CN111785293B

CN111785293B - Voice transmission method, device and equipment and storage medium

Info

Publication number: CN111785293B
Application number: CN202010501279.7A
Authority: CN
Inventors: 毛恩云
Original assignee: Hangzhou Hikvision System Technology Co Ltd
Current assignee: Hangzhou Hikvision System Technology Co Ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2023-04-25
Anticipated expiration: 2040-06-04
Also published as: CN111785293A

Abstract

The application provides a voice transmission method, a voice transmission device, voice transmission equipment and a storage medium, which can greatly reduce the data transmission quantity. A voice transmission method applied to a terminal device, the method comprising: generating corresponding first voice characteristic information according to the acquired first voice data to be sent to the target terminal equipment; searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information; if the first voice ID is found, the first voice ID is sent to a server side, so that the server side controls the destination terminal equipment according to the first voice ID to obtain first voice data corresponding to the first voice ID; and if the first voice ID is not found, sending first voice data to the server side so that the server side can forward the first voice data to the target terminal equipment.

Description

Voice transmission method, device and equipment and storage medium

Technical Field

The present disclosure relates to the field of speech technologies, and in particular, to a method, an apparatus, a device, and a storage medium for transmitting speech.

Background

In some situations, such as voice communication, voice transmission is required in the processes of a common landline call, a mobile phone call, an intercom call, a network voice chat, and the like. Voice transmission is usually implemented by relying on a network, and the quality of the network generally determines the quality of the voice transmission. In general, regardless of the network state, after the terminal device collects the voice data, the terminal device directly sends the voice data to the server, and the server forwards the voice data to other terminal devices. This results in a large amount of data to be transmitted, which affects the quality of voice transmission, and particularly when the network is abnormal, such as congested, it is highly likely to aggravate the network abnormality, resulting in very poor voice transmission quality, such as occurrence of stuck, lost, erroneous, etc.

In order to improve the voice transmission quality, the improvement method is that under the condition that the network is abnormal, the terminal equipment compresses voice data and sends the compressed voice data to the server end so as to be forwarded to other terminal equipment by the server end. Although this approach can reduce the amount of data transmission by a certain amount, the discrimination is reduced and the amount of data transmission is still large.

Disclosure of Invention

In view of the above, the present application provides a voice transmission method, apparatus and device, and a storage medium, which can greatly reduce the data transmission amount.

The first aspect of the present application provides a voice transmission method, applied to a terminal device, where the method includes:

generating corresponding first voice characteristic information according to the acquired first voice data to be sent to the target terminal equipment;

searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information;

if the first voice ID is found, the first voice ID is sent to a server side, so that the server side controls the destination terminal equipment according to the first voice ID to obtain first voice data corresponding to the first voice ID;

and if the first voice ID is not found, sending first voice data to the server side so that the server side can forward the first voice data to the target terminal equipment.

According to an embodiment of the present application, the generating the corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device includes:

voiceprint recognition is carried out on the collected first voice data, and corresponding voiceprint information is obtained;

Encoding the acquired first voice data to obtain encoded information, wherein the encoded information at least comprises: syllable coding information and/or semantic coding information; the syllable coding information is syllable information identified according to a syllable identification mode, and the semantic coding information is semantic information identified according to a semantic identification mode;

and determining the first voice characteristic information according to the voiceprint information and the coding information.

According to one embodiment of the present application,

the method further comprises the following steps: determining a voice transmission mode for voice transmission according to the detected network state of the device;

in the case that the first voice ID is not found, the method further includes:

if the voice transmission mode is a set first mode, further sending the first voice characteristic information to the server according to the first mode so that the server can distribute a corresponding first voice ID according to the first voice characteristic information;

and acquiring the first voice ID from the server, and recording the corresponding relation between the first voice ID and the first voice characteristic information.

According to one embodiment of the present application, the method further comprises:

And receiving second voice data sent by the server side, and playing the second voice data.

acquiring and recording the corresponding relation between the voice ID and the voice data from the server;

when at least one second voice ID sent by the server is received, searching voice data corresponding to the received second voice ID according to the corresponding relation between the recorded voice ID and the voice data;

if 1 second voice ID is received, playing the searched voice data;

if more than two second voice IDs are received, synthesizing the searched voice data corresponding to each second voice ID, and playing the synthesized voice data.

A second aspect of the present application provides a voice transmission method, applied to a server, where the method includes:

under the condition that a voice ID sent by a source terminal device is received, controlling a target terminal device to obtain voice data corresponding to the voice ID according to the voice ID;

and forwarding the voice data to the destination terminal equipment under the condition that the voice data sent by the source terminal equipment are received.

According to one embodiment of the present application, controlling the destination terminal device according to the voice ID to obtain the voice data corresponding to the voice ID includes:

Searching the voice data corresponding to the received voice ID in the corresponding relation between the recorded voice ID and the voice data;

if the number of the received voice IDs is 1, forwarding the searched voice data to the target terminal equipment;

if the number of the received voice IDs is greater than 1, synthesizing the voice data corresponding to the searched voice IDs, and forwarding the synthesized voice data to the target terminal equipment.

and when the recorded corresponding relation between the received voice ID and the voice data is determined to be sent to the target terminal, forwarding the received voice ID to the target terminal equipment so that the target terminal equipment plays the corresponding voice data according to the received voice ID.

According to one embodiment of the present application, in case of receiving voice data sent by the source terminal device, the method further comprises:

when voice characteristic information corresponding to the voice data sent by the source terminal equipment is received, the source terminal equipment sends the voice characteristic information when the voice transmission mode of the equipment is a set first mode, and corresponding voice ID is distributed according to the voice characteristic information and returned to the source terminal equipment; the correspondence between the voice ID and the voice data is recorded.

and sending the corresponding relation between the locally recorded voice ID and the voice data to the destination terminal equipment so that the destination terminal equipment can find the corresponding voice data according to the voice ID when receiving the voice ID.

A third aspect of the present application provides a voice transmission apparatus, applied to a terminal device, including:

the voice characteristic information generation module is used for generating corresponding first voice characteristic information according to the acquired first voice data to be sent to the target terminal equipment;

the voice ID searching module is used for searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information;

the first voice transmission module is used for sending the first voice ID to a server if the first voice ID is found, so that the server controls the destination terminal equipment to obtain first voice data corresponding to the first voice ID according to the first voice ID;

and the second voice transmission module is used for sending the first voice data to the server if the first voice ID is not found, so that the server forwards the first voice data to the target terminal equipment.

According to an embodiment of the present application, when the voice feature information generating module generates corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device, the voice feature information generating module is specifically configured to:

According to one embodiment of the present application,

the apparatus further comprises: the voice transmission mode determining module is used for determining a voice transmission mode for voice transmission according to the detected network state of the equipment;

in the case that the first voice ID is not found, the second voice transmission module is further configured to:

According to one embodiment of the present application, the apparatus further comprises:

the first voice playing module is used for receiving second voice data sent by the server and playing the second voice data.

the corresponding relation acquisition module is used for acquiring and recording the corresponding relation between the voice ID and the voice data from the server;

the voice data searching module is used for searching the voice data corresponding to the received second voice ID according to the corresponding relation between the recorded voice ID and the voice data when at least one second voice ID sent by the server is received;

the second voice playing module is used for playing the searched voice data if 1 second voice ID is received;

and the third voice playing module is used for synthesizing the voice data corresponding to the searched second voice IDs and playing the synthesized voice data if more than two second voice IDs are received.

A fourth aspect of the present application provides a voice transmission apparatus, applied to a server, the apparatus including:

The third voice transmission module is used for controlling the destination terminal equipment to obtain voice data corresponding to the voice ID according to the voice ID under the condition that the voice ID sent by the source terminal equipment is received;

and the fourth voice transmission module is used for forwarding the voice data to the destination terminal equipment under the condition of receiving the voice data sent by the source terminal equipment.

According to an embodiment of the present application, when the third voice transmission module controls the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID, the third voice transmission module is specifically configured to:

According to an embodiment of the present application, in a case of receiving voice data sent by the source terminal device, the fourth voice transmission module is further configured to:

and the corresponding relation transmitting module is used for transmitting the corresponding relation between the locally recorded voice ID and the voice data to the destination terminal equipment so that the destination terminal equipment can find the corresponding voice data according to the voice ID when receiving the voice ID.

A fifth aspect of the present application provides an electronic device, comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the voice transmission method as described in the foregoing embodiment.

A sixth aspect of the present application provides a machine-readable storage medium having stored thereon a program which, when executed by a processor, implements a voice transmission method as described in the foregoing embodiments.

The embodiment of the application has the following beneficial effects:

in the embodiment of the present application, a correspondence between a voice identifier ID and voice feature information may be learned and recorded in a terminal device, after the terminal device may generate corresponding voice feature information according to collected first voice data, the terminal device searches for a first voice ID corresponding to the first voice feature information in the correspondence, and if not found, sends the first voice data to a server, so that the server forwards the first voice data to a destination terminal device; if the first voice ID is found, the server only needs to send the first voice ID to the server, and the server controls the target terminal device to obtain the first voice data corresponding to the first voice ID according to the first voice ID, for example, under the condition that the server learns the corresponding relation between the voice ID and the voice data, the server can find the first voice data corresponding to the first voice ID in the learned corresponding relation between the voice ID and the voice data and forward the first voice data to the target terminal device.

Drawings

Fig. 1 is a flow chart of a voice transmission method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a voice transmission device according to an embodiment of the present application;

FIG. 3 is a block diagram of a voice transmission system according to an embodiment of the present application;

fig. 4 is an interaction schematic diagram among a source terminal device, a server, and a destination terminal device according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various devices, these information should not be limited by these terms. These terms are only used to distinguish one device from another of the same type. For example, a first device could also be termed a second device, and, similarly, a second device could also be termed a first device, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In order to make the description of the present application clearer and concise, some technical terms in the present application are explained below:

voiceprint: voiceprint (Voiceprint) is the spectrum of sound waves carrying speech information that is displayed with electro-acoustic instrumentation. Modern scientific research shows that voiceprints have not only specificity but also relative stability, and after adults, human voice can be kept relatively stable and unchanged for a long time.

Syllables: syllables (syllabes) are the smallest phonetic units of the combined pronunciation of individual vocalic phonemes and consonantal phonemes in a phonological system, and individual vocalic phonemes may also be self-syllables. Chinese (Chinese) syllables are composed of combinations of vowel phones and consonant phones in the phonological system.

Weak net: for the definition of data of a weak network, the definition of different applications is different, and not only the minimum rate of each type of network is considered, but also the definition is divided by combining a service scene and an application type. According to the characteristic of movement, the rate of the 3G network is generally lower than that of the 2G network and belongs to the weak network, and the 3G network can be divided into the weak network. In addition, weak signal Wifi is typically incorporated into weak networks.

The voice transmission method of the embodiment of the application can be applied to voice communication scenes, such as patrol communities, parks, communities, factories, prisons, parking lots and the like, and voice communication is usually needed among staff in the scenes. Taking a factory as an example, assume that two security guards are used for patrol, and the two security guards have handheld terminals capable of carrying out voice communication, when one security guard finds that abnormal personnel intrudes or equipment is abnormal, the other security guard can carry out voice communication through the handheld terminals, and voice transmission is needed in the process. Of course, the above scenario is merely exemplary, and is not limited to the above scenario, and the above-described voice communication scenario may be, but not limited to, a real-time communication scenario.

The following describes the voice transmission method according to the embodiments of the present application in more detail, but is not limited thereto. In one embodiment, referring to fig. 1, a voice transmission method is applied to a terminal device, and the method may include the steps of:

s100: generating corresponding first voice characteristic information according to the acquired first voice data to be sent to the target terminal equipment;

s200: searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information;

s300: if the first voice ID is found, the first voice ID is sent to a server side, so that the server side controls the destination terminal equipment according to the first voice ID to obtain first voice data corresponding to the first voice ID;

s400: and if the first voice ID is not found, sending first voice data to the server side so that the server side can forward the first voice data to the target terminal equipment.

The main execution body of the voice transmission method is a terminal device, and more specifically may be a processing chip of the terminal device. The terminal device may be interphone, landline, mobile phone, virtual machine, etc., and the specific type is not limited as long as it can support voice transmission and has a certain processing capability.

The terminal equipment can be provided with a voice acquisition module and a voice playing module. The steps S100 to S400 may be implemented by a voice acquisition module, in which case the terminal device is used as a source terminal device, and voice data is required to be sent to a destination terminal device; the voice playing module may be used for playing voice data, in which case the terminal device is used as a destination terminal device, for example, when receiving the voice data sent by the server from the source terminal device, the voice playing module may play the voice data.

The terminal equipment is connected with the server. The server side generally has larger storage capacity and processing capacity, and can be composed of one computer device or a plurality of computer devices, and the specific type is not limited. The server may be connected to other terminal devices in addition to the above terminal devices, and the number of terminal devices specifically connected is not limited.

In this embodiment, two voice transmission modes for voice transmission may be set, including a first mode and a second mode. The terminal equipment can determine a voice transmission mode according to the detected network state of the equipment, and can enter a first mode when the network state of the equipment is normal; when the network state of the device is abnormal, the second mode can be entered.

The terminal device may detect the network state of the device according to a set policy, for example, may detect periodically or detect according to a network state event triggered by the bottom layer (the terminal device may monitor a network state event, where the network state event is used to indicate whether the network state of the terminal device is normal), and the specific detection mode is not limited to this. Network state anomalies here may include, for example: network delay, network packet dropping, network throttling, network retransmission, etc., various network problems caused by network congestion or instability, are not particularly limited thereto.

The difference between the first mode and the second mode is that in the first mode, the correspondence between the voice ID and the voice feature information is learned, and the method shown in fig. 1 is implemented. In the second mode, the method shown in fig. 1 may be implemented using the correspondence between the voice ID and the voice feature information learned and recorded in the first mode. And will be described in detail in the following examples.

In step S100, corresponding first voice feature information is generated according to the collected first voice data to be sent to the destination terminal device.

The terminal device needs to send the collected first voice data to the destination terminal device, so that the first voice data is played on the destination terminal device. The destination terminal device may be any one or more of other terminal devices connected to the server.

Alternatively, the first voice data may be acquired by the terminal device, or acquired by the terminal device after being acquired by an external device (such as an external microphone), and of course, the specific device from which the first voice data is acquired is not limited.

Alternatively, the device may collect a plurality of pieces of first voice data at a time. Taking a terminal device as an interphone as an example, when a person turns on the interphone, speaks N sentences towards the interphone, and turns off the interphone afterwards, two sentences with the interval time longer than a preset time such as 0.5s can be determined to be two pieces of voice data, and then the interphone can be determined to acquire N pieces of voice data, wherein N is greater than 1.

In this case, first voice feature information corresponding to each piece of first voice data may be generated, and the first voice feature information may characterize the corresponding first voice data. In the case of generating first voice feature information corresponding to a plurality of pieces of first voice data, the subsequent steps S200 to S400 may be performed for each piece of first voice feature information (of course, steps S300 and S400 are not both performed but one of them is selected according to the search state).

In step S200, a first voice ID corresponding to the first voice feature information is found in the corresponding relationship between the recorded voice ID and the voice feature information.

In the related mode, under the condition that the first voice data is collected, the first voice data is directly sent to the server, or the first voice data is compressed and then sent to the server, and then the server sends the first voice data to the target terminal equipment.

In this embodiment, no matter in the first mode or the second mode, the first voice data is not directly sent to the server, but the first voice ID corresponding to the first voice feature information is first searched in the corresponding relationship between the locally recorded voice ID and the voice feature information, and the transmission mode of the first voice data is determined according to the searching condition.

In step S300, if the first voice ID is found, the first voice ID is sent to a server, so that the server controls the destination terminal device according to the first voice ID to obtain first voice data corresponding to the first voice ID.

The method comprises the steps that a first voice ID is found in the corresponding relation between the recorded voice ID and voice characteristic information, related information (comprising the corresponding relation between the first voice ID and the first voice characteristic information in terminal equipment and the corresponding relation between the first voice ID and the first voice data in a server side) of first voice data is learned before, at this time, only the first voice ID needs to be sent to the server side, and the server side can control the target terminal equipment to obtain the first voice data corresponding to the first voice ID according to the first voice ID.

Optionally, when the server controls the destination terminal device to obtain the first voice data corresponding to the first voice ID according to the first voice ID, the server may find the first voice data corresponding to the received first voice ID in the corresponding relationship between the locally recorded voice ID and the voice data, and if the number of the received voice IDs is 1, forward the found voice data to the destination terminal device; if the number of the received voice IDs is greater than 1, synthesizing the voice data corresponding to the searched voice IDs, and forwarding the synthesized voice data to the target terminal equipment.

In this way, only the first voice ID is required to be transmitted between the terminal device and the server, and the first voice data is not required to be transmitted any more, and the data amount of the first voice data is much smaller than that of the first voice ID, so that the data transmission amount between the terminal device and the server can be greatly reduced, and finally, the destination terminal device can still obtain the required first voice data.

Or when the server controls the destination terminal equipment to obtain the first voice data corresponding to the first voice ID according to the first voice ID, and when the recorded corresponding relation between the received voice ID and the voice data is determined to be sent to the target terminal, the received first voice ID is forwarded to the destination terminal equipment so that the destination terminal equipment plays the corresponding first voice data according to the received first voice ID. When the destination terminal device plays the corresponding first voice data according to the received first voice ID, the first voice data corresponding to the first voice ID can be searched in the corresponding relation between the recorded voice ID and the voice data, if 1 first voice data is searched, the first voice data is played, and if a plurality of first voice data are searched, the searched first voice data are combined and played.

In this manner, not only the first voice ID is required to be transmitted between the terminal device and the server, but also the first voice ID is required to be transmitted between the server and the destination terminal device, and in the voice transmission process, not only the data transmission amount between the terminal device and the server, but also the data transmission amount between the server and the destination terminal device can be greatly reduced, in other words, the required data transmission amount can be further reduced in this manner, and finally, the destination terminal device can still obtain the required first voice data.

In summary, no matter how the server controls the destination terminal device to obtain the first voice data corresponding to the first voice ID according to the first voice ID, only the first voice ID needs to be transmitted between the terminal device and the server, and the data transmission amount between the terminal device and the server can be greatly reduced.

In step S400, if the first voice ID is not found, the first voice data is sent to the server, so that the server forwards the first voice data to the destination terminal device.

In either the first mode or the second mode, if the terminal device does not find the first voice ID, it indicates that the relevant information of the first voice data has not been learned, so that the first voice data may be directly sent to the server, and forwarded to the destination terminal device by the server.

Alternatively, when the first voice data is sent to the server, the first voice data may be compressed first, then the compressed first voice data is sent to the server, and the server forwards the compressed first voice data to the destination terminal device.

It will be appreciated that, when the terminal device sends the first voice ID, the first voice data, and other information, the terminal device may also carry indication information (such as address information of the destination terminal device, etc.) for indicating the destination terminal device, so that the server may send relevant information to the destination terminal device indicated by the identification information. The same applies to the transmission of other information, and the description thereof will not be repeated.

In one embodiment, the above method flow may be performed by the voice transmission apparatus 100, and as shown in fig. 2, the voice transmission apparatus 100 may include 4 modules: a voice feature information generation module 101, a voice ID lookup module 102, a first voice transmission module 103, and a second voice transmission module 104. The voice feature information generating module 101 is configured to perform the step S100, the voice ID searching module 102 is configured to perform the step S200, the first voice transmission module 103 is configured to perform the step S300, and the second voice transmission module 104 is configured to perform the step S400.

In one embodiment, in step S100, the generating corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device may include the following steps:

s101: voiceprint recognition is carried out on the collected first voice data, and corresponding voiceprint information is obtained;

s102: encoding the acquired first voice data to obtain encoded information, wherein the encoded information at least comprises: syllable coding information and/or semantic coding information; the syllable coding information is syllable information identified according to a syllable identification mode, and the semantic coding information is semantic information identified according to a semantic identification mode;

s103: and determining the first voice characteristic information according to the voiceprint information and the coding information.

Since voiceprints from different persons are substantially different and voiceprints from a person are generally stable, the identity of the voice originator can be represented by voiceprint information. In step S101, the existing voiceprint recognition method may be used to perform voiceprint recognition on the first voice data, so as to obtain corresponding voiceprint information, where the voiceprint information may represent the identity of the source of the first voice data.

Syllables or semantics of the speech may represent the content of the speech to some extent, and in general, syllables or semantics of different utterances are different, so in this embodiment, syllable coding information and/or semantic coding information are also determined by identifying syllable information and/or semantic information of the first speech data in step S102, to be used for representing the content of the speech.

Wherein, the syllable recognition of the first voice data can be realized by utilizing the existing syllable recognition algorithm to obtain syllable information; the existing semantic recognition algorithm can be utilized to realize semantic recognition of the first voice data, so that semantic information is obtained. In the identification process, the syllable can be encoded while being identified, for example, a syllable is encoded when being identified, and finally the obtained syllable information, namely syllable encoding information, is obtained.

In step S103, the first voice feature information is determined according to the voiceprint information and the encoding information, for example, the voiceprint information and the encoding information may be determined as the first voice feature information. In the above way, the voice data of different contents sent by different people can be basically distinguished according to the voice characteristic information.

Of course, if different people do not need to be distinguished in the actual scene, the encoded information may also be determined as the first voice feature information, and voiceprint information is not needed, which is not particularly limited.

In this embodiment, syllable information and/or semantic information are used, but these are not used to reconstruct speech, unlike the general usage, for finding corresponding speech data. Taking the semantic recognition algorithm as a Chinese semantic recognition algorithm as an example, in this embodiment, since the real semantic meaning in the voice data is not concerned, only the recognized semantic information can be used to distinguish the voice data of different contents, the language of the voice data can be any type of language, such as Chinese, english, russian, and even dialect.

In one embodiment, before step S100, further comprising: and determining a voice transmission mode for voice transmission according to the detected network state of the device.

For example, when the detected network state of the device is normal, the first mode may be entered; and when the detected network state of the device is abnormal, entering a second mode.

In step S400, in the case that the first voice ID is not found, the method further includes:

S410: if the voice transmission mode is a set first mode, further sending the first voice characteristic information to the server according to the first mode so that the server can distribute a corresponding first voice ID according to the first voice characteristic information;

s420: and acquiring the first voice ID from the server, and recording the corresponding relation between the first voice ID and the first voice characteristic information.

In other words, when the voice transmission mode is the first mode and the first voice ID is not found, both the first voice data and the first voice feature information are sent to the server, so that not only is the transmission of the first voice data realized, but also the learning of the related information of the first voice data is realized.

After receiving the first voice data and the first voice feature information, the server side can allocate a corresponding first voice ID according to the first voice feature information, and the first voice ID can identify the first voice data. After the server allocates the first voice ID, the server may record the corresponding relationship between the first voice ID and the first voice data locally, or may record the corresponding relationship between the first voice ID, the first voice data and the first voice feature information locally, and return the first voice ID to the terminal device, and forward the first voice data to the destination terminal device.

The first voice ID in this embodiment may uniquely identify the first voice data. In general, there are a plurality of terminal devices connected to a server, and the uniqueness among all the terminal devices is ensured when the voice ID is assigned, so that the uniqueness of the voice ID is more easily ensured by the server. For example, the voice IDs may be assigned in the order of 0, 1, 2, 3, and 4, which is less complex to handle. The specific allocation mode is not limited, and the voice data of different contents spoken by different people can be allocated with different voice IDs, namely, different voice IDs are allocated for different voice characteristic information.

The terminal equipment acquires the first voice ID from the server and records the corresponding relation between the first voice ID and the first voice characteristic information. When the first voice data is acquired subsequently, after the corresponding first voice characteristic information is generated, the corresponding first voice ID can be found in the corresponding relation, and only the first voice ID is required to be sent to the server side, and the first voice data is not required to be sent to the server side.

Optionally, in addition to the correspondence between the first voice ID and the first voice feature information, the correspondence between the first voice ID or the first voice feature information and the first voice data may be recorded in the terminal device; alternatively, the correspondence relationship among the first voice ID, the first voice feature information, and the first voice data may be recorded, which is not particularly limited. Of course, in order to reduce the amount of memory required by the terminal device, voice data may not be stored in the terminal device.

By the above manner, for each piece of voice data acquired or collected in the first mode, the corresponding relationship between the voice ID and the voice feature information of the voice data is generally recorded in the terminal device, and the server side generally records the corresponding relationship between the voice data and the voice ID, so that the learning process can be continuously performed to continuously enrich the corresponding relationship and cover more and more words and sentences. In this way, voice transmission can be achieved by transmitting the voice ID between the terminal device and the server by using the voice ID as an association between voice feature information in the terminal device and voice data in the server.

In addition, in step S200, after the corresponding first voice feature information is generated, the first voice ID corresponding to the first voice feature information is searched in the corresponding relation between the recorded voice ID and the voice feature information, and only if the corresponding first voice feature information and the corresponding first voice data are not searched, the first voice feature information and the first voice data are sent to the server side, so that the learning of the related information of the first voice data is realized. By the method, repeated learning of related information of the same voice data can be avoided, and only voice data containing strange contents sent by strangers can be learned and recorded in the terminal equipment, and the corresponding relation of voice characteristic information-voice ID and the corresponding relation of voice ID-voice are learned and recorded in the server.

In the case of abnormal network conditions, the quality of the voice data may be degraded after the voice data is transmitted, and the voice data is generally unsuitable for learning the related information of the first voice data, so that the terminal device may not need to send the first voice feature information to the server in this case. Of course, this is not a limitation, and may be selected as desired, such as in the case where poor speech quality is acceptable, or learning may be performed.

A more specific example is provided below in connection with fig. 3 and 4, but this should not be taken as a limitation.

As shown in fig. 3, the server 300 may be connected to a plurality of terminal devices 201-203, where the current terminal device 201 needs to send first voice data to the terminal device 202, and in this case, the terminal device 201 is a source terminal device, the terminal device 202 is a destination terminal device, and the terminal device 203 is another terminal device. Of course, the server 300 may also be connected to more terminal devices, which are not shown in the figure.

As shown in fig. 4, there are two voice transmission modes of the source terminal device 201, namely, a first mode and a second mode, and the source terminal device 201 enters the first mode when detecting that the network state of the device is normal, and enters the second mode when detecting that the network state of the device is abnormal.

In the first mode:

after the source terminal device 201 collects the first voice data, first voice feature information corresponding to the first voice data can be generated according to the first voice data, where the first voice feature information includes voiceprint information and syllable coding information;

next, the source terminal device 201 searches for a first voice ID corresponding to the first voice feature information in the correspondence between the recorded voice ID and the voice feature information;

if not, the first voice data and the first voice feature information are sent to the server 300 at this time because the related information of the first voice data is not learned yet. After receiving the first voice data and the first voice feature information, the server 300 allocates a corresponding first voice ID according to the first voice feature information, records a corresponding relationship between the first voice ID and the first voice data, and returns the first voice ID to the source terminal device 201. After receiving the first voice ID returned by the server, the source terminal 201 records the correspondence between the first voice feature information and the first voice ID. In addition, after receiving the first voice data and the first voice feature information, the server 300 forwards the first voice data to the destination terminal 202 (may be forwarded singly or in combination). After receiving the first voice data, the destination terminal 202 may play the first voice data.

If the related information of the first voice data, which has been learned before the description, is found, including the correspondence between the first voice ID and the first voice feature information in the source terminal device 201, and the correspondence between the first voice data and the first voice ID in the server 300, only the first voice ID needs to be sent to the server 300. After receiving the first voice ID, the server 300 searches the corresponding relationship between the recorded voice ID and the voice data for the first voice data corresponding to the first voice ID, and forwards the searched first voice data to the destination terminal device 202 (which may be single-strip forwarding or multiple-strip merging forwarding). After receiving the first voice data, the destination terminal 202 may play the first voice data.

In the second mode:

If not found, the description has not learned the related information of the first voice data, but because the current network state is abnormal, the first voice data may be sent only to the server 300 (may be sent singly or in multiple pieces at the same time). After receiving the first voice data, the server 300 directly forwards the received first voice data to the destination terminal device 202 (may forward the first voice data singly or forward the first voice data in a combined manner), and does not learn related information. After receiving the first voice data, the destination terminal 202 may play the first voice data.

In the above embodiment, the terminal device is the source terminal device. The identity of the terminal device differs for different processing logics, and in some processing logics the terminal device may of course also be the destination terminal device, such as several embodiments described below.

In one embodiment, the method further comprises:

The terminal device may have a voice playing function, for example, may have a voice player through which the second voice data is played.

In one embodiment, the method further comprises:

s500: acquiring and recording the corresponding relation between the voice ID and the voice data from the server;

s600: when at least one second voice ID sent by the server is received, searching voice data corresponding to the received second voice ID according to the corresponding relation between the recorded voice ID and the voice data;

s700: if 1 second voice ID is received, playing the searched voice data;

s800: if more than two second voice IDs are received, synthesizing the searched voice data corresponding to each second voice ID, and playing the synthesized voice data.

When the server is idle, the corresponding relation between the recorded (and unsynchronized) voice ID and voice data can be sent to each connected terminal device. Idle here may refer to when voice data, as well as any other information, need not be transmitted. Optionally, after sending the correspondence between the voice ID and the voice data to each terminal device, the server may delete the locally recorded correspondence, or mark that the correspondence is synchronized with synchronization identification information.

After the terminal device obtains the correspondence between the voice ID and the voice data from the server, the correspondence may be recorded. When the source terminal device finds a second voice ID (there may be multiple pieces of voice data, so that multiple second voice IDs may be found) corresponding to the voice data to be sent locally, the source terminal device may send the second voice ID to the server, and the server may forward the second voice ID to the terminal device.

When the terminal equipment receives at least one second voice ID sent by the server, the corresponding voice data corresponding to the second voice ID can be found in the corresponding relation between the recorded voice ID and the voice data. If only 1 second voice ID is received, only one piece of voice data is searched, and the searched voice data is directly played. If more than two second voice IDs are received, synthesizing the voice data corresponding to the searched second voice IDs into a complete voice data section, and playing the synthesized voice data, wherein the synthesis mode is not limited.

In this embodiment, the server may synchronize the correspondence between the voice ID and the voice data to the connected terminal device, so that the subsequent terminal device may directly use the correspondence between the voice ID and the voice data to implement voice transmission.

In one embodiment, when the server is idle, the recorded voice ID and voice feature information may be sent to each connected terminal device, so that other terminal devices do not need to repeatedly learn the voice ID and voice feature information, and the learning process is quickened.

The above is the embodiment content of the voice transmission method applied to the terminal device, and the embodiment content of the method applied to the voice transmission of the server is described below.

In one embodiment, the voice transmission method is applied to the server, and the method may include the following steps:

t100: under the condition that a voice ID sent by a source terminal device is received, controlling a target terminal device to obtain voice data corresponding to the voice ID according to the voice ID;

t200: and forwarding the voice data to the destination terminal equipment under the condition that the voice data sent by the source terminal equipment are received.

The main execution body of the voice transmission method is a server, and the server can have larger storage capacity and processing capacity and can be composed of one computer device or a plurality of computer devices.

The server may be connected to a plurality of terminal devices, as shown in fig. 3, and the server 300 may be connected to a plurality of terminal devices 201-203, where, assuming that the terminal device 201 needs to send first voice data to the terminal device 202, in this case, the terminal device 201 is a source terminal device, the terminal device 202 is a destination terminal device, and the terminal device 203 is another terminal device. Of course, the server 300 may also be connected to more terminal devices, which are not shown in the figure.

In step T100, when a voice ID sent by a source terminal device is received, a destination terminal device is controlled according to the voice ID to obtain voice data corresponding to the voice ID.

The voice ID may be a voice ID corresponding to second voice feature information found by the source terminal device in a correspondence between the recorded voice ID and the voice feature information, and the second voice feature information may be voice feature information corresponding to the source terminal device generated by the source terminal device according to the collected voice data.

There are various ways of controlling the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID. For example, the voice data corresponding to the received voice ID may be found from the correspondence between the voice ID recorded by the server and the voice data, and the voice data may be transmitted to the destination terminal device. For another example, the server may forward the received voice ID to the destination terminal device, so that the destination terminal device finds the voice data corresponding to the received voice ID in the correspondence between the recorded voice ID and the voice data.

Of course, the above manner is not limited, and the correspondence between the voice ID and the voice data may be recorded in other places, for example, in the cloud space, and the server may forward the received voice ID to the cloud space, and after the cloud space finds the voice data corresponding to the received voice ID from the correspondence between the recorded voice ID and the voice data, forward the found voice data to the destination terminal device.

After the destination terminal device obtains the voice data corresponding to the voice ID, the obtained voice data may be played. The destination terminal device may have a voice playing module, and the voice playing module may play the voice data, and the voice playing module may invoke the voice player to play the voice data.

In step T200, when voice data sent by the source terminal device is received, the voice data is forwarded to the destination terminal device.

The received voice data may be, for example, that the source terminal device does not find the voice ID corresponding to the second voice feature information in the corresponding relationship between the recorded voice ID and the voice feature information, where the second voice feature information may be that the source terminal device generates the corresponding voice feature information according to the collected voice data.

And forwarding the voice data to the destination terminal equipment when the voice data are received. After the destination terminal device obtains the voice data corresponding to the voice ID, the obtained voice data may be played. The destination terminal device may have a voice playing module, and the voice playing module may play the voice data, and the voice playing module may invoke the voice player to play the voice data.

In this embodiment, in some cases, only a voice ID needs to be transmitted between the source terminal device and the server, and the server may control the destination terminal device to obtain voice data corresponding to the voice ID according to the voice ID, so that the data transmission amount between the source terminal device and the server may be greatly reduced.

In one embodiment, in step T100, the step of controlling the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID may include the following steps:

t101: searching the voice data corresponding to the received voice ID in the corresponding relation between the recorded voice ID and the voice data;

T102: if the number of the received voice IDs is 1, forwarding the searched voice data to the target terminal equipment;

t103: if the number of the received voice IDs is greater than 1, synthesizing the voice data corresponding to the searched voice IDs, and forwarding the synthesized voice data to the target terminal equipment.

Optionally, the corresponding relation between the voice ID in the source terminal device and the voice feature information, and the corresponding relation between the voice ID in the server and the voice data are synchronously learned and recorded. When the voice ID is received, it is stated that the corresponding relationship between the voice ID and the second voice feature information is already recorded in the source terminal device, and then the corresponding relationship between the voice ID and the voice data is generally already recorded in the server.

Thus, in this embodiment, the voice data corresponding to the received voice ID may be found in the corresponding relationship between the recorded voice ID and the voice data.

Alternatively, the source terminal device may collect a plurality of pieces of voice data at a time. Taking a source terminal device as an interphone as an example, when a person turns on the interphone, speaks N sentences towards the interphone, and turns off the interphone afterwards, two sentences with the interval time longer than a preset time such as 0.5s can be determined to be two pieces of voice data, and then the interphone can be determined to collect N pieces of voice data, wherein N is greater than 1.

In this case, voice feature information corresponding to each piece of voice data may be generated, and the voice feature information may characterize the corresponding voice data. Under the condition of generating voice characteristic information corresponding to a plurality of pieces of voice data, a plurality of voice IDs may be found, and at this time, the source terminal device may send the plurality of voice IDs to the server, or, of course, only 1 voice ID may be found.

If the number of the received voice IDs is 1, forwarding the searched voice data to the target terminal equipment. If the number of the received voice IDs is greater than 1, synthesizing the voice data corresponding to the searched voice IDs, and forwarding the synthesized voice data to the target terminal equipment.

Optionally, if the server receives the voice ID and the voice data (where the voice ID and the voice data are not corresponding, and the voice ID is a voice ID corresponding to other voice data), after the voice data corresponding to the voice ID is found, the found voice data and the received voice data may be synthesized, and the synthesized voice data may be forwarded to the destination terminal device.

In this embodiment, only the voice ID is required to be transmitted between the source terminal device and the server, the corresponding voice data is not required to be transmitted, and the data size of the voice data is much smaller than that of the voice ID, so that the data transmission size between the source terminal device and the server can be greatly reduced, and finally the destination terminal device can still obtain the required voice data.

t104: and when the recorded corresponding relation between the received voice ID and the voice data is determined to be sent to the target terminal, forwarding the received voice ID to the target terminal equipment so that the target terminal equipment plays the corresponding voice data according to the received voice ID.

Optionally, the corresponding relation between the voice ID in the source terminal device and the voice feature information and the corresponding relation between the voice ID in the server side and the voice data are synchronously learned and recorded, and the server side can synchronize the corresponding relation between the recorded voice ID and the voice data to other terminal devices (including the destination terminal device) when idle, and can mark the synchronized corresponding relation through the synchronization identification information when synchronization is completed.

Thus, in this embodiment, when it is determined that the recorded correspondence between the received voice ID and the voice data has been sent to the target terminal, whether the correspondence is synchronized or not may be determined according to the synchronization identification information, and the received voice ID is forwarded to the target terminal device, so that the target terminal device plays the corresponding voice data according to the received voice ID.

When the destination terminal equipment plays the corresponding voice data according to the received voice ID, the voice data corresponding to the received voice ID can be searched in the corresponding relation between the recorded voice ID and the voice data, and when 1 voice data is searched, the voice data can be directly played; when more than two pieces of voice data are found, the voice data can be synthesized, and the synthesized voice data can be played.

In this embodiment, only the voice ID is required to be transmitted between the source terminal device and the server, and only the voice ID is required to be transmitted between the server and the destination terminal device, so that in the voice transmission process, not only the data transmission amount between the source terminal device and the server, but also the data transmission amount between the server and the destination terminal device can be greatly reduced, in other words, the required data transmission amount can be further reduced in this manner, and finally, the destination terminal device can still obtain the required voice data.

In one embodiment, in case of receiving voice data sent by the source terminal device, the method further comprises:

t210: when voice characteristic information corresponding to the voice data sent by the source terminal equipment is received, the source terminal equipment sends the voice characteristic information when the voice transmission mode of the equipment is a set first mode, and corresponding voice ID is distributed according to the voice characteristic information and returned to the source terminal equipment; the correspondence between the voice ID and the voice data is recorded.

When voice data is received, if voice characteristic information corresponding to the voice data sent by the source terminal equipment is also received, the source terminal equipment is indicated to be in a first mode currently, the first mode is a mode which the source terminal equipment enters when detecting that the network state of the equipment is normal, and under the condition, the learning of relevant information of the voice data is needed.

After receiving the voice data and the corresponding voice feature information, the server can allocate a corresponding voice ID according to the voice feature information, and the voice ID can identify the voice data. After the server allocates the voice ID, the server may record the corresponding relationship between the voice ID and the voice data locally, or may record the corresponding relationship between the voice ID, the voice data and the voice feature information locally, and return the voice ID to the source terminal device, and forward the voice data to the destination terminal device.

After receiving the voice ID, the source terminal device may record a correspondence between the voice ID and the voice feature information in the device, and when the same voice data is collected later, the source terminal device may find a corresponding voice ID from the correspondence according to the corresponding voice feature information, and send the voice ID to the server, which may refer to the description content related to step T100 in the foregoing embodiment.

The voice ID may uniquely identify the voice data. In general, there are a plurality of terminal devices connected to a server, and the uniqueness among all the terminal devices is ensured when the voice ID is assigned, so that the uniqueness of the voice ID is more easily ensured by the server. For example, the voice IDs may be assigned in the order of 0, 1, 2, 3, and 4, which is less complex to handle. The specific allocation mode is not limited, and the voice data of different contents spoken by different people can be allocated with different voice IDs, namely, different voice IDs are allocated for different voice characteristic information.

In one embodiment, the method further comprises:

t300: and sending the corresponding relation between the locally recorded voice ID and the voice data to the destination terminal equipment so that the destination terminal equipment can find the corresponding voice data according to the voice ID when receiving the voice ID.

When the server is idle, the corresponding relation between the recorded voice ID and the voice data can be sent to each connected terminal device (including the destination terminal device). Idle here may refer to when voice data, as well as any other information, need not be transmitted. Optionally, after sending the correspondence between the voice ID and the voice data to each terminal device, the server may delete the locally recorded correspondence, or mark that the correspondence is synchronized with synchronization identification information.

After the destination terminal device obtains the correspondence between the voice ID and the voice data from the server, the correspondence may be recorded. When the source terminal device finds the voice ID (there may be multiple pieces of voice data, so that multiple voice IDs may be found) corresponding to the voice data to be sent locally, the source terminal device may send the voice ID to the server, and the server may forward the voice ID to the destination terminal device.

When the destination terminal device receives at least one voice ID sent by the server, the destination terminal device can find out the voice data corresponding to the voice ID in the corresponding relation between the recorded voice ID and the voice data. If only 1 voice ID is received, only one piece of voice data is searched, and the searched voice data is directly played. If more than two voice IDs are received, synthesizing the voice data corresponding to the searched voice IDs into a complete voice data, and playing the synthesized voice data, wherein the synthesis mode is not limited.

In this embodiment, the server may synchronize the correspondence between the voice ID and the voice data to the destination terminal device, so that the subsequent destination terminal device may directly use the correspondence between the voice ID and the voice data to implement voice transmission.

The present application also provides a voice transmission apparatus, applied to a terminal device, referring to fig. 2, the voice transmission apparatus 100 includes:

the voice characteristic information generating module 101 is configured to generate corresponding first voice characteristic information according to the collected first voice data to be sent to the destination terminal device;

a voice ID searching module 102, configured to search a first voice ID corresponding to the first voice feature information in a corresponding relationship between a recorded voice ID and voice feature information;

the first voice transmission module 103 is configured to send the first voice ID to a server if the first voice ID is found, so that the server controls the destination terminal device according to the first voice ID to obtain first voice data corresponding to the first voice ID;

and the second voice transmission module 104 is configured to send first voice data to the server if the first voice ID is not found, so that the server forwards the first voice data to the destination terminal device.

In one embodiment, when the voice feature information generating module generates the corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device, the voice feature information generating module is specifically configured to:

In one embodiment of the present invention, in one embodiment,

In one embodiment, the apparatus further comprises:

The application also provides a voice transmission device, which is applied to a server, and comprises:

In one embodiment, the third voice transmission module is specifically configured to, when controlling the destination terminal device according to the voice ID to obtain the voice data corresponding to the voice ID:

In one embodiment, in case of receiving voice data sent by the source terminal device, the fourth voice transmission module is further configured to:

In one embodiment, the apparatus further comprises:

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements.

The application also provides electronic equipment, which comprises a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the voice transmission method as described in the foregoing embodiment.

The embodiment of the voice transmission device can be applied to electronic equipment. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of an electronic device where the device is located for operation. In terms of hardware, as shown in fig. 5, fig. 5 is a hardware structure diagram of an electronic device where the voice transmission apparatus 100 is located according to an exemplary embodiment of the present application, and in addition to the processor 510, the memory 530, the interface 520, and the nonvolatile storage 540 shown in fig. 5, the electronic device where the apparatus 100 is located in the embodiment generally includes other hardware according to the actual functions of the electronic device, which will not be described herein.

The present application also provides a machine-readable storage medium having stored thereon a program which, when executed by a processor, implements a voice transmission method as described in the foregoing embodiments.

The present application may take the form of a computer program product embodied on one or more storage media (including, but not limited to, magnetic disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Machine-readable storage media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of machine-readable storage media include, but are not limited to: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by the computing device.

The foregoing description of the preferred embodiments of the present invention is not intended to limit the invention to the precise form disclosed, and any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims

1. A voice transmission method, applied to a terminal device, the method comprising:

2. The voice transmission method according to claim 1, wherein the generating the corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device includes:

3. The voice transmission method of claim 1, wherein,

in the case that the first voice ID is not found, the method further includes:

4. A method of voice transmission according to any one of claims 1 to 3, further comprising:

If 1 second voice ID is received, playing the searched voice data;

5. A voice transmission method, characterized in that it is applied to a server, the method comprising:

6. The voice transmission method according to claim 5, wherein controlling a destination terminal device to obtain voice data corresponding to the voice ID according to the voice ID comprises:

7. The voice transmission method according to claim 5, wherein controlling a destination terminal device to obtain voice data corresponding to the voice ID according to the voice ID comprises:

8. The voice transmission method according to claim 5, wherein in the case of receiving voice data transmitted from the source terminal device, the method further comprises:

9. A voice transmission apparatus, characterized in that it is applied to a terminal device, the apparatus comprising:

10. A voice transmission device, for application to a server, the device comprising:

11. An electronic device, comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the voice transmission method according to any one of claims 1-8.

12. A machine readable storage medium having stored thereon a program which, when executed by a processor, implements the speech transmission method according to any of claims 1-8.