CN111785293A - Voice transmission method, device and equipment and storage medium - Google Patents

Voice transmission method, device and equipment and storage medium Download PDF

Info

Publication number
CN111785293A
CN111785293A CN202010501279.7A CN202010501279A CN111785293A CN 111785293 A CN111785293 A CN 111785293A CN 202010501279 A CN202010501279 A CN 202010501279A CN 111785293 A CN111785293 A CN 111785293A
Authority
CN
China
Prior art keywords
voice
voice data
server
data
terminal equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010501279.7A
Other languages
Chinese (zh)
Other versions
CN111785293B (en
Inventor
毛恩云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision System Technology Co Ltd
Original Assignee
Hangzhou Hikvision System Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision System Technology Co Ltd filed Critical Hangzhou Hikvision System Technology Co Ltd
Priority to CN202010501279.7A priority Critical patent/CN111785293B/en
Publication of CN111785293A publication Critical patent/CN111785293A/en
Application granted granted Critical
Publication of CN111785293B publication Critical patent/CN111785293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application provides a voice transmission method, a voice transmission device, voice transmission equipment and a voice transmission storage medium, which can greatly reduce data transmission quantity. A voice transmission method is applied to terminal equipment, and the method comprises the following steps: generating corresponding first voice characteristic information according to the collected first voice data to be sent to the target terminal equipment; searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information; if the first voice ID is found, sending the first voice ID to a server, so that the server controls the target terminal equipment to obtain first voice data corresponding to the first voice ID according to the first voice ID; and if the first voice ID is not found, sending first voice data to the server so that the server forwards the first voice data to the target terminal equipment.

Description

Voice transmission method, device and equipment and storage medium
Technical Field
The present application relates to the field of voice technologies, and in particular, to a voice transmission method, apparatus and device, and a storage medium.
Background
In some situations, such as voice communication, it is necessary to perform voice transmission in the processes of a fixed-line phone call, a mobile phone call, an intercom call, a network voice chat, and the like. Voice transmission is usually realized by depending on a network, and the quality of the network generally determines the quality of the voice transmission. In general, regardless of the network state, after acquiring the voice data, the terminal device directly sends the voice data to the server, and the server forwards the voice data to other terminal devices. This results in a large amount of data to be transmitted, which affects the quality of voice transmission, and especially when an abnormality, such as congestion, occurs in the network, it is very likely to aggravate the situation of the network abnormality, which results in very poor quality of voice transmission, such as jamming, loss, error, etc.
In order to improve the voice transmission quality, the improvement method is that, under the condition that the network is abnormal, the terminal device compresses the voice data, and sends the compressed voice data to the server side so as to be forwarded to other terminal devices by the server side. Although this method can reduce a certain amount of data transfer, the degree of recognition is reduced and the amount of data transfer is still large.
Disclosure of Invention
In view of the above, the present application provides a voice transmission method, apparatus and device, and storage medium, which can greatly reduce data transmission amount.
A first aspect of the present application provides a voice transmission method, which is applied to a terminal device, and the method includes:
generating corresponding first voice characteristic information according to the collected first voice data to be sent to the target terminal equipment;
searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information;
if the first voice ID is found, sending the first voice ID to a server, so that the server controls the target terminal equipment to obtain first voice data corresponding to the first voice ID according to the first voice ID;
and if the first voice ID is not found, sending first voice data to the server so that the server forwards the first voice data to the target terminal equipment.
According to an embodiment of the present application, the generating corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device includes:
carrying out voiceprint recognition on the collected first voice data to obtain corresponding voiceprint information;
encoding the collected first voice data to obtain encoded information, wherein the encoded information at least comprises: syllable coding information and/or semantic coding information; the syllable coding information is syllable information identified according to a syllable identification mode, and the semantic coding information is semantic information identified according to a semantic identification mode;
and determining the first voice characteristic information according to the voiceprint information and the coding information.
According to one embodiment of the present application,
the method further comprises the following steps: determining a voice transmission mode for voice transmission according to the detected network state of the equipment;
in the case where the first voice ID is not found, the method further includes:
if the voice transmission mode is a set first mode, further sending the first voice feature information to the server according to the first mode, so that the server allocates a corresponding first voice ID according to the first voice feature information;
and acquiring the first voice ID from the server, and recording the corresponding relation between the first voice ID and the first voice characteristic information.
According to an embodiment of the application, the method further comprises:
and receiving second voice data sent by the server, and playing the second voice data.
According to an embodiment of the application, the method further comprises:
acquiring and recording a corresponding relation between the voice ID and the voice data from the server;
when at least one second voice ID sent by the server is received, searching the voice data corresponding to the received second voice ID according to the recorded corresponding relationship between the voice ID and the voice data;
if 1 second voice ID is received, playing the searched voice data;
and if more than two second voice IDs are received, synthesizing the voice data corresponding to the searched second voice IDs, and playing the synthesized voice data.
A second aspect of the present application provides a voice transmission method, applied to a server, the method including:
under the condition of receiving a voice ID sent by source terminal equipment, controlling target terminal equipment according to the voice ID to obtain voice data corresponding to the voice ID;
and forwarding the voice data to the destination terminal equipment under the condition of receiving the voice data sent by the source terminal equipment.
According to an embodiment of the present application, controlling the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID includes:
searching the received voice data corresponding to the voice ID in the corresponding relation between the recorded voice ID and the voice data;
if the number of the received voice IDs is 1, forwarding the searched voice data to the target terminal equipment;
and if the number of the received voice IDs is more than 1, synthesizing the voice data corresponding to each searched voice ID, and forwarding the synthesized voice data to the target terminal equipment.
According to an embodiment of the present application, controlling the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID includes:
and when the recorded corresponding relation between the received voice ID and the voice data is determined to be sent to the target terminal, forwarding the received voice ID to the target terminal equipment, so that the target terminal equipment plays the corresponding voice data according to the received voice ID.
According to an embodiment of the present application, in case of receiving voice data transmitted by a source terminal device, the method further includes:
when receiving voice feature information corresponding to the voice data sent by the source terminal equipment, and the source terminal equipment sends the voice feature information when the voice transmission mode of the source terminal equipment is a set first mode, distributing corresponding voice ID according to the voice feature information and returning the voice ID to the source terminal equipment; the correspondence between the voice ID and the voice data is recorded.
According to an embodiment of the application, the method further comprises:
and sending the corresponding relation between the locally recorded voice ID and the voice data to the target terminal equipment, so that the target terminal equipment searches the corresponding voice data according to the voice ID when receiving the voice ID.
A third aspect of the present application provides a voice transmission apparatus, which is applied to a terminal device, and includes:
the voice characteristic information generating module is used for generating corresponding first voice characteristic information according to the collected first voice data to be sent to the target terminal equipment;
the voice ID searching module is used for searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information;
the first voice transmission module is used for sending the first voice ID to a server if the first voice ID is found, so that the server controls the target terminal equipment to obtain first voice data corresponding to the first voice ID according to the first voice ID;
and the second voice transmission module is used for sending first voice data to the server if the first voice ID is not found, so that the server forwards the first voice data to the target terminal equipment.
According to an embodiment of the present application, when the voice feature information generating module generates corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device, the voice feature information generating module is specifically configured to:
carrying out voiceprint recognition on the collected first voice data to obtain corresponding voiceprint information;
encoding the collected first voice data to obtain encoded information, wherein the encoded information at least comprises: syllable coding information and/or semantic coding information; the syllable coding information is syllable information identified according to a syllable identification mode, and the semantic coding information is semantic information identified according to a semantic identification mode;
and determining the first voice characteristic information according to the voiceprint information and the coding information.
According to one embodiment of the present application,
the apparatus further comprises: the voice transmission mode determining module is used for determining a voice transmission mode for voice transmission according to the detected network state of the equipment;
in the case that the first voice ID is not found, the second voice transmission module is further configured to:
if the voice transmission mode is a set first mode, further sending the first voice feature information to the server according to the first mode, so that the server allocates a corresponding first voice ID according to the first voice feature information;
and acquiring the first voice ID from the server, and recording the corresponding relation between the first voice ID and the first voice characteristic information.
According to an embodiment of the application, the apparatus further comprises:
and the first voice playing module is used for receiving second voice data sent by the server and playing the second voice data.
According to an embodiment of the application, the apparatus further comprises:
the corresponding relation acquisition module is used for acquiring and recording the corresponding relation between the voice ID and the voice data from the server;
the voice data searching module is used for searching the voice data corresponding to the received second voice ID according to the recorded corresponding relation between the voice ID and the voice data when at least one second voice ID sent by the server is received;
the second voice playing module is used for playing the searched voice data if 1 second voice ID is received;
and the third voice playing module is used for synthesizing the searched voice data corresponding to each second voice ID and playing the synthesized voice data if more than two second voice IDs are received.
The fourth aspect of the present application provides a voice transmission apparatus, which is applied to a server, and the apparatus includes:
the third voice transmission module is used for controlling the destination terminal equipment to obtain voice data corresponding to the voice ID according to the voice ID under the condition of receiving the voice ID sent by the source terminal equipment;
and the fourth voice transmission module is used for forwarding the voice data to the destination terminal equipment under the condition of receiving the voice data sent by the source terminal equipment.
According to an embodiment of the present application, when the third voice transmission module controls the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID, the third voice transmission module is specifically configured to:
searching the received voice data corresponding to the voice ID in the corresponding relation between the recorded voice ID and the voice data;
if the number of the received voice IDs is 1, forwarding the searched voice data to the target terminal equipment;
and if the number of the received voice IDs is more than 1, synthesizing the voice data corresponding to each searched voice ID, and forwarding the synthesized voice data to the target terminal equipment.
According to an embodiment of the present application, when the third voice transmission module controls the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID, the third voice transmission module is specifically configured to:
and when the recorded corresponding relation between the received voice ID and the voice data is determined to be sent to the target terminal, forwarding the received voice ID to the target terminal equipment, so that the target terminal equipment plays the corresponding voice data according to the received voice ID.
According to an embodiment of the present application, in a case of receiving voice data sent by a source terminal device, the fourth voice transmission module is further configured to:
when receiving voice feature information corresponding to the voice data sent by the source terminal equipment, and the source terminal equipment sends the voice feature information when the voice transmission mode of the source terminal equipment is a set first mode, distributing corresponding voice ID according to the voice feature information and returning the voice ID to the source terminal equipment; the correspondence between the voice ID and the voice data is recorded.
According to an embodiment of the application, the apparatus further comprises:
and the corresponding relation sending module is used for sending the locally recorded corresponding relation between the voice ID and the voice data to the target terminal equipment so that the target terminal equipment can find the corresponding voice data according to the voice ID when receiving the voice ID.
A fifth aspect of the present application provides an electronic device, comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the voice transmission method as described in the foregoing embodiments.
A sixth aspect of the present application provides a machine-readable storage medium on which a program is stored, the program, when executed by a processor, implementing the voice transmission method as described in the foregoing embodiments.
The embodiment of the application has the following beneficial effects:
in the embodiment of the application, the corresponding relation between the voice identifier ID and the voice feature information can be learned and recorded in the terminal device, the terminal device can search the corresponding relation for the first voice ID corresponding to the first voice feature information after generating the corresponding voice feature information according to the collected first voice data, and if the corresponding relation is not found, the first voice data is sent to the server, so that the server forwards the first voice data to the target terminal device; if the first voice data is found, the server only needs to send the first voice ID to the server, and the server controls the destination terminal device to obtain the first voice data corresponding to the first voice ID according to the first voice ID, for example, under the condition that the server learns the corresponding relationship between the voice ID and the voice data, the server can find out the first voice data corresponding to the first voice ID from the learned corresponding relationship between the voice ID and the voice data and forward the first voice data to the destination terminal device Or voice transmission under the condition of unstable network, and the identification degree of voice data cannot be influenced.
Drawings
Fig. 1 is a flowchart illustrating a voice transmission method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a voice transmission apparatus according to an embodiment of the present application;
fig. 3 is a block diagram of a voice transmission system according to an embodiment of the present application;
fig. 4 is a schematic interaction diagram among a source terminal device, a server, and a destination terminal device according to an embodiment of the present application;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one type of device from another. For example, a first device may also be referred to as a second device, and similarly, a second device may also be referred to as a first device, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In order to make the description of the present application clearer and more concise, some technical terms in the present application are explained below:
voiceprint: voiceprint (Voiceprint) is a spectrum of sound waves carrying verbal information displayed by an electro-acoustic apparatus. Modern scientific research shows that the voiceprint not only has specificity, but also has relative stability, and the voice of a human can keep relatively stable and unchanged for a long time after the human grows up.
Syllable: syllables (syllables) are the smallest phonetic unit of the combined pronunciation of a single vowel phone and a consonant phone in a phonogram, and a single vowel phone can also be self-syllabled. The Chinese (Chinese) syllable is formed by combining vowel phoneme and consonant phoneme in a phonogram language system.
Weak network: for the data definition of the weak network, the meaning defined by different applications is different, and not only the lowest rate of each type of network is considered, but also the division is carried out by combining a service scene and an application type. According to the characteristics of mobility, the general rate is lower than that of the 2G network, and the 3G network can be divided into weak networks. In addition, weak signal Wifi is also typically incorporated into the weak network.
The voice transmission method can be applied to voice communication scenes, such as patrol districts, parks, communities, factories, prisons, parking lots and the like, and voice communication is usually required among workers in the scenes. Taking a factory as an example, suppose that two security guards patrol and both have handheld terminals capable of voice communication, when one security guard finds that abnormal personnel intrude or equipment is abnormal, the other security guard can perform voice communication through the handheld terminal, and voice transmission is needed in the process. Of course, the above scenario is only an example, and is not limited to the above scenario, and the above voice communication scenario may be, but is not limited to, a real-time communication scenario.
In the embodiment of the application, the corresponding relation between the voice identifier ID and the voice feature information can be learned and recorded in the terminal device, the terminal device can search the corresponding relation for the first voice ID corresponding to the first voice feature information after generating the corresponding voice feature information according to the collected first voice data, and if the corresponding relation is not found, the first voice data is sent to the server, so that the server forwards the first voice data to the target terminal device; if the first voice data is found, the server only needs to send the first voice ID to the server, and the server controls the destination terminal device to obtain the first voice data corresponding to the first voice ID according to the first voice ID, for example, under the condition that the server learns the corresponding relationship between the voice ID and the voice data, the server can find out the first voice data corresponding to the first voice ID from the learned corresponding relationship between the voice ID and the voice data and forward the first voice data to the destination terminal device Or voice transmission under the condition of unstable network, and the identification degree of voice data cannot be influenced.
The following describes the voice transmission method in more detail, but should not be limited thereto. In one embodiment, referring to fig. 1, a voice transmission method is applied to a terminal device, and the method may include the following steps:
s100: generating corresponding first voice characteristic information according to the collected first voice data to be sent to the target terminal equipment;
s200: searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information;
s300: if the first voice ID is found, sending the first voice ID to a server, so that the server controls the target terminal equipment to obtain first voice data corresponding to the first voice ID according to the first voice ID;
s400: and if the first voice ID is not found, sending first voice data to the server so that the server forwards the first voice data to the target terminal equipment.
The execution subject of the voice transmission method is the terminal device, and more specifically, the execution subject can be a processing chip of the terminal device. The terminal device may be an interphone, a landline telephone, a mobile phone, a virtual machine, etc., and the specific type is not limited as long as the terminal device can support voice transmission and has certain processing capability.
The terminal equipment can be provided with a voice acquisition module and a voice playing module. The steps S100 to S400 can be implemented by a voice acquisition module, in which case the terminal device serves as a source terminal device and voice data needs to be sent to a destination terminal device; the voice playing module may be configured to play the voice data, in which case the terminal device serves as a destination terminal device, and for example, when receiving the voice data from the source terminal device sent by the server, the terminal device may play the voice data through the voice playing module.
The terminal equipment is connected with the server side. The server generally has a large storage capacity and a large processing capacity, and may be composed of one computer device or a plurality of computer devices, and the specific type is not limited. The server can be connected with other terminal devices besides the terminal devices, and the number of the terminal devices to be connected is not limited.
In the embodiment of the present application, two voice transmission modes for voice transmission may be set, including a first mode and a second mode. The terminal equipment can determine a voice transmission mode according to the detected network state of the equipment, and can enter a first mode when the network state of the equipment is normal; when the network state of the device is abnormal, the device can enter the second mode.
The terminal device may detect the network state of the device according to a set policy, for example, the network state of the device may be detected periodically or according to a network state event triggered by a bottom layer (the terminal device may monitor the network state event, and the network state event is used to indicate whether the network state of the terminal device is normal), and of course, the specific detection method is not limited thereto. The network status exception here may include, for example: network delay, network packet drop, network throttling, network retransmission, and various network problems caused by network congestion or instability, which are not limited to the above.
The first mode and the second mode are different in that in the first mode, the correspondence between the voice ID and the voice feature information is learned while the method shown in fig. 1 is implemented. In the second mode, the method shown in fig. 1 can be implemented using the correspondence between the voice ID and the voice feature information learned and recorded in the first mode. Details will be described in the following examples.
In step S100, corresponding first voice feature information is generated according to the collected first voice data to be sent to the destination terminal device.
The terminal device needs to send the collected first voice data to the destination terminal device, so that the first voice data is played on the destination terminal device. The destination terminal device may be any one or more of other terminal devices connected to the server.
Optionally, the first voice data may be collected by the terminal device, or may be acquired by the terminal device after being collected by an external device (such as an external microphone), and of course, the specific device by which the first voice data is collected is not limited.
Optionally, the device may collect a plurality of pieces of first voice data at a time. Taking a terminal device as an example of an interphone, when one person turns on the interphone and speaks N words towards the interphone, and then turns off the interphone, two words with an interval time greater than a preset time, such as 0.5s, can be determined as two pieces of voice data, and then it can be determined that the interphone acquires N pieces of voice data, where N is greater than 1.
In this case, first voice feature information corresponding to each piece of first voice data may be generated, and the first voice feature information may characterize the corresponding first voice data. In the case where the first speech feature information corresponding to the plurality of pieces of first speech data is generated, the subsequent steps S200 to S400 may be performed for each piece of first speech feature information (of course, steps S300 and S400 are not both performed, but one of them is selected to be performed according to the search state).
In step S200, a first voice ID corresponding to the first voice feature information is searched in a correspondence between the recorded voice ID and the voice feature information.
In a related mode, under the condition that the first voice data is collected, the first voice data can be directly sent to the server side, or the first voice data is sent to the server side after being compressed, and then the server side sends the first voice data to the target terminal equipment.
In this embodiment, no matter in the first mode or the second mode, the first voice data is not directly sent to the server, but the first voice ID corresponding to the first voice feature information is searched in the correspondence between the locally recorded voice ID and the voice feature information, and the transmission mode of the first voice data is determined according to the search condition.
In step S300, if the first voice ID is found, the first voice ID is sent to a server, so that the server controls the destination terminal device to obtain first voice data corresponding to the first voice ID according to the first voice ID.
Finding the first voice ID in the recorded correspondence between the voice ID and the voice feature information, which indicates that the related information of the first voice data (including the correspondence between the first voice ID in the terminal device and the first voice feature information and the correspondence between the first voice ID in the server and the first voice data) has been learned before, at this time, the server may control the destination terminal device to obtain the first voice data corresponding to the first voice ID according to the first voice ID by only sending the first voice ID to the server.
Optionally, when the server controls the destination terminal device to obtain first voice data corresponding to the first voice ID according to the first voice ID, the server may search for the first voice data corresponding to the received first voice ID in the correspondence between the locally recorded voice ID and the voice data, and if the number of the received voice IDs is 1, forward the searched voice data to the destination terminal device; and if the number of the received voice IDs is more than 1, synthesizing the voice data corresponding to each searched voice ID, and forwarding the synthesized voice data to the target terminal equipment.
In this way, only the first voice ID needs to be transmitted between the terminal device and the server, and the first voice data does not need to be transmitted any more, and the data volume of the first voice data is much smaller than that of the first voice ID, so that the data transmission volume between the terminal device and the server can be greatly reduced, and finally, the target terminal device can still obtain the required first voice data.
Or, when the server controls the destination terminal device to obtain first voice data corresponding to the first voice ID according to the first voice ID, and when it is determined that the recorded corresponding relationship between the received voice ID and the voice data is sent to the target terminal, the server forwards the received first voice ID to the destination terminal device, so that the destination terminal device plays the corresponding first voice data according to the received first voice ID. When the target terminal device plays the corresponding first voice data according to the received first voice ID, the first voice data corresponding to the first voice ID can be searched in the corresponding relationship between the recorded voice ID and the voice data, if 1 piece of first voice data is found, the first voice data is played, and if a plurality of pieces of first voice data are found, the found first voice data is combined and played.
In this way, only the first voice ID needs to be transmitted between the terminal device and the server, and only the first voice ID needs to be transmitted between the server and the destination terminal device, so that in the voice transmission process, the data transmission amount between the terminal device and the server can be greatly reduced, and the data transmission amount between the server and the destination terminal device can be greatly reduced.
In summary, no matter how the server controls the destination terminal device to obtain the first voice data corresponding to the first voice ID according to the first voice ID, it can be ensured that only the first voice ID needs to be transmitted between the terminal device and the server, which can greatly reduce the data transmission amount between the terminal device and the server.
In step S400, if the first voice ID is not found, sending first voice data to the server, so that the server forwards the first voice data to the destination terminal device.
In the first mode or the second mode, if the terminal device does not find the first voice ID, it indicates that the relevant information of the first voice data has not been learned, so the first voice data can be directly sent to the server and forwarded to the destination terminal device by the server.
Optionally, when the first voice data is sent to the server, the first voice data may also be compressed first, and then the compressed first voice data is sent to the server, and the server forwards the compressed first voice data to the destination terminal device.
It can be understood that when the terminal device sends the first voice ID, the first voice data, and other information, the terminal device may further carry indication information (for example, address information of the destination terminal device, etc.) for indicating the destination terminal device, so that the server may send the relevant information to the destination terminal device indicated by the identification information. The same applies to the sending of other information, and is not described in detail later.
In one embodiment, the above method flow can be executed by the voice transmission apparatus 100, as shown in fig. 2, the voice transmission apparatus 100 can include 4 modules: the system comprises a voice characteristic information generation module 101, a voice ID search module 102, a first voice transmission module 103 and a second voice transmission module 104. The voice feature information generating module 101 is configured to execute the step S100, the voice ID searching module 102 is configured to execute the step S200, the first voice transmission module 103 is configured to execute the step S300, and the second voice transmission module 104 is configured to execute the step S400.
In an embodiment, in step S100, the generating corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device may include the following steps:
s101: carrying out voiceprint recognition on the collected first voice data to obtain corresponding voiceprint information;
s102: encoding the collected first voice data to obtain encoded information, wherein the encoded information at least comprises: syllable coding information and/or semantic coding information; the syllable coding information is syllable information identified according to a syllable identification mode, and the semantic coding information is semantic information identified according to a semantic identification mode;
s103: and determining the first voice characteristic information according to the voiceprint information and the coding information.
Since the voiceprints made by different persons are basically different and the voiceprint made by a person is usually stable and unchangeable, the identity of the person who uttered the voice can be represented by the voiceprint information. In step S101, the existing voiceprint recognition mode may be utilized to perform voiceprint recognition on the first voice data, so as to obtain corresponding voiceprint information, where the voiceprint information may represent an identity of a source of the first voice data.
The syllables or the semantemes of the speech can represent the content of the speech to some extent, and in general, the syllables or the semantemes of different utterances are different, so in this embodiment, in step S102, the syllable encoding information and/or the semanteme encoding information are also determined by recognizing the syllable information and/or the semanteme information of the first speech data, so as to represent the content of the speech.
The method comprises the steps that syllable recognition can be carried out on first voice data by utilizing an existing syllable recognition algorithm to obtain syllable information; the semantic recognition of the first voice data can be realized by utilizing the existing semantic recognition algorithm to obtain semantic information. During the recognition process, the side coding can be recognized, for example, a syllable is recognized, and the syllable is coded, and finally, the obtained syllable information is syllable coded information.
In step S103, the first speech characteristic information is determined according to the voiceprint information and the coding information, for example, the voiceprint information and the coding information may be determined as the first speech characteristic information. By the above mode, voice data with different contents sent by different people can be basically distinguished according to the voice characteristic information.
Of course, if it is not necessary to distinguish different persons in the actual scene, the encoded information may also be determined as the first speech feature information, and the voiceprint information is not required, which is not limited specifically.
Although syllable information and/or semantic information are used in the present embodiment, these are not used for reconstructing speech, and are used for searching corresponding speech data, unlike general usage. In this embodiment, because it is not concerned about what the real semantic meaning in the voice data is, as long as the recognized semantic information can be used to distinguish the voice data with different contents, the language of the voice data may be any type of language, for example, the language may be chinese, english, russian, or even dialect.
In one embodiment, before step S100, the method further includes: and determining a voice transmission mode for voice transmission according to the detected network state of the equipment.
For example, when the detected network state of the device is normal, the device may enter the first mode; when the detected network state of the device is abnormal, the device can enter a second mode.
In step S400, if the first voice ID is not found, the method further includes:
s410: if the voice transmission mode is a set first mode, further sending the first voice feature information to the server according to the first mode, so that the server allocates a corresponding first voice ID according to the first voice feature information;
s420: and acquiring the first voice ID from the server, and recording the corresponding relation between the first voice ID and the first voice characteristic information.
In other words, when the voice transmission mode is the first mode and the first voice ID is not found, both the first voice data and the first voice feature information are sent to the server, so that not only the transmission of the first voice data is realized, but also the learning of the related information of the first voice data is realized.
After receiving the first voice data and the first voice feature information, the server may assign a corresponding first voice ID according to the first voice feature information, where the first voice ID may identify the first voice data. After the server allocates the first voice ID, the server may locally record a corresponding relationship between the first voice ID and the first voice data, or may locally record a corresponding relationship between the first voice ID, the first voice data, and the first voice feature information, return the first voice ID to the terminal device, and forward the first voice data to the destination terminal device.
The first voice ID in the present embodiment can uniquely identify the first voice data. Generally, there are a plurality of terminal devices connected to the server, and when assigning a voice ID, uniqueness is ensured among all the terminal devices, so that it is easier to ensure the uniqueness of the voice ID by the server assignment. For example, the voice IDs may be assigned in the order of 0, 1, 2, 3, and 4, which is less complex to process. The specific allocation mode is not limited, and it is only required to ensure that different voice data with different contents spoken by different people can be allocated with different voice IDs, that is, different voice IDs are allocated to different voice feature information.
And the terminal equipment acquires the first voice ID from the server and records the corresponding relation between the first voice ID and the first voice characteristic information. Therefore, when the first voice data is subsequently collected, after the corresponding first voice characteristic information is generated, the corresponding first voice ID can be found in the corresponding relation, and the first voice ID is only required to be sent to the server side without sending the first voice data to the server side.
Optionally, in addition to the correspondence between the first voice ID and the first voice feature information, the terminal device may also record the correspondence between the first voice ID or the first voice feature information and the first voice data; alternatively, the corresponding relationship between the first voice ID, the first voice feature information, and the first voice data may also be recorded, which is not limited specifically. Of course, in order to reduce the amount of memory required by the terminal device, the terminal device may not have stored therein voice data.
Through the above manner, for each piece of voice data acquired or collected in the first mode, the terminal device generally records the corresponding relationship between the voice ID of the voice data and the voice feature information, and the server generally records the corresponding relationship between the voice data and the voice ID, and this learning process is continuously performed to continuously enrich the corresponding relationship, and cover more and more words and sentences. In this way, the voice ID can be used as the association between the voice feature information in the terminal device and the voice data in the server, and the voice transmission can be realized by transmitting the voice ID between the terminal device and the server.
Moreover, in step S200, after the corresponding first voice feature information is generated, the first voice ID corresponding to the first voice feature information is searched in the corresponding relationship between the recorded voice ID and the voice feature information, and only when the first voice ID is not searched, the first voice feature information and the first voice data are sent to the server, so as to implement the learning of the related information of the first voice data. By the method, repeated learning of related information of the same voice data can be avoided, and only for the voice data containing strange content sent by strangers, the corresponding relation between the voice characteristic information and the voice ID can be learned and recorded in the terminal equipment, and the corresponding relation between the voice ID and the voice can be learned and recorded in the server.
In the case of an abnormal network state, the quality of the transmitted voice data may be degraded, and it is generally not suitable for learning the related information of the first voice data, so that the terminal device may not need to transmit the first voice feature information to the server. Of course, the method is not limited herein, and may be selected according to the need, for example, learning may be performed in the case of acceptable poor voice quality.
A more specific example is provided below in conjunction with fig. 3 and 4, but should not be taken as limiting.
As shown in fig. 3, the server 300 may connect a plurality of terminal devices 201 and 203, where the current terminal device 201 needs to send the first voice data to the terminal device 202, and in this case, the terminal device 201 is a source terminal device, the terminal device 202 is a destination terminal device, and the terminal device 203 is another terminal device. Of course, the server 300 may also be connected with more terminal devices, which are not shown in the figure.
As shown in fig. 4, there are two voice transmission modes of the source terminal device 201, which are a first mode and a second mode, respectively, where the source terminal device 201 enters the first mode when detecting that the network state of the device is normal, and enters the second mode when detecting that the network state of the device is abnormal.
In the first mode:
after acquiring the first voice data, the source terminal device 201 may generate first voice feature information corresponding to the first voice data according to the first voice data, where the first voice feature information includes, for example, voiceprint information and syllable encoding information;
next, the source terminal device 201 searches for a first voice ID corresponding to the first voice feature information in the correspondence between the recorded voice ID and the voice feature information;
if the first voice data is not found, it indicates that the related information of the first voice data has not been learned, and at this time, the first voice data and the first voice feature information are sent to the server 300. After receiving the first voice data and the first voice feature information, the server 300 allocates a corresponding first voice ID according to the first voice feature information, records a corresponding relationship between the first voice ID and the first voice data, and returns the first voice ID to the source terminal 201. After receiving the first voice ID returned by the server, the source terminal device 201 records the corresponding relationship between the first voice feature information and the first voice ID. In addition, after receiving the first voice data and the first voice feature information, the server 300 further forwards the first voice data to the destination terminal device 202 (which may be forwarded singly or in multiple combined ways). After receiving the first voice data, the destination terminal device 202 may play the first voice data.
If the first voice data is found, it is described that the related information of the first voice data has been learned before, including the corresponding relationship between the first voice ID and the first voice feature information in the source terminal device 201 and the corresponding relationship between the first voice data and the first voice ID in the server 300, at this time, only the first voice ID needs to be sent to the server 300. After receiving the first voice ID, the server 300 searches for the first voice data corresponding to the first voice ID in the recorded correspondence between the voice ID and the voice data, and forwards the found first voice data to the destination terminal device 202 (which may be forwarded singly or forwarded in multiple combined ways). After receiving the first voice data, the destination terminal device 202 may play the first voice data.
In a second mode:
after acquiring the first voice data, the source terminal device 201 may generate first voice feature information corresponding to the first voice data according to the first voice data, where the first voice feature information includes, for example, voiceprint information and syllable encoding information;
next, the source terminal device 201 searches for a first voice ID corresponding to the first voice feature information in the correspondence between the recorded voice ID and the voice feature information;
if the first voice data is found, it is described that the related information of the first voice data has been learned before, including the corresponding relationship between the first voice ID and the first voice feature information in the source terminal device 201 and the corresponding relationship between the first voice data and the first voice ID in the server 300, at this time, only the first voice ID needs to be sent to the server 300. After receiving the first voice ID, the server 300 searches for the first voice data corresponding to the first voice ID in the recorded correspondence between the voice ID and the voice data, and forwards the found first voice data to the destination terminal device 202 (which may be forwarded singly or forwarded in multiple combined ways). After receiving the first voice data, the destination terminal device 202 may play the first voice data.
If the first voice data is not found, it indicates that the related information of the first voice data has not been learned, but because the current network state is abnormal, the first voice data may be sent only to the server 300 (a single voice data may be sent, or multiple voice data may be sent simultaneously). After receiving the first voice data, the server 300 directly forwards the received first voice data to the destination terminal device 202 (which may be forwarded singly or in combination), without learning related information. After receiving the first voice data, the destination terminal device 202 may play the first voice data.
In the above embodiment, the terminal device is the source terminal device. The identity of the terminal device is different for different processing logics, in some processing logics, the terminal device may of course also serve as the destination terminal device, such as several embodiments described below.
In one embodiment, the method further comprises:
and receiving second voice data sent by the server, and playing the second voice data.
The terminal device may have a voice playing function, for example, may have a voice player, and the second voice data may be played by the voice player.
In one embodiment, the method further comprises:
s500: acquiring and recording a corresponding relation between the voice ID and the voice data from the server;
s600: when at least one second voice ID sent by the server is received, searching the voice data corresponding to the received second voice ID according to the recorded corresponding relationship between the voice ID and the voice data;
s700: if 1 second voice ID is received, playing the searched voice data;
s800: and if more than two second voice IDs are received, synthesizing the voice data corresponding to the searched second voice IDs, and playing the synthesized voice data.
When the server is idle, the server may send the correspondence between the recorded (and unsynchronized) voice ID and the voice data to each connected terminal device. Idle may refer to when voice data, and any other information, need not be transmitted. Optionally, after the server sends the corresponding relationship between the voice ID and the voice data to each terminal device, the server may delete the locally recorded corresponding relationship, or mark that the corresponding relationship is synchronized with the synchronization identification information.
After the terminal device obtains the corresponding relationship between the voice ID and the voice data from the server, the terminal device may record the corresponding relationship. When the source terminal device locally finds a second voice ID corresponding to the voice data to be sent (there may be multiple pieces of voice data, so that multiple second voice IDs can be found), the source terminal device may send the second voice ID to the server, and the server may forward the second voice ID to the terminal device.
When receiving at least one second voice ID sent by the server, the terminal device may find the voice data corresponding to the second voice ID in the recorded correspondence between the voice ID and the voice data. If only 1 second voice ID is received, only one piece of voice data is found, and the found voice data is directly played. If more than two second voice IDs are received, synthesizing the voice data corresponding to each found second voice ID into a section of complete voice data, and playing the synthesized voice data, wherein the synthesis mode is not limited.
In this embodiment, the server may synchronize the correspondence between the voice ID and the voice data to the connected terminal device, so that the subsequent terminal device may directly utilize the correspondence between the voice ID and the voice data to implement voice transmission.
In one embodiment, when the server is idle, the recorded voice ID and the voice feature information may be sent to each connected terminal device, so that other terminal devices do not need to repeatedly learn the voice ID and the voice feature information, and the learning process is accelerated.
The above is the content of the embodiment of the voice transmission method applied to the terminal device, and the content of the embodiment of the voice transmission method applied to the server is described below.
In one embodiment, the voice transmission method is applied to a server, and the method may include the following steps:
t100: under the condition of receiving a voice ID sent by source terminal equipment, controlling target terminal equipment according to the voice ID to obtain voice data corresponding to the voice ID;
t200: and under the condition of receiving the voice data sent by the source terminal equipment, forwarding the voice data to the destination terminal equipment.
The main execution body of the voice transmission method is a server, and the server can have larger storage capacity and processing capacity and can be composed of one computer device or a plurality of computer devices.
The server may be connected to a plurality of terminal devices, as shown in fig. 3, the server 300 may be connected to a plurality of terminal devices 201 and 203, where if the terminal device 201 needs to send the first voice data to the terminal device 202, in this case, the terminal device 201 is a source terminal device, the terminal device 202 is a destination terminal device, and the terminal device 203 is another terminal device. Of course, the server 300 may also be connected with more terminal devices, which are not shown in the figure.
In step T100, when receiving a voice ID sent by a source terminal device, a destination terminal device is controlled according to the voice ID to obtain voice data corresponding to the voice ID.
The voice ID may be a voice ID corresponding to second voice feature information found in a correspondence between a recorded voice ID and voice feature information by the source terminal device, and the second voice feature information may be voice feature information generated by the source terminal device according to collected voice data.
There are various ways of controlling the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID. For example, the voice data corresponding to the received voice ID may be found from the correspondence between the voice ID and the voice data recorded by the server, and the voice data may be sent to the destination terminal device. For another example, the server may forward the received voice ID to the destination terminal device, so that the destination terminal device finds the voice data corresponding to the received voice ID in the correspondence between the recorded voice ID and the voice data.
Certainly, the above manner is not limited, and the corresponding relationship between the voice ID and the voice data may also be recorded in other places, for example, in a cloud space, the server may forward the received voice ID to the cloud space, and after the cloud space finds the voice data corresponding to the received voice ID from the recorded corresponding relationship between the voice ID and the voice data, the cloud space forwards the found voice data to the destination terminal device.
After the destination terminal device obtains the voice data corresponding to the voice ID, the obtained voice data can be played. The target terminal equipment can be provided with a voice playing module, voice data can be played through the voice playing module, and the voice playing module can call a voice player to play the voice data.
In step T200, when receiving the voice data sent by the source terminal device, the source terminal device forwards the voice data to the destination terminal device.
The received voice data may be, for example, the voice data sent by the source terminal device when the source terminal device does not find the voice ID corresponding to the second voice feature information in the recorded correspondence between the voice ID and the voice feature information, where the second voice feature information may be the voice feature information generated by the source terminal device according to the collected voice data.
And forwarding the voice data to the destination terminal equipment under the condition of receiving the voice data. After the destination terminal device obtains the voice data corresponding to the voice ID, the obtained voice data can be played. The target terminal equipment can be provided with a voice playing module, voice data can be played through the voice playing module, and the voice playing module can call a voice player to play the voice data.
In this embodiment, in some cases, only the voice ID needs to be transmitted between the source terminal device and the server, and the server may control the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID, so that the data transmission amount between the source terminal device and the server may be greatly reduced, and for a case of an abnormal network state, such as congestion, the method is favorable for improving the abnormal network state and avoiding further worsening the network state, and is particularly suitable for a scenario of a weak network, and the bandwidth occupied by voice transmission may be reduced.
In an embodiment, in step T100, controlling the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID may include the following steps:
t101: searching the received voice data corresponding to the voice ID in the corresponding relation between the recorded voice ID and the voice data;
t102: if the number of the received voice IDs is 1, forwarding the searched voice data to the target terminal equipment;
t103: and if the number of the received voice IDs is more than 1, synthesizing the voice data corresponding to each searched voice ID, and forwarding the synthesized voice data to the target terminal equipment.
Optionally, the correspondence between the voice ID in the source terminal device and the voice feature information and the correspondence between the voice ID in the server and the voice data are synchronously learned and recorded. When the voice ID is received, it is described that the source terminal device has recorded the correspondence between the voice ID and the second voice feature information, and in general, the server has recorded the correspondence between the voice ID and the voice data.
Therefore, in this embodiment, the voice data corresponding to the received voice ID can be found in the correspondence between the recorded voice ID and the voice data.
Optionally, the source terminal device may acquire a plurality of pieces of voice data at a time. Taking the source terminal device as an example of an interphone, when one person turns on the interphone and speaks N words towards the interphone, and then turns off the interphone, two words with an interval time greater than a preset time, for example, 0.5s can be determined as two pieces of voice data, and then it can be determined that the interphone acquires N pieces of voice data, where N is greater than 1.
In this case, voice feature information corresponding to each piece of voice data may be generated, and the voice feature information may characterize the corresponding voice data. In the case of generating voice feature information corresponding to a plurality of pieces of voice data, a plurality of voice IDs may be found, and at this time, the source terminal device may send a plurality of voice IDs to the server, or may find only 1 voice ID.
And if the number of the received voice IDs is 1, forwarding the searched voice data to the destination terminal equipment. And if the number of the received voice IDs is more than 1, synthesizing the voice data corresponding to each searched voice ID, and forwarding the synthesized voice data to the target terminal equipment.
Optionally, if the server receives the voice ID and the voice data (where the voice ID and the voice data are not corresponding, and the voice ID is a voice ID corresponding to other voice data), after the voice data corresponding to the voice ID is found, the found voice data and the received voice data may be synthesized, and the synthesized voice data is forwarded to the destination terminal device.
In this embodiment, only the voice ID needs to be transmitted between the source terminal device and the server, and the corresponding voice data does not need to be transmitted any more, and the data volume of the voice data is much smaller than that of the voice ID, so that the data transmission volume between the source terminal device and the server can be greatly reduced, and finally the destination terminal device can still obtain the required voice data.
In an embodiment, in step T100, controlling the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID may include the following steps:
t104: and when the recorded corresponding relation between the received voice ID and the voice data is determined to be sent to the target terminal, forwarding the received voice ID to the target terminal equipment, so that the target terminal equipment plays the corresponding voice data according to the received voice ID.
Optionally, the correspondence between the voice ID in the source terminal device and the voice feature information and the correspondence between the voice ID in the server and the voice data are synchronously learned and recorded, and the server may synchronize the recorded correspondence between the voice ID and the voice data to other terminal devices (including the destination terminal device) in an idle state, and may mark the synchronized correspondence through the synchronization identification information when synchronization is completed.
Therefore, in this embodiment, when it is determined that the recorded correspondence between the received voice ID and the voice data is sent to the destination terminal, it may be determined whether the correspondence is synchronized according to the synchronization identification information, and the received voice ID is forwarded to the destination terminal device, so that the destination terminal device plays the corresponding voice data according to the received voice ID.
When the target terminal equipment plays the corresponding voice data according to the received voice ID, the voice data corresponding to the received voice ID can be found out in the corresponding relation between the recorded voice ID and the voice data, and when 1 piece of voice data is found out, the voice data can be directly played; when more than two pieces of voice data are found, the voice data can be synthesized, and the synthesized voice data is played.
In this embodiment, only the voice ID needs to be transmitted between the source terminal device and the server, and only the voice ID needs to be transmitted between the server and the destination terminal device, so that in the voice transmission process, the data transmission amount between the source terminal device and the server can be greatly reduced, and the data transmission amount between the server and the destination terminal device can be greatly reduced.
In one embodiment, in the case of receiving voice data transmitted by a source terminal device, the method further includes:
t210: when receiving voice feature information corresponding to the voice data sent by the source terminal equipment, and the source terminal equipment sends the voice feature information when the voice transmission mode of the source terminal equipment is a set first mode, distributing corresponding voice ID according to the voice feature information and returning the voice ID to the source terminal equipment; the correspondence between the voice ID and the voice data is recorded.
When receiving voice data, if receiving voice feature information corresponding to the voice data sent by a source terminal device, it indicates that the source terminal device is currently in a first mode, where the first mode is a mode that the source terminal device enters when detecting that a network state of the device is normal, and in this case, learning of related information of the voice data is required.
After receiving the voice data and the corresponding voice feature information, the server may assign a corresponding voice ID according to the voice feature information, where the voice ID may identify the voice data. After the server allocates the voice ID, the server may locally record the corresponding relationship between the voice ID and the voice data, or may locally record the corresponding relationship between the voice ID, the voice data, and the voice feature information, return the voice ID to the source terminal device, and forward the voice data to the destination terminal device.
After the source terminal device receives the voice ID, it may record a corresponding relationship between the voice ID and the voice feature information in the device, and when the same voice data is subsequently acquired, it may find out a corresponding voice ID from the corresponding relationship according to the corresponding voice feature information, and send the voice ID to the server, which is only required, and specifically, refer to the description content related to step T100 in the foregoing embodiment.
The voice ID may uniquely identify the voice data. Generally, there are a plurality of terminal devices connected to the server, and when assigning a voice ID, uniqueness is ensured among all the terminal devices, so that it is easier to ensure the uniqueness of the voice ID by the server assignment. For example, the voice IDs may be assigned in the order of 0, 1, 2, 3, and 4, which is less complex to process. The specific allocation mode is not limited, and it is only required to ensure that different voice data with different contents spoken by different people can be allocated with different voice IDs, that is, different voice IDs are allocated to different voice feature information.
In one embodiment, the method further comprises:
t300: and sending the corresponding relation between the locally recorded voice ID and the voice data to the target terminal equipment, so that the target terminal equipment searches the corresponding voice data according to the voice ID when receiving the voice ID.
When the server is idle, the server may send the recorded correspondence between the voice ID and the voice data to each connected terminal device (including the destination terminal device). Idle may refer to when voice data, and any other information, need not be transmitted. Optionally, after the server sends the corresponding relationship between the voice ID and the voice data to each terminal device, the server may delete the locally recorded corresponding relationship, or mark that the corresponding relationship is synchronized with the synchronization identification information.
After the destination terminal device obtains the corresponding relationship between the voice ID and the voice data from the server, the destination terminal device may record the corresponding relationship. When the source terminal device locally finds a voice ID corresponding to voice data to be sent (there may be multiple pieces of voice data, so that multiple voice IDs can be found), the source terminal device may send the voice ID to the server, and the server may forward the voice ID to the destination terminal device.
When receiving at least one voice ID sent by the server, the destination terminal device may search for the voice data corresponding to the voice ID in the recorded correspondence between the voice ID and the voice data. If only 1 voice ID is received, only one piece of voice data is searched, and the searched voice data is directly played. If more than two voice IDs are received, synthesizing the searched voice data corresponding to each voice ID into a section of complete voice data, and playing the synthesized voice data, wherein the synthesis mode is not limited.
In this embodiment, the server may synchronize the correspondence between the voice ID and the voice data to the destination terminal device, so that the subsequent destination terminal device may directly utilize the correspondence between the voice ID and the voice data to implement voice transmission.
The present application further provides a voice transmission apparatus, applied to a terminal device, and referring to fig. 2, the voice transmission apparatus 100 includes:
the voice feature information generating module 101 is configured to generate corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device;
a voice ID search module 102, configured to search for a first voice ID corresponding to the first voice feature information in a correspondence between a recorded voice ID and voice feature information;
the first voice transmission module 103 is configured to send the first voice ID to a server if the first voice ID is found, so that the server controls the destination terminal device to obtain first voice data corresponding to the first voice ID according to the first voice ID;
the second voice transmission module 104 is configured to send the first voice data to the server if the first voice ID is not found, so that the server forwards the first voice data to the destination terminal device.
In one embodiment, when the voice feature information generating module generates corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device, the voice feature information generating module is specifically configured to:
carrying out voiceprint recognition on the collected first voice data to obtain corresponding voiceprint information;
encoding the collected first voice data to obtain encoded information, wherein the encoded information at least comprises: syllable coding information and/or semantic coding information; the syllable coding information is syllable information identified according to a syllable identification mode, and the semantic coding information is semantic information identified according to a semantic identification mode;
and determining the first voice characteristic information according to the voiceprint information and the coding information.
In one embodiment of the present invention,
the apparatus further comprises: the voice transmission mode determining module is used for determining a voice transmission mode for voice transmission according to the detected network state of the equipment;
in the case that the first voice ID is not found, the second voice transmission module is further configured to:
if the voice transmission mode is a set first mode, further sending the first voice feature information to the server according to the first mode, so that the server allocates a corresponding first voice ID according to the first voice feature information;
and acquiring the first voice ID from the server, and recording the corresponding relation between the first voice ID and the first voice characteristic information.
In one embodiment, the apparatus further comprises:
and the first voice playing module is used for receiving second voice data sent by the server and playing the second voice data.
In one embodiment, the apparatus further comprises:
the corresponding relation acquisition module is used for acquiring and recording the corresponding relation between the voice ID and the voice data from the server;
the voice data searching module is used for searching the voice data corresponding to the received second voice ID according to the recorded corresponding relation between the voice ID and the voice data when at least one second voice ID sent by the server is received;
the second voice playing module is used for playing the searched voice data if 1 second voice ID is received;
and the third voice playing module is used for synthesizing the searched voice data corresponding to each second voice ID and playing the synthesized voice data if more than two second voice IDs are received.
The application also provides a voice transmission device, which is applied to a server side, and the device comprises:
the third voice transmission module is used for controlling the destination terminal equipment to obtain voice data corresponding to the voice ID according to the voice ID under the condition of receiving the voice ID sent by the source terminal equipment;
and the fourth voice transmission module is used for forwarding the voice data to the destination terminal equipment under the condition of receiving the voice data sent by the source terminal equipment.
In an embodiment, when the third voice transmission module controls the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID, the third voice transmission module is specifically configured to:
searching the received voice data corresponding to the voice ID in the corresponding relation between the recorded voice ID and the voice data;
if the number of the received voice IDs is 1, forwarding the searched voice data to the target terminal equipment;
and if the number of the received voice IDs is more than 1, synthesizing the voice data corresponding to each searched voice ID, and forwarding the synthesized voice data to the target terminal equipment.
In an embodiment, when the third voice transmission module controls the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID, the third voice transmission module is specifically configured to:
and when the recorded corresponding relation between the received voice ID and the voice data is determined to be sent to the target terminal, forwarding the received voice ID to the target terminal equipment, so that the target terminal equipment plays the corresponding voice data according to the received voice ID.
In an embodiment, in the case of receiving voice data sent by a source terminal device, the fourth voice transmission module is further configured to:
when receiving voice feature information corresponding to the voice data sent by the source terminal equipment, and the source terminal equipment sends the voice feature information when the voice transmission mode of the source terminal equipment is a set first mode, distributing corresponding voice ID according to the voice feature information and returning the voice ID to the source terminal equipment; the correspondence between the voice ID and the voice data is recorded.
In one embodiment, the apparatus further comprises:
and the corresponding relation sending module is used for sending the locally recorded corresponding relation between the voice ID and the voice data to the target terminal equipment so that the target terminal equipment can find the corresponding voice data according to the voice ID when receiving the voice ID.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts shown as units may or may not be physical units.
The application also provides an electronic device, which comprises a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the voice transmission method as described in the foregoing embodiments.
The embodiment of the voice transmission device can be applied to electronic equipment. Taking a software implementation as an example, as a logical device, the device is formed by reading, by a processor of the electronic device where the device is located, a corresponding computer program instruction in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 5, fig. 5 is a hardware structure diagram of an electronic device where the voice transmission apparatus 100 is located according to an exemplary embodiment of the present application, and except for the processor 510, the memory 530, the interface 520, and the nonvolatile memory 540 shown in fig. 5, the electronic device where the apparatus 100 is located in the embodiment may also include other hardware according to an actual function of the electronic device, which is not described again.
The present application also provides a machine-readable storage medium on which a program is stored, which when executed by a processor, implements the voice transmission method as described in the foregoing embodiments.
This application may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Machine-readable storage media include both permanent and non-permanent, removable and non-removable media, and the storage of information may be accomplished by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of machine-readable storage media include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium may be used to store information that may be accessed by a computing device.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims (12)

1. A voice transmission method is applied to a terminal device, and comprises the following steps:
generating corresponding first voice characteristic information according to the collected first voice data to be sent to the target terminal equipment;
searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information;
if the first voice ID is found, sending the first voice ID to a server, so that the server controls the target terminal equipment to obtain first voice data corresponding to the first voice ID according to the first voice ID;
and if the first voice ID is not found, sending first voice data to the server so that the server forwards the first voice data to the target terminal equipment.
2. The voice transmission method according to claim 1, wherein the generating corresponding first voice feature information according to the collected first voice data to be sent to the destination terminal device comprises:
carrying out voiceprint recognition on the collected first voice data to obtain corresponding voiceprint information;
encoding the collected first voice data to obtain encoded information, wherein the encoded information at least comprises: syllable coding information and/or semantic coding information; the syllable coding information is syllable information identified according to a syllable identification mode, and the semantic coding information is semantic information identified according to a semantic identification mode;
and determining the first voice characteristic information according to the voiceprint information and the coding information.
3. The voice transmission method according to claim 1,
the method further comprises the following steps: determining a voice transmission mode for voice transmission according to the detected network state of the equipment;
in the case where the first voice ID is not found, the method further includes:
if the voice transmission mode is a set first mode, further sending the first voice feature information to the server according to the first mode, so that the server allocates a corresponding first voice ID according to the first voice feature information;
and acquiring the first voice ID from the server, and recording the corresponding relation between the first voice ID and the first voice characteristic information.
4. A method for transmitting speech according to any one of claims 1 to 3, characterized in that the method further comprises:
acquiring and recording a corresponding relation between the voice ID and the voice data from the server;
when at least one second voice ID sent by the server is received, searching the voice data corresponding to the received second voice ID according to the recorded corresponding relationship between the voice ID and the voice data;
if 1 second voice ID is received, playing the searched voice data;
and if more than two second voice IDs are received, synthesizing the voice data corresponding to the searched second voice IDs, and playing the synthesized voice data.
5. A voice transmission method, applied to a server, the method comprising:
under the condition of receiving a voice ID sent by source terminal equipment, controlling target terminal equipment according to the voice ID to obtain voice data corresponding to the voice ID;
and forwarding the voice data to the destination terminal equipment under the condition of receiving the voice data sent by the source terminal equipment.
6. The voice transmission method according to claim 5, wherein controlling the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID comprises:
searching the received voice data corresponding to the voice ID in the corresponding relation between the recorded voice ID and the voice data;
if the number of the received voice IDs is 1, forwarding the searched voice data to the target terminal equipment;
and if the number of the received voice IDs is more than 1, synthesizing the voice data corresponding to each searched voice ID, and forwarding the synthesized voice data to the target terminal equipment.
7. The voice transmission method according to claim 5, wherein controlling the destination terminal device to obtain the voice data corresponding to the voice ID according to the voice ID comprises:
and when the recorded corresponding relation between the received voice ID and the voice data is determined to be sent to the target terminal, forwarding the received voice ID to the target terminal equipment, so that the target terminal equipment plays the corresponding voice data according to the received voice ID.
8. The voice transmission method according to claim 5, wherein in case of receiving the voice data transmitted by the source terminal device, the method further comprises:
when receiving voice feature information corresponding to the voice data sent by the source terminal equipment, and the source terminal equipment sends the voice feature information when the voice transmission mode of the source terminal equipment is a set first mode, distributing corresponding voice ID according to the voice feature information and returning the voice ID to the source terminal equipment; the correspondence between the voice ID and the voice data is recorded.
9. A voice transmission device is applied to a terminal device, and comprises:
the voice characteristic information generating module is used for generating corresponding first voice characteristic information according to the collected first voice data to be sent to the target terminal equipment;
the voice ID searching module is used for searching a first voice ID corresponding to the first voice characteristic information in the corresponding relation between the recorded voice ID and the voice characteristic information;
the first voice transmission module is used for sending the first voice ID to a server if the first voice ID is found, so that the server controls the target terminal equipment to obtain first voice data corresponding to the first voice ID according to the first voice ID;
and the second voice transmission module is used for sending first voice data to the server if the first voice ID is not found, so that the server forwards the first voice data to the target terminal equipment.
10. A voice transmission apparatus, applied to a server, the apparatus comprising:
the third voice transmission module is used for controlling the destination terminal equipment to obtain voice data corresponding to the voice ID according to the voice ID under the condition of receiving the voice ID sent by the source terminal equipment;
and the fourth voice transmission module is used for forwarding the voice data to the destination terminal equipment under the condition of receiving the voice data sent by the source terminal equipment.
11. An electronic device comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the voice transmission method of any one of claims 1-8.
12. A machine-readable storage medium, having stored thereon a program which, when executed by a processor, implements the voice transmission method according to any one of claims 1 to 8.
CN202010501279.7A 2020-06-04 2020-06-04 Voice transmission method, device and equipment and storage medium Active CN111785293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010501279.7A CN111785293B (en) 2020-06-04 2020-06-04 Voice transmission method, device and equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010501279.7A CN111785293B (en) 2020-06-04 2020-06-04 Voice transmission method, device and equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111785293A true CN111785293A (en) 2020-10-16
CN111785293B CN111785293B (en) 2023-04-25

Family

ID=72754598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010501279.7A Active CN111785293B (en) 2020-06-04 2020-06-04 Voice transmission method, device and equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111785293B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10310257A1 (en) * 2003-03-05 2004-09-16 Cit Gmbh User access verification method e.g. for application server via data network, involves setting up communication link to voice communication terminal of user
CN102348095A (en) * 2011-09-14 2012-02-08 宋健 Method for keeping stable transmission of images in mobile equipment video communication
CN104700836A (en) * 2013-12-10 2015-06-10 阿里巴巴集团控股有限公司 Voice recognition method and voice recognition system
CN105072015A (en) * 2015-06-30 2015-11-18 网易(杭州)网络有限公司 Voice information processing method, server, and terminal
CN105206273A (en) * 2015-09-06 2015-12-30 上海智臻智能网络科技股份有限公司 Voice transmission control method and system
CN106098069A (en) * 2016-06-21 2016-11-09 佛山科学技术学院 A kind of identity identifying method and terminal unit
CN106230689A (en) * 2016-07-25 2016-12-14 北京奇虎科技有限公司 Method, device and the server that a kind of voice messaging is mutual
WO2017000481A1 (en) * 2015-06-29 2017-01-05 中兴通讯股份有限公司 Dialing method and apparatus for voice call
CN107767872A (en) * 2017-10-13 2018-03-06 深圳市汉普电子技术开发有限公司 Audio recognition method, terminal device and storage medium
CN108023941A (en) * 2017-11-23 2018-05-11 阿里巴巴集团控股有限公司 Sound control method and device and electronic equipment
CN108022600A (en) * 2017-10-26 2018-05-11 珠海格力电器股份有限公司 Equipment control method and device, storage medium and server
CN109117235A (en) * 2018-08-24 2019-01-01 腾讯科技(深圳)有限公司 A kind of business data processing method, device and relevant device
CN109639738A (en) * 2019-01-30 2019-04-16 维沃移动通信有限公司 The method and terminal device of voice data transmission
CN109724215A (en) * 2018-06-27 2019-05-07 平安科技(深圳)有限公司 Air conditioning control method, air conditioning control device, air-conditioning equipment and storage medium
CN110364170A (en) * 2019-05-29 2019-10-22 平安科技(深圳)有限公司 Voice transmission method, device, computer installation and storage medium
CN110867188A (en) * 2018-08-13 2020-03-06 珠海格力电器股份有限公司 Method and device for providing content service, storage medium and electronic device
CN110875041A (en) * 2018-08-29 2020-03-10 阿里巴巴集团控股有限公司 Voice control method, device and system

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10310257A1 (en) * 2003-03-05 2004-09-16 Cit Gmbh User access verification method e.g. for application server via data network, involves setting up communication link to voice communication terminal of user
CN102348095A (en) * 2011-09-14 2012-02-08 宋健 Method for keeping stable transmission of images in mobile equipment video communication
CN104700836A (en) * 2013-12-10 2015-06-10 阿里巴巴集团控股有限公司 Voice recognition method and voice recognition system
WO2017000481A1 (en) * 2015-06-29 2017-01-05 中兴通讯股份有限公司 Dialing method and apparatus for voice call
CN105072015A (en) * 2015-06-30 2015-11-18 网易(杭州)网络有限公司 Voice information processing method, server, and terminal
CN105206273A (en) * 2015-09-06 2015-12-30 上海智臻智能网络科技股份有限公司 Voice transmission control method and system
CN106098069A (en) * 2016-06-21 2016-11-09 佛山科学技术学院 A kind of identity identifying method and terminal unit
CN106230689A (en) * 2016-07-25 2016-12-14 北京奇虎科技有限公司 Method, device and the server that a kind of voice messaging is mutual
CN107767872A (en) * 2017-10-13 2018-03-06 深圳市汉普电子技术开发有限公司 Audio recognition method, terminal device and storage medium
CN108022600A (en) * 2017-10-26 2018-05-11 珠海格力电器股份有限公司 Equipment control method and device, storage medium and server
CN108023941A (en) * 2017-11-23 2018-05-11 阿里巴巴集团控股有限公司 Sound control method and device and electronic equipment
CN109724215A (en) * 2018-06-27 2019-05-07 平安科技(深圳)有限公司 Air conditioning control method, air conditioning control device, air-conditioning equipment and storage medium
CN110867188A (en) * 2018-08-13 2020-03-06 珠海格力电器股份有限公司 Method and device for providing content service, storage medium and electronic device
CN109117235A (en) * 2018-08-24 2019-01-01 腾讯科技(深圳)有限公司 A kind of business data processing method, device and relevant device
CN110875041A (en) * 2018-08-29 2020-03-10 阿里巴巴集团控股有限公司 Voice control method, device and system
CN109639738A (en) * 2019-01-30 2019-04-16 维沃移动通信有限公司 The method and terminal device of voice data transmission
CN110364170A (en) * 2019-05-29 2019-10-22 平安科技(深圳)有限公司 Voice transmission method, device, computer installation and storage medium

Also Published As

Publication number Publication date
CN111785293B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
US10412206B1 (en) Communications for multi-mode device
US9905228B2 (en) System and method of performing automatic speech recognition using local private data
CN1910654B (en) Method and system for determining the topic of a conversation and obtaining and presenting related content
CN109658916B (en) Speech synthesis method, speech synthesis device, storage medium and computer equipment
EP3676831B1 (en) Natural language user input processing restriction
CN107644637B (en) Phoneme synthesizing method and device
US8086461B2 (en) System and method for tracking persons of interest via voiceprint
US20200126560A1 (en) Smart speaker and operation method thereof
US11687526B1 (en) Identifying user content
WO2009063445A2 (en) A method and apparatus for fast search in call-center monitoring
US20070294122A1 (en) System and method for interacting in a multimodal environment
US12026476B2 (en) Methods and systems for control of content in an alternate language or accent
CN110581927A (en) Call content processing and prompting method and device
CN113316078A (en) Data processing method and device, computer equipment and storage medium
JP2006279111A (en) Information processor, information processing method and program
Wyatt et al. A Privacy-Sensitive Approach to Modeling Multi-Person Conversations.
CN106486134B (en) Language state determination device and method
US20220366927A1 (en) End-To-End Time-Domain Multitask Learning for ML-Based Speech Enhancement
US20210327423A1 (en) Method and system for monitoring content of a communication session over a network
CN111785293B (en) Voice transmission method, device and equipment and storage medium
CN109616116B (en) Communication system and communication method thereof
WO2019150708A1 (en) Information processing device, information processing system, information processing method, and program
US11318373B2 (en) Natural speech data generation systems and methods
CN113506565B (en) Speech recognition method, device, computer readable storage medium and processor
CN109712606A (en) A kind of information acquisition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant