CN112185425A - Audio signal processing method, device, equipment and storage medium - Google Patents
Audio signal processing method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN112185425A CN112185425A CN201910604779.0A CN201910604779A CN112185425A CN 112185425 A CN112185425 A CN 112185425A CN 201910604779 A CN201910604779 A CN 201910604779A CN 112185425 A CN112185425 A CN 112185425A
- Authority
- CN
- China
- Prior art keywords
- audio
- voice
- vad
- feature
- audio information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000005236 sound signal Effects 0.000 title claims abstract description 22
- 238000003672 processing method Methods 0.000 title claims abstract description 18
- 238000000034 method Methods 0.000 claims abstract description 46
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 39
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 39
- 238000012545 processing Methods 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims description 38
- 238000004590 computer program Methods 0.000 claims description 10
- 238000004891 communication Methods 0.000 abstract description 12
- 230000003993 interaction Effects 0.000 description 19
- 238000010586 diagram Methods 0.000 description 12
- 230000004044 response Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 11
- 230000002452 interceptive effect Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 239000002699 waste material Substances 0.000 description 5
- 238000000605 extraction Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 230000002618 waking effect Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002592 echocardiography Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the invention provides an audio signal processing method, an audio signal processing device, audio signal processing equipment and a storage medium, wherein the method comprises the following steps: firstly, determining voice characteristics in received audio information under the condition that a voice endpoint is awakened to detect VAD; secondly, identifying whether the played voice synthesis audio comprises voice characteristics; then, in a case where the speech synthesis audio includes a speech feature, the speech feature is determined to be a false wake VAD. Therefore, the problem of 'conversation with oneself' at the equipment end is solved, and the accuracy of intelligent voice communication is improved.
Description
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to an audio signal processing method, apparatus, device, and storage medium.
Background
With the rapid development of artificial intelligence technology and computers, intelligent voice conversations are widely developed and utilized, and intelligent voice communication between people and equipment is receiving wide attention.
In order to realize real-time response of the device end to human Voice language, the device end (e.g., a smart speaker) determines whether to respond to the received audio through Voice Activity Detection (VAD). Currently, in application of current voice endpoint detection, it may happen that an equipment side receives audio played by the equipment side while playing the audio, and sends the received audio being played to a server side, and the server side will repeatedly feed back to the equipment side, so that the equipment side and the server side will fall into a loop. For example: the playing element of the equipment end plays 'you are good and happy to know you', the receiving element of the equipment end receives the 'you are good and happy to know you' which are playing, the audio is sent to the server end, the server end can circularly respond to the audio, and therefore the possibility that the equipment end 'talks with the server' occurs, and intelligent voice communication between people and the equipment end is influenced.
Disclosure of Invention
In view of this, one or more embodiments of the present invention describe a method, an apparatus, a device and a storage medium for audio signal processing, which solve the problem of "conversation with itself" at the device end and improve the accuracy of intelligent voice communication.
According to a first aspect, there is provided an audio signal processing method, which may comprise:
determining voice characteristics in the received audio information under the condition that a voice endpoint is awakened to detect VAD;
identifying whether the played speech synthesis audio includes speech features;
in the case where the speech synthesis audio includes speech features, the speech features are determined to be a false wake VAD.
According to a second aspect, there is provided an audio signal processing apparatus, which may include:
the receiving module is used for determining the voice characteristics in the received audio information under the condition that a voice endpoint is awakened to detect VAD;
the recognition module is used for recognizing whether the played voice synthesis audio comprises voice features;
and the processing module is used for determining the voice characteristic as a false wake VAD under the condition that the voice synthesis audio comprises the voice characteristic.
According to a third aspect, there is provided an audio enclosure apparatus comprising at least one processor and a memory, the memory being configured to store computer program instructions, the processor being configured to execute a program of the memory to control the audio enclosure apparatus to implement the audio signal processing method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising at least one processor and a memory, the memory being adapted to store computer program instructions, the processor being adapted to execute a program of the memory to control a server to implement the audio signal processing method as shown in the first aspect.
According to a fifth aspect, there is provided a computer-readable storage medium having stored thereon a computer program which, if executed in a computer, causes the computer to execute the audio signal processing method as shown in the first aspect.
By using the scheme of the embodiment of the invention, whether the played voice synthesis audio comprises the voice characteristics in the received audio information is identified, and the voice characteristics are determined to be the false wake-up VAD under the condition that the voice synthesis audio comprises the voice characteristics. Then, the voice characteristic of the false wake VAD is used as a negative sample for training the VAD model, so that the VAD model is updated, and the audio information of the false wake VAD is intercepted by using the updated VAD model. Therefore, the problem of 'conversation with oneself' at the equipment end is solved, the voice error recognition in the voice interaction process is reduced,
and the identification accuracy is improved. The overall power consumption of the voice interaction system is reduced, and meanwhile the user experience is improved.
In addition, when the interference information is received, the equipment end does not need to send the interference information to the server end, so that the processing resources of the server end are saved, and meanwhile, the waste of communication resources between the equipment end and the server end is reduced. For the equipment end, the method can continuously optimize the performance along with the continuous enhancement of the service time of the equipment end, and improve the accuracy of intelligent voice communication.
Drawings
The present invention may be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings, in which like or similar reference characters identify like or similar features.
Fig. 1 shows a schematic structural diagram of an audio signal interaction system according to an embodiment;
fig. 2 shows a flow diagram of an audio signal processing method according to an embodiment;
FIG. 3 illustrates a flow diagram of a method for updating a VAD model based on a server according to one embodiment;
FIG. 4 shows a flow diagram of a method of audio processing in a voice interactive system based on VAD techniques according to one embodiment;
fig. 5 shows a block diagram of the structure of an audio signal processing apparatus according to an embodiment;
FIG. 6 illustrates a schematic structural diagram of a computing device, according to one embodiment.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In order to solve the problems in the prior art, embodiments of the present invention provide an audio processing method, apparatus, device, server, and storage medium.
Fig. 1 shows a schematic structural diagram of an audio signal interaction system according to an embodiment.
As shown in fig. 1, a practical application scenario of the voice interaction system may include an audio device 10 with a sound receiving element a and a broadcasting element B, and a server 20.
The sound receiving element a receives the sound source 1, the audio device 10 needs to report the audio information of the sound source 1 to the server 20, the server 20 determines the response audio information 2 for the audio information, the server sends the response audio information 2 to the audio device 10, and the broadcasting element B broadcasts the response audio information 2.
Since the sound receiving element a and the broadcasting element B of the voice interaction system are in the same scene, the sound receiving element a may receive the response audio information 3 (marked as 3 because of the re-received audio) being played by the broadcasting element B, and the audio device 10 sends the received response audio information 3 to the server 20. At this time, in order to avoid that the content played by the audio device 10 triggers VAD when the audio device 10 performs intelligent voice interaction with the user, the recognition error rate is increased. The server 20 matches the response audio information 3 with the response audio information 2 (i.e. the audio information sent by the server 20 to the audio device 10 last time), and if there is a matching result, trains the audio processing model according to the response audio information 3 to obtain the target audio processing model. The server 20 synchronizes the target audio processing model to the audio device 10, so that when the audio device 10 receives new audio information again, it determines whether the new audio information is an interfering audio according to the target audio processing model, and if the new audio information is determined to be the interfering audio, the new audio information is not sent to the server. Further, the interfering audio may include: audio device 10 receives audio played by audio device 10 itself and/or echoes of audio played by audio device 10.
In the embodiment of the invention, the problem of 'conversation with oneself' at the equipment end is solved through the audio processing method, the voice error recognition in the voice interaction process is reduced, and the recognition accuracy is improved. The overall power consumption of the voice interaction system is reduced, and meanwhile the user experience is improved. In addition, when the interference information is received, the equipment end does not need to send the interference information to the server end, so that the processing resources of the server end are saved, and meanwhile, the waste of communication resources between the equipment end and the server end is reduced. For the equipment end, the method can continuously optimize the performance along with the service duration of the equipment end, and improve the accuracy of intelligent voice communication.
The above voice interaction system can be applied to different application scenarios, such as: intelligent home, intelligent vehicle-mounted, intelligent wearable, medical, educational, intelligent audio input and output, intelligent shopping, and the like; various products with radio receiving elements and broadcasting elements can be embedded, such as intelligent sound boxes, intelligent equipment for children or adults, shopping software, audio playing software, intelligent household appliances and the like.
Therefore, based on the above scenario of the voice interaction system, the following describes the audio processing method in detail with reference to fig. 2.
Fig. 2 shows a flow diagram of an audio signal processing method according to an embodiment.
As shown in fig. 2, the method flow includes steps 210-230, first, step 210, in case of waking up a voice endpoint to detect VAD, determining voice characteristics in the received audio information; secondly, step 220, identifying whether the played voice synthesis audio comprises voice features; then, in step 230, in the case that the speech synthesis audio includes speech features, the speech features are determined to be a false wake VAD.
The above steps are described in detail below:
involving step 210: the speech features include at least one of: echo, noise, clutter, silence characteristics of the played audio.
Specifically, the voice features are divided into an echo feature, a noise feature and a silence feature of the played audio according to the audio energy.
Alternatively, the speech features may be divided into echo, noise and silence features of the played audio according to the loudness, pitch and spectral distribution of the audio. Involving step 220: it is identified whether the played speech synthesis audio includes speech features.
Specifically, whether the played speech synthesis audio includes speech features is identified according to the VAD model after training.
In the case where the speech synthesis audio does not include speech features, it is determined that the VAD is properly awakened. The received audio information can be sent to the server, the server matches feedback information of the audio information, then the server sends the feedback information to the equipment, and the equipment plays feedback audio corresponding to the feedback information.
In the case where the speech synthesis audio includes speech features, step 230 is performed.
Involving step 230: in the case where the speech synthesis audio includes speech features, the speech features are determined to be a false wake VAD.
In one example, before determining that the voice characteristic is a false wake VAD, the method may further include:
marking the audio information to obtain marked audio information; and taking the marked audio information as a training negative sample of the VAD model, and training the VAD model to determine the VAD model after training.
Further, marking the echo feature, the noise feature and the mute feature of the played audio divided in the step 210 respectively; and taking the echo feature, the noise feature and the mute feature of the marked played audio as training negative samples of the VAD model.
For example: according to the audio energy of the noise characteristic and the mute characteristic, audio segments corresponding to the noise characteristic and the mute characteristic are respectively selected from the audio information; and marking the audio segment, and taking the marked audio segment as a training negative sample of the VAD model.
Here, it is also understood that at least one feature is included in the audio information, each feature has an audio processing model corresponding to it, that is, one feature corresponds to one audio processing model, and the VAD model includes a plurality of audio processing models. Therefore, when the model is trained, a more accurate VAD model is obtained.
In addition, in this case, it may also happen that the VAD is awakened by a sound error played by the device side, and in order to prevent this, when the similarity between the voiceprint feature of the played speech synthesis audio and the voiceprint feature of the audio information is higher than the preset threshold, the speech feature is determined as the echo feature of the played audio. In order to prevent the situation that the keyword falsely awakens the VAD and has similar pronunciation to the keyword awakening the VAD, the keyword is used as a training negative sample of the VAD model when the keyword falsely awakening the VAD is identified.
Determining the voice feature as a false wake VAD if the speech synthesis audio includes the voice feature by identifying whether the played speech synthesis audio includes the voice feature in the received audio information. And then, the voice characteristic of the false wake-up VAD is used as a negative sample for training the VAD model, so that the VAD model is updated, and the updated VAD model is used for intercepting the audio information of the false wake-up VAD. Therefore, the problem of 'conversation with oneself' at the equipment end is solved, the voice error recognition in the voice interaction process is reduced, and the recognition accuracy is improved. The overall power consumption of the voice interaction system is reduced, and meanwhile the user experience is improved. In addition, when the interference information is received, the equipment end does not need to send the interference information to the server end, so that the processing resources of the server end are saved, and meanwhile, the waste of communication resources between the equipment end and the server end is reduced.
It is noted that in one possible example, the updating and use of the VAD model can be done in the device side as referred to in fig. 2; in another possible example, the VAD model may be updated at the service end, and the device end uses the updated VAD model to intercept audio information of a false wake VAD. Therefore, the embodiment of the present invention will describe in detail the method for updating the VAD model by the server with reference to fig. 3. Fig. 3 shows a flow diagram of a method for updating a VAD model based on a server according to one embodiment.
As shown in fig. 3, the method flow includes steps 310-350. The details are as follows:
step 310: and receiving the audio information sent by the equipment terminal, and identifying the voice characteristics in the audio information.
Step 320: and identifying whether the played voice synthesis audio sent to the equipment side comprises voice characteristics.
Step 330: in the case where the speech synthesis audio includes speech features, audio information corresponding to the speech features is taken as training negative samples of the VAD model.
And under the condition that the voice synthesis audio does not comprise voice characteristics, determining feedback information corresponding to the audio information according to the audio information, then sending the feedback information to the equipment end, and playing the feedback audio corresponding to the feedback information by the equipment end.
Step 340: the VAD model is trained based on the training negative samples to determine a VAD model after training.
Step 350: and sending the VAD model after training to the equipment end so that the equipment end intercepts interference information according to the VAD model after training.
Therefore, by updating the VAD model on the equipment side, the equipment side does not need to send the interference information to the server side when receiving the interference information, the processing resources of the server side are saved, and meanwhile, the waste of communication resources between the equipment side and the server side is reduced. For the equipment end, the method can continuously optimize the performance along with the continuous enhancement of the service time of the equipment end, and improve the accuracy of intelligent voice communication.
For convenience of understanding, the following description illustrates an audio processing method provided by an embodiment of the present invention in a case where a voice interactive system is applied to a smart speaker in combination with VAD technology.
The method includes determining received silence, noise and voice synthesized audio information through a voice recognition technology, training an original VAD model according to the audio information, then updating a VAD module of an equipment end, wherein the updated VAD module is used for determining whether the new audio information is interference information or not when the equipment end receives the new audio information, and if the new audio information is the interference information, the equipment end actively intercepts the new audio information and does not send the new audio information to a service end.
Fig. 4 shows a flow chart of a method of audio processing in a voice interactive system based on VAD techniques according to an embodiment.
Before describing a specific method, it is introduced that the system includes a smart speaker (i.e., a device side) and a server side corresponding to the smart speaker. Wherein, intelligent audio amplifier can include: a feature extraction module and a VAD module; the server side can comprise: the device comprises a voice recognition module, a dialogue module, a synthesis module and a training data integration module.
As shown in fig. 4, the method includes steps 410 to 490, which are specifically as follows:
step 410: receiving the audio information 1, and determining the audio characteristic A in the audio information 1.
The intelligent sound box receives the audio information 1, and the feature extraction module performs feature extraction on the audio information 1 to determine audio features A.
For example: and converting the time domain corresponding to the audio information 1 into a frequency domain, and performing feature extraction on the frequency domain by utilizing a mel frequency spectrum cepstrum coefficient.
Step 420: and judging whether to send the audio information to the server.
And the VAD module in the intelligent loudspeaker identifies the audio characteristic A, and determines whether the audio information 1 needs to be sent to the server side or not through the identification of the audio characteristic A, and the server side processes the audio information.
In case it is determined that it needs to be sent to the server, other processing modules (e.g. algorithm modules) in the smart loudspeaker preprocess its audio information 1, for example: noise reduction, gain, dereverberation, echo cancellation, etc. In the case where it is determined that the transmission to the service end is not necessary, the VAD module may perform an operation of turning off the received audio.
Step 430: and identifying the audio information 1, and determining character information aiming at the audio information 1.
The server receives the audio information 1 sent by the intelligent sound box, and converts the audio information into character information through the voice recognition module.
Step 440: the dialogue module carries out semantic analysis on the character information to generate audio information 2, wherein the audio information 2 comprises response information aiming at the audio information 1.
Further, extracting key information and judging the meaning expressed by the character information; and generating response information aiming at the character information according to the key information. For example: the key information is "play the song", then search for the song according to ". x", and based on the audio information of ". x" song as the answer information.
And the synthesis module converts the response information into audio information 2 and sends the audio information 2 to the intelligent sound box, so that the intelligent sound box plays the audio information 2. Thus, a round of voice interaction is completed.
At this time, the synthesis module needs to send the audio information 2 to the training data integration module in addition to sending the audio information 2 to the smart speaker, so that when audio having the same voice feature as the audio information 2 is received, the VAD model in the training data integration module is trained according to the audio having the same voice feature.
In the following, for better explanation of the methods provided by the present application, a second round of voice interactions based on the first round of voice interactions described above will be continued to be described.
Next, step 450: the intelligent sound box receives the audio information 3 while playing the audio information 2, and sends the audio information 3 to the server.
Wherein, the step refers to the mode of the server side that the smart speaker transmits the audio information 3, and refers to step 410 and step 420.
Step 460: and the server receives the audio information 3 sent by the intelligent sound box, and determines that the audio information 3 comprises the voice feature B.
Step 470: the server determines whether the audio information 2 includes the speech feature B.
Further, in one implementation, in the case that the audio information 2 includes the speech feature B, the VAD model is trained according to the audio information 3, so as to obtain a trained VAD model. Wherein, the audio information 3 is marked, and the marked audio information 3 is used as a training negative sample of the VAD model.
Alternatively, in another example, in the case where the audio information 3 is a noise feature and/or a silence feature, the audio information 3 is directly used as a training negative sample of the VAD model. According to the audio energy of the noise characteristic and/or the mute characteristic, selecting an audio segment corresponding to the noise characteristic and/or the mute characteristic in the audio information 3 respectively; the audio segment is labeled as a training negative sample of the VAD model.
Alternatively, in another example, in the case that the audio information 3 includes both speech features and noise features and/or silence features, the speech features may be subjected to audio and text alignment (forced alignment) processing through the acoustic model speech recognition of the speech recognition module, and then the rest is subjected to the labeling of the silence features and the noise features, where the labeling of the silence features and the noise features may be used as training negative samples of the VAD model. When the audio information 2 includes the speech feature, the VAD model is trained according to the audio information 3, and the trained VAD model is obtained. Wherein, the audio information 3 is marked, and the marked audio information 3 is used as a training negative sample of the VAD model.
Step 480: and the server sends the trained VAD model to the intelligent sound box, and the trained VAD model is used for determining whether other audio information is interference information or not under the condition that the intelligent sound box receives other audio information.
Step 490: and the intelligent sound box receives the VAD model, and under the condition that the intelligent sound box receives the audio information 4, the audio information 4 is input into the VAD model to obtain an output result.
If the output result indicates that the audio information 4 is interference information, the audio information 4 is not transmitted to the server. On the contrary, when the output result indicates that the audio information 4 is non-interference information, the audio information 4 is sent to the server.
Here, it should be noted that, in actual application, because the VAD model needs to be continuously optimized based on a large number of samples, the output result indicates that the audio information 4 is non-interference information, and may also be interference information, and only the VAD model in the device side does not have such samples yet, the calculation of step 470 to step 490 needs to be performed by the service side, so that the VAD model needs to be optimized.
Therefore, an iteration cycle for updating the VAD model is completed, along with daily use of a user, the VAD model in the VAD module is more and more adaptive to the placing environment of the intelligent sound box, and meanwhile, the sound of the user can be captured more accurately, so that the problem of 'conversation with oneself' at an equipment end is solved, the voice misrecognition in the voice interaction process is reduced, and the recognition accuracy is improved.
In the embodiment of the present invention, by identifying whether the played speech synthesis audio includes a speech feature in the received audio information, in the case that the speech synthesis audio includes the speech feature, it is determined that the VAD is woken up by mistake. Then, the voice characteristic of the false wake VAD is used as a negative sample for training the VAD model, so that the VAD model is updated, and the audio information of the false wake VAD is intercepted by using the updated VAD model. Therefore, the problem of 'conversation with oneself' at the equipment end is solved, the voice error recognition in the voice interaction process is reduced, and the recognition accuracy is improved. The overall power consumption of the voice interaction system is reduced, and meanwhile the user experience is improved. In addition, when the interference information is received, the equipment end does not need to send the interference information to the server end, so that the processing resources of the server end are saved, and meanwhile, the waste of communication resources between the equipment end and the server end is reduced.
Fig. 5 shows a block diagram of the structure of an audio processing apparatus according to an embodiment.
As shown in fig. 5, the apparatus 50 may include:
the transceiving module 501 determines the voice feature in the received audio information in case of waking up the voice endpoint to detect VAD.
A recognition module 502 for recognizing whether the played speech synthesis audio includes speech features.
A processing module 503, configured to determine that the voice feature is a false wake VAD if the speech synthesis audio includes the voice feature.
The apparatus 50 may further comprise: a training module 504, configured to label the audio information to obtain labeled audio information; and taking the marked audio information as a training negative sample of the VAD model, and training the VAD model to determine the VAD model after training.
The training module 504 may be specifically configured to divide the voice feature into an echo feature, a noise feature, and a silence feature of the played audio according to the audio energy. Further, marking the echo characteristic, the noise characteristic and the mute characteristic of the divided played audio respectively; and taking the echo feature, the noise feature and the mute feature of the marked played audio as training negative samples of the VAD model.
In one possible example, the voice feature is determined as an echo feature of the played audio in a case that a similarity between a voiceprint feature of the played speech synthesis audio and a voiceprint feature of the audio information is higher than a preset threshold.
In another possible example, in the event that a keyword is identified that falsely awakens the VAD, the keyword is used as a training negative sample of the VAD model. The recognition module 502 in the embodiment of the present invention may be specifically configured to recognize whether the played speech synthesis audio includes speech features according to the trained VAD model.
The processing module 503 in this embodiment of the present invention may be specifically configured to determine to correctly wake up the VAD when the speech synthesis audio does not include the speech feature.
FIG. 6 illustrates a schematic structural diagram of a computing device, according to one embodiment.
As shown in fig. 6, a block diagram of an exemplary hardware architecture of a computing device capable of implementing an audio signal processing method and apparatus according to an embodiment of the present invention.
The apparatus may include a processor 601 and a memory 602 storing computer program instructions.
Specifically, the processor 601 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.
The processor 601 realizes any one of the audio signal processing methods in the above-described embodiments by reading and executing computer program instructions stored in the memory 602.
The transceiver 603 is mainly used for implementing the apparatuses in the embodiments of the present invention or communicating with other devices.
In one example, the device may also include a bus 604. As shown in fig. 6, the processor 601, the memory 602, and the transceiver 603 are connected via a bus 604 and communicate with each other.
Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which, when the computer program is executed in a computer, causes the computer to perform the steps of the audio signal processing method of an embodiment of the present invention.
It is to be understood that the invention is not limited to the particular arrangements and instrumentality described in the above embodiments and shown in the drawings. For convenience and brevity of description, detailed description of a known method is omitted here, and for the specific working processes of the system, the module and the unit described above, reference may be made to corresponding processes in the foregoing method embodiments, which are not described herein again.
It will be apparent to those skilled in the art that the method procedures of the present invention are not limited to the specific steps described and illustrated, and that various changes, modifications and additions, or equivalent substitutions and changes in the sequence of steps within the technical scope of the present invention are possible within the technical scope of the present invention as those skilled in the art can appreciate the spirit of the present invention.
Claims (11)
1. An audio signal processing method, comprising:
determining voice characteristics in the received audio information under the condition that a voice endpoint is awakened to detect VAD;
identifying whether the played speech synthesis audio includes the speech features;
determining that the voice feature is a false wake of the VAD if the speech synthesis audio includes the voice feature.
2. The method of claim 1, further comprising:
marking the audio information to obtain marked audio information;
and taking the marked audio information as a training negative sample of the VAD model, and training the VAD model to determine the VAD model after training.
3. The method of claim 2, wherein the determining speech characteristics in the received audio information comprises:
and dividing the voice characteristics into echo characteristics, noise characteristics and mute characteristics of the played audio according to the audio energy.
4. The method of claim 3, wherein the using the labeled audio information as training negative samples of a VAD model comprises:
respectively marking the echo characteristic, the noise characteristic and the mute characteristic of the divided played audio;
and taking the echo feature, the noise feature and the mute feature of the marked played audio as training negative samples of the VAD model.
5. The method of claim 3, further comprising:
and under the condition that the similarity between the voiceprint characteristics of the played voice synthesis audio and the voiceprint characteristics of the audio information is higher than a preset threshold value, determining the voice characteristics as echo characteristics of the played audio.
6. The method of claim 3, further comprising:
and in the case that a keyword which falsely awakens the VAD is identified, taking the keyword as a training negative sample of the VAD model.
7. The method of claim 2, wherein the identifying whether the played speech synthesis audio includes the speech feature comprises:
identifying whether the played speech synthesis audio includes the speech feature according to the VAD model after the training.
8. The method of claim 1, further comprising:
determining to wake the VAD correctly if the speech synthesis audio does not include the speech feature.
9. An audio signal processing apparatus, comprising:
the receiving module is used for determining the voice characteristics in the received audio information under the condition that a voice endpoint is awakened to detect VAD;
the recognition module is used for recognizing whether the played voice synthesis audio comprises the voice characteristics;
a processing module to determine that the voice feature incorrectly awakens the VAD if the speech synthesis audio includes the voice feature.
10. An audio enclosure device comprising at least one processor and a memory, the memory for storing computer program instructions, the processor for executing the program of the memory to control the audio enclosure device to implement the method of any one of claims 1-8.
11. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, if executed in a computer, causes the computer to perform the method of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910604779.0A CN112185425A (en) | 2019-07-05 | 2019-07-05 | Audio signal processing method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910604779.0A CN112185425A (en) | 2019-07-05 | 2019-07-05 | Audio signal processing method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112185425A true CN112185425A (en) | 2021-01-05 |
Family
ID=73915987
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910604779.0A Pending CN112185425A (en) | 2019-07-05 | 2019-07-05 | Audio signal processing method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112185425A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113241073A (en) * | 2021-06-29 | 2021-08-10 | 深圳市欧瑞博科技股份有限公司 | Intelligent voice control method and device, electronic equipment and storage medium |
CN113270099A (en) * | 2021-06-29 | 2021-08-17 | 深圳市欧瑞博科技股份有限公司 | Intelligent voice extraction method and device, electronic equipment and storage medium |
CN114360523A (en) * | 2022-03-21 | 2022-04-15 | 深圳亿智时代科技有限公司 | Keyword dataset acquisition and model training methods, devices, equipment and medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100246837A1 (en) * | 2009-03-29 | 2010-09-30 | Krause Lee S | Systems and Methods for Tuning Automatic Speech Recognition Systems |
CN106940998A (en) * | 2015-12-31 | 2017-07-11 | 阿里巴巴集团控股有限公司 | A kind of execution method and device of setting operation |
CN107977183A (en) * | 2017-11-16 | 2018-05-01 | 百度在线网络技术(北京)有限公司 | voice interactive method, device and equipment |
CN108831477A (en) * | 2018-06-14 | 2018-11-16 | 出门问问信息科技有限公司 | A kind of audio recognition method, device, equipment and storage medium |
CN109461446A (en) * | 2018-12-24 | 2019-03-12 | 出门问问信息科技有限公司 | Method, device, system and storage medium for identifying user target request |
CN109753665A (en) * | 2019-01-30 | 2019-05-14 | 北京声智科技有限公司 | Wake up the update method and device of model |
-
2019
- 2019-07-05 CN CN201910604779.0A patent/CN112185425A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100246837A1 (en) * | 2009-03-29 | 2010-09-30 | Krause Lee S | Systems and Methods for Tuning Automatic Speech Recognition Systems |
CN106940998A (en) * | 2015-12-31 | 2017-07-11 | 阿里巴巴集团控股有限公司 | A kind of execution method and device of setting operation |
CN107977183A (en) * | 2017-11-16 | 2018-05-01 | 百度在线网络技术(北京)有限公司 | voice interactive method, device and equipment |
CN108831477A (en) * | 2018-06-14 | 2018-11-16 | 出门问问信息科技有限公司 | A kind of audio recognition method, device, equipment and storage medium |
CN109461446A (en) * | 2018-12-24 | 2019-03-12 | 出门问问信息科技有限公司 | Method, device, system and storage medium for identifying user target request |
CN109753665A (en) * | 2019-01-30 | 2019-05-14 | 北京声智科技有限公司 | Wake up the update method and device of model |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113241073A (en) * | 2021-06-29 | 2021-08-10 | 深圳市欧瑞博科技股份有限公司 | Intelligent voice control method and device, electronic equipment and storage medium |
CN113270099A (en) * | 2021-06-29 | 2021-08-17 | 深圳市欧瑞博科技股份有限公司 | Intelligent voice extraction method and device, electronic equipment and storage medium |
CN113270099B (en) * | 2021-06-29 | 2023-08-29 | 深圳市欧瑞博科技股份有限公司 | Intelligent voice extraction method and device, electronic equipment and storage medium |
CN113241073B (en) * | 2021-06-29 | 2023-10-31 | 深圳市欧瑞博科技股份有限公司 | Intelligent voice control method, device, electronic equipment and storage medium |
CN114360523A (en) * | 2022-03-21 | 2022-04-15 | 深圳亿智时代科技有限公司 | Keyword dataset acquisition and model training methods, devices, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6613347B2 (en) | Method and apparatus for pushing information | |
US8543402B1 (en) | Speaker segmentation in noisy conversational speech | |
CN108694940B (en) | Voice recognition method and device and electronic equipment | |
CN112397083B (en) | Voice processing method and related device | |
CN102568478B (en) | Video play control method and system based on voice recognition | |
CN107799126A (en) | Sound end detecting method and device based on Supervised machine learning | |
CN107147618A (en) | A kind of user registering method, device and electronic equipment | |
CN103971680A (en) | Method and device for recognizing voices | |
CN104575504A (en) | Method for personalized television voice wake-up by voiceprint and voice identification | |
CN105679310A (en) | Method and system for speech recognition | |
CN112185425A (en) | Audio signal processing method, device, equipment and storage medium | |
EP3989217B1 (en) | Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium | |
CN103680497A (en) | Voice recognition system and voice recognition method based on video | |
CN111210829A (en) | Speech recognition method, apparatus, system, device and computer readable storage medium | |
CN111145763A (en) | GRU-based voice recognition method and system in audio | |
CN106558306A (en) | Method for voice recognition, device and equipment | |
CN110875045A (en) | Voice recognition method, intelligent device and intelligent television | |
CN111916068A (en) | Audio detection method and device | |
CN111178081A (en) | Semantic recognition method, server, electronic device and computer storage medium | |
CN108322770A (en) | Video frequency program recognition methods, relevant apparatus, equipment and system | |
CN113889091A (en) | Voice recognition method and device, computer readable storage medium and electronic equipment | |
CN110808050A (en) | Voice recognition method and intelligent equipment | |
CN111540357A (en) | Voice processing method, device, terminal, server and storage medium | |
CN115132197B (en) | Data processing method, device, electronic equipment, program product and medium | |
CN111063338B (en) | Audio signal identification method, device, equipment, system and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |