CN112185425A

CN112185425A - Audio signal processing method, device, equipment and storage medium

Info

Publication number: CN112185425A
Application number: CN201910604779.0A
Authority: CN
Inventors: 徐涛; 曹元斌
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2021-01-05

Abstract

The embodiment of the invention provides an audio signal processing method, an audio signal processing device, audio signal processing equipment and a storage medium, wherein the method comprises the following steps: firstly, determining voice characteristics in received audio information under the condition that a voice endpoint is awakened to detect VAD; secondly, identifying whether the played voice synthesis audio comprises voice characteristics; then, in a case where the speech synthesis audio includes a speech feature, the speech feature is determined to be a false wake VAD. Therefore, the problem of 'conversation with oneself' at the equipment end is solved, and the accuracy of intelligent voice communication is improved.

Description

Audio signal processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to an audio signal processing method, apparatus, device, and storage medium.

Background

With the rapid development of artificial intelligence technology and computers, intelligent voice conversations are widely developed and utilized, and intelligent voice communication between people and equipment is receiving wide attention.

In order to realize real-time response of the device end to human Voice language, the device end (e.g., a smart speaker) determines whether to respond to the received audio through Voice Activity Detection (VAD). Currently, in application of current voice endpoint detection, it may happen that an equipment side receives audio played by the equipment side while playing the audio, and sends the received audio being played to a server side, and the server side will repeatedly feed back to the equipment side, so that the equipment side and the server side will fall into a loop. For example: the playing element of the equipment end plays 'you are good and happy to know you', the receiving element of the equipment end receives the 'you are good and happy to know you' which are playing, the audio is sent to the server end, the server end can circularly respond to the audio, and therefore the possibility that the equipment end 'talks with the server' occurs, and intelligent voice communication between people and the equipment end is influenced.

Disclosure of Invention

In view of this, one or more embodiments of the present invention describe a method, an apparatus, a device and a storage medium for audio signal processing, which solve the problem of "conversation with itself" at the device end and improve the accuracy of intelligent voice communication.

According to a first aspect, there is provided an audio signal processing method, which may comprise:

determining voice characteristics in the received audio information under the condition that a voice endpoint is awakened to detect VAD;

identifying whether the played speech synthesis audio includes speech features;

in the case where the speech synthesis audio includes speech features, the speech features are determined to be a false wake VAD.

According to a second aspect, there is provided an audio signal processing apparatus, which may include:

the receiving module is used for determining the voice characteristics in the received audio information under the condition that a voice endpoint is awakened to detect VAD;

the recognition module is used for recognizing whether the played voice synthesis audio comprises voice features;

and the processing module is used for determining the voice characteristic as a false wake VAD under the condition that the voice synthesis audio comprises the voice characteristic.

According to a third aspect, there is provided an audio enclosure apparatus comprising at least one processor and a memory, the memory being configured to store computer program instructions, the processor being configured to execute a program of the memory to control the audio enclosure apparatus to implement the audio signal processing method of the first aspect.

According to a fourth aspect, there is provided a computing device comprising at least one processor and a memory, the memory being adapted to store computer program instructions, the processor being adapted to execute a program of the memory to control a server to implement the audio signal processing method as shown in the first aspect.

According to a fifth aspect, there is provided a computer-readable storage medium having stored thereon a computer program which, if executed in a computer, causes the computer to execute the audio signal processing method as shown in the first aspect.

By using the scheme of the embodiment of the invention, whether the played voice synthesis audio comprises the voice characteristics in the received audio information is identified, and the voice characteristics are determined to be the false wake-up VAD under the condition that the voice synthesis audio comprises the voice characteristics. Then, the voice characteristic of the false wake VAD is used as a negative sample for training the VAD model, so that the VAD model is updated, and the audio information of the false wake VAD is intercepted by using the updated VAD model. Therefore, the problem of 'conversation with oneself' at the equipment end is solved, the voice error recognition in the voice interaction process is reduced,

and the identification accuracy is improved. The overall power consumption of the voice interaction system is reduced, and meanwhile the user experience is improved.

In addition, when the interference information is received, the equipment end does not need to send the interference information to the server end, so that the processing resources of the server end are saved, and meanwhile, the waste of communication resources between the equipment end and the server end is reduced. For the equipment end, the method can continuously optimize the performance along with the continuous enhancement of the service time of the equipment end, and improve the accuracy of intelligent voice communication.

Drawings

The present invention may be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings, in which like or similar reference characters identify like or similar features.

Fig. 1 shows a schematic structural diagram of an audio signal interaction system according to an embodiment;

fig. 2 shows a flow diagram of an audio signal processing method according to an embodiment;

FIG. 3 illustrates a flow diagram of a method for updating a VAD model based on a server according to one embodiment;

FIG. 4 shows a flow diagram of a method of audio processing in a voice interactive system based on VAD techniques according to one embodiment;

fig. 5 shows a block diagram of the structure of an audio signal processing apparatus according to an embodiment;

FIG. 6 illustrates a schematic structural diagram of a computing device, according to one embodiment.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In order to solve the problems in the prior art, embodiments of the present invention provide an audio processing method, apparatus, device, server, and storage medium.

Fig. 1 shows a schematic structural diagram of an audio signal interaction system according to an embodiment.

As shown in fig. 1, a practical application scenario of the voice interaction system may include an audio device 10 with a sound receiving element a and a broadcasting element B, and a server 20.

The sound receiving element a receives the sound source 1, the audio device 10 needs to report the audio information of the sound source 1 to the server 20, the server 20 determines the response audio information 2 for the audio information, the server sends the response audio information 2 to the audio device 10, and the broadcasting element B broadcasts the response audio information 2.

Since the sound receiving element a and the broadcasting element B of the voice interaction system are in the same scene, the sound receiving element a may receive the response audio information 3 (marked as 3 because of the re-received audio) being played by the broadcasting element B, and the audio device 10 sends the received response audio information 3 to the server 20. At this time, in order to avoid that the content played by the audio device 10 triggers VAD when the audio device 10 performs intelligent voice interaction with the user, the recognition error rate is increased. The server 20 matches the response audio information 3 with the response audio information 2 (i.e. the audio information sent by the server 20 to the audio device 10 last time), and if there is a matching result, trains the audio processing model according to the response audio information 3 to obtain the target audio processing model. The server 20 synchronizes the target audio processing model to the audio device 10, so that when the audio device 10 receives new audio information again, it determines whether the new audio information is an interfering audio according to the target audio processing model, and if the new audio information is determined to be the interfering audio, the new audio information is not sent to the server. Further, the interfering audio may include: audio device 10 receives audio played by audio device 10 itself and/or echoes of audio played by audio device 10.

In the embodiment of the invention, the problem of 'conversation with oneself' at the equipment end is solved through the audio processing method, the voice error recognition in the voice interaction process is reduced, and the recognition accuracy is improved. The overall power consumption of the voice interaction system is reduced, and meanwhile the user experience is improved. In addition, when the interference information is received, the equipment end does not need to send the interference information to the server end, so that the processing resources of the server end are saved, and meanwhile, the waste of communication resources between the equipment end and the server end is reduced. For the equipment end, the method can continuously optimize the performance along with the service duration of the equipment end, and improve the accuracy of intelligent voice communication.

The above voice interaction system can be applied to different application scenarios, such as: intelligent home, intelligent vehicle-mounted, intelligent wearable, medical, educational, intelligent audio input and output, intelligent shopping, and the like; various products with radio receiving elements and broadcasting elements can be embedded, such as intelligent sound boxes, intelligent equipment for children or adults, shopping software, audio playing software, intelligent household appliances and the like.

Therefore, based on the above scenario of the voice interaction system, the following describes the audio processing method in detail with reference to fig. 2.

Fig. 2 shows a flow diagram of an audio signal processing method according to an embodiment.

As shown in fig. 2, the method flow includes steps 210-230, first, step 210, in case of waking up a voice endpoint to detect VAD, determining voice characteristics in the received audio information; secondly, step 220, identifying whether the played voice synthesis audio comprises voice features; then, in step 230, in the case that the speech synthesis audio includes speech features, the speech features are determined to be a false wake VAD.

The above steps are described in detail below:

involving step 210: the speech features include at least one of: echo, noise, clutter, silence characteristics of the played audio.

Specifically, the voice features are divided into an echo feature, a noise feature and a silence feature of the played audio according to the audio energy.

Alternatively, the speech features may be divided into echo, noise and silence features of the played audio according to the loudness, pitch and spectral distribution of the audio. Involving step 220: it is identified whether the played speech synthesis audio includes speech features.

Specifically, whether the played speech synthesis audio includes speech features is identified according to the VAD model after training.

In the case where the speech synthesis audio does not include speech features, it is determined that the VAD is properly awakened. The received audio information can be sent to the server, the server matches feedback information of the audio information, then the server sends the feedback information to the equipment, and the equipment plays feedback audio corresponding to the feedback information.

In the case where the speech synthesis audio includes speech features, step 230 is performed.

Involving step 230: in the case where the speech synthesis audio includes speech features, the speech features are determined to be a false wake VAD.

In one example, before determining that the voice characteristic is a false wake VAD, the method may further include:

marking the audio information to obtain marked audio information; and taking the marked audio information as a training negative sample of the VAD model, and training the VAD model to determine the VAD model after training.

Further, marking the echo feature, the noise feature and the mute feature of the played audio divided in the step 210 respectively; and taking the echo feature, the noise feature and the mute feature of the marked played audio as training negative samples of the VAD model.

For example: according to the audio energy of the noise characteristic and the mute characteristic, audio segments corresponding to the noise characteristic and the mute characteristic are respectively selected from the audio information; and marking the audio segment, and taking the marked audio segment as a training negative sample of the VAD model.

Here, it is also understood that at least one feature is included in the audio information, each feature has an audio processing model corresponding to it, that is, one feature corresponds to one audio processing model, and the VAD model includes a plurality of audio processing models. Therefore, when the model is trained, a more accurate VAD model is obtained.

In addition, in this case, it may also happen that the VAD is awakened by a sound error played by the device side, and in order to prevent this, when the similarity between the voiceprint feature of the played speech synthesis audio and the voiceprint feature of the audio information is higher than the preset threshold, the speech feature is determined as the echo feature of the played audio. In order to prevent the situation that the keyword falsely awakens the VAD and has similar pronunciation to the keyword awakening the VAD, the keyword is used as a training negative sample of the VAD model when the keyword falsely awakening the VAD is identified.

Determining the voice feature as a false wake VAD if the speech synthesis audio includes the voice feature by identifying whether the played speech synthesis audio includes the voice feature in the received audio information. And then, the voice characteristic of the false wake-up VAD is used as a negative sample for training the VAD model, so that the VAD model is updated, and the updated VAD model is used for intercepting the audio information of the false wake-up VAD. Therefore, the problem of 'conversation with oneself' at the equipment end is solved, the voice error recognition in the voice interaction process is reduced, and the recognition accuracy is improved. The overall power consumption of the voice interaction system is reduced, and meanwhile the user experience is improved. In addition, when the interference information is received, the equipment end does not need to send the interference information to the server end, so that the processing resources of the server end are saved, and meanwhile, the waste of communication resources between the equipment end and the server end is reduced.

It is noted that in one possible example, the updating and use of the VAD model can be done in the device side as referred to in fig. 2; in another possible example, the VAD model may be updated at the service end, and the device end uses the updated VAD model to intercept audio information of a false wake VAD. Therefore, the embodiment of the present invention will describe in detail the method for updating the VAD model by the server with reference to fig. 3. Fig. 3 shows a flow diagram of a method for updating a VAD model based on a server according to one embodiment.

As shown in fig. 3, the method flow includes steps 310-350. The details are as follows:

step 310: and receiving the audio information sent by the equipment terminal, and identifying the voice characteristics in the audio information.

Step 320: and identifying whether the played voice synthesis audio sent to the equipment side comprises voice characteristics.

Step 330: in the case where the speech synthesis audio includes speech features, audio information corresponding to the speech features is taken as training negative samples of the VAD model.

And under the condition that the voice synthesis audio does not comprise voice characteristics, determining feedback information corresponding to the audio information according to the audio information, then sending the feedback information to the equipment end, and playing the feedback audio corresponding to the feedback information by the equipment end.

Step 340: the VAD model is trained based on the training negative samples to determine a VAD model after training.

Step 350: and sending the VAD model after training to the equipment end so that the equipment end intercepts interference information according to the VAD model after training.

Therefore, by updating the VAD model on the equipment side, the equipment side does not need to send the interference information to the server side when receiving the interference information, the processing resources of the server side are saved, and meanwhile, the waste of communication resources between the equipment side and the server side is reduced. For the equipment end, the method can continuously optimize the performance along with the continuous enhancement of the service time of the equipment end, and improve the accuracy of intelligent voice communication.

For convenience of understanding, the following description illustrates an audio processing method provided by an embodiment of the present invention in a case where a voice interactive system is applied to a smart speaker in combination with VAD technology.

The method includes determining received silence, noise and voice synthesized audio information through a voice recognition technology, training an original VAD model according to the audio information, then updating a VAD module of an equipment end, wherein the updated VAD module is used for determining whether the new audio information is interference information or not when the equipment end receives the new audio information, and if the new audio information is the interference information, the equipment end actively intercepts the new audio information and does not send the new audio information to a service end.

Fig. 4 shows a flow chart of a method of audio processing in a voice interactive system based on VAD techniques according to an embodiment.

Before describing a specific method, it is introduced that the system includes a smart speaker (i.e., a device side) and a server side corresponding to the smart speaker. Wherein, intelligent audio amplifier can include: a feature extraction module and a VAD module; the server side can comprise: the device comprises a voice recognition module, a dialogue module, a synthesis module and a training data integration module.

As shown in fig. 4, the method includes steps 410 to 490, which are specifically as follows:

step 410: receiving the audio information 1, and determining the audio characteristic A in the audio information 1.

The intelligent sound box receives the audio information 1, and the feature extraction module performs feature extraction on the audio information 1 to determine audio features A.

For example: and converting the time domain corresponding to the audio information 1 into a frequency domain, and performing feature extraction on the frequency domain by utilizing a mel frequency spectrum cepstrum coefficient.

Step 420: and judging whether to send the audio information to the server.

And the VAD module in the intelligent loudspeaker identifies the audio characteristic A, and determines whether the audio information 1 needs to be sent to the server side or not through the identification of the audio characteristic A, and the server side processes the audio information.

In case it is determined that it needs to be sent to the server, other processing modules (e.g. algorithm modules) in the smart loudspeaker preprocess its audio information 1, for example: noise reduction, gain, dereverberation, echo cancellation, etc. In the case where it is determined that the transmission to the service end is not necessary, the VAD module may perform an operation of turning off the received audio.

Step 430: and identifying the audio information 1, and determining character information aiming at the audio information 1.

The server receives the audio information 1 sent by the intelligent sound box, and converts the audio information into character information through the voice recognition module.

Step 440: the dialogue module carries out semantic analysis on the character information to generate audio information 2, wherein the audio information 2 comprises response information aiming at the audio information 1.

Further, extracting key information and judging the meaning expressed by the character information; and generating response information aiming at the character information according to the key information. For example: the key information is "play the song", then search for the song according to ". x", and based on the audio information of ". x" song as the answer information.

And the synthesis module converts the response information into audio information 2 and sends the audio information 2 to the intelligent sound box, so that the intelligent sound box plays the audio information 2. Thus, a round of voice interaction is completed.

At this time, the synthesis module needs to send the audio information 2 to the training data integration module in addition to sending the audio information 2 to the smart speaker, so that when audio having the same voice feature as the audio information 2 is received, the VAD model in the training data integration module is trained according to the audio having the same voice feature.

In the following, for better explanation of the methods provided by the present application, a second round of voice interactions based on the first round of voice interactions described above will be continued to be described.

Next, step 450: the intelligent sound box receives the audio information 3 while playing the audio information 2, and sends the audio information 3 to the server.

Wherein, the step refers to the mode of the server side that the smart speaker transmits the audio information 3, and refers to step 410 and step 420.

Step 460: and the server receives the audio information 3 sent by the intelligent sound box, and determines that the audio information 3 comprises the voice feature B.

Step 470: the server determines whether the audio information 2 includes the speech feature B.

Further, in one implementation, in the case that the audio information 2 includes the speech feature B, the VAD model is trained according to the audio information 3, so as to obtain a trained VAD model. Wherein, the audio information 3 is marked, and the marked audio information 3 is used as a training negative sample of the VAD model.

Alternatively, in another example, in the case where the audio information 3 is a noise feature and/or a silence feature, the audio information 3 is directly used as a training negative sample of the VAD model. According to the audio energy of the noise characteristic and/or the mute characteristic, selecting an audio segment corresponding to the noise characteristic and/or the mute characteristic in the audio information 3 respectively; the audio segment is labeled as a training negative sample of the VAD model.

Alternatively, in another example, in the case that the audio information 3 includes both speech features and noise features and/or silence features, the speech features may be subjected to audio and text alignment (forced alignment) processing through the acoustic model speech recognition of the speech recognition module, and then the rest is subjected to the labeling of the silence features and the noise features, where the labeling of the silence features and the noise features may be used as training negative samples of the VAD model. When the audio information 2 includes the speech feature, the VAD model is trained according to the audio information 3, and the trained VAD model is obtained. Wherein, the audio information 3 is marked, and the marked audio information 3 is used as a training negative sample of the VAD model.

Step 480: and the server sends the trained VAD model to the intelligent sound box, and the trained VAD model is used for determining whether other audio information is interference information or not under the condition that the intelligent sound box receives other audio information.

Step 490: and the intelligent sound box receives the VAD model, and under the condition that the intelligent sound box receives the audio information 4, the audio information 4 is input into the VAD model to obtain an output result.

If the output result indicates that the audio information 4 is interference information, the audio information 4 is not transmitted to the server. On the contrary, when the output result indicates that the audio information 4 is non-interference information, the audio information 4 is sent to the server.

Here, it should be noted that, in actual application, because the VAD model needs to be continuously optimized based on a large number of samples, the output result indicates that the audio information 4 is non-interference information, and may also be interference information, and only the VAD model in the device side does not have such samples yet, the calculation of step 470 to step 490 needs to be performed by the service side, so that the VAD model needs to be optimized.

Therefore, an iteration cycle for updating the VAD model is completed, along with daily use of a user, the VAD model in the VAD module is more and more adaptive to the placing environment of the intelligent sound box, and meanwhile, the sound of the user can be captured more accurately, so that the problem of 'conversation with oneself' at an equipment end is solved, the voice misrecognition in the voice interaction process is reduced, and the recognition accuracy is improved.

In the embodiment of the present invention, by identifying whether the played speech synthesis audio includes a speech feature in the received audio information, in the case that the speech synthesis audio includes the speech feature, it is determined that the VAD is woken up by mistake. Then, the voice characteristic of the false wake VAD is used as a negative sample for training the VAD model, so that the VAD model is updated, and the audio information of the false wake VAD is intercepted by using the updated VAD model. Therefore, the problem of 'conversation with oneself' at the equipment end is solved, the voice error recognition in the voice interaction process is reduced, and the recognition accuracy is improved. The overall power consumption of the voice interaction system is reduced, and meanwhile the user experience is improved. In addition, when the interference information is received, the equipment end does not need to send the interference information to the server end, so that the processing resources of the server end are saved, and meanwhile, the waste of communication resources between the equipment end and the server end is reduced.

Fig. 5 shows a block diagram of the structure of an audio processing apparatus according to an embodiment.

As shown in fig. 5, the apparatus 50 may include:

the transceiving module 501 determines the voice feature in the received audio information in case of waking up the voice endpoint to detect VAD.

A recognition module 502 for recognizing whether the played speech synthesis audio includes speech features.

A processing module 503, configured to determine that the voice feature is a false wake VAD if the speech synthesis audio includes the voice feature.

The apparatus 50 may further comprise: a training module 504, configured to label the audio information to obtain labeled audio information; and taking the marked audio information as a training negative sample of the VAD model, and training the VAD model to determine the VAD model after training.

The training module 504 may be specifically configured to divide the voice feature into an echo feature, a noise feature, and a silence feature of the played audio according to the audio energy. Further, marking the echo characteristic, the noise characteristic and the mute characteristic of the divided played audio respectively; and taking the echo feature, the noise feature and the mute feature of the marked played audio as training negative samples of the VAD model.

In one possible example, the voice feature is determined as an echo feature of the played audio in a case that a similarity between a voiceprint feature of the played speech synthesis audio and a voiceprint feature of the audio information is higher than a preset threshold.

In another possible example, in the event that a keyword is identified that falsely awakens the VAD, the keyword is used as a training negative sample of the VAD model. The recognition module 502 in the embodiment of the present invention may be specifically configured to recognize whether the played speech synthesis audio includes speech features according to the trained VAD model.

The processing module 503 in this embodiment of the present invention may be specifically configured to determine to correctly wake up the VAD when the speech synthesis audio does not include the speech feature.

As shown in fig. 6, a block diagram of an exemplary hardware architecture of a computing device capable of implementing an audio signal processing method and apparatus according to an embodiment of the present invention.

The apparatus may include a processor 601 and a memory 602 storing computer program instructions.

Specifically, the processor 601 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

Memory 602 may include mass storage for data or instructions. By way of example, and not limitation, memory 602 may include a Hard Disk Drive (HDD), a floppy disk drive, flash memory, an optical disk, a magneto-optical disk, magnetic tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Memory 602 may include removable or non-removable (or fixed) media, where appropriate. Memory 602 may be internal or external to the integrated gateway device, where appropriate. In a particular embodiment, the memory 602 is a non-volatile solid-state memory. In a particular embodiment, the memory 602 includes Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.

The processor 601 realizes any one of the audio signal processing methods in the above-described embodiments by reading and executing computer program instructions stored in the memory 602.

The transceiver 603 is mainly used for implementing the apparatuses in the embodiments of the present invention or communicating with other devices.

In one example, the device may also include a bus 604. As shown in fig. 6, the processor 601, the memory 602, and the transceiver 603 are connected via a bus 604 and communicate with each other.

Bus 604 includes hardware, software, or both. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 703 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which, when the computer program is executed in a computer, causes the computer to perform the steps of the audio signal processing method of an embodiment of the present invention.

It is to be understood that the invention is not limited to the particular arrangements and instrumentality described in the above embodiments and shown in the drawings. For convenience and brevity of description, detailed description of a known method is omitted here, and for the specific working processes of the system, the module and the unit described above, reference may be made to corresponding processes in the foregoing method embodiments, which are not described herein again.

It will be apparent to those skilled in the art that the method procedures of the present invention are not limited to the specific steps described and illustrated, and that various changes, modifications and additions, or equivalent substitutions and changes in the sequence of steps within the technical scope of the present invention are possible within the technical scope of the present invention as those skilled in the art can appreciate the spirit of the present invention.

Claims

1. An audio signal processing method, comprising:

identifying whether the played speech synthesis audio includes the speech features;

determining that the voice feature is a false wake of the VAD if the speech synthesis audio includes the voice feature.

2. The method of claim 1, further comprising:

marking the audio information to obtain marked audio information;

and taking the marked audio information as a training negative sample of the VAD model, and training the VAD model to determine the VAD model after training.

3. The method of claim 2, wherein the determining speech characteristics in the received audio information comprises:

and dividing the voice characteristics into echo characteristics, noise characteristics and mute characteristics of the played audio according to the audio energy.

4. The method of claim 3, wherein the using the labeled audio information as training negative samples of a VAD model comprises:

respectively marking the echo characteristic, the noise characteristic and the mute characteristic of the divided played audio;

and taking the echo feature, the noise feature and the mute feature of the marked played audio as training negative samples of the VAD model.

5. The method of claim 3, further comprising:

and under the condition that the similarity between the voiceprint characteristics of the played voice synthesis audio and the voiceprint characteristics of the audio information is higher than a preset threshold value, determining the voice characteristics as echo characteristics of the played audio.

6. The method of claim 3, further comprising:

and in the case that a keyword which falsely awakens the VAD is identified, taking the keyword as a training negative sample of the VAD model.

7. The method of claim 2, wherein the identifying whether the played speech synthesis audio includes the speech feature comprises:

identifying whether the played speech synthesis audio includes the speech feature according to the VAD model after the training.

8. The method of claim 1, further comprising:

determining to wake the VAD correctly if the speech synthesis audio does not include the speech feature.

9. An audio signal processing apparatus, comprising:

the recognition module is used for recognizing whether the played voice synthesis audio comprises the voice characteristics;

a processing module to determine that the voice feature incorrectly awakens the VAD if the speech synthesis audio includes the voice feature.

10. An audio enclosure device comprising at least one processor and a memory, the memory for storing computer program instructions, the processor for executing the program of the memory to control the audio enclosure device to implement the method of any one of claims 1-8.

11. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, if executed in a computer, causes the computer to perform the method of any of claims 1-8.