CN112992189B - Voice audio detection method and device, storage medium and electronic device - Google Patents

Voice audio detection method and device, storage medium and electronic device Download PDF

Info

Publication number
CN112992189B
CN112992189B CN202110130677.7A CN202110130677A CN112992189B CN 112992189 B CN112992189 B CN 112992189B CN 202110130677 A CN202110130677 A CN 202110130677A CN 112992189 B CN112992189 B CN 112992189B
Authority
CN
China
Prior art keywords
fbank
audio
voice
feature
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110130677.7A
Other languages
Chinese (zh)
Other versions
CN112992189A (en
Inventor
贾基东
张卓博
赵培
苏腾荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Haier Smart Home Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202110130677.7A priority Critical patent/CN112992189B/en
Publication of CN112992189A publication Critical patent/CN112992189A/en
Application granted granted Critical
Publication of CN112992189B publication Critical patent/CN112992189B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a method and a device for detecting voice audio, a storage medium and an electronic device, wherein the method for detecting the voice audio comprises the following steps: receiving a first voice audio sent by a voice awakening module, and acquiring a plurality of first Fbank characteristics of the first voice audio; sequentially carrying out encoding operation and decoding operation on each first Fbank characteristic through a differential self-encoder neural network model to obtain a second Fbank characteristic corresponding to each first Fbank characteristic; and determining a deviation value between each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value.

Description

Voice audio detection method and device, storage medium and electronic device
Technical Field
The present invention relates to the field of communications, and in particular, to a method and an apparatus for detecting a voice audio, a storage medium, and an electronic apparatus.
Background
At present, an intelligent voice interaction function is increasingly popularized on various devices such as mobile phones, sound boxes, household appliances and intelligent wearing devices, and the voice awakening function is used more and more frequently in daily life of people as a voice interaction triggering mechanism. However, in practical applications, the voice wake-up technology faces a problem of having to compromise between the false wake-up rate and the wake-up rate, and in order to ensure that the user can wake up the device, the false wake-up rate of the device is high. This problem becomes increasingly prominent as various types of intelligent interactive devices are used exponentially in daily life. At present, in the prior art, there are two approaches for optimizing the voice wakeup performance, one is to optimize the voice wakeup algorithm itself, but the voice wakeup technology is mainly applied in the far field, and the voice source energy attenuation in the far field is serious and is easily affected by the environmental noise and the room reverberation, resulting in poor voice wakeup performance. And secondly, a secondary detection mechanism is added after the voice awakening module. The conventional method of the secondary detection mechanism is a voice awakening method based on a mixed Gaussian model, the mixed Gaussian model is adopted to model awakening words, and the probabilities of the awakening words and the non-awakening words are respectively calculated according to input audio. However, the modeling capability of the gaussian mixture model is limited, accurate modeling of voice cannot be realized, and particularly, in a complex actual sound pickup environment, the voice wakeup performance based on the gaussian mixture model is seriously reduced. It follows that the prior art does not solve the above technical problem well.
Aiming at the problem that the false awakening rate of the intelligent equipment is high inevitably in order to ensure the voice awakening rate of the intelligent equipment in the related art, an effective solution is not provided.
Disclosure of Invention
The embodiment of the invention provides a voice audio detection method and device, a storage medium and an electronic device, and aims to solve the problem that in the related art, in order to ensure the voice wake-up rate of intelligent equipment, the false wake-up rate of the intelligent equipment is high.
According to an embodiment of the present invention, there is provided a method for detecting voice audio, including: receiving a first voice audio sent by a voice awakening module, and acquiring a plurality of first Fbank characteristics of the first voice audio, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value; sequentially performing encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model already obtains similar features of all awakening word audios in the model training process; and determining a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value.
Optionally, obtaining a plurality of first Fbank features of the first voice audio includes: performing framing operation on the first voice audio to obtain multi-frame audio; performing pre-enhancement operation on the multi-frame audio to obtain pre-enhanced audio; windowing the high-frequency voice part of the pre-enhanced audio to obtain a windowed audio; extracting a plurality of first Fbank features of the first voice audio from the windowed audio.
Optionally, extracting a plurality of first Fbank features of the first voice audio from the target audio includes: carrying out Fourier transform operation on the windowed audio to obtain a Fourier transform result; performing Mel filtering operation on the Fourier transform result to obtain filtered audio; and carrying out logarithmic operation processing on the filtered audio to obtain the plurality of first Fbank characteristics.
Optionally, determining an offset value between each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature according to each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature includes: determining the deviation value A according to the following formula: a ═ Σk(YK-XK)2Wherein Y isKIs the second Fbank specialSymbol, XKIs a first Fbank feature, K being a positive integer.
Optionally, before the differential self-encoder neural network model sequentially performs an encoding operation and a decoding operation on each first Fbank feature to obtain a second Fbank feature corresponding to each first Fbank feature, the method further includes: performing model training on the differential self-encoder neural network model by using a plurality of noiseless wake-up word audios; and under the condition that the similar features are learned by the differential self-encoder neural network model, respectively saving the similar features into a plurality of GRU units, wherein the differential self-encoder neural network model comprises a plurality of GRU units.
Optionally, the encoding operation and the decoding operation are sequentially performed on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, and the method includes: receiving H through each GRU unit in the encoder in the differential self-encoder neural network modelK-1And XKCoding is carried out to obtain the characteristic code corresponding to each GRU unit in the coder, wherein when K is equal to 1, H0Is a zero sequence, when K is not equal to 1, HK-1Is the K-1 GRU unit pair HK-2And XK-1The result of the encoding; and decoding the received characteristic codes through each GRU unit in a decoder in the neural network model of the differential self-encoder to obtain a second Fbank characteristic corresponding to each GRU unit in the decoder. .
Optionally, determining again whether the first voice audio is a wakeup word audio according to the deviation value includes: determining that the first voice audio is not the awakening word audio under the condition that the deviation value is greater than a preset threshold; and determining the first voice audio as the awakening word audio under the condition that the deviation value is smaller than or equal to the preset threshold.
According to still another embodiment of the present invention, there is also provided a voice audio detection apparatus including: the first obtaining module is used for receiving a first voice audio sent by the voice awakening module and obtaining a plurality of first Fbank characteristics of the first voice audio, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value; the second obtaining module is used for sequentially performing encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model obtains similar features of all awakening word audios in a model training process; and the determining module is used for determining a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value.
According to yet another embodiment of the invention, there is also provided a computer-readable storage medium comprising a stored program, wherein the program when executed performs the method described in any of the above.
According to yet another embodiment of the present invention, there is also provided an electronic apparatus comprising a memory having a computer program stored therein and a processor arranged to perform the method described in any one of the above by means of the computer program.
According to the invention, a first voice audio sent by a voice awakening module is received, and a plurality of first Fbank characteristics of the first voice audio are obtained, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value; sequentially performing encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model already obtains similar features of all awakening word audios in the model training process; and determining a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value. That is to say, a plurality of first Fbank features of the first voice audio are obtained, each first Fbank feature is sequentially subjected to encoding operation and decoding operation through a differential self-encoder neural network model, so that a second Fbank feature corresponding to each first Fbank feature is obtained, and whether the first voice audio is a wakeup word audio can be determined according to a deviation value between each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature. By adopting the technical scheme, the problem that the false awakening rate of the intelligent equipment is high in order to guarantee the voice awakening rate of the intelligent equipment in the correlation technique is solved, so that the false awakening rate of the intelligent equipment is reduced under the condition that the awakening rate is high, and the experience of a user in interaction with the intelligent equipment is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware structure of an intelligent device of a voice audio detection method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for detecting speech audio according to an embodiment of the present invention;
FIG. 3 is a block diagram of a voice wake-up two-time detection model according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for extracting features of speech audio according to an embodiment of the present invention;
FIG. 5 is a block diagram of a differential auto-encoder neural network model according to an embodiment of the present invention;
fig. 6 is a block diagram of a detection apparatus for voice audio according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The method provided by the embodiment of the application can be executed in an intelligent device or a similar operation device. Taking an example of the method running on an intelligent device, fig. 1 is a block diagram of a hardware structure of the intelligent device of the method for detecting a voice audio according to the embodiment of the present invention. As shown in fig. 1, the smart device may include one or more processors 102 (only one is shown in fig. 1), wherein the processors 102 may include, but are not limited to, a Microprocessor (MPU) or a Programmable Logic Device (PLD), and a memory 104 for storing data, and in an exemplary embodiment, the smart device may further include a transmission device 106 for communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is merely illustrative and is not intended to limit the structure of the smart device. For example, a smart device may also include more or fewer components than shown in FIG. 1, or have a different configuration with equivalent functionality to that shown in FIG. 1 or with more functionality than that shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the determination method of the detection method of voice audio in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the smart device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the smart device. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In this embodiment, a method for detecting a voice audio is provided, which is applied to the above-mentioned intelligent device, and fig. 2 is a flowchart of the method for detecting a voice audio according to the embodiment of the present invention, where the flowchart includes the following steps:
step S202: receiving a first voice audio sent by a voice awakening module, and acquiring a plurality of first Fbank characteristics of the first voice audio, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value;
step S204: sequentially performing encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model already obtains similar features of all awakening word audios in the model training process;
step S206: and determining a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value.
According to the invention, a first voice audio sent by a voice awakening module is received, and a plurality of first Fbank characteristics of the first voice audio are obtained, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value; sequentially performing encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model already obtains similar features of all awakening word audios in the model training process; and determining a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value. That is to say, a plurality of first fbbank features of the first speech audio are obtained, and encoding and decoding operations are sequentially performed on each first fbbank feature through a neural network model of a differential self-encoder to obtain a second fbbank feature corresponding to each first fbbank feature, and whether the first speech audio is a wakeup word audio can be determined according to a deviation value between each first fbbank feature and the second fbbank feature corresponding to each first fbbank feature. By adopting the technical scheme, the problem that the false awakening rate of the intelligent equipment is high in order to guarantee the voice awakening rate of the intelligent equipment in the correlation technique is solved, so that the false awakening rate of the intelligent equipment is reduced under the condition that the awakening rate is high, and the experience of a user in interaction with the intelligent equipment is improved.
In step S202, obtaining a plurality of first Fbank features of the first voice audio includes: performing framing operation on the first voice audio to obtain multi-frame audio; performing pre-enhancement operation on the multi-frame audio to obtain pre-enhanced audio; windowing the high-frequency voice part of the pre-enhanced audio to obtain a windowed audio; extracting a plurality of first Fbank features of the first voice audio from the windowed audio.
It should be noted that, the framing operation is performed on the first speech audio, and the speech audio is segmented according to a first preset length to obtain a plurality of sequences, where one sequence is a frame and a second preset length is a frame shift length. It should be noted that the first preset length and the second preset length may be set according to specific situations. And performing pre-enhancement operation on the obtained multi-frame audio according to s (nz) s (n) -as (n-1), wherein s (n) is the voice audio at the time of n, s (n-1) is the voice audio at the time of n-1, s (nz) is a high-frequency voice part obtained after performing the pre-enhancement operation on the obtained voice audio, and a is a pre-enhancement coefficient, and the multi-frame audio comprises the voice audio at the time of n and the voice audio at the time of n-1. It should be noted that the pre-emphasis coefficient may be set according to specific situations. And carrying out windowing operation on the high-frequency voice part of the pre-enhanced audio, wherein the frequency spectrum leakage phenomenon caused by a signal framing stage can be improved through windowing. From the windowed audio, a plurality of first Fbank features of the first speech audio can be extracted.
When step S202 is executed, extracting a plurality of first Fbank features of the first speech audio from the target audio, including: carrying out Fourier transform operation on the windowed audio to obtain a Fourier transform result; performing Mel filtering operation on the Fourier transform result to obtain filtered audio; and carrying out logarithmic operation processing on the filtered audio to obtain a plurality of first Fbank characteristics.
It should be noted that the following processing is further required for the windowed audio to extract a plurality of first Fbank features of the first speech audio: and carrying out Fourier transform operation on the windowed audio, converting the windowed audio from a time domain to a frequency domain, and combining a time domain signal of the windowed audio and a frequency domain signal of the windowed audio to obtain a time-frequency spectrum of the windowed audio, which is called a Fourier transform result. And performing Mel filtering operation on the Fourier transform result, wherein the Mel filtering operation is used for subdividing signal frequency bands according to the sensitivity degree of human ears to sounds in different frequency bands. And constructing a group of Mel filter banks according to the division of the Mel filtering on the signal frequency band, and converting the Fourier transform result of the signal into the Mel frequency domain, which is called as the Mel frequency domain spectrogram of the signal. The Mel filters are connected in parallel, and each Mel filter correspondingly filters a frequency band. And carrying out logarithmic operation processing on the filtered audio, wherein the dynamic range of the sound signal is extremely large, in order to compress the dynamic range of the signal and highlight detail information in the signal, the energy representation of the Mel frequency domain spectrogram is logarithmized and converted into sound pressure level, namely, an Fbank spectrogram. It should be noted that the Fbank spectrogram comprises the plurality of first Fbank features. Through the technical means, a plurality of first Fbank characteristics can be acquired from a section of voice audio.
Optionally, because gains from acoustic signals to electrical signals on different devices are different, in order to avoid performance degradation of a neural network model of a subsequent differential self-encoder caused by device differences, the Fbank spectrum may be normalized, that is, values of the first Fbank features are converted into an interval from zero to one.
In step S206, determining an offset value between each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature according to each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature, including: determining the deviation value A according to the following formula: a ═ Σk(YK-XK)2Wherein Y isKIs a second Fbank characteristic, XKIs a first Fbank feature, K being a positive integer.
Note that a ═ Σ may be usedk(YK-XK)2Calculating a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic, wherein Y isKIs a second Fbank characteristic, XKIs a first Fbank feature, K being a positive integer. K is an integer greater than or equal to one, so that a plurality of X can be obtainedKAnd YKNamely, each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic are obtained. And determining whether the first voice audio is the awakening word audio again according to the deviation value. In addition, X may be the same as XKAnd YKThe variance and the mean square error of each first fbbank feature to calculate a bias value between each first fbbank feature and a corresponding second fbbank feature of each first fbbank feature. If is according to XKAnd YKThe deviation value is calculated according to the variance and the mean square error, and whether the first voice audio is the awakening word audio can be determined again according to the deviation value only by correspondingly adjusting a preset threshold. It should be noted that there are a plurality of first Fbank features, so there are a plurality of second Fbank features obtained finally, where there is a one-to-one correspondence relationship between the first Fbank features and the second Fbank features.
Before step S204 is executed, that is, before each first Fbank feature is sequentially subjected to an encoding operation and a decoding operation through the differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, the method further includes: performing model training on the differential self-encoder neural network model by using a plurality of noiseless wake-up word audios; and under the condition that the similar features are learned by the differential self-encoder neural network model, respectively saving the similar features into a plurality of GRU units, wherein the differential self-encoder neural network model comprises a plurality of GRU units.
The differential self-encoder neural network model is subjected to model training by using a plurality of noiseless wake-up word audios, wherein the model training is training in a deep learning manner. And training the differential self-encoder neural network model by using a large number of noiseless awakening word audios, learning similar characteristics of all the awakening word audios by using the differential self-encoder neural network model, and storing the similar characteristics into each GRU (generalized neural Unit) Unit in the differential self-encoder neural network model. It should be noted that, when the differential self-encoder neural network model is trained, the used wake-up word audio must satisfy the condition that the wake-up word audio is noiseless and the number of the wake-up word audio is large. The noise-free characteristic is to ensure that the difference self-encoder neural network model learns similar characteristics of the wake-up word audio, and if the difference self-encoder neural network model learns similar characteristics of noise (the audio of the non-wake-up word audio is noise), the difference self-encoder neural network model sequentially performs encoding operation and decoding operation on the first Fbank characteristics to obtain second Fbank characteristics including noise characteristics (the first Fbank characteristics and the second Fbank characteristics have a one-to-one correspondence relationship). The above assumption may cause a large difference between the second Fbank feature and the first Fbank feature, and the deviation value is too large, which may finally cause the first speech audio to be determined again according to the deviation value to be the non-wakeup word audio. If the number of the awakening word audio used for training is small, the difference self-encoder neural network model learns that the similar characteristics are not necessarily accurate, and finally, the result that whether the first voice audio is the awakening word audio is determined again according to the deviation value is wrong.
In step S204, sequentially performing an encoding operation and a decoding operation on each first Fbank feature through the differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, including: receiving H through each GRU unit in an encoder in the differential self-encoder neural network modelK-1And XKCoding is carried out to obtain the characteristic code corresponding to each GRU unit in the coder, wherein when K is equal to 1, H0Is a zero sequence, when K is not equal to 1, HK-1Is the K-1 GRU unit pair HK-2And XK-1The result of the encoding; and decoding the received characteristic codes through each GRU unit in a decoder in the neural network model of the differential self-encoder to obtain a second Fbank characteristic corresponding to each GRU unit in the decoder. .
It should be noted that the differential self-encoder neural network model includes an encoder and a decoder, where both the encoder and the decoder include a plurality of GRU units. Each GRU unit in the encoder is coupled to the received HK-1And XKEncoding is performed (multiple GRU units are present in the encoder), wherein the first GRU unit of the encoder receives H0And X1In which H is0Is a sequence of zeros, X1Is a first one of the plurality of first Fbank features, a K-1 GRU unit pair H of an encoder0And X1Encoding, outputting H1To the second GRU unit of the encoder. Encoder for encoding a video signalThe K-1 GRU unit of (a) receives HK-2And XK-1,XK-1Is the K-1 first Fbank feature of the plurality of first Fbank features, the K-1 GRU unit pair H of the encoderK-2And XK-1Encoding, output HK-1To the kth GRU unit of the encoder. Finally, a plurality of feature codes are obtained through an encoder, and the received feature codes are decoded through each GRU unit in a decoder in the neural network model of the differential self-encoder (a plurality of GRU units exist in the decoder), so that a plurality of second Fbank features can be obtained, wherein the plurality of first Fbank features correspond to the plurality of second Fbank features in a one-to-one mode (for example, a first Fbank feature corresponds to a first second Fbank feature).
In step S206, determining again whether the first voice audio is a wakeup word audio according to the deviation value includes: determining that the first voice audio is not the awakening word audio under the condition that the deviation value is greater than a preset threshold; and determining the first voice audio as the awakening word audio under the condition that the deviation value is smaller than or equal to the preset threshold.
It should be noted that, if the deviation value is greater than a preset threshold, it is determined that the first voice audio is not the wakeup word audio; and if the deviation value is smaller than or equal to the preset threshold, determining that the first voice audio is the awakening word audio, wherein the preset threshold can be set according to a specific situation. Through the technical means, the problem that the false awakening rate of the intelligent device is high in order to guarantee the voice awakening rate of the intelligent device in the correlation technique is solved, so that the false awakening rate is reduced under the condition that the awakening rate is high, and the experience of a user in interaction with the intelligent device is improved.
In order to better understand the above technical solution, the following structural block diagram is used to explain the two-time detection process of the voice audio.
Fig. 3 is a block diagram of a structure of a voice wake-up double detection model according to an embodiment of the present invention, as shown in fig. 3:
the two-time detection model comprises: the device comprises an end side voice awakening module and an awakening secondary detection module, wherein the end side voice awakening module is the voice awakening module.
The end-side voice awakening module receives the audio stream (namely, the first voice audio), determines the audio stream as a suspected awakening word audio under the condition that the awakening word probability of the audio stream is greater than a preset threshold value, and sends the suspected awakening word audio to the awakening secondary detection module.
The awakening secondary detection module comprises the differential self-encoder neural network model, processes the audio stream to obtain the first Fbank characteristics, and inputs the first Fbank characteristics into the differential self-encoder neural network model to obtain the second Fbank characteristics. And finally, determining whether the first voice audio is the awakening word audio again according to the deviation values between the plurality of first Fbank characteristics and the plurality of second Fbank characteristics, and outputting an awakening detection result.
Fig. 4 is a flowchart of a feature extraction method of speech audio according to an embodiment of the present invention, as shown in fig. 4:
s402: performing a framing operation on the first speech audio: dividing the voice audio according to a fixed length, taking 25ms as a frame, and shifting the frame by 10 ms;
s404: performing a pre-enhancement operation on the obtained multi-frame audio according to s (nz) ═ s (n) -as (n-1), wherein s (n) is the voice audio at the time n, s (n-1) is the voice audio at the time n-1, s (nz) is a high-frequency voice part obtained after performing the pre-enhancement operation on the obtained voice audio, and a is a pre-enhancement coefficient, wherein the multi-frame audio comprises the voice audio at the time n and the voice audio at the time n-1;
s406: windowing is carried out on the high-frequency voice part of the pre-enhanced audio, and the frequency spectrum leakage phenomenon caused by the signal framing stage can be improved through windowing;
s408: performing Fourier transform operation on the windowed audio, converting the windowed audio from a time domain to a frequency domain, and combining a time domain signal of the windowed audio and a frequency domain signal of the windowed audio to obtain a time-frequency spectrum of the windowed audio, which is called a Fourier transform result;
s410: performing Mel filtering operation on the Fourier transform result, wherein the Mel filtering operation is to divide the signal frequency band again according to the sensitivity of human ears to sounds in different frequency bands, construct a group of Mel filter banks according to the division of the Mel filtering on the signal frequency band, and convert the Fourier transform result of the signal into Mel frequency domain, which is called as Mel frequency domain spectrogram of the signal;
s412: and taking logarithm of the energy representation of the Mel frequency domain spectrogram, and converting the logarithm into sound pressure level, which is called as an Fbank spectrogram, wherein the Fbank spectrogram comprises the plurality of first Fbank features.
FIG. 5 is a block diagram of a neural network model of a differential auto-encoder according to an embodiment of the present invention, as shown in FIG. 5:
the differential self-encoder neural network model includes an encoder and a decoder, where both the encoder and the decoder include a plurality of GRU units. A plurality of GRU units of an encoder respectively receive H from the plurality of GRU units of the encoderK-1And XKAnd (6) encoding is carried out. It should be noted that, before the differential self-encoder neural network model is used to sequentially perform encoding operation and decoding operation on the first Fbank feature, the differential self-encoder neural network model needs to be model-trained by using a plurality of noiseless wake-up word audios, and when the differential self-encoder neural network model learns the similar feature, the similar feature is respectively stored in a plurality of GRU units.
According to the invention, a first voice audio sent by a voice awakening module is received, and a plurality of first Fbank characteristics of the first voice audio are obtained, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value; sequentially performing encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model already obtains similar features of all awakening word audios in the model training process; and determining a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value. That is to say, a plurality of first Fbank features of the first voice audio are obtained, each first Fbank feature is sequentially subjected to encoding operation and decoding operation through a differential self-encoder neural network model, so that a second Fbank feature corresponding to each first Fbank feature is obtained, and whether the first voice audio is a wakeup word audio can be determined according to a deviation value between each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature. By adopting the technical scheme, the problem that the false awakening rate of the intelligent device is high in order to ensure the voice awakening rate of the intelligent device in the correlation technique is solved, so that the false awakening rate is reduced under the condition that the awakening rate is high, and the experience of a user in interaction with the intelligent device is improved.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a device for detecting a voice audio is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 6 is a block diagram of a voice audio detection apparatus according to an embodiment of the present invention; as shown in fig. 6, includes:
a first obtaining module 60, configured to receive a first voice audio sent by a voice wakeup module, and obtain a plurality of first Fbank features of the first voice audio, where the first voice audio is detected by the voice wakeup module, and a wakeup word probability of the first voice audio is greater than a preset threshold;
a second obtaining module 62, configured to perform encoding operation and decoding operation on each first Fbank feature sequentially through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, where the differential self-encoder neural network model obtains similar features of all wakeup word audios in a model training process;
a determining module 64, configured to determine, according to each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature, an offset value between each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature, and determine again whether the first voice audio is a wakeup word audio according to the offset value.
According to the invention, a first voice audio sent by a voice awakening module is received, and a plurality of first Fbank characteristics of the first voice audio are obtained, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value; sequentially performing encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model already obtains similar features of all awakening word audios in the model training process; and determining a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value. That is to say, a plurality of first Fbank features of the first voice audio are obtained, each first Fbank feature is sequentially subjected to encoding operation and decoding operation through a differential self-encoder neural network model, so that a second Fbank feature corresponding to each first Fbank feature is obtained, and whether the first voice audio is a wakeup word audio can be determined according to a deviation value between each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature. By adopting the technical scheme, the problem that the false awakening rate of the intelligent equipment is high in order to guarantee the voice awakening rate of the intelligent equipment in the correlation technique is solved, so that the false awakening rate of the intelligent equipment is reduced under the condition that the awakening rate is high, and the experience of a user in interaction with the intelligent equipment is improved.
Optionally, the first obtaining module 60 is further configured to perform framing operation on the first voice audio to obtain multiple frames of audio; performing pre-enhancement operation on the multi-frame audio to obtain pre-enhanced audio; windowing the high-frequency voice part of the pre-enhanced audio to obtain a windowed audio; extracting a plurality of first Fbank features of the first voice audio from the windowed audio.
It should be noted that, the framing operation is performed on the first speech audio, and the speech audio is segmented according to a first preset length to obtain a plurality of sequences, where one sequence is a frame and a second preset length is a frame shift length. It should be noted that the first preset length and the second preset length may be set according to specific situations. And performing pre-enhancement operation on the obtained multi-frame audio according to s (nz) s (n) -as (n-1), wherein s (n) is the voice audio at the time of n, s (n-1) is the voice audio at the time of n-1, s (nz) is a high-frequency voice part obtained after performing the pre-enhancement operation on the obtained voice audio, and a is a pre-enhancement coefficient, and the multi-frame audio comprises the voice audio at the time of n and the voice audio at the time of n-1. It should be noted that the pre-emphasis coefficient may be set according to specific situations. And carrying out windowing operation on the high-frequency voice part of the pre-enhanced audio, wherein the frequency spectrum leakage phenomenon caused by a signal framing stage can be improved through windowing. From the windowed audio, a plurality of first Fbank features of the first speech audio can be extracted.
Optionally, the first obtaining module 60 is further configured to perform a fourier transform operation on the windowed audio to obtain a fourier transform result; performing Mel filtering operation on the Fourier transform result to obtain filtered audio; and carrying out logarithmic operation processing on the filtered audio to obtain the plurality of first Fbank characteristics.
It should be noted that the following processing is further required for the windowed audio to extract a plurality of first Fbank features of the first speech audio: and carrying out Fourier transform operation on the windowed audio, converting the windowed audio from a time domain to a frequency domain, and combining a time domain signal of the windowed audio and a frequency domain signal of the windowed audio to obtain a time-frequency spectrum of the windowed audio, which is called a Fourier transform result. And performing Mel filtering operation on the Fourier transform result, wherein the Mel filtering operation is used for subdividing signal frequency bands according to the sensitivity degree of human ears to sounds in different frequency bands. And constructing a group of Mel filter banks according to the division of the Mel filtering on the signal frequency band, and converting the Fourier transform result of the signal into the Mel frequency domain, which is called as the Mel frequency domain spectrogram of the signal. The Mel filters are connected in parallel, and each Mel filter correspondingly filters a frequency band. And carrying out logarithmic operation processing on the filtered audio, wherein the dynamic range of the sound signal is extremely large, in order to compress the dynamic range of the signal and highlight detail information in the signal, the energy representation of the Mel frequency domain spectrogram is logarithmized and converted into sound pressure level, namely, an Fbank spectrogram. It should be noted that the Fbank spectrogram includes the plurality of first Fbank features. Through the technical means, a plurality of first Fbank characteristics can be acquired from a section of voice audio.
Optionally, because gains from acoustic signals to electrical signals on different devices are different, in order to avoid performance degradation of a neural network model of a subsequent differential self-encoder caused by device differences, the Fbank spectrum may be normalized, that is, values of the first Fbank features are converted into an interval from zero to one.
Optionally, the determining module 64 is further configured to determine the deviation value a according to the following formula: a ═ Σk(YK-XK)2Wherein Y isKIs a second Fbank characteristic, XKIs a first Fbank feature, K being a positive integer.
Note that a ═ Σ may be usedk(YK-XK)2Calculating a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic, wherein Y isKIs a second Fbank characteristic, XKIs a first Fbank feature, K being a positive integer. K is an integer greater than or equal to one, so that a plurality of X can be obtainedKAnd YKNamely, each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic are obtained. And determining whether the first voice audio is the awakening word audio again according to the deviation value. In addition, X may be the same as XKAnd YKThe variance and the mean square error of the first Fbank feature are calculated to obtain a deviation value between each first Fbank feature and the corresponding second Fbank feature of each first Fbank feature. If is according to XKAnd YKThe deviation value is calculated according to the variance and the mean square error, and whether the first voice audio is the awakening word audio can be determined again according to the deviation value only by correspondingly adjusting a preset threshold. It should be noted that there are a plurality of first Fbank features, so there are a plurality of second Fbank features obtained finally, where there is a one-to-one correspondence relationship between the first Fbank features and the second Fbank features.
Optionally, the second obtaining module 62 is further configured to perform model training on the differential self-encoder neural network model by using a plurality of noiseless wake-up word audios; and under the condition that the similar features are learned by the differential self-encoder neural network model, respectively saving the similar features into a plurality of GRU units, wherein the differential self-encoder neural network model comprises a plurality of GRU units.
The differential self-encoder neural network model is subjected to model training by using a plurality of noiseless wake-up word audios, wherein the model training is training in a deep learning manner. And training the differential self-encoder neural network model by using a large number of noiseless awakening word audios, learning similar characteristics of all the awakening word audios by using the differential self-encoder neural network model, and storing the similar characteristics into each GRU (generalized neural Unit) Unit in the differential self-encoder neural network model. It should be noted that, when the differential self-encoder neural network model is trained, the used wake-up word audio must satisfy the condition that the wake-up word audio is noiseless and the number of the wake-up word audio is large. The noise-free characteristic is to ensure that the difference self-encoder neural network model learns similar characteristics of the wake-up word audio, and if the difference self-encoder neural network model learns similar characteristics of noise (the audio of the non-wake-up word audio is noise), the difference self-encoder neural network model sequentially performs encoding operation and decoding operation on the first Fbank characteristics to obtain second Fbank characteristics including noise characteristics (the first Fbank characteristics and the second Fbank characteristics have a one-to-one correspondence relationship). The above assumption may cause a large difference between the second Fbank feature and the first Fbank feature, and the deviation value is too large, which may finally cause the first speech audio to be determined again according to the deviation value to be the non-wakeup word audio. If the number of the awakening word audio used for training is small, the difference self-encoder neural network model learns that the similar characteristics are not necessarily accurate, and finally, the result that whether the first voice audio is the awakening word audio is determined again according to the deviation value is wrong.
Optionally, the second obtaining module 62 is further configured to obtain the received H through each GRU unit in the encoder in the differential self-encoder neural network modelK-1And XKCoding is carried out to obtain the characteristic code corresponding to each GRU unit in the coder, wherein when K is equal to 1, H0Is a zero sequence, when K is not equal to 1, HK-1Is the K-1 GRU unit pair HK-2And XK-1The result of the encoding; features received by each GRU unit pair in a decoder in the differential self-encoder neural network modelAnd decoding the codes to obtain a second Fbank characteristic corresponding to each GRU unit in the decoder. .
It should be noted that the differential self-encoder neural network model includes an encoder and a decoder, where both the encoder and the decoder include a plurality of GRU units. Each GRU unit in the encoder is coupled to the received HK-1And XKEncoding is performed (multiple GRU units are present in the encoder), wherein the first GRU unit of the encoder receives H0And X1In which H is0Is a sequence of zeros, X1Is a first one of the plurality of first Fbank features, a K-1 GRU unit pair H of an encoder0And X1Encoding, output H1To the second GRU unit of the encoder. The K-1 GRU unit of the encoder receives HK-2And XK-1,XK-1Is the K-1 first Fbank feature of the plurality of first Fbank features, the K-1 GRU unit pair H of the encoderK-2And XK-1Encoding, outputting HK-1To the kth GRU unit of the encoder. And finally, obtaining a plurality of feature codes through an encoder, and decoding the received feature codes through each GRU unit in a decoder in the neural network model of the differential self-encoder (a plurality of GRU units exist in the decoder), so as to obtain a plurality of second Fbank features, wherein the plurality of first Fbank features correspond to the plurality of second Fbank features in a one-to-one manner (for example, a first Fbank feature corresponds to a first second Fbank feature).
Optionally, the determining module 64 is further configured to determine that the first voice audio is not the wakeup word audio when the deviation value is greater than a preset threshold; and determining the first voice audio as the awakening word audio under the condition that the deviation value is smaller than or equal to the preset threshold.
It should be noted that, if the deviation value is greater than a preset threshold, it is determined that the first voice audio is not the wakeup word audio; and if the deviation value is smaller than or equal to the preset threshold, determining that the first voice audio is the awakening word audio, wherein the preset threshold can be set according to a specific situation. Through the technical means, the problem that the false awakening rate of the intelligent device is high in order to guarantee the voice awakening rate of the intelligent device in the correlation technique is solved, so that the false awakening rate is reduced under the condition that the awakening rate is high, and the experience of a user in interaction with the intelligent device is improved.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, receiving a first voice audio sent by a voice awakening module, and acquiring a plurality of first Fbank characteristics of the first voice audio, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value;
s2, sequentially carrying out encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model obtains similar features of all awakening word audios in the model training process;
and S3, determining a deviation value between each first Fbank feature and a second Fbank feature corresponding to each first Fbank feature according to each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature, and determining whether the first voice audio is the wakeup word audio again according to the deviation value.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, receiving a first voice audio sent by a voice awakening module, and acquiring a plurality of first Fbank characteristics of the first voice audio, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value;
s2, sequentially carrying out encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model obtains similar features of all awakening word audios in the model training process;
and S3, determining a deviation value between each first Fbank feature and a second Fbank feature corresponding to each first Fbank feature according to each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature, and determining whether the first voice audio is the wakeup word audio again according to the deviation value.
Optionally, in this option, the specific examples in this embodiment may refer to the examples described in the foregoing embodiment and optional implementation, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for detecting speech audio, comprising:
receiving a first voice audio sent by a voice awakening module, and acquiring a plurality of first Fbank characteristics of the first voice audio, wherein the first voice audio is detected by the voice awakening module, and the awakening word probability of the first voice audio is greater than a preset threshold value;
sequentially performing encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model already obtains similar features of all awakening word audios in the model training process;
and determining a deviation value between each first Fbank characteristic and a second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value.
2. The method of claim 1, wherein obtaining a plurality of first fbbank features of the first speech audio comprises:
performing framing operation on the first voice audio to obtain multi-frame audio;
performing pre-enhancement operation on the multi-frame audio to obtain pre-enhanced audio;
windowing the high-frequency voice part of the pre-enhanced audio to obtain a windowed audio;
extracting a plurality of first Fbank features of the first voice audio from the windowed audio.
3. The method of claim 2, wherein extracting a plurality of first Fbank features of the first speech audio from the windowed audio comprises:
carrying out Fourier transform operation on the windowed audio to obtain a Fourier transform result;
performing Mel filtering operation on the Fourier transform result to obtain filtered audio;
and carrying out logarithmic operation processing on the filtered audio to obtain the plurality of first Fbank characteristics.
4. The method of claim 1, wherein determining an offset value between each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature according to each first Fbank feature and the second Fbank feature corresponding to each first Fbank feature comprises:
determining the deviation value A according to the following formula:
Figure DEST_PATH_IMAGE002
wherein, in the step (A),
Figure DEST_PATH_IMAGE004
is the second Fbank characteristic,
Figure DEST_PATH_IMAGE006
Is a first Fbank feature, K being a positive integer.
5. The method according to claim 1, wherein before the operation of encoding and decoding each first Fbank feature is performed in turn by using a neural network model of differential self-encoder, and a second Fbank feature corresponding to each first Fbank feature is obtained, the method further comprises:
performing model training on the differential self-encoder neural network model by using a plurality of noiseless wake-up word audios;
and under the condition that the similar features are learned by the differential self-encoder neural network model, respectively saving the similar features into a plurality of GRU units, wherein the differential self-encoder neural network model comprises a plurality of GRU units.
6. The method of claim 5, wherein the sequentially performing an encoding operation and a decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature comprises:
receiving H through each GRU unit in an encoder in the differential self-encoder neural network modelK-1And XKCoding is carried out to obtain the characteristic code corresponding to each GRU unit in the coder, wherein when K is equal to 1, H0Is a zero sequence, when K is not equal to 1, HK-1Is the K-1 GRU unit pair HK-2And XK-1The result of the encoding;
and decoding the received characteristic codes through each GRU unit in a decoder in the neural network model of the differential self-encoder to obtain a second Fbank characteristic corresponding to each GRU unit in the decoder.
7. The method of claim 1, wherein re-determining whether the first speech audio is a wake word audio according to the deviation value comprises:
determining that the first voice audio is not the awakening word audio under the condition that the deviation value is greater than a preset threshold;
and determining the first voice audio as the awakening word audio under the condition that the deviation value is smaller than or equal to the preset threshold.
8. An apparatus for detecting speech audio, comprising:
the first acquisition module is used for receiving a first voice audio sent by the voice wake-up module and acquiring a plurality of first Fbank characteristics of the first voice audio, wherein the first voice audio is detected by the voice wake-up module, and the wake-up word probability of the first voice audio is greater than a preset threshold;
the second obtaining module is used for sequentially performing encoding operation and decoding operation on each first Fbank feature through a differential self-encoder neural network model to obtain a second Fbank feature corresponding to each first Fbank feature, wherein the differential self-encoder neural network model obtains similar features of all awakening word audios in a model training process;
and the determining module is used for determining a deviation value between each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic according to each first Fbank characteristic and the second Fbank characteristic corresponding to each first Fbank characteristic, and determining whether the first voice audio is the wakeup word audio again according to the deviation value.
9. A computer-readable storage medium, comprising a stored program, wherein the program is operable to perform the method of any one of claims 1 to 7.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.
CN202110130677.7A 2021-01-29 2021-01-29 Voice audio detection method and device, storage medium and electronic device Active CN112992189B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110130677.7A CN112992189B (en) 2021-01-29 2021-01-29 Voice audio detection method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110130677.7A CN112992189B (en) 2021-01-29 2021-01-29 Voice audio detection method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN112992189A CN112992189A (en) 2021-06-18
CN112992189B true CN112992189B (en) 2022-05-03

Family

ID=76345898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110130677.7A Active CN112992189B (en) 2021-01-29 2021-01-29 Voice audio detection method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN112992189B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114360526B (en) * 2022-03-16 2022-06-17 杭州研极微电子有限公司 Audio detection device, method, apparatus and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036412A (en) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 voice awakening method and system
WO2019126880A1 (en) * 2017-12-29 2019-07-04 Fluent.Ai Inc. A low-power keyword spotting system
CN110767231A (en) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 Voice control equipment awakening word identification method and device based on time delay neural network
CN110890093A (en) * 2019-11-22 2020-03-17 腾讯科技(深圳)有限公司 Intelligent device awakening method and device based on artificial intelligence
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
CN111933112A (en) * 2020-09-21 2020-11-13 北京声智科技有限公司 Awakening voice determination method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019126880A1 (en) * 2017-12-29 2019-07-04 Fluent.Ai Inc. A low-power keyword spotting system
CN109036412A (en) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 voice awakening method and system
CN110767231A (en) * 2019-09-19 2020-02-07 平安科技(深圳)有限公司 Voice control equipment awakening word identification method and device based on time delay neural network
CN110890093A (en) * 2019-11-22 2020-03-17 腾讯科技(深圳)有限公司 Intelligent device awakening method and device based on artificial intelligence
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
CN111933112A (en) * 2020-09-21 2020-11-13 北京声智科技有限公司 Awakening voice determination method, device, equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Context-Aware Misinformation Detection: A Benchmark of Deep Learning Architectures Using Word Embeddings;Vlad-Iulian Ilie et al.;《IEEE Access》;20211203;第162122-162124页 *

Also Published As

Publication number Publication date
CN112992189A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN108899044B (en) Voice signal processing method and device
CN110706693B (en) Method and device for determining voice endpoint, storage medium and electronic device
CN102549659A (en) Suppressing noise in an audio signal
CN104103278A (en) Real time voice denoising method and device
JP7486266B2 (en) Method and apparatus for determining a depth filter - Patents.com
JP4551215B2 (en) How to perform auditory intelligibility analysis of speech
CN109493883A (en) A kind of audio time-delay calculation method and apparatus of smart machine and its smart machine
CN112992189B (en) Voice audio detection method and device, storage medium and electronic device
WO2021007841A1 (en) Noise estimation method, noise estimation apparatus, speech processing chip and electronic device
CN110191397B (en) Noise reduction method and Bluetooth headset
CN115348507A (en) Impulse noise suppression method, system, readable storage medium and computer equipment
CN116612778B (en) Echo and noise suppression method, related device and medium
CN110797008B (en) Far-field voice recognition method, voice recognition model training method and server
CN106910494B (en) Audio identification method and device
CN108053834A (en) audio data processing method, device, terminal and system
CN115954013A (en) Voice processing method, device, equipment and storage medium
CN116110418A (en) Audio noise reduction method and device, storage medium and electronic device
CN115881142A (en) Training method and device for bone conduction speech coding model and storage medium
CN110189763B (en) Sound wave configuration method and device and terminal equipment
CN114937449A (en) Voice keyword recognition method and system
CN112927705A (en) Frequency response calibration method and related product
CN116758934B (en) Method, system and medium for realizing intercom function of intelligent wearable device
CN113763976A (en) Method and device for reducing noise of audio signal, readable medium and electronic equipment
CN112201229B (en) Voice processing method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant