CN108922513B - Voice distinguishing method and device, computer equipment and storage medium - Google Patents

Voice distinguishing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN108922513B
CN108922513B CN201810561788.1A CN201810561788A CN108922513B CN 108922513 B CN108922513 B CN 108922513B CN 201810561788 A CN201810561788 A CN 201810561788A CN 108922513 B CN108922513 B CN 108922513B
Authority
CN
China
Prior art keywords
voice data
distinguished
asr
target
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810561788.1A
Other languages
Chinese (zh)
Other versions
CN108922513A (en
Inventor
涂宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810561788.1A priority Critical patent/CN108922513B/en
Priority to PCT/CN2018/094190 priority patent/WO2019232846A1/en
Publication of CN108922513A publication Critical patent/CN108922513A/en
Application granted granted Critical
Publication of CN108922513B publication Critical patent/CN108922513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Abstract

The invention discloses a voice distinguishing method and device, computer equipment and a storage medium. The voice distinguishing method comprises the following steps: processing original voice data to be distinguished based on a voice activity detection algorithm to obtain target voice data to be distinguished; acquiring corresponding ASR voice characteristics based on the target voice data to be distinguished; and inputting the ASR speech features into a pre-trained ASR-RNN model for distinguishing, and obtaining a target distinguishing result. The voice distinguishing method can well distinguish target voice and interference voice, and can still distinguish accurate voice under the condition that noise interference of voice data is very large.

Description

Voice distinguishing method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of speech processing, and in particular, to a method and an apparatus for speech discrimination, a computer device, and a storage medium.
Background
Voice differentiation refers to the silent screening of the input voice, and only the voice segments that are more meaningful for recognition (i.e., the target voice) are retained. The existing voice distinguishing method has great defects, particularly under the condition of existence of noise, along with the increase of the noise, the difficulty in distinguishing the voice is higher, the target voice and the interference voice cannot be distinguished accurately, and the voice distinguishing effect is not ideal.
Disclosure of Invention
The embodiment of the invention provides a voice distinguishing method, a voice distinguishing device, computer equipment and a storage medium, and aims to solve the problem that the voice distinguishing effect is not ideal.
The embodiment of the invention provides a voice distinguishing method, which comprises the following steps:
processing original voice data to be distinguished based on a voice activity detection algorithm to obtain target voice data to be distinguished;
acquiring corresponding ASR voice characteristics based on the target voice data to be distinguished;
and inputting the ASR speech features into a pre-trained ASR-RNN model for distinguishing, and obtaining a target distinguishing result.
An embodiment of the present invention provides a voice distinguishing device, including:
the target voice data to be distinguished acquisition module is used for processing the original voice data to be distinguished based on a voice activity detection algorithm to acquire the target voice data to be distinguished;
the voice feature acquisition module is used for acquiring corresponding ASR voice features based on the target voice data to be distinguished;
and the target distinguishing result acquisition module is used for inputting the ASR speech features into a pre-trained ASR-RNN model for distinguishing and acquiring a target distinguishing result.
An embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the voice distinguishing method when executing the computer program.
An embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the voice distinguishing method.
In the voice distinguishing method, the voice distinguishing device, the computer equipment and the storage medium provided by the embodiment of the invention, the original voice data to be distinguished is processed based on the voice activity detection algorithm to obtain the target voice data to be distinguished, the original voice data to be distinguished is firstly distinguished once through the voice activity detection algorithm to obtain the target voice data to be distinguished with a smaller range, and non-voice can be removed primarily and effectively. And then, acquiring corresponding ASR speech characteristics based on the target speech data to be distinguished, and providing a technical basis for performing corresponding ASR-RNN model recognition according to the ASR speech characteristics. And finally, inputting the ASR voice characteristics into a pre-trained ASR-RNN model for distinguishing to obtain a target distinguishing result, wherein the ASR-RNN model is a recognition model which is specially trained according to the ASR voice characteristics and the characteristics of the voice in time sequence and is used for accurately distinguishing the voice, the target voice and the interference voice can be correctly distinguished from the voice data to be distinguished of the target, and the accuracy of voice distinguishing is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a diagram of an application environment of a speech discrimination method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a speech discrimination method according to an embodiment of the present invention;
FIG. 3 is a detailed flowchart of step S10 in FIG. 2;
FIG. 4 is a detailed flowchart of step S20 in FIG. 2;
FIG. 5 is a detailed flowchart of step S21 in FIG. 4;
FIG. 6 is a detailed flowchart of step S24 in FIG. 4;
FIG. 7 is a detailed flowchart of FIG. 2 before step S30;
FIG. 8 is a diagram of a voice differentiating apparatus according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 illustrates an application environment of a voice differentiating method provided by an embodiment of the present invention. The application environment of the voice recognition method comprises a server and a client, wherein the server and the client are connected through a network. The client is a device capable of performing human-computer interaction with a user, and includes, but is not limited to, a computer, a smart phone, a tablet and the like. The server may be specifically implemented by an independent server or a server cluster composed of a plurality of servers. The voice distinguishing method provided by the embodiment of the invention is applied to a server.
As shown in fig. 2, fig. 2 shows a flowchart of a voice distinguishing method in the embodiment, the voice distinguishing method includes the following steps:
s10: and processing the original voice data to be distinguished based on a voice activity detection algorithm to obtain target voice data to be distinguished.
Voice Activity Detection (VAD) aims to identify and eliminate a long silent period from a Voice signal stream, so as to save Voice channel resources without reducing service quality, save precious bandwidth resources, reduce end-to-end delay, and improve user experience. The voice activity detection algorithm (VAD algorithm), i.e. the algorithm specifically used in voice activity detection, may be of various kinds. It will be appreciated that VAD can be applied to speech discrimination, being able to discriminate between target speech and interfering speech. The target speech is a speech portion in which a voiceprint is continuously changed and is obvious in the speech data, and the interfering speech may be a speech portion in which no pronunciation is caused due to silence in the speech data, or may be ambient noise. The original voice data to be distinguished is the most originally acquired voice data to be distinguished, and the original voice data to be distinguished refers to voice data to be preliminarily distinguished and processed by adopting a VAD algorithm. The target voice data to be distinguished refers to voice data for voice distinguishing acquired after the original voice data to be distinguished is processed through a voice activity detection algorithm.
In this embodiment, the VAD algorithm is used to process the original voice data to be distinguished, a target voice and an interfering voice are preliminarily screened out from the original voice data to be distinguished, and the preliminarily screened target voice portion is used as the target voice data to be distinguished. It can be understood that the interference voices screened out preliminarily do not need to be distinguished any more, so that the efficiency of voice distinguishing is improved. The target voice primarily screened from the original voice data to be distinguished still has the content of the interference voice, and particularly when the noise of the original voice data to be distinguished is relatively large, the more interference voices (such as noise) mixed with the primarily screened target voice are, obviously, the voice cannot be effectively distinguished by adopting the VAD algorithm at this moment, so the primarily screened target voice mixed with the interference voice is used as the target voice data to be distinguished, so that the primarily screened target voice is more accurately distinguished. The original voice data to be distinguished is preliminarily distinguished by adopting the VAD algorithm, and the original voice data to be distinguished screened preliminarily can be distinguished again, and meanwhile, a large amount of interference voice is removed, so that the subsequent further voice distinguishing is facilitated.
In a specific embodiment, as shown in fig. 3, in step S10, processing original voice data to be distinguished based on a voice activity detection algorithm to obtain target voice data to be distinguished, including the following steps:
s11: processing original voice data to be distinguished according to a short-time energy characteristic value calculation formula to obtain a corresponding short-time energy characteristic value, reserving the original voice data to be distinguished with the short-time energy characteristic value larger than a first threshold value, and determining the original voice data to be distinguished as first original voice data to be distinguished, wherein the short-time energy characteristic value calculation formula is
Figure BDA0001683449630000051
N is the length of the voice frame, s (N) is the signal amplitude in the time domain, and N is the time.
The short-time energy feature value describes the energy corresponding to a frame of speech (a frame is generally 10-30 ms) in the time domain, and the short time of the short-time energy is understood to be the time of a frame (i.e. the length of a speech frame). Since the short-term energy eigenvalue of the target speech is much higher than that of the interfering speech (silence), the target speech and the interfering speech can be distinguished from each other by the short-term energy eigenvalue.
In this embodiment, the original voice data to be distinguished is processed according to the short-time energy characteristic value calculation formula (the original voice data to be distinguished needs to be subjected to framing processing in advance), the short-time energy characteristic value of each frame of the original voice data to be distinguished is calculated and obtained, the short-time energy characteristic value of each frame is compared with a preset first threshold value, the original voice data to be distinguished larger than the first threshold value is reserved, and the original voice data to be distinguished is determined as the first original voice data to be distinguished. The first threshold is a boundary value used to measure whether the short-term energy characteristic value belongs to the target voice or the interfering voice. In this embodiment, according to the comparison result between the short-time energy characteristic value and the first threshold, the target voice in the original voice data to be distinguished can be preliminarily distinguished from the short-time energy characteristic value, and a large amount of interference voice in the original voice data to be distinguished can be effectively removed.
S12: processing the original voice data to be distinguished according to a zero-crossing rate characteristic value calculation formula to obtain a corresponding zero-crossing rate characteristic value, reserving the original voice data to be distinguished with the zero-crossing rate characteristic value smaller than a second threshold value, and determining the original voice data to be distinguished as second original voice data to be distinguished, wherein the zero-crossing rate characteristic value calculation formula is
Figure BDA0001683449630000061
N is the length of the voice frame, and s (N) is the signal amplitude N in the time domain as time.
The zero-crossing rate characteristic value describes the number of times that a speech signal waveform in a frame of speech crosses a horizontal axis (zero level). Since the zero-crossing rate characteristic value of the target speech is much lower than that of the interfering speech, the target speech and the interfering speech can be distinguished from each other according to the short-time energy characteristic value.
In this embodiment, the original voice data to be distinguished is processed according to the zero-crossing rate characteristic value calculation formula, the zero-crossing rate characteristic value of each frame of the original voice data to be distinguished is calculated and obtained, the zero-crossing rate characteristic value of each frame is compared with a preset second threshold value, the original voice data to be distinguished smaller than the second threshold value is reserved, and the original voice data to be distinguished is determined as second original voice data to be distinguished. The second threshold is a boundary value used to measure whether the short-term energy characteristic value belongs to the target speech or the interfering speech. In this embodiment, according to the comparison result between the zero-crossing rate characteristic value and the second threshold, the target voice in the original voice data to be distinguished can be preliminarily distinguished from the zero-crossing rate characteristic value, and a large amount of interference voice in the original voice data to be distinguished can be effectively removed.
S13: and taking the first original distinguishing voice data and the second original distinguishing voice data as target voice data to be distinguished.
In this embodiment, the first original speech data to be distinguished is distinguished and obtained from the original speech data to be distinguished according to the short-time energy characteristic value, and the second original speech data to be distinguished is distinguished and obtained from the original speech data to be distinguished according to the zero-crossing rate characteristic value. The first original distinguishing voice data and the second original distinguishing voice data are respectively started from different voice distinguishing angles, and both the two angles can well distinguish voices, so that the first original distinguishing voice data and the second original distinguishing voice data are combined (combined in an intersection mode) together to serve as target voice data to be distinguished.
Steps S11 to S13 may primarily and effectively remove most of the interfering voice data in the original voice data to be distinguished, retain the original voice data to be distinguished mixed with the target voice and a small amount of interfering voice (such as noise), and use the original voice data to be distinguished as the target voice data to be distinguished, which may effectively and primarily distinguish the original voice data to be distinguished.
S20: and acquiring corresponding ASR voice characteristics based on the target voice data to be distinguished.
Among them, ASR (Automatic Speech Recognition) is a technology for converting voice data into a computer readable input, for example, converting voice data into a form of a key, a binary code, or a character sequence. The speech features in the target speech data to be distinguished can be extracted through ASR, and the extracted speech is the ASR speech features corresponding to the extracted speech. It is to be appreciated that ASR is capable of converting speech data that would otherwise not be directly readable by a computer into computer-readable ASR speech features that can be represented in a vector fashion.
In this embodiment, the ASR is used to process the target speech data to be distinguished, and obtain corresponding ASR speech features, where the ASR speech features can well reflect potential features of the target speech data to be distinguished, and can distinguish the target speech data to be distinguished according to the ASR speech features, so as to provide an important technical premise for performing corresponding ASR-RNN (Recurrent neural networks, RNN) model recognition according to the ASR speech features.
In a specific embodiment, as shown in fig. 4, in step S20, acquiring corresponding ASR speech features based on the target speech data to be distinguished, including the following steps:
s21: and preprocessing the target voice data to be distinguished to acquire preprocessed voice data.
In this embodiment, the target voice data to be distinguished is preprocessed, and corresponding preprocessed voice data is obtained. The ASR voice features of the target voice data to be distinguished can be better extracted by preprocessing the target voice data to be distinguished, so that the extracted ASR voice features can represent the target voice data to be distinguished better, and the ASR voice features are adopted for voice distinguishing.
In a specific embodiment, as shown in fig. 5, in step S21, preprocessing the target voice data to be distinguished to obtain preprocessed voice data, including the following steps:
s211: pre-emphasis processing is carried out on the voice data to be distinguished, and the calculation formula of the pre-emphasis processing is s' n =s n -a*s n-1 Wherein s is n Is the amplitude of the signal in the time domain, s n-1 Is a sum of s n Corresponding signal amplitude, s 'at the previous moment' n Is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value range of a is 0.9<a<1.0。
The pre-emphasis is a signal processing method for compensating the high-frequency component of the input signal at the transmitting end. As the signal rate increases, the signal is greatly damaged during transmission, and the damaged signal needs to be compensated for in order to obtain a better signal waveform at the receiving end. The idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate the excessive attenuation of the high-frequency component in the transmission process, so that the receiving end can obtain a better signal waveform. The pre-emphasis has no influence on noise, so that the output signal-to-noise ratio can be effectively improved.
In this embodiment, the target voice data to be distinguished is pre-emphasized, and the formula of the pre-emphasis is s' n =s n -a*s n-1 Wherein s is n Is the amplitude of the signal in the time domain, i.e. the amplitude (amplitude) of the speech represented in the time domain by the speech data, s n-1 Is a sum of s n Relative signal amplitude at previous time, s' n Is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value range of a is 0.9<a<1.0, where a pre-emphasis of 0.97 is preferred. By adopting the pre-emphasis processing, the interference caused by vocal cords, lips and the like in the sounding process can be eliminated, the suppressed high-frequency part of the target voice data to be distinguished can be effectively compensated, the formant of the high frequency of the target voice data to be distinguished can be highlighted, the signal amplitude of the target voice data to be distinguished is enhanced, and the ASR voice feature extraction is facilitated.
S212: and performing frame division processing on the pre-emphasized target voice data to be distinguished.
In this embodiment, after pre-emphasizing the voice data to be distinguished, frame division processing should be performed. Framing refers to a speech processing technique that divides an entire speech signal into several segments, where each frame is in the range of 10-30ms and approximately 1/2 of the frame length is used as a frame shift. The frame shift refers to an overlapping area between two adjacent frames, so that the problem of overlarge change of the two adjacent frames can be avoided. The voice data to be distinguished of the target is subjected to framing processing, the voice data to be distinguished of the target can be divided into a plurality of sections of voice data, the voice data to be distinguished of the target can be subdivided, and the extraction of the ASR voice features is facilitated.
S213: windowing the voice data to be distinguished of the target after framing to obtain preprocessed voice data, wherein the calculation formula of windowing is as follows
Figure BDA0001683449630000091
Wherein N is the window length, N is the time, s n Is the signal amplitude, s 'in the time domain' n Is the signal amplitude in the time domain after windowing.
In this embodiment, after the frame division processing is performed on the target voice data to be distinguishedSince discontinuous portions are present at both the start and end of each frame, the more frames are divided, the larger the error with the target speech data to be distinguished is. The windowing can be adopted to solve the problem, the voice data to be distinguished of the target after the frame division can be made continuous, and each frame can show the characteristics of a periodic function. The windowing processing specifically refers to processing target voice data to be distinguished by adopting a window function, wherein the window function can select a Hamming window, and the windowing formula is
Figure BDA0001683449630000092
N is the Hamming window length, N is the time, s n Is the signal amplitude, s 'in the time domain' n Is the signal amplitude in the time domain after windowing. Windowing is carried out on the target voice data to be distinguished, the preprocessed voice data are obtained, signals of the framed target voice data to be distinguished in the time domain can be continuous, and extraction of ASR voice features of the target voice data to be distinguished is facilitated.
The preprocessing operation of the target voice data to be distinguished in the steps S211 to S213 provides a basis for extracting ASR voice features of the target voice data to be distinguished, so that the extracted ASR voice features can better represent the target voice data to be distinguished, and voice distinguishing is performed according to the ASR voice features.
S22: and performing fast Fourier transform on the preprocessed voice data to obtain the frequency spectrum of the target voice data to be distinguished, and obtaining the power spectrum of the target voice data to be distinguished according to the frequency spectrum.
Among them, fast Fourier Transform (FFT) is a general term of an efficient and Fast calculation method for calculating discrete Fourier transform by using a computer, and is abbreviated as FFT. The multiplication times required by a computer for calculating the discrete Fourier transform can be greatly reduced by adopting the algorithm, and particularly, the more the number of the converted sampling points is, the more remarkable the calculation amount of the FFT algorithm is saved.
In the present embodiment, the preprocessed voice data is subjected to a fast fourier transform to convert the preprocessed voice data from signal amplitude in the time domain to signal amplitude (frequency spectrum) in the frequency domain.The formula for calculating the frequency spectrum is
Figure BDA0001683449630000101
K is more than or equal to 1 and less than or equal to N, N is the size of a frame, s (k) is the signal amplitude on a frequency domain, s (N) is the signal amplitude on a time domain, N is time, and i is a complex unit. After the frequency spectrum of the preprocessed voice data is obtained, the power spectrum of the preprocessed voice data can be directly obtained according to the frequency spectrum, and the power spectrum of the preprocessed voice data is hereinafter referred to as the power spectrum of the target voice data to be distinguished. The formula for calculating the power spectrum of the target voice data to be distinguished is
Figure BDA0001683449630000102
K is more than or equal to 1 and less than or equal to N, N is the size of the frame, and s (k) is the signal amplitude on the frequency domain. The method comprises the steps of converting the signal amplitude of the preprocessed voice data from the time domain into the signal amplitude of the frequency domain, and then obtaining the power spectrum of the target voice data to be distinguished according to the signal amplitude of the frequency domain, thereby providing an important technical basis for extracting the ASR voice features from the power spectrum of the target voice data to be distinguished.
S23: and processing the power spectrum of the target voice data to be distinguished by adopting a Mel scale filter bank to obtain the Mel power spectrum of the target voice data to be distinguished.
The power spectrum of the target voice data to be distinguished processed by adopting the Mel scale filter bank is subjected to Mel frequency analysis, and the Mel frequency analysis is based on human auditory perception. It has been observed that the human ear, just like a filter bank, only focuses on certain frequency components (human hearing is selective to frequency), i.e. the human ear only lets signals of certain frequencies pass, and does not directly ignore signals of certain frequencies that are not intended to be perceived. However, these filters are not uniformly distributed on the frequency axis, there are many filters in the low frequency region, and they are distributed more densely, but in the high frequency region, the number of filters becomes smaller, and the distribution is sparse. It is understood that the resolution of the mel-scale filter bank in the low frequency part is high, which is consistent with the auditory characteristics of human ears, and this is also the physical meaning of the mel scale.
In this embodiment, a mel scale filter bank is used to process the power spectrum of the target voice data to be distinguished, a mel power spectrum of the target voice data to be distinguished is obtained, and a mel scale filter bank is used to segment the frequency domain signal, so that each frequency segment corresponds to a numerical value, and if the number of the filters is 22, 22 energy values corresponding to the mel power spectrum of the target voice data to be distinguished can be obtained. The power spectrum of the target voice data to be distinguished is subjected to Mel frequency analysis, so that a Mel power spectrum obtained after analysis retains a frequency part closely related to the characteristics of human ears, and the frequency part can well reflect the characteristics of the target voice data to be distinguished.
S24: and performing cepstrum analysis on the Mel power spectrum to obtain Mel frequency cepstrum coefficients of the target voice data to be distinguished.
Among them, cepstrum (cepstrum) is an inverse fourier transform performed after a fourier transform spectrum of a signal is subjected to a logarithmic operation, and is also called a complex cepstrum because a general fourier spectrum is a complex spectrum.
In this embodiment, cepstrum analysis is performed on the mel-power spectrum, and according to the result of the cepstrum, mel-frequency cepstrum coefficients of the target voice data to be distinguished are analyzed and acquired. By this cepstrum analysis, the features contained in the mel power spectrum of the target speech data to be distinguished, which originally have too high feature dimensions and are difficult to use directly, can be converted into features (mel-frequency cepstrum coefficient feature vectors for training or recognition) which are easy to use by performing cepstrum analysis on the mel power spectrum. The Mel frequency cepstrum coefficient can be used as a coefficient for distinguishing different voices by ASR voice characteristics, and the ASR voice characteristics can reflect the difference between the voices and can be used for identifying and distinguishing target voice data to be distinguished.
In a specific embodiment, as shown in fig. 6, in step S24, performing cepstrum analysis on the mel-power spectrum to obtain mel-frequency cepstrum coefficients of the target voice data to be distinguished, including the following steps:
s241: and taking the logarithm value of the Mel power spectrum to obtain the Mel power spectrum to be transformed.
In this embodiment, according to the definition of the cepstrum, a log value log is taken for the mel power spectrum, and a mel power spectrum m to be transformed is obtained.
S242: and performing discrete cosine transform on the Mel power spectrum to be transformed to obtain Mel frequency cepstrum coefficients of the target voice data to be distinguished.
In this embodiment, discrete Cosine Transform (DCT) is performed on the mel power spectrum m to be transformed to obtain mel frequency cepstrum coefficients of the corresponding target speech data to be distinguished, and generally, the 2 nd to 13 th coefficients are taken as ASR speech features, which can reflect the differences between the speech data. The discrete cosine transform formula of the Mel power spectrum m to be transformed is
Figure BDA0001683449630000121
N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because the Mel filters are overlapped, the energy values obtained by adopting the Mel scale filters have correlation, discrete cosine transform can perform dimensionality reduction compression and abstraction on the Mel power spectrum m to be transformed, corresponding ASR voice characteristics are obtained, compared with Fourier transform, the result of the discrete cosine transform has no imaginary part, and the method has obvious advantages in the aspect of calculation.
The steps S21 to S24 are based on the ASR technology to carry out feature extraction processing on the target voice data to be distinguished, the finally obtained ASR voice features can well reflect the target voice data to be distinguished, the ASR voice features can be trained in a deep network model to obtain an ASR-RNN model, the result of the trained ASR-RNN model in voice distinguishing is more accurate, and noise and voice can be accurately distinguished even under the condition of high noise.
It should be noted that the above extracted features are mel-frequency cepstrum coefficients, and the ASR speech features should not be limited to only one mel-frequency cepstrum coefficient, but the speech features obtained by using the ASR technique can be recognized and model-trained as ASR speech features as long as the speech data features can be effectively reflected.
S30: and inputting the ASR speech characteristics into a pre-trained ASR-RNN model for distinguishing, and obtaining a target distinguishing result.
Wherein, the ASR-RNN model refers to a Recurrent neural network model obtained by adopting ASR speech feature training, and the RNN refers to a Recurrent neural network (Recurrent neural networks). The ASR-RNN model is obtained by training ASR speech features extracted from the speech data to be trained, so that the model can recognize the ASR speech features, and speech is distinguished according to the ASR speech features. Specifically, the speech data to be trained comprises target speech and noise, and when performing ASR-RNN model training, the ASR speech features of the ASR speech features and the ASR speech features of the noise of the target speech are extracted, so that the ASR-RNN model obtained by training can recognize the noise in the target speech and the interfering speech according to the ASR speech features (most of the interfering speech, such as a speech portion and a portion of noise in the speech data that is not pronounced due to silence, has been removed when the VAD is used to distinguish the original speech data to be distinguished, so the interfering speech distinguished by the ASR-DBN model is specifically referred to as a noise portion), thereby achieving the purpose of effectively distinguishing the target speech from the interfering speech.
In the embodiment, the ASR speech features are input into the ASR-RNN model which is trained in advance for distinguishing, and the ASR speech features can reflect the features of the speech data, so that the ASR speech features extracted from the target speech data to be distinguished can be recognized according to the ASR-RNN model, and the target speech data to be distinguished can be accurately distinguished according to the ASR speech features. The pre-trained ASR-RNN model combines the characteristics of ASR speech characteristics and the characteristic of deep extraction of the characteristics by a recurrent neural network, distinguishes speech from the ASR speech characteristics of speech data, and still has high accuracy under the condition of severe noise conditions. Specifically, since the features extracted by ASR also include ASR speech features of noise, in the ASR-RNN model, noise can be accurately distinguished, and the problem that the current speech distinguishing method (including but not limited to VAD) cannot effectively distinguish speech under the condition of large noise influence is solved.
In a specific embodiment, in step S30, before the step of inputting the ASR speech features into the ASR-RNN model trained in advance for distinguishing and obtaining the target distinguishing result, the speech distinguishing method further includes the following steps: and obtaining the ASR-RNN model.
As shown in FIG. 7, the step of obtaining the ASR-RNN model specifically includes:
s31: and acquiring the voice data to be trained, and extracting the ASR voice characteristics to be trained of the voice data to be trained.
The speech data to be trained refers to a speech data training sample set required for training an ASR-RNN model, and the speech data to be trained can be a speech training set directly adopting an open source or a speech training set by collecting a large amount of sample speech data. The voice data to be trained is to distinguish the target voice and the interfering voice (specifically, noise) in advance, and the specific way of distinguishing may be to set different label values for the target voice and the noise, respectively. For example, by marking the target speech part as 1 (representing "true") and the noise part as 0 (representing "false") in the speech data to be trained, the recognition accuracy of the ASR-RNN model can be checked by the label values set in advance, so as to provide improved references, update the network parameters in the ASR-RNN model and continuously optimize the ASR-RNN model. In this embodiment, the ratio of the target voice to the noise may specifically be 1:1, and by using this ratio, the overfitting phenomenon caused by the different amounts of the target voice and the noise in the voice data to be trained can be avoided. The overfitting is a phenomenon that an assumption becomes excessively strict in order to obtain a consistent assumption, and is avoided as a core task in the design of the classifier.
In this embodiment, the speech data to be trained is obtained, and the feature of the speech data to be trained, that is, the speech feature of the ASR to be trained, is extracted, and the step of extracting the speech feature of the ASR to be trained is the same as that in steps S21 to S24, and is not described herein again. The speech data to be trained comprises a training sample of target speech and a training sample of noise, and the two parts of speech data have respective ASR speech characteristics, so that a corresponding ASR-RNN model can be extracted and trained by using the speech characteristics of the ASR to be trained, and the ASR-RNN model obtained by training according to the speech characteristics of the ASR to be trained can accurately distinguish the target speech from the noise (the noise belongs to interference speech).
S32: the RNN model is initialized.
The RNN model is a recurrent neural network model. The RNN model includes an input layer composed of neurons, a hidden layer, and an output layer. The RNN model includes weights and offsets for each neuron connection between layers, which determine the properties and recognition effects of the RNN model. In contrast to conventional Neural networks such as DNN (Deep Neural Network), RNN is a Neural Network that models sequence data (e.g., time series), i.e., the current output of a sequence is related to the previous output. The concrete expression is that the network memorizes the state of the previous hidden layer and applies the state to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer comprises not only the output of the input layer but also the output of the hidden layer at the last moment. Because the voice data has the characteristic of time sequence, the RNN model can be trained by adopting the voice data to be trained, the respective deep features of the target voice and the interference voice on the time sequence can be accurately extracted, and the voice can be accurately distinguished.
In this embodiment, the RNN model is initialized, and the initialization operation sets initial values of the weights and offsets in the RNN model, and the initial values may be set to smaller values, for example, between the range [ -0.3-0.3 ]. The reasonable initialization of the RNN model can enable the model to have flexible adjusting capacity in the initial stage, and can effectively adjust the model in the model training process, so that the poor distinguishing effect of the trained model due to the poor adjusting capacity of the model in the initial stage is avoided.
S33: inputting the ASR speech features to be trained into the RNN model, and acquiring an output value of the RNN model according to a forward propagation algorithm, wherein the output value is expressed as:
Figure BDA0001683449630000161
σ denotes the activation function, V denotes the weight of the connection between the hidden layer and the output layer, h t Indicating time tHidden state, c represents the bias between the hidden layer and the output layer.
In this embodiment, the RNN forward propagation process is a series of linear operations and activation operations performed in the RNN model according to the time sequence based on the weights, biases, and input ASR speech features to be trained in the RNN model, which connect the neurons, to obtain an output value of each layer of the network in the RNN model. In particular, since the RNN is a neural network modeling sequential (here may be specifically a time series) data, the hidden state h of the hidden layer at the time of calculation t t In time, the hidden layer state h at the time t-1 is needed t-1 And the ASR speech characteristics to be trained input at the time t are obtained together. From the RNN model forward propagation process, the RNN model forward propagation algorithm can be obtained: for any time t, calculating the output from the input layer of the RNN model to the hidden layer according to the input ASR speech features to be trained, wherein the output of the hidden layer (namely the hidden state h) t ) Expressed as: h is t =σ(Ux t +Wh t-1 + b), where σ represents an activation function (specifically, an activation function of tanh may be used, and tanh may continuously expand the difference between the features of the ASR speech features to be trained in the loop process, so as to be beneficial to distinguishing target speech from noise), U represents a weight of the connection between the input layer and the hidden layer, W represents a weight of the connection between the hidden layers (the connection between the hidden layers is realized by a time sequence), and h represents a weight of the connection between the hidden layers (the connection between the hidden layers is realized by a time sequence), and t-1 representing the hidden state at time t-1 and b representing the offset between the input layer and the hidden layer. The output from the hidden layer of the RNN model to the output layer is computed, the output of the output layer (i.e., the output value of the RNN model) being represented as
Figure BDA0001683449630000162
The activation function may specifically adopt a softmax function (the softmax function is used for classifying problems with good effect), V represents a weight value of a connection between the hidden layer and the output layer, and h represents a weight value of a connection between the hidden layer and the output layer t Indicating the hidden state at time t and c the offset between the hidden layer and the output layer. Output value of the RNN model (output of output layer)
Figure BDA0001683449630000163
That is, the output value calculated layer by layer in the sequence by the forward propagation algorithm may be referred to as a predicted output value. After the server obtains the output value of the RNN model, network parameters (weight and bias) in the RNN model can be updated and adjusted according to the output value, so that the obtained RNN model can be distinguished according to the time sequence characteristics of the speech, and accurate recognition results are obtained through the difference of the ASR speech characteristics of the target speech, the ASR speech characteristics of the interference speech and the time sequence.
S34: and performing error back transmission based on the output value, updating the weight and bias of each layer of the RNN model, and obtaining the ASR-RNN model, wherein the formula for updating the weight V is as follows:
Figure BDA0001683449630000171
v represents the weight of the connection between the hidden layer and the output layer before updating, V' represents the weight of the connection between the hidden layer and the output layer after updating, alpha represents the learning rate, t represents the time t, tau represents the total duration,
Figure BDA0001683449630000172
representing the predicted output value, y t Representing the true output value, h t Representing a hidden state at the moment T, wherein T represents matrix transposition operation; the formula for the update bias c is:
Figure BDA0001683449630000173
c represents the offset between the hidden layer and the output layer before updating, c' represents the offset between the hidden layer and the output layer after updating; the formula for updating the weight U is as follows:
Figure BDA0001683449630000174
u represents the weight of the connection between the input layer and the hidden layer before updating, U' represents the weight of the connection between the input layer and the hidden layer after updating, diag () represents the matrix operation of constructing a diagonal matrix or returning diagonal elements on the matrix in the form of vectors, and delta t Gradient, x, representing the state of the hidden layer t Representing the ASR speech characteristics to be trained input at the time t; the formula for updating the weight W is as follows:
Figure BDA0001683449630000175
w represents the weight of the connection between the hidden layers before updating, and W' represents the weight of the connection between the hidden layers after updating; the formula for the update bias b is:
Figure BDA0001683449630000176
b denotes the offset between the input layer and the hidden layer before updating, and b' denotes the offset between the input layer and the hidden layer after updating.
In this embodiment, the server obtains the output value (predicted output value) of the RNN model according to the forward propagation algorithm
Figure BDA0001683449630000181
Then, can be based on
Figure BDA0001683449630000182
And calculating the error generated by the ASR speech feature to be trained when the RNN model is trained, and constructing a proper error function (for example, representing the generated error by adopting a logarithmic error function) according to the error. And the server side performs error back transmission by adopting the error function, and adjusts and updates the weight (U, W and V) and the weight (b and c) of each layer of the RNN model. Specifically, the preset tag value may be referred to as a true output value (i.e., representing an objective fact, a tag value of 1 represents a target voice, and a tag value of 0 represents an interfering voice), and y is used for t And (4) showing. In the process of training the RNN model, the RNN model has an error in calculating the forward output at each layer in the time series, and the error can be measured by using an error function L, which is expressed as:
Figure BDA0001683449630000183
where t means the time t, τ means the total duration, L t Representing the error produced at time t as represented by the error function. After the server obtains the error function, the weight and bias of the RNN model can be updated according to the BPTT (Back Propagation delay Time, time-based Back Propagation algorithm), and the ASR speech feature to be trained is obtainedASR-RNN model. Specifically, the formula for updating the weight V is:
Figure BDA0001683449630000184
wherein V represents a weight value of a connection between the hidden layer and the output layer before the update, V' represents a weight value of a connection between the hidden layer and the output layer after the update, alpha represents a learning rate,
Figure BDA0001683449630000185
representing the predicted output value, y t Represents the true output value, h t Indicating the hidden state at time T, and T indicates the matrix transposition operation. The formula for the update bias c is:
Figure BDA0001683449630000186
c denotes the offset between the hidden layer and the output layer before updating, and c' denotes the offset between the hidden layer and the output layer after updating. Compared with the weight V and the bias c, the weight U, the weight W and the bias b, when the two components are reversely transmitted, the gradient loss at a certain moment t is jointly determined by the gradient loss corresponding to the output of the current position and the gradient loss at the moment t + 1. Therefore, updating the weight U, the weight W and the bias b requires a gradient δ by the hidden layer state t Thus obtaining the product. Gradient delta of hidden layer state at time of t sequence t Expressed as:
Figure BDA0001683449630000187
δ t+1 and delta t There is a connection between them according to delta t+1 Can find delta t The expression associated with this is:
Figure BDA0001683449630000191
wherein, delta t+1 Representing the gradient of the hidden layer state at the moment of the t +1 sequence, diag () representing a computation function of a matrix operation for constructing a diagonal matrix or returning diagonal elements on a matrix in the form of vectors, h t+1 Representing the hidden layer state at the instant of the t +1 sequence. It is possible to obtain the gradient delta of the hidden layer state at time instant tau τ By using delta t+1 And delta t Expression of the relation between
Figure BDA0001683449630000192
By delta τ Recursion by layer-by-layer back propagation to obtain delta t . Due to delta τ There are no other moments in time to follow, so from the gradient calculation it is straightforward to get:
Figure BDA0001683449630000193
may be according to δ τ Recursion to obtain delta t . To obtain delta t Then, the weight U, the weight W, and the offset b can be calculated. The formula for updating the weight U is as follows:
Figure BDA0001683449630000194
u represents the weight of the connection between the input layer and the hidden layer before updating, U' represents the weight of the connection between the input layer and the hidden layer after updating, diag () represents the matrix operation of constructing a diagonal matrix or returning diagonal elements on a matrix in the form of vectors, delta t Gradient, x, representing the state of the hidden layer t Representing the ASR speech characteristics to be trained input at the time t; the formula for updating the weight W is as follows:
Figure BDA0001683449630000195
w represents the weight of the connection between the hidden layers before updating, and W' represents the weight of the connection between the hidden layers after updating; the formula for the update bias b is:
Figure BDA0001683449630000196
b denotes the offset between the input layer and the hidden layer before updating, and b' denotes the offset between the input layer and the hidden layer after updating. When all the weight values and the variation values of the bias are smaller than the iteration stopping threshold belonging to the same category, the training can be stopped; or stopping training when the training reaches the maximum iteration number MAX. Through the error generated between the predicted output value of the ASR speech feature to be trained in the RNN model and the preset label value (real output value), the updating of the weight and bias of each layer of the RNN model is realized based on the error, so that the finally obtained ASR-RNN model can be trained and learned according to the ASR speech featureThe deep features of the time sequence are learned, and the purpose of accurately distinguishing the voice is achieved.
The steps S31-S34 adopt the ASR speech features to be trained to train the RNN model, so that the ASR-RNN model obtained by training can train and learn deep features related to the sequence (time sequence) according to the ASR speech features, and the speech can be effectively distinguished according to the ASR speech features of the target speech and the interference speech and by combining time sequence factors. In the case of serious noise interference, the target voice and the noise can be accurately distinguished.
In the voice distinguishing method provided by this embodiment, the original voice data to be distinguished is processed based on a voice activity detection algorithm (VAD), target voice data to be distinguished is obtained, the original voice data to be distinguished is distinguished once by the voice activity detection algorithm, and the target voice data to be distinguished with a smaller range is obtained, so that interfering voice data in the original voice data to be distinguished can be primarily and effectively removed, the original voice data to be distinguished, in which the target voice and the interfering voice are mixed, is retained, and the original voice data to be distinguished is used as the target voice data to be distinguished, so that the original voice data to be distinguished can be effectively primarily distinguished, and a large amount of interfering voice is removed. And then, acquiring corresponding ASR speech features based on the target speech data to be distinguished, wherein the ASR speech features can enable the result of speech distinguishing to be more accurate, and can accurately distinguish interference speech (such as noise) from the target speech even under the condition of high noise, thereby providing an important technical premise for performing corresponding ASR-RNN model recognition according to the ASR speech features. And finally, inputting the ASR voice characteristics into a pre-trained ASR-RNN model for distinguishing to obtain a target distinguishing result, wherein the ASR-RNN model is a recognition model which is specially trained according to the ASR voice characteristics extracted from the voice data to be trained and the characteristics of the voice in time sequence and is used for effectively distinguishing the voice, and can correctly distinguish target voice and interference voice from the target voice data to be distinguished mixed with the target voice and the interference voice (most of the interference voice is noise because the ASR voice is distinguished once by VAD), thereby improving the accuracy of voice distinguishing.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not limit the implementation process of the embodiments of the present invention in any way.
Fig. 8 is a schematic block diagram of a speech discrimination apparatus in one-to-one correspondence with the speech discrimination method in the embodiment. As shown in fig. 8, the voice distinguishing apparatus includes a target to-be-distinguished voice data obtaining module 10, a voice feature obtaining module 20, and a target distinguishing result obtaining module 30. The implementation functions of the target to-be-distinguished voice data obtaining module 10, the voice feature obtaining module 20, and the target distinguishing result obtaining module 30 correspond to the steps corresponding to the voice distinguishing method in the embodiment one by one, and for avoiding repeated descriptions, detailed descriptions are not needed in this embodiment.
And the target voice data to be distinguished acquisition module 10 is used for processing the original voice data to be distinguished based on a voice activity detection algorithm to acquire the target voice data to be distinguished.
And the voice feature obtaining module 20 is configured to obtain corresponding ASR voice features based on the target voice data to be distinguished.
And the target distinguishing result acquisition module 30 is used for inputting the ASR speech features into a pre-trained ASR-RNN model for distinguishing and acquiring a target distinguishing result.
Preferably, the target to-be-distinguished voice data acquiring module 10 includes a first original distinguishing voice data acquiring unit 11, a second original distinguishing voice data acquiring unit 12, and a target to-be-distinguished voice data acquiring unit 13.
A first original distinguishing voice data obtaining unit 11, configured to process original voice data to be distinguished according to a short-time energy characteristic value calculation formula, obtain a corresponding short-time energy characteristic value, reserve the original voice data to be distinguished having a short-time energy characteristic value greater than a first threshold, and determine the original voice data to be distinguished as first original distinguishing voice data, where the short-time energy characteristic value calculation formula is
Figure BDA0001683449630000211
N is the length of the voice frame, s (N) is the signal amplitude in the time domain, and N is the time.
A second original distinguishing voice data obtaining unit 12, configured to process the original voice data to be distinguished according to a zero-crossing rate feature value calculation formula, obtain a corresponding zero-crossing rate feature value, reserve the original voice data to be distinguished having a zero-crossing rate feature value smaller than a second threshold, and determine the original voice data to be distinguished as second original distinguishing voice data, where the zero-crossing rate feature value calculation formula is
Figure BDA0001683449630000221
N is the length of the voice frame, s (N) is the signal amplitude in the time domain, and N is the time.
And a target to-be-distinguished voice data obtaining unit 13 configured to use the first original distinguishing voice data and the second original distinguishing voice data as target to-be-distinguished voice data.
Preferably, the voice feature acquisition module 20 includes a preprocessed voice data acquisition unit 21, a power spectrum acquisition unit 22, a mel-power spectrum acquisition unit 23, and a mel-frequency cepstrum coefficient unit 24.
The preprocessing unit 21 is configured to preprocess the target voice data to be distinguished, and acquire preprocessed voice data.
The power spectrum obtaining unit 22 is configured to perform fast fourier transform on the preprocessed voice data, obtain a frequency spectrum of the target voice data to be distinguished, and obtain a power spectrum of the target voice data to be distinguished according to the frequency spectrum.
The mel-power spectrum obtaining unit 23 is configured to process the power spectrum of the target voice data to be distinguished by using the mel-scale filter bank, and obtain the mel-power spectrum of the target voice data to be distinguished.
And the mel frequency cepstrum coefficient unit 24 is configured to perform cepstrum analysis on the mel power spectrum to obtain mel frequency cepstrum coefficients of the target voice data to be distinguished.
Preferably, the pre-processing unit 21 includes a pre-emphasis sub-unit 211, a framing sub-unit 212, and a windowing sub-unit 213.
A pre-emphasis subunit 211, configured to perform pre-emphasis processing on the target voice data to be distinguished, where the pre-emphasis processing is calculated by using the formula s' n =s n -a*s n-1 Wherein s is n Is the amplitude of the signal in the time domain, s n-1 Is a sum of s n Corresponding signal amplitude, s 'at the previous moment' n Is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value range of a is 0.9<a<1.0。
And a framing subunit 212, configured to perform framing processing on the pre-emphasized target voice data to be distinguished.
A windowing subunit 213, configured to perform windowing on the framed target to-be-distinguished voice data to obtain preprocessed voice data, where the windowing formula is
Figure BDA0001683449630000231
Wherein N is the window length, N is the time, s n Is the signal amplitude, s 'in the time domain' n Is the signal amplitude in the time domain after windowing.
Preferably, the mel frequency cepstrum coefficient unit 24 includes a mel power spectrum acquiring sub-unit 241 to be transformed and a mel frequency cepstrum coefficient sub-unit 242.
The to-be-transformed mel-power spectrum obtaining subunit 241 is configured to obtain a log value of the mel-power spectrum and obtain the to-be-transformed mel-power spectrum.
A mel-frequency cepstrum coefficient subunit 242, configured to perform discrete cosine transform on the mel power spectrum to be transformed, and obtain mel-frequency cepstrum coefficients of the target voice data to be distinguished.
Preferably, the speech distinguishing device further comprises an ASR-RNN model obtaining module 40, wherein the ASR-RNN model obtaining module 40 comprises an ASR speech feature obtaining unit 41 to be trained, an initialization unit 42, an output value obtaining unit 43 and an updating unit 44.
And the to-be-trained ASR speech feature obtaining unit 41 is configured to obtain speech data to be trained, and extract to-be-trained ASR speech features of the to-be-trained speech data.
An initialization unit 42 for initializing the RNN model.
An output value obtaining unit 43, configured to input the ASR speech feature to be trained into the RNN model, and obtain an output value of the RNN model according to a forward propagation algorithm, where the output value is represented as:
Figure BDA0001683449630000232
σ denotes the activation function, V denotes the weight of the connection between the hidden layer and the output layer, h t Indicating the hidden state at time t and c the offset between the hidden layer and the output layer.
And the updating unit 44 is configured to perform error back-propagation based on the output value, update the weight and the bias of each layer of the RNN model, and obtain the ASR-RNN model, where the formula for updating the weight V is:
Figure BDA0001683449630000233
v represents the weight of the connection between the hidden layer and the output layer before updating, V' represents the weight of the connection between the hidden layer and the output layer after updating, alpha represents the learning rate, t represents the time t, tau represents the total duration,
Figure BDA0001683449630000241
representing the predicted output value, y t Representing the true output value, h t Representing a hidden state at the moment T, wherein T represents matrix transposition operation; the formula for the update bias c is:
Figure BDA0001683449630000242
c represents the offset between the hidden layer and the output layer before updating, c' represents the offset between the hidden layer and the output layer after updating; the formula for updating the weight U is as follows:
Figure BDA0001683449630000243
u represents the weight of the connection between the input layer and the hidden layer before updating, U' represents the weight of the connection between the input layer and the hidden layer after updating, diag () represents the matrix operation of constructing a diagonal matrix or returning diagonal elements on the matrix in the form of vectors, and delta t Gradient, x, representing the state of the hidden layer t Representing the ASR speech characteristics to be trained input at the time t; the formula for updating the weight W is as follows:
Figure BDA0001683449630000244
w denotes the weight of the connection between hidden layers before updateThe value W' represents the weight of the connection between the hidden layers after updating; the formula for the update bias b is:
Figure BDA0001683449630000245
b denotes the offset between the input layer and the hidden layer before updating, and b' denotes the offset between the input layer and the hidden layer after updating.
The present embodiment provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for distinguishing between voices in the embodiment is implemented, and in order to avoid repetition, details are not repeated here. Alternatively, the computer program is executed by the processor to implement the functions of each module/unit in the voice differentiating apparatus in the embodiment, and is not described herein again to avoid repetition.
It is to be understood that the computer-readable storage medium may include: any entity or device capable of carrying said computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, etc.
Fig. 9 is a schematic diagram of the computer apparatus in the present embodiment. As shown in fig. 9, the computer device 50 comprises a processor 51, a memory 52 and a computer program 53 stored in the memory 52 and executable on the processor 51. The processor 51, when executing the computer program 53, implements the various steps of the speech discrimination method in the embodiment, such as steps S10, S20 and S30 shown in fig. 2. Alternatively, the processor 51, when executing the computer program 53, implements the functions of the modules/units of the speech distinguishing device in the embodiment, such as the functions of the target to-be-distinguished speech data obtaining module 10, the speech feature obtaining module 20, the target distinguishing result obtaining module 30 and the ASR-RNN model obtaining module 40 shown in fig. 8.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (8)

1. A method for speech discrimination, comprising:
processing original voice data to be distinguished based on a voice activity detection algorithm to obtain target voice data to be distinguished;
acquiring corresponding ASR voice characteristics based on the target voice data to be distinguished;
inputting the ASR speech features into a pre-trained ASR-RNN model for distinguishing, and acquiring a target distinguishing result;
wherein the step of obtaining the ASR-RNN model comprises:
acquiring voice data to be trained, and extracting ASR voice features to be trained of the voice data to be trained;
initializing an RNN model;
inputting the ASR speech features to be trained into an RNN model, and acquiring an output value of the RNN model according to a forward propagation algorithm, wherein the output value is expressed as:
Figure FDA0004008816010000011
σ denotes the activation function, V denotes the weight of the connection between the hidden layer and the output layer, h t Representing the hidden state at time t, c representing the offset between the hidden layer and the output layer;
and performing error back transmission based on the output value, updating the weight and bias of each layer of the RNN model, and obtaining the ASR-RNN model, wherein the formula for updating the weight V is as follows:
Figure FDA0004008816010000012
v represents the weight of the connection between the hidden layer and the output layer before updating, V' represents the weight of the connection between the hidden layer and the output layer after updating, alpha represents the learning rate, t represents the time t, tau represents the total duration,
Figure FDA0004008816010000013
representing the predicted output value, y t Representing the true output value, h t Representing a hidden state at the moment T, wherein T represents matrix transposition operation; the formula for the update bias c is:
Figure FDA0004008816010000014
c represents the offset between the hidden layer and the output layer before updating, c' represents the offset between the hidden layer and the output layer after updating; the formula for updating the weight U is as follows:
Figure FDA0004008816010000015
u represents the weight of the connection between the input layer and the hidden layer before updating, U' represents the weight of the connection between the input layer and the hidden layer after updating, diag () represents the matrix operation of constructing a diagonal matrix or returning diagonal elements on the matrix in the form of vectors, and delta t Gradient, x, representing the state of the hidden layer t Representing the ASR speech characteristics to be trained input at the time t; the formula for updating the weight W is as follows:
Figure FDA0004008816010000016
w representing connections between hidden layers before updateThe weight value W' represents the weight value of the connection between the hidden layers after updating; the formula for the update bias b is:
Figure FDA0004008816010000021
b denotes the offset between the input layer and the hidden layer before updating, and b' denotes the offset between the input layer and the hidden layer after updating.
2. The voice differentiating method according to claim 1, wherein the processing the original voice data to be differentiated based on the voice activity detection algorithm to obtain the target voice data to be differentiated comprises:
processing the original voice data to be distinguished according to a short-time energy characteristic value calculation formula to obtain a corresponding short-time energy characteristic value, reserving the original voice data to be distinguished with the short-time energy characteristic value larger than a first threshold value, determining the original voice data to be distinguished as first original distinguishing voice data, and calculating the short-time energy characteristic value according to the short-time energy characteristic value calculation formula
Figure FDA0004008816010000022
Wherein, N is the length of the voice frame, s (N) is the signal amplitude in the time domain, and N is the time;
processing the original voice data to be distinguished according to a zero-crossing rate characteristic value calculation formula to obtain a corresponding zero-crossing rate characteristic value, reserving the original voice data to be distinguished with the zero-crossing rate characteristic value smaller than a second threshold value, determining the original voice data to be distinguished as second original voice data to be distinguished, and calculating the zero-crossing rate characteristic value according to the zero-crossing rate characteristic value calculation formula
Figure FDA0004008816010000023
Wherein, N is the length of the voice frame, s (N) is the signal amplitude on the time domain, and N is the time;
and taking the first original distinguishing voice data and the second original distinguishing voice data as the target voice data to be distinguished.
3. The speech discrimination method according to claim 1, wherein the obtaining of the corresponding ASR speech features based on the target speech data to be discriminated comprises:
preprocessing the target voice data to be distinguished to obtain preprocessed voice data;
performing fast Fourier transform on the preprocessed voice data to obtain a frequency spectrum of the target voice data to be distinguished, and obtaining a power spectrum of the target voice data to be distinguished according to the frequency spectrum;
processing the power spectrum of the target voice data to be distinguished by adopting a Mel scale filter bank to obtain a Mel power spectrum of the target voice data to be distinguished;
and performing cepstrum analysis on the Mel power spectrum to obtain Mel frequency cepstrum coefficients of the target voice data to be distinguished.
4. The voice distinguishing method according to claim 3, wherein the preprocessing the target voice data to be distinguished to obtain preprocessed voice data includes:
pre-emphasis processing is carried out on the target voice data to be distinguished, and the calculation formula of the pre-emphasis processing is s' n =s n -a*s n-1 Wherein s is n Is the amplitude of the signal in the time domain, s n-1 Is a sum of s n Corresponding signal amplitude, s 'at the previous moment' n Is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value range of a is 0.9<a<1.0;
Carrying out frame division processing on the pre-emphasized target voice data to be distinguished;
windowing the voice data to be distinguished of the target after framing to obtain preprocessed voice data, wherein the calculation formula of windowing is as follows
Figure FDA0004008816010000031
Wherein N is the window length, N is the time, s n Is the signal amplitude, s 'in the time domain' n Is the signal amplitude in the time domain after windowing.
5. The speech discrimination method of claim 3, wherein performing cepstrum analysis on the mel-power spectrum to obtain mel-frequency cepstrum coefficients of the target speech data to be discriminated comprises:
taking the logarithm value of the Mel power spectrum to obtain the Mel power spectrum to be transformed;
and performing discrete cosine transform on the Mel power spectrum to be transformed to obtain Mel frequency cepstrum coefficient of the target voice data to be distinguished.
6. A speech differentiating apparatus, comprising:
the target voice data to be distinguished acquisition module is used for processing the original voice data to be distinguished based on a voice activity detection algorithm to acquire the target voice data to be distinguished;
the voice feature acquisition module is used for acquiring corresponding ASR voice features based on the target voice data to be distinguished;
the target distinguishing result acquisition module is used for inputting the ASR speech features into a pre-trained ASR-RNN model for distinguishing and acquiring a target distinguishing result;
the speech distinguishing device also comprises an ASR-RNN model obtaining module used for obtaining the ASR-RNN model, and the ASR-RNN model obtaining module comprises:
the ASR voice feature acquisition unit to be trained is used for acquiring the voice data to be trained and extracting the ASR voice feature to be trained of the voice data to be trained;
an initialization unit for initializing the RNN model;
an output value obtaining unit, configured to input the ASR speech feature to be trained into the RNN model, and obtain an output value of the RNN model according to a forward propagation algorithm, where the output value is represented as:
Figure FDA0004008816010000032
σ denotes the activation function, V denotes the weight of the connection between the hidden layer and the output layer, h t Representing the hidden state at time t, c representing the offset between the hidden layer and the output layer;
an update unit for updating the data based onAnd performing error back transmission on the output value, updating the weight and the offset of each layer of the RNN model, and acquiring the ASR-RNN model, wherein the formula for updating the weight V is as follows:
Figure FDA0004008816010000033
v represents the weight of the connection between the hidden layer and the output layer before updating, V' represents the weight of the connection between the hidden layer and the output layer after updating, alpha represents the learning rate, t represents the time t, tau represents the total duration,
Figure FDA0004008816010000041
representing the predicted output value, y t Representing the true output value, h t Representing a hidden state at the moment T, wherein T represents matrix transposition operation; the formula for the update bias c is:
Figure FDA0004008816010000042
c represents the offset between the hidden layer and the output layer before updating, c' represents the offset between the hidden layer and the output layer after updating; the formula for updating the weight U is as follows:
Figure FDA0004008816010000043
u represents the weight of the connection between the input layer and the hidden layer before updating, U' represents the weight of the connection between the input layer and the hidden layer after updating, diag () represents the matrix operation of constructing a diagonal matrix or returning diagonal elements on the matrix in the form of vectors, and delta t Gradient, x, representing the state of the hidden layer t Representing the ASR speech characteristics to be trained input at the time t; the formula for updating the weight W is as follows:
Figure FDA0004008816010000044
w represents the weight of the connection between the hidden layers before updating, and W' represents the weight of the connection between the hidden layers after updating; the formula for updating bias b is:
Figure FDA0004008816010000045
b represents before updateThe offset between the input layer and the hidden layer, b' denotes the offset between the input layer and the hidden layer after updating.
7. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the speech discrimination method according to any one of claims 1 to 5 when executing the computer program.
8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the speech discrimination method according to one of claims 1 to 5.
CN201810561788.1A 2018-06-04 2018-06-04 Voice distinguishing method and device, computer equipment and storage medium Active CN108922513B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810561788.1A CN108922513B (en) 2018-06-04 2018-06-04 Voice distinguishing method and device, computer equipment and storage medium
PCT/CN2018/094190 WO2019232846A1 (en) 2018-06-04 2018-07-03 Speech differentiation method and apparatus, and computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810561788.1A CN108922513B (en) 2018-06-04 2018-06-04 Voice distinguishing method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108922513A CN108922513A (en) 2018-11-30
CN108922513B true CN108922513B (en) 2023-03-17

Family

ID=64419509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810561788.1A Active CN108922513B (en) 2018-06-04 2018-06-04 Voice distinguishing method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN108922513B (en)
WO (1) WO2019232846A1 (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658920B (en) * 2018-12-18 2020-10-09 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN109545192B (en) 2018-12-18 2022-03-08 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN109545193B (en) * 2018-12-18 2023-03-14 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN110265065B (en) * 2019-05-13 2021-08-03 厦门亿联网络技术股份有限公司 Method for constructing voice endpoint detection model and voice endpoint detection system
CN110189747A (en) * 2019-05-29 2019-08-30 大众问问(北京)信息科技有限公司 Voice signal recognition methods, device and equipment
CN110148401B (en) * 2019-07-02 2023-12-15 腾讯科技(深圳)有限公司 Speech recognition method, device, computer equipment and storage medium
CN110838307B (en) * 2019-11-18 2022-02-25 思必驰科技股份有限公司 Voice message processing method and device
CN113223511B (en) * 2020-01-21 2024-04-16 珠海市煊扬科技有限公司 Audio processing device for speech recognition
CN112581940A (en) * 2020-09-17 2021-03-30 国网江苏省电力有限公司信息通信分公司 Discharging sound detection method based on edge calculation and neural network
CN112598114B (en) * 2020-12-17 2023-11-03 海光信息技术股份有限公司 Power consumption model construction method, power consumption measurement method, device and electronic equipment
CN112908303A (en) * 2021-01-28 2021-06-04 广东优碧胜科技有限公司 Audio signal processing method and device and electronic equipment
CN117648717A (en) * 2024-01-29 2024-03-05 知学云(北京)科技股份有限公司 Privacy protection method for artificial intelligent voice training

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100580770C (en) * 2005-08-08 2010-01-13 中国科学院声学研究所 Voice end detection method based on energy and harmonic
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
US9858919B2 (en) * 2013-11-27 2018-01-02 International Business Machines Corporation Speaker adaptation of neural network acoustic models using I-vectors
CN104157290B (en) * 2014-08-19 2017-10-24 大连理工大学 A kind of method for distinguishing speek person based on deep learning
CN105139864B (en) * 2015-08-17 2019-05-07 北京眼神智能科技有限公司 Audio recognition method and device
KR102450853B1 (en) * 2015-11-30 2022-10-04 삼성전자주식회사 Apparatus and method for speech recognition
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
CN107799126B (en) * 2017-10-16 2020-10-16 苏州狗尾草智能科技有限公司 Voice endpoint detection method and device based on supervised machine learning
CN107680597B (en) * 2017-10-23 2019-07-09 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer readable storage medium
CN107731233B (en) * 2017-11-03 2021-02-09 王华锋 Voiceprint recognition method based on RNN

Also Published As

Publication number Publication date
CN108922513A (en) 2018-11-30
WO2019232846A1 (en) 2019-12-12

Similar Documents

Publication Publication Date Title
CN108922513B (en) Voice distinguishing method and device, computer equipment and storage medium
Ravanelli et al. Multi-task self-supervised learning for robust speech recognition
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
WO2019227586A1 (en) Voice model training method, speaker recognition method, apparatus, device and medium
KR100908121B1 (en) Speech feature vector conversion method and apparatus
CN107633842A (en) Audio recognition method, device, computer equipment and storage medium
CN110379412A (en) Method, apparatus, electronic equipment and the computer readable storage medium of speech processes
CN110428842A (en) Speech model training method, device, equipment and computer readable storage medium
CN112581979B (en) Speech emotion recognition method based on spectrogram
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
US5594834A (en) Method and system for recognizing a boundary between sounds in continuous speech
CN109036470B (en) Voice distinguishing method, device, computer equipment and storage medium
CN109448726A (en) A kind of method of adjustment and system of voice control accuracy rate
CN111192598A (en) Voice enhancement method for jump connection deep neural network
WO2019232867A1 (en) Voice discrimination method and apparatus, and computer device, and storage medium
CN114242044B (en) Voice quality evaluation method, voice quality evaluation model training method and device
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN112786068B (en) Audio sound source separation method, device and storage medium
Razani et al. A reduced complexity MFCC-based deep neural network approach for speech enhancement
CN114283835A (en) Voice enhancement and detection method suitable for actual communication condition
CN114613387A (en) Voice separation method and device, electronic equipment and storage medium
Zhang et al. Neural noise embedding for end-to-end speech enhancement with conditional layer normalization
CN112750469A (en) Method for detecting music in voice, voice communication optimization method and corresponding device
CN110689875A (en) Language identification method and device and readable storage medium
CN112951270B (en) Voice fluency detection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant