WO2019232846A1 - Speech differentiation method and apparatus, and computer device and storage medium - Google Patents

Speech differentiation method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2019232846A1
WO2019232846A1 PCT/CN2018/094190 CN2018094190W WO2019232846A1 WO 2019232846 A1 WO2019232846 A1 WO 2019232846A1 CN 2018094190 W CN2018094190 W CN 2018094190W WO 2019232846 A1 WO2019232846 A1 WO 2019232846A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
target
asr
voice data
data
Prior art date
Application number
PCT/CN2018/094190
Other languages
French (fr)
Chinese (zh)
Inventor
涂宏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019232846A1 publication Critical patent/WO2019232846A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present application relates to the field of speech processing, and in particular, to a method, a device, a computer device, and a storage medium for distinguishing speech.
  • Speech discrimination refers to mute filtering of the input speech, and only retain the speech segments (that is, the target speech) that are more meaningful for recognition.
  • the current methods of speech discrimination have great shortcomings, especially in the presence of noise, as the noise becomes larger, the difficulty of speech discrimination becomes more difficult, and the target speech and the interference speech cannot be accurately distinguished, resulting in the effect of speech discrimination. not ideal.
  • the embodiments of the present application provide a method, an apparatus, a computer device, and a storage medium for speech discrimination, so as to solve the problem that the effect of speech discrimination is not ideal.
  • An embodiment of the present application provides a method for distinguishing speech, including:
  • the ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
  • An embodiment of the present application provides a voice distinguishing device, including:
  • Target to-be-differentiated voice data acquisition module for processing original to-be-differentiated voice data based on a voice activity detection algorithm, to obtain target to-be-differentiated voice data
  • a voice feature acquisition module configured to acquire a corresponding ASR voice feature based on the target to-be-differentiated voice data
  • a target discrimination result acquisition module is configured to input the ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtain a target discrimination result.
  • An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor implements the computer-readable instructions to implement The following steps:
  • the ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
  • This embodiment of the present application provides one or more non-volatile readable storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors are Perform the following steps:
  • the ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
  • FIG. 1 is an application environment diagram of a speech discrimination method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a speech discrimination method according to an embodiment of the present application.
  • FIG. 3 is a specific flowchart of step S10 in FIG. 2;
  • FIG. 4 is a specific flowchart of step S20 in FIG. 2;
  • step S21 in FIG. 4 is a specific flowchart of step S21 in FIG. 4;
  • step S24 in FIG. 4 is a specific flowchart of step S24 in FIG. 4;
  • FIG. 7 is a specific flowchart before step S30 in FIG. 2; FIG.
  • FIG. 8 is a schematic diagram of a voice distinguishing device according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.
  • FIG. 1 illustrates an application environment of a speech discrimination method provided by an embodiment of the present application.
  • the application environment of the speech recognition method includes a server and a client, wherein the server and the client are connected through a network.
  • Clients are devices that can interact with users, including but not limited to computers, smartphones, and tablets.
  • the server can be implemented by an independent server or a server cluster composed of multiple servers.
  • the speech discrimination method provided in the embodiments of the present application is applied to a server.
  • FIG. 2 shows a flowchart of the voice discrimination method in this embodiment.
  • the voice discrimination method includes the following steps:
  • the original speech data to be distinguished is processed based on the speech activity detection algorithm, and the target speech data to be distinguished is obtained.
  • VAD Voice Activity Detection
  • VAD algorithm is an algorithm specifically used in the voice activity detection, and the algorithm may have various types. Understandably, VAD can be applied to speech discrimination, and can distinguish target speech and interference speech.
  • the target voice refers to the voice part in which the voiceprint continuously changes significantly in the voice data, and the interference voice may be a voice part in the voice data that is not pronounced due to silence, or it may be environmental noise.
  • the original to-be-differentiated voice data is the most originally obtained to-be-differentiated voice data, and the original to-be-differentiated voice data refers to voice data to be subjected to preliminary distinguishing processing using a VAD algorithm.
  • the target to-be-differentiated speech data refers to the speech data obtained by using the voice activity detection algorithm to process the original to-be-differentiated speech data for speech discrimination.
  • the VAD algorithm is used to process the original to-be-differentiated voice data, and the target to-be-differentiated voice is initially selected from the original to-be-differentiated voice data, and the initially-selected target voice portion is used as the target to-be-differentiated voice data. Understandably, it is not necessary to distinguish the interfering voices that are initially screened to improve the efficiency of voice discrimination. However, the target voice initially screened from the original to-be-differentiated voice data still contains the content of interfering speech, especially when the original voice data to be distinguished is relatively noisy, the interfering voices (such as noise) mixed with the preliminary target voice are more mixed. It is obvious that it is impossible to effectively distinguish the speech by using the VAD algorithm at this time.
  • the target voice that is preliminarily screened with interfering voices should be used as the target to-be-differentiated voice data in order to more accurately distinguish the target voice that is initially screened.
  • the VAD algorithm to perform preliminary speech discrimination on the original to-be-differentiated voice data, it is possible to re-differentiate the original to-be-differentiated voice data and to remove a large amount of interfering speech at the same time, which is beneficial to subsequent further speech discrimination.
  • processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data includes the following steps:
  • S11 Process the original speech data to be distinguished according to the short-term energy characteristic value calculation formula, obtain the corresponding short-term energy characteristic value, and retain the original to-be-differentiated data whose short-term energy characteristic value is greater than the first threshold, and determine it as the first original Differentiate speech data, where the short-term energy eigenvalue calculation formula is N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.
  • the short-term energy characteristic value describes the energy corresponding to a frame of speech (a frame generally takes 10-30ms) in its time domain.
  • the "short-term" of this short-term energy should be understood as the time of a frame (that is, speech Frame length). Since the short-term energy feature value of the target voice is much higher than the short-term energy feature value of the interfering voice (silence), the target voice and the interfering voice can be distinguished according to the short-term energy feature value.
  • the original speech data to be distinguished is processed according to the short-term energy feature value calculation formula (the original speech data to be distinguished needs to be framed in advance), and the short-term energy characteristics of each frame of the original speech data to be distinguished are calculated and obtained.
  • Value comparing the short-term energy characteristic value of each frame with a preset first threshold value, retaining the original to-be-differentiated voice data that is greater than the first threshold, and determining it as the first original distinguishing voice data.
  • the first threshold is a cut-off value for measuring whether the short-term energy characteristic value belongs to the target speech or the interference speech.
  • the target voice in the original to-be-differentiated voice data can be obtained from the perspective of the short-term energy feature value, and the original to-be-differentiated voice data is effectively removed A lot of disturbing speech.
  • S12 Process the original to-be-differentiated voice data according to the calculation formula of the zero-crossing rate feature value, obtain the corresponding zero-crossing rate feature value, and retain the original to-be-differentiated voice data with the zero-cross rate feature value less than the second threshold, and determine as the second
  • the original distinguished speech data where the zero-crossing rate eigenvalue calculation formula is N is the speech frame length, and s (n) is the signal amplitude n in the time domain is time.
  • the zero-crossing rate characteristic value describes the number of times a voice signal waveform passes through the horizontal axis (zero level) in a frame of speech. Since the feature value of the zero-crossing rate of the target voice is much lower than the feature value of the zero-crossing rate of the interfering voice, the target voice and the interfering voice can be distinguished according to the short-term energy feature value.
  • the original to-be-differentiated speech data is processed according to the zero-crossing rate feature value calculation formula, and the zero-crossing rate feature value of each frame of the original to-be-differentiated voice data is calculated and obtained.
  • the second threshold value is compared, and the original to-be-differentiated voice data smaller than the second threshold value is retained and determined as the second original distinguished voice data.
  • the second threshold is a cutoff value for measuring whether the short-term energy characteristic value belongs to the target speech or the interference speech.
  • the target voice in the original to-be-differentiated voice data can be obtained from the perspective of the zero-crossing rate feature value, and the original to-be-differentiated voice data can be effectively removed. A lot of disturbing speech.
  • S13 Use the first original distinguished speech data and the second original distinguished speech data as target to-be-separated speech data.
  • the first original distinguished speech data is distinguished and obtained from the original to-be-differentiated speech data according to the angle of the short-term energy feature value
  • the second original distinguished speech data is from the original to-be-distant speech data according to the angle of the zero-crossing rate feature value. Distinguish and acquire in speech data.
  • the first original distinguished speech data and the second original distinguished speech data are from different perspectives of distinguishing speech. Both of these angles can distinguish the speech well. Therefore, the first original distinguished speech data and the second original distinguished speech data are merged. (Merged in the manner of taking intersections) together as the target speech data to be distinguished.
  • Steps S11-S13 can initially and effectively remove most of the interfering voice data in the original to-be-differentiated voice data, retain the original to-be-differentiated voice data that is mixed with the target voice and a small part of the interfering voice (such as noise), and the original to-be-differentiated voice data
  • the data is used as the target to-be-differentiated speech data, which can make effective preliminary speech discrimination on the original to-be-differentiated speech data.
  • S20 Obtain the corresponding ASR voice characteristics based on the target to-be-differentiated voice data.
  • ASR Automatic Speech Recognition
  • ASR is a technology that converts speech data into computer-readable input, for example, converts speech data into keys, binary codes, or character sequences.
  • the ASR can extract the voice features in the target to-be-differentiated voice data, and the extracted voice is the corresponding ASR voice feature.
  • ASR can convert voice data that cannot be read directly by a computer into ASR voice features that can be read by a computer, and the ASR voice features can be represented in a vector manner.
  • ASR is used to process the target to-be-differentiated voice data to obtain the corresponding ASR voice characteristics.
  • This ASR voice feature can well reflect the potential characteristics of the target to-be-differentiated voice data. Differentiate speech data to distinguish, and provide important technical prerequisites for subsequent ASR-RNN (RNN, Recurrent Neural Networks) model recognition based on the ASR speech characteristics.
  • RNN Recurrent Neural Networks
  • step S20 acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data includes the following steps:
  • S21 Preprocess the target to-be-differentiated voice data to obtain preprocessed voice data.
  • the target to-be-differentiated voice data is pre-processed, and corresponding pre-processed voice data is obtained.
  • Preprocessing the target to-be-differentiated voice data can better extract the ASR voice characteristics of the target to-be-differentiated voice data, so that the extracted ASR speech features can better represent the target-to-be-differentiated voice data, so as to use the ASR speech features for speech discrimination .
  • pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data includes the following steps:
  • Pre-emphasis processing is performed on the target to-be-differentiated voice data.
  • pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end.
  • the idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform.
  • Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
  • the weighting coefficient, the range of a is 0.9 ⁇ a ⁇ 1.0.
  • the effect of pre-emphasis of 0.97 is better.
  • the use of the pre-emphasis processing can eliminate interference caused by vocal cords and lips during the utterance process, can effectively compensate the suppressed high-frequency part of the target voice data to be distinguished, and can highlight the high-frequency formants of the target voice data to be distinguished, strengthening the target Distinguishing the signal amplitude of speech data helps to extract ASR speech features.
  • S212 Frame the target to-be-differentiated voice data after pre-emphasis.
  • Framing refers to the speech processing technology that cuts the entire voice signal into several segments.
  • the size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length.
  • Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames.
  • Frame processing is performed on the target to-be-differentiated voice data, which can divide the target to-be-differentiated voice data into several pieces of voice data, and the target to-be-differentiated voice data can be subdivided to facilitate the extraction of ASR voice features.
  • S213 Perform windowing on the framed target to-be-differentiated voice data to obtain pre-processed voice data.
  • the calculation formula for the windowing is Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  • the window function can select the Hamming window, and the windowing formula is N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed.
  • Windowing the target to-be-differentiated voice data to obtain pre-processed speech data can make the signal of the target to-be-differentiated voice data in the time domain continuous after framed, which is helpful for extracting the ASR voice of the target to-be-differentiated voice data. feature.
  • the pre-processing operations on the target to-be-differentiated voice data in steps S211 to S213 provide a basis for extracting the ASR voice characteristics of the target to-be-differentiated voice data, which can make the extracted ASR voice features more representative of the target to-be-differentiated voice data, and according to This ASR voice feature performs voice discrimination.
  • S22 Perform a fast Fourier transform on the pre-processed speech data to obtain the frequency spectrum of the target speech data to be distinguished, and obtain the power spectrum of the target speech data to be distinguished according to the frequency spectrum.
  • FFT Fast Fourier Transform
  • FFT refers to a collective term for an efficient and fast method for computing a discrete Fourier transform using a computer, and is referred to as FFT for short.
  • the use of this algorithm can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the FFT algorithm's computational savings will be.
  • fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from the signal amplitude in the time domain to the signal amplitude (spectrum) in the frequency domain.
  • the formula for calculating the spectrum is 1 ⁇ k ⁇ N, N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit.
  • the power spectrum of the pre-processed voice data can be directly obtained according to the frequency spectrum.
  • the power spectrum of the pre-processed voice data is hereinafter referred to as the power spectrum of the target voice data to be distinguished.
  • the formula for calculating the power spectrum of the target speech data to be distinguished is 1 ⁇ k ⁇ N, N is the size of the frame, and s (k) is the signal amplitude in the frequency domain.
  • the pre-processed speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the target speech data to be distinguished is obtained according to the signal amplitude in the frequency domain. Extracting ASR speech features from the spectrum provides an important technical basis.
  • S23 Use the Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain the Mel power spectrum of the target speech data to be distinguished.
  • the power spectrum of the target speech data to be processed using the Mel scale filter bank is a Mel frequency analysis of the power spectrum
  • the Mel frequency analysis is an analysis based on human auditory perception.
  • the human ear is like a filter bank, focusing only on certain specific frequency components (human hearing is selective to frequencies), which means that the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive.
  • these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
  • a Mel scale filter bank is used to process the power spectrum of the target speech data to be distinguished, and a Mel power spectrum of the target speech data to be distinguished is obtained.
  • the frequency domain signal is segmented by using the Mel scale filter bank. Make each frequency segment correspond to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the target speech data to be distinguished can be obtained.
  • the Mel power spectrum obtained after the analysis retains a frequency portion closely related to the characteristics of the human ear, and this frequency portion can well reflect the target to be distinguished Characteristics of speech data.
  • S24 Perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
  • a cepstrum analysis is performed on the Mel power spectrum, and based on the cepstrum result, the Mel frequency cepstrum coefficient of the target speech data to be distinguished is analyzed and obtained.
  • the features contained in the Mel power spectrum of the target speech data to be distinguished which is too high in original feature dimension, can be directly converted into easy-to-use through cepstrum analysis on the Mel power spectrum.
  • the Mel frequency cepstrum coefficient can be used as a coefficient for distinguishing different voices from ASR voice features.
  • the ASR voice feature can reflect the difference between voices, and can be used to identify and distinguish target to-be-differentiated voice data.
  • step S24 cepstrum analysis is performed on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the target speech data to be distinguished, including the following steps:
  • a log value log of the Mel power spectrum is taken to obtain the Mel power spectrum m to be transformed.
  • S242 Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • a discrete cosine transform is performed on the Mel power spectrum m to be transformed to obtain a corresponding Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • DCT discrete cosine transform
  • the second to thirteenth coefficients are taken.
  • Coefficients are used as ASR speech features, which can reflect the differences between speech data.
  • Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and The corresponding ASR speech features are obtained. Compared with the Fourier transform, the result of the discrete cosine transform has no imaginary part, and has obvious advantages in terms of calculation.
  • Steps S21-S24 are based on the ASR technology to perform feature extraction on the target to-be-differentiated voice data.
  • the final obtained ASR speech feature can well reflect the target to-be-differentiated voice data.
  • the ASR voice feature can be obtained by deep network model training to obtain ASR-
  • the RNN model makes the ASR-RNN model obtained during training more accurate when distinguishing speech, and can accurately distinguish noise from speech even under very noisy conditions.
  • the features extracted above are Mel frequency cepstrum coefficients.
  • the ASR speech features should not be limited to only Mel frequency cepstrum coefficients. Instead, it should be considered that the speech features obtained by ASR technology can be used as long as they can The features that effectively reflect speech data can be used as ASR speech features for recognition and model training.
  • ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
  • the ASR-RNN model refers to a recurrent neural network model trained using ASR speech features
  • RNN refers to recurrent neural networks.
  • the ASR-RNN model is trained using ASR speech features extracted from the speech data to be trained, so the model can recognize ASR speech features and distinguish speech based on ASR speech features.
  • the speech data to be trained includes target speech and noise.
  • ASR-RNN model training the ASR speech feature of the target speech and the ASR speech feature of the noise are extracted, so that the ASR-RNN model obtained by training can recognize the target based on the ASR speech feature.
  • Noise in speech and interfering speech when VAD is used to distinguish the original to-be-differentiated speech data, most of the interfering speech has been removed, such as the speech data and part of the noise that are not pronounced due to silence in the speech data, so the ASR-DBN model distinguishes
  • the interference speech specifically refers to the noise part), to achieve the purpose of effectively distinguishing between the target speech and the interference speech.
  • the ASR voice features are input into a pre-trained ASR-RNN model to distinguish them. Since the ASR voice features can reflect the characteristics of the voice data, the ASR of the target to be distinguished voice data can be extracted according to the ASR-RNN model. The speech features are recognized, so that the target speech data to be distinguished is accurately distinguished based on the ASR speech features.
  • This pre-trained ASR-RNN model combines the features of ASR speech features and recurrent neural network to extract features in depth, and distinguishes speech from the ASR speech features of speech data. It is still available under very bad noise conditions. Very high accuracy.
  • the features extracted by ASR also include the ASR speech features of noise
  • noise can also be accurately distinguished, and the current speech discrimination methods (including but not limited to VAD) are affected by noise.
  • VAD current speech discrimination methods
  • step S30 before the steps of inputting ASR voice features into a pre-trained ASR-RNN model to distinguish and obtain a target discrimination result, the voice discrimination method further includes the following steps: obtaining an ASR-RNN model .
  • the steps of obtaining the ASR-RNN model include:
  • S31 Acquire speech data to be trained, and extract speech features of the ASR to be trained.
  • the voice data to be trained refers to a training set of voice data required for training the ASR-RNN model.
  • the voice data to be trained may be an open source voice training set directly, or a voice training set by collecting a large amount of sample voice data.
  • the to-be-trained voice data distinguishes the target voice and the interfering voice (here, specifically noise) in advance, and a specific method for distinguishing may be to set different label values for the target voice and noise respectively. For example, all target speech parts in the speech data to be trained are marked as 1 (representing "true"), and noisy parts are marked as 0 (representing "false").
  • the ASR-RNN model recognition can be tested by setting the label value in advance Accuracy in order to provide improved references, update network parameters in the ASR-RNN model, and continuously optimize the ASR-RNN model.
  • the ratio of the target voice and the noise may specifically be 1: 1, and adopting this ratio can avoid overfitting due to different target voice and noise amounts in the voice data to be trained.
  • overfitting refers to the phenomenon that the assumptions become too strict in order to obtain a consistent hypothesis. Avoiding overfitting is a core task in classifier design.
  • the voice data to be trained is obtained and the feature of the voice data to be trained is extracted.
  • This feature is the voice feature of the ASR to be trained.
  • the steps of extracting the voice feature of the ASR to be trained are the same as steps S21-S24, and will not be repeated here .
  • the speech data to be trained includes training samples of the target speech and training samples of noise. Both parts of the speech data have their own ASR speech features. Therefore, the corresponding ASR-RNN model can be extracted and trained using the ASR speech features to be trained, so that The ASR-RNN model obtained by training the ASR speech features to be trained can accurately distinguish the target speech and noise (noise belongs to interference speech).
  • the RNN model is a recurrent neural network model.
  • the RNN model includes an input layer, a hidden layer, and an output layer composed of neurons.
  • the RNN model includes the weights and biases of each neuron connection between the layers. These weights and biases determine the nature and recognition effect of the RNN model.
  • DNN Deep Neural Network
  • RNN is a neural network that models sequence data (such as time series), that is, the current output of a sequence is related to the previous output.
  • the specific expression is that the network will remember the state of the previous hidden layer and apply it to the current output calculation, that is, the nodes between the hidden layers are no longer unconnected but connected, and the input of the hidden layer includes not only the input layer
  • the output also includes the output of the hidden layer at the previous moment. Due to the temporal characteristics of the speech data, the RNN model can be trained with the speech data to be trained to accurately extract the respective deep features of the target speech and the interfering speech in time to achieve accurate speech discrimination.
  • the RNN model is initialized.
  • This initialization operation is to set the initial values of weights and offsets in the RNN model.
  • the initial value can be set to a smaller value when initially set, such as in the interval [-0.3-0.3] between.
  • Reasonable initialization of the RNN model can make the model have more flexible adjustment capabilities in the early stage.
  • the model can be adjusted effectively during the model training process without making the model's adjustment capability in the initial stage very poor, resulting in a trained model. The distinction is not good.
  • S33 Input the ASR speech features to be trained into the RNN model, and obtain the output value of the RNN model according to the forward propagation algorithm.
  • the output value is expressed as: ⁇ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer.
  • the process of RNN forward propagation is a series of linear operations and activation operations performed in the RNN model according to the time series according to the weighted, biased, and input ASR speech features of each neuron in the RNN model.
  • the RNN is a neural network that models sequence (specifically, time series) data
  • it is necessary to calculate the hidden layer state h t- at time t-1 The ASR speech features input at times 1 and t are obtained together.
  • U represents the weight of the connection between the input layer and the hidden layer
  • W represents the weight of the connection between the hidden layers (implemented by time series Connection between hidden layers)
  • h t-1 represents the hidden state at t-1
  • b represents the offset between the input layer and the hidden layer.
  • the output of the output layer (that is, the output value of the RNN model) is expressed as Among them, the activation function used here can be a softmax function (the softmax function is better for classification problems), V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, c Represents the offset between the hidden layer and the output layer.
  • the output value of this RNN model (the output of the output layer) That is, the output value calculated by the layer by layer through the forward propagation algorithm can be called the predicted output value.
  • the server After the server obtains the output value of the RNN model, it can update and adjust the network parameters (weights and offsets) in the RNN model according to the output value, so that the obtained RNN model can be distinguished according to the time-series characteristics of the voice.
  • the difference between the ASR voice characteristics of the speech and the ASR voice characteristics of the interfering speech and their timing performance results in accurate recognition results.
  • the server obtains the output value (predicted output value) of the RNN model according to the forward propagation algorithm. After that, you can With the ASR speech feature to be trained with pre-set label values, calculate the error generated by the ASR speech feature to be trained during the training of the RNN model, and construct a suitable error function based on the error (such as using a logarithmic error function to represent the generated error ). The server then uses this error function for error back propagation, adjusting and updating the weights (U, W, and V) and weights (b and c) of each layer of the RNN model.
  • the preset label value can be called a real output value (that is, it represents objective facts, a label value of 1 represents a target voice, and a label value of 0 represents an interfering voice), and is represented by y t .
  • the RNN model in the time series has an error when calculating the forward output at each layer.
  • the error function L can be used to express: Among them, t refers to time t, ⁇ represents the total time, and L t represents the error generated at time t by the error function.
  • the server After the server obtains the error function, it can update the weights and offsets of the RNN model according to BPTT (Back Propagation Trough Time) to obtain the ASR-RNN model based on the ASR speech features to be trained.
  • the formula for updating the weight V is: Among them, V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, and ⁇ represents the learning rate, Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, and T represents the matrix transposition operation.
  • the formula for updating the offset c is: c represents the offset between the hidden layer and the output layer before the update, and c 'represents the offset between the hidden layer and the output layer after the update.
  • the gradient loss at a time t is determined by the gradient loss corresponding to the output of the current position and time t + 1
  • the two parts of the gradient loss are jointly determined. Therefore, the update of weight U, weight W and offset b needs to be obtained by means of the gradient ⁇ t of the state of the hidden layer.
  • the gradient ⁇ t of the hidden layer state at time t is expressed as: + ⁇ t is present between the contact 1 and ⁇ t, ⁇ t can be determined according ⁇ t + 1, which is linked to the expression: Among them, ⁇ t + 1 represents the gradient of the state of the hidden layer at the time of t + 1 sequence, and diag () represents a calculation function for matrix operations. The calculation function is used to construct a diagonal matrix or return a pair of matrices in the form of a vector. Corner line element, h t + 1 represents the state of the hidden layer at time t + 1 sequence.
  • the formula for updating weight U is: U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector For matrix operations on diagonal elements, ⁇ t represents the gradient of the state of the hidden layer, and x t represents the speech features of the ASR to be trained at time t;
  • the formula for updating the weight W is: W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update;
  • the formula for updating the offset b is: b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
  • the training can be stopped; or, when the training reaches the maximum number of iterations MAX, the training is stopped.
  • the error between the predicted output value of the ASR speech feature to be trained in the RNN model and the preset label value (real output value) is used to update the weights and offsets of each layer of the RNN model based on the error, so that the final
  • the obtained ASR-RNN model can train and learn deep features about time series according to the ASR speech features to achieve the purpose of accurately distinguishing speech.
  • Steps S31-S34 train the RNN model with the ASR speech features to be trained, so that the trained ASR-RNN model can train and learn deep features about the sequence (timing) based on the ASR speech features, and can use the ASR of the target speech and the interfering speech Speech features and the combination of timing factors effectively distinguish speech. In the case of severe noise interference, the target speech and noise can still be accurately distinguished.
  • the original voice data to be differentiated is processed based on a voice activity detection algorithm (VAD), and the target voice data to be distinguished is obtained.
  • the raw voice data to be distinguished is first distinguished by the voice activity detection algorithm to obtain
  • the target to-be-differentiated voice data with a smaller range can initially and effectively remove the interfering voice data from the original to-be-differentiated voice data, retain the original to-be-differentiated voice data mixed with the target voice and the interfering voice, and use the original to-be-differentiated voice data as The target to-be-differentiated voice data can effectively make preliminary speech distinctions from the original to-be-differentiated voice data, removing a large amount of interfering speech.
  • the corresponding ASR speech features are obtained.
  • This ASR speech feature can make the result of speech discrimination more accurate, and even under the condition of noisy noise, it can accurately interfering speech (such as noise) It is distinguished from the target speech, and provides important technical prerequisites for subsequent ASR-RNN model recognition based on the ASR speech characteristics.
  • the ASR speech features are input into a pre-trained ASR-RNN model to distinguish them and obtain the target discrimination result.
  • the ASR-RNN model is specially trained according to the ASR speech features extracted from the speech data to be trained and the timing characteristics of the speech.
  • the recognition model for effectively distinguishing speech can correctly distinguish the target speech from the target speech data that is mixed with the target speech and the interference speech (because VAD has been used to distinguish it once, so the interference speech here mostly refers to noise). Interfering with speech and improving the accuracy of speech discrimination.
  • FIG. 8 shows a principle block diagram of a voice distinguishing device corresponding to the voice distinguishing method in the embodiment.
  • the voice discrimination device includes a target to-be-differentiated voice data acquisition module 10, a voice feature acquisition module 20, and a target discrimination result acquisition module 30.
  • the implementation functions of the target to-be-separated voice data acquisition module 10, the voice feature acquisition module 20, and the target discrimination result acquisition module 30 correspond to the steps corresponding to the voice discrimination method in the embodiment. To avoid redundant descriptions, this embodiment is not one by one. Elaborate.
  • the target to-be-differentiated voice data acquisition module 10 is configured to process the original to-be-differentiated voice data based on a voice activity detection algorithm to obtain the target to-be-differentiated voice data.
  • the voice feature obtaining module 20 is configured to obtain a corresponding ASR voice feature based on the target to-be-differentiated voice data.
  • the target discrimination result acquisition module 30 is configured to input ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtain a target discrimination result.
  • the target undistinguished speech data acquisition module 10 includes a first original distinguished speech data acquisition unit 11, a second original distinguished speech data acquisition unit 12, and a target undisturbed speech data acquisition unit 13.
  • the first original distinguished speech data acquisition unit 11 is configured to process the original speech data to be distinguished according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, and increase the original value of the short-term energy feature value greater than the first threshold.
  • the to-be-differentiated data is retained, and it is determined as the first original distinguished speech data.
  • the short-term energy characteristic value calculation formula is N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.
  • a second original distinguished speech data obtaining unit 12 is configured to process the original to-be-differentiated speech data according to a calculation formula of the zero-crossing rate characteristic value, obtain a corresponding zero-crossing rate characteristic value, and reduce the original value of the zero-crossing rate characteristic value to be less than the second threshold
  • the to-be-differentiated voice data is retained, and it is determined to be the second original distinguished voice data.
  • the formula of the zero-crossing rate characteristic value is N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.
  • the target undistinguished speech data acquisition unit 13 is configured to use the first original distinguished speech data and the second original distinguished speech data as the target undistorted speech data.
  • the speech feature acquisition module 20 includes a pre-processed speech data acquisition unit 21, a power spectrum acquisition unit 22, a Mel power spectrum acquisition unit 23, and a Mel frequency cepstrum coefficient unit 24.
  • the pre-processing unit 21 is configured to pre-process the target to-be-differentiated voice data to obtain pre-processed voice data.
  • the power spectrum obtaining unit 22 is configured to perform a fast Fourier transform on the pre-processed speech data, obtain a frequency spectrum of the target speech data to be distinguished, and obtain a power spectrum of the target speech data to be distinguished according to the frequency spectrum.
  • the Mel power spectrum acquisition unit 23 is configured to process a power spectrum of the target speech data to be distinguished by using a Mel scale filter bank, and obtain a Mel power spectrum of the target speech data to be distinguished.
  • the Mel frequency cepstrum coefficient unit 24 is configured to perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • the pre-processing unit 21 includes a pre-emphasis sub-unit 211, a frame sub-unit 212, and a windowing sub-unit 213.
  • the pre-emphasis sub-unit 211 is configured to perform pre-emphasis processing on target voice data to be distinguished.
  • the frame sub-unit 212 is configured to perform frame processing on the target pre-emphasized voice data to be distinguished.
  • a windowing sub-unit 213 is configured to perform windowing on the framed target to-be-differentiated speech data to obtain pre-processed speech data.
  • the calculation formula of the windowing is Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  • the Mel frequency cepstrum coefficient unit 24 includes a Mel power spectrum acquisition sub-unit 241 and a Mel frequency cepstrum coefficient sub-unit 242 to be transformed.
  • the to-be-transformed Mel power spectrum acquisition subunit 241 is configured to obtain a log value of the to-be-transformed Mel power spectrum to obtain the to-be-transformed Mel power spectrum.
  • the Mel frequency cepstrum coefficient sub-unit 242 is configured to perform a discrete cosine transform of the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • the speech discrimination device further includes an ASR-RNN model acquisition module 40.
  • the ASR-RNN model acquisition module 40 includes an ASR speech feature acquisition unit 41, an initialization unit 42, an output value acquisition unit 43, and an update unit 44 to be trained.
  • the ASR speech feature acquisition unit 41 is configured to acquire speech data to be trained and extract speech speech features of the ASR to be trained.
  • the initialization unit 42 is configured to initialize an RNN model.
  • An output value acquisition unit 43 is configured to input the ASR speech features to be trained into the RNN model, and obtain the output value of the RNN model according to the forward propagation algorithm, and the output value is expressed as: ⁇ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer.
  • the updating unit 44 is configured to perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model.
  • the formula for updating the weight V is: V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, ⁇ represents the learning rate, t represents the time t, and ⁇ represents the total duration, Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation;
  • the formula for updating the offset c is: c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update;
  • the formula for updating the weight U is: U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input
  • This embodiment provides one or more non-volatile readable storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors are executed.
  • the method for distinguishing speech in the embodiment is implemented at this time. To avoid repetition, details are not described herein again.
  • the computer-readable instructions are executed by one or more processors, the functions of each module / unit in the speech distinguishing device in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here is not More details.
  • the computer-readable storage medium may include: any entity or device capable of carrying the computer-readable instructions, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM , Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals and telecommunication signals.
  • a recording medium a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM , Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals and telecommunication signals.
  • FIG. 9 is a schematic diagram of a computer device in this embodiment.
  • the computer device 50 includes a processor 51, a memory 52, and computer-readable instructions 53 stored in the memory 52 and executable on the processor 51.
  • the processor 51 executes the computer-readable instructions 53
  • each step of the method for distinguishing speech in the embodiment is implemented, for example, steps S10, S20, and S30 shown in FIG.
  • the processor 51 executes the computer-readable instructions 53
  • the functions of the modules / units of the voice distinguishing device in the embodiment are realized, as shown in FIG. 8, the target to-be-differentiated voice data acquisition module 10, the voice feature acquisition module 20, and the target discrimination result acquisition.
  • the module 30 and the ASR-RNN model acquire the functions of the module 40.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or in the form of software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Machine Translation (AREA)

Abstract

Provided are a speech differentiation method and apparatus, and a computer device and a storage medium. The speech differentiation method comprises: processing, based on a speech activity detection algorithm, original speech data to be differentiated, and acquiring target speech data to be differentiated (S10); acquiring a corresponding ASR speech feature based on the target speech data to be differentiated (S20); and inputting the ASR speech feature into a pre-trained ASR-RNN model for differentiation, and acquiring a target differentiation result (S30). By means of the speech differentiation method, target speech can be differentiated well from interference speech, and the speech can still be accurately differentiated where noise interference in speech data is strong.

Description

语音区分方法、装置、计算机设备及存储介质Speech distinguishing method, device, computer equipment and storage medium
本申请以2018年6月4日提交的申请号为201810561788.1,名称为“语音区分方法、装置、计算机设备及存储介质”的中国专利申请为基础,并要求其优先权。This application is based on a Chinese patent application filed on June 4, 2018 with the application number 201810561788.1, entitled "Voice distinguishing method, device, computer equipment and storage medium", and claims its priority.
技术领域Technical field
本申请涉及语音处理领域,尤其涉及一种语音区分方法、装置、计算机设备及存储介质。The present application relates to the field of speech processing, and in particular, to a method, a device, a computer device, and a storage medium for distinguishing speech.
背景技术Background technique
语音区分是指对输入的语音进行静音筛选,仅保留对识别更有意义的语音段(即目标语音)。目前的语音区分方法存在很大的不足,尤其在噪音存在的情况下,随着噪音的变大,进行语音区分的难度就越大,无法准确区分出目标语音和干扰语音,导致语音区分的效果不理想。Speech discrimination refers to mute filtering of the input speech, and only retain the speech segments (that is, the target speech) that are more meaningful for recognition. The current methods of speech discrimination have great shortcomings, especially in the presence of noise, as the noise becomes larger, the difficulty of speech discrimination becomes more difficult, and the target speech and the interference speech cannot be accurately distinguished, resulting in the effect of speech discrimination. not ideal.
发明内容Summary of the Invention
本申请实施例提供一种语音区分方法、装置、计算机设备及存储介质,以解决在进行语音区分效果不理想的问题。The embodiments of the present application provide a method, an apparatus, a computer device, and a storage medium for speech discrimination, so as to solve the problem that the effect of speech discrimination is not ideal.
本申请实施例提供一种语音区分方法,包括:An embodiment of the present application provides a method for distinguishing speech, including:
基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据;Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;
基于所述目标待区分语音数据,获取相对应的ASR语音特征;Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;
将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
本申请实施例提供一种语音区分装置,包括:An embodiment of the present application provides a voice distinguishing device, including:
目标待区分语音数据获取模块,用于基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据;Target to-be-differentiated voice data acquisition module, for processing original to-be-differentiated voice data based on a voice activity detection algorithm, to obtain target to-be-differentiated voice data;
语音特征获取模块,用于基于所述目标待区分语音数据,获取相对应的ASR语音特征;A voice feature acquisition module, configured to acquire a corresponding ASR voice feature based on the target to-be-differentiated voice data;
目标区分结果获取模块,用于将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。A target discrimination result acquisition module is configured to input the ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtain a target discrimination result.
本申请实施例提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the computer-readable instructions to implement The following steps:
基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据;Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;
基于所述目标待区分语音数据,获取相对应的ASR语音特征;Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;
将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
本申请实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:This embodiment of the present application provides one or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors are Perform the following steps:
基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据;Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;
基于所述目标待区分语音数据,获取相对应的ASR语音特征;Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;
将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请一实施例中语音区分方法的一应用环境图;FIG. 1 is an application environment diagram of a speech discrimination method according to an embodiment of the present application; FIG.
图2是本申请一实施例中语音区分方法的一流程图;FIG. 2 is a flowchart of a speech discrimination method according to an embodiment of the present application; FIG.
图3是图2中步骤S10的一具体流程图;FIG. 3 is a specific flowchart of step S10 in FIG. 2;
图4是图2中步骤S20的一具体流程图;FIG. 4 is a specific flowchart of step S20 in FIG. 2;
图5是图4中步骤S21的一具体流程图;5 is a specific flowchart of step S21 in FIG. 4;
图6是图4中步骤S24的一具体流程图;6 is a specific flowchart of step S24 in FIG. 4;
图7是图2中步骤S30之前的一具体流程图;FIG. 7 is a specific flowchart before step S30 in FIG. 2; FIG.
图8是本申请一实施例中语音区分装置的一示意图;8 is a schematic diagram of a voice distinguishing device according to an embodiment of the present application;
图9是本申请一实施例中计算机设备的一示意图。FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
图1示出本申请实施例提供的语音区分方法的应用环境。该语音识别方法的应用环境包括服务端和客户端,其中,服务端和客户端之间通过网络进行连接。客户端是可与用户进行人机交互的设备,包括但不限于电脑、智能手机和平板等设备。服务端具体可以用独立的服务器或者多个服务器组成的服务器集群实现。本申请实施例提供的语音区分方法应用于服务端。FIG. 1 illustrates an application environment of a speech discrimination method provided by an embodiment of the present application. The application environment of the speech recognition method includes a server and a client, wherein the server and the client are connected through a network. Clients are devices that can interact with users, including but not limited to computers, smartphones, and tablets. The server can be implemented by an independent server or a server cluster composed of multiple servers. The speech discrimination method provided in the embodiments of the present application is applied to a server.
如图2所示,图2示出本实施例中语音区分方法的一流程图,该语音区分方法包括如下步骤:As shown in FIG. 2, FIG. 2 shows a flowchart of the voice discrimination method in this embodiment. The voice discrimination method includes the following steps:
S10:基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据。S10: The original speech data to be distinguished is processed based on the speech activity detection algorithm, and the target speech data to be distinguished is obtained.
其中,语音活动检测(Voice Activity Detection,以下简称VAD),目的是从声音信号流里识别和消除长时间的静音期,以达到在不降低业务质量的情况下节省话路资源的作用,可以节省宝贵的带宽资源,降低端到端的时延,提升用户体验。语音活动检测算法(VAD算法)即语音活动检测时具体采用的算法,该算法可以有多种。可以理解地,VAD可以应用在语音区分,能够区分目标语音和干扰语音。目标语音是指语音数据中声纹连续变化明显的语音部分,干扰语音可以是语音数据中由于静默而没有发音的语音部分,也可以是环境噪音。原始待区分语音数据是最原始获取到的待区分语音数据,该原始待区分语音数据是指待采用VAD算法进行初步区分处理的语音数据。目标待区分语音数据是指通过语音活动检测算法对原始待区分语音数据进行处理后,获取的用于进行语音区分的语音数据。Among them, Voice Activity Detection (hereinafter referred to as VAD) is to identify and eliminate the long silence period from the sound signal stream, so as to save the voice channel resources without reducing the service quality, which can save Precious bandwidth resources reduce end-to-end delay and improve user experience. The voice activity detection algorithm (VAD algorithm) is an algorithm specifically used in the voice activity detection, and the algorithm may have various types. Understandably, VAD can be applied to speech discrimination, and can distinguish target speech and interference speech. The target voice refers to the voice part in which the voiceprint continuously changes significantly in the voice data, and the interference voice may be a voice part in the voice data that is not pronounced due to silence, or it may be environmental noise. The original to-be-differentiated voice data is the most originally obtained to-be-differentiated voice data, and the original to-be-differentiated voice data refers to voice data to be subjected to preliminary distinguishing processing using a VAD algorithm. The target to-be-differentiated speech data refers to the speech data obtained by using the voice activity detection algorithm to process the original to-be-differentiated speech data for speech discrimination.
本实施例中,采用VAD算法对原始待区分语音数据进行处理,从原始待区分语音数据中初步筛选出目标语音和干扰语音,并将初步筛选出的目标语音部分作为目标待区分语音数据。可以理解地,对于初步筛选出的干扰语音不必再进行区分,以提高语音区分的效率。而从原始待区分语音数据中初步筛选出的目标语音仍然存在干扰语音的内容,尤其当原始待区分语音数据的噪音比较大时,初步筛选出的目标语音混杂的干扰语音(如噪音)就越多,显然此时采用VAD算法是无法有效区分语音的,因此应将初步筛选出的混杂着干扰语音的目标语音作为目标待区分语音数据,以对初步筛选出的目标语音进行更精确的区分。通过采用VAD算法对原始待区分语音数据进行初步语音区分,可以根据初步筛选的原始待区分语音数据进行再区分,同时去除大量的干扰语音,有利于后续进一步的语音区分。In this embodiment, the VAD algorithm is used to process the original to-be-differentiated voice data, and the target to-be-differentiated voice is initially selected from the original to-be-differentiated voice data, and the initially-selected target voice portion is used as the target to-be-differentiated voice data. Understandably, it is not necessary to distinguish the interfering voices that are initially screened to improve the efficiency of voice discrimination. However, the target voice initially screened from the original to-be-differentiated voice data still contains the content of interfering speech, especially when the original voice data to be distinguished is relatively noisy, the interfering voices (such as noise) mixed with the preliminary target voice are more mixed. It is obvious that it is impossible to effectively distinguish the speech by using the VAD algorithm at this time. Therefore, the target voice that is preliminarily screened with interfering voices should be used as the target to-be-differentiated voice data in order to more accurately distinguish the target voice that is initially screened. By using the VAD algorithm to perform preliminary speech discrimination on the original to-be-differentiated voice data, it is possible to re-differentiate the original to-be-differentiated voice data and to remove a large amount of interfering speech at the same time, which is beneficial to subsequent further speech discrimination.
在一具体实施方式中,如图3所示,步骤S10中,基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据,包括如下步骤:In a specific implementation, as shown in FIG. 3, in step S10, processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data includes the following steps:
S11:根据短时能量特征值计算公式对原始待区分语音数据进行处理,获取对应的短时能量特征值,将短时能量特征值大于第一阈值的原始待区分数据保留,确定为第一原始区分语音数据,其中,短时能量特征值计算公式为
Figure PCTCN2018094190-appb-000001
N为语音帧长,s(n)为时域上的信号幅度,n为时间。
S11: Process the original speech data to be distinguished according to the short-term energy characteristic value calculation formula, obtain the corresponding short-term energy characteristic value, and retain the original to-be-differentiated data whose short-term energy characteristic value is greater than the first threshold, and determine it as the first original Differentiate speech data, where the short-term energy eigenvalue calculation formula is
Figure PCTCN2018094190-appb-000001
N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.
其中,短时能量特征值描述的是一帧语音(一帧一般取10-30ms)在其时域上对应的能量,该短时能量的“短时”应理解为一帧的时间(即语音帧长)。由于目标语音的短时能量特征值,相比于干扰语音(静音)的短时能量特征值会高出很多,因此可以根据该短时能量特征值来区分目标语音和干扰语音。Among them, the short-term energy characteristic value describes the energy corresponding to a frame of speech (a frame generally takes 10-30ms) in its time domain. The "short-term" of this short-term energy should be understood as the time of a frame (that is, speech Frame length). Since the short-term energy feature value of the target voice is much higher than the short-term energy feature value of the interfering voice (silence), the target voice and the interfering voice can be distinguished according to the short-term energy feature value.
本实施例中,根据短时能量特征值计算公式处理原始待区分语音数据(需要预先对原始待区分语音数据作分帧的处理),计算并获取原始待区分语音数据各帧的短时能量特征值,将各帧的短时能量特征值与预先设置的第一阈值进行比较,将大于第一阈值的原始待区分语音数据保留,并确定为第一原始区分语音数据。该第一阈值是用于衡量短时能量特征值是属于目标语音还是干扰语音的分界值。本实施例中,根据短时能量特征值和第一阈值的比较结果,可以从短时能量特征值的角度初步区分得到原始待区分语音数据中的目标语音,并有效去除原始待区分语音数据中大量的干扰语音。In this embodiment, the original speech data to be distinguished is processed according to the short-term energy feature value calculation formula (the original speech data to be distinguished needs to be framed in advance), and the short-term energy characteristics of each frame of the original speech data to be distinguished are calculated and obtained. Value, comparing the short-term energy characteristic value of each frame with a preset first threshold value, retaining the original to-be-differentiated voice data that is greater than the first threshold, and determining it as the first original distinguishing voice data. The first threshold is a cut-off value for measuring whether the short-term energy characteristic value belongs to the target speech or the interference speech. In this embodiment, according to the comparison result of the short-term energy feature value and the first threshold value, the target voice in the original to-be-differentiated voice data can be obtained from the perspective of the short-term energy feature value, and the original to-be-differentiated voice data is effectively removed A lot of disturbing speech.
S12:根据过零率特征值计算公式对原始待区分语音数据进行处理,获取对应的过零率特征值,将过零率特征值小于第二阈值的原始待区分语音数据保留,确定为第二原始区分语音数据,其中,过零率特征值计算公式为
Figure PCTCN2018094190-appb-000002
N为语音帧长,s(n)为时域上的信号幅度n为时间。
S12: Process the original to-be-differentiated voice data according to the calculation formula of the zero-crossing rate feature value, obtain the corresponding zero-crossing rate feature value, and retain the original to-be-differentiated voice data with the zero-cross rate feature value less than the second threshold, and determine as the second The original distinguished speech data, where the zero-crossing rate eigenvalue calculation formula is
Figure PCTCN2018094190-appb-000002
N is the speech frame length, and s (n) is the signal amplitude n in the time domain is time.
其中,过零率特征值是描述一帧语音中语音信号波形穿过横轴(零电平)的次数。由于目标语音的过零率特征值,相比于干扰语音的过零率特征值会低很多,因此可以根据该短时能量特征值来区分目标语音和干扰语音。Among them, the zero-crossing rate characteristic value describes the number of times a voice signal waveform passes through the horizontal axis (zero level) in a frame of speech. Since the feature value of the zero-crossing rate of the target voice is much lower than the feature value of the zero-crossing rate of the interfering voice, the target voice and the interfering voice can be distinguished according to the short-term energy feature value.
本实施例中,根据过零率特征值计算公式处理原始待区分语音数据,计算并获取原始待区分语音数据各帧的过零率特征值,将各帧的过零率特征值与预先设置的第二阈值进行比较,将小于第二阈值的原始待区分语音数据保留,并确定为第二原始区分语音数据。该第二阈值是用于衡量短时能量特征值是属于目标语音还是干扰语音的分界值。本实施例中,根据过零率特征值和第二阈值的比较结果,可以从过零率特征值的角度初步区分得到原始待区分语音数据中的目标语音,并有效去除原始待区分语音数据中大量的干扰语音。In this embodiment, the original to-be-differentiated speech data is processed according to the zero-crossing rate feature value calculation formula, and the zero-crossing rate feature value of each frame of the original to-be-differentiated voice data is calculated and obtained. The second threshold value is compared, and the original to-be-differentiated voice data smaller than the second threshold value is retained and determined as the second original distinguished voice data. The second threshold is a cutoff value for measuring whether the short-term energy characteristic value belongs to the target speech or the interference speech. In this embodiment, according to the comparison result of the zero-crossing rate feature value and the second threshold value, the target voice in the original to-be-differentiated voice data can be obtained from the perspective of the zero-crossing rate feature value, and the original to-be-differentiated voice data can be effectively removed. A lot of disturbing speech.
S13:将第一原始区分语音数据和第二原始区分语音数据作为目标待区分语音数据。S13: Use the first original distinguished speech data and the second original distinguished speech data as target to-be-separated speech data.
本实施例中,第一原始区分语音数据是根据短时能量特征值的角度从原始待区分语音数据中区分并获取的,第二原始区分语音数据是根据过零率特征值的角度从原始待区分语音数据中区分并获取的。第一原始区分语音数据和第二原始区分语音数据分别从区分语音的不同角度出发,这两个角度都能够很好地区分语音,因此将第一原始区分语音数据和第二原始区分语音数据合并(以取交集的方式合并)在一起,作为目标待区分语音数据。In this embodiment, the first original distinguished speech data is distinguished and obtained from the original to-be-differentiated speech data according to the angle of the short-term energy feature value, and the second original distinguished speech data is from the original to-be-distant speech data according to the angle of the zero-crossing rate feature value. Distinguish and acquire in speech data. The first original distinguished speech data and the second original distinguished speech data are from different perspectives of distinguishing speech. Both of these angles can distinguish the speech well. Therefore, the first original distinguished speech data and the second original distinguished speech data are merged. (Merged in the manner of taking intersections) together as the target speech data to be distinguished.
步骤S11-S13可以初步有效地去除原始待区分语音数据中大部分的干扰语音数据,保留混杂着目标语音和少部分干扰语音(如噪音)的原始待区分语音数据,并将该原始待区分语音数据作为目标待区分语音数据,能够对原始待区分语音数据作有效的初步语音区分。Steps S11-S13 can initially and effectively remove most of the interfering voice data in the original to-be-differentiated voice data, retain the original to-be-differentiated voice data that is mixed with the target voice and a small part of the interfering voice (such as noise), and the original to-be-differentiated voice data The data is used as the target to-be-differentiated speech data, which can make effective preliminary speech discrimination on the original to-be-differentiated speech data.
S20:基于目标待区分语音数据,获取相对应的ASR语音特征。S20: Obtain the corresponding ASR voice characteristics based on the target to-be-differentiated voice data.
其中,ASR(Automatic Speech Recognition,自动语音识别技术)是将语音数据转换为计算机可读输入的技术,例如将语音数据转化为按键、二进制编码或者字符序列等形式。通过ASR可以提取目标待区分语音数据中的语音特征,提取到的语音即为与其相对应的ASR语音特征。可以理解地,ASR能够将原本计算机无法直接读取的语音数据转换为计算机能够读取的ASR语音特征,该ASR语音特征可以采用向量的方式表示。Among them, ASR (Automatic Speech Recognition) is a technology that converts speech data into computer-readable input, for example, converts speech data into keys, binary codes, or character sequences. The ASR can extract the voice features in the target to-be-differentiated voice data, and the extracted voice is the corresponding ASR voice feature. Understandably, ASR can convert voice data that cannot be read directly by a computer into ASR voice features that can be read by a computer, and the ASR voice features can be represented in a vector manner.
本实施例中,采用ASR对目标待区分语音数据进行处理,获取相对应的ASR语音特征,该ASR语音特征可以很好地反映目标待区分语音数据的潜在特征,可以根据ASR语音特征对目标待区分语音数据进行区分,为后续根据该ASR语音特征进行相应的ASR-RNN(RNN,Recurrent neural networks,循环神经网络)模型识别提供重要的技术前提。In this embodiment, ASR is used to process the target to-be-differentiated voice data to obtain the corresponding ASR voice characteristics. This ASR voice feature can well reflect the potential characteristics of the target to-be-differentiated voice data. Differentiate speech data to distinguish, and provide important technical prerequisites for subsequent ASR-RNN (RNN, Recurrent Neural Networks) model recognition based on the ASR speech characteristics.
在一具体实施方式中,如图4所示,步骤S20中,基于目标待区分语音数据,获取相对应的ASR语音特征,包括如下步骤:In a specific implementation, as shown in FIG. 4, in step S20, acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data includes the following steps:
S21:对目标待区分语音数据进行预处理,获取预处理语音数据。S21: Preprocess the target to-be-differentiated voice data to obtain preprocessed voice data.
本实施例中,对目标待区分语音数据进行预处理,并获取相对应的预处理语音数据。对目标待区分语音数据进行预处理能够更好地提取目标待区分语音数据的ASR语音特征,使得提取出的ASR语音特征更能代表该目标待区分语音数据,以采用该ASR语音特征进行语音区分。In this embodiment, the target to-be-differentiated voice data is pre-processed, and corresponding pre-processed voice data is obtained. Preprocessing the target to-be-differentiated voice data can better extract the ASR voice characteristics of the target to-be-differentiated voice data, so that the extracted ASR speech features can better represent the target-to-be-differentiated voice data, so as to use the ASR speech features for speech discrimination .
在一具体实施方式中,如图5所示,步骤S21中,对目标待区分语音数据进行预处理,获取预处理语音数据,包括如下步骤:In a specific implementation, as shown in FIG. 5, in step S21, pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data includes the following steps:
S211:对目标待区分语音数据作预加重处理,预加重处理的计算公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,s n-1为与s n相对应的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0。 S211: Pre-emphasis processing is performed on the target to-be-differentiated voice data. The calculation formula for the pre-emphasis processing is s' n = s n -a * s n-1 , where s n is the signal amplitude in the time domain and s n-1 s n is the amplitude of the signal corresponding to the previous time, s' n for the signal amplitude in the time domain after the pre-emphasis, a is the pre-emphasis coefficient, a is in the range of 0.9 <a <1.0.
其中,预加重是一种在发送端对输入信号高频分量进行补偿的信号处理方式。随着信号速率的增加,信号在传输过程中受损很大,为了使接收端能得到比较好的信号波形,就需要对受损的信号进行补偿。预加重技术的思想就是在传输线的发送端增强信号的高频成分,以补偿高频分量在传输过程中的过大衰减,使得接收端能够得到较好的信号波形。预加重对噪声并没有影响,因此能够有效提高输出信噪比。Among them, pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end. With the increase of the signal rate, the signal is greatly damaged in the transmission process. In order to obtain a better signal waveform at the receiving end, the damaged signal needs to be compensated. The idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform. Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
本实施例中,对目标待区分语音数据作预加重处理,该预加重处理的公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,即语音数据在时域上表达的语音的幅值(幅度),s n-1为与s n相对的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0,这里取0.97预加重的效果比较好。采用该预加重处理能够消除发声过程中声带和嘴唇等造成的干扰,可以有效补偿目标待区分语音数据被压抑的高频部分,并且能够突显目标待区分语音数据高频的共振峰,加强目标待区分语音数据的信号幅度,有助于提取ASR语音特征。 Formula of the present embodiment, the target to be differentiated for the voice data pre-emphasis, the pre-emphasis is s' n = s n -a * s n-1, wherein the amplitude of the signal s n on the time domain, i.e., voice magnitude (amplitude) expression of the voice data in the time domain, s n-1 s n is the opposite of the signal amplitude of a time, s' n for the amplitude of the signal on the time-domain pre-emphasis, a is pre- The weighting coefficient, the range of a is 0.9 <a <1.0. Here, the effect of pre-emphasis of 0.97 is better. The use of the pre-emphasis processing can eliminate interference caused by vocal cords and lips during the utterance process, can effectively compensate the suppressed high-frequency part of the target voice data to be distinguished, and can highlight the high-frequency formants of the target voice data to be distinguished, strengthening the target Distinguishing the signal amplitude of speech data helps to extract ASR speech features.
S212:将预加重后的目标待区分语音数据进行分帧处理。S212: Frame the target to-be-differentiated voice data after pre-emphasis.
本实施例中,在预加重目标待区分语音数据后,还应进行分帧处理。分帧是指将整段的语音信号切分成若干段的语音处理技术,每帧的大小在10-30ms的范围内,以大概1/2帧长作为帧移。帧移是指相邻两帧间的重叠区域,能够避免相邻两帧变化过大的问题。对目标待区分语音数据进行分帧处理,能够将目标待区分语音数据分成若干段的语音数据,可以细分目标待区分语音数据,便于ASR语音特征的提取。In this embodiment, after the pre-emphasis target has to distinguish the voice data, it should also perform frame processing. Framing refers to the speech processing technology that cuts the entire voice signal into several segments. The size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length. Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames. Frame processing is performed on the target to-be-differentiated voice data, which can divide the target to-be-differentiated voice data into several pieces of voice data, and the target to-be-differentiated voice data can be subdivided to facilitate the extraction of ASR voice features.
S213:将分帧后的目标待区分语音数据进行加窗处理,获取预处理语音数据,加窗的计算公式为
Figure PCTCN2018094190-appb-000003
其中,N为窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。
S213: Perform windowing on the framed target to-be-differentiated voice data to obtain pre-processed voice data. The calculation formula for the windowing is
Figure PCTCN2018094190-appb-000003
Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
本实施例中,在对目标待区分语音数据进行分帧处理后,每一帧的起始段和末尾端都会出现不连续的地方,所以分帧越多与目标待区分语音数据的误差也就越大。采用加窗能够解决这个问题,可以使分帧后的目标待区分语音数据变得连续,并且使得每一帧能够表现出周期函数的特征。加窗处理具体是指采用窗函数对目标待区分语音数据进行处理,窗函数可以选择汉明窗,则该加窗的公式为
Figure PCTCN2018094190-appb-000004
N为汉明窗窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。对目标待区分语音数据进行加窗处理,获取预处理语音数据,能够使得分帧后的目标待区分语音数据在时域上的信号变得连续,有助于提取目标待区分语音数据的ASR语音特征。
In this embodiment, after frame processing is performed on the target to-be-differentiated voice data, discontinuities appear at the beginning and end of each frame, so the more frames there are, the more errors there are with the target to-be-differentiated voice data. Bigger. The use of windowing can solve this problem, which can make the target to-be-differentiated voice data after framed become continuous, and each frame can show the characteristics of the periodic function. The windowing process specifically refers to using the window function to process the target to-be-differentiated speech data. The window function can select the Hamming window, and the windowing formula is
Figure PCTCN2018094190-appb-000004
N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed. Windowing the target to-be-differentiated voice data to obtain pre-processed speech data can make the signal of the target to-be-differentiated voice data in the time domain continuous after framed, which is helpful for extracting the ASR voice of the target to-be-differentiated voice data. feature.
上述步骤S211-S213对目标待区分语音数据的预处理操作,为提取目标待区分语音数据的ASR语音特征提供了基础,能够使得提取的ASR语音特征更能代表该目标待区分语音数据,并根据该ASR语音特征进行语音区分。The pre-processing operations on the target to-be-differentiated voice data in steps S211 to S213 provide a basis for extracting the ASR voice characteristics of the target to-be-differentiated voice data, which can make the extracted ASR voice features more representative of the target to-be-differentiated voice data, and according to This ASR voice feature performs voice discrimination.
S22:对预处理语音数据作快速傅里叶变换,获取目标待区分语音数据的频谱,并根据频谱获取目标待区分语音数据的功率谱。S22: Perform a fast Fourier transform on the pre-processed speech data to obtain the frequency spectrum of the target speech data to be distinguished, and obtain the power spectrum of the target speech data to be distinguished according to the frequency spectrum.
其中,快速傅里叶变换(Fast Fourier Transformation,简称FFT),指利用计算机计算离散傅里叶变换的高效、快速计算方法的统称,简称FFT。采用这种算法能使计算机计算离散傅里叶变换所需要的乘法次数大为减少,特别是被变换的抽样点数越多,FFT算法计算量的节省就越显著。Among them, Fast Fourier Transform (FFT) refers to a collective term for an efficient and fast method for computing a discrete Fourier transform using a computer, and is referred to as FFT for short. The use of this algorithm can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the FFT algorithm's computational savings will be.
本实施例中,对预处理语音数据进行快速傅里叶变换,以将预处理语音数据从时域上的信号幅度转换为在频域上的信号幅度(频谱)。该计算频谱的公式为
Figure PCTCN2018094190-appb-000005
1≤k≤N,N为帧的大小,s(k)为频域上的信号幅度,s(n)为时域上的信号幅度,n为时间,i为复数单位。在获取预处理语音数据的频谱后,可以根据该频谱直接求得预处理语音数据的功率谱,以下将预处理语音数据的功率谱称为目标待区分语音数据的功率谱。该计算目标待区分语音数据的功率谱的公式为
Figure PCTCN2018094190-appb-000006
1≤k≤N,N为帧的大小,s(k)为频域上的信号幅度。通过将预处理语音数据从时域上的信号幅度转换为频域上的信号幅度,再根据该频域上的信号幅度获取目标待区分语音数据的功率谱,为从目标待区分语音数据的功率谱中提取ASR语音特征提供重要的技术基础。
In this embodiment, fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from the signal amplitude in the time domain to the signal amplitude (spectrum) in the frequency domain. The formula for calculating the spectrum is
Figure PCTCN2018094190-appb-000005
1≤k≤N, N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit. After obtaining the frequency spectrum of the pre-processed voice data, the power spectrum of the pre-processed voice data can be directly obtained according to the frequency spectrum. The power spectrum of the pre-processed voice data is hereinafter referred to as the power spectrum of the target voice data to be distinguished. The formula for calculating the power spectrum of the target speech data to be distinguished is
Figure PCTCN2018094190-appb-000006
1≤k≤N, N is the size of the frame, and s (k) is the signal amplitude in the frequency domain. The pre-processed speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the target speech data to be distinguished is obtained according to the signal amplitude in the frequency domain. Extracting ASR speech features from the spectrum provides an important technical basis.
S23:采用梅尔刻度滤波器组处理目标待区分语音数据的功率谱,获取目标待区分语音数据的梅尔功率谱。S23: Use the Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain the Mel power spectrum of the target speech data to be distinguished.
其中,采用梅尔刻度滤波器组处理目标待区分语音数据的功率谱是对功率谱进行的梅尔频率分析,梅尔频率分析是基于人类听觉感知的分析。观测发现,人耳就像一个滤波器组一样,只关注某些特定的频率分量(人的听觉对频率是有选择性的),也就是说人耳只让某些频率的信号通过,而直接无视不想感知的某些频率信号。然而这些滤波器在频率坐标轴上却不是统一分布的,在低频区域有很多的滤波器,他们分布比较密集,但在高频区域,滤波器的数目就变得比较少,分布很稀疏。可以理解地,梅尔刻度滤波器组在低频部分的分辨率高,跟人耳的听觉特性是相符的,这也是梅尔刻度的物理意义所在。Among them, the power spectrum of the target speech data to be processed using the Mel scale filter bank is a Mel frequency analysis of the power spectrum, and the Mel frequency analysis is an analysis based on human auditory perception. Observation found that the human ear is like a filter bank, focusing only on certain specific frequency components (human hearing is selective to frequencies), which means that the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive. However, these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
本实施例中,采用梅尔刻度滤波器组处理目标待区分语音数据的功率谱,获取目标待区分语音数据的梅尔功率谱,通过采用梅尔刻度滤波器组对频域信号进行切分,使得最后每个频率段对应一个数值,若滤波器的个数为22,则可以得到目标待区分语音数据的梅尔功率谱对应的22个能量值。通过对目标待区分语音数据的功率谱进行梅尔频率分析,使得其分析后获取的梅尔功率谱保留着与人耳特性密切相关的频率部分,该频率部分能够很好地反映出目标待区分语音数据的特征。In this embodiment, a Mel scale filter bank is used to process the power spectrum of the target speech data to be distinguished, and a Mel power spectrum of the target speech data to be distinguished is obtained. The frequency domain signal is segmented by using the Mel scale filter bank. Make each frequency segment correspond to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the target speech data to be distinguished can be obtained. By performing Mel frequency analysis on the power spectrum of the target speech data to be distinguished, the Mel power spectrum obtained after the analysis retains a frequency portion closely related to the characteristics of the human ear, and this frequency portion can well reflect the target to be distinguished Characteristics of speech data.
S24:在梅尔功率谱上进行倒谱分析,获取目标待区分语音数据的梅尔频率倒谱系数。S24: Perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
其中,倒谱(cepstrum)是指一种信号的傅里叶变换谱经对数运算后再进行的傅里叶反变换,由于一般傅里叶谱是复数谱,因而倒谱又称复倒谱。Among them, cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
本实施例中,对梅尔功率谱进行倒谱分析,根据倒谱的结果,分析并获取目标待区分语音数据的梅尔频率倒谱系数。通过该倒谱分析,可以将原本特征维数过高,难以直接使用的目标待区分语音数据的梅尔功率谱中包含的特征,通过在梅尔功率谱上进行倒谱分析,转换成易于使用的特征(用来进行训练或识别的梅尔频率倒谱系数特征向量)。该梅尔频率倒谱系数能够作为ASR语音特征对不同语音进行区分的系数,该ASR语音特征可以反映语音之间的区别,可以用来识别和区分目标待区分语音数据。In this embodiment, a cepstrum analysis is performed on the Mel power spectrum, and based on the cepstrum result, the Mel frequency cepstrum coefficient of the target speech data to be distinguished is analyzed and obtained. Through the cepstrum analysis, the features contained in the Mel power spectrum of the target speech data to be distinguished, which is too high in original feature dimension, can be directly converted into easy-to-use through cepstrum analysis on the Mel power spectrum. (Mel frequency cepstrum coefficient feature vector used for training or identification). The Mel frequency cepstrum coefficient can be used as a coefficient for distinguishing different voices from ASR voice features. The ASR voice feature can reflect the difference between voices, and can be used to identify and distinguish target to-be-differentiated voice data.
在一具体实施方式中,如图6所示,步骤S24中,在梅尔功率谱上进行倒谱分析,获取目标待区分语音数据的梅尔频率倒谱系数,包括如下步骤:In a specific embodiment, as shown in FIG. 6, in step S24, cepstrum analysis is performed on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the target speech data to be distinguished, including the following steps:
S241:取梅尔功率谱的对数值,获取待变换梅尔功率谱。S241: Take the log value of the Mel power spectrum, and obtain the Mel power spectrum to be transformed.
本实施例中,根据倒谱的定义,对梅尔功率谱取对数值log,获取待变换梅尔功率谱m。In this embodiment, according to the definition of the cepstrum, a log value log of the Mel power spectrum is taken to obtain the Mel power spectrum m to be transformed.
S242:对待变换梅尔功率谱作离散余弦变换,获取目标待区分语音数据的梅尔频率倒谱系数。S242: Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
本实施例中,对待变换梅尔功率谱m作离散余弦变换(Discrete Cosine Transform,DCT),获取相对应的目标待区分语音数据的梅尔频率倒谱系数,一般取第2个到第13个系数作为ASR语音特征,该ASR语音特征能够反映语音数据间的区别。对待变换梅尔功率谱m作离散余弦变换的公式为
Figure PCTCN2018094190-appb-000007
i=0,1,2,...,N-1,N为帧长,m为待变换梅尔功率谱,j为待变换梅尔功率谱的自变量。由于梅尔滤波器之间是有重叠的,所以采用梅尔刻度滤波器获取的能量值之间是具有相关性的,离散余弦变换可以对待变换梅尔功率谱m进行降维压缩和抽象,并获得相应的ASR语音特征,相比于傅里叶变换,离散余弦变换的结果没有虚部,在计算方面有明显的优势。
In this embodiment, a discrete cosine transform (DCT) is performed on the Mel power spectrum m to be transformed to obtain a corresponding Mel frequency cepstrum coefficient of the target speech data to be distinguished. Generally, the second to thirteenth coefficients are taken. Coefficients are used as ASR speech features, which can reflect the differences between speech data. The formula for discrete cosine transform of the transformed Mel power spectrum m is
Figure PCTCN2018094190-appb-000007
i = 0, 1, 2, ..., N-1, N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because there is overlap between Mel filters, there is a correlation between the energy values obtained by using Mel scale filters. Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and The corresponding ASR speech features are obtained. Compared with the Fourier transform, the result of the discrete cosine transform has no imaginary part, and has obvious advantages in terms of calculation.
步骤S21-S24基于ASR技术对目标待区分语音数据进行特征提取的处理,最终获取的ASR语音特征能够很好地体现目标待区分语音数据,该ASR语音特征能够在深度网络模型训练获取得到ASR-RNN模型,使训练获取的ASR-RNN模型在进行语音区分时的结果更为精确,即使在噪音很大的条件下,也可以精确地将噪音和语音区分开来。Steps S21-S24 are based on the ASR technology to perform feature extraction on the target to-be-differentiated voice data. The final obtained ASR speech feature can well reflect the target to-be-differentiated voice data. The ASR voice feature can be obtained by deep network model training to obtain ASR- The RNN model makes the ASR-RNN model obtained during training more accurate when distinguishing speech, and can accurately distinguish noise from speech even under very noisy conditions.
需要说明的是,以上提取的特征为梅尔频率倒谱系数,在这里不应将ASR语音特征限定为只有梅尔频率倒谱系数一种,而应当认为采用ASR技术获取的语音特征,只要能够有效反映语音数据特征,都是可以作为ASR语音特征进行识别和模型训练的。It should be noted that the features extracted above are Mel frequency cepstrum coefficients. Here, the ASR speech features should not be limited to only Mel frequency cepstrum coefficients. Instead, it should be considered that the speech features obtained by ASR technology can be used as long as they can The features that effectively reflect speech data can be used as ASR speech features for recognition and model training.
S30:将ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。S30: The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
其中,ASR-RNN模型是指采用ASR语音特征训练得到的循环神经网络模型,RNN即指循环神经网络(Recurrent neural networks)。该ASR-RNN模型是采用待训练语音数据提取的ASR语音特征进行训练得到的,因此该模型能够识别ASR语音特征,从而根据ASR语音特征区分语音。具体地,待训练语音数据包括目标语音和噪音,在进行ASR-RNN模型训练时提取目标语音的ASR语音特征和噪音的ASR语音特征,使得训练获取的ASR-RNN模型能够根据ASR语音特征识别目标语音和干扰语音中的噪音(在采用VAD区分原始待区分语音数据时已经去除了大部分的干扰语音,如语音数据中由于静默而没有发音的语音部分和一部分噪音,所以这里ASR-DBN模型区分的干扰语音具体是指噪音部分),实现对目标语音和干扰语音进行有效区分的目的。Among them, the ASR-RNN model refers to a recurrent neural network model trained using ASR speech features, and RNN refers to recurrent neural networks. The ASR-RNN model is trained using ASR speech features extracted from the speech data to be trained, so the model can recognize ASR speech features and distinguish speech based on ASR speech features. Specifically, the speech data to be trained includes target speech and noise. When performing ASR-RNN model training, the ASR speech feature of the target speech and the ASR speech feature of the noise are extracted, so that the ASR-RNN model obtained by training can recognize the target based on the ASR speech feature. Noise in speech and interfering speech (when VAD is used to distinguish the original to-be-differentiated speech data, most of the interfering speech has been removed, such as the speech data and part of the noise that are not pronounced due to silence in the speech data, so the ASR-DBN model distinguishes The interference speech specifically refers to the noise part), to achieve the purpose of effectively distinguishing between the target speech and the interference speech.
本实施例中,将ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,由于ASR语音特征能够反映语音数据的特征,因此可以根据ASR-RNN模型对目标待区分语音数据提取的ASR语音特征进行识别,从而根据ASR语音特征对目标待区分语音数据作出精确的语音区分。该预先训练好的ASR-RNN模型结合了ASR语音特征和循环神经网络对特征进行深层提取的特点,从语音数据的ASR语音特征上对语音进行了区分,在噪音条件非常恶劣的情况下仍然有很高的精确率。具体地,由于ASR提取的特征也包含了噪音的ASR语音特征,因此,在该ASR-RNN模型中,噪音也是可以精确地进行区分,解决当前语音区分方法(包括但不限于VAD)在噪音影响较大的条件下无法有效进行语音区分的问题。In this embodiment, the ASR voice features are input into a pre-trained ASR-RNN model to distinguish them. Since the ASR voice features can reflect the characteristics of the voice data, the ASR of the target to be distinguished voice data can be extracted according to the ASR-RNN model. The speech features are recognized, so that the target speech data to be distinguished is accurately distinguished based on the ASR speech features. This pre-trained ASR-RNN model combines the features of ASR speech features and recurrent neural network to extract features in depth, and distinguishes speech from the ASR speech features of speech data. It is still available under very bad noise conditions. Very high accuracy. Specifically, since the features extracted by ASR also include the ASR speech features of noise, in this ASR-RNN model, noise can also be accurately distinguished, and the current speech discrimination methods (including but not limited to VAD) are affected by noise. The problem that the speech cannot be effectively distinguished under larger conditions.
在一具体实施方式中,步骤S30,在将ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果的步骤之前,语音区分方法还包括如下步骤:获取ASR-RNN模型。In a specific implementation, step S30, before the steps of inputting ASR voice features into a pre-trained ASR-RNN model to distinguish and obtain a target discrimination result, the voice discrimination method further includes the following steps: obtaining an ASR-RNN model .
如图7所示,获取ASR-RNN模型的步骤具体包括:As shown in FIG. 7, the steps of obtaining the ASR-RNN model include:
S31:获取待训练语音数据,并提取待训练语音数据的待训练ASR语音特征。S31: Acquire speech data to be trained, and extract speech features of the ASR to be trained.
其中,待训练语音数据是指训练ASR-RNN模型所需的语音数据训练样本集,该待训练语音数据可以是直接采用开源的语音训练集,或者是通过收集大量样本语音数据的语音训练集。该待训练语音数据是将目标语音和干扰语音(在这里具体为噪音)提前区分好的,区分采取的具体方式可以是对目标语音和噪音分别设置不同的标签值。例如,将待训练语音数据中的目标语音部分都标记为1(代表“真”),将噪音部分都标记为0(代表“假”),通过提前设置的标签值可以检验ASR-RNN模型识别的精确度,以便提供改进的参考,更新ASR-RNN模型中的网络参数,不断优化ASR-RNN模型。本实施例中,目标语音和噪音的比例具体可以取1:1,采用该比例能够避免因待训练语音数据中目标语音和噪音数量不相同而出现过拟合现象。其中,过拟合是指为了得到一致假设而使假设变得过度严格的现象,避免过拟合是分类器设计中的一个核心任务。The voice data to be trained refers to a training set of voice data required for training the ASR-RNN model. The voice data to be trained may be an open source voice training set directly, or a voice training set by collecting a large amount of sample voice data. The to-be-trained voice data distinguishes the target voice and the interfering voice (here, specifically noise) in advance, and a specific method for distinguishing may be to set different label values for the target voice and noise respectively. For example, all target speech parts in the speech data to be trained are marked as 1 (representing "true"), and noisy parts are marked as 0 (representing "false"). The ASR-RNN model recognition can be tested by setting the label value in advance Accuracy in order to provide improved references, update network parameters in the ASR-RNN model, and continuously optimize the ASR-RNN model. In this embodiment, the ratio of the target voice and the noise may specifically be 1: 1, and adopting this ratio can avoid overfitting due to different target voice and noise amounts in the voice data to be trained. Among them, overfitting refers to the phenomenon that the assumptions become too strict in order to obtain a consistent hypothesis. Avoiding overfitting is a core task in classifier design.
本实施例中,获取待训练语音数据,并提取该待训练语音数据的特征,该特征即待训练ASR语音特征,提取待训练ASR语音特征的步骤与步骤S21-S24相同,在此不再赘述。待训练语音数据包括目标语音的训练样本和噪音的训练样本,这两部分语音数据都有各自的ASR语音特征,因此,可以提取并采用 待训练ASR语音特征训练相对应的ASR-RNN模型,使得根据该待训练ASR语音特征训练获取的ASR-RNN模型可以精确地区分目标语音和噪音(噪音属于干扰语音)。In this embodiment, the voice data to be trained is obtained and the feature of the voice data to be trained is extracted. This feature is the voice feature of the ASR to be trained. The steps of extracting the voice feature of the ASR to be trained are the same as steps S21-S24, and will not be repeated here . The speech data to be trained includes training samples of the target speech and training samples of noise. Both parts of the speech data have their own ASR speech features. Therefore, the corresponding ASR-RNN model can be extracted and trained using the ASR speech features to be trained, so that The ASR-RNN model obtained by training the ASR speech features to be trained can accurately distinguish the target speech and noise (noise belongs to interference speech).
S32:初始化RNN模型。S32: Initialize the RNN model.
其中,RNN模型即循环神经网络模型。RNN模型包括由神经元组成的输入层、隐藏层和输出层。RNN模型包括各层之间各个神经元连接的权值和偏置,这些权值和偏置决定了RNN模型的性质及识别效果。与传统的神经网络如DNN(Deep Neural Network,深度神经网络)相比,RNN是一种对序列数据(如时间序列)建模的神经网络,即一个序列当前的输出与前面的输出有关。具体的表现形式为网络会对前面的隐藏层状态进行记忆并应用于当前输出的计算中,即隐藏层之间的节点不再是无连接而是有连接的,隐藏层的输入不仅包括输入层的输出还包括上一时刻隐藏层的输出。由于语音数据具有时序上的特点,因此可以采用待训练语音数据训练RNN模型,精确提取目标语音和干扰语音在时序上各自的深层特征,实现语音的精确区分。Among them, the RNN model is a recurrent neural network model. The RNN model includes an input layer, a hidden layer, and an output layer composed of neurons. The RNN model includes the weights and biases of each neuron connection between the layers. These weights and biases determine the nature and recognition effect of the RNN model. Compared with traditional neural networks such as DNN (Deep Neural Network, Deep Neural Network), RNN is a neural network that models sequence data (such as time series), that is, the current output of a sequence is related to the previous output. The specific expression is that the network will remember the state of the previous hidden layer and apply it to the current output calculation, that is, the nodes between the hidden layers are no longer unconnected but connected, and the input of the hidden layer includes not only the input layer The output also includes the output of the hidden layer at the previous moment. Due to the temporal characteristics of the speech data, the RNN model can be trained with the speech data to be trained to accurately extract the respective deep features of the target speech and the interfering speech in time to achieve accurate speech discrimination.
本实施例中,初始化RNN模型,该初始化操作即设置RNN模型中权值和偏置的初始值,该初始值初始设置时可以设置为较小的值,如设置在区间[-0.3-0.3]之间。合理的初始化RNN模型可以使模型在初期有较灵活的调整能力,可以在模型训练过程中对模型进行有效的调整,而不会使模型在初始阶段的调整能力就很差,导致训练出的模型区分效果不好。In this embodiment, the RNN model is initialized. This initialization operation is to set the initial values of weights and offsets in the RNN model. The initial value can be set to a smaller value when initially set, such as in the interval [-0.3-0.3] between. Reasonable initialization of the RNN model can make the model have more flexible adjustment capabilities in the early stage. The model can be adjusted effectively during the model training process without making the model's adjustment capability in the initial stage very poor, resulting in a trained model. The distinction is not good.
S33:将待训练ASR语音特征输入到RNN模型中,根据前向传播算法获取RNN模型的输出值,输出值表示为:
Figure PCTCN2018094190-appb-000008
σ表示激活函数,V表示隐藏层和输出层之间连接的权值,h t表示t时刻的隐藏状态,c表示隐藏层和输出层之间的偏置。
S33: Input the ASR speech features to be trained into the RNN model, and obtain the output value of the RNN model according to the forward propagation algorithm. The output value is expressed as:
Figure PCTCN2018094190-appb-000008
σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer.
本实施例中,RNN前向传播的过程即根据RNN模型中连接各个神经元的权值、偏置和输入的待训练ASR语音特征按照时间序列在RNN模型中进行的一系列线性运算和激活运算,得到的RNN模型中网络每一层的输出值。特别地,由于RNN是对序列(这里具体可以是时间序列)数据进行建模的神经网络,在计算t时刻隐藏层的隐藏状态h t时,需要根据t-1时刻的隐层状态h t-1和t时刻输入的待训练ASR语音特征共同求得。由RNN模型前向传播的过程,可以得到RNN模型的前向传播算法:对于任意时刻t,根据输入的待训练ASR语音特征从RNN模型的输入层计算到隐藏层的输出,该隐藏层的输出(即隐藏状态h t)表示为:h t=σ(Ux t+Wh t-1+b),其中,σ表示激活函数(这里具体可以采用tanh激活函数,tanh在循环过程中会不断扩大待训练ASR语音特征的特征之间的区别,有利于区分目标语音和噪音),U表示输入层到隐藏层之间连接的权值,W表示隐藏层之间连接的权值(由时间序列实现的隐藏层之间的连接),h t-1表示t-1时刻的隐藏状态,b表示输入层和隐藏层之间的偏置。从RNN模型的隐藏层计算到输出层的输出,该输出层的输出(即RNN模型的输出值)表示为
Figure PCTCN2018094190-appb-000009
其中,这里的激活函数具体采用的可以是softmax函数(该softmax函数用于分类问题效果比较好),V表示隐藏层和输出层之间连接的权值,h t表示t时刻的隐藏状态,c表示隐藏层和输出层之间的偏置。该RNN模型的输出值(输出层的输出)
Figure PCTCN2018094190-appb-000010
即为通过前向传播算法按序列一层层计算得到的输出值,可以称为预测输出值。服务器获取RNN模型的输出值后,可以根据该输出值更新、调整RNN模型中的网络参数(权值和偏置),以使获取的RNN模型能够根据语音具有的时序性特点进行区分,通过目标语音的ASR语音特征和干扰语音的ASR语音特征及在时序上表现的不同,得到精确的识别结果。
In this embodiment, the process of RNN forward propagation is a series of linear operations and activation operations performed in the RNN model according to the time series according to the weighted, biased, and input ASR speech features of each neuron in the RNN model. To get the output value of each layer of the network in the RNN model. In particular, since the RNN is a neural network that models sequence (specifically, time series) data, when calculating the hidden state h t of the hidden layer at time t , it is necessary to calculate the hidden layer state h t- at time t-1 The ASR speech features input at times 1 and t are obtained together. From the process of RNN model forward propagation, the RNN model's forward propagation algorithm can be obtained: for any time t, according to the input ASR speech features to be trained, it is calculated from the input layer of the RNN model to the output of the hidden layer, and the output of the hidden layer (That is, the hidden state h t ) is expressed as: h t = σ (Ux t + Wh t-1 + b), where σ represents the activation function (specifically, the tanh activation function can be used here, and tanh will continue to expand during the cycle. Train the differences between the features of the ASR speech features to help distinguish the target speech from noise), U represents the weight of the connection between the input layer and the hidden layer, and W represents the weight of the connection between the hidden layers (implemented by time series Connection between hidden layers), h t-1 represents the hidden state at t-1, and b represents the offset between the input layer and the hidden layer. From the hidden layer of the RNN model to the output of the output layer, the output of the output layer (that is, the output value of the RNN model) is expressed as
Figure PCTCN2018094190-appb-000009
Among them, the activation function used here can be a softmax function (the softmax function is better for classification problems), V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, c Represents the offset between the hidden layer and the output layer. The output value of this RNN model (the output of the output layer)
Figure PCTCN2018094190-appb-000010
That is, the output value calculated by the layer by layer through the forward propagation algorithm can be called the predicted output value. After the server obtains the output value of the RNN model, it can update and adjust the network parameters (weights and offsets) in the RNN model according to the output value, so that the obtained RNN model can be distinguished according to the time-series characteristics of the voice. The difference between the ASR voice characteristics of the speech and the ASR voice characteristics of the interfering speech and their timing performance results in accurate recognition results.
S34:基于输出值进行误差反传,更新RNN模型各层的权值和偏置,获取ASR-RNN模型,其中,更新权值V的公式为:
Figure PCTCN2018094190-appb-000011
V表示更新前隐藏层和输出层之间连接的的权值,V' 表示更新后隐藏层和输出层之间连接的的权值,α表示学习率,t表示t时刻,τ表示总时长,
Figure PCTCN2018094190-appb-000012
表示预测输出值,y t表示真实输出值,h t表示t时刻的隐藏状态,T表示矩阵转置运算;更新偏置c的公式为:
Figure PCTCN2018094190-appb-000013
c表示更新前隐藏层和输出层之间的偏置,c'表示更新后隐藏层和输出层之间的偏置;更新权值U的公式为:
Figure PCTCN2018094190-appb-000014
U表示更新前输入层到隐藏层之间连接的权值,U'表示更新后输入层到隐藏层之间连接的权值,diag()表示构造一个对角矩阵或者以向量的形式返回一个矩阵上对角线元素的矩阵运算,δ t表示隐藏层状态的梯度,x t表示t时刻输入的待训练ASR语音特征;更新权值W的公式为:
Figure PCTCN2018094190-appb-000015
W表示更新前隐藏层之间连接的权值,W'表示更新后隐藏层之间连接的权值;更新偏置b的公式为:
Figure PCTCN2018094190-appb-000016
b表示更新前输入层和隐藏层之间的偏置,b'表示更新后输入层和隐藏层之间的偏置。
S34: Perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain the ASR-RNN model. The formula for updating the weight V is:
Figure PCTCN2018094190-appb-000011
V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
Figure PCTCN2018094190-appb-000012
Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
Figure PCTCN2018094190-appb-000013
c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
Figure PCTCN2018094190-appb-000014
U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:
Figure PCTCN2018094190-appb-000015
W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:
Figure PCTCN2018094190-appb-000016
b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
本实施例中,服务端在根据前向传播算法获取RNN模型的输出值(预测输出值)
Figure PCTCN2018094190-appb-000017
后,可以根据
Figure PCTCN2018094190-appb-000018
与预先设置好标签值的待训练ASR语音特征,计算待训练ASR语音特征在该RNN模型训练时产生的误差,并根据该误差构建合适的误差函数(如采用对数误差函数来表示产生的误差)。服务端再采用该误差函数进行误差反传,调整、更新RNN模型各层的权值(U、W和V)和权值(b和c)。具体地,预先设置好的标签值可以称为真实输出值(即代表客观事实,标签值1代表目标语音,标签值为0代表干扰语音),用y t表示。在训练RNN模型的过程中,时间序列上RNN模型在每一层计算前向输出时都有误差,衡量该误差可以采用误差函数L,表示为:
Figure PCTCN2018094190-appb-000019
其中,t即指t时刻,τ表示总时长,L t表示由误差函数表示的在t时刻产生的误差。服务端得到误差函数后,可以根据BPTT(Back Propagation Trough Time,基于时间的反向传播算法)更新RNN模型的权值和偏置,获取基于待训练ASR语音特征的ASR-RNN模型。具体地,更新权值V的公式为:
Figure PCTCN2018094190-appb-000020
其中,V表示更新前隐藏层和输出层之间连接的的权值,V'表示更新后隐藏层和输出层之间连接的的权值,α表示学习率,
Figure PCTCN2018094190-appb-000021
表示预测输出值,y t表示真实输出值,h t表示t时刻的隐藏状态,T表示矩阵转置运算。更新偏置c的公式为:
Figure PCTCN2018094190-appb-000022
c表示更新前隐藏层和输出层之间的偏置,c'表示更新后隐藏层和输出层之间的 偏置。相比较于权值V和偏置c,权值U、权值W和偏置b,在反向传播时,某一时刻t的梯度损失由当前位置的输出对应的梯度损失和t+1时刻的梯度损失两部分共同决定。因此权值U、权值W和偏置b的更新需要借助隐藏层状态的梯度δ t得到。t序列时刻隐藏层状态的梯度δ t表示为:
Figure PCTCN2018094190-appb-000023
δ t+1与δ t之间存在联系,根据δ t+1可以求得δ t,其联系的表达式为:
Figure PCTCN2018094190-appb-000024
其中,δ t+1表示t+1序列时刻隐藏层状态的梯度,diag()表示一种矩阵运算的计算函数,该计算函数用于构造一个对角矩阵或者以向量的形式返回一个矩阵上对角线元素,h t+1表示t+1序列时刻的隐藏层状态。则可以通过得到τ时刻隐藏层状态的梯度δ τ,利用δ t+1与δ t之间联系的表达式
Figure PCTCN2018094190-appb-000025
由δ τ一层层反向传播递推得到δ t。由于δ τ后面没有其他的时刻,因此根据梯度计算可以直接得到:
Figure PCTCN2018094190-appb-000026
则可以根据δ τ递推求得δ t。得到δ t后,即可以计算权值U、权值W和偏置b。更新权值U的公式为:
Figure PCTCN2018094190-appb-000027
U表示更新前输入层到隐藏层之间连接的权值,U'表示更新后输入层到隐藏层之间连接的权值,diag()表示构造一个对角矩阵或者以向量的形式返回一个矩阵上对角线元素的矩阵运算,δ t表示隐藏层状态的梯度,x t表示t时刻输入的待训练ASR语音特征;更新权值W的公式为:
Figure PCTCN2018094190-appb-000028
W表示更新前隐藏层之间连接的权值,W'表示更新后隐藏层之间连接的权值;更新偏置b的公式为:
Figure PCTCN2018094190-appb-000029
b表示更新前输入层和隐藏层之间的偏置,b'表示更新后输入层和隐藏层之间的偏置。当所有权值和偏置的变化值都小于停止迭代阈值∈时,即可停止训练;或者,训练达到最大迭代次数MAX时,停止训练。通过待训练ASR语音特征在RNN模型中的预测输出值和预先设置好的标签值(真实输出值)之间产生的误差,基于该误差实现RNN模型各层权值和偏置的更新,使得最终获取的ASR-RNN模型能够根据ASR语音特征,训练并学习关于时间序列的深层特征,实现精确区分语音的目的。
In this embodiment, the server obtains the output value (predicted output value) of the RNN model according to the forward propagation algorithm.
Figure PCTCN2018094190-appb-000017
After that, you can
Figure PCTCN2018094190-appb-000018
With the ASR speech feature to be trained with pre-set label values, calculate the error generated by the ASR speech feature to be trained during the training of the RNN model, and construct a suitable error function based on the error (such as using a logarithmic error function to represent the generated error ). The server then uses this error function for error back propagation, adjusting and updating the weights (U, W, and V) and weights (b and c) of each layer of the RNN model. Specifically, the preset label value can be called a real output value (that is, it represents objective facts, a label value of 1 represents a target voice, and a label value of 0 represents an interfering voice), and is represented by y t . In the process of training the RNN model, the RNN model in the time series has an error when calculating the forward output at each layer. To measure this error, the error function L can be used to express:
Figure PCTCN2018094190-appb-000019
Among them, t refers to time t, τ represents the total time, and L t represents the error generated at time t by the error function. After the server obtains the error function, it can update the weights and offsets of the RNN model according to BPTT (Back Propagation Trough Time) to obtain the ASR-RNN model based on the ASR speech features to be trained. Specifically, the formula for updating the weight V is:
Figure PCTCN2018094190-appb-000020
Among them, V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, and α represents the learning rate,
Figure PCTCN2018094190-appb-000021
Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, and T represents the matrix transposition operation. The formula for updating the offset c is:
Figure PCTCN2018094190-appb-000022
c represents the offset between the hidden layer and the output layer before the update, and c 'represents the offset between the hidden layer and the output layer after the update. Compared with weight V and offset c, weight U, weight W, and offset b, during back propagation, the gradient loss at a time t is determined by the gradient loss corresponding to the output of the current position and time t + 1 The two parts of the gradient loss are jointly determined. Therefore, the update of weight U, weight W and offset b needs to be obtained by means of the gradient δ t of the state of the hidden layer. The gradient δ t of the hidden layer state at time t is expressed as:
Figure PCTCN2018094190-appb-000023
+ δ t is present between the contact 1 and δ t, δ t can be determined according δ t + 1, which is linked to the expression:
Figure PCTCN2018094190-appb-000024
Among them, δ t + 1 represents the gradient of the state of the hidden layer at the time of t + 1 sequence, and diag () represents a calculation function for matrix operations. The calculation function is used to construct a diagonal matrix or return a pair of matrices in the form of a vector. Corner line element, h t + 1 represents the state of the hidden layer at time t + 1 sequence. Then we can get the gradient δ τ of the state of the hidden layer at time τ , and use the expression between δ t + 1 and δ t
Figure PCTCN2018094190-appb-000025
Δ t is obtained by recursing δ τ from layer to layer. Since there is no other time behind δ τ , it can be directly obtained from the gradient calculation:
Figure PCTCN2018094190-appb-000026
Then δ t can be obtained recursively according to δ τ . After δ t is obtained, the weight U, weight W and offset b can be calculated. The formula for updating weight U is:
Figure PCTCN2018094190-appb-000027
U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector For matrix operations on diagonal elements, δ t represents the gradient of the state of the hidden layer, and x t represents the speech features of the ASR to be trained at time t; the formula for updating the weight W is:
Figure PCTCN2018094190-appb-000028
W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:
Figure PCTCN2018094190-appb-000029
b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update. When both the value of the ownership and the change of the bias are less than the stop iteration threshold ∈, the training can be stopped; or, when the training reaches the maximum number of iterations MAX, the training is stopped. The error between the predicted output value of the ASR speech feature to be trained in the RNN model and the preset label value (real output value) is used to update the weights and offsets of each layer of the RNN model based on the error, so that the final The obtained ASR-RNN model can train and learn deep features about time series according to the ASR speech features to achieve the purpose of accurately distinguishing speech.
步骤S31-S34采用待训练ASR语音特征对RNN模型进行训练,使得训练获取的ASR-RNN模型能够根据ASR语音特征训练并学习关于序列(时序)的深层特征,可以根据目标语音和干扰语音的ASR语音特征以及结合时序因素有效区分语音。在噪音干扰严重的情况下,仍然可以将目标语音和噪音进行精确的区分。Steps S31-S34 train the RNN model with the ASR speech features to be trained, so that the trained ASR-RNN model can train and learn deep features about the sequence (timing) based on the ASR speech features, and can use the ASR of the target speech and the interfering speech Speech features and the combination of timing factors effectively distinguish speech. In the case of severe noise interference, the target speech and noise can still be accurately distinguished.
本实施例所提供的语音区分方法中,首先基于语音活动检测算法(VAD)处理原始待区分语音数据,获取目标待区分语音数据,把原始待区分语音数据通过语音活动检测算法先区分一次,得到范围更小的目标待区分语音数据,能够初步有效地去除原始待区分语音数据中的干扰语音数据,保留混杂着目标语音和干扰语音的原始待区分语音数据,并将该原始待区分语音数据作为目标待区分语音数据,能够对原始待区分语音数据作有效的初步语音区分,去除大量的干扰语音。然后基于目标待区分语音数据,获取 相对应的ASR语音特征,该ASR语音特征能够使语音区分的结果更为精确,即使在噪音很大的条件下,也可以精确地将干扰语音(如噪音)和目标语音区分开来,为后续根据该ASR语音特征进行相应的ASR-RNN模型识别提供重要的技术前提。最后将ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果,该ASR-RNN模型是根据待训练语音数据提取的ASR语音特征及语音在时序上的特点专门训练的用于有效区分语音的识别模型,能够从混杂着目标语音和干扰语音(由于已经采用VAD区分过一次,所以这里的干扰语音大多数是指噪音)的目标待区分语音数据中正确区分目标语音和干扰语音,提高语音区分的准确性。In the voice discrimination method provided in this embodiment, first, the original voice data to be differentiated is processed based on a voice activity detection algorithm (VAD), and the target voice data to be distinguished is obtained. The raw voice data to be distinguished is first distinguished by the voice activity detection algorithm to obtain The target to-be-differentiated voice data with a smaller range can initially and effectively remove the interfering voice data from the original to-be-differentiated voice data, retain the original to-be-differentiated voice data mixed with the target voice and the interfering voice, and use the original to-be-differentiated voice data as The target to-be-differentiated voice data can effectively make preliminary speech distinctions from the original to-be-differentiated voice data, removing a large amount of interfering speech. Then based on the target to-be-differentiated speech data, the corresponding ASR speech features are obtained. This ASR speech feature can make the result of speech discrimination more accurate, and even under the condition of noisy noise, it can accurately interfering speech (such as noise) It is distinguished from the target speech, and provides important technical prerequisites for subsequent ASR-RNN model recognition based on the ASR speech characteristics. Finally, the ASR speech features are input into a pre-trained ASR-RNN model to distinguish them and obtain the target discrimination result. The ASR-RNN model is specially trained according to the ASR speech features extracted from the speech data to be trained and the timing characteristics of the speech. The recognition model for effectively distinguishing speech can correctly distinguish the target speech from the target speech data that is mixed with the target speech and the interference speech (because VAD has been used to distinguish it once, so the interference speech here mostly refers to noise). Interfering with speech and improving the accuracy of speech discrimination.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
图8示出与实施例中语音区分方法一一对应的语音区分装置的原理框图。如图8所示,该语音区分装置包括目标待区分语音数据获取模块10、语音特征获取模块20和目标区分结果获取模块30。其中,目标待区分语音数据获取模块10、语音特征获取模块20和目标区分结果获取模块30的实现功能与实施例中语音区分方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。FIG. 8 shows a principle block diagram of a voice distinguishing device corresponding to the voice distinguishing method in the embodiment. As shown in FIG. 8, the voice discrimination device includes a target to-be-differentiated voice data acquisition module 10, a voice feature acquisition module 20, and a target discrimination result acquisition module 30. The implementation functions of the target to-be-separated voice data acquisition module 10, the voice feature acquisition module 20, and the target discrimination result acquisition module 30 correspond to the steps corresponding to the voice discrimination method in the embodiment. To avoid redundant descriptions, this embodiment is not one by one. Elaborate.
目标待区分语音数据获取模块10,用于基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据。The target to-be-differentiated voice data acquisition module 10 is configured to process the original to-be-differentiated voice data based on a voice activity detection algorithm to obtain the target to-be-differentiated voice data.
语音特征获取模块20,用于基于目标待区分语音数据,获取相对应的ASR语音特征。The voice feature obtaining module 20 is configured to obtain a corresponding ASR voice feature based on the target to-be-differentiated voice data.
目标区分结果获取模块30,用于将ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。The target discrimination result acquisition module 30 is configured to input ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtain a target discrimination result.
优选地,目标待区分语音数据获取模块10包括第一原始区分语音数据获取单元11、第二原始区分语音数据获取单元12和目标待区分语音数据获取单元13。Preferably, the target undistinguished speech data acquisition module 10 includes a first original distinguished speech data acquisition unit 11, a second original distinguished speech data acquisition unit 12, and a target undisturbed speech data acquisition unit 13.
第一原始区分语音数据获取单元11,用于根据短时能量特征值计算公式对原始待区分语音数据进行处理,获取对应的短时能量特征值,将短时能量特征值大于第一阈值的原始待区分数据保留,确定为第一原始区分语音数据,其中短时能量特征值计算公式为
Figure PCTCN2018094190-appb-000030
N为语音帧长,s(n)为时域上的信号幅度,n为时间。
The first original distinguished speech data acquisition unit 11 is configured to process the original speech data to be distinguished according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, and increase the original value of the short-term energy feature value greater than the first threshold. The to-be-differentiated data is retained, and it is determined as the first original distinguished speech data. The short-term energy characteristic value calculation formula is
Figure PCTCN2018094190-appb-000030
N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.
第二原始区分语音数据获取单元12,用于根据过零率特征值计算公式对原始待区分语音数据进行处理,获取对应的过零率特征值,将过零率特征值小于第二阈值的原始待区分语音数据保留,确定为第二原始区分语音数据,其中过零率特征值计算公式为
Figure PCTCN2018094190-appb-000031
N为语音帧长,s(n)为时域上的信号幅度,n为时间。
A second original distinguished speech data obtaining unit 12 is configured to process the original to-be-differentiated speech data according to a calculation formula of the zero-crossing rate characteristic value, obtain a corresponding zero-crossing rate characteristic value, and reduce the original value of the zero-crossing rate characteristic value to be less than the second threshold The to-be-differentiated voice data is retained, and it is determined to be the second original distinguished voice data. The formula of the zero-crossing rate characteristic value is
Figure PCTCN2018094190-appb-000031
N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.
目标待区分语音数据获取单元13,用于将第一原始区分语音数据和第二原始区分语音数据作为目标待区分语音数据。The target undistinguished speech data acquisition unit 13 is configured to use the first original distinguished speech data and the second original distinguished speech data as the target undistorted speech data.
优选地,语音特征获取模块20包括预处理语音数据获取单元21、功率谱获取单元22、梅尔功率谱获取单元23和梅尔频率倒谱系数单元24。Preferably, the speech feature acquisition module 20 includes a pre-processed speech data acquisition unit 21, a power spectrum acquisition unit 22, a Mel power spectrum acquisition unit 23, and a Mel frequency cepstrum coefficient unit 24.
预处理单元21,用于对目标待区分语音数据进行预处理,获取预处理语音数据。The pre-processing unit 21 is configured to pre-process the target to-be-differentiated voice data to obtain pre-processed voice data.
功率谱获取单元22,用于对预处理语音数据作快速傅里叶变换,获取目标待区分语音数据的频谱,并根据频谱获取目标待区分语音数据的功率谱。The power spectrum obtaining unit 22 is configured to perform a fast Fourier transform on the pre-processed speech data, obtain a frequency spectrum of the target speech data to be distinguished, and obtain a power spectrum of the target speech data to be distinguished according to the frequency spectrum.
梅尔功率谱获取单元23,用于采用梅尔刻度滤波器组处理目标待区分语音数据的功率谱,获取目标待区分语音数据的梅尔功率谱。The Mel power spectrum acquisition unit 23 is configured to process a power spectrum of the target speech data to be distinguished by using a Mel scale filter bank, and obtain a Mel power spectrum of the target speech data to be distinguished.
梅尔频率倒谱系数单元24,用于在梅尔功率谱上进行倒谱分析,获取目标待区分语音数据的梅尔频率倒谱系数。The Mel frequency cepstrum coefficient unit 24 is configured to perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
优选地,预处理单元21包括预加重子单元211、分帧子单元212和加窗子单元213。Preferably, the pre-processing unit 21 includes a pre-emphasis sub-unit 211, a frame sub-unit 212, and a windowing sub-unit 213.
预加重子单元211,用于对目标待区分语音数据作预加重处理,预加重处理的计算公式为 s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,s n-1为与s n相对应的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0。 The pre-emphasis sub-unit 211 is configured to perform pre-emphasis processing on target voice data to be distinguished. The calculation formula of the pre-emphasis processing is s' n = s n -a * s n-1 , where s n is a signal in the time domain. amplitude, s n-1 s n is the amplitude of the signal corresponding to the previous time, s' n for the amplitude of the signal on the time-domain pre-emphasis, a is pre-emphasis coefficient, a is in the range of 0.9 <a < 1.0.
分帧子单元212,用于将预加重后的目标待区分语音数据进行分帧处理。The frame sub-unit 212 is configured to perform frame processing on the target pre-emphasized voice data to be distinguished.
加窗子单元213,用于将分帧后的目标待区分语音数据进行加窗处理,获取预处理语音数据,加窗的计算公式为
Figure PCTCN2018094190-appb-000032
其中,N为窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。
A windowing sub-unit 213 is configured to perform windowing on the framed target to-be-differentiated speech data to obtain pre-processed speech data. The calculation formula of the windowing is
Figure PCTCN2018094190-appb-000032
Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
优选地,梅尔频率倒谱系数单元24包括待变换梅尔功率谱获取子单元241和梅尔频率倒谱系数子单元242。Preferably, the Mel frequency cepstrum coefficient unit 24 includes a Mel power spectrum acquisition sub-unit 241 and a Mel frequency cepstrum coefficient sub-unit 242 to be transformed.
待变换梅尔功率谱获取子单元241,用于取梅尔功率谱的对数值,获取待变换梅尔功率谱。The to-be-transformed Mel power spectrum acquisition subunit 241 is configured to obtain a log value of the to-be-transformed Mel power spectrum to obtain the to-be-transformed Mel power spectrum.
梅尔频率倒谱系数子单元242,用于对待变换梅尔功率谱作离散余弦变换,获取目标待区分语音数据的梅尔频率倒谱系数。The Mel frequency cepstrum coefficient sub-unit 242 is configured to perform a discrete cosine transform of the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
优选地,该语音区分装置还包括ASR-RNN模型获取模块40,ASR-RNN模型获取模块40包括待训练ASR语音特征获取单元41、初始化单元42、输出值获取单元43和更新单元44。Preferably, the speech discrimination device further includes an ASR-RNN model acquisition module 40. The ASR-RNN model acquisition module 40 includes an ASR speech feature acquisition unit 41, an initialization unit 42, an output value acquisition unit 43, and an update unit 44 to be trained.
待训练ASR语音特征获取单元41,用于获取待训练语音数据,并提取待训练语音数据的待训练ASR语音特征。The ASR speech feature acquisition unit 41 is configured to acquire speech data to be trained and extract speech speech features of the ASR to be trained.
初始化单元42,用于初始化RNN模型。The initialization unit 42 is configured to initialize an RNN model.
输出值获取单元43,用于将待训练ASR语音特征输入到RNN模型中,根据前向传播算法获取RNN模型的输出值,输出值表示为:
Figure PCTCN2018094190-appb-000033
σ表示激活函数,V表示隐藏层和输出层之间连接的权值,h t表示t时刻的隐藏状态,c表示隐藏层和输出层之间的偏置。
An output value acquisition unit 43 is configured to input the ASR speech features to be trained into the RNN model, and obtain the output value of the RNN model according to the forward propagation algorithm, and the output value is expressed as:
Figure PCTCN2018094190-appb-000033
σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer.
更新单元44,用于基于输出值进行误差反传,更新RNN模型各层的权值和偏置,获取ASR-RNN模型,其中,更新权值V的公式为:
Figure PCTCN2018094190-appb-000034
V表示更新前隐藏层和输出层之间连接的的权值,V'表示更新后隐藏层和输出层之间连接的的权值,α表示学习率,t表示t时刻,τ表示总时长,
Figure PCTCN2018094190-appb-000035
表示预测输出值,y t表示真实输出值,h t表示t时刻的隐藏状态,T表示矩阵转置运算;更新偏置c的公式为:
Figure PCTCN2018094190-appb-000036
c表示更新前隐藏层和输出层之间的偏置,c'表示更新后隐藏层和输出层之间的偏置;更新权值U的公式为:
Figure PCTCN2018094190-appb-000037
U表示更新前输入层到隐藏层之间连接的权值,U'表示更新后输入层到隐藏层之间连接的权值,diag()表示构造一个对角矩阵或者以向量的形式返回一个矩阵上对角线元素的矩阵运算,δ t表示隐藏层状态的梯度,x t表示t时刻输入的待训练ASR语音特征;更新权值W的公式为:
Figure PCTCN2018094190-appb-000038
W表示更新前隐藏层之间连接的权值,W'表示更新后隐藏层之间连接的权值;更新偏置b的公式为:
Figure PCTCN2018094190-appb-000039
b表示更新前输入层和隐藏层之间的偏置,b'表示更新后输入层和隐藏层之间的偏置。
The updating unit 44 is configured to perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model. The formula for updating the weight V is:
Figure PCTCN2018094190-appb-000034
V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
Figure PCTCN2018094190-appb-000035
Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
Figure PCTCN2018094190-appb-000036
c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
Figure PCTCN2018094190-appb-000037
U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:
Figure PCTCN2018094190-appb-000038
W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:
Figure PCTCN2018094190-appb-000039
b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
本实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例中语音区分方法,为避免重复,这里不再赘述。或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例中语音区分装置中各模块/单元的功能,为避免重复,这里不再赘述。This embodiment provides one or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors are executed. The method for distinguishing speech in the embodiment is implemented at this time. To avoid repetition, details are not described herein again. Alternatively, when the computer-readable instructions are executed by one or more processors, the functions of each module / unit in the speech distinguishing device in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here is not More details.
可以理解地,所述计算机可读存储介质可以包括:能够携带所述计算机可读指令的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号和电信信号等。Understandably, the computer-readable storage medium may include: any entity or device capable of carrying the computer-readable instructions, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM , Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals and telecommunication signals.
图9是本实施例中计算机设备的示意图。如图9所示,计算机设备50包括处理器51、存储器52以及存储在存储器52中并可在处理器51上运行的计算机可读指令53。处理器51执行计算机可读指令53时实现实施例中语音区分方法的各个步骤,例如图2所示的步骤S10、S20和S30。或者,处理器51执行计算机可读指令53时实现实施例中语音区分装置各模块/单元的功能,如图8所示目标待区分语音数据获取模块10、语音特征获取模块20、目标区分结果获取模块30和ASR-RNN模型获取模块40的功能。FIG. 9 is a schematic diagram of a computer device in this embodiment. As shown in FIG. 9, the computer device 50 includes a processor 51, a memory 52, and computer-readable instructions 53 stored in the memory 52 and executable on the processor 51. When the processor 51 executes the computer-readable instructions 53, each step of the method for distinguishing speech in the embodiment is implemented, for example, steps S10, S20, and S30 shown in FIG. Alternatively, when the processor 51 executes the computer-readable instructions 53, the functions of the modules / units of the voice distinguishing device in the embodiment are realized, as shown in FIG. 8, the target to-be-differentiated voice data acquisition module 10, the voice feature acquisition module 20, and the target discrimination result acquisition. The module 30 and the ASR-RNN model acquire the functions of the module 40.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software functional unit.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application, and should be included in the present invention Within the scope of the application.

Claims (20)

  1. 一种语音区分方法,其特征在于,包括:A method for distinguishing speech, comprising:
    基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据;Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;
    基于所述目标待区分语音数据,获取相对应的ASR语音特征;Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;
    将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
  2. 根据权利要求1所述的语音区分方法,其特征在于,在所述将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取区分结果的步骤之前,所述语音区分方法还包括:获取ASR-RNN模型;The speech discrimination method according to claim 1, characterized in that, before the step of inputting the ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtaining a discrimination result, the speech discrimination method Also includes: obtaining ASR-RNN model;
    所述获取ASR-RNN模型的步骤包括:The step of obtaining the ASR-RNN model includes:
    获取待训练语音数据,并提取所述待训练语音数据的待训练ASR语音特征;Acquiring voice data to be trained, and extracting voice features of the ASR to be trained of the voice data to be trained;
    初始化RNN模型;Initialize the RNN model;
    将待训练ASR语音特征输入到RNN模型中,根据前向传播算法获取RNN模型的输出值,所述输出值表示为:
    Figure PCTCN2018094190-appb-100001
    σ表示激活函数,V表示隐藏层和输出层之间连接的权值,h t表示t时刻的隐藏状态,c表示隐藏层和输出层之间的偏置;
    The ASR speech features to be trained are input into the RNN model, and the output value of the RNN model is obtained according to the forward propagation algorithm, and the output value is expressed as:
    Figure PCTCN2018094190-appb-100001
    σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer;
    基于所述输出值进行误差反传,更新RNN模型各层的权值和偏置,获取ASR-RNN模型,其中,更新权值V的公式为:
    Figure PCTCN2018094190-appb-100002
    V表示更新前隐藏层和输出层之间连接的的权值,V'表示更新后隐藏层和输出层之间连接的的权值,α表示学习率,t表示t时刻,τ表示总时长,
    Figure PCTCN2018094190-appb-100003
    表示预测输出值,y t表示真实输出值,h t表示t时刻的隐藏状态,T表示矩阵转置运算;更新偏置c的公式为:
    Figure PCTCN2018094190-appb-100004
    c表示更新前隐藏层和输出层之间的偏置,c'表示更新后隐藏层和输出层之间的偏置;更新权值U的公式为:
    Figure PCTCN2018094190-appb-100005
    U表示更新前输入层到隐藏层之间连接的权值,U'表示更新后输入层到隐藏层之间连接的权值,diag()表示构造一个对角矩阵或者以向量的形式返回一个矩阵上对角线元素的矩阵运算,δ t表示隐藏层状态的梯度,x t表示t时刻输入的待训练ASR语音特征;更新权值W的公式为:
    Figure PCTCN2018094190-appb-100006
    W表示更新前隐藏层之间连接的权值,W'表示更新后隐藏层之间连接的权值;更新偏置b的公式为:
    Perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model, where the formula for updating the weight V is:
    Figure PCTCN2018094190-appb-100002
    V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
    Figure PCTCN2018094190-appb-100003
    Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
    Figure PCTCN2018094190-appb-100004
    c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
    Figure PCTCN2018094190-appb-100005
    U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:
    Figure PCTCN2018094190-appb-100006
    W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:
    Figure PCTCN2018094190-appb-100007
    b表示更新前输入层和隐藏层之间的偏置,b'表示更新后输入层和隐藏层之间的偏置。
    Figure PCTCN2018094190-appb-100007
    b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
  3. 根据权利要求1所述的语音区分方法,其特征在于,所述基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据,包括:The method of claim 1, wherein the processing of the original speech data to be distinguished based on the speech activity detection algorithm to obtain the target speech data to be distinguished comprises:
    根据短时能量特征值计算公式对所述原始待区分语音数据进行处理,获取对应的短时能量特征值,将所述短时能量特征值大于第一阈值的所述原始待区分数据保留,确定为第一原始区分语音数据,短时能量特征值计算公式为
    Figure PCTCN2018094190-appb-100008
    其中,N为语音帧长,s(n)为时域上的信号幅度,n为时间;
    Process the original to-be-differentiated voice data according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, retain the original to-be-differentiated data whose short-term energy feature value is greater than a first threshold, and determine For the first original distinguishing speech data, the short-term energy eigenvalue calculation formula is
    Figure PCTCN2018094190-appb-100008
    Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;
    根据过零率特征值计算公式对所述原始待区分语音数据进行处理,获取对应的过零率特征值,将所述过零率特征值小于第二阈值的所述原始待区分语音数据保留,确定为第二原始区分语音数据,过零率特征值计算公式为
    Figure PCTCN2018094190-appb-100009
    其中,N为语音帧长,s(n)为时域上的信号幅度,n为时间;
    Processing the original to-be-differentiated voice data according to a zero-crossing rate feature value calculation formula, obtaining a corresponding zero-crossing rate feature value, and retaining the original to-be-differentiated voice data with the zero-cross rate feature value less than a second threshold, Determined as the second original distinguished speech data, the calculation formula of the zero-crossing rate characteristic value is
    Figure PCTCN2018094190-appb-100009
    Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;
    将所述第一原始区分语音数据和所述第二原始区分语音数据作为所述目标待区分语音数据。Use the first original distinguished speech data and the second original distinguished speech data as the target to-be-differentiated speech data.
  4. 根据权利要求1所述的语音区分方法,其特征在于,所述基于所述目标待区分语音数据,获取相对应的ASR语音特征,包括:The method according to claim 1, wherein the acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data comprises:
    对所述目标待区分语音数据进行预处理,获取预处理语音数据;Pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data;
    对所述预处理语音数据作快速傅里叶变换,获取目标待区分语音数据的频谱,并根据所述频谱获取目标待区分语音数据的功率谱;Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of target speech data to be distinguished, and obtaining a power spectrum of target speech data to be distinguished according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述目标待区分语音数据的功率谱,获取目标待区分语音数据的梅尔功率谱;Adopting a Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain a Mel power spectrum of the target speech data to be distinguished;
    在所述梅尔功率谱上进行倒谱分析,获取目标待区分语音数据的梅尔频率倒谱系数。A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  5. 根据权利要求4所述的语音区分方法,其特征在于,所述对所述目标待区分语音数据进行预处理,获取预处理语音数据,包括:The speech discrimination method according to claim 4, wherein the pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data comprises:
    对所述目标待区分语音数据作预加重处理,预加重处理的计算公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,s n-1为与s n相对应的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0; Distinguish the speech of the target data to be processed for pre-emphasis, pre-emphasis process is calculated as s' n = s n -a * s n-1, wherein the amplitude of the signal on the time domain s n, s n-1 s n is the amplitude of the signal corresponding to the previous time, s' n for the pre-emphasis signal amplitude on the time domain, a is a pre-emphasis coefficient, a is in the range of 0.9 <a <1.0;
    将预加重后的所述目标待区分语音数据进行分帧处理;Performing frame processing on the pre-emphasized target to-be-differentiated voice data;
    将分帧后的所述目标待区分语音数据进行加窗处理,获取预处理语音数据,加窗的计算公式为
    Figure PCTCN2018094190-appb-100010
    其中,N为窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。
    The windowed processing is performed on the target to-be-differentiated voice data after framing to obtain preprocessed voice data. The calculation formula of the windowing is
    Figure PCTCN2018094190-appb-100010
    Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  6. 根据权利要求4所述的语音区分方法,其特征在于,所述在所述梅尔功率谱上进行倒谱分析,获取目标待区分语音数据的梅尔频率倒谱系数,包括:The speech discrimination method according to claim 4, wherein the performing cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished comprises:
    取所述梅尔功率谱的对数值,获取待变换梅尔功率谱;Taking a log value of the Mel power spectrum to obtain a Mel power spectrum to be transformed;
    对所述待变换梅尔功率谱作离散余弦变换,获取目标待区分语音数据的梅尔频率倒谱系数。Performing discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished.
  7. 一种语音区分装置,其特征在于,包括:A voice distinguishing device, comprising:
    目标待区分语音数据获取模块,用于基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据;Target to-be-differentiated voice data acquisition module, for processing original to-be-differentiated voice data based on a voice activity detection algorithm, to obtain target to-be-differentiated voice data;
    语音特征获取模块,用于基于所述目标待区分语音数据,获取相对应的ASR语音特征;A voice feature acquisition module, configured to acquire a corresponding ASR voice feature based on the target to-be-differentiated voice data;
    目标区分结果获取模块,用于将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。A target discrimination result acquisition module is configured to input the ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtain a target discrimination result.
  8. 根据权利要求7所述的语音区分装置,其特征在于,所述语音区分装置还包括ASR-RNN模型获取模块,所述ASR-RNN模型获取模块包括:The speech discrimination device according to claim 7, wherein the speech discrimination device further comprises an ASR-RNN model acquisition module, and the ASR-RNN model acquisition module comprises:
    待训练ASR语音特征获取单元,用于获取待训练语音数据,并提取所述待训练语音数据的待训练ASR语音特征;A speech feature acquisition unit to be trained, for acquiring speech data to be trained, and extracting speech feature of the ASR to be trained from the speech data to be trained;
    初始化单元,用于初始化RNN模型;An initialization unit for initializing the RNN model;
    输出值获取单元,用于将待训练ASR语音特征输入到RNN模型中,根据前向传播算法获取RNN模型的输出值,所述输出值表示为:
    Figure PCTCN2018094190-appb-100011
    σ表示激活函数,V表示隐藏层和输出层之间连接的权值,h t表示t时刻的隐藏状态,c表示隐藏层和输出层之间的偏置;
    An output value obtaining unit is configured to input the ASR speech features to be trained into the RNN model, and obtain an output value of the RNN model according to a forward propagation algorithm, where the output value is expressed as:
    Figure PCTCN2018094190-appb-100011
    σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer;
    更新单元,用于基于所述输出值进行误差反传,更新RNN模型各层的权值和偏置,获取ASR-RNN模型,其中,更新权值V的公式为:
    Figure PCTCN2018094190-appb-100012
    V表示更新前隐藏层和输出层之间连接的的权值,V'表示更新后隐藏层和输出层之间连接的的权值,α表示学习率,t表示t时刻,τ表示总时长,
    Figure PCTCN2018094190-appb-100013
    表示预测输出值,y t表示真实输出值,h t表示t时刻的隐藏状态,T表示矩阵转置运算;更新偏置c的公式为:
    Figure PCTCN2018094190-appb-100014
    c表示更新前隐藏层和输出层之间的偏置,c'表示更新后隐藏层和输出层之间的偏置;更新权值U的公式为:
    Figure PCTCN2018094190-appb-100015
    U表示更新前输入层到隐藏层之间连接的权值,U'表示更新后输入层到隐藏层之间连接的权值,diag()表示构造一个对角矩阵或者以向量的形式返回一个矩阵上对角线元素的矩阵运算,δ t表示隐藏层状态的梯度,x t表示t时刻输入的待训练ASR语音特征;更新权值W的公式为:
    An updating unit is configured to perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model. The formula for updating the weight V is:
    Figure PCTCN2018094190-appb-100012
    V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
    Figure PCTCN2018094190-appb-100013
    Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
    Figure PCTCN2018094190-appb-100014
    c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
    Figure PCTCN2018094190-appb-100015
    U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:
    Figure PCTCN2018094190-appb-100016
    W表示更新前隐藏层之间连接的权值,W'表示更新后隐藏层之间连接的权值;更新偏置b的公式为:
    Figure PCTCN2018094190-appb-100017
    b表示更新前输入层和隐藏层之间的偏置,b'表示更新后输入层和隐藏层之间的偏置。
    Figure PCTCN2018094190-appb-100016
    W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:
    Figure PCTCN2018094190-appb-100017
    b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:
    基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据;Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;
    基于所述目标待区分语音数据,获取相对应的ASR语音特征;Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;
    将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
  10. 根据权利要求9所述的计算机设备,其特征在于,在所述将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取区分结果的步骤之前,所述处理器执行所述计算机可读指令时还实现如下步骤:获取ASR-RNN模型;The computer device according to claim 9, characterized in that before the step of inputting the ASR speech features into a pre-trained ASR-RNN model for differentiation, and obtaining a discrimination result, the processor executes all When describing the computer-readable instructions, the following steps are also implemented: obtaining an ASR-RNN model;
    所述获取ASR-RNN模型的步骤包括:The step of obtaining the ASR-RNN model includes:
    获取待训练语音数据,并提取所述待训练语音数据的待训练ASR语音特征;Acquiring voice data to be trained, and extracting voice features of the ASR to be trained of the voice data to be trained;
    初始化RNN模型;Initialize the RNN model;
    将待训练ASR语音特征输入到RNN模型中,根据前向传播算法获取RNN模型的输出值,所述输出值表示为:
    Figure PCTCN2018094190-appb-100018
    σ表示激活函数,V表示隐藏层和输出层之间连接的权值,h t表示t时刻的隐藏状态,c表示隐藏层和输出层之间的偏置;
    The ASR speech features to be trained are input into the RNN model, and the output value of the RNN model is obtained according to the forward propagation algorithm, and the output value is expressed as:
    Figure PCTCN2018094190-appb-100018
    σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer;
    基于所述输出值进行误差反传,更新RNN模型各层的权值和偏置,获取ASR-RNN模型,其中,更新权值V的公式为:
    Figure PCTCN2018094190-appb-100019
    V表示更新前隐藏层和输出层之间连接的的权值,V'表示更新后隐藏层和输出层之间连接的的权值,α表示学习率,t表示t时刻,τ表示总时长,
    Figure PCTCN2018094190-appb-100020
    表示预测输出值,y t表示真实输出值,h t表示t时刻的隐藏状态,T表示矩阵转置运算;更新偏置c的公式为:
    Figure PCTCN2018094190-appb-100021
    c表示更新前隐藏层和输出层之间的偏置,c'表示更新后隐藏层和输出层之间的偏置;更新权值U的公式为:
    Figure PCTCN2018094190-appb-100022
    U表示更新前输入层到隐藏层之间连接的权值,U'表示更新后输入层到隐藏层之间连接的权值,diag()表示构造一个对角矩阵或者以向量的形式返回一个矩阵上对角线元素的矩阵运算,δ t表示隐藏层状态的梯度,x t表示t时刻输入的待训练ASR语音特征;更新权值W的公式为:
    Figure PCTCN2018094190-appb-100023
    W表示更新前隐藏层之间连接的权值,W'表示更新后隐藏层之间连接的权值;更新偏置b的公式为:
    Perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model, where the formula for updating the weight V is:
    Figure PCTCN2018094190-appb-100019
    V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
    Figure PCTCN2018094190-appb-100020
    Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
    Figure PCTCN2018094190-appb-100021
    c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
    Figure PCTCN2018094190-appb-100022
    U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:
    Figure PCTCN2018094190-appb-100023
    W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:
    Figure PCTCN2018094190-appb-100024
    b表示更新前输入层和隐藏层之间的偏置,b'表示更新后输入层和隐藏层之间的偏置。
    Figure PCTCN2018094190-appb-100024
    b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
  11. 根据权利要求9所述的计算机设备,其特征在于,所述基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据,包括:The computer device according to claim 9, wherein the processing of the original voice data to be distinguished based on the voice activity detection algorithm to obtain the target voice data to be distinguished comprises:
    根据短时能量特征值计算公式对所述原始待区分语音数据进行处理,获取对应的短时能量特征值,将所述短时能量特征值大于第一阈值的所述原始待区分数据保留,确定为第一原始区分语音数据,短时能量特征值计算公式为
    Figure PCTCN2018094190-appb-100025
    其中,N为语音帧长,s(n)为时域上的信号幅度,n为时间;
    Process the original to-be-differentiated voice data according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, retain the original to-be-differentiated data whose short-term energy feature value is greater than a first threshold, and determine For the first original distinguishing speech data, the short-term energy eigenvalue calculation formula is
    Figure PCTCN2018094190-appb-100025
    Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;
    根据过零率特征值计算公式对所述原始待区分语音数据进行处理,获取对应的过零率特征值,将所述过零率特征值小于第二阈值的所述原始待区分语音数据保留,确定为第二原始区分语音数据,过零率特征值计算公式为
    Figure PCTCN2018094190-appb-100026
    其中,N为语音帧长,s(n)为时域上的信号幅度,n为 时间;
    Processing the original to-be-differentiated voice data according to a zero-crossing rate feature value calculation formula, obtaining a corresponding zero-crossing rate feature value, and retaining the original to-be-differentiated voice data with the zero-cross rate feature value less than a second threshold, Determined as the second original distinguished speech data, the calculation formula of the zero-crossing rate characteristic value is
    Figure PCTCN2018094190-appb-100026
    Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;
    将所述第一原始区分语音数据和所述第二原始区分语音数据作为所述目标待区分语音数据。Use the first original distinguished speech data and the second original distinguished speech data as the target to-be-differentiated speech data.
  12. 根据权利要求9所述的计算机设备,其特征在于,所述基于所述目标待区分语音数据,获取相对应的ASR语音特征,包括:The computer device according to claim 9, wherein the acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data comprises:
    对所述目标待区分语音数据进行预处理,获取预处理语音数据;Pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data;
    对所述预处理语音数据作快速傅里叶变换,获取目标待区分语音数据的频谱,并根据所述频谱获取目标待区分语音数据的功率谱;Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of target speech data to be distinguished, and obtaining a power spectrum of target speech data to be distinguished according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述目标待区分语音数据的功率谱,获取目标待区分语音数据的梅尔功率谱;Adopting a Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain a Mel power spectrum of the target speech data to be distinguished;
    在所述梅尔功率谱上进行倒谱分析,获取目标待区分语音数据的梅尔频率倒谱系数。A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  13. 根据权利要求12所述的计算机设备,其特征在于,所述对所述目标待区分语音数据进行预处理,获取预处理语音数据,包括:The computer device according to claim 12, wherein the pre-processing the target to-be-differentiated voice data to obtain the pre-processed voice data comprises:
    对所述目标待区分语音数据作预加重处理,预加重处理的计算公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,s n-1为与s n相对应的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0; Distinguish the speech of the target data to be processed for pre-emphasis, pre-emphasis process is calculated as s' n = s n -a * s n-1, wherein the amplitude of the signal on the time domain s n, s n-1 s n is the amplitude of the signal corresponding to the previous time, s' n for the pre-emphasis signal amplitude on the time domain, a is a pre-emphasis coefficient, a is in the range of 0.9 <a <1.0;
    将预加重后的所述目标待区分语音数据进行分帧处理;Performing frame processing on the pre-emphasized target to-be-differentiated voice data;
    将分帧后的所述目标待区分语音数据进行加窗处理,获取预处理语音数据,加窗的计算公式为
    Figure PCTCN2018094190-appb-100027
    其中,N为窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。
    The windowed processing is performed on the target to-be-differentiated voice data after framing to obtain preprocessed voice data. The calculation formula of the windowing is
    Figure PCTCN2018094190-appb-100027
    Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  14. 根据权利要求12所述的计算机设备,其特征在于,所述在所述梅尔功率谱上进行倒谱分析,获取目标待区分语音数据的梅尔频率倒谱系数,包括:The computer device according to claim 12, wherein the performing cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished comprises:
    取所述梅尔功率谱的对数值,获取待变换梅尔功率谱;Taking a log value of the Mel power spectrum to obtain a Mel power spectrum to be transformed;
    对所述待变换梅尔功率谱作离散余弦变换,获取目标待区分语音数据的梅尔频率倒谱系数。Performing discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished.
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:
    基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据;Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;
    基于所述目标待区分语音数据,获取相对应的ASR语音特征;Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;
    将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取目标区分结果。The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
  16. 根据权利要求15所述的非易失性可读存储介质,其特征在于,在所述将所述ASR语音特征输入到预先训练好的ASR-RNN模型中进行区分,获取区分结果的步骤之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:获取ASR-RNN模型;The non-volatile readable storage medium according to claim 15, characterized in that before the step of inputting the ASR speech feature into a pre-trained ASR-RNN model for discrimination, and obtaining a discrimination result, When the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: obtaining an ASR-RNN model;
    所述获取ASR-RNN模型的步骤包括:The step of obtaining the ASR-RNN model includes:
    获取待训练语音数据,并提取所述待训练语音数据的待训练ASR语音特征;Acquiring voice data to be trained, and extracting voice features of the ASR to be trained of the voice data to be trained;
    初始化RNN模型;Initialize the RNN model;
    将待训练ASR语音特征输入到RNN模型中,根据前向传播算法获取RNN模型的输出值,所述输出值表示为:
    Figure PCTCN2018094190-appb-100028
    σ表示激活函数,V表示隐藏层和输出层之间连接的权值,h t表示t时刻的隐藏状态,c表示隐藏层和输出层之间的偏置;
    The ASR speech features to be trained are input into the RNN model, and the output value of the RNN model is obtained according to the forward propagation algorithm, and the output value is expressed as:
    Figure PCTCN2018094190-appb-100028
    σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer;
    基于所述输出值进行误差反传,更新RNN模型各层的权值和偏置,获取ASR-RNN模型,其中,更新权值V的公式为:
    Figure PCTCN2018094190-appb-100029
    V表示更新前隐藏层和输出层之间连接的的权值,V'表 示更新后隐藏层和输出层之间连接的的权值,α表示学习率,t表示t时刻,τ表示总时长,
    Figure PCTCN2018094190-appb-100030
    表示预测输出值,y t表示真实输出值,h t表示t时刻的隐藏状态,T表示矩阵转置运算;更新偏置c的公式为:
    Figure PCTCN2018094190-appb-100031
    c表示更新前隐藏层和输出层之间的偏置,c'表示更新后隐藏层和输出层之间的偏置;更新权值U的公式为:
    Figure PCTCN2018094190-appb-100032
    U表示更新前输入层到隐藏层之间连接的权值,U'表示更新后输入层到隐藏层之间连接的权值,diag()表示构造一个对角矩阵或者以向量的形式返回一个矩阵上对角线元素的矩阵运算,δ t表示隐藏层状态的梯度,x t表示t时刻输入的待训练ASR语音特征;更新权值W的公式为:
    Figure PCTCN2018094190-appb-100033
    W表示更新前隐藏层之间连接的权值,W'表示更新后隐藏层之间连接的权值;更新偏置b的公式为:
    Perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model, where the formula for updating the weight V is:
    Figure PCTCN2018094190-appb-100029
    V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
    Figure PCTCN2018094190-appb-100030
    Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
    Figure PCTCN2018094190-appb-100031
    c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
    Figure PCTCN2018094190-appb-100032
    U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:
    Figure PCTCN2018094190-appb-100033
    W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:
    Figure PCTCN2018094190-appb-100034
    b表示更新前输入层和隐藏层之间的偏置,b'表示更新后输入层和隐藏层之间的偏置。
    Figure PCTCN2018094190-appb-100034
    b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
  17. 根据权利要求15所述的非易失性可读存储介质,其特征在于,所述基于语音活动检测算法处理原始待区分语音数据,获取目标待区分语音数据,包括:The non-volatile readable storage medium according to claim 15, wherein the processing of the original voice data to be distinguished based on the voice activity detection algorithm to obtain the target voice data to be distinguished comprises:
    根据短时能量特征值计算公式对所述原始待区分语音数据进行处理,获取对应的短时能量特征值,将所述短时能量特征值大于第一阈值的所述原始待区分数据保留,确定为第一原始区分语音数据,短时能量特征值计算公式为
    Figure PCTCN2018094190-appb-100035
    其中,N为语音帧长,s(n)为时域上的信号幅度,n为时间;
    Process the original to-be-differentiated voice data according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, retain the original to-be-differentiated data whose short-term energy feature value is greater than a first threshold, and determine For the first original distinguishing speech data, the short-term energy eigenvalue calculation formula is
    Figure PCTCN2018094190-appb-100035
    Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;
    根据过零率特征值计算公式对所述原始待区分语音数据进行处理,获取对应的过零率特征值,将所述过零率特征值小于第二阈值的所述原始待区分语音数据保留,确定为第二原始区分语音数据,过零率特征值计算公式为
    Figure PCTCN2018094190-appb-100036
    其中,N为语音帧长,s(n)为时域上的信号幅度,n为时间;
    Processing the original to-be-differentiated voice data according to a zero-crossing rate feature value calculation formula, obtaining a corresponding zero-crossing rate feature value, and retaining the original to-be-differentiated voice data with the zero-cross rate feature value less than a second threshold, Determined as the second original distinguished speech data, the calculation formula of the zero-crossing rate characteristic value is
    Figure PCTCN2018094190-appb-100036
    Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;
    将所述第一原始区分语音数据和所述第二原始区分语音数据作为所述目标待区分语音数据。Use the first original distinguished speech data and the second original distinguished speech data as the target to-be-differentiated speech data.
  18. 根据权利要求15所述的非易失性可读存储介质,其特征在于,所述基于所述目标待区分语音数据,获取相对应的ASR语音特征,包括:The non-volatile readable storage medium according to claim 15, wherein the acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data comprises:
    对所述目标待区分语音数据进行预处理,获取预处理语音数据;Pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data;
    对所述预处理语音数据作快速傅里叶变换,获取目标待区分语音数据的频谱,并根据所述频谱获取目标待区分语音数据的功率谱;Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of target speech data to be distinguished, and obtaining a power spectrum of target speech data to be distinguished according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述目标待区分语音数据的功率谱,获取目标待区分语音数据的梅尔功率谱;Adopting a Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain a Mel power spectrum of the target speech data to be distinguished;
    在所述梅尔功率谱上进行倒谱分析,获取目标待区分语音数据的梅尔频率倒谱系数。A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  19. 根据权利要求18所述的非易失性可读存储介质,其特征在于,所述对所述目标待区分语音数据进行预处理,获取预处理语音数据,包括:The non-volatile readable storage medium according to claim 18, wherein the pre-processing the target to-be-differentiated voice data to obtain the pre-processed voice data comprises:
    对所述目标待区分语音数据作预加重处理,预加重处理的计算公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,s n-1为与s n相对应的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0; Distinguish the speech of the target data to be processed for pre-emphasis, pre-emphasis process is calculated as s' n = s n -a * s n-1, wherein the amplitude of the signal on the time domain s n, s n-1 s n is the amplitude of the signal corresponding to the previous time, s' n for the pre-emphasis signal amplitude on the time domain, a is a pre-emphasis coefficient, a is in the range of 0.9 <a <1.0;
    将预加重后的所述目标待区分语音数据进行分帧处理;Performing frame processing on the pre-emphasized target to-be-differentiated voice data;
    将分帧后的所述目标待区分语音数据进行加窗处理,获取预处理语音数据,加窗的计算公式为
    Figure PCTCN2018094190-appb-100037
    其中,N为窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。
    The windowed processing is performed on the target to-be-differentiated voice data after framing to obtain preprocessed voice data. The calculation formula of the windowing is
    Figure PCTCN2018094190-appb-100037
    Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  20. 根据权利要求18所述的非易失性可读存储介质,其特征在于,所述在所述梅尔功率谱上进行倒谱分析,获取目标待区分语音数据的梅尔频率倒谱系数,包括:The non-volatile readable storage medium according to claim 18, wherein the performing cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished comprises: :
    取所述梅尔功率谱的对数值,获取待变换梅尔功率谱;Taking a log value of the Mel power spectrum to obtain a Mel power spectrum to be transformed;
    对所述待变换梅尔功率谱作离散余弦变换,获取目标待区分语音数据的梅尔频率倒谱系数。Performing discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished.
PCT/CN2018/094190 2018-06-04 2018-07-03 Speech differentiation method and apparatus, and computer device and storage medium WO2019232846A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810561788.1A CN108922513B (en) 2018-06-04 2018-06-04 Voice distinguishing method and device, computer equipment and storage medium
CN201810561788.1 2018-06-04

Publications (1)

Publication Number Publication Date
WO2019232846A1 true WO2019232846A1 (en) 2019-12-12

Family

ID=64419509

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094190 WO2019232846A1 (en) 2018-06-04 2018-07-03 Speech differentiation method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN108922513B (en)
WO (1) WO2019232846A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112581940A (en) * 2020-09-17 2021-03-30 国网江苏省电力有限公司信息通信分公司 Discharging sound detection method based on edge calculation and neural network
CN112598114A (en) * 2020-12-17 2021-04-02 海光信息技术股份有限公司 Power consumption model construction method, power consumption measurement method and device and electronic equipment
CN113223511A (en) * 2020-01-21 2021-08-06 珠海市煊扬科技有限公司 Audio processing device for speech recognition
CN117648717A (en) * 2024-01-29 2024-03-05 知学云(北京)科技股份有限公司 Privacy protection method for artificial intelligent voice training

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545193B (en) * 2018-12-18 2023-03-14 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN109658920B (en) * 2018-12-18 2020-10-09 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN109545192B (en) * 2018-12-18 2022-03-08 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN110265065B (en) * 2019-05-13 2021-08-03 厦门亿联网络技术股份有限公司 Method for constructing voice endpoint detection model and voice endpoint detection system
CN110189747A (en) * 2019-05-29 2019-08-30 大众问问(北京)信息科技有限公司 Voice signal recognition methods, device and equipment
CN110288999B (en) * 2019-07-02 2020-12-11 腾讯科技(深圳)有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN110838307B (en) * 2019-11-18 2022-02-25 思必驰科技股份有限公司 Voice message processing method and device
CN112908303A (en) * 2021-01-28 2021-06-04 广东优碧胜科技有限公司 Audio signal processing method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139864A (en) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 Voice recognition method and voice recognition device
US20170154033A1 (en) * 2015-11-30 2017-06-01 Samsung Electronics Co., Ltd. Speech recognition apparatus and method
CN107731233A (en) * 2017-11-03 2018-02-23 王华锋 A kind of method for recognizing sound-groove based on RNN
CN107799126A (en) * 2017-10-16 2018-03-13 深圳狗尾草智能科技有限公司 Sound end detecting method and device based on Supervised machine learning

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100580770C (en) * 2005-08-08 2010-01-13 中国科学院声学研究所 Voice end detection method based on energy and harmonic
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
US9858919B2 (en) * 2013-11-27 2018-01-02 International Business Machines Corporation Speaker adaptation of neural network acoustic models using I-vectors
CN104157290B (en) * 2014-08-19 2017-10-24 大连理工大学 A kind of method for distinguishing speek person based on deep learning
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
CN107680597B (en) * 2017-10-23 2019-07-09 平安科技(深圳)有限公司 Audio recognition method, device, equipment and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139864A (en) * 2015-08-17 2015-12-09 北京天诚盛业科技有限公司 Voice recognition method and voice recognition device
US20170154033A1 (en) * 2015-11-30 2017-06-01 Samsung Electronics Co., Ltd. Speech recognition apparatus and method
CN107799126A (en) * 2017-10-16 2018-03-13 深圳狗尾草智能科技有限公司 Sound end detecting method and device based on Supervised machine learning
CN107731233A (en) * 2017-11-03 2018-02-23 王华锋 A kind of method for recognizing sound-groove based on RNN

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223511A (en) * 2020-01-21 2021-08-06 珠海市煊扬科技有限公司 Audio processing device for speech recognition
CN113223511B (en) * 2020-01-21 2024-04-16 珠海市煊扬科技有限公司 Audio processing device for speech recognition
CN112581940A (en) * 2020-09-17 2021-03-30 国网江苏省电力有限公司信息通信分公司 Discharging sound detection method based on edge calculation and neural network
CN112598114A (en) * 2020-12-17 2021-04-02 海光信息技术股份有限公司 Power consumption model construction method, power consumption measurement method and device and electronic equipment
CN112598114B (en) * 2020-12-17 2023-11-03 海光信息技术股份有限公司 Power consumption model construction method, power consumption measurement method, device and electronic equipment
CN117648717A (en) * 2024-01-29 2024-03-05 知学云(北京)科技股份有限公司 Privacy protection method for artificial intelligent voice training
CN117648717B (en) * 2024-01-29 2024-05-03 知学云(北京)科技股份有限公司 Privacy protection method for artificial intelligent voice training

Also Published As

Publication number Publication date
CN108922513B (en) 2023-03-17
CN108922513A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
WO2019232846A1 (en) Speech differentiation method and apparatus, and computer device and storage medium
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
JP7177167B2 (en) Mixed speech identification method, apparatus and computer program
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
WO2019232833A1 (en) Speech differentiating method and device, computer device and storage medium
Bhat et al. A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
CN110767244B (en) Speech enhancement method
Zhao et al. Late reverberation suppression using recurrent neural networks with long short-term memory
WO2019232867A1 (en) Voice discrimination method and apparatus, and computer device, and storage medium
CN112700786B (en) Speech enhancement method, device, electronic equipment and storage medium
CN111292762A (en) Single-channel voice separation method based on deep learning
CN111968666B (en) Hearing aid voice enhancement method based on depth domain self-adaptive network
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
US20230395087A1 (en) Machine Learning for Microphone Style Transfer
Hou et al. Domain adversarial training for speech enhancement
Lee et al. Dynamic noise embedding: Noise aware training and adaptation for speech enhancement
WO2022213825A1 (en) Neural network-based end-to-end speech enhancement method and apparatus
CN111899750A (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN114613387A (en) Voice separation method and device, electronic equipment and storage medium
Poovarasan et al. Speech enhancement using sliding window empirical mode decomposition and hurst-based technique
Schmidt et al. Reduction of non-stationary noise using a non-negative latent variable decomposition
CN116403594A (en) Speech enhancement method and device based on noise update factor
CN111091847A (en) Deep clustering voice separation method based on improvement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921594

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18921594

Country of ref document: EP

Kind code of ref document: A1