WO2019232833A1 - Procédé et dispositif de différentiation vocale, dispositif d'ordinateur et support d'informations - Google Patents

Procédé et dispositif de différentiation vocale, dispositif d'ordinateur et support d'informations Download PDF

Info

Publication number
WO2019232833A1
WO2019232833A1 PCT/CN2018/092651 CN2018092651W WO2019232833A1 WO 2019232833 A1 WO2019232833 A1 WO 2019232833A1 CN 2018092651 W CN2018092651 W CN 2018092651W WO 2019232833 A1 WO2019232833 A1 WO 2019232833A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
target
asr
data
voice data
Prior art date
Application number
PCT/CN2018/092651
Other languages
English (en)
Chinese (zh)
Inventor
涂宏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019232833A1 publication Critical patent/WO2019232833A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present application relates to the field of speech processing, and in particular, to a method, a device, a computer device, and a storage medium for distinguishing speech.
  • Speech discrimination refers to mute filtering of the input speech, and only retain the speech segments (that is, the target speech) that are more meaningful for recognition.
  • the current method of speech discrimination still has great shortcomings, especially in the presence of noise, as the noise becomes larger, the difficulty of speech discrimination becomes more difficult, and the target speech and the interference speech cannot be accurately distinguished, resulting in speech discrimination. The effect is not ideal.
  • the embodiments of the present application provide a method, an apparatus, a computer device, and a storage medium for speech discrimination, so as to solve the problem that the effect of speech discrimination is not ideal.
  • An embodiment of the present application provides a method for distinguishing speech, including:
  • the ASR speech features are input into a pre-trained ASR-DNN model for discrimination, and a target discrimination result is obtained.
  • An embodiment of the present application provides a voice distinguishing device, including:
  • Target to-be-differentiated voice data acquisition module for processing original to-be-differentiated voice data based on a voice activity detection algorithm, to obtain target to-be-differentiated voice data
  • a voice feature acquisition module configured to acquire a corresponding ASR voice feature based on the target to-be-differentiated voice data
  • a target discrimination result acquisition module is configured to input the ASR speech features into a pre-trained ASR-DNN model to distinguish and obtain a target discrimination result.
  • An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor implements the computer-readable instructions to implement The following steps:
  • the ASR speech features are input into a pre-trained ASR-DNN model for discrimination, and a target discrimination result is obtained.
  • This embodiment of the present application provides one or more non-volatile readable storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors are Perform the following steps:
  • the ASR speech features are input into a pre-trained ASR-DNN model for discrimination, and a target discrimination result is obtained.
  • FIG. 1 is an application environment diagram of a speech discrimination method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a speech discrimination method according to an embodiment of the present application.
  • FIG. 3 is a specific flowchart of step S10 in FIG. 2;
  • FIG. 4 is a specific flowchart of step S20 in FIG. 2;
  • step S21 in FIG. 4 is a specific flowchart of step S21 in FIG. 4;
  • step S24 in FIG. 4 is a specific flowchart of step S24 in FIG. 4;
  • FIG. 7 is a specific flowchart before step S30 in FIG. 2; FIG.
  • FIG. 8 is a schematic diagram of a voice distinguishing device according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.
  • FIG. 1 illustrates an application environment of a speech discrimination method provided by an embodiment of the present application.
  • the application environment of the speech recognition method includes a server and a client, wherein the server and the client are connected through a network, and the client is a device that can perform human-computer interaction with a user, including but not limited to a computer, a smart phone, and For devices such as tablets, the server can be implemented by using an independent server or a server cluster composed of multiple servers.
  • the speech discrimination method provided in the embodiments of the present application is applied to a server.
  • FIG. 2 shows a flowchart of the voice discrimination method in this embodiment.
  • the voice discrimination method includes the following steps:
  • the original speech data to be distinguished is processed based on the speech activity detection algorithm, and the target speech data to be distinguished is obtained.
  • VAD Voice Activity Detection
  • VAD algorithm is an algorithm specifically used in the voice activity detection, and the algorithm may have various types. Understandably, VAD can be applied to speech discrimination, and can distinguish target speech and interference speech.
  • the target voice refers to the voice part in which the voiceprint continuously changes significantly in the voice data, and the interference voice may be a voice part in the voice data that is not pronounced due to silence, or it may be environmental noise.
  • the original to-be-differentiated voice data is the most originally obtained to-be-differentiated voice data, and the original to-be-differentiated voice data refers to voice data to be subjected to preliminary distinguishing processing using a VAD algorithm.
  • the target voice data to be distinguished refers to the voice data used for voice discrimination obtained after processing the original voice data to be distinguished through a voice activity detection algorithm.
  • the VAD algorithm is used to process the original to-be-differentiated voice data, and the target to-be-differentiated voice is initially selected from the original to-be-differentiated voice data, and the initially-selected target voice portion is used as the target to-be-differentiated voice data. Understandably, for the interference speech that is initially selected, the interference speech that is initially selected does not need to be distinguished again to improve the efficiency of speech discrimination. However, the target voice initially screened from the original to-be-differentiated voice data still contains the content of interfering speech, especially when the original voice data to be distinguished is relatively noisy, the interfering voices (such as noise) mixed with the preliminary target voice are more mixed.
  • the target voice that is preliminarily screened with interfering voices should be used as the target to-be-differentiated voice data in order to more accurately distinguish the target voice that is initially screened.
  • the server uses the VAD algorithm to perform preliminary speech discrimination on the original to-be-differentiated voice data. It can re-differentiate according to the preliminary filtered original to-be-differentiated voice data, and at the same time remove a large amount of interfering speech, which is beneficial to subsequent further speech discrimination.
  • processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data includes the following steps:
  • S11 Process the original speech data to be distinguished according to the short-term energy characteristic value calculation formula, obtain the corresponding short-term energy characteristic value, and retain the original to-be-differentiated data whose short-term energy characteristic value is greater than the first threshold, and determine it as the first original Differentiate speech data, where the short-term energy eigenvalue calculation formula is N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.
  • the short-term energy characteristic value describes the energy corresponding to a frame of speech (a frame generally takes 10-30ms) in its time domain.
  • the "short-term" of this short-term energy should be understood as the time of a frame (that is, speech Frame length). Since the short-term energy feature value of the target voice is much higher than the short-term energy feature value of the interfering voice (silence), the target voice and the interfering voice can be distinguished according to the short-term energy feature value.
  • the original speech data to be distinguished is processed according to the short-term energy feature value calculation formula (the original speech data to be distinguished needs to be framed in advance), and the short-term energy characteristics of each frame of the original speech data to be distinguished are calculated and obtained.
  • Value comparing the short-term energy characteristic value of each frame with a preset first threshold value, retaining the original to-be-differentiated voice data that is greater than the first threshold, and determining it as the first original distinguishing voice data.
  • the first threshold is a cut-off value for measuring whether the short-term energy characteristic value belongs to the target speech or the interference speech.
  • the target voice in the original to-be-differentiated voice data can be obtained from the perspective of the short-term energy feature value, and the original to-be-differentiated voice data is effectively removed A lot of disturbing speech.
  • S12 Process the original to-be-differentiated voice data according to the calculation formula of the zero-crossing rate feature value, obtain the corresponding zero-crossing rate feature value, and retain the original to-be-differentiated voice data with the zero-cross rate feature value less than the second threshold, and determine as the second
  • the original distinguished speech data where the zero-crossing rate eigenvalue calculation formula is N is the speech frame length, and s (n) is the signal amplitude n in the time domain is time.
  • the zero-crossing rate characteristic value describes the number of times a voice signal waveform passes through the horizontal axis (zero level) in a frame of speech. Since the feature value of the zero-crossing rate of the target voice is much lower than the feature value of the zero-crossing rate of the interfering voice, the target voice and the interfering voice can be distinguished according to the short-term energy feature value.
  • the original to-be-differentiated speech data is processed according to the zero-crossing rate feature value calculation formula, and the zero-crossing rate feature value of each frame of the original to-be-differentiated voice data is calculated and obtained.
  • the second threshold value is compared, and the original to-be-differentiated voice data smaller than the second threshold value is retained and determined as the second original distinguished voice data.
  • the second threshold is a cutoff value for measuring whether the short-term energy characteristic value belongs to the target speech or the interference speech.
  • the target voice in the original to-be-differentiated voice data can be obtained from the perspective of the zero-crossing rate feature value, and the original to-be-differentiated voice data can be effectively removed. A lot of disturbing speech.
  • S13 Use the first original distinguished speech data and the second original distinguished speech data as target to-be-separated speech data.
  • the first original distinguished speech data is distinguished and obtained from the original to-be-differentiated speech data according to the angle of the short-term energy feature value
  • the second original distinguished speech data is from the original to-be-distant speech data according to the angle of the zero-crossing rate feature value. Distinguish and acquire in speech data.
  • the first original distinguished speech data and the second original distinguished speech data are from different perspectives of distinguishing speech. Both of these angles can distinguish the speech well. Therefore, the first original distinguished speech data and the second original distinguished speech data are merged. (Merged in the manner of taking intersections) together as the target speech data to be distinguished.
  • Steps S11-S13 can initially and effectively remove most of the interfering voice data in the original to-be-differentiated voice data, retain the original to-be-differentiated voice data that is mixed with the target voice and a small part of the interfering voice (such as noise), and the original to-be-differentiated voice data
  • the data is used as the target to-be-differentiated speech data, which can make effective preliminary speech discrimination on the original to-be-differentiated speech data.
  • S20 Obtain the corresponding ASR voice characteristics based on the target to-be-differentiated voice data.
  • ASR Automatic Speech Recognition
  • ASR is a technology that converts speech data into computer-readable input, for example, converts speech data into keys, binary codes, or character sequences.
  • the ASR can extract the voice features in the target to-be-differentiated voice data, and the extracted voice is the corresponding ASR voice feature.
  • ASR can convert voice data that cannot be read directly by a computer into ASR voice features that can be read by a computer, and the ASR voice features can be represented in a vector manner.
  • ASR is used to process the target to-be-differentiated voice data to obtain the corresponding ASR voice characteristics.
  • This ASR voice feature can well reflect the potential characteristics of the target to-be-differentiated voice data. Differentiate the voice data to distinguish, and provide important technical prerequisites for subsequent ASR-DNN (DNN, Deep Neural Networks) model recognition based on the ASR voice characteristics.
  • DNN Deep Neural Networks
  • step S20 acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data includes the following steps:
  • S21 Preprocess the target to-be-differentiated voice data to obtain preprocessed voice data.
  • the target to-be-differentiated voice data is pre-processed, and corresponding pre-processed voice data is obtained.
  • Preprocessing the target to-be-differentiated voice data can better extract the ASR voice characteristics of the target to-be-differentiated voice data, so that the extracted ASR speech features can better represent the target-to-be-differentiated voice data, so as to use the ASR speech features for speech discrimination .
  • pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data includes the following steps:
  • Pre-emphasis processing is performed on the target to-be-differentiated voice data.
  • pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end.
  • the idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform.
  • Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
  • the weighting coefficient, the range of a is 0.9 ⁇ a ⁇ 1.0.
  • the effect of pre-emphasis of 0.97 is better.
  • the use of the pre-emphasis processing can eliminate interference caused by vocal cords and lips during the utterance process, can effectively compensate the suppressed high-frequency part of the target voice data to be distinguished, and can highlight the high-frequency formants of the target voice data to be distinguished, strengthening the target Distinguishing the signal amplitude of speech data helps to extract ASR speech features.
  • S212 Frame the target to-be-differentiated voice data after pre-emphasis.
  • Framing refers to the speech processing technology that cuts the entire voice signal into several segments.
  • the size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length.
  • Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames.
  • Frame processing is performed on the target to-be-differentiated voice data, which can divide the target to-be-differentiated voice data into several pieces of voice data, and the target to-be-differentiated voice data can be subdivided to facilitate the extraction of ASR voice features.
  • S213 Perform windowing on the framed target to-be-differentiated voice data to obtain pre-processed voice data.
  • the calculation formula for the windowing is Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  • the window function can select the Hamming window, and the windowing formula is N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed.
  • Windowing the target to-be-differentiated voice data to obtain pre-processed speech data can make the signal of the target to-be-differentiated voice data in the time domain continuous after framed, which is helpful for extracting the ASR voice of the target to-be-differentiated voice data. feature.
  • the pre-processing operations on the target to-be-differentiated voice data in steps S211 to S213 provide a basis for extracting the ASR voice characteristics of the target to-be-differentiated voice data, which can make the extracted ASR voice features more representative of the target to-be-differentiated voice data, and according to This ASR voice feature performs voice discrimination.
  • S22 Perform a fast Fourier transform on the pre-processed speech data to obtain the frequency spectrum of the target speech data to be distinguished, and obtain the power spectrum of the target speech data to be distinguished according to the frequency spectrum.
  • FFT Fast Fourier Transform
  • FFT refers to a collective term for an efficient and fast method for computing a discrete Fourier transform using a computer, and is referred to as FFT for short.
  • the use of this algorithm can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the FFT algorithm's computational savings will be.
  • fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from the signal amplitude in the time domain to the signal amplitude (spectrum) in the frequency domain.
  • the formula for calculating the spectrum is N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit.
  • N is the frame size
  • s (k) is the signal amplitude in the frequency domain.
  • the pre-processed speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the target speech data to be distinguished is obtained according to the signal amplitude in the frequency domain. Extracting ASR speech features from the spectrum provides an important technical basis.
  • S23 Use the Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain the Mel power spectrum of the target speech data to be distinguished.
  • the power spectrum of the target speech data to be processed using the Mel scale filter bank is a Mel frequency analysis of the power spectrum
  • the Mel frequency analysis is an analysis based on human auditory perception.
  • the human ear is like a filter bank, focusing only on certain specific frequency components (human hearing is selective to frequencies), which means that the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive.
  • these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
  • a Mel scale filter bank is used to process the power spectrum of the target speech data to be distinguished, and a Mel power spectrum of the target speech data to be distinguished is obtained.
  • the frequency domain signal is segmented by using the Mel scale filter bank. Make each frequency segment correspond to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the target speech data to be distinguished can be obtained.
  • the Mel power spectrum obtained after the analysis retains a frequency portion closely related to the characteristics of the human ear, and this frequency portion can well reflect the target to be distinguished Characteristics of speech data.
  • S24 Perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
  • a cepstrum analysis is performed on the Mel power spectrum, and based on the cepstrum result, the Mel frequency cepstrum coefficient of the target speech data to be distinguished is analyzed and obtained.
  • the features contained in the Mel power spectrum of the target speech data to be distinguished which is too high in original feature dimension, can be directly converted into easy-to-use through cepstrum analysis on the Mel power spectrum.
  • the Mel frequency cepstrum coefficient can be used as a coefficient for distinguishing different voices from ASR voice features.
  • the ASR voice feature can reflect the difference between voices, and can be used to identify and distinguish target to-be-differentiated voice data.
  • step S24 cepstrum analysis is performed on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the target speech data to be distinguished, including the following steps:
  • a log value log of the Mel power spectrum is taken to obtain the Mel power spectrum m to be transformed.
  • S242 Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • a discrete cosine transform is performed on the Mel power spectrum m to be transformed to obtain a corresponding Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • DCT discrete cosine transform
  • the second to thirteenth coefficients are taken.
  • Coefficients are used as ASR speech features, which can reflect the differences between speech data.
  • the formula for discrete cosine transform of the transformed Mel power spectrum m is N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because there is overlap between Mel filters, there is a correlation between the energy values obtained by using Mel scale filters.
  • Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and The corresponding ASR speech features are obtained. Compared with the Fourier transform, the result of the discrete cosine transform has no imaginary part, and has obvious advantages in terms of calculation.
  • Steps S21-S24 are based on the ASR technology to perform feature extraction on the target to-be-differentiated voice data.
  • the final obtained ASR speech feature can well reflect the target to-be-differentiated voice data.
  • the ASR voice feature can be obtained by deep network model training to obtain ASR-
  • the DNN model makes the ASR-DNN model obtained during training more accurate in speech discrimination. Even under very noisy conditions, it can accurately distinguish noise from speech.
  • the features extracted above are Mel frequency cepstrum coefficients.
  • the ASR speech features should not be limited to only Mel frequency cepstrum coefficients. Instead, it should be considered that the speech features obtained by ASR technology can be used as long as they can The features that effectively reflect speech data can be used as ASR speech features for recognition and model training.
  • the ASR-DNN model refers to a deep neural network model trained using ASR speech features
  • DNN refers to deep neural networks (Deep Neural Networks).
  • the ASR-DNN model is trained using ASR speech features, so the model can recognize ASR speech features, and thus distinguish speech based on ASR speech features. For example, if the speech data to be trained includes the target speech and noise, the ASR speech feature of the target speech and the ASR speech feature of the noise are extracted when the ASR-DNN model is trained, so that the ASR-DNN model obtained by training can be recognized based on the ASR speech feature.
  • the interfering speech distinguished by the ASR-DNN model mainly refers to the noise part )Effect.
  • the ASR speech features are input into a pre-trained ASR-DNN model to distinguish them. Since the ASR speech features can reflect the characteristics of the speech data, the ASR of the target to be distinguished speech data can be extracted according to the ASR-DNN model The speech features are recognized, so that the target speech data to be distinguished is accurately distinguished based on the ASR speech features.
  • the pre-trained ASR-DNN model combines the features of ASR speech features and neural network to extract features in depth, and distinguishes speech from the nature of speech. It still has high accuracy under very bad noise conditions. rate.
  • the features extracted by ASR also include the ASR voice features of noise
  • noise can also be accurately distinguished, and the current voice discrimination methods (including but not limited to VAD) are used in noise conditions.
  • VAD voice discrimination methods
  • step S30 before the steps of inputting ASR voice features into a pre-trained ASR-DNN model to distinguish and obtain a target discrimination result, the method of voice discrimination further includes the following steps: obtaining an ASR-DNN model .
  • the steps of obtaining the ASR-DNN model include:
  • S31 Acquire speech data to be trained, and extract speech features of the ASR to be trained.
  • the voice data to be trained refers to the voice data required to obtain the ASR-DNN model.
  • the voice data to be trained can be an open source voice training set directly, or a voice training set by collecting a large amount of sample voice data.
  • the to-be-trained voice data distinguishes the target voice and noise in advance (the ratio of the target voice and noise may be 1: 1), and the specific method of distinguishing may be to set different label values for the target voice and noise respectively. For example, all parts of the speech data to be trained are marked as 1 (representing "true"), and all noise parts are marked as 0 (representing "false").
  • the accuracy of the ASR-DNN model recognition can be verified by setting the label value in advance In order to provide improved reference, update the network parameters in the ASR-DNN model and continuously optimize the ASR-DNN model.
  • the voice data to be trained is obtained and the feature of the voice data to be trained is extracted.
  • This feature is the voice feature of the ASR to be trained.
  • the steps of extracting the voice feature of the ASR to be trained are the same as steps S21-S24, and will not be repeated here .
  • the speech data to be trained includes the target speech part and the noise part. Both types of speech data have their own ASR speech features. Therefore, the corresponding ASR-DNN model can be extracted and used to train the corresponding ASR-DNN model.
  • the ASR-DNN model obtained by training the ASR speech feature training can accurately distinguish the target speech and noise (noise belongs to interference speech).
  • the DNN model is a deep neural network model.
  • the deep neural network model includes an input layer, a hidden layer, and an output layer composed of neurons.
  • the deep neural network model includes weights and biases of each neuron connection between layers. These weights and biases determine the nature and recognition effect of the DNN model.
  • the DNN model is initialized.
  • This initialization operation is to set the initial values of weights and offsets in the DNN model.
  • the initial value can be set to a smaller value when initially set, such as in the interval [-0.3-0.3] Between, or directly use empirical values to set initial weights and offsets.
  • Reasonable initialization of the DNN model can make the model have more flexible adjustment capabilities in the early stage, and can effectively adjust the model during the model training process, without making the model's adjustment capability in the initial stage very poor, resulting in a trained model The distinction is not good.
  • S33 Input the ASR speech features to be trained into the DNN model, and obtain the output value of the DNN model according to the forward propagation algorithm.
  • the DNN forward propagation algorithm is a series of linear operations and activation operations performed in the DNN model according to the weight value W, the offset b, and the input value vector x i of each neuron in the DNN model.
  • the layer starts, and the layer is calculated backwards until the output layer gets the output value.
  • the output value of each layer of the network in the DNN model can be calculated until the output value of the last layer is calculated.
  • the total number of layers in a DNN model is L.
  • the output a i, l ⁇ (W l a i, l-1 + b l ), where l represents the current layer, and ⁇ is the activation function.
  • the activation function specifically used here may be a sigmoid or tanh activation function.
  • forward propagation is performed layer by layer according to the number of layers, and the final output values a i, L of the network in the DNN model are obtained.
  • the output values a i, L the output values a i, L can be used .
  • the network parameters in the DNN model weights W, bias b connecting each neuron are adjusted to obtain an ASR-DNN model with excellent differentiated speech capabilities.
  • S34 Perform error back propagation based on the output value, update the weights and offsets of each layer of the DNN model, and obtain the ASR-DNN model.
  • the to-be-trained ASR speech features can be calculated according to a i, L and the pre-set labeled ASR speech features in the DNN model.
  • the error generated during training, and based on the error to construct a suitable error function (such as the error function using the mean square error to measure the error), according to the error function to carry out error back propagation, adjust and update the weights W and bias of each layer of the DNN model.
  • Country b. ASR-DNN model is a DNN model trained based on ASR speech features.
  • the back-propagation algorithm is used to update the weights W and offsets b of each layer of the DNN model, and the extreme values of the minimum error function are calculated according to the back-propagation algorithm to optimize and update the weights W and offset b of each layer of the DNN model To obtain the ASR-DNN model.
  • the iteration step size of the model training is set to ⁇ , the maximum number of iterations MAX, and the stop iteration threshold ⁇ .
  • the sensitivity ⁇ i, l is a common factor that appears every time the parameter is updated, so the error can be calculated by using the sensitivity ⁇ i, l to update the network parameters in the DNN model.
  • the sensitivity ⁇ i, l of layer l can be calculated (W l + 1 ) T ⁇ i, l + 1 ⁇ '(z i, l ) to obtain the sensitivity ⁇ i, l of layer l , that is, the weights W and offsets b of each layer of the DNN model can be updated, and the updated weights for The updated offset is Among them, ⁇ is the iterative step size for model training, m is the total number of samples of the input ASR speech features to be trained, and T is the matrix transposition operation.
  • the training can be stopped; or, when the training reaches the maximum number of iterations MAX, the training is stopped.
  • the weights W and offsets b of each layer of the DNN model are updated, so that the finally obtained ASR-DNN model can Differentiate speech based on ASR speech characteristics.
  • Steps S31-S34 train the DNN model by using the ASR speech features to be trained, so that the ASR-DNN model obtained by training can effectively distinguish the speech, and can still accurately distinguish the target speech from the noise in the case of severe noise interference.
  • the ASR-DNN model further extracted the deep features of the ASR speech features to be trained during the model training process.
  • the trained weights and biases in the ASR-DNN model network reflected the deep features based on the ASR speech features.
  • the ASR-DNN model can perform deep feature recognition based on the target speech ASR feature and the noise ASR speech feature to achieve accurate discrimination between the target speech and noise.
  • the original voice data to be differentiated is processed based on a voice activity detection algorithm (VAD), and the target voice data to be distinguished is obtained.
  • the raw voice data to be distinguished is first distinguished by the voice activity detection algorithm to obtain
  • the target to-be-differentiated voice data with a smaller range can initially and effectively remove the interfering voice data from the original to-be-differentiated voice data, retain the original to-be-differentiated voice data mixed with the target voice and the interfering voice, and use the original to-be-differentiated voice data as The target to-be-differentiated voice data can effectively make preliminary speech distinctions from the original to-be-differentiated voice data, removing a large amount of interfering speech.
  • the corresponding ASR speech features are obtained.
  • This ASR speech feature can make the result of speech discrimination more accurate, and even under the condition of noisy noise, it can accurately interfering speech (such as noise) It is distinguished from the target speech and provides important technical prerequisites for subsequent ASR-DNN model recognition based on the ASR speech characteristics.
  • the ASR speech features are input into a pre-trained ASR-DNN model to distinguish them, and the target discrimination result is obtained.
  • the ASR-DNN model is a recognition model specially trained for effectively distinguishing speech based on ASR speech features, which can be mixed from Target speech and interference speech (because VAD has been used to distinguish once, so most of the interference speech here refers to noise).
  • the target speech to be distinguished is correctly distinguished between target speech and interference speech in the speech data to improve the accuracy of speech discrimination.
  • FIG. 8 shows a principle block diagram of a voice distinguishing device corresponding to the voice distinguishing method in the embodiment.
  • the voice discrimination device includes a target to-be-differentiated voice data acquisition module 10, a voice feature acquisition module 20, and a target discrimination result acquisition module 30.
  • the implementation functions of the target to-be-separated voice data acquisition module 10, the voice feature acquisition module 20, and the target discrimination result acquisition module 30 correspond to the steps corresponding to the voice discrimination method in the embodiment. To avoid redundant descriptions, this embodiment is not one by one. Elaborate.
  • the target to-be-differentiated voice data acquisition module 10 is configured to process the original to-be-differentiated voice data based on a voice activity detection algorithm to obtain the target to-be-differentiated voice data.
  • the voice feature obtaining module 20 is configured to obtain a corresponding ASR voice feature based on the target to-be-differentiated voice data.
  • the target discrimination result acquisition module 30 is configured to input ASR speech features into a pre-trained ASR-DNN model for discrimination, and obtain a target discrimination result.
  • the target undistinguished speech data acquisition module 10 includes a first original distinguished speech data acquisition unit 11, a second original distinguished speech data acquisition unit 12, and a target undisturbed speech data acquisition unit 13.
  • the first original distinguished speech data acquisition unit 11 is configured to process the original speech data to be distinguished according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, and increase the original value of the short-term energy feature value greater than the first threshold.
  • the to-be-differentiated data is retained, and it is determined as the first original distinguished speech data.
  • the short-term energy characteristic value calculation formula is N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.
  • a second original distinguished speech data obtaining unit 12 is configured to process the original to-be-differentiated speech data according to a calculation formula of the zero-crossing rate characteristic value, obtain a corresponding zero-crossing rate characteristic value, and reduce the original value of the zero-crossing rate characteristic value to be less than the second threshold
  • the to-be-differentiated voice data is retained, and it is determined to be the second original distinguished voice data.
  • the formula of the zero-crossing rate characteristic value is N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.
  • the target undistinguished speech data acquisition unit 13 is configured to use the first original distinguished speech data and the second original distinguished speech data as the target undistorted speech data.
  • the speech feature acquisition module 20 includes a pre-processed speech data acquisition unit 21, a power spectrum acquisition unit 22, a Mel power spectrum acquisition unit 23, and a Mel frequency cepstrum coefficient unit 24.
  • the pre-processing unit 21 is configured to pre-process the target to-be-differentiated voice data to obtain pre-processed voice data.
  • the power spectrum obtaining unit 22 is configured to perform a fast Fourier transform on the pre-processed speech data, obtain a frequency spectrum of the target speech data to be distinguished, and obtain a power spectrum of the target speech data to be distinguished according to the frequency spectrum.
  • the Mel power spectrum acquisition unit 23 is configured to use a Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain a Mel power spectrum of the target speech data to be distinguished.
  • the Mel frequency cepstrum coefficient unit 24 is configured to perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • the pre-processing unit 21 includes a pre-emphasis sub-unit 211, a frame sub-unit 212, and a windowing sub-unit 213.
  • the pre-emphasis sub-unit 211 is configured to perform pre-emphasis processing on target voice data to be distinguished.
  • the frame sub-unit 212 is configured to perform frame processing on the target pre-emphasized voice data to be distinguished.
  • a windowing sub-unit 213 is configured to perform windowing on the framed target to-be-differentiated speech data to obtain pre-processed speech data.
  • the calculation formula of the windowing is Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
  • the Mel frequency cepstrum coefficient unit 24 includes a Mel power spectrum acquisition sub-unit 241 and a Mel frequency cepstrum coefficient sub-unit 242 to be transformed.
  • the to-be-transformed Mel power spectrum acquisition subunit 241 is configured to obtain a log value of the to-be-transformed Mel power spectrum to obtain the to-be-transformed Mel power spectrum.
  • the Mel frequency cepstrum coefficient sub-unit 242 is configured to perform a discrete cosine transform of the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
  • the speech discrimination device further includes an ASR-DNN model acquisition module 40.
  • the ASR-DNN model acquisition module 40 includes an ASR speech feature acquisition unit 41, an initialization unit 42, an output value acquisition unit 43, and an update unit 44 to be trained.
  • the ASR speech feature acquisition unit 41 is configured to acquire speech data to be trained and extract speech speech features of the ASR to be trained.
  • the initialization unit 42 is configured to initialize a DNN model.
  • An output value obtaining unit 43 is configured to input the ASR speech features to be trained into the DNN model, and obtain the output value of the DNN model according to the forward propagation algorithm.
  • An update unit 44 is configured to perform error back propagation based on the output value, update the weights and offsets of each layer of the DNN model, and obtain an ASR-DNN model.
  • This embodiment provides one or more non-volatile readable storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors are executed.
  • the method for distinguishing speech in the embodiment is implemented. To avoid repetition, details are not described herein again.
  • the computer-readable instructions are executed by one or more processors, the one or more processors are caused to perform functions of implementing modules / units in the speech distinguishing device in the embodiment. To avoid repetition, here is not More details.
  • the computer-readable storage medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM (Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals and telecommunication signals.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • FIG. 9 is a schematic diagram of a computer device in this embodiment.
  • the computer device 50 includes a processor 51, a memory 52, and computer-readable instructions 53 stored in the memory 52 and executable on the processor 51.
  • the processor 51 executes the computer-readable instructions 53, each step of the method for distinguishing speech in the embodiment is implemented, for example, steps S10, S20, and S30 shown in FIG.
  • the processor 51 executes the computer-readable instructions 53
  • the functions of the modules / units of the speech distinguishing device in the embodiment are implemented, as shown in FIG. Function of module 30.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or in the form of software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé et un dispositif de différentiation vocale, un dispositif d'ordinateur et un support d'informations. Le procédé consiste : à traiter des données vocales originales devant être différenciées sur la base d'un algorithme de détection d'activité vocale, et à acquérir des données vocales cibles à différencier (S10) ; à acquérir une caractéristique vocale ASR correspondante sur la base des données vocales cibles à différencier (S20) ; et à saisir la caractéristique vocale ASR dans un modèle ASR-DNN pré-entraîné pour la différenciation et obtenir un résultat de différenciation cible (S30). Le présent procédé peut être efficace pour différencier une parole cible d'une parole d'interférence et peut différencier avec précision la parole dans un cas où des données vocales présentent une interférence de bruit sévère.
PCT/CN2018/092651 2018-06-04 2018-06-25 Procédé et dispositif de différentiation vocale, dispositif d'ordinateur et support d'informations WO2019232833A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810561723.7 2018-06-04
CN201810561723.7A CN109036470B (zh) 2018-06-04 2018-06-04 语音区分方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2019232833A1 true WO2019232833A1 (fr) 2019-12-12

Family

ID=64611733

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/092651 WO2019232833A1 (fr) 2018-06-04 2018-06-25 Procédé et dispositif de différentiation vocale, dispositif d'ordinateur et support d'informations

Country Status (2)

Country Link
CN (1) CN109036470B (fr)
WO (1) WO2019232833A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582020A (zh) * 2020-03-25 2020-08-25 平安科技(深圳)有限公司 信号处理方法、装置、计算机设备及存储介质
CN113488073A (zh) * 2021-07-06 2021-10-08 浙江工业大学 一种基于多特征融合的伪造语音检测方法及装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036470B (zh) * 2018-06-04 2023-04-21 平安科技(深圳)有限公司 语音区分方法、装置、计算机设备及存储介质
CN110556125B (zh) * 2019-10-15 2022-06-10 出门问问信息科技有限公司 基于语音信号的特征提取方法、设备及计算机存储介质
CN113744730B (zh) * 2021-09-13 2023-09-08 北京奕斯伟计算技术股份有限公司 声音检测方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912993A (zh) * 2005-08-08 2007-02-14 中国科学院声学研究所 基于能量及谐波的语音端点检测方法
CN105261357A (zh) * 2015-09-15 2016-01-20 百度在线网络技术(北京)有限公司 基于统计模型的语音端点检测方法及装置
WO2017052739A1 (fr) * 2015-09-24 2017-03-30 Google Inc. Détection d'activité vocale
CN107527630A (zh) * 2017-09-22 2017-12-29 百度在线网络技术(北京)有限公司 语音端点检测方法、装置和计算机设备
CN109036470A (zh) * 2018-06-04 2018-12-18 平安科技(深圳)有限公司 语音区分方法、装置、计算机设备及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065629A (zh) * 2012-11-20 2013-04-24 广东工业大学 一种仿人机器人的语音识别系统
CN103646649B (zh) * 2013-12-30 2016-04-13 中国科学院自动化研究所 一种高效的语音检测方法
CN106611604B (zh) * 2015-10-23 2020-04-14 中国科学院声学研究所 一种基于深度神经网络的自动语音叠音检测方法
CN105895078A (zh) * 2015-11-26 2016-08-24 乐视致新电子科技(天津)有限公司 动态选择语音模型的语音识别方法及装置
US9922664B2 (en) * 2016-03-28 2018-03-20 Nuance Communications, Inc. Characterizing, selecting and adapting audio and acoustic training data for automatic speech recognition systems
CN106782511A (zh) * 2016-12-22 2017-05-31 太原理工大学 修正线性深度自编码网络语音识别方法
CN107644401A (zh) * 2017-08-11 2018-01-30 西安电子科技大学 基于深度神经网络的乘性噪声去除方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912993A (zh) * 2005-08-08 2007-02-14 中国科学院声学研究所 基于能量及谐波的语音端点检测方法
CN105261357A (zh) * 2015-09-15 2016-01-20 百度在线网络技术(北京)有限公司 基于统计模型的语音端点检测方法及装置
WO2017052739A1 (fr) * 2015-09-24 2017-03-30 Google Inc. Détection d'activité vocale
CN107527630A (zh) * 2017-09-22 2017-12-29 百度在线网络技术(北京)有限公司 语音端点检测方法、装置和计算机设备
CN109036470A (zh) * 2018-06-04 2018-12-18 平安科技(深圳)有限公司 语音区分方法、装置、计算机设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TONG, SIBO ET AL.: "A Comparative Study of Robustness of Deep Learning Approaches for VAD", 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 25 March 2016 (2016-03-25), pages 5695 - 5699, XP032901694, DOI: 10.1109/ICASSP.2016.7472768 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582020A (zh) * 2020-03-25 2020-08-25 平安科技(深圳)有限公司 信号处理方法、装置、计算机设备及存储介质
CN113488073A (zh) * 2021-07-06 2021-10-08 浙江工业大学 一种基于多特征融合的伪造语音检测方法及装置
CN113488073B (zh) * 2021-07-06 2023-11-24 浙江工业大学 一种基于多特征融合的伪造语音检测方法及装置

Also Published As

Publication number Publication date
CN109036470B (zh) 2023-04-21
CN109036470A (zh) 2018-12-18

Similar Documents

Publication Publication Date Title
WO2019232846A1 (fr) Procédé et appareil de différenciation vocale, dispositif informatique et support de stockage
CN110600017B (zh) 语音处理模型的训练方法、语音识别方法、系统及装置
WO2019232833A1 (fr) Procédé et dispositif de différentiation vocale, dispositif d'ordinateur et support d'informations
CN108447495B (zh) 一种基于综合特征集的深度学习语音增强方法
WO2019232829A1 (fr) Procédé et appareil de reconnaissance d'empreinte vocale, dispositif informatique et support d'enregistrement
CN110767244B (zh) 语音增强方法
WO2019232867A1 (fr) Procédé et appareil de discrimination vocale, et dispositif informatique et support de stockage
CN112735456B (zh) 一种基于dnn-clstm网络的语音增强方法
CN111292762A (zh) 一种基于深度学习的单通道语音分离方法
JP2006079079A (ja) 分散音声認識システム及びその方法
CN111899757B (zh) 针对目标说话人提取的单通道语音分离方法及系统
CN108922543B (zh) 模型库建立方法、语音识别方法、装置、设备及介质
Lee et al. Dynamic noise embedding: Noise aware training and adaptation for speech enhancement
CN111899750A (zh) 联合耳蜗语音特征和跳变深层神经网络的语音增强算法
CN114613387A (zh) 语音分离方法、装置、电子设备与存储介质
CN117310668A (zh) 融合注意力机制与深度残差收缩网络的水声目标识别方法
CN111785262A (zh) 一种基于残差网络及融合特征的说话人年龄性别分类方法
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
CN113327589B (zh) 一种基于姿态传感器的语音活动检测方法
CN114283835A (zh) 一种适用于实际通信条件下的语音增强与检测方法
Agrawal et al. Deep variational filter learning models for speech recognition
CN114512133A (zh) 发声对象识别方法、装置、服务器及存储介质
Therese et al. A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system
CN110689875A (zh) 一种语种识别方法、装置及可读存储介质
Pan et al. Application of hidden Markov models in speech command recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921545

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18921545

Country of ref document: EP

Kind code of ref document: A1