WO2019232846A1

WO2019232846A1 - Speech differentiation method and apparatus, and computer device and storage medium

Info

Publication number: WO2019232846A1
Application number: PCT/CN2018/094190
Authority: WO
Inventors: 涂宏
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-06-04
Filing date: 2018-07-03
Publication date: 2019-12-12
Also published as: CN108922513B; CN108922513A

Abstract

Provided are a speech differentiation method and apparatus, and a computer device and a storage medium. The speech differentiation method comprises: processing, based on a speech activity detection algorithm, original speech data to be differentiated, and acquiring target speech data to be differentiated (S10); acquiring a corresponding ASR speech feature based on the target speech data to be differentiated (S20); and inputting the ASR speech feature into a pre-trained ASR-RNN model for differentiation, and acquiring a target differentiation result (S30). By means of the speech differentiation method, target speech can be differentiated well from interference speech, and the speech can still be accurately differentiated where noise interference in speech data is strong.

Description

Speech distinguishing method, device, computer equipment and storage medium

This application is based on a Chinese patent application filed on June 4, 2018 with the application number 201810561788.1, entitled "Voice distinguishing method, device, computer equipment and storage medium", and claims its priority.

Technical field

The present application relates to the field of speech processing, and in particular, to a method, a device, a computer device, and a storage medium for distinguishing speech.

Background technique

Speech discrimination refers to mute filtering of the input speech, and only retain the speech segments (that is, the target speech) that are more meaningful for recognition. The current methods of speech discrimination have great shortcomings, especially in the presence of noise, as the noise becomes larger, the difficulty of speech discrimination becomes more difficult, and the target speech and the interference speech cannot be accurately distinguished, resulting in the effect of speech discrimination. not ideal.

Summary of the Invention

The embodiments of the present application provide a method, an apparatus, a computer device, and a storage medium for speech discrimination, so as to solve the problem that the effect of speech discrimination is not ideal.

An embodiment of the present application provides a method for distinguishing speech, including:

Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;

Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;

The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.

An embodiment of the present application provides a voice distinguishing device, including:

Target to-be-differentiated voice data acquisition module, for processing original to-be-differentiated voice data based on a voice activity detection algorithm, to obtain target to-be-differentiated voice data;

A voice feature acquisition module, configured to acquire a corresponding ASR voice feature based on the target to-be-differentiated voice data;

A target discrimination result acquisition module is configured to input the ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtain a target discrimination result.

An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the computer-readable instructions to implement The following steps:

This embodiment of the present application provides one or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors are Perform the following steps:

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is an application environment diagram of a speech discrimination method according to an embodiment of the present application; FIG.

FIG. 2 is a flowchart of a speech discrimination method according to an embodiment of the present application; FIG.

FIG. 3 is a specific flowchart of step S10 in FIG. 2;

FIG. 4 is a specific flowchart of step S20 in FIG. 2;

5 is a specific flowchart of step S21 in FIG. 4;

6 is a specific flowchart of step S24 in FIG. 4;

FIG. 7 is a specific flowchart before step S30 in FIG. 2; FIG.

8 is a schematic diagram of a voice distinguishing device according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.

Detailed ways

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

FIG. 1 illustrates an application environment of a speech discrimination method provided by an embodiment of the present application. The application environment of the speech recognition method includes a server and a client, wherein the server and the client are connected through a network. Clients are devices that can interact with users, including but not limited to computers, smartphones, and tablets. The server can be implemented by an independent server or a server cluster composed of multiple servers. The speech discrimination method provided in the embodiments of the present application is applied to a server.

As shown in FIG. 2, FIG. 2 shows a flowchart of the voice discrimination method in this embodiment. The voice discrimination method includes the following steps:

S10: The original speech data to be distinguished is processed based on the speech activity detection algorithm, and the target speech data to be distinguished is obtained.

Among them, Voice Activity Detection (hereinafter referred to as VAD) is to identify and eliminate the long silence period from the sound signal stream, so as to save the voice channel resources without reducing the service quality, which can save Precious bandwidth resources reduce end-to-end delay and improve user experience. The voice activity detection algorithm (VAD algorithm) is an algorithm specifically used in the voice activity detection, and the algorithm may have various types. Understandably, VAD can be applied to speech discrimination, and can distinguish target speech and interference speech. The target voice refers to the voice part in which the voiceprint continuously changes significantly in the voice data, and the interference voice may be a voice part in the voice data that is not pronounced due to silence, or it may be environmental noise. The original to-be-differentiated voice data is the most originally obtained to-be-differentiated voice data, and the original to-be-differentiated voice data refers to voice data to be subjected to preliminary distinguishing processing using a VAD algorithm. The target to-be-differentiated speech data refers to the speech data obtained by using the voice activity detection algorithm to process the original to-be-differentiated speech data for speech discrimination.

In this embodiment, the VAD algorithm is used to process the original to-be-differentiated voice data, and the target to-be-differentiated voice is initially selected from the original to-be-differentiated voice data, and the initially-selected target voice portion is used as the target to-be-differentiated voice data. Understandably, it is not necessary to distinguish the interfering voices that are initially screened to improve the efficiency of voice discrimination. However, the target voice initially screened from the original to-be-differentiated voice data still contains the content of interfering speech, especially when the original voice data to be distinguished is relatively noisy, the interfering voices (such as noise) mixed with the preliminary target voice are more mixed. It is obvious that it is impossible to effectively distinguish the speech by using the VAD algorithm at this time. Therefore, the target voice that is preliminarily screened with interfering voices should be used as the target to-be-differentiated voice data in order to more accurately distinguish the target voice that is initially screened. By using the VAD algorithm to perform preliminary speech discrimination on the original to-be-differentiated voice data, it is possible to re-differentiate the original to-be-differentiated voice data and to remove a large amount of interfering speech at the same time, which is beneficial to subsequent further speech discrimination.

In a specific implementation, as shown in FIG. 3, in step S10, processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data includes the following steps:

S11: Process the original speech data to be distinguished according to the short-term energy characteristic value calculation formula, obtain the corresponding short-term energy characteristic value, and retain the original to-be-differentiated data whose short-term energy characteristic value is greater than the first threshold, and determine it as the first original Differentiate speech data, where the short-term energy eigenvalue calculation formula is

N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time.

Among them, the short-term energy characteristic value describes the energy corresponding to a frame of speech (a frame generally takes 10-30ms) in its time domain. The "short-term" of this short-term energy should be understood as the time of a frame (that is, speech Frame length). Since the short-term energy feature value of the target voice is much higher than the short-term energy feature value of the interfering voice (silence), the target voice and the interfering voice can be distinguished according to the short-term energy feature value.

In this embodiment, the original speech data to be distinguished is processed according to the short-term energy feature value calculation formula (the original speech data to be distinguished needs to be framed in advance), and the short-term energy characteristics of each frame of the original speech data to be distinguished are calculated and obtained. Value, comparing the short-term energy characteristic value of each frame with a preset first threshold value, retaining the original to-be-differentiated voice data that is greater than the first threshold, and determining it as the first original distinguishing voice data. The first threshold is a cut-off value for measuring whether the short-term energy characteristic value belongs to the target speech or the interference speech. In this embodiment, according to the comparison result of the short-term energy feature value and the first threshold value, the target voice in the original to-be-differentiated voice data can be obtained from the perspective of the short-term energy feature value, and the original to-be-differentiated voice data is effectively removed A lot of disturbing speech.

S12: Process the original to-be-differentiated voice data according to the calculation formula of the zero-crossing rate feature value, obtain the corresponding zero-crossing rate feature value, and retain the original to-be-differentiated voice data with the zero-cross rate feature value less than the second threshold, and determine as the second The original distinguished speech data, where the zero-crossing rate eigenvalue calculation formula is

N is the speech frame length, and s (n) is the signal amplitude n in the time domain is time.

Among them, the zero-crossing rate characteristic value describes the number of times a voice signal waveform passes through the horizontal axis (zero level) in a frame of speech. Since the feature value of the zero-crossing rate of the target voice is much lower than the feature value of the zero-crossing rate of the interfering voice, the target voice and the interfering voice can be distinguished according to the short-term energy feature value.

In this embodiment, the original to-be-differentiated speech data is processed according to the zero-crossing rate feature value calculation formula, and the zero-crossing rate feature value of each frame of the original to-be-differentiated voice data is calculated and obtained. The second threshold value is compared, and the original to-be-differentiated voice data smaller than the second threshold value is retained and determined as the second original distinguished voice data. The second threshold is a cutoff value for measuring whether the short-term energy characteristic value belongs to the target speech or the interference speech. In this embodiment, according to the comparison result of the zero-crossing rate feature value and the second threshold value, the target voice in the original to-be-differentiated voice data can be obtained from the perspective of the zero-crossing rate feature value, and the original to-be-differentiated voice data can be effectively removed. A lot of disturbing speech.

S13: Use the first original distinguished speech data and the second original distinguished speech data as target to-be-separated speech data.

In this embodiment, the first original distinguished speech data is distinguished and obtained from the original to-be-differentiated speech data according to the angle of the short-term energy feature value, and the second original distinguished speech data is from the original to-be-distant speech data according to the angle of the zero-crossing rate feature value. Distinguish and acquire in speech data. The first original distinguished speech data and the second original distinguished speech data are from different perspectives of distinguishing speech. Both of these angles can distinguish the speech well. Therefore, the first original distinguished speech data and the second original distinguished speech data are merged. (Merged in the manner of taking intersections) together as the target speech data to be distinguished.

Steps S11-S13 can initially and effectively remove most of the interfering voice data in the original to-be-differentiated voice data, retain the original to-be-differentiated voice data that is mixed with the target voice and a small part of the interfering voice (such as noise), and the original to-be-differentiated voice data The data is used as the target to-be-differentiated speech data, which can make effective preliminary speech discrimination on the original to-be-differentiated speech data.

S20: Obtain the corresponding ASR voice characteristics based on the target to-be-differentiated voice data.

Among them, ASR (Automatic Speech Recognition) is a technology that converts speech data into computer-readable input, for example, converts speech data into keys, binary codes, or character sequences. The ASR can extract the voice features in the target to-be-differentiated voice data, and the extracted voice is the corresponding ASR voice feature. Understandably, ASR can convert voice data that cannot be read directly by a computer into ASR voice features that can be read by a computer, and the ASR voice features can be represented in a vector manner.

In this embodiment, ASR is used to process the target to-be-differentiated voice data to obtain the corresponding ASR voice characteristics. This ASR voice feature can well reflect the potential characteristics of the target to-be-differentiated voice data. Differentiate speech data to distinguish, and provide important technical prerequisites for subsequent ASR-RNN (RNN, Recurrent Neural Networks) model recognition based on the ASR speech characteristics.

In a specific implementation, as shown in FIG. 4, in step S20, acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data includes the following steps:

S21: Preprocess the target to-be-differentiated voice data to obtain preprocessed voice data.

In this embodiment, the target to-be-differentiated voice data is pre-processed, and corresponding pre-processed voice data is obtained. Preprocessing the target to-be-differentiated voice data can better extract the ASR voice characteristics of the target to-be-differentiated voice data, so that the extracted ASR speech features can better represent the target-to-be-differentiated voice data, so as to use the ASR speech features for speech discrimination .

In a specific implementation, as shown in FIG. 5, in step S21, pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data includes the following steps:

S211: Pre-emphasis processing is performed on the target to-be-differentiated voice data. The calculation formula for the pre-emphasis processing is s' _n = s _n -a * s _n-1 , where s _n is the signal amplitude in the time domain and s _n-1 s _n is the amplitude of the signal corresponding to the previous time, s' _n for the signal amplitude in the time domain after the pre-emphasis, a is the pre-emphasis coefficient, a is in the range of 0.9 <a <1.0.

Among them, pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end. With the increase of the signal rate, the signal is greatly damaged in the transmission process. In order to obtain a better signal waveform at the receiving end, the damaged signal needs to be compensated. The idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better signal waveform. Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.

Formula of the present embodiment, the target to be differentiated for the voice data pre-emphasis, the pre-emphasis is _{_{s' n = s n -a *}} s n-1, wherein the amplitude of the signal s _n on the time domain, i.e., voice magnitude (amplitude) expression of the voice data in the time domain, s _n-1 s _n is the opposite of the signal amplitude of a time, s' _n for the amplitude of the signal on the time-domain pre-emphasis, a is pre- The weighting coefficient, the range of a is 0.9 <a <1.0. Here, the effect of pre-emphasis of 0.97 is better. The use of the pre-emphasis processing can eliminate interference caused by vocal cords and lips during the utterance process, can effectively compensate the suppressed high-frequency part of the target voice data to be distinguished, and can highlight the high-frequency formants of the target voice data to be distinguished, strengthening the target Distinguishing the signal amplitude of speech data helps to extract ASR speech features.

S212: Frame the target to-be-differentiated voice data after pre-emphasis.

In this embodiment, after the pre-emphasis target has to distinguish the voice data, it should also perform frame processing. Framing refers to the speech processing technology that cuts the entire voice signal into several segments. The size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length. Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames. Frame processing is performed on the target to-be-differentiated voice data, which can divide the target to-be-differentiated voice data into several pieces of voice data, and the target to-be-differentiated voice data can be subdivided to facilitate the extraction of ASR voice features.

S213: Perform windowing on the framed target to-be-differentiated voice data to obtain pre-processed voice data. The calculation formula for the windowing is

Wherein, N is the window length, of n-time, the signal amplitude on the time domain s _n, s' _n for the amplitude of the signal on the windowed time domain.

In this embodiment, after frame processing is performed on the target to-be-differentiated voice data, discontinuities appear at the beginning and end of each frame, so the more frames there are, the more errors there are with the target to-be-differentiated voice data. Bigger. The use of windowing can solve this problem, which can make the target to-be-differentiated voice data after framed become continuous, and each frame can show the characteristics of the periodic function. The windowing process specifically refers to using the window function to process the target to-be-differentiated speech data. The window function can select the Hamming window, and the windowing formula is

N Hamming window length, n is the time, s _n of the signal amplitude on the time domain, s' _n in the time domain signal after the amplitude is windowed. Windowing the target to-be-differentiated voice data to obtain pre-processed speech data can make the signal of the target to-be-differentiated voice data in the time domain continuous after framed, which is helpful for extracting the ASR voice of the target to-be-differentiated voice data. feature.

The pre-processing operations on the target to-be-differentiated voice data in steps S211 to S213 provide a basis for extracting the ASR voice characteristics of the target to-be-differentiated voice data, which can make the extracted ASR voice features more representative of the target to-be-differentiated voice data, and according to This ASR voice feature performs voice discrimination.

S22: Perform a fast Fourier transform on the pre-processed speech data to obtain the frequency spectrum of the target speech data to be distinguished, and obtain the power spectrum of the target speech data to be distinguished according to the frequency spectrum.

Among them, Fast Fourier Transform (FFT) refers to a collective term for an efficient and fast method for computing a discrete Fourier transform using a computer, and is referred to as FFT for short. The use of this algorithm can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the FFT algorithm's computational savings will be.

In this embodiment, fast Fourier transform is performed on the pre-processed voice data to convert the pre-processed voice data from the signal amplitude in the time domain to the signal amplitude (spectrum) in the frequency domain. The formula for calculating the spectrum is

1≤k≤N, N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit. After obtaining the frequency spectrum of the pre-processed voice data, the power spectrum of the pre-processed voice data can be directly obtained according to the frequency spectrum. The power spectrum of the pre-processed voice data is hereinafter referred to as the power spectrum of the target voice data to be distinguished. The formula for calculating the power spectrum of the target speech data to be distinguished is

1≤k≤N, N is the size of the frame, and s (k) is the signal amplitude in the frequency domain. The pre-processed speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the target speech data to be distinguished is obtained according to the signal amplitude in the frequency domain. Extracting ASR speech features from the spectrum provides an important technical basis.

S23: Use the Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain the Mel power spectrum of the target speech data to be distinguished.

Among them, the power spectrum of the target speech data to be processed using the Mel scale filter bank is a Mel frequency analysis of the power spectrum, and the Mel frequency analysis is an analysis based on human auditory perception. Observation found that the human ear is like a filter bank, focusing only on certain specific frequency components (human hearing is selective to frequencies), which means that the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive. However, these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.

In this embodiment, a Mel scale filter bank is used to process the power spectrum of the target speech data to be distinguished, and a Mel power spectrum of the target speech data to be distinguished is obtained. The frequency domain signal is segmented by using the Mel scale filter bank. Make each frequency segment correspond to a numerical value. If the number of filters is 22, 22 energy values corresponding to the Mel power spectrum of the target speech data to be distinguished can be obtained. By performing Mel frequency analysis on the power spectrum of the target speech data to be distinguished, the Mel power spectrum obtained after the analysis retains a frequency portion closely related to the characteristics of the human ear, and this frequency portion can well reflect the target to be distinguished Characteristics of speech data.

S24: Perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.

Among them, cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .

In this embodiment, a cepstrum analysis is performed on the Mel power spectrum, and based on the cepstrum result, the Mel frequency cepstrum coefficient of the target speech data to be distinguished is analyzed and obtained. Through the cepstrum analysis, the features contained in the Mel power spectrum of the target speech data to be distinguished, which is too high in original feature dimension, can be directly converted into easy-to-use through cepstrum analysis on the Mel power spectrum. (Mel frequency cepstrum coefficient feature vector used for training or identification). The Mel frequency cepstrum coefficient can be used as a coefficient for distinguishing different voices from ASR voice features. The ASR voice feature can reflect the difference between voices, and can be used to identify and distinguish target to-be-differentiated voice data.

In a specific embodiment, as shown in FIG. 6, in step S24, cepstrum analysis is performed on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the target speech data to be distinguished, including the following steps:

S241: Take the log value of the Mel power spectrum, and obtain the Mel power spectrum to be transformed.

In this embodiment, according to the definition of the cepstrum, a log value log of the Mel power spectrum is taken to obtain the Mel power spectrum m to be transformed.

S242: Perform discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.

In this embodiment, a discrete cosine transform (DCT) is performed on the Mel power spectrum m to be transformed to obtain a corresponding Mel frequency cepstrum coefficient of the target speech data to be distinguished. Generally, the second to thirteenth coefficients are taken. Coefficients are used as ASR speech features, which can reflect the differences between speech data. The formula for discrete cosine transform of the transformed Mel power spectrum m is

i = 0, 1, 2, ..., N-1, N is the frame length, m is the Mel power spectrum to be transformed, and j is the independent variable of the Mel power spectrum to be transformed. Because there is overlap between Mel filters, there is a correlation between the energy values obtained by using Mel scale filters. Discrete cosine transform can perform dimensionality reduction and abstraction on the transformed Mel power spectrum m, and The corresponding ASR speech features are obtained. Compared with the Fourier transform, the result of the discrete cosine transform has no imaginary part, and has obvious advantages in terms of calculation.

Steps S21-S24 are based on the ASR technology to perform feature extraction on the target to-be-differentiated voice data. The final obtained ASR speech feature can well reflect the target to-be-differentiated voice data. The ASR voice feature can be obtained by deep network model training to obtain ASR- The RNN model makes the ASR-RNN model obtained during training more accurate when distinguishing speech, and can accurately distinguish noise from speech even under very noisy conditions.

It should be noted that the features extracted above are Mel frequency cepstrum coefficients. Here, the ASR speech features should not be limited to only Mel frequency cepstrum coefficients. Instead, it should be considered that the speech features obtained by ASR technology can be used as long as they can The features that effectively reflect speech data can be used as ASR speech features for recognition and model training.

S30: The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.

Among them, the ASR-RNN model refers to a recurrent neural network model trained using ASR speech features, and RNN refers to recurrent neural networks. The ASR-RNN model is trained using ASR speech features extracted from the speech data to be trained, so the model can recognize ASR speech features and distinguish speech based on ASR speech features. Specifically, the speech data to be trained includes target speech and noise. When performing ASR-RNN model training, the ASR speech feature of the target speech and the ASR speech feature of the noise are extracted, so that the ASR-RNN model obtained by training can recognize the target based on the ASR speech feature. Noise in speech and interfering speech (when VAD is used to distinguish the original to-be-differentiated speech data, most of the interfering speech has been removed, such as the speech data and part of the noise that are not pronounced due to silence in the speech data, so the ASR-DBN model distinguishes The interference speech specifically refers to the noise part), to achieve the purpose of effectively distinguishing between the target speech and the interference speech.

In this embodiment, the ASR voice features are input into a pre-trained ASR-RNN model to distinguish them. Since the ASR voice features can reflect the characteristics of the voice data, the ASR of the target to be distinguished voice data can be extracted according to the ASR-RNN model. The speech features are recognized, so that the target speech data to be distinguished is accurately distinguished based on the ASR speech features. This pre-trained ASR-RNN model combines the features of ASR speech features and recurrent neural network to extract features in depth, and distinguishes speech from the ASR speech features of speech data. It is still available under very bad noise conditions. Very high accuracy. Specifically, since the features extracted by ASR also include the ASR speech features of noise, in this ASR-RNN model, noise can also be accurately distinguished, and the current speech discrimination methods (including but not limited to VAD) are affected by noise. The problem that the speech cannot be effectively distinguished under larger conditions.

In a specific implementation, step S30, before the steps of inputting ASR voice features into a pre-trained ASR-RNN model to distinguish and obtain a target discrimination result, the voice discrimination method further includes the following steps: obtaining an ASR-RNN model .

As shown in FIG. 7, the steps of obtaining the ASR-RNN model include:

S31: Acquire speech data to be trained, and extract speech features of the ASR to be trained.

The voice data to be trained refers to a training set of voice data required for training the ASR-RNN model. The voice data to be trained may be an open source voice training set directly, or a voice training set by collecting a large amount of sample voice data. The to-be-trained voice data distinguishes the target voice and the interfering voice (here, specifically noise) in advance, and a specific method for distinguishing may be to set different label values for the target voice and noise respectively. For example, all target speech parts in the speech data to be trained are marked as 1 (representing "true"), and noisy parts are marked as 0 (representing "false"). The ASR-RNN model recognition can be tested by setting the label value in advance Accuracy in order to provide improved references, update network parameters in the ASR-RNN model, and continuously optimize the ASR-RNN model. In this embodiment, the ratio of the target voice and the noise may specifically be 1: 1, and adopting this ratio can avoid overfitting due to different target voice and noise amounts in the voice data to be trained. Among them, overfitting refers to the phenomenon that the assumptions become too strict in order to obtain a consistent hypothesis. Avoiding overfitting is a core task in classifier design.

In this embodiment, the voice data to be trained is obtained and the feature of the voice data to be trained is extracted. This feature is the voice feature of the ASR to be trained. The steps of extracting the voice feature of the ASR to be trained are the same as steps S21-S24, and will not be repeated here . The speech data to be trained includes training samples of the target speech and training samples of noise. Both parts of the speech data have their own ASR speech features. Therefore, the corresponding ASR-RNN model can be extracted and trained using the ASR speech features to be trained, so that The ASR-RNN model obtained by training the ASR speech features to be trained can accurately distinguish the target speech and noise (noise belongs to interference speech).

S32: Initialize the RNN model.

Among them, the RNN model is a recurrent neural network model. The RNN model includes an input layer, a hidden layer, and an output layer composed of neurons. The RNN model includes the weights and biases of each neuron connection between the layers. These weights and biases determine the nature and recognition effect of the RNN model. Compared with traditional neural networks such as DNN (Deep Neural Network, Deep Neural Network), RNN is a neural network that models sequence data (such as time series), that is, the current output of a sequence is related to the previous output. The specific expression is that the network will remember the state of the previous hidden layer and apply it to the current output calculation, that is, the nodes between the hidden layers are no longer unconnected but connected, and the input of the hidden layer includes not only the input layer The output also includes the output of the hidden layer at the previous moment. Due to the temporal characteristics of the speech data, the RNN model can be trained with the speech data to be trained to accurately extract the respective deep features of the target speech and the interfering speech in time to achieve accurate speech discrimination.

In this embodiment, the RNN model is initialized. This initialization operation is to set the initial values of weights and offsets in the RNN model. The initial value can be set to a smaller value when initially set, such as in the interval [-0.3-0.3] between. Reasonable initialization of the RNN model can make the model have more flexible adjustment capabilities in the early stage. The model can be adjusted effectively during the model training process without making the model's adjustment capability in the initial stage very poor, resulting in a trained model. The distinction is not good.

S33: Input the ASR speech features to be trained into the RNN model, and obtain the output value of the RNN model according to the forward propagation algorithm. The output value is expressed as:

σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h _t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer.

In this embodiment, the process of RNN forward propagation is a series of linear operations and activation operations performed in the RNN model according to the time series according to the weighted, biased, and input ASR speech features of each neuron in the RNN model. To get the output value of each layer of the network in the RNN model. In particular, since the RNN is a neural network that models sequence (specifically, time series) data, when calculating the hidden state h _t of the hidden layer at time _t , it is necessary to calculate the hidden layer state h _t- at time t-1 _The ASR speech features input at times ₁ and t are obtained together. From the process of RNN model forward propagation, the RNN model's forward propagation algorithm can be obtained: for any time t, according to the input ASR speech features to be trained, it is calculated from the input layer of the RNN model to the output of the hidden layer, and the output of the hidden layer (That is, the hidden state h _t ) is expressed as: h _t = σ (Ux _t + Wh _t-1 + b), where σ represents the activation function (specifically, the tanh activation function can be used here, and tanh will continue to expand during the cycle. Train the differences between the features of the ASR speech features to help distinguish the target speech from noise), U represents the weight of the connection between the input layer and the hidden layer, and W represents the weight of the connection between the hidden layers (implemented by time series Connection between hidden layers), h _t-1 represents the hidden state at t-1, and b represents the offset between the input layer and the hidden layer. From the hidden layer of the RNN model to the output of the output layer, the output of the output layer (that is, the output value of the RNN model) is expressed as

Among them, the activation function used here can be a softmax function (the softmax function is better for classification problems), V represents the weight of the connection between the hidden layer and the output layer, h _t represents the hidden state at time t, c Represents the offset between the hidden layer and the output layer. The output value of this RNN model (the output of the output layer)

That is, the output value calculated by the layer by layer through the forward propagation algorithm can be called the predicted output value. After the server obtains the output value of the RNN model, it can update and adjust the network parameters (weights and offsets) in the RNN model according to the output value, so that the obtained RNN model can be distinguished according to the time-series characteristics of the voice. The difference between the ASR voice characteristics of the speech and the ASR voice characteristics of the interfering speech and their timing performance results in accurate recognition results.

S34: Perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain the ASR-RNN model. The formula for updating the weight V is:

V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,

Represents the predicted output value, y _t represents the real output value, h _t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:

c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:

U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ _t represents the gradient of the hidden layer state, and x _t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:

W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:

b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.

In this embodiment, the server obtains the output value (predicted output value) of the RNN model according to the forward propagation algorithm.

After that, you can

With the ASR speech feature to be trained with pre-set label values, calculate the error generated by the ASR speech feature to be trained during the training of the RNN model, and construct a suitable error function based on the error (such as using a logarithmic error function to represent the generated error ). The server then uses this error function for error back propagation, adjusting and updating the weights (U, W, and V) and weights (b and c) of each layer of the RNN model. Specifically, the preset label value can be called a real output value (that is, it represents objective facts, a label value of 1 represents a target voice, and a label value of 0 represents an interfering voice), and is represented by y _t . In the process of training the RNN model, the RNN model in the time series has an error when calculating the forward output at each layer. To measure this error, the error function L can be used to express:

Among them, t refers to time t, τ represents the total time, and L _t represents the error generated at time _t by the error function. After the server obtains the error function, it can update the weights and offsets of the RNN model according to BPTT (Back Propagation Trough Time) to obtain the ASR-RNN model based on the ASR speech features to be trained. Specifically, the formula for updating the weight V is:

Among them, V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, and α represents the learning rate,

Represents the predicted output value, y _t represents the real output value, h _t represents the hidden state at time t, and T represents the matrix transposition operation. The formula for updating the offset c is:

c represents the offset between the hidden layer and the output layer before the update, and c 'represents the offset between the hidden layer and the output layer after the update. Compared with weight V and offset c, weight U, weight W, and offset b, during back propagation, the gradient loss at a time t is determined by the gradient loss corresponding to the output of the current position and time t + 1 The two parts of the gradient loss are jointly determined. Therefore, the update of weight U, weight W and offset b needs to be obtained by means of the gradient δ _{t of the} state of the hidden layer. The gradient δ _{t of the} hidden layer state at time _{t is} expressed as:

₊ δ _t is present between the contact ₁ and δ _t, δ _t can be determined according δ _{t + 1,} which is linked to the expression:

Among them, δ _{t + 1} represents the gradient of the state of the hidden layer at the time of t + 1 sequence, and diag () represents a calculation function for matrix operations. The calculation function is used to construct a diagonal matrix or return a pair of matrices in the form of a vector. Corner line element, h _{t + 1} represents the state of the hidden layer at time t + 1 sequence. Then we can get the gradient δ _{τ of the} state of the hidden layer at time _τ , and use the expression between δ _{t + 1} and δ _t

Δ _{t is} obtained by recursing δ _τ from layer to layer. Since there is no other time behind δ _τ , it can be directly obtained from the gradient calculation:

Then δ _t can be obtained recursively according to δ _τ . After δ _{t is} obtained, the weight U, weight W and offset b can be calculated. The formula for updating weight U is:

U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector For matrix operations on diagonal elements, δ _t represents the gradient of the state of the hidden layer, and x _t represents the speech features of the ASR to be trained at time t; the formula for updating the weight W is:

b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update. When both the value of the ownership and the change of the bias are less than the stop iteration threshold ∈, the training can be stopped; or, when the training reaches the maximum number of iterations MAX, the training is stopped. The error between the predicted output value of the ASR speech feature to be trained in the RNN model and the preset label value (real output value) is used to update the weights and offsets of each layer of the RNN model based on the error, so that the final The obtained ASR-RNN model can train and learn deep features about time series according to the ASR speech features to achieve the purpose of accurately distinguishing speech.

Steps S31-S34 train the RNN model with the ASR speech features to be trained, so that the trained ASR-RNN model can train and learn deep features about the sequence (timing) based on the ASR speech features, and can use the ASR of the target speech and the interfering speech Speech features and the combination of timing factors effectively distinguish speech. In the case of severe noise interference, the target speech and noise can still be accurately distinguished.

In the voice discrimination method provided in this embodiment, first, the original voice data to be differentiated is processed based on a voice activity detection algorithm (VAD), and the target voice data to be distinguished is obtained. The raw voice data to be distinguished is first distinguished by the voice activity detection algorithm to obtain The target to-be-differentiated voice data with a smaller range can initially and effectively remove the interfering voice data from the original to-be-differentiated voice data, retain the original to-be-differentiated voice data mixed with the target voice and the interfering voice, and use the original to-be-differentiated voice data as The target to-be-differentiated voice data can effectively make preliminary speech distinctions from the original to-be-differentiated voice data, removing a large amount of interfering speech. Then based on the target to-be-differentiated speech data, the corresponding ASR speech features are obtained. This ASR speech feature can make the result of speech discrimination more accurate, and even under the condition of noisy noise, it can accurately interfering speech (such as noise) It is distinguished from the target speech, and provides important technical prerequisites for subsequent ASR-RNN model recognition based on the ASR speech characteristics. Finally, the ASR speech features are input into a pre-trained ASR-RNN model to distinguish them and obtain the target discrimination result. The ASR-RNN model is specially trained according to the ASR speech features extracted from the speech data to be trained and the timing characteristics of the speech. The recognition model for effectively distinguishing speech can correctly distinguish the target speech from the target speech data that is mixed with the target speech and the interference speech (because VAD has been used to distinguish it once, so the interference speech here mostly refers to noise). Interfering with speech and improving the accuracy of speech discrimination.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

FIG. 8 shows a principle block diagram of a voice distinguishing device corresponding to the voice distinguishing method in the embodiment. As shown in FIG. 8, the voice discrimination device includes a target to-be-differentiated voice data acquisition module 10, a voice feature acquisition module 20, and a target discrimination result acquisition module 30. The implementation functions of the target to-be-separated voice data acquisition module 10, the voice feature acquisition module 20, and the target discrimination result acquisition module 30 correspond to the steps corresponding to the voice discrimination method in the embodiment. To avoid redundant descriptions, this embodiment is not one by one. Elaborate.

The target to-be-differentiated voice data acquisition module 10 is configured to process the original to-be-differentiated voice data based on a voice activity detection algorithm to obtain the target to-be-differentiated voice data.

The voice feature obtaining module 20 is configured to obtain a corresponding ASR voice feature based on the target to-be-differentiated voice data.

The target discrimination result acquisition module 30 is configured to input ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtain a target discrimination result.

Preferably, the target undistinguished speech data acquisition module 10 includes a first original distinguished speech data acquisition unit 11, a second original distinguished speech data acquisition unit 12, and a target undisturbed speech data acquisition unit 13.

The first original distinguished speech data acquisition unit 11 is configured to process the original speech data to be distinguished according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, and increase the original value of the short-term energy feature value greater than the first threshold. The to-be-differentiated data is retained, and it is determined as the first original distinguished speech data. The short-term energy characteristic value calculation formula is

A second original distinguished speech data obtaining unit 12 is configured to process the original to-be-differentiated speech data according to a calculation formula of the zero-crossing rate characteristic value, obtain a corresponding zero-crossing rate characteristic value, and reduce the original value of the zero-crossing rate characteristic value to be less than the second threshold The to-be-differentiated voice data is retained, and it is determined to be the second original distinguished voice data. The formula of the zero-crossing rate characteristic value is

The target undistinguished speech data acquisition unit 13 is configured to use the first original distinguished speech data and the second original distinguished speech data as the target undistorted speech data.

Preferably, the speech feature acquisition module 20 includes a pre-processed speech data acquisition unit 21, a power spectrum acquisition unit 22, a Mel power spectrum acquisition unit 23, and a Mel frequency cepstrum coefficient unit 24.

The pre-processing unit 21 is configured to pre-process the target to-be-differentiated voice data to obtain pre-processed voice data.

The power spectrum obtaining unit 22 is configured to perform a fast Fourier transform on the pre-processed speech data, obtain a frequency spectrum of the target speech data to be distinguished, and obtain a power spectrum of the target speech data to be distinguished according to the frequency spectrum.

The Mel power spectrum acquisition unit 23 is configured to process a power spectrum of the target speech data to be distinguished by using a Mel scale filter bank, and obtain a Mel power spectrum of the target speech data to be distinguished.

The Mel frequency cepstrum coefficient unit 24 is configured to perform cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.

Preferably, the pre-processing unit 21 includes a pre-emphasis sub-unit 211, a frame sub-unit 212, and a windowing sub-unit 213.

The pre-emphasis sub-unit 211 is configured to perform pre-emphasis processing on target voice data to be distinguished. The calculation formula of the pre-emphasis processing is s' _n = s _n -a * s _n-1 , where s _n is a signal in the time domain. amplitude, s _n-1 s _n is the amplitude of the signal corresponding to the previous time, s' _n for the amplitude of the signal on the time-domain pre-emphasis, a is pre-emphasis coefficient, a is in the range of 0.9 <a < 1.0.

The frame sub-unit 212 is configured to perform frame processing on the target pre-emphasized voice data to be distinguished.

A windowing sub-unit 213 is configured to perform windowing on the framed target to-be-differentiated speech data to obtain pre-processed speech data. The calculation formula of the windowing is

Preferably, the Mel frequency cepstrum coefficient unit 24 includes a Mel power spectrum acquisition sub-unit 241 and a Mel frequency cepstrum coefficient sub-unit 242 to be transformed.

The to-be-transformed Mel power spectrum acquisition subunit 241 is configured to obtain a log value of the to-be-transformed Mel power spectrum to obtain the to-be-transformed Mel power spectrum.

The Mel frequency cepstrum coefficient sub-unit 242 is configured to perform a discrete cosine transform of the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.

Preferably, the speech discrimination device further includes an ASR-RNN model acquisition module 40. The ASR-RNN model acquisition module 40 includes an ASR speech feature acquisition unit 41, an initialization unit 42, an output value acquisition unit 43, and an update unit 44 to be trained.

The ASR speech feature acquisition unit 41 is configured to acquire speech data to be trained and extract speech speech features of the ASR to be trained.

The initialization unit 42 is configured to initialize an RNN model.

An output value acquisition unit 43 is configured to input the ASR speech features to be trained into the RNN model, and obtain the output value of the RNN model according to the forward propagation algorithm, and the output value is expressed as:

The updating unit 44 is configured to perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model. The formula for updating the weight V is:

This embodiment provides one or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors are executed. The method for distinguishing speech in the embodiment is implemented at this time. To avoid repetition, details are not described herein again. Alternatively, when the computer-readable instructions are executed by one or more processors, the functions of each module / unit in the speech distinguishing device in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here is not More details.

Understandably, the computer-readable storage medium may include: any entity or device capable of carrying the computer-readable instructions, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM , Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals and telecommunication signals.

FIG. 9 is a schematic diagram of a computer device in this embodiment. As shown in FIG. 9, the computer device 50 includes a processor 51, a memory 52, and computer-readable instructions 53 stored in the memory 52 and executable on the processor 51. When the processor 51 executes the computer-readable instructions 53, each step of the method for distinguishing speech in the embodiment is implemented, for example, steps S10, S20, and S30 shown in FIG. Alternatively, when the processor 51 executes the computer-readable instructions 53, the functions of the modules / units of the voice distinguishing device in the embodiment are realized, as shown in FIG. 8, the target to-be-differentiated voice data acquisition module 10, the voice feature acquisition module 20, and the target discrimination result acquisition. The module 30 and the ASR-RNN model acquire the functions of the module 40.

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software functional unit.

The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application, and should be included in the present invention Within the scope of the application.

Claims

A method for distinguishing speech, comprising:

Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;

Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;

The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
The speech discrimination method according to claim 1, characterized in that, before the step of inputting the ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtaining a discrimination result, the speech discrimination method Also includes: obtaining ASR-RNN model;

The step of obtaining the ASR-RNN model includes:

Acquiring voice data to be trained, and extracting voice features of the ASR to be trained of the voice data to be trained;

Initialize the RNN model;

The ASR speech features to be trained are input into the RNN model, and the output value of the RNN model is obtained according to the forward propagation algorithm, and the output value is expressed as:
σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer;

Perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model, where the formula for updating the weight V is:
V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:
W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:

b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
The method of claim 1, wherein the processing of the original speech data to be distinguished based on the speech activity detection algorithm to obtain the target speech data to be distinguished comprises:

Process the original to-be-differentiated voice data according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, retain the original to-be-differentiated data whose short-term energy feature value is greater than a first threshold, and determine For the first original distinguishing speech data, the short-term energy eigenvalue calculation formula is
Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;

Processing the original to-be-differentiated voice data according to a zero-crossing rate feature value calculation formula, obtaining a corresponding zero-crossing rate feature value, and retaining the original to-be-differentiated voice data with the zero-cross rate feature value less than a second threshold, Determined as the second original distinguished speech data, the calculation formula of the zero-crossing rate characteristic value is
Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;

Use the first original distinguished speech data and the second original distinguished speech data as the target to-be-differentiated speech data.
The method according to claim 1, wherein the acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data comprises:

Pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data;

Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of target speech data to be distinguished, and obtaining a power spectrum of target speech data to be distinguished according to the frequency spectrum;

Adopting a Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain a Mel power spectrum of the target speech data to be distinguished;

A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
The speech discrimination method according to claim 4, wherein the pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data comprises:

Distinguish the speech of the target data to be processed for pre-emphasis, pre-emphasis process is calculated as s' n = s n -a * s n-1, wherein the amplitude of the signal on the time domain s n, s n-1 s n is the amplitude of the signal corresponding to the previous time, s' n for the pre-emphasis signal amplitude on the time domain, a is a pre-emphasis coefficient, a is in the range of 0.9 <a <1.0;

Performing frame processing on the pre-emphasized target to-be-differentiated voice data;

The windowed processing is performed on the target to-be-differentiated voice data after framing to obtain preprocessed voice data. The calculation formula of the windowing is
Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
The speech discrimination method according to claim 4, wherein the performing cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished comprises:

Taking a log value of the Mel power spectrum to obtain a Mel power spectrum to be transformed;

Performing discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished.
A voice distinguishing device, comprising:

Target to-be-differentiated voice data acquisition module, for processing original to-be-differentiated voice data based on a voice activity detection algorithm, to obtain target to-be-differentiated voice data;

A voice feature acquisition module, configured to acquire a corresponding ASR voice feature based on the target to-be-differentiated voice data;

A target discrimination result acquisition module is configured to input the ASR speech features into a pre-trained ASR-RNN model for discrimination, and obtain a target discrimination result.
The speech discrimination device according to claim 7, wherein the speech discrimination device further comprises an ASR-RNN model acquisition module, and the ASR-RNN model acquisition module comprises:

A speech feature acquisition unit to be trained, for acquiring speech data to be trained, and extracting speech feature of the ASR to be trained from the speech data to be trained;

An initialization unit for initializing the RNN model;

An output value obtaining unit is configured to input the ASR speech features to be trained into the RNN model, and obtain an output value of the RNN model according to a forward propagation algorithm, where the output value is expressed as:
σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer;

An updating unit is configured to perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model. The formula for updating the weight V is:
V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:

W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:
b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;

Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;

The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
The computer device according to claim 9, characterized in that before the step of inputting the ASR speech features into a pre-trained ASR-RNN model for differentiation, and obtaining a discrimination result, the processor executes all When describing the computer-readable instructions, the following steps are also implemented: obtaining an ASR-RNN model;

The step of obtaining the ASR-RNN model includes:

Acquiring voice data to be trained, and extracting voice features of the ASR to be trained of the voice data to be trained;

Initialize the RNN model;

The ASR speech features to be trained are input into the RNN model, and the output value of the RNN model is obtained according to the forward propagation algorithm, and the output value is expressed as:
σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer;

Perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model, where the formula for updating the weight V is:
V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:
W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:

b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
The computer device according to claim 9, wherein the processing of the original voice data to be distinguished based on the voice activity detection algorithm to obtain the target voice data to be distinguished comprises:

Process the original to-be-differentiated voice data according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, retain the original to-be-differentiated data whose short-term energy feature value is greater than a first threshold, and determine For the first original distinguishing speech data, the short-term energy eigenvalue calculation formula is
Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;

Processing the original to-be-differentiated voice data according to a zero-crossing rate feature value calculation formula, obtaining a corresponding zero-crossing rate feature value, and retaining the original to-be-differentiated voice data with the zero-cross rate feature value less than a second threshold, Determined as the second original distinguished speech data, the calculation formula of the zero-crossing rate characteristic value is
Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;

Use the first original distinguished speech data and the second original distinguished speech data as the target to-be-differentiated speech data.
The computer device according to claim 9, wherein the acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data comprises:

Pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data;

Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of target speech data to be distinguished, and obtaining a power spectrum of target speech data to be distinguished according to the frequency spectrum;

Adopting a Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain a Mel power spectrum of the target speech data to be distinguished;

A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
The computer device according to claim 12, wherein the pre-processing the target to-be-differentiated voice data to obtain the pre-processed voice data comprises:

Distinguish the speech of the target data to be processed for pre-emphasis, pre-emphasis process is calculated as s' n = s n -a * s n-1, wherein the amplitude of the signal on the time domain s n, s n-1 s n is the amplitude of the signal corresponding to the previous time, s' n for the pre-emphasis signal amplitude on the time domain, a is a pre-emphasis coefficient, a is in the range of 0.9 <a <1.0;

Performing frame processing on the pre-emphasized target to-be-differentiated voice data;

The windowed processing is performed on the target to-be-differentiated voice data after framing to obtain preprocessed voice data. The calculation formula of the windowing is
Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
The computer device according to claim 12, wherein the performing cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished comprises:

Taking a log value of the Mel power spectrum to obtain a Mel power spectrum to be transformed;

Performing discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished.
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

Processing the original to-be-differentiated voice data based on the voice activity detection algorithm to obtain the target to-be-differentiated voice data;

Obtaining corresponding ASR voice characteristics based on the target to-be-differentiated voice data;

The ASR speech features are input into a pre-trained ASR-RNN model for discrimination, and a target discrimination result is obtained.
The non-volatile readable storage medium according to claim 15, characterized in that before the step of inputting the ASR speech feature into a pre-trained ASR-RNN model for discrimination, and obtaining a discrimination result, When the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: obtaining an ASR-RNN model;

The step of obtaining the ASR-RNN model includes:

Acquiring voice data to be trained, and extracting voice features of the ASR to be trained of the voice data to be trained;

Initialize the RNN model;

The ASR speech features to be trained are input into the RNN model, and the output value of the RNN model is obtained according to the forward propagation algorithm, and the output value is expressed as:
σ represents the activation function, V represents the weight of the connection between the hidden layer and the output layer, h t represents the hidden state at time t, and c represents the offset between the hidden layer and the output layer;

Perform error back propagation based on the output value, update the weights and offsets of each layer of the RNN model, and obtain an ASR-RNN model, where the formula for updating the weight V is:
V represents the weight of the connection between the hidden layer and the output layer before the update, V 'represents the weight of the connection between the hidden layer and the output layer after the update, α represents the learning rate, t represents the time t, and τ represents the total duration,
Represents the predicted output value, y t represents the real output value, h t represents the hidden state at time t, T represents the matrix transposition operation; the formula for updating the offset c is:
c represents the offset between the hidden layer and the output layer before the update, c 'represents the offset between the hidden layer and the output layer after the update; the formula for updating the weight U is:
U represents the weight of the connection between the input layer and the hidden layer before the update, U 'represents the weight of the connection between the input layer and the hidden layer after the update, and diag () means constructing a diagonal matrix or returning a matrix in the form of a vector The matrix operation of the diagonal elements above, δ t represents the gradient of the hidden layer state, and x t represents the input ASR speech feature to be trained at time t; the formula for updating the weight W is:
W represents the weight of the connections between the hidden layers before the update, W 'represents the weight of the connections between the hidden layers after the update; the formula for updating the offset b is:

b indicates the offset between the input layer and the hidden layer before the update, and b 'indicates the offset between the input layer and the hidden layer after the update.
The non-volatile readable storage medium according to claim 15, wherein the processing of the original voice data to be distinguished based on the voice activity detection algorithm to obtain the target voice data to be distinguished comprises:

Process the original to-be-differentiated voice data according to a short-term energy feature value calculation formula, obtain a corresponding short-term energy feature value, retain the original to-be-differentiated data whose short-term energy feature value is greater than a first threshold, and determine For the first original distinguishing speech data, the short-term energy eigenvalue calculation formula is
Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;

Processing the original to-be-differentiated voice data according to a zero-crossing rate feature value calculation formula, obtaining a corresponding zero-crossing rate feature value, and retaining the original to-be-differentiated voice data with the zero-cross rate feature value less than a second threshold, Determined as the second original distinguished speech data, the calculation formula of the zero-crossing rate characteristic value is
Where N is the speech frame length, s (n) is the signal amplitude in the time domain, and n is the time;

Use the first original distinguished speech data and the second original distinguished speech data as the target to-be-differentiated speech data.
The non-volatile readable storage medium according to claim 15, wherein the acquiring the corresponding ASR voice feature based on the target to-be-differentiated voice data comprises:

Pre-processing the target to-be-differentiated voice data to obtain pre-processed voice data;

Performing a fast Fourier transform on the pre-processed speech data to obtain a frequency spectrum of target speech data to be distinguished, and obtaining a power spectrum of target speech data to be distinguished according to the frequency spectrum;

Adopting a Mel scale filter bank to process the power spectrum of the target speech data to be distinguished, and obtain a Mel power spectrum of the target speech data to be distinguished;

A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished.
The non-volatile readable storage medium according to claim 18, wherein the pre-processing the target to-be-differentiated voice data to obtain the pre-processed voice data comprises:

Distinguish the speech of the target data to be processed for pre-emphasis, pre-emphasis process is calculated as s' n = s n -a * s n-1, wherein the amplitude of the signal on the time domain s n, s n-1 s n is the amplitude of the signal corresponding to the previous time, s' n for the pre-emphasis signal amplitude on the time domain, a is a pre-emphasis coefficient, a is in the range of 0.9 <a <1.0;

Performing frame processing on the pre-emphasized target to-be-differentiated voice data;

The windowed processing is performed on the target to-be-differentiated voice data after framing to obtain preprocessed voice data. The calculation formula of the windowing is
Wherein, N is the window length, of n-time, the signal amplitude on the time domain s n, s' n for the amplitude of the signal on the windowed time domain.
The non-volatile readable storage medium according to claim 18, wherein the performing cepstrum analysis on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of the target speech data to be distinguished comprises: :

Taking a log value of the Mel power spectrum to obtain a Mel power spectrum to be transformed;

Performing discrete cosine transform on the Mel power spectrum to be transformed to obtain a Mel frequency cepstrum coefficient of target speech data to be distinguished.