WO2019232845A1

WO2019232845A1 - Voice data processing method and apparatus, and computer device, and storage medium

Info

Publication number: WO2019232845A1
Application number: PCT/CN2018/094184
Authority: WO
Inventors: 涂宏
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-06-04
Filing date: 2018-07-03
Publication date: 2019-12-12
Also published as: CN108877775B; CN108877775A

Abstract

A voice data processing method and apparatus, and a computer device, and a storage medium. The voice data processing method comprises: obtaining original voice data (S10); performing framing and segmentation processing on the original voice data by using a VAD algorithm to obtain at least two frames of voice data to be detected (S20); performing feature extraction on each frame of the voice data to be detected by using an ASR voice feature extraction algorithm to obtain a filter voice feature to be detected (S30); recognizing the filter voice feature to be detected by using a trained ASR-LSTM voice recognition model to obtain a recognition probability value (S40); and if the recognition probability value is greater than a preset probability value, using the voice data to be detected as target voice data (S50). According to the voice data processing method, the interference of noise and silence can be effectively eliminated, thereby improving the accuracy of model recognition.

Description

Voice data processing method, device, computer equipment and storage medium

This patent application is based on a Chinese invention patent application filed on June 4, 2018 with the application number 201810561725.6, entitled "Voice Data Processing Method, Device, Computer Equipment, and Storage Medium", and claims its priority.

Technical field

The present application relates to the technical field of speech recognition, and in particular, to a method, a device, a computer device, and a storage medium for processing voice data.

Background technique

Voice Activity Detection (VAD), also known as voice endpoint detection or voice boundary detection, identifies and eliminates long periods of silence from the sound signal stream to achieve voice savings without reducing service quality The role of resources.

At present, when training or recognizing a speech recognition model, it is necessary to obtain relatively pure speech data for model training, but the current speech data is often mixed with noise or silence, resulting in the use of speech data with mixed noise for training. The accuracy of speech recognition models is low, which is not conducive to the popularization and application of speech recognition models.

Summary of the Invention

Based on this, it is necessary to provide a method, an apparatus, a computer device, and a storage medium for speech data processing in order to solve the technical problem of low accuracy of a speech recognition model in the prior art.

A voice data processing method includes:

Get the original voice data;

Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;

ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;

Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;

If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.

A voice data processing device includes:

Raw voice data acquisition module, used to obtain raw voice data;

The voice data to be tested module is configured to use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;

The filter feature acquisition module under test is configured to use ASR voice feature extraction algorithm to perform feature extraction on the frame of voice data to be tested for each frame to acquire the filter feature of the filter under test;

A recognition probability value acquisition module, configured to use the trained ASR-LSTM speech recognition model to recognize the voice characteristics of the filter under test to obtain a recognition probability value;

The target voice data acquisition module is configured to use the voice data to be tested as target voice data if the recognition probability value is greater than a preset probability value.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:

Get the original voice data;

One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:

Get the original voice data;

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is an application environment diagram of a voice data processing method according to an embodiment of the present application;

2 is a flowchart of a voice data processing method according to an embodiment of the present application;

FIG. 3 is a specific flowchart of step S20 in FIG. 2;

FIG. 4 is a specific flowchart of step S30 in FIG. 2;

5 is another flowchart of a voice data processing method according to an embodiment of the present application;

6 is a specific flowchart of step S63 in FIG. 5;

7 is a schematic diagram of a voice data processing apparatus according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed ways

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

The voice data processing method provided in this application can be applied in the application environment shown in FIG. 1, where a computer device communicates with a server through a network. Computer devices can be, but are not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as a stand-alone server.

Specifically, the voice data processing method is applied to a computer device configured by a financial institution such as a bank, securities, insurance or the like, and is used to preprocess the original voice data by using the voice data processing method to obtain training data in order to use the training Data is used to train voiceprint models or other speech models to improve the accuracy of model recognition.

In one embodiment, as shown in FIG. 2, a method for processing voice data is provided. The method is applied to the server in FIG. 1 as an example, and includes the following steps:

S10: Obtain the original voice data.

The original voice data is speaker voice data obtained by using a recording device, and the original voice data is unprocessed voice data. In this embodiment, the original voice data may be voice data in wav, mp3, or other formats. The original voice data includes target voice data and interference voice data, where the target voice data refers to a voice part in which the voiceprint continuously changes significantly in the original voice data, and the target voice data is generally a speaker voice. Correspondingly, the interfering voice data refers to a voice portion other than the target voice data in the original voice data, that is, the interfering voice data is a voice other than the speaker voice. Specifically, the interfering speech data includes a mute segment and a noise segment, where the mute segment refers to a speech portion of the original speech data that is not pronounced due to silence, such as the collected original speech data due to the speaker thinking and breathing during the speaking process The voice part when no sound is produced, the voice part is a mute section. The noise section refers to the voice part corresponding to the environmental noise in the original voice data, and the sounds such as the opening and closing of doors and windows and the collision of objects can be considered as the noise section.

S20: The VAD algorithm is used to frame and segment the original voice data to obtain at least two frames of voice data to be tested.

The voice data to be tested is the original voice data obtained by cutting out the mute section in the interference voice data using the VAD algorithm. The VAD (Voice Activity Detection) algorithm is an algorithm that accurately locates the start and end of target voice data from a noisy environment. The VAD algorithm can be used to identify and eliminate long silent segments from the signal stream of the original speech data, in order to eliminate the interference speech data of the silent segment in the original speech data, and improve the accuracy of speech data processing.

Frame is the smallest unit of observation in speech data. Framing is the process of dividing according to the timing of speech data. Since the original speech data is not stable as a whole, but can be regarded as stable locally, the original speech data is considered. Framed to obtain a relatively stable single-frame voice data. In the process of speech recognition or voiceprint recognition, a stable signal needs to be input, so the server needs to perform frame processing on the original speech data first.

Segmentation is a process of cutting out a single frame of speech data belonging to a mute segment in the original speech data. In this embodiment, the VAD algorithm is used to perform segmentation processing on the original voice data after frame processing, and to remove the mute segment to obtain at least two frames of voice data to be tested.

In an embodiment, as shown in FIG. 3, in step S20, the original voice data is framed and segmented using the VAD algorithm to obtain at least two frames of voice data to be tested, which specifically includes the following steps:

S21: Perform frame processing on the original voice data to obtain at least two frames of single-frame voice data.

Framing is the collection of N sampling points into an observation unit, called a frame. Generally, the value of N is 256 or 512, and the time covered is about 20-30ms. In order to avoid the change of two adjacent frames from being too large, there is an overlapping area between the adjacent two frames. This overlapping area contains M sampling points. Generally, the value of M is about 1/2 or 1/3 of N. This process is called framing. Specifically, after framing the original voice data, at least two frames of single-frame voice data may be acquired, and each frame of single-frame voice data includes N sampling points.

Further, since at least two frames of single-frame speech data obtained by framing the original voice data, discontinuities appear at the beginning and end of each frame. The more frames, the more frame The larger the error between the single frame of speech data and the original speech data before the frame. In order to make the single-frame speech data after the frame continuous, each frame can show the characteristics of the periodic function. Therefore, it is necessary to perform windowing and pre-emphasis processing on each single-frame speech data after the frame. To get better single-frame voice data.

Windowing is to multiply each frame by a Hamming Window (namely, Hamming Window). Because the amplitude and frequency characteristics of the Hamming window is a large sidelobe attenuation, the server can increase the left end of the frame and the frame by windowing the single frame of voice data. Right end continuity. That is, by windowing the single-frame speech data after framed, non-stationary speech signals can be converted into short-term stationary signals. Let the framed signal be S (n), n = 0,1 ..., N-1, N is the frame size, and the signal of the Hamming window is W (n), then the signal after windowing is S ' (n) = S (n) × W (n), where

N is the size of the frame. Different a values will produce different Hamming windows. Generally, a is taken as 0.46.

In order to increase the amplitude of the high-frequency component of the speech signal relative to the low-frequency component to eliminate the effects of glottal excitation and oral and nasal radiation, pre-emphasis processing of single-frame speech data is needed, which helps to improve the signal-to-noise ratio. Signal-to-noise ratio refers to the ratio of signal to noise in an electronic device or electronic system.

The pre-emphasis is to pass the windowed single-frame voice data through a high-pass filter H (Z) = 1-μz ^-1 , where μ is between 0.9-1.0, and Z represents a single-frame voice data. The goal is to improve the high-frequency part, make the spectrum of the signal smoother, keep it in the entire low-frequency to high-frequency band, can use the same signal-to-noise ratio to find the spectrum, and highlight the high-frequency formants.

Understandably, by pre-processing the original voice data such as framing, windowing, and pre-emphasis, the pre-processed single-frame voice data has the advantages of high resolution, good stability, and small errors from the original voice data. When subsequent segmentation processing is performed on at least two frames of single-frame voice data, the efficiency and quality of obtaining at least two frames of voice data to be tested can be improved.

S22: The short-term energy calculation formula is used to segment the single-frame voice data to obtain the short-term energy corresponding to the single-frame voice data, and the single-frame voice data whose short-term energy is greater than the first threshold is retained as the first voice data.

Among them, the short-term energy calculation formula is specifically

Among them, N is a frame length of a single frame of voice data, x _n (m) is an n-th frame of single frame of voice data, E (n) is short-term energy, and m is a time series.

Among them, short-term energy refers to the energy of a frame of voice signals. The first threshold value is a threshold value with a lower preset value. The first voice data refers to voice data in which the short-term energy corresponding to a single frame of voice data in a single frame of voice data is greater than a first threshold. The VAD algorithm can detect the four parts of speech in a single frame of voice data: the mute segment, the transition segment, the speech segment, and the end segment. Specifically, the short-term energy calculation formula is used to calculate each frame of single-frame voice data, and the short-term energy corresponding to each frame of single-frame voice data is obtained, and the single-frame voice data whose short-term energy is greater than the first threshold is retained as First voice data. In this embodiment, single-frame voice data whose short-term energy is greater than the first threshold is retained, that is, the starting point is marked, and it is proved that the single-frame voice data after the starting point enters the transition section, that is, the first voice data finally obtained includes the transition section. , Speech segment, and ending segment. Understandably, the first voice data obtained based on the short-term energy in step S21 is obtained by segmenting a single frame of voice data whose short-term energy is not greater than the first threshold threshold, that is, the mute in the single-frame voice data is removed. This part of the segment interferes with speech data.

S23: Use the zero-crossing rate calculation formula to segment the first voice data to obtain the zero-crossing rate corresponding to the first voice data, retain the first voice data with the zero-crossing rate greater than the second threshold, and obtain at least two frames to be tested Voice data.

Among them, the formula for calculating the zero-crossing rate is specifically

Among them, sgn [] is a symbolic function, and its function formula is

x _n (m) is the first speech data of the n-th frame, Z _n is the zero-crossing rate, and m is a time series.

The second threshold value is a preset threshold value with a relatively high value. Because the first threshold is not necessarily the beginning of the speech segment, it may be caused by short noise. Therefore, the calculation of the first speech data (that is, the original speech data in the transition period and after the transition period) of each frame needs to be calculated. Zero rate. If the zero-crossing rate corresponding to the first voice data is not greater than the second threshold threshold, the first voice data is considered to be in the mute section, and the first voice data of this segment is segmented, that is, the retained zero-crossing rate is greater than the second Threshold the first voice data to obtain at least two frames of voice data to be tested, thereby achieving the purpose of further segmenting the interference voice data in the transition section of the first voice data.

In this embodiment, the short-term energy calculation formula is used to perform segmentation processing on the original speech data to obtain the corresponding short-term energy, and to retain a single frame of speech data whose short-term energy is greater than the first threshold value, that is, to mark the starting point to prove The single frame of voice data after the starting point enters the transition section, and the mute section in the single frame of voice data can be initially cut off. Then, the first voice data of each frame (that is, the original voice data in the transition section and after the transition section) is calculated. Zero rate, the first voice data whose zero-crossing rate is not greater than the second threshold is cut off to obtain at least two frames of voice data to be tested whose zero-crossing rate is greater than the second threshold. In this embodiment, the VAD algorithm cuts the interference voice data corresponding to the mute segment in the first voice data by using a dual threshold method, which is simple to implement and improves the processing efficiency of voice data.

S30: ASR voice feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested.

The voice feature of the filter under test is a filter feature obtained by performing feature extraction of the voice data to be tested using an ASR voice feature extraction algorithm. Filter-Bank (Fbank for short) features are commonly used in speech recognition. Because the commonly used Mel features are currently subjected to dimensionality reduction during model training or recognition, which results in the loss of some information, in order to avoid the above problems, the filter features are used instead of the commonly used Mel features in this embodiment. Helps improve the accuracy of subsequent model recognition. ASR (Automatic Speech Recognition) is a technology that converts human speech into text. It generally includes three parts: speech feature extraction, acoustic model and pattern matching, and language model and language processing. ASR speech feature extraction algorithm is an algorithm used in ASR technology to implement speech feature extraction.

Because the recognition of the acoustic model or the speech recognition model is based on the speech features after feature extraction based on the voice data to be tested, and cannot be directly based on the voice data to be tested, it is necessary to first perform feature extraction on the voice data to be tested. In this embodiment, an ASR speech feature extraction algorithm is used to perform feature extraction on each frame of voice data to be tested to obtain the voice characteristics of the filter to be tested, which can provide technical support for subsequent model recognition.

In an embodiment, as shown in FIG. 4, in step S30, the ASR voice feature extraction algorithm is used to perform feature extraction to obtain the voice features of the filter to be tested, which specifically includes the following steps:

S31: Perform fast Fourier transform on the voice data to be tested for each frame, and obtain a frequency spectrum corresponding to the voice data to be tested for each frame.

The spectrum corresponding to the voice data to be tested refers to the energy spectrum of the voice data to be tested in the frequency domain. Because the transformation of the speech signal in the time domain is usually difficult to see the characteristics of the signal, it is usually necessary to convert it into an energy distribution in the frequency domain to observe. Different energy distributions represent the characteristics of different speech. In this embodiment, the fast Fourier transform is performed on the voice data to be tested for each frame to obtain the spectrum of the voice data to be tested for each frame, that is, the energy spectrum.

Fast Fourier Transform (hereinafter referred to as FFT) is a collective term for fast calculation by Discrete Fourier Transform (hereinafter referred to as DFT). Fast Fourier transform is used to transform the time-domain signal into the frequency-domain energy spectrum. Since the voice data to be tested is a signal after preprocessing and voice activity detection of the original voice data, it is mainly reflected in the time domain signal. It is difficult to see the characteristics of the signal. Therefore, the voice data to be tested is required for each frame. A fast Fourier transform is performed to obtain the energy distribution in the frequency spectrum.

The formula of the fast Fourier transform is X _i (w) = FFT {x _i (k)}; where x _i (k) is the voice data of the i-th frame in the time domain and X _i (w) is the frequency The voice signal spectrum corresponding to the voice data to be measured in the i-th frame on the domain, k represents a time series, and w represents a frequency in the voice signal spectrum. Specifically, the calculation formula of the discrete Fourier transform is

among them,

N is the number of sampling points included in each frame of voice data to be tested. Because the DFT algorithm has high complexity, large amount of calculation, and time-consuming when the amount of data is large, fast Fourier transform is used for calculation to speed up the calculation and save time. Specifically, the fast Fourier transform uses the rotation factor in the discrete Fourier transform formula

The characteristics, namely periodicity, symmetry and reducibility, are transformed by butterfly operation to reduce the algorithm complexity.

Specifically, the DFT operation of N sampling points is called a butterfly operation, and the FFT operation is composed of several stages of iterative butterfly operations. Assume that the number of sampling points of the speech data to be tested is 2 ^ L (L is a positive integer). If the number of sampling points is less than 2 ^ L, you can use 0's complement to know that the number of sampling points in the frame is 2 ^ L , The calculation formula for butterfly operation is

Among them, X '(k') is a discrete Fourier transform of an even-numbered branch, and X "(k") is a discrete Fourier transform of an even-numbered branch. The DFT operation of N sampling points is converted into an odd-numbered discrete Fourier transform and an even-numbered discrete Fourier transform through butterfly operations to reduce the complexity of the algorithm and achieve the purpose of efficient operations.

S32: Pass the spectrum through the Mel filter bank to obtain the voice characteristics of the filter under test.

Among them, the Mel filter bank refers to passing the energy spectrum (that is, the spectrum of the voice data to be measured) output by the fast Fourier transform through a set of triangular filter banks of Mel scale to define a M filter The filter bank uses a triangular filter with a center frequency of f (m), m = 1,2, ..., M. M usually takes 22-26. The Mel filter bank is used to smooth the spectrum and eliminate filtering. It can highlight the formant characteristics of speech and reduce the amount of calculation. Then calculate the logarithmic energy output from each triangular filter in the Mel filter bank

Among them, M is the number of triangular filters, m is the m-th triangular filter, H _m (w) is the frequency response of the m-th triangular filter, and X _i (w) is the correspondence of the voice data to be measured in the i-th frame. The spectrum of the voice signal, w represents the frequency in the spectrum of the voice signal, and the logarithmic energy is the voice characteristics of the filter under test.

In this embodiment, first, fast Fourier transform is performed on the voice data to be tested for each frame to obtain the frequency spectrum corresponding to the voice data to be tested for each frame, so as to reduce the computational complexity, speed up the calculation, and save time. Then, pass the spectrum through the Mel filter bank and calculate the logarithmic energy output by each triangular filter in the Mel filter bank to obtain the voice characteristics of the filter under test to eliminate filtering, highlight the formant characteristics of the voice, and reduce the amount of calculation .

S40: Use the trained ASR-LSTM speech recognition model to identify the speech features of the filter under test, and obtain the recognition probability value.

Among them, the ASR-LSTM speech recognition model is a model that is pre-trained to distinguish between speech and noise in the speech features of the filter under test. Specifically, the ASR-LSTM speech recognition model is a speech recognition model obtained by using LSTM (long-short term memory, long-term memory neural network) to train the training filter speech features extracted using the ASR speech feature extraction algorithm. The recognition probability value is the probability that when the ASR-LSTM speech recognition model is used to recognize the speech features of the filter under test, it is recognized as speech. The recognition probability value may be a real number between 0-1. Specifically, the speech feature of the filter to be tested corresponding to the speech data of each frame to be tested is input to the ASR-LSTM speech recognition model for recognition, so as to obtain the recognition probability value corresponding to the speech feature of the filter to be tested per frame, that is, The possibility of speech.

S50: If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as the target voice data.

Since the voice data to be tested is a single frame of voice data with the mute segment removed, interference from the mute segment is eliminated. Specifically, if the recognition probability value is greater than a preset probability value, the voice data to be tested is not considered to be a noise segment, that is, the voice data to be tested with a recognition probability value greater than the preset probability value is determined as the target voice data. Understandably, the server recognizes the voice data to be tested after removing the mute segment, and can exclude interference voice data such as mute segment and noise segment from the target voice data, so as to use the target voice data as training data for the voiceprint model or other The speech model is trained to improve the recognition accuracy of the model. If the recognition probability value is not greater than the preset probability value, it proves that the piece of voice data to be tested is likely to be noise. This piece of voice data to be tested is excluded to avoid subsequent recognition of the model when training the model based on the target voice data. The problem of low accuracy.

In this embodiment, the original voice data is obtained first, and the original voice data includes target voice data and interference voice data. The VAD algorithm is used to perform frame and segment processing on the original voice data in order to initially cut off the interference of the mute section. Obtaining more pure target voice data provides protection. ASR voice feature extraction algorithm is used to extract the feature of each frame of the voice data to be tested, and to obtain the voice characteristics of the filter to be tested, which effectively solves the problem of reducing the dimensionality of the data during model training and causing partial information loss. If the recognition probability value is greater than a preset probability value, the voice data to be tested is considered to be the target voice data, so that the acquired target voice data does not include cut-off interference voice data such as mute segments and noise segments, that is, to obtain a purer target voice The data helps to use the target voice data as training data to train the voiceprint model or other voice models in order to improve the recognition accuracy of the model.

In one embodiment, the speech data processing method further includes: pre-training an ASR-LSTM speech recognition model.

As shown in Figure 5, pre-training the ASR-LSTM speech recognition model includes the following steps:

S61: Obtain training voice data.

Among them, the training voice data is the voice data that continuously changes with time obtained from the open source voice database and is used for model training. The training voice data includes pure voice data and pure noise data. The open-source speech database has labeled pure speech data and pure noise data for model training. The ratio of pure voice data and pure noise data in the training voice data is 1: 1, that is, obtaining equal proportions of pure voice data and pure noise data can effectively prevent the model training from overfitting, so as to pass The recognition effect of the model obtained by training voice data training is more accurate. In this embodiment, after the server obtains the training voice data, the training voice data needs to be framed to obtain at least two frames of training voice data in order to perform feature extraction for each frame of training voice data subsequently.

S62: The ASR speech feature extraction algorithm is used to extract the training speech data to obtain the training filter speech features.

Since the acoustic model training is based on the speech features after feature extraction based on the training voice data, rather than directly based on the training voice data, it is necessary to perform feature extraction on the training voice data first to obtain the voice features of the filter under test. Understandably, since the training voice data is time-series, the training filter voice characteristics obtained by performing feature extraction on each frame of voice data to be tested are time-series. Specifically, the server uses the ASR speech feature extraction algorithm to perform feature extraction on each frame of training speech data, obtains the training filter speech features carrying the timing state, and provides technical support for subsequent model training. In this embodiment, the steps of using ASR speech feature extraction algorithm for feature extraction of training speech data are the same as the step of feature extraction in step S30, in order to avoid repetition, and will not be repeated here.

S63: The training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.

Among them, a long-short-term memory neural network (long-term memory LSTM) model is a time-recurrent neural network model that is suitable for processing and predicting important events with time series and relatively long time series intervals and delays. The LSTM model has the function of temporal memory, so it is used to process the speech features of the training filter that carry the timing state. The LSTM model is one of the neural network models with long-term memory capabilities. It has a three-layer network structure of an input layer, a hidden layer, and an output layer. Among them, the input layer is the first layer of the LSTM model and is used to receive external signals, that is, it is responsible for receiving the voice characteristics of the training filter. The output layer is the last layer of the LSTM model and is used to output signals to the outside world, that is, it is responsible for outputting the calculation results of the LSTM model. Hidden layers are layers other than the input layer and output layer in the LSTM model. They are used to train the filter speech features to adjust the parameters of each layer of the hidden layer in the LSTM model to obtain the ASR-LSTM speech recognition model. Understandably, using the LSTM model for model training increases the temporality of the filter's speech features, thereby improving the accuracy of the ASR-LSTM speech recognition model. In this embodiment, the output layer of the LSTM model uses Softmax (regression model) for regression processing, which is used to classify the output weight matrix. Softmax (regression model) is a classification function commonly used in neural networks. It maps the output of multiple neurons into the [0,1] interval, which can be understood as a probability. It is simple and convenient to calculate, so as to perform multi-classification. Output to make its output more accurate.

In this embodiment, an equal proportion of speech data and noise data is first obtained from an open source speech database to prevent over-fitting of the model training, and the recognition effect of the speech recognition model obtained by training the speech data training is more accurate. Then, the ASR speech feature extraction algorithm is used to extract the features of each frame of training speech data to obtain the training filter speech features. Finally, the speech features of the training filter are trained by using a long-term and short-term memory neural network model with temporal memory capability to obtain a trained ASR-LSTM speech recognition model, which makes the recognition accuracy of the ASR-LSTM speech recognition model high.

In an embodiment, as shown in FIG. 6, in step S63, the voice characteristics of the training filter are input to a long-term and short-term memory neural network model for training, and the trained ASR-LSTM speech recognition model is obtained, which specifically includes the following steps:

S631: In the hidden layer of the long-term and short-term memory neural network model, the first activation function is used to calculate the voice characteristics of the training filter to obtain the neurons carrying the identification of the activation state.

Among them, each neuron in the hidden layer of the short-term memory neural network model includes three gates, which are an input gate, a forget gate, and an output gate. The forget gate determines the past information to be discarded in the neuron. The input gate determines the information to be added to the neuron. The output gate determines the information to be output in the neuron. The first activation function is a function for activating a neuron state. The state of the neuron determines the information discarded, added, and output by each gate (ie, input gate, forget gate, and output gate). The activation status flag includes a pass flag and a fail flag. The identifiers corresponding to the input gate, the forget gate, and the output gate in this embodiment are i, f, and o, respectively.

In this embodiment, the Sigmoid (S-shaped growth curve) function is specifically selected as the first activation function. The Sigmoid function is a S-type function commonly used in biology. In addition, Sigmoid function is often used as the threshold function of neural networks, which can map variables to 0-1. The calculation formula for the first activation function is

Among them, z represents the output value of the forget gate.

Specifically, the activation state of each neuron (training filter voice feature) is calculated to obtain the neuron that carries the activation state identifier as the pass identifier. In this embodiment, a calculation formula of the forgetting gate is used f _t = σ (z) = σ (W _f · [h _t-1 , x _t ] + b _f ) to calculate which information of the forgetting gate is received (that is, only receiving and carrying The activation state is identified by the identified neurons), where f _t represents the forgetting threshold (that is, the activation state), W _f represents the weight matrix of the forgetting gate, b _f represents the weight bias term of the forgetting gate, and h _t-1 represents The output of the neuron at the last moment, x _t represents the input data of the current moment, that is, the voice characteristics of the training filter, t represents the current moment, and t-1 represents the previous moment. The forgetting gate also includes the forgetting threshold. By calculating the speech filter's speech features through the calculation formula of the forgetting gate, a 0-1 interval scalar (ie, forgetting threshold) is obtained. This scalar determines the neuron according to the current state and the past state. Comprehensively determine the proportion of past information received in order to reduce the dimensionality of the data, reduce the amount of calculation, and improve training efficiency.

S632: In the hidden layer of the long-term and short-term memory neural network model, a second activation function is used to calculate the neuron carrying the identification of the activation state to obtain the output value of the hidden layer of the long-term and short-term memory neural network model.

Among them, the output value of the hidden layer of the short-term memory neural network model includes the output value of the input gate, the output value of the output gate, and the state of the neuron. Specifically, in the input gate in the hidden layer of the long-term and short-term memory neural network model, a second activation function is used to carry the activation state identifier to perform calculation through the identified neurons to obtain the output value of the hidden layer. In this embodiment, because the expressive ability of the linear model is insufficient, a tanh (hyperbolic tangent) function is used as the activation function of the input gate (ie, the second activation function). Non-linear factors can be added to make the trained ASR-LSTM speech recognition Models can solve more complex problems. In addition, the activation function tanh (hyperbolic tangent) has the advantage of fast convergence speed, which can save training time and increase training efficiency.

Specifically, the output value of the input gate is calculated by a calculation formula of the input gate. The input gate also includes an input threshold. The calculation formula of the input gate is i _t = σ (W _i · [h _t-1 , x _t ] + b _i ), where W _i is a weight matrix of the input gate, i _t represents the input threshold, b _i represents the bias term of the input gate, and the calculation of the training filter's voice characteristics by the calculation formula of the input gate will obtain a 0-1 interval scalar (that is, the input threshold), which controls the nerve Yuan judges the proportion of the received current information, that is, the proportion of newly input information, according to the comprehensive evaluation of the current state and the past state, so as to reduce the calculation amount and improve the training efficiency.

Then, the calculation formula of the state of the neuron is adopted.

with

Calculate the current neuron state; where W _c represents the weight matrix of the neuron state, b _c represents the bias term of the neuron state,

Represents the state of the neuron at the previous moment, and C _t represents the state of the neuron at the current moment. By performing a dot product operation on the state of the neuron and the forgetting threshold (input threshold), the model can only output the required information, thereby improving the efficiency of model learning.

Finally, the output gate calculation formula o _t = σ (W _o [h _t-1 , x _t ] + b _o ) is used to calculate which information is output in the output gate, and then the formula h _t = o _t * tanh (C _t ) Calculate the output value of the neuron at the current moment, where o _t represents the output threshold, W _o represents the weight matrix of the output gate, _bo represents the bias term of the output gate, and h _t represents the output value of the neuron at the current moment.

S633: Based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long- and short-term memory neural network model to obtain a trained ASR-LSTM speech recognition model.

First, according to the formula

with

Calculate the error term of the output gate at any time t

Error term of input gate

Error term of forget gate

And neuron state error term

Then, update the formula based on the weights

Updating the error back propagation, wherein, T is time, W is the weight, such as W _i, W _c, W _o or W _f, B represents a value such as the output i _{_{_t,}} f _t, o _t or

δ represents the error term,

Is the state data of the neuron at the last moment, and b ^t-1 _h is the output value of the hidden layer at the last moment. Update formula based on offset

Update the offset. Among them, b is the offset term of each gate, and δ _{a, t} represents the error of each gate at time t.

Finally, according to the weight update formula, the updated weights can be obtained, and the offsets are updated according to the offset update formula. The obtained weights and offsets of each layer are applied to the short-term memory neural network. The trained ASR-LSTM speech recognition model can be obtained from the model. Further, each weight in the ASR-LSTM speech recognition model implements the functions of the ASR-LSTM speech recognition model to decide which old information to discard, which new information to add, and which information to output. At the output layer of the ASR-LSTM speech recognition model, the probability value will eventually be output. The probability value indicates the probability that the training speech data is determined to be speech data after being recognized by the ASR-LSTM speech recognition model, and can be widely used in speech data processing to achieve the purpose of accurately identifying the speech features of the training filter.

In this embodiment, the first activation function is used to calculate the training filter speech features in the hidden layer of the long-term and short-term memory neural network model to obtain the neurons carrying the identification of the active state in order to reduce the data and reduce the amount of calculation. Improve training efficiency. In the hidden layer of the long-term and short-term memory neural network model, a second activation function is used to calculate the neurons carrying the identification of the activation state to obtain the output value of the hidden layer of the long-term and short-term memory neural network model. The output value is used to update the long-term and short-term memory neural network model by error back-propagation, and obtain the updated weights and offsets. Applying the updated weights and offsets to the long-term and short-term memory neural network model can obtain ASR- The LSTM speech recognition model can be widely used in speech data processing to achieve the purpose of accurately identifying the speech features of the training filter.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

In one embodiment, a voice data processing device is provided. The voice data processing device corresponds to the voice data processing method in the above embodiment. As shown in FIG. 7, the voice data processing device includes an original voice data acquisition module 10, a test voice data acquisition module 20, a filter voice characteristic acquisition module 30, a recognition probability value acquisition module 40, and a target voice data acquisition module 50. . The detailed description of each function module is as follows:

The original voice data acquisition module 10 is configured to acquire original voice data.

The voice data to be tested module 20 is configured to use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested.

The voice feature acquisition module 30 of the filter under test is configured to use ASR voice feature extraction algorithm to extract features of the voice data of each frame to obtain the voice feature of the filter under test.

The recognition probability value acquisition module 40 is configured to recognize the voice characteristics of the filter to be tested by using the trained ASR-LSTM speech recognition model to obtain the recognition probability value.

The target voice data acquisition module 50 is configured to use the voice data to be tested as target voice data if the recognition probability value is greater than a preset probability value.

Specifically, the voice data acquisition module 20 includes a single frame voice data acquisition unit 21, a first voice data acquisition unit 22, and a voice data acquisition unit 23.

The single-frame voice data obtaining unit 21 is configured to perform frame processing on the original voice data to obtain at least two frames of single-frame voice data.

The first voice data obtaining unit 22 is configured to perform segmentation processing on a single frame of voice data by using a short-term energy calculation formula, to obtain corresponding short-term energy, and to retain single-frame voice data whose short-term energy is greater than a first threshold threshold, as a first A voice data.

The voice data acquisition unit 23 uses the zero-crossing rate calculation formula to perform segmentation processing on the first voice data, acquires the corresponding zero-crossing rate, retains the first voice data whose zero-crossing rate is greater than the second threshold, and obtains at least two frames. Voice data under test.

Specifically, the short-term energy calculation formula is

The zero-crossing rate calculation formula is

Among them, sgn [] is a symbol function, x _n (m) is the first voice data of the n-th frame, Z _n is a zero-crossing rate, and m is a time series.

Specifically, the voice feature acquisition module 30 of the filter to be tested includes a spectrum acquisition unit 31 and a voice feature acquisition unit 32 of the filter to be tested.

The frequency spectrum acquiring unit 31 is configured to perform fast Fourier transform on each frame of voice data to be tested to acquire a frequency spectrum corresponding to the voice data to be tested.

The voice feature acquisition unit 32 of the filter under test is configured to pass the frequency spectrum through the Mel filter bank to obtain the voice feature of the filter under test.

Specifically, the speech data processing device further includes an ASR-LSTM speech recognition model training module 60 for pre-training the ASR-LSTM speech recognition model.

The ASR-LSTM speech recognition model training module 60 includes a training speech data acquisition unit 61, a training filter speech feature acquisition unit 62, and an ASR-LSTM speech recognition model acquisition unit 63.

The training voice data acquiring unit 61 is configured to acquire training voice data.

The training filter speech feature obtaining unit 62 is configured to use ASR speech feature extraction algorithm to perform feature extraction on the training speech data to obtain the training filter speech features.

The ASR-LSTM speech recognition model acquisition unit 63 is configured to input the training filter speech features into a long-term and short-term memory neural network model for training, and obtain a trained ASR-LSTM speech recognition model.

Specifically, the ASR-LSTM speech recognition model acquisition unit 63 includes an activation state neuron acquisition subunit 631, a model output value acquisition subunit 632, and an ASR-LSTM speech recognition model acquisition subunit 633.

The activation state neuron acquisition subunit 631 is configured to calculate a speech filter feature of a training filter by using a first activation function in a hidden layer of a long-term and short-term memory neural network model to obtain a neuron carrying an activation state identifier.

The model output value acquisition subunit 632 is configured to calculate a neuron carrying an activation state identifier in a hidden layer of the long-term and short-term memory neural network model by using a second activation function to obtain an output value of the hidden layer of the long-term and short-term memory neural network model.

The ASR-LSTM speech recognition model acquisition subunit 633 is configured to perform error back propagation update of the long-term and short-term memory neural network model based on the output value of the hidden layer of the long-term and short-term memory neural network model to obtain a trained ASR-LSTM speech recognition model.

For the specific limitation of the voice data processing device, refer to the foregoing limitation on the voice data processing method, and details are not described herein again. Each module in the above-mentioned voice data processing device may be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data generated or obtained during the execution of the voice data processing method, such as target voice data. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a voice data processing method.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented: Voice data; use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested; use ASR voice feature extraction algorithm to feature extraction for each frame of voice data to be tested, and obtain the filter to be tested Speech features; the trained ASR-LSTM speech recognition model is used to identify the speech features of the filter under test to obtain the recognition probability value; if the recognition probability value is greater than a preset probability value, the measured speech data is used as the target speech data.

In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: frame processing the original voice data to obtain at least two frames of single-frame voice data; and using a short-term energy calculation formula to cut the single-frame voice data Divide the processing to obtain the corresponding short-term energy, and retain a single frame of voice data with short-term energy greater than the first threshold as the first voice data; use the zero-crossing rate calculation formula to perform segmentation processing on the first voice data to obtain the corresponding Zero-crossing rate, retaining first voice data with a zero-crossing rate greater than a second threshold, and obtaining at least two frames of voice data to be tested.

Specifically, the short-term energy calculation formula is

Among them, N is the frame length of a single frame of speech data, x _n (m) is the nth frame of single frame of speech data, E (n) is short-term energy, and m is a time series; the formula for calculating the zero-crossing rate is

In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: performing fast Fourier transform on each frame of voice data to be tested to obtain a frequency spectrum corresponding to the voice data to be tested; passing the spectrum through a Mel filter Group to obtain the voice characteristics of the filter under test.

In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: acquiring training speech data; using ASR speech feature extraction algorithm to perform feature extraction on the training speech data to obtain training filter speech features; and training the filter speech The features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.

In one embodiment, when the processor executes the computer-readable instructions, the processor further implements the following steps: in the hidden layer of the long-term and short-term memory neural network model, the first activation function is used to calculate the voice characteristics of the training filter to obtain the nerve carrying the activation state identifier. Element; in the hidden layer of the long-term and short-term memory neural network model, the second activation function is used to calculate the neuron carrying the activation status identifier to obtain the output value of the hidden layer of the long-term and short-term memory neural network model; based on the hidden layer of the long-term and short-term memory neural network model The output value is used to update the long-term and short-term memory neural network model by error back propagation to obtain the ASR-LSTM speech recognition model.

In one embodiment, one or more non-volatile readable storage media storing computer-readable instructions are provided, and when the computer-readable instructions are executed by one or more processors, the one or more Each processor performs the following steps: obtaining the original voice data; using the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested; and using an ASR voice feature extraction algorithm for each frame of voice data to be tested Perform feature extraction to obtain the voice characteristics of the filter under test; use the trained ASR-LSTM speech recognition model to identify the voice characteristics of the filter under test to obtain the recognition probability value; if the recognition probability value is greater than a preset probability value, the Speech data is used as the target speech data.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: framing the original voice data to obtain at least two Frame single frame of voice data; use the short-term energy calculation formula to segment the single-frame voice data to obtain the corresponding short-term energy, and retain the single-frame voice data whose short-term energy is greater than the first threshold value as the first voice data; The first voice data is segmented using a zero-crossing rate calculation formula to obtain the corresponding zero-crossing rate, and the first voice data with the zero-crossing rate greater than the second threshold is retained, and at least two frames of voice data to be tested are obtained.

Specifically, the short-term energy calculation formula is

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: performing fast Fourier Fourier processing on each frame of voice data to be tested. The leaf transform obtains the frequency spectrum corresponding to the voice data to be measured; passes the spectrum through the Mel filter bank to obtain the voice characteristics of the filter to be tested.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: obtaining training voice data; and adopting an ASR voice feature extraction algorithm to The training speech data is used for feature extraction to obtain the training filter speech features; the training filter speech features are input to the long-term and short-term memory neural network model for training, and the trained ASR-LSTM speech recognition model is obtained.

In one embodiment, when the computer-readable instructions are executed by one or more processors, the execution of the one or more processors further implements the following steps: The hidden layer of the long-term memory neural network model adopts the first step. An activation function calculates the voice characteristics of the training filter to obtain the neurons carrying the identification of the active state; in the hidden layer of the long-term and short-term memory neural network model, a second activation function is used to calculate the neurons carrying the identification of the active state to obtain the duration The output value of the hidden layer of the memory neural network model; based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long- and short-term memory neural network model to obtain a trained ASR-LSTM speech recognition model.

A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims

A voice data processing method, comprising:

Get the original voice data;

Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;

ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;

Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;

If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
The voice data processing method according to claim 1, wherein the adopting a VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested, comprising:

Frame processing the original voice data to obtain at least two frames of single-frame voice data;

Use the short-term energy calculation formula to perform segmentation processing on the single-frame voice data to obtain corresponding short-term energy, and retain single-frame voice data in which the short-term energy is greater than a first threshold threshold as the first voice data;

Use the zero-crossing rate calculation formula to segment the first voice data to obtain the corresponding zero-crossing rate, retain the first voice data with the zero-crossing rate greater than the second threshold, and obtain at least two frames of the test data. Voice data.
The speech data processing method according to claim 2, wherein the short-term energy calculation formula is
Where N is the frame length of a single frame of speech data, x n (m) is the single frame of speech data at the nth frame, E (n) is the short-term energy, and m is a time series;

The zero-crossing rate calculation formula is
Where sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is the zero-crossing rate, and m is a time series.
The voice data processing method according to claim 1, wherein the ASR voice feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested, comprising:

Performing fast Fourier transform on the voice data to be tested for each frame to obtain a frequency spectrum corresponding to the voice data to be tested;

Pass the spectrum through a Mel filter bank to obtain the voice characteristics of the filter under test.
The speech data processing method according to claim 1, wherein the speech data processing method further comprises: pre-training the ASR-LSTM speech recognition model;

The pre-training the ASR-LSTM speech recognition model includes:

Obtain training voice data;

ASR speech feature extraction algorithm is used to extract the training speech data to obtain the training filter speech features;

The training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
The speech data processing method according to claim 5, wherein the inputting the speech features of the training filter into a long-term and short-term memory neural network model for training to obtain a trained ASR-LSTM speech recognition model, comprising: :

Using a first activation function to calculate the voice characteristics of the training filter in a hidden layer of the long-term and short-term memory neural network model to obtain a neuron carrying an activation state identifier;

Using a second activation function to calculate the neuron carrying the activation state identifier in the hidden layer of the long-term and short-term memory neural network model to obtain an output value of the hidden layer of the long-term and short-term memory neural network model;

Based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long-term and short-term memory neural network model to obtain the trained ASR-LSTM speech recognition model.
A voice data processing device, comprising:

Raw voice data acquisition module, used to obtain raw voice data;

The voice data to be tested module is configured to use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;

The filter feature acquisition module under test is configured to use ASR voice feature extraction algorithm to perform feature extraction on the frame of voice data to be tested for each frame to acquire the filter feature of the filter under test;

A recognition probability value acquisition module, configured to use the trained ASR-LSTM speech recognition model to recognize the voice characteristics of the filter under test to obtain a recognition probability value;

The target voice data acquisition module is configured to use the voice data to be tested as target voice data if the recognition probability value is greater than a preset probability value.
The voice data processing device according to claim 7, wherein the voice data acquisition module to be tested comprises:

A single-frame voice data acquisition unit, configured to perform frame processing on the original voice data to acquire at least two frames of single-frame voice data;

A first voice data obtaining unit, configured to perform segmentation processing on the single-frame voice data by using a short-term energy calculation formula to obtain corresponding short-term energy, and retain original voice data in which the short-term energy is greater than a first threshold threshold, As the first voice data;

A voice data to be tested unit for segmenting the first voice data by using a zero-crossing rate calculation formula to obtain a corresponding zero-crossing rate, and retaining original voice data in which the zero-crossing rate is greater than a second threshold threshold, Acquire at least two frames of the voice data to be tested.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

Get the original voice data;

Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;

ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;

Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;

If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
The computer device according to claim 9, wherein the adopting a VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested comprises:

Frame processing the original voice data to obtain at least two frames of single-frame voice data;

Use the short-term energy calculation formula to perform segmentation processing on the single-frame voice data to obtain corresponding short-term energy, and retain single-frame voice data in which the short-term energy is greater than a first threshold threshold as the first voice data;

Use the zero-crossing rate calculation formula to segment the first voice data to obtain the corresponding zero-crossing rate, retain the first voice data with the zero-crossing rate greater than the second threshold, and obtain at least two frames of the test data. Voice data.
The computer device according to claim 10, wherein the short-term energy calculation formula is
Where N is the frame length of a single frame of speech data, x n (m) is the single frame of speech data at the nth frame, E (n) is the short-term energy, and m is a time series;

The zero-crossing rate calculation formula is
Where sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is the zero-crossing rate, and m is a time series.
The computer device according to claim 9, wherein the adopting ASR speech feature extraction algorithm to perform feature extraction on each frame of the voice data to be tested to obtain the voice characteristics of the filter to be tested comprises:

Performing fast Fourier transform on the voice data to be tested for each frame to obtain a frequency spectrum corresponding to the voice data to be tested;

Pass the spectrum through a Mel filter bank to obtain the voice characteristics of the filter under test.
The computer device according to claim 9, wherein when the processor executes the computer-readable instructions, the following step is further implemented: pre-training the ASR-LSTM speech recognition model;

The pre-training the ASR-LSTM speech recognition model includes:

Obtain training voice data;

ASR speech feature extraction algorithm is used to extract the training speech data to obtain the training filter speech features;

The training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
The computer device according to claim 13, wherein the inputting the voice characteristics of the training filter into a long-term and short-term memory neural network model for training to obtain a trained ASR-LSTM speech recognition model comprises:

Using a first activation function to calculate the voice characteristics of the training filter in a hidden layer of the long-term and short-term memory neural network model to obtain a neuron carrying an activation state identifier;

Using a second activation function to calculate the neuron carrying the activation state identifier in the hidden layer of the long-term and short-term memory neural network model to obtain an output value of the hidden layer of the long-term and short-term memory neural network model;

Based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long-term and short-term memory neural network model to obtain the trained ASR-LSTM speech recognition model.
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

Get the original voice data;

Use the VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested;

ASR speech feature extraction algorithm is used to extract the feature of the voice data to be tested for each frame to obtain the voice features of the filter to be tested;

Adopting the trained ASR-LSTM speech recognition model to recognize the speech features of the filter under test to obtain a recognition probability value;

If the recognition probability value is greater than a preset probability value, the voice data to be tested is used as target voice data.
The non-volatile readable storage medium according to claim 15, wherein the adopting a VAD algorithm to frame and segment the original voice data to obtain at least two frames of voice data to be tested, comprising:

Frame processing the original voice data to obtain at least two frames of single-frame voice data;

Use the short-term energy calculation formula to perform segmentation processing on the single-frame voice data to obtain corresponding short-term energy, and retain single-frame voice data in which the short-term energy is greater than a first threshold threshold as the first voice data;

Use the zero-crossing rate calculation formula to segment the first voice data to obtain the corresponding zero-crossing rate, retain the first voice data with the zero-crossing rate greater than the second threshold, and obtain at least two frames of the test data. Voice data.
The computer device according to claim 16, wherein the short-term energy calculation formula is
Where N is the frame length of a single frame of speech data, x n (m) is the single frame of speech data at the nth frame, E (n) is the short-term energy, and m is a time series;

The zero-crossing rate calculation formula is
Where sgn [] is a symbol function, x n (m) is the first voice data of the n-th frame, Z n is the zero-crossing rate, and m is a time series.
The non-volatile readable storage medium according to claim 15, wherein the ASR voice feature extraction algorithm is used to perform feature extraction on each frame of the voice data to be tested to obtain the voice characteristics of the filter to be tested, include:

Performing fast Fourier transform on the voice data to be tested for each frame to obtain a frequency spectrum corresponding to the voice data to be tested;

Pass the spectrum through a Mel filter bank to obtain the voice characteristics of the filter under test.
The non-volatile readable storage medium of claim 15, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to further perform the following steps: Pre-train the ASR-LSTM speech recognition model;

The pre-training the ASR-LSTM speech recognition model includes:

Obtain training voice data;

ASR speech feature extraction algorithm is used to extract the training speech data to obtain the training filter speech features;

The training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech recognition model is obtained.
The non-volatile readable storage medium according to claim 15, wherein the training filter speech features are input into a long-term and short-term memory neural network model for training, and a trained ASR-LSTM speech is obtained Identify models, including:

Using a first activation function to calculate the voice characteristics of the training filter in a hidden layer of the long-term and short-term memory neural network model to obtain a neuron carrying an activation state identifier;

Using a second activation function to calculate the neuron carrying the activation state identifier in the hidden layer of the long-term and short-term memory neural network model to obtain an output value of the hidden layer of the long-term and short-term memory neural network model;

Based on the output value of the hidden layer of the long-term and short-term memory neural network model, an error back propagation update is performed on the long-term and short-term memory neural network model to obtain the trained ASR-LSTM speech recognition model.