CN116778970B

CN116778970B - Voice detection model training method in strong noise environment

Info

Publication number: CN116778970B
Application number: CN202311076367.7A
Authority: CN
Inventors: 李春霞
Original assignee: Changchun Mingxi Technology Co ltd
Current assignee: Changchun Mingxi Technology Co ltd
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-11-24
Anticipated expiration: 2043-08-25
Also published as: CN116778970A

Abstract

The application provides a training method of a voice detection model in a strong noise environment, which comprises the following steps: the method comprises the steps of acquiring voice data in a strong noise environment, preprocessing the voice data, carrying out sliding window segmentation, converting an original voice signal into spectrum representation through Fourier transformation, and inputting the spectrum representation into a convolutional neural network CNN to extract meaningful voice characteristic data; after a two-way long-short-term memory progressive learning model is introduced to estimate a progressive ratio mask of a corpus layer, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm; according to user feedback and model performance, continuously optimizing and fine-tuning a voice detection model; the method can adaptively adjust the trade-off between noise reduction and voice distortion, and realize the adaptive optimization of various noise environments; by utilizing the information provided by the PRMs, the model can estimate the noise more accurately, and the effect of voice detection is further improved.

Description

Voice detection model training method in strong noise environment

Technical Field

The application relates to the technical field of voice detection, in particular to a training method of a voice detection model in a strong noise environment.

Background

The voice detection model training method is one of key technologies and core contents which are necessary for voice reconnaissance. The voice detection accuracy directly determines the reconnaissance capability of voice information. The voice application scenes are gradually enriched, and different application scenes are often accompanied by noise; due to the presence of noise, recognition accuracy is a persistent problem. The prior art has made remarkable improvement, but the recognition accuracy of the existing voice detection model training method in a specific noise environment is still insufficient, and the voice detection model training method in a strong noise environment is provided for the purpose.

Disclosure of Invention

In order to solve the above problems, the present application proposes a method for training a speech detection model in a strong noise environment, so as to more exactly solve the above problem that the recognition rate is not accurate enough when the existing speech detection model constructed in a specific noise environment is used for actual detection.

The application is realized by the following technical scheme:

the application provides a training method of a voice detection model in a strong noise environment, which comprises the following steps:

s1: acquiring voice data in an in-situ recording in a strong noise environment, and preprocessing the voice data;

s2: sliding window segmentation is carried out on the preprocessed voice data, and original voice signals are converted into frequency spectrum representation through Fourier transformation on each segment;

s3: inputting the frequency spectrum into a convolutional neural network CNN, and automatically extracting meaningful voice characteristic data from the input data;

s4: after a two-way long-short-term memory progressive learning model is introduced according to voice characteristic data to estimate a progressive ratio mask of a corpus level, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm;

s5: and continuously optimizing and fine-tuning the voice detection model according to the user feedback and the model performance.

Further, the step of obtaining the voice data in the in-situ recording in the strong noise environment and preprocessing the voice data includes the steps of;

according to the voice acquisition module, acquiring voice data in different noise environments, removing a mute segment from the audio signal intensity in the acquired voice data, and standardizing the voice data by a Z-Score method for one label of each audio sample.

Further, the step of performing sliding window segmentation on the preprocessed voice data and converting the original voice signal into a spectrum representation for each segment through fourier transformation includes the steps of;

the frame length and frame shift of the window are set according to the actual task to ensure that the overlapping part exists between the continuous audio fragments, and the intensity of each frequency component of each frame is calculated through Fourier transform after the window function is applied to each frame to inhibit the frequency spectrum leakage, so that a frequency spectrum is obtained.

Further, the step of inputting the spectrum into the convolutional neural network CNN and automatically extracting meaningful voice feature data from the input data includes;

after the spectrum is input into the convolutional neural network CNN, the convolutional kernel slides on the input data and calculates, and the CNN automatically identifies important frequency modes, harmonic structures and tone characteristics.

Further, after the progressive ratio mask of the corpus layer is estimated by introducing the two-way long-short-term memory progressive learning model according to the voice characteristic data, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and the step of parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm comprises the following steps of;

predicting progressive ratio masks PRMs by BLSTM as a regression model; wherein PRMs are generated by the intermediate layer and as learning targets, which correspond to the ratio between clean speech and noise, i.e. "mask"; a series of progressive ratio masks PRMs for helping trade-off between noise reduction and speech distortion are obtained by taking the characteristics of logarithmic power spectrum LPS as the input of a speech detection model and the ideal ratio masks IRM as the output, the trade-off between noise reduction and speech distortion is controlled adaptively, the noise is estimated accurately by an information model provided by the PRMs, and the loss is calculated by improving an optimization algorithm according to the weighted MMSE criterion of m target layers so as to optimize parameters.

Further, the step of obtaining a series of progressive ratio masks PRMs for helping to trade-off between noise reduction and speech distortion by taking the logarithmic power spectrum LPS characteristics as input to the speech detection model and the ideal ratio mask IRM as output includes;

PRMs trade-off between noise reduction and speech distortion, defined as:

；

wherein,for time frame +.>For frequency bin->Short-term fourier transform of the speech signal in time frames and frequency bins, < >>Targeting T-F units for a progressive ratio mask>Is a short-time fourier transform of the noise of (a),for the input signal in the T-F unit->Is a noisy short-time fourier transform of (a).

Further, the improved optimization algorithm is as follows;

；

wherein,weighting factor for the mth target layer, < ->For the set of weight matrix and bias vector, +.>And outputting the neural network serving as the mth target layer.

Further, the step of continuously optimizing and fine-tuning the voice detection model according to the user feedback and the model performance comprises the following steps of;

if the model performs poorly in some situations, the voice detection model is continually optimized by adding various types and levels of noise in the public voice database to the clean voice data to generate more training samples.

The application has the beneficial effects that: extracting characteristic data after preprocessing voice data; after the voice characteristic data is introduced into a two-way long-short-term memory progressive learning model to estimate the progressive ratio mask of the corpus level, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm; the method can adaptively adjust the trade-off between noise reduction and voice distortion, thereby realizing the adaptive optimization of various noise environments; moreover, by utilizing the information provided by the PRMs, a more accurate detection model can be constructed.

Drawings

Fig. 1 is a flow chart of a training method of a speech detection model in a strong noise environment according to the present application.

The realization, functional characteristics and advantages of the present application are further described with reference to the accompanying drawings in combination with the embodiments.

Detailed Description

In order to more clearly and completely describe the technical scheme of the application, the application is further described below with reference to the accompanying drawings.

Referring to fig. 1, the present application proposes a training method for a speech detection model in a strong noise environment, which comprises the following steps:

In the specific implementation, voice data in the field recording in a strong noise environment are obtained, and the voice data are preprocessed; recording voice in different noise environments through professional audio equipment such as a microphone, and simultaneously, taking into consideration that the voice is additionally recorded by using equipment such as a mobile phone and the like, namely, acquiring input voice through various equipment sources, wherein the data can cover as much voice and noise types as possible, including different speakers, accents, speech speeds, various noise environments and the like; although the goal is to detect speech in noisy environments, some noise does not help with model training, such as breeze or rain, fourier transform or spectral subtraction methods can be used to try to reduce these extraneous noise, which needs to be normalized to the same volume level before speech data is input into the model, which would otherwise be biased to recognize larger signals and ignore smaller signals; the method comprises the steps of carrying out sliding window segmentation on preprocessed voice data, wherein the audio signal is changed along with time, so that the whole signal cannot be directly processed, but the whole signal needs to be segmented firstly, specifically, a window length is determined firstly, for example, 20ms, then a window is slid forwards once at regular intervals, for example, 10ms, and the signal in the window is taken as one segment, so that 50% overlapping parts exist in every two adjacent segments, and the continuity of the signal is ensured; converting the original speech signal into a spectral representation by fourier transformation for each segment; fourier transform is a method of converting a time domain signal into a frequency domain signal, and in audio processing, frequency components of the signal, that is, intensities of each frequency component, are generally of interest, instead of original waveforms, so that each segmented signal is converted into a spectral representation using fourier transform, specifically, the intensities of its respective frequency components can be calculated for each segment, thereby obtaining a spectrum; inputting the frequency spectrum into a convolutional neural network CNN, and automatically extracting meaningful voice characteristic data from the input data; convolutional neural network CNN is a deep learning model that automatically extracts meaningful features from input data by using convolutional kernels and pooling layers; specifically, the convolution kernel slides on the input data and performs calculation, so that a new feature map is obtained; then, the pooling layer can reduce the dimension of the feature map, so that the calculated amount is reduced and the generalization capability of the model is enhanced; in audio processing, the advantages of CNN can be exploited to allow it to automatically extract meaningful speech features from the spectrum; specifically, CNNs identify some important frequency patterns, harmonic structures, timbres, etc. features; after a progressive type learning model of two-way long-short-term memory is introduced according to voice characteristic data to estimate a progressive type ratio mask of a corpus level, the two-way long-short-term memory (BLSTM) is a neural network capable of acquiring information from the front direction and the rear direction of an input sequence; the BLSTM can better understand the context information than the unidirectional LSTM; this is very useful in speech recognition because it is often a continuous piece of speech that is processed, both before and after which can affect the current understanding, in which framework the BLSTM is used as a regression model to predict Progressive Ratio Masks (PRMs); the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm; these PRMs are generated by the intermediate layer and as learning targets, they correspond to the ratio between clean speech and noise, i.e. "mask"; using a Logarithmic Power Spectrum (LPS) feature as an input to the model, and an Ideal Ratio Mask (IRM) as an output; in this way, a series of PRMs may be obtained that may help balance noise reduction and speech distortion; specifically, if noise reduction is merely focused, noise may be excessively suppressed, resulting in speech distortion; while if only attention is paid to maintaining voice quality, noise may not be effectively eliminated; the PRMs can find the balance point between the two, thereby realizing the aims of reducing noise and keeping voice quality; according to user feedback and model performance, continuously optimizing and fine-tuning a voice detection model; if the model performs poorly in some situations, the voice detection model is continually optimized by adding various types and levels of noise in the public voice database to the clean voice data to generate more training samples.

In the present embodiment, in step S1, it includes; according to the voice acquisition module, acquiring voice data in different noise environments, wherein the environments comprise various scenes such as indoor, outdoor, mute, traffic noise and the like; in each environment, a large number of voice samples need to be collected to ensure that the model has good performance under various conditions, and the audio signal strength in the acquired voice data is removed from the silence section, because the silence section does not carry any voice information, but the noise and the complexity of the data are increased; the model accuracy and efficiency can be improved by removing the mute segment, and a label is allocated to each audio sample, so that the model is trained, and a label is allocated to each audio sample; tags may be binary, such as whether they contain speech, or may be multiple, such as the class or emotion of speech; tags are annotated by a professional or generated by automated tools; normalizing the voice data by a Z-Score method; when preprocessing data, it is often necessary to normalize the data in order to eliminate scale differences in the data; Z-Score normalization is a common normalization method that is accomplished by subtracting the mean and dividing by the standard deviation; the data so processed will have zero mean and unit standard deviation, making the model easier to learn, and preventing certain features from dominate the learning process due to their large range of values.

In the present embodiment, in step S2, it includes; setting the frame length and frame shift of a window for the preprocessed voice data according to an actual task so as to ensure that an overlapped part exists between continuous audio fragments, and setting a proper window length and frame shift amount when framing the voice signal; the window length determines the amount of data used in one calculation, while the frame shift determines the step size of each shift; by setting these two parameters, it is possible to ensure a certain overlap between consecutive audio pieces so as to capture continuity information in the speech signal; after each frame is subjected to window function inhibition spectrum leakage, the intensity of each frequency component of the spectrum is calculated through Fourier transformation, so that a spectrum is obtained; the window function is a function capable of changing the shape of an input signal, and is mainly aimed at suppressing spectrum leakage; the spectrum leakage is a phenomenon generated by Fourier transformation of the non-periodic signal, and can distort the result of spectrum analysis; after the window function is applied, each frame of audio signal may be fourier transformed; the Fourier transform is a method for converting a time domain signal into a frequency domain signal, and can help to analyze the intensity of each frequency component in a voice signal; these frequency components and their corresponding intensities make up the spectrum; through the frequency spectrum, the information such as pitch, tone and the like in the voice signal can be more intuitively seen.

In the present embodiment, in step S3, it includes; convolutional Neural Networks (CNNs) are an algorithm for deep learning in which an audio signal is converted into a spectral form by a front audio signal and then inputted into the CNN as a 2D image; after the frequency spectrum is input into the convolutional neural network CNN, the convolutional kernel slides on the input data and calculates, the CNN automatically recognizes important frequency modes, harmonic structures and tone characteristics, and the convolutional kernel or the filter slides on the input data and calculates, so that the network can be helped to automatically recognize the important frequency modes, the harmonic structures, the tone characteristics and the like; CNNs have two main layers: a convolution layer and a pooling layer; in the convolution layer, each convolution kernel performs convolution operation on input data and generates a feature map; this feature map reflects a certain specific feature in the input data; in the pooling layer, the method can downsample the feature map output by the convolution layer, reduce the dimension of data and retain important information; CNNs can effectively extract spectral features when processing audio data because they can automatically learn and recognize complex patterns and structures; this makes CNN an ideal choice for tasks such as audio signal classification and recognition.

In the present embodiment, in step S4, it includes; predicting progressive ratio masks PRMs by BLSTM as a regression model; two-way long and short term memory (BLSTM) is a recurrent neural network that is capable of capturing not only past information but also future information; in processing the audio data, a progressive ratio mask may be predicted using the BLSTM as a regression model; wherein PRMs are generated by the intermediate layer and as learning targets, which correspond to the ratio between clean speech and noise, i.e. "mask"; obtaining a series of progressive ratio masks PRMs for helping balance between noise reduction and voice distortion by taking the characteristics of logarithmic power spectrum LPS as input of a voice detection model and taking ideal ratio masks IRM as output, adaptively controlling balance between the noise reduction and the voice distortion, accurately estimating the noise by an information model provided by the PRMs, calculating loss by improving an optimization algorithm according to weighted MMSE criteria of m target layers, and optimizing parameters; the Logarithmic Power Spectrum (LPS) feature is one of the important features of the audio signal, which can be used as an input to the model; an Ideal Ratio Mask (IRM) is an excellent masking strategy that defines the ideal energy ratio that each frequency component should have in order to extract clean speech from a noise signal; by using LPS features and IRM, a series of Progressive Ratio Masks (PRMs) can be derived that help balance noise reduction and speech distortion; by adaptively controlling the trade-off between noise reduction and speech distortion, noise can be accurately estimated through an information model provided by PRMs; then, calculating the loss by improving an optimization algorithm according to a weighted Minimum Mean Square Error (MMSE) criterion so as to optimize the parameters; the criterion considers m target layers, so that the model is finer, and complex structures and modes in the audio signal can be better captured; through continuous optimization and adjustment, the performance of the model can be continuously improved, and the audio signal in the high-noise environment is better processed; PRMs trade-off between noise reduction and speech distortion, defined as:

；

wherein,for time frame +.>For frequency bin->Short-term fourier transform of the speech signal in time frames and frequency bins, < >>Targeting T-F units for a progressive ratio mask>Is a short-time fourier transform of the noise of (a),for the input signal in the T-F unit->Is a noisy short-time fourier transform of (1);

the improved optimization algorithm is as follows;

；

wherein,weighting factor for the mth target layer, < ->For the set of weight matrix and bias vector, +.>Neural network output for the mth target layer; by combining the advantages of estimating progressive ratio mask PRMs based on a progressive learning framework and a conventional IMCRA, satisfactory ASR results are achieved without retraining the acoustic model; estimating progressive ratio mask PRMs of a corpus layer through a two-way long-short-term memory BLSTM progressive learning model; these estimated PRMs are then incorporated into improved minimum mean square errorThe difference reduction algorithm IMCRA program; then, combining the PRMs and IMCRA gain functions, and providing a new gain function for recovering clean voice signals frame by frame; the method can adaptively adjust the trade-off between noise reduction and voice distortion, thereby realizing the adaptive optimization of various noise environments; moreover, by utilizing the information provided by the PRMs, a more accurate detection model can be constructed, and the model can estimate noise more accurately.

Of course, the present application can be implemented in various other embodiments, and based on this embodiment, those skilled in the art can obtain other embodiments without any inventive effort, which fall within the scope of the present application.

Claims

1. The method for training the voice detection model in the strong noise environment is characterized by comprising the following steps:

s5: continuously optimizing and fine-tuning a voice detection model according to user feedback and model performance, wherein the step of inputting a frequency spectrum into a convolutional neural network CNN and automatically extracting meaningful voice characteristic data from input data comprises the following steps of;

after the frequency spectrum is input into a convolutional neural network CNN, the convolutional kernel slides on input data and calculates, the CNN automatically recognizes important frequency modes, harmonic structures and tone characteristics, wherein after the gradual rate mask of a corpus level is estimated according to the gradual learning model of the two-way long-short-term memory introduced by voice characteristic data, the estimated gradual rate mask is incorporated into a minimum value control recursive average method program to construct a voice detection model, and the step of optimizing parameters of the model by improving the calculation loss of an optimization algorithm comprises the steps of;

predicting progressive ratio masks PRMs by BLSTM as a regression model; wherein PRMs are generated by the intermediate layer and as learning targets, which correspond to the ratio between clean speech and noise, i.e. "mask"; obtaining a series of progressive ratio masks PRMs for helping trade-off between noise reduction and speech distortion by taking logarithmic power spectrum LPS features as inputs of a speech detection model and ideal ratio masks IRM as outputs, adaptively controlling trade-off between noise reduction and speech distortion, accurately estimating noise by an information model provided by the PRMs, calculating losses by improving an optimization algorithm according to weighted MMSE criteria of m target layers to optimize parameters, wherein the step of obtaining the series of progressive ratio masks PRMs for helping trade-off between noise reduction and speech distortion by taking logarithmic power spectrum LPS features as inputs of the speech detection model and ideal ratio masks IRM as outputs comprises;

PRMs trade-off between noise reduction and speech distortion, defined as:

；

wherein,for time frame +.>For frequency bin->In time frames for speech signalsAnd short-time Fourier transform of frequency bin, +.>Targeting T-F units for a progressive ratio mask>Is a short-time fourier transform of the noise of (a),for the input signal in the T-F unit->Wherein the improved optimization algorithm is;

；

2. The method for training a speech detection model in a strong noise environment according to claim 1, wherein the step of acquiring speech data in an in-field recording in the strong noise environment and preprocessing the speech data comprises;

3. The method according to claim 1, wherein the step of performing sliding window segmentation on the preprocessed voice data and converting the original voice signal into a spectral representation by fourier transform for each segment comprises;

4. The method according to claim 1, wherein the step of continuously optimizing and fine-tuning the speech detection model according to the user feedback and the model performance comprises;