CN116778913A - Speech recognition method and system for enhancing noise robustness - Google Patents

Speech recognition method and system for enhancing noise robustness Download PDF

Info

Publication number
CN116778913A
CN116778913A CN202311075628.3A CN202311075628A CN116778913A CN 116778913 A CN116778913 A CN 116778913A CN 202311075628 A CN202311075628 A CN 202311075628A CN 116778913 A CN116778913 A CN 116778913A
Authority
CN
China
Prior art keywords
voice data
noise
clean
layer
noisy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311075628.3A
Other languages
Chinese (zh)
Other versions
CN116778913B (en
Inventor
柯登峰
王运峰
陈立德
徐艳艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocdop Ltd
Beijing Forestry University
Original Assignee
Ocdop Ltd
Beijing Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocdop Ltd, Beijing Forestry University filed Critical Ocdop Ltd
Priority to CN202311075628.3A priority Critical patent/CN116778913B/en
Publication of CN116778913A publication Critical patent/CN116778913A/en
Application granted granted Critical
Publication of CN116778913B publication Critical patent/CN116778913B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Abstract

The application relates to the technical field of voice signal processing, and particularly discloses a voice recognition method and a voice recognition system for enhancing noise robustness, wherein the voice recognition method comprises the steps of acquiring noise data and clean voice data with text labels, and generating noise-added voice data based on the clean voice data and the noise data; preprocessing the clean voice data and the noisy voice data to extract Mel frequency spectrums of the clean voice data and the noisy voice data; an automatic voice recognition model is built, and the Mel frequency spectrums of the clean voice data and the noisy voice data are input into the automatic voice recognition model to obtain a recognition result of the clean voice data and a recognition result of the noisy voice data; training an automatic speech recognition model based on recognition results of the text, clean speech data and noisy speech data to obtain a trained automatic speech recognition model; recognizing the noisy speech data based on the trained automatic speech recognition model; the method improves the noise robustness of the automatic speech recognition model.

Description

Speech recognition method and system for enhancing noise robustness
Technical Field
The application relates to the technical field of voice signal processing, in particular to a voice recognition method and a voice recognition system for enhancing noise robustness.
Background
Automatic speech recognition is a technique of transcribing audio information into text information; with the development of deep learning, the end-to-end automatic voice recognition has more and more application scenes in daily life, has penetrated into the aspects of people's life, and has been widely applied in the aspects of mobile phone voice assistants, vehicle-mounted voice navigation, intelligent robots and the like; at present, the recognition effect under clean voice is particularly good, and even the accuracy of human ears is exceeded.
However, the life is full of noise everywhere, in actual use, the end-to-end automatic speech recognition is often affected by background noise, and this effect seriously reduces the recognition effect of the end-to-end speech recognition model, especially at low signal-to-noise ratio, and even makes the model completely unusable, which remains a particularly great challenge for popularization and application of the end-to-end automatic speech recognition model in real life; the current mainstream solution is to use a data enhancement technique, which is to introduce a data enhancement module before the encoding stage based on the traditional encoding and decoding model; data enhancement techniques include traditional statistical methods such as wiener filters and DNN-based speech enhancement methods such as time-frequency masking, signal approximation and spectral mapping; the data enhancement technology enhances and expands the original clean voice data through noise types with different signal-to-noise ratio (SNR) values, and is input into an end-to-end model; the noise-added voice data features are enhanced, so that the features obtained by the noise-added voice data are consistent with the corresponding clean voice data, and the capability of extracting noise invariance features is trained by a training model; the lower layers of the end-to-end model perform larger gradient updates than the higher layers.
However, the speech enhancement portion is typically different from the recognition portion, so the enhancement method is not optimized to the final objective, resulting in a suboptimal solution; in addition, when the speech data with low signal-to-noise ratio is encountered, because the speaking information is seriously interfered by the noise information, even if the speech data is trained in a large amount, the extractor is difficult to extract the same characteristics, but the noise invariance constraint becomes a burden, so that the recognition robustness of the speech with low signal-to-noise ratio is sharply reduced, and the recognition result of the speech with high signal-to-noise ratio is influenced.
Disclosure of Invention
In view of the foregoing, an object of the present application is to provide a voice recognition method for enhancing noise robustness, which provides a noise feature repair network layer (NFRN layer) to improve the problem that the recognition performance of an automatic voice recognition model is severely degraded at a low signal-to-noise ratio, so as to enhance the noise robustness of the automatic voice recognition model.
It is a second object of the application to provide a speech recognition system that enhances noise robustness.
The first technical scheme adopted by the application is as follows: a speech recognition method for enhancing noise robustness, comprising the steps of:
s100: acquiring noise data and clean voice data with text labels, and generating noise-added voice data based on the clean voice data and the noise data; preprocessing the clean voice data and the noisy voice data to respectively extract Mel spectrums of the clean voice data and the noisy voice data;
s200: an automatic voice recognition model is constructed, the automatic voice recognition model comprises a noise characteristic restoration network layer, a coding layer and a decoding layer, the noise characteristic restoration network layer comprises three layers of convolution operation, three layers of reverse convolution operation and a Sigmoid activation function which are sequentially connected, and a batch normalization operation and a RELU activation function are connected after each layer of convolution operation and each layer of reverse convolution operation;
s300: respectively inputting the Mel frequency spectrums of the clean voice data and the noisy voice data into the automatic voice recognition model to obtain a recognition result of the clean voice data and a recognition result of the noisy voice data, and training the automatic voice recognition model based on the text, the recognition result of the clean voice data and the recognition result of the noisy voice data to obtain a trained automatic voice recognition model;
s400: and identifying the noisy speech data based on the trained automatic speech recognition model.
Preferably, the preprocessing in step S100 includes:
resampling the clean voice data and the noisy voice data respectively;
pre-emphasis processing is carried out on the resampled clean voice data and the noise-added voice data; then processing by a short-time Fourier transform algorithm; and finally, carrying out the conversion of the Mel frequency spectrum to obtain the corresponding Mel frequency spectrum of the clean voice data and the Mel frequency spectrum of the noise-added voice data.
Preferably, the step S300 includes the following sub-steps:
s301: respectively inputting the mel frequency spectrums of the clean voice data and the noisy voice data into a noise characteristic restoration network layer to obtain noise characteristic restoration weights of the clean voice data and the noisy voice data; multiplying the Mel frequency spectrum of the clean voice data by the corresponding noise characteristic restoration weight, and multiplying the Mel frequency spectrum of the noisy voice data by the corresponding noise characteristic restoration weight, so as to respectively obtain the voice characteristics of the clean voice data and the noisy voice data after restoration;
s302: inputting the repaired voice characteristics of the clean voice data and the repaired noise-added voice data into an encoding layer respectively to obtain encoding results of the clean voice data and the repaired noise-added voice data;
s303: and respectively inputting the coding results of the clean voice data and the noisy voice data into a decoding layer, so as to obtain the recognition result of the clean voice data and the recognition result of the noisy voice data.
Preferably, the coding layers in the automatic speech recognition model consist of 6 layers of Conformer-Des; the Conformer-Des includes a dense connection convolution module.
Preferably, the Conformer-Des comprises a layer normalization module, two dense connection convolution modules, a convolution module and a multi-head self-attention module, wherein the two dense connection convolution modules are multiplied by 1/2, the dense connection convolution modules are connected with the multi-head self-attention module, and the multi-head self-attention module is sequentially connected with the convolution module, the dense connection convolution module and the layer normalization module.
Preferably, the dense connection convolution module includes a layer normalization, a first dense connection convolution layer, a Swish activation function, a first discard function, a second dense connection convolution layer, and a second discard function connected in sequence.
Preferably, the decoding layer in the automatic speech recognition model is composed of 6 layers of transformers;
the transducer structure comprises a plurality of groups of decoders, and each layer of the decoders comprises four operation modules, namely a self-attention module, a layer normalization module, a coding and decoding attention mechanism and a feedforward neural network.
Preferably, the step S300 includes:
calculating cross entropy loss based on the recognition result of the text and the clean voice data, and calculating cross entropy loss based on the recognition result of the clean voice data and the recognition result of the noisy voice data; training is performed based on the cross entropy loss until convergence to obtain a trained automatic speech recognition model.
The second technical scheme adopted by the application is as follows: a voice recognition system for enhancing noise robustness comprises a preprocessing module, an automatic voice recognition model construction module, a training module and a recognition module;
the preprocessing module is used for acquiring noise data and clean voice data with text labels and generating noise-added voice data based on the clean voice data and the noise data; preprocessing the clean voice data and the noisy voice data to respectively extract Mel spectrums of the clean voice data and the noisy voice data;
the automatic voice recognition model construction module is used for constructing an automatic voice recognition model, the automatic voice recognition model comprises a noise characteristic restoration network layer, an encoding layer and a decoding layer, the noise characteristic restoration network layer comprises three layers of convolution operation, three layers of reverse convolution operation and a Sigmoid activation function which are sequentially connected, and each layer of convolution operation and each layer of reverse convolution operation are connected with a batch normalization operation and a RELU activation function;
the training module is used for inputting the mel frequency spectrums of the clean voice data and the noise-added voice data into the automatic voice recognition model respectively to obtain a recognition result of the clean voice data and a recognition result of the noise-added voice data, and training the automatic voice recognition model based on the text, the recognition result of the clean voice data and the recognition result of the noise-added voice data to obtain a trained automatic voice recognition model;
the recognition module is used for recognizing the noisy speech data based on the trained automatic speech recognition model.
The beneficial effects of the technical scheme are that:
(1) The voice recognition method for enhancing noise robustness disclosed by the application carries out restoration processing on noise-containing speaking information by proposing a noise characteristic restoration network (Noisy Feature Repair Network, NFRN), and carries out convolution and inverse convolution operation, and allows the results of a characteristic restoration stage and a coding and decoding stage of clean voice data and corresponding noise-added voice data to be different, so that the maximum learning capacity of an NFRN layer, a coding layer and a decoding layer is given as much as possible, and only the final output result is constrained, thereby effectively improving the noise robustness of an automatic voice recognition model under low signal-to-noise ratio;
(2) The application discloses a voice recognition method for enhancing noise robustness, which provides a noise feature repair network layer (NFRN layer) for solving the problem that the recognition performance of an automatic voice recognition model is seriously reduced when the signal to noise ratio is low so as to enhance the noise robustness of the automatic voice recognition model.
Drawings
FIG. 1 is a block flow diagram of a speech recognition method for enhancing noise robustness according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an automatic speech recognition model according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a noise feature repair network layer (NFRN layer) according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a Conformer-Des according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a dense connection convolution module according to one embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a speech recognition system with enhanced noise robustness according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in further detail below with reference to the accompanying drawings and examples. The following detailed description of the embodiments and the accompanying drawings are provided to illustrate the principles of the application and are not intended to limit the scope of the application, i.e. the application is not limited to the preferred embodiments described, which is defined by the claims.
In the description of the present application, it is to be noted that, unless otherwise indicated, the meaning of "plurality" means two or more; the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance; the specific meaning of the above terms in the present application can be understood as appropriate by those of ordinary skill in the art.
Example 1
As shown in fig. 1, one embodiment of the present application provides a voice recognition method for enhancing noise robustness, comprising the steps of:
s100: acquiring noise data and clean voice data with text labels, and generating noise-added voice data based on the clean voice data and the noise data; preprocessing the clean voice data and the noisy voice data to extract a mel spectrum of the clean voice data and a mel spectrum of the noisy voice data;
collecting and sorting an English data set Librispeeches and transcribed texts thereof to obtain clean voice data with text labels; the train-clean-100 subset in Librispeech is used as a training set, the dev-clean subset is used as a verification set, and the test-clean subset is used as a test set. Seven types of noise data are obtained from the freebond website.
One noise type is randomly selected from the noise data of seven noise types, one noise segment is randomly selected from the noise types, then one signal-to-noise ratio is randomly selected from the signal-to-noise ratios [0,5,10,15,20,25] db, and the noise segment under the signal-to-noise ratio is added into corresponding clean voice data (clean voice data in a training set or a test set) to form different noisy voice data.
For example, for testing, 120 different audio files were randomly selected from a test-clean subset of Librispeeches, then a total of 56 different pieces of noise audio for each of the different noise types (seven noise types total), and one signal-to-noise ratio was selected from the signal-to-noise ratios [0,5,10,15,20] db among the seven noise data and added to the previous 120 different clean speech data, resulting in 4200 pieces of test data (120 tones×7 noise types×5 different signal-to-noise ratios) having different signal-to-noise ratios and noise types.
Preprocessing the clean voice data and the noisy voice data to extract a mel spectrum of the clean voice data and a mel spectrum of the noisy voice data, wherein the method specifically comprises the following steps of:
resampling the clean voice data and the noise-added voice data according to parameters in table 1, converting the sampling rate of all the clean voice data and the noise-added voice data to 16kHZ, and performing 0.97 pre-emphasis processing on the resampled clean voice data and the noise-added voice data; then processing by a Short Time Fourier Transform (STFT) algorithm, wherein the frame shift is 256, and the window length and the frame length are 1024; and finally, carrying out conversion on the Mel frequency spectrum to obtain the corresponding Mel frequency spectrum of the clean voice data and the Mel frequency spectrum of the noise-added voice data, wherein the Mel filter adopts 80 Mel filter groups, and the frequency is increased from the minimum frequency of 0 to the maximum frequency of 8000.
TABLE 1 Audio parameters
S200: constructing an automatic voice recognition model; as shown in fig. 2, the automatic speech recognition model includes a noise feature repair network layer (NFRN layer), an encoding layer (Encoder), and a decoding layer (Decoder), and the decoding layer is connected with a Softmax activation function.
(1) A noise feature repair network layer (NFRN layer);
as shown in fig. 3, three layers of convolution operations are sequentially performed in front of the noise feature repair network layer (NFRN layer), all convolution kernels are 3×3, and the channel numbers are respectively 8, 16 and 32 in an increasing manner; each layer of convolution operation is followed by a Batch normalization (Batch Norm) operation and a RELU activation function. After three layers of convolution, three layers of reverse convolution (namely reverse convolution) operations are sequentially connected, the sizes of all the reverse convolution kernels are 3 multiplied by 3, the channel numbers are respectively reduced by 16, 8 and 1, and a batch of normalization and RELU activation functions are also connected after each layer of reverse convolution operation. After all convolution operations and reverse convolution operations are completed, a Sigmoid activation function is finally connected, and the Sigmoid activation function has the function of normalizing the result between [0,1] to finally obtain the output result of the NFRN layer.
(2) An encoding layer (Encoder);
the coding layer consists of 6 layers of Conformer-Des; conformer-Des are modified from the traditional Conformer of the present application; the Conformer-Des replaces the feedforward neural network module in the traditional Conformer with a dense connection convolution module, and as shown in FIG. 4, the Conformer-Des comprises a layer normalization module, two dense connection convolution modules, a convolution module and a multi-head self-attention module, wherein the two dense connection convolution modules are multiplied by 1/2, the dense connection convolution module is connected with the multi-head self-attention module, and the multi-head self-attention module is sequentially connected with the convolution module, the dense connection convolution module and the layer normalization module; each layer adds a Residual connection.
Conformer is mainly added with a Conformer module on the basis of a transducer; the Conformer module comprises two feedforward neural network modules, wherein the two feedforward neural network modules are multiplied by 1/2 and consist of two layers of linear transformation and a nonlinear activation function; a multi-head self-attention module and a convolution module are clamped between the two feedforward neural network modules to form a sandwich shape of a macarons (Macaron), and normalization is carried out on the rear layer; each layer is added with Residual connection, so that information of the upper layer can be effectively transferred to the lower layer; the convolution module contains a gating mechanism consisting of a point-by-point convolution (Pointwise Convolution), a linear gating unit (gated linear unit, GLU) and a one-dimensional depth convolution, followed by batch normalization (batch norm) to help train deeper models.
If for the firstiInput of individual Conformer modulesx i Then the following formula will be used to obtain the firstiOutput of Conformer modulesy i
In the method, in the process of the application,x i is the firstiAn input of a plurality of Conformer modules;FFNrepresenting a feed-forward neural network module;is the firstiThe output of the feedforward neural network module in the Conformer module;MHSArepresenting a multi-headed self-attention module; />Is the firstiThe output of the multi-head self-attention module in the Conformer modules; />Is the firstiThe output of the convolution module in the Conformer module;Convrepresenting a convolution module;y i is thatiThe outputs of the individual Conformer modules;Layernormrepresentation layer normalization.
The structure of the dense connection convolution module is shown in fig. 5, where Layer normalization (Layer Norm) is connected to a first dense connection convolution Layer (DenseNets) followed by a Swish activation function, a first drop function (Dropout), a second dense connection convolution Layer, and a second drop function (Dropout) in that order.
Compared with a feedforward neural network module, the method has better noise robustness on the voice recognition model, and the method combines a densely connected convolutional network (namely the feedforward neural network module) with a Conformer model to improve, so that the noise robustness of the automatic voice recognition model is further improved.
(3) A decoding layer (Decoder);
the decoding layer consists of 6 layers of traditional transformers; the transducer structure comprises a plurality of groups of decoders (decoders), wherein each Layer of the decoders comprises four operation modules, namely a Self Attention module (Self Attention), a Layer normalization module (Layer Norm), a codec Attention mechanism (encoding-Decoder Attention) and a Feed Forward neural network (Feed Forward); all the Attention layers are added with Residual connection, so that the information of the upper layer can be effectively transferred to the lower layer; layer Norm is used for normalizing the activation value of the hierarchy, so as to speed up the training process of the model, and achieve the purpose of faster convergence.
The core of the transducer is a Self Attention module (Self Attention), which is also the most main innovation point of the transducer; the transducer incorporates three representations, query, key and Value, denoted Q, K and V, respectively, represented by the input speech feature X (forA first layer decodes the layer and speech feature X is the output of the encoding layer; for the following decoding layer, the speech feature X is the output of the decoding layer of the upper layer) and the corresponding three matrices、/>And->Respectively multiplied byQKAndVthe sizes of (2X 3), 4X 3 and 4X 2, respectively; the specific formula is as follows:
in the method, in the process of the application,QKandVrespectively isQKAndVthe product of the corresponding weight matrix and the voice characteristic;Xas the voice characteristics, the voice characteristics X of the first decoding layer are the output of the encoding layer; the voice characteristic X of the decoding layer at the back is the output of the decoding layer at the upper layer;、/>and->Respectively isQKAndVthe corresponding weight matrix.
ObtainingQKAndVafter that, willQAndKis subjected to MatMul operation to obtainThen will +.>Scale operation is performed in order to make +.>Performing the necessary scaling with a scale of +.>,/>Representation ofKThen using Mask matrix to block the information after each word and making Softmax operation, finally combining the obtained result with the informationVMatMul operation is carried out to obtain an output resultZThe method comprises the steps of carrying out a first treatment on the surface of the The calculation process is as follows:
in the method, in the process of the application,Zoutputting a result for the self-attention module;Attenionrepresentation ofSoftmaxActivating a function operation;QKandVrespectively isQKAndVthe product of the corresponding weight matrix and the voice characteristic;Tis a transposition;is thatKIs a dimension of (a) in the number of dimensions.
S300: respectively inputting the Mel frequency spectrums of the clean voice data and the noisy voice data into the automatic voice recognition model to obtain a recognition result of the clean voice data and a recognition result of the noisy voice data, and training the automatic voice recognition model based on the text, the recognition result of the clean voice data and the recognition result of the noisy voice data to obtain a trained automatic voice recognition model;
respectively inputting the mel spectrum of the clean voice data and the mel spectrum of the noisy voice data into the automatic voice recognition model, so as to obtain a recognition result of the clean voice data (namely, a text corresponding to the clean voice data is recognized and output) and a recognition result of the noisy voice data (namely, a text corresponding to the noisy voice data is recognized and output), and specifically comprising the following substeps:
s301: respectively inputting mel frequency spectrums of the clean voice data and the noisy voice data into an NRFN layer to obtain noise characteristic restoration weights of the clean voice data and the noisy voice data; multiplying the Mel frequency spectrum of the clean voice data by the corresponding noise characteristic restoration weight, and multiplying the Mel frequency spectrum of the noisy voice data by the corresponding noise characteristic restoration weight, so as to respectively obtain the voice characteristics of the clean voice data and the noisy voice data after restoration;
after the mel spectrum of clean voice data and the mel spectrum of noisy voice data are input into an NRFN layer, noise characteristic restoration weights normalized between [0,1] are obtained, the noise characteristic restoration weights are multiplied by the original mel spectrum, and the purpose of the operation is to discard unnecessary noise information as much as possible, namely, multiplication by 0; the complete speaking information is kept as much as possible, namely multiplied by 1; and multiplying noise characteristic restoration weights between 0 and 1 for the noise-added voice data to restore the noise-added voice data, and finally obtaining the voice characteristics of the noise-added voice data after restoration.
S302: inputting the repaired voice characteristics of the clean voice data and the repaired noise-added voice data into an encoding layer respectively to obtain encoding results of the clean voice data and the repaired noise-added voice data;
s303: and respectively inputting the coding results of the clean voice data and the noisy voice data into a decoding layer, so as to obtain the recognition result of the clean voice data and the recognition result of the noisy voice data.
The automatic speech recognition model is input into a text (namely a transcribed text corresponding to each piece of clean speech data), a recognition result of the clean speech data and a recognition result of the noise-added speech data; at the last layer of the decoding layer, KL divergence constraint is carried out on the recognition result of the clean voice data and the recognition result of the noisy voice data, so that the recognition result of the clean voice data and the recognition result of the noisy voice data are forced to be the same during training, and the noise robustness is enhanced; and finally, constraining the clean voice data output result by using text so that the model can be converged.
Training the automatic speech recognition model based on the recognition results of the text, clean speech data, and the recognition results of the noisy speech data comprises:
calculating cross entropy loss based on the recognition result of the text and the clean voice data, and calculating cross entropy loss based on the recognition result of the clean voice data and the recognition result of the noisy voice data; training is carried out based on the cross entropy loss until convergence, so as to obtain a trained automatic speech recognition model.
An Adam optimizer is used in training, and the initial learning rate is 0.002; the number of self-attention heads is 4, the downsampling multiple is one fourth, the Dropout rate is 0.1, and the attention output dimension 256. Training data is 251 speakers, a subset train-clean of the English data set Librisspeech of 100.6 hours, verification data is 40 speakers, and a subset dev-clean of the English data set Librisspeech of 5.4 hours. At the time of training and verification, noise data of different noise types are randomly added with a probability of 0.5, and training iteration is 50.
Further, in one embodiment, the method further comprises testing the trained automatic speech recognition model;
calculating Word Error Rate (WER) from 4200 pieces of test data (120 tones x 7 noise types x 5 different signal to noise ratios) having different signal to noise ratios and noise types; WER is an important index for evaluating the performance of an automatic speech recognition model, and is used for evaluating the error condition of words between a predicted result and a tag text, so that the smaller the numerical value of the word error rate is, the better the recognition result of the speech recognition model is; the word error rate is commonly used for languages such as English, arabic and the like, and cannot be applied to a Chinese data set because the word error rate is calculated on a word level; the WER is calculated by the following formula:
in the method, in the process of the application,WERis word error rate;S(substituation) is the number of erroneously substituted words;D(delete) is the number of erroneously deleted words;I(Insertion) is the number of erroneously inserted words;N(Number) is the total Number of words in the tag text.
S400: and identifying the noisy speech data based on the trained automatic speech recognition model.
Noise-containing voice data are acquired, the noise-containing voice data are preprocessed to obtain Mel spectrums, and the Mel spectrums of the noise-containing voice data are input into a trained automatic voice recognition model, so that recognition results (namely corresponding texts) are obtained.
Example two
As shown in fig. 6, one embodiment of the present application provides a voice recognition system for enhancing noise robustness, which includes a preprocessing module, an automatic voice recognition model construction module, a training module, and a recognition module;
the preprocessing module is used for acquiring noise data and clean voice data with text labels and generating noise-added voice data based on the clean voice data and the noise data; preprocessing the clean voice data and the noisy voice data to respectively extract Mel spectrums of the clean voice data and the noisy voice data;
the automatic voice recognition model construction module is used for constructing an automatic voice recognition model, the automatic voice recognition model comprises a noise characteristic restoration network layer, an encoding layer and a decoding layer, the noise characteristic restoration network layer comprises three layers of convolution operation, three layers of reverse convolution operation and a Sigmoid activation function which are sequentially connected, and each layer of convolution operation and each layer of reverse convolution operation are connected with a batch normalization operation and a RELU activation function;
the training module is used for inputting the mel frequency spectrums of the clean voice data and the noise-added voice data into the automatic voice recognition model respectively to obtain a recognition result of the clean voice data and a recognition result of the noise-added voice data, and training the automatic voice recognition model based on the text, the recognition result of the clean voice data and the recognition result of the noise-added voice data to obtain a trained automatic voice recognition model;
the recognition module is used for recognizing the noisy speech data based on the trained automatic speech recognition model.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (9)

1. A method of speech recognition for enhancing noise robustness, comprising the steps of:
s100: acquiring noise data and clean voice data with text labels, and generating noise-added voice data based on the clean voice data and the noise data; preprocessing the clean voice data and the noisy voice data to respectively extract Mel spectrums of the clean voice data and the noisy voice data;
s200: an automatic voice recognition model is constructed, the automatic voice recognition model comprises a noise characteristic restoration network layer, a coding layer and a decoding layer, the noise characteristic restoration network layer comprises three layers of convolution operation, three layers of reverse convolution operation and a Sigmoid activation function which are sequentially connected, and a batch normalization operation and a RELU activation function are connected after each layer of convolution operation and each layer of reverse convolution operation;
s300: respectively inputting the Mel frequency spectrums of the clean voice data and the noisy voice data into the automatic voice recognition model to obtain a recognition result of the clean voice data and a recognition result of the noisy voice data, and training the automatic voice recognition model based on the text, the recognition result of the clean voice data and the recognition result of the noisy voice data to obtain a trained automatic voice recognition model;
s400: and identifying the noisy speech data based on the trained automatic speech recognition model.
2. The method according to claim 1, wherein the preprocessing in step S100 includes:
resampling the clean voice data and the noisy voice data respectively;
pre-emphasis processing is carried out on the resampled clean voice data and the noise-added voice data; then processing by a short-time Fourier transform algorithm; and finally, carrying out the conversion of the Mel frequency spectrum to obtain the corresponding Mel frequency spectrum of the clean voice data and the Mel frequency spectrum of the noise-added voice data.
3. The method according to claim 1, wherein said step S300 comprises the sub-steps of:
s301: respectively inputting the mel frequency spectrums of the clean voice data and the noisy voice data into a noise characteristic restoration network layer to obtain noise characteristic restoration weights of the clean voice data and the noisy voice data; multiplying the Mel frequency spectrum of the clean voice data by the corresponding noise characteristic restoration weight, and multiplying the Mel frequency spectrum of the noisy voice data by the corresponding noise characteristic restoration weight, so as to respectively obtain the voice characteristics of the clean voice data and the noisy voice data after restoration;
s302: inputting the repaired voice characteristics of the clean voice data and the repaired noise-added voice data into an encoding layer respectively to obtain encoding results of the clean voice data and the repaired noise-added voice data;
s303: and respectively inputting the coding results of the clean voice data and the noisy voice data into a decoding layer, so as to obtain the recognition result of the clean voice data and the recognition result of the noisy voice data.
4. The method of claim 1, wherein the coding layers in the automatic speech recognition model consist of 6 layers of Conformer-Des; the Conformer-Des includes a dense connection convolution module.
5. The speech recognition method of claim 4, wherein the Conformer-Des comprises a layer normalization module, two densely connected convolution modules, a convolution module, and a multi-headed self-attention module, wherein both densely connected convolution modules multiply by 1/2, the densely connected convolution modules are connected to the multi-headed self-attention module, which is followed by the convolution module, the densely connected convolution module, and the layer normalization module.
6. The method of claim 4 or 5, wherein the dense connection convolution module comprises a layer normalization, a first dense connection convolution layer, a Swish activation function, a first drop function, a second dense connection convolution layer, and a second drop function connected in sequence.
7. The method according to claim 1, wherein the decoding layer in the automatic speech recognition model is composed of 6 layers of transformers;
the transducer structure comprises a plurality of groups of decoders, and each layer of the decoders comprises four operation modules, namely a self-attention module, a layer normalization module, a coding and decoding attention mechanism and a feedforward neural network.
8. The method according to claim 1, wherein the step S300 includes:
calculating cross entropy loss based on the recognition result of the text and the clean voice data, and calculating cross entropy loss based on the recognition result of the clean voice data and the recognition result of the noisy voice data; training is performed based on the cross entropy loss until convergence to obtain a trained automatic speech recognition model.
9. The voice recognition system for enhancing noise robustness is characterized by comprising a preprocessing module, an automatic voice recognition model construction module, a training module and a recognition module;
the preprocessing module is used for acquiring noise data and clean voice data with text labels and generating noise-added voice data based on the clean voice data and the noise data; preprocessing the clean voice data and the noisy voice data to respectively extract Mel spectrums of the clean voice data and the noisy voice data;
the automatic voice recognition model construction module is used for constructing an automatic voice recognition model, the automatic voice recognition model comprises a noise characteristic restoration network layer, an encoding layer and a decoding layer, the noise characteristic restoration network layer comprises three layers of convolution operation, three layers of reverse convolution operation and a Sigmoid activation function which are sequentially connected, and each layer of convolution operation and each layer of reverse convolution operation are connected with a batch normalization operation and a RELU activation function;
the training module is used for inputting the mel frequency spectrums of the clean voice data and the noise-added voice data into the automatic voice recognition model respectively to obtain a recognition result of the clean voice data and a recognition result of the noise-added voice data, and training the automatic voice recognition model based on the text, the recognition result of the clean voice data and the recognition result of the noise-added voice data to obtain a trained automatic voice recognition model;
the recognition module is used for recognizing the noisy speech data based on the trained automatic speech recognition model.
CN202311075628.3A 2023-08-25 2023-08-25 Speech recognition method and system for enhancing noise robustness Active CN116778913B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311075628.3A CN116778913B (en) 2023-08-25 2023-08-25 Speech recognition method and system for enhancing noise robustness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311075628.3A CN116778913B (en) 2023-08-25 2023-08-25 Speech recognition method and system for enhancing noise robustness

Publications (2)

Publication Number Publication Date
CN116778913A true CN116778913A (en) 2023-09-19
CN116778913B CN116778913B (en) 2023-10-20

Family

ID=87993524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311075628.3A Active CN116778913B (en) 2023-08-25 2023-08-25 Speech recognition method and system for enhancing noise robustness

Country Status (1)

Country Link
CN (1) CN116778913B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100318354A1 (en) * 2009-06-12 2010-12-16 Microsoft Corporation Noise adaptive training for speech recognition
CN108257606A (en) * 2018-01-15 2018-07-06 江南大学 A kind of robust speech personal identification method based on the combination of self-adaptive parallel model
CN110047502A (en) * 2019-04-18 2019-07-23 广州九四智能科技有限公司 The recognition methods of hierarchical voice de-noising and system under noise circumstance
CN110930976A (en) * 2019-12-02 2020-03-27 北京声智科技有限公司 Voice generation method and device
CN114067784A (en) * 2021-11-24 2022-02-18 云知声智能科技股份有限公司 Training method and device of fundamental frequency extraction model and fundamental frequency extraction method and device
CN114842833A (en) * 2022-05-11 2022-08-02 合肥讯飞数码科技有限公司 Speech recognition method and related device, electronic equipment and storage medium
CN115641834A (en) * 2022-09-09 2023-01-24 平安科技(深圳)有限公司 Voice synthesis method and device, electronic equipment and storage medium
WO2023036017A1 (en) * 2021-09-07 2023-03-16 广西电网有限责任公司贺州供电局 Speech recognition method and system for power grid dispatching
CN116092501A (en) * 2023-03-14 2023-05-09 澳克多普有限公司 Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system
CN116364109A (en) * 2023-03-03 2023-06-30 重庆邮电大学 Speech enhancement network signal-to-noise ratio estimator and loss optimization method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100318354A1 (en) * 2009-06-12 2010-12-16 Microsoft Corporation Noise adaptive training for speech recognition
CN108257606A (en) * 2018-01-15 2018-07-06 江南大学 A kind of robust speech personal identification method based on the combination of self-adaptive parallel model
CN110047502A (en) * 2019-04-18 2019-07-23 广州九四智能科技有限公司 The recognition methods of hierarchical voice de-noising and system under noise circumstance
CN110930976A (en) * 2019-12-02 2020-03-27 北京声智科技有限公司 Voice generation method and device
WO2023036017A1 (en) * 2021-09-07 2023-03-16 广西电网有限责任公司贺州供电局 Speech recognition method and system for power grid dispatching
CN114067784A (en) * 2021-11-24 2022-02-18 云知声智能科技股份有限公司 Training method and device of fundamental frequency extraction model and fundamental frequency extraction method and device
CN114842833A (en) * 2022-05-11 2022-08-02 合肥讯飞数码科技有限公司 Speech recognition method and related device, electronic equipment and storage medium
CN115641834A (en) * 2022-09-09 2023-01-24 平安科技(深圳)有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN116364109A (en) * 2023-03-03 2023-06-30 重庆邮电大学 Speech enhancement network signal-to-noise ratio estimator and loss optimization method
CN116092501A (en) * 2023-03-14 2023-05-09 澳克多普有限公司 Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system

Also Published As

Publication number Publication date
CN116778913B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN110827801B (en) Automatic voice recognition method and system based on artificial intelligence
CN108737667B (en) Voice quality inspection method and device, computer equipment and storage medium
CN112017644B (en) Sound transformation system, method and application
Revathi et al. Speaker independent continuous speech and isolated digit recognition using VQ and HMM
CN110634476B (en) Method and system for rapidly building robust acoustic model
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
Thomas et al. Acoustic and data-driven features for robust speech activity detection
Kurian et al. Continuous speech recognition system for Malayalam language using PLP cepstral coefficient
CN114495969A (en) Voice recognition method integrating voice enhancement
CN116778913B (en) Speech recognition method and system for enhancing noise robustness
CN114626424B (en) Data enhancement-based silent speech recognition method and device
CN115631757A (en) Convolution countermeasure sample construction method and device for voice identity anonymity
Dua et al. Noise robust automatic speech recognition: review and analysis
Shin et al. Speaker-invariant psychological stress detection using attention-based network
CN112489651B (en) Voice recognition method, electronic device and storage device
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
Kurian et al. Connected digit speech recognition system for Malayalam language
Rahim et al. Robust numeric recognition in spoken language dialogue
Feng et al. Noise Classification Speech Enhancement Generative Adversarial Network
CN117041430B (en) Method and device for improving outbound quality and robustness of intelligent coordinated outbound system
Chiang et al. WhisperHakka: A Hybrid Architecture Speech Recognition System for Low-Resource Taiwanese Hakka
Kaur et al. Correlative consideration concerning feature extraction techniques for speech recognition—a review
CN116386601A (en) Intelligent voice customer service question answering method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant