CN114582330A - Training method of voice recognition model, voice recognition method and electronic equipment - Google Patents

Training method of voice recognition model, voice recognition method and electronic equipment Download PDF

Info

Publication number
CN114582330A
CN114582330A CN202210235275.8A CN202210235275A CN114582330A CN 114582330 A CN114582330 A CN 114582330A CN 202210235275 A CN202210235275 A CN 202210235275A CN 114582330 A CN114582330 A CN 114582330A
Authority
CN
China
Prior art keywords
voice
recognition model
training
noise
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210235275.8A
Other languages
Chinese (zh)
Inventor
朱秋实
戴礼荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202210235275.8A priority Critical patent/CN114582330A/en
Publication of CN114582330A publication Critical patent/CN114582330A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a training method of a speech recognition model, which comprises the following steps: respectively processing clean voice and voice with noise by using a feature extraction module of the voice recognition model to obtain clean voice features and voice features with noise; processing the voice characteristics with noise by using a context module of the voice recognition model to obtain a context representation; clustering the clean voice features by using a quantization module of the voice recognition model to obtain quantized clean voice features; processing the context characteristics and the quantized clean voice characteristics by using a pre-training loss function to obtain a pre-training loss value; and optimizing the voice recognition model according to the pre-training loss value. The invention also discloses a voice recognition method, electronic equipment and a storage medium.

Description

Training method of voice recognition model, voice recognition method and electronic equipment
Technical Field
The invention belongs to the field of voice signal processing, and particularly relates to a training method of a voice recognition model, a voice recognition method, electronic equipment and a storage medium.
Background
Speech Recognition, also known as Automatic Speech Recognition (ASR), is aimed at converting human Speech into computer-readable words or instructions. Noise robustness speech recognition is a research field of speech recognition, and at present, the noise robustness speech recognition usually adopts a method based on front-end speech enhancement, multi-task learning and countertraining. The method based on front-end speech enhancement is characterized in that a speech enhancement network is trained in advance, the speech with noise is obtained through the speech enhancement network after enhancement, and then the enhanced speech is input into a speech recognition model for speech recognition. Based on the method of multi-task learning, the voice data obtains the output of the last layer through the neural network, then the output of the last layer can carry out multi-task learning, and one branch carries out a connection time sequence Classification (CTC) task through a Classification layer; the other branch classifies the noise type by an auxiliary classification layer. Based on the method of the confrontation training, the confrontation training method introduces a Gradient Reversal Layer (GRL) before a classification Layer of a neural network, and makes the noise classifier difficult to distinguish the type of noise, so that the characterization before the Gradient Reversal Layer is noise invariant. The three methods all bring performance improvement in the noise robustness voice recognition task.
Currently, methods of unsupervised pre-training are provided in the prior art. Unsupervised Pre-Training (Unsupervised Pre-Training) is also called Self-Supervised Pre-Training (Self-Supervised Pre-Training), however, the Unsupervised Pre-Training method has the problems of poor robustness and insufficient generalization capability.
Disclosure of Invention
In view of the above problems, the present invention provides a training method of a speech recognition model, a speech recognition method, an electronic device, and a storage medium, which are intended to solve at least one of the above problems.
According to an embodiment of the present invention, there is provided a training method of a speech recognition model, including:
respectively processing clean voice and voice with noise by using a feature extraction module of the voice recognition model to obtain clean voice features and voice features with noise;
processing the voice characteristics with noise by using a context module of the voice recognition model to obtain a context representation;
clustering the clean voice features by using a quantization module of the voice recognition model to obtain quantized clean voice features;
processing the context characteristics and the quantized clean voice characteristics by using a pre-training loss function to obtain a pre-training loss value;
and optimizing the voice recognition model according to the pre-training loss value.
According to an embodiment of the present invention, the training method of the speech recognition model further includes:
preprocessing the noise-carrying voice with the label by utilizing the linear mapping layer to obtain the preprocessed noise-carrying voice with the label;
processing the preprocessed noise-carrying voice with the label by using a feature extraction module of the voice recognition model to obtain the noise-carrying voice feature with the label;
processing the noise-carrying voice characteristics with the labels by utilizing a context module of the voice recognition model to obtain the context representation with the labels;
processing the context representation with the label by utilizing a fine tuning loss function, and optimizing a voice recognition model according to a fine tuning loss value;
and (4) performing pre-training loss optimization and fine-tuning loss optimization in an iteration mode until the pre-training loss value and/or the fine-tuning loss value meet preset conditions to obtain a trained voice recognition model.
According to an embodiment of the present invention, the feature extraction module employs a multilayer convolutional neural network or a multilayer deep separable convolutional neural network, each layer of the multilayer deep separable convolutional neural network includes a channel-by-channel convolution and a point-by-point convolution;
the context module adopts a multi-layer conversion neural network, and each layer of the multi-layer conversion neural network comprises a plurality of self-attention layers and a feedforward neural network layer;
the quantization module adopts a product quantization codebook and is optimized through a Gunn-Bell activation function.
According to an embodiment of the present invention, the pre-training loss function is determined by a weighted sum of a contrast loss function, a codebook diversity loss function, an L2 loss function, and a consistency loss function.
According to an embodiment of the present invention, the above-mentioned contrast loss function is determined by equation (1):
Figure BDA0003541823270000031
wherein the codebook diversity loss function is determined by equation (2):
Figure BDA0003541823270000032
wherein p isg,vDetermined by equation (3):
Figure BDA0003541823270000033
wherein the consistency loss function is determined by equation (4):
Lc=‖Znoisy-Zclean2 (4);
wherein Z isnoisyRepresenting characteristics of noisy speech, ZcleanRepresenting clean speech features, sim cosine similarity, pg,vRepresenting the probability of selecting the g-th group of the v-th codebook, nvAnd nkRepresenting random noise perturbations with different means and variances, obeying a gaussian distribution, and τ represents a non-negative temperature coefficient. k denotes a non-negative temperature coefficient, G denotes the number of quantization codebooks, V denotes the codebook vector dimension after quantization,
Figure BDA0003541823270000034
all features after quantization including positive sample features and negative sample features are represented.
According to an embodiment of the present invention, the fine tuning loss function adopts a connection timing classification loss function or a cross entropy loss function.
According to an embodiment of the present invention, the processing the noisy speech feature with a tag by using the context module of the speech recognition model to obtain the context characterization with the tag includes:
the linear mapping layer is accessed to a context module for processing the tagged noisy speech features.
According to an embodiment of the present invention, there is provided a speech recognition method including:
acquiring a voice to be recognized;
and processing the voice to be recognized by utilizing a voice recognition model to obtain a voice recognition result, wherein the voice recognition model is obtained by training according to the training method of the voice recognition model.
According to an embodiment of the present invention, there is provided an electronic apparatus including:
one or more processors;
a storage device to store one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method for training the speech recognition model and the method for speech recognition.
According to an embodiment of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described method of training a speech recognition model and the above-described method of speech recognition.
According to the training method of the voice recognition model, the voice recognition model is trained by acquiring the fusion representation comprising the clean voice characteristic and the noisy voice characteristic by utilizing the clean voice and the noisy voice, so that the robustness of the voice recognition model is improved, the generalization capability of the voice recognition model is improved, and the accuracy rate of the voice recognition in a noise scene is improved.
Drawings
FIG. 1 schematically shows a flow diagram of a method of training a speech recognition model according to an embodiment of the present invention;
FIG. 2 schematically shows a flow diagram of a method of training a speech recognition model according to another embodiment of the present invention;
FIG. 3 schematically shows a network architecture diagram of a feature extraction module according to an embodiment of the invention;
FIG. 4 schematically illustrates a block diagram of a speech recognition model according to an embodiment of the present invention;
FIG. 5 schematically shows a flow chart of a speech recognition method according to an embodiment of the invention;
FIG. 6 schematically illustrates a block diagram of an electronic device adapted to implement a training method and a speech recognition method of a multi-modal speech recognition model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings in combination with the embodiments.
In the field of speech recognition, there are some problems in improving the performance of noisy speech recognition based on the methods of front-end speech enhancement, multi-task learning, and countertraining. Front-end speech enhancement based methods require training a speech enhancement network first, and then a speech recognition model. The training criterion of the speech enhancement network is not consistent with the training criterion of the speech recognition model, so that the speech enhancement network cannot bring about the improvement of the speech recognition performance under the noise environment. In the multitask learning, the type of noise needs to be classified, and thus the type of noise needs to be known in advance. And for scenes with unknown noise types, the generalization capability of the model is not strong. The method of countertraining can improve the speech recognition accuracy on a specific noise type, but often at the cost of reducing the speech recognition accuracy of the speech recognition model on clean speech.
While the unsupervised pre-training method improves speech recognition performance by utilizing a large amount of unlabeled speech data. The unsupervised pre-training method does not need labeled data in a pre-training stage, and aims to improve the performance of voice recognition by learning common structural information in voice by using a large amount of unlabeled data and sharing the common structural information. For example, unsupervised pre-training is performed on a large amount of unlabeled voice data to obtain a pre-training model, and then the pre-training model is used to obtain voice characterization or initialize the pre-training model to a voice recognition model, and further fine-tuning is performed by using labeled data to significantly improve the performance of voice recognition. The performance of speech recognition is improved by pre-training on a large amount of unlabeled speech data based on an unsupervised pre-training method. The non-labeled data required for unsupervised pre-training is easy to acquire, and both noisy and noisy data are easy to acquire. Therefore, the method has great significance for researching whether the unsupervised pre-training method can improve the speech recognition accuracy rate in a noise scene.
FIG. 1 schematically shows a flow chart of a method of training a speech recognition model according to an embodiment of the present invention.
As shown in fig. 1, operations S101 to S105 are included.
In operation S101, the clean speech and the noisy speech are respectively processed by using the feature extraction module of the speech recognition model to obtain a clean speech feature and a noisy speech feature.
The feature extraction module comprises a multilayer convolutional neural network or a multilayer deep separable convolutional neural network and is used for extracting shallow local features from the original voice.
In operation S102, the noisy speech feature is processed by using the context module of the speech recognition model to obtain a context representation.
The context module employs a transform-employing neural network for learning the speech context information.
In operation S103, the clean speech features are clustered by using a quantization module of the speech recognition model, so as to obtain quantized clean speech features.
In operation S104, the context characterization and the quantized clean speech feature are processed using a pre-training loss function to obtain a pre-training loss value.
In operation S105, the speech recognition model is optimized according to the pre-training loss value.
According to the training method of the voice recognition model, the voice recognition model is trained by acquiring the fusion representation comprising the clean voice characteristic and the noisy voice characteristic by utilizing the clean voice and the noisy voice, so that the robustness of the voice recognition model is improved, the generalization capability of the voice recognition model is improved, and the accuracy rate of the voice recognition in a noise scene is improved.
FIG. 2 schematically shows a flow chart of a method of training a speech recognition model according to another embodiment of the present invention.
As shown in fig. 2, the training method of the speech recognition model further includes operations S106 to S110.
In operation S106, the tagged noisy speech is preprocessed by the linear mapping layer to obtain a preprocessed tagged noisy speech.
In operation S107, the preprocessed noise-bearing voice with the tag is processed by the feature extraction module of the voice recognition model, so as to obtain the noise-bearing voice feature with the tag.
In operation S108, the noise-bearing speech feature with the tag is processed by the context module of the speech recognition model, so as to obtain a context representation with the tag.
In operation S109, the tagged context tokens are processed using a fine tuning loss function and the speech recognition model is optimized according to the fine tuning loss value.
In operation S110, the pre-training loss optimization and the fine-tuning loss optimization are performed iteratively until the pre-training loss value and/or the fine-tuning loss value satisfy a preset condition, so as to obtain a trained speech recognition model.
The preset conditions include, but are not limited to, a threshold value set by an operator, a loss function value of the speech recognition model no longer changing (i.e., the model spontaneously converges to a certain value), a loss function value of the speech recognition model decreasing with a small change, or a loss function value of the speech recognition model fluctuating within a small range of values.
According to the training method of the voice recognition model provided by the embodiment of the invention, the noisy voice characteristics are obtained by utilizing the noisy voice with the label, and the voice recognition model is optimized by fine tuning the loss function, so that the generalization capability of the voice recognition model is improved, and the application range of the voice recognition model is expanded.
According to an embodiment of the present invention, the feature extraction module employs a multi-layer convolutional neural network or a multi-layer deep separable convolutional neural network, where each layer of the multi-layer deep separable convolutional neural network includes a channel-by-channel convolution (DepthwiseConv1d) and a point-by-point convolution (poitwisseconvld);
the feature extraction module preferably employs a 7-layer convolutional neural network or a 7-layer deep separable convolutional neural network.
Fig. 3 schematically shows a network structure diagram of a feature extraction module according to an embodiment of the present invention, wherein fig. 3(a) represents a 7-layer general one-dimensional convolutional neural network, wherein Conv1d represents a convolutional neural network function; fig. 3(b) represents a 7-layer depth separable convolutional neural network, where DepthwiseConv1d and PointwiseConv1d represent depth separable convolutional neural network functions. The above feature extraction module is further described in detail with reference to fig. 3.
As shown in fig. 3, k represents the size of the convolution kernel, s represents the size of the convolution step, and seven layers of convolution neural networks are adopted, the step sizes are respectively (5,2,2,2,2,2,2), and the sizes of the convolution kernels are respectively (10,3,3,3,3,2, 2). Shared feature extractor module for raw clean speech and noisy speech, Znoisy=f(Xnoisy),Zclean=f(Xclean) Respectively obtaining clean speech characteristics ZcleanAnd noisy speech feature Znoisy. The above-mentioned characteristic extraction module inputs 16KHz speech sampling point, clean speech sampling point
Figure BDA0003541823270000071
Obtaining clean speech features Z through a feature extractor module fclean=f(Xclean) Wherein, in the step (A),
Figure BDA0003541823270000072
representing the clean speech feature of the t-th frame. Noise-carrying speech sampling point
Figure BDA0003541823270000073
Obtaining Z by the feature extractor module fnoisy=f(Xnoisy) Wherein
Figure BDA0003541823270000074
The context module adopts a multi-layer conversion neural network, and each layer of the multi-layer conversion neural network comprises a plurality of self-attention layers and a feedforward neural network layer;
the context module is preferably composed of 12 layers of transform neural networks (transform blocks), each layer of transform blocks being composed of a self-attention layer and a feedforward neural network layer, respectively. Noisy speech feature ZnoisyEntering a context module to obtain a context representation Cnoisy=g(Znoisy). E.g. noisy speech features
Figure BDA0003541823270000075
Entering context module g gets context representation Cnoisy=g(Znoisy) Wherein the context characterizes
Figure BDA0003541823270000076
The quantization module adopts a product quantization codebook and is optimized through a Gunn-Bell activation function.
The quantization module is optimized by a Gumbel softmax function (Gunn Bell activation function) and aims to cluster the same pronunciation characteristics by the quantization module. Clean speech features
Figure BDA0003541823270000077
Figure BDA0003541823270000081
Entering a quantization module to obtain quantized features qclean=VQ(Zclean) Wherein
Figure BDA0003541823270000082
Clean feature q after quantizationcleanThe training targets are provided for the entire model during the pre-training phase.
According to an embodiment of the present invention, the pre-training loss function is determined by a weighted sum of a contrast loss function, a codebook diversity loss function, an L2 loss function, and a consistency loss function.
The above-mentioned pre-training loss function is determined by the following formula: l ═ Lm+αLd+βLf+γLcWhere L represents the pre-training loss function, LmRepresenting the contrast loss function, LdRepresenting a codebook diversity loss function, LfRepresents the L2 loss function, LcThe consistency loss function is expressed, and α, β, and γ represent weighting coefficients.
According to an embodiment of the present invention, the above-mentioned contrast loss function is determined by equation (1):
Figure BDA0003541823270000083
the contrast loss function is used for enabling the information distance between the predicted noisy speech frame and the real clean speech frame to be close, and enabling the information distance between the predicted noisy speech frame and the randomly sampled speech frame to be increased.
Wherein the codebook diversity loss function is determined by equation (2):
Figure BDA0003541823270000084
the codebook diversity loss function is used to make the model as uniform as possible using the quantized codebook.
The L2 loss function described above is used to stabilize the training process.
Wherein p isg,vDetermined by equation (3):
Figure BDA0003541823270000085
wherein the consistency loss function is determined by equation (4):
Lc=‖Znoisy-Zclean2 (4);
the consistency loss function is used to constrain the consistency of clean speech and noisy speech output through the feature extractor module.
Wherein, ZnoisyRepresenting characteristics of speech with noise, ZcleanRepresenting clean speech features, sim cosine similarity, pg,vRepresenting the probability of selecting the g-th group of the v-th codebook, nvAnd nkRepresenting random noise perturbations with different means and variances, obeying a gaussian distribution, and τ represents a non-negative temperature coefficient. k denotes a non-negative temperature coefficient, G denotes the number of quantization codebooks, V denotes the codebook vector dimension after quantization,
Figure BDA0003541823270000091
all features after quantization including positive sample features and negative sample features are represented.
The pre-training loss function determined by the formula can be used for better optimizing the voice recognition model, so that the voice recognition model with higher voice recognition accuracy rate is obtained.
According to an embodiment of the present invention, the fine tuning loss function adopts a connection timing Classification (Connectionist Temporal Classification) loss function or a Cross Entropy (Cross Entropy) loss function.
The fine tuning loss function can be used for conveniently testing the voice model so as to decode and evaluate the word error rate.
According to an embodiment of the present invention, the processing the noisy speech feature with a tag by using the context module of the speech recognition model to obtain the context characterization with the tag includes:
and accessing the linear mapping layer into a context module for processing the labeled noisy speech characteristics.
Fig. 4 is a schematic diagram illustrating a structure of a speech recognition model according to an embodiment of the present invention, and the method for training the speech recognition model according to the embodiment of the present invention is further described with reference to fig. 4.
As shown in fig. 4, a training method for a speech recognition model according to an embodiment of the present invention includes a pre-training stage and a fine-tuning stage, where the speech recognition model includes a feature extractor module (feature encoder), a context module (fransformer encoder), and a quantization module (vector quantization), and a loss function mainly includes a pre-training loss function (e.g., a contrast loss function and a consistency loss function) and a fine-tuning loss function, where C represents a context token, Z represents a speech feature (e.g., a noisy speech feature and a clean speech feature) obtained through processing by the feature extraction module, and X represents a speech to be processed. The representation of noise robustness is obtained through unsupervised pre-training on noisy speech data, and then fine tuning is carried out on the noisy data with tags to improve speech recognition performance under different noise scenes.
In order to fully use easy-to-obtain label-free data to obtain a pre-training model to improve noise robustness of a speech recognition model in different noise scenes, and in order to verify the effectiveness of the training method provided by the embodiment of the invention, the invention provides the following specific implementation modes.
First, the unlabeled data used herein in the pre-training phase is the published English dataset librispeech [9]Contains a total of 100 hours of clean speech tagged data, train-clean-100. Noise data set is open source FreeSource [10 ]]And NoiseX-92[11 ]]Data sets, in which noise data can be roughly classified into two types, relatively stationary noise and relatively non-stationary noise, the types of relatively stationary noise include 'Car', 'Metro' and 'Traffic', and the types of relatively non-stationary noise include 'Babble', 'Airport/Station', 'Cafe' and 'AC/Vacuum'. Noisy data is synthesized by manually controlling the signal-to-noise ratio. First, the wav2vec2.0 model [4 ] is utilized]And carrying out unsupervised pre-training on the noisy data, and then finely adjusting the noisy data to obtain the recognition results of the models under the conditions of different signal-to-noise ratios. The self-developed model was then trained (FIG. 1), where clean speech and noisy speech were passed through the same feature extractor module, Znoisy=f(Xnoisy),Zclean=f(Xclean) Respectively obtaining clean speech characteristics ZcleanAnd noisy speech feature Znoisy. The noisy features are then entered into the context module Cnoisy=g(Znoisy) The clean features go into the quantization module qclean=VQ(Zclean) The clean features after quantization provide training targets for the entire model during the pre-training phase. Model learning by contrast loss function to predict clean speech table from noisy speech characterizationAnd (4) capacity of characterization.
And secondly, after the pre-training process is finished, removing the quantization module, accessing a linear mapping layer after the context module, and finely adjusting the pre-training model by using a CTC loss function by using the labeled noisy data. Wherein the fine tuning data is labeled data after 100h dynamic noise adding. The results of the experiment are shown in table 1 for the relatively stationary noise types and in table 2 for the relatively non-stationary noise types. Where the Pre-train column information represents the noise data used by the Pre-training data and the Fine-tune column information represents the noise data used by the Fine-tuning stage. 'No' indicates No pre-training phase.
TABLE 1 relatively stationary noise types, word error rates at different SNR for different methods
Figure BDA0003541823270000111
TABLE 2 word error Rate for different signal to noise ratios for different methods versus non-stationary noise types
Figure BDA0003541823270000112
TABLE 3 cosine distances of the after-fine-tuning features of the models and after-fine-tuning of the clean speech baseline system
Figure BDA0003541823270000113
From the experimental results of tables 2 and 3, it was found that 1) the data set used was identical to the article, and although the model structure used was somewhat different, it was comparable to the baseline system in the article. 2) For the wav2vec2.0 model, 'no pre-train free fine-tune' and 'no pre-train clean fine-tune' compared to finetune on noisy data set can improve recognition performance on noisy test set. On the clean and noisy test set, the recognition performance of 'clean pre-train free fine-tune' is better than that of 'no pre-train free fine-tune', which indicates that the noise robustness of the model can be improved by additionally introducing a pre-training model. The performance of 'FreeSource pre-train FreeSource fine-tune' is better than that of 'clean pre-train FreeSource fine-tune', the pre-training in a noisy environment can still learn a more robust characterization by the wav2vec2.0 model, and the noise robustness of the ASR system under different signal-to-noise ratios is improved. But the effect of the model on a clean test set is much reduced, 3) to alleviate this problem, we generate noise-clean paired data and treat clean speech as the training target of the model during the pre-training phase. Under the same condition, the training method can not only ensure further performance improvement on a noisy test set, but also reduce performance reduction on a clean test set. The experimental result shows the effectiveness of the method. 4) To further verify the robustness of the pre-trained model to noise species, experiments were performed using the noise NoiseX-92 noise data set to generate pre-trained data and the FreeSend noise data set to generate fine-tuning data. The results of the experiments are shown in tables 2 and 3. The noise robustness of the speech recognition model can still be improved by finding the pre-training models obtained on different noise types. The performance of speech recognition is improved more when only the pre-training data and the fine tuning data are in the same domain.
In addition, quantitative measurements were made of the characterization after the different models were trimmed. And taking the clean voice baseline system as a target, and measuring the difference of the characterization after the fine adjustment of each model and the characterization after the fine adjustment of the clean voice baseline system. The criterion used is cosine similarity, and the experimental results are shown in table 3. From table 3, it can be seen that the cosine similarity of the proposed method is greatest in different methods and different snr test sets, which indicates that the characterization of the method obtained under different noise conditions is cleaner and helpful for ASR tasks.
In general, the invention provides a noise robustness speech recognition model based on unsupervised pre-training. The method aims to improve the robustness of the speech recognition model under different noise environments through an unsupervised pre-training method. The training of the model mainly comprises a pre-training stage and a fine-tuning stage. The representation of noise robustness is obtained through unsupervised pre-training on noisy speech data, and then fine tuning is carried out on the noisy data with labels to improve speech recognition performance in different noise scenes.
Fig. 5 schematically shows a flow chart of a speech recognition method according to an embodiment of the invention.
As shown in fig. 5, operations S510 to S520 are included.
In operation S510, a voice to be recognized is acquired;
in operation S520, the speech to be recognized is processed by using the speech recognition model, and a speech recognition result is obtained, wherein the speech recognition model is obtained by training according to the training method of the speech recognition model.
FIG. 6 schematically illustrates a block diagram of an electronic device adapted to implement a training method and a speech recognition method of a multi-modal speech recognition model according to an embodiment of the present invention.
As shown in fig. 6, an electronic device 600 according to an embodiment of the present invention includes a processor 601 which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. Processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 601 may also include onboard memory for caching purposes. Processor 601 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present invention.
In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. The processor 601 performs various operations of the method flow according to the embodiments of the present invention by executing programs in the ROM 602 and/or RAM 603. It is to be noted that the programs may also be stored in one or more memories other than the ROM 602 and RAM 603. The processor 601 may also perform various operations of method flows according to embodiments of the present invention by executing programs stored in the one or more memories.
Electronic device 600 may also include input/output (I/O) interface 605, where input/output (I/O) interface 605 is also connected to bus 604, according to an embodiment of the invention. The electronic device 600 may also include one or more of the following components connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement a method according to an embodiment of the invention.
According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to an embodiment of the present invention, a computer-readable storage medium may include the ROM 602 and/or the RAM603 described above and/or one or more memories other than the ROM 602 and the RAM 603.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of training a speech recognition model, comprising:
respectively processing clean voice and voice with noise by using a feature extraction module of the voice recognition model to obtain clean voice features and voice features with noise;
processing the voice characteristics with noise by utilizing a context module of the voice recognition model to obtain a context representation;
clustering the clean voice features by utilizing a quantization module of the voice recognition model to obtain quantized clean voice features;
processing the context characteristics and the quantized clean voice characteristics by utilizing a pre-training loss function to obtain a pre-training loss value;
and optimizing the voice recognition model according to the pre-training loss value.
2. The method of claim 1, further comprising:
preprocessing the noise-carrying voice with the label by utilizing the linear mapping layer to obtain the preprocessed noise-carrying voice with the label;
processing the preprocessed noise-carrying voice with the label by utilizing a feature extraction module of the voice recognition model to obtain the noise-carrying voice feature with the label;
processing the noise-carrying voice characteristics with the labels by utilizing a context module of the voice recognition model to obtain context representations with the labels;
processing the tagged context tokens with the fine tuning loss function and optimizing the speech recognition model according to a fine tuning loss value;
and iterating and performing pre-training loss optimization and fine-tuning loss optimization until the pre-training loss value and/or the fine-tuning loss value meet a preset condition to obtain a trained voice recognition model.
3. The method of claim 1, wherein the feature extraction module employs a multi-layer convolutional neural network or a multi-layer deep separable convolutional neural network, each layer of the multi-layer deep separable convolutional neural network comprising a channel-by-channel convolution and a point-by-point convolution;
wherein the context module employs a multi-layer transforming neural network, each layer of the multi-layer transforming neural network comprising a plurality of self-attention layers and a feedforward neural network layer;
the quantization module adopts a product quantization codebook and optimizes the product quantization codebook through a Gunn-Bell activation function.
4. The method of claim 1, wherein the pre-training loss function is determined by a weighted sum of a contrast loss function, a codebook diversity loss function, an L2 loss function, and a consistency loss function.
5. The method of claim 4, wherein the contrast loss function is determined by equation (1):
Figure FDA0003541823260000021
wherein the codebook diversity loss function is determined by equation (2):
Figure FDA0003541823260000022
wherein p isg,vDetermined by equation (3):
Figure FDA0003541823260000023
wherein the consistency loss function is determined by equation (4):
Lc=‖Znoisy-Zclean2 (4);
wherein Z isnoisyRepresenting said noisy speech feature, ZcleanRepresenting clean speech features, sim cosine similarity, pg,vRepresenting the probability of selecting the g-th group of the v-th codebook, nvAnd nkRepresenting random noise perturbations with different means and variances, obeying a gaussian distribution, and τ represents a non-negative temperature coefficient. k denotes a non-negative temperature coefficient, G denotes the number of quantization codebooks, V denotes the codebook vector dimension after quantization,
Figure FDA0003541823260000024
all features after quantization including positive sample features and negative sample features are represented.
6. The method of claim 1, wherein the fine-tuning loss function employs a connection timing classification loss function or a cross-entropy loss function.
7. The method of claim 1, wherein the processing the tagged noisy speech feature with a context module of the speech recognition model to obtain a tagged context characterization comprises:
a linear mapping layer is accessed to the context module for processing the tagged noisy speech feature.
8. A speech recognition method, comprising:
acquiring a voice to be recognized;
processing the speech to be recognized by using a speech recognition model to obtain a speech recognition result, wherein the speech recognition model is obtained by training according to the method of any one of claims 1 to 7.
9. An electronic device, comprising:
one or more processors;
a storage device to store one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-8.
10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 8.
CN202210235275.8A 2022-03-11 2022-03-11 Training method of voice recognition model, voice recognition method and electronic equipment Pending CN114582330A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210235275.8A CN114582330A (en) 2022-03-11 2022-03-11 Training method of voice recognition model, voice recognition method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210235275.8A CN114582330A (en) 2022-03-11 2022-03-11 Training method of voice recognition model, voice recognition method and electronic equipment

Publications (1)

Publication Number Publication Date
CN114582330A true CN114582330A (en) 2022-06-03

Family

ID=81774626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210235275.8A Pending CN114582330A (en) 2022-03-11 2022-03-11 Training method of voice recognition model, voice recognition method and electronic equipment

Country Status (1)

Country Link
CN (1) CN114582330A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115206293A (en) * 2022-09-15 2022-10-18 四川大学 Multi-task air traffic control voice recognition method and device based on pre-training
CN115472167A (en) * 2022-08-17 2022-12-13 南京龙垣信息科技有限公司 Voiceprint recognition model training method and system based on big data self-supervision
CN117975945A (en) * 2024-03-28 2024-05-03 深圳市友杰智新科技有限公司 Network generation method, device, equipment and medium for improving noisy speech recognition rate

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472167A (en) * 2022-08-17 2022-12-13 南京龙垣信息科技有限公司 Voiceprint recognition model training method and system based on big data self-supervision
CN115206293A (en) * 2022-09-15 2022-10-18 四川大学 Multi-task air traffic control voice recognition method and device based on pre-training
CN115206293B (en) * 2022-09-15 2022-11-29 四川大学 Multi-task air traffic control voice recognition method and device based on pre-training
CN117975945A (en) * 2024-03-28 2024-05-03 深圳市友杰智新科技有限公司 Network generation method, device, equipment and medium for improving noisy speech recognition rate

Similar Documents

Publication Publication Date Title
CN114582330A (en) Training method of voice recognition model, voice recognition method and electronic equipment
KR100192854B1 (en) Method for spectral estimation to improve noise robustness for speech recognition
US20100174389A1 (en) Automatic audio source separation with joint spectral shape, expansion coefficients and musical state estimation
CN112509563B (en) Model training method and device and electronic equipment
CN109192200B (en) Speech recognition method
CN111161744B (en) Speaker clustering method for simultaneously optimizing deep characterization learning and speaker identification estimation
CN110428364B (en) Method and device for expanding Parkinson voiceprint spectrogram sample and computer storage medium
CN116635934A (en) Unsupervised learning of separate phonetic content and style representations
CN114724548A (en) Training method of multi-mode speech recognition model, speech recognition method and equipment
CN112053694A (en) Voiceprint recognition method based on CNN and GRU network fusion
Wang et al. Gated convolutional LSTM for speech commands recognition
CN115394287A (en) Mixed language voice recognition method, device, system and storage medium
CN116741159A (en) Audio classification and model training method and device, electronic equipment and storage medium
CN112927709A (en) Voice enhancement method based on time-frequency domain joint loss function
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
US20230360636A1 (en) Quality estimation for automatic speech recognition
CN111640438B (en) Audio data processing method and device, storage medium and electronic equipment
CN113160823B (en) Voice awakening method and device based on impulse neural network and electronic equipment
CN113379757B (en) Method for training brain image segmentation model and brain image segmentation method
CN114818789A (en) Ship radiation noise identification method based on data enhancement
CN113538507B (en) Single-target tracking method based on full convolution network online training
CN110361090B (en) Future illuminance prediction method based on relevance of photovoltaic array sensor
Noyum et al. Boosting the predictive accurary of singer identification using discrete wavelet transform for feature extraction
CN113808573A (en) Dialect classification method and system based on mixed domain attention and time sequence self-attention
Gong et al. A Robust Feature Extraction Method for Sound Signals Based on Gabor and MFCC

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination