CN113870845A

CN113870845A - Speech recognition model training method, device, equipment and medium

Info

Publication number: CN113870845A
Application number: CN202111130120.XA
Authority: CN
Inventors: 李泽远; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-31

Abstract

The invention relates to the field of artificial intelligence, and provides a method, a device, equipment and a medium for training a speech recognition model, wherein the method comprises the following steps: acquiring a voice sample set containing voice samples; inputting a voice sample into an initial recognition model; obtaining an audio clip to be processed through audio enhancement processing; performing teacher acoustic feature extraction through a teacher network in the initial recognition model to obtain a first feature vector, and performing student acoustic feature extraction through a student network in the initial recognition model to obtain a second feature vector; carrying out alignment comparison processing by combining with a dynamic queue in a teacher network to obtain a loss value; and when the loss value does not reach the preset convergence condition, iteratively updating until convergence is achieved, and obtaining the trained voice recognition model. The invention realizes the common voice recognition through the teacher network and the student network, and accelerates the training efficiency. The method is suitable for the field of artificial intelligence, and can further promote the construction of smart cities.

Description

Speech recognition model training method, device, equipment and medium

Technical Field

The invention relates to the field of artificial intelligence voice recognition, in particular to a voice recognition model training method, a device, computer equipment and a storage medium.

Background

The speech translation is a process of converting a natural language (source language) into another natural language (target language), different from the traditional machine translation, the input of the speech translation is speech directly, the output is text, with the increase of international communication, the communication is more and more frequent by using languages of different languages, and the online speech translation based on the client is widely applied to overcome language communication barriers.

On-line speech translation generally involves two links, the first is to perform speech recognition, that is, to convert a speech signal in a first language input by a user into a text; secondly, the text is translated on line through the machine translation device to obtain the text of the second language as the translation result, and finally the text or the voice information of the second language is provided for the user.

Disclosure of Invention

The invention provides a method and a device for training a voice recognition model, computer equipment and a storage medium, which realize the training of a self-supervision voice recognition model without manual marking, and carry out alignment comparison processing between a teacher network and a student network by using a dynamic queue through teacher acoustic feature extraction and student acoustic feature extraction, thereby continuously carrying out training, improving the training speed, finally simplifying the structure of the student network, ensuring the recognition precision and improving the translation efficiency and the accuracy for subsequent voice translation.

A method of speech recognition model training, comprising:

acquiring a voice sample set; the set of speech samples comprises a plurality of speech samples;

inputting the voice sample into an initial recognition model containing initial parameters;

carrying out audio enhancement processing on the voice sample through the initial recognition model to obtain an audio clip to be processed;

performing teacher acoustic feature extraction on the audio clip to be processed through a teacher network to obtain a first feature vector, and performing student acoustic feature extraction on the audio clip to be processed through a student network to obtain a second feature vector; wherein the initial recognition model comprises the teacher network and the student network; the student network is obtained after distillation learning is carried out on the teacher network;

aligning and comparing the first feature vector, the second feature vector and a dynamic queue in the teacher network to obtain a loss value;

and when the loss value does not reach a preset convergence condition, iteratively updating the initial parameters of the initial recognition model until the loss value reaches the convergence condition, and recording the initial recognition model after convergence as a trained voice recognition model.

A speech recognition model training apparatus comprising:

the acquisition module is used for acquiring a voice sample set; the set of speech samples comprises a plurality of speech samples;

the input module is used for inputting the voice sample into an initial recognition model containing initial parameters;

the enhancement module is used for carrying out audio enhancement processing on the voice sample through the initial recognition model to obtain an audio clip to be processed;

the extraction module is used for performing teacher acoustic feature extraction on the audio clip to be processed through a teacher network to obtain a first feature vector, and performing student acoustic feature extraction on the audio clip to be processed through a student network to obtain a second feature vector; wherein the initial recognition model comprises the teacher network and the student network; the student network is obtained after distillation learning is carried out on the teacher network;

the loss module is used for aligning and comparing the first characteristic vector, the second characteristic vector and the dynamic queue in the teacher network to obtain a loss value;

and the training module is used for iteratively updating the initial parameters of the initial recognition model when the loss value does not reach a preset convergence condition until the loss value reaches the convergence condition, and recording the initial recognition model after convergence as a trained voice recognition model.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above-mentioned speech recognition model training method when executing the computer program.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned speech recognition model training method.

The invention provides a method, a device, computer equipment and a storage medium for training a voice recognition model, which are characterized in that a voice sample set containing a plurality of voice samples is obtained; inputting the voice sample into an initial recognition model containing initial parameters; carrying out audio enhancement processing on the voice sample through the initial recognition model to obtain an audio clip to be processed; performing teacher acoustic feature extraction on the audio clip to be processed through a teacher network to obtain a first feature vector, and performing student acoustic feature extraction on the audio clip to be processed through a student network to obtain a second feature vector; wherein the initial recognition model comprises the teacher network and the student network; the student network is obtained after distillation learning is carried out on the teacher network; aligning and comparing the first feature vector, the second feature vector and a dynamic queue in the teacher network to obtain a loss value; when the loss value does not reach the preset convergence condition, iteratively updating the initial parameters of the initial recognition model until the loss value reaches the convergence condition, recording the initial recognition model after convergence as a trained voice recognition model, thus realizing the purpose of automatically enhancing useful audio information without marking a large number of voice samples and saving labor cost by audio enhancement processing, extracting acoustic characteristics of teachers through a teacher network, extracting acoustic characteristics of students through a student network obtained by distillation learning in the teacher network, carrying out alignment comparison processing in combination with a dynamic queue, and obtaining the voice recognition model through iterative training, and training by using a distillation learning method and model training of a self-supervised teacher network and a student network to obtain the voice recognition model, the voice recognition method has the advantages that manual marking time and workload are reduced, voice recognition efficiency is improved through the student network, and accuracy of voice recognition is improved through common voice recognition of the teacher network and the student network.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a diagram illustrating an application environment of a speech recognition model training method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for training a speech recognition model according to an embodiment of the invention;

FIG. 3 is a flowchart of step S50 of the method for training a speech recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic block diagram of a speech recognition model training apparatus according to an embodiment of the present invention;

FIG. 5 is a functional block diagram of a loss module of the speech recognition model training apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer device in an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The speech recognition model training method provided by the invention can be applied to the application environment shown in fig. 1, wherein a client (computer equipment or terminal) communicates with a server through a network. The client (computer device or terminal) includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

In an embodiment, as shown in fig. 2, a method for training a speech recognition model is provided, which mainly includes the following steps S10-S70:

s10, acquiring a voice sample set; the set of speech samples comprises a plurality of speech samples.

Understandably, the voice sample set is a set of all the voice samples, the voice samples are audio files collected historically, the voice samples can be audio files with preset duration, and one section of audio file can be segmented according to the preset duration to obtain the voice samples.

S20, inputting the voice sample into an initial recognition model containing initial parameters.

Understandably, the initial recognition model includes the initial parameters, the initial parameters are parameters of each level in the initial recognition model, the initial recognition model includes a teacher network and a student network, and the initial parameters include a teacher parameter corresponding to the teacher network and a student parameter corresponding to the student network.

S30, performing audio enhancement processing on the voice sample through the initial recognition model to obtain an audio clip to be processed.

Understandably, the audio enhancement processing procedure is: firstly, pre-emphasizing the signal-to-noise ratio of a high-frequency part in the voice sample, wherein most energy of voice is concentrated in a low-frequency part due to the fact that the power spectrum of a voice signal is reduced along with the increase of frequency, so that the signal-to-noise ratio of the high-frequency part is very low, and the signal-to-noise ratio of the high-frequency part is improved through a first-order or second-order high-pass filter; secondly, framing and windowing the voice sample after pre-emphasizing the signal-to-noise ratio of the high-frequency part, namely, taking a preset time length (for example, 10ms, 15ms, 20ms and the like) as a frame, in order to ensure that the smooth transition between the frames keeps continuity, the time length of partial overlap (for example, 1ms and 2ms) exists between the frames, preferably, the time length of the partial overlap is less than one third of the preset time length, and the windowing mode is to perform windowing extraction operation on the framed signal through a window function; thirdly, performing Fourier transform and amplitude square operation on the extracted frame signal; and finally, filtering the signal after the square of the amplitude by using a filter, and obtaining a characteristic vector through logarithmic power conversion, so that frame signals after audio enhancement processing are spliced to obtain the audio segment to be processed, wherein the audio segment to be processed is a segment formed by characteristic vectors related to frequency domain characteristics.

S40, performing teacher acoustic feature extraction on the audio clip to be processed through a teacher network to obtain a first feature vector, and performing student acoustic feature extraction on the audio clip to be processed through a student network to obtain a second feature vector; wherein the initial recognition model comprises the teacher network and the student network; the student network is obtained after distilling learning is carried out on the teacher network.

Understandably, the teacher network is a neural network model trained in advance, the teacher network is configured to extract the acoustic features of the inputted audio segment to be processed, output a first feature vector according to the extracted acoustic features of the teacher, and identify the output first feature vector to obtain a model of text content, the student network is obtained after performing distillation learning on the teacher network, the student network can extract the acoustic features of students in the inputted audio segment to be processed in a distillation learning manner, output a second feature vector according to the extracted acoustic features of the students, and identify the output second feature vector to obtain a model of text content, preferably, the teacher network is a model constructed based on Bert, the student network is a model constructed based on TinyBert, and the process of extracting the acoustic features of the teacher is to perform Bert model compiling on the inputted audio segment to be processed And code and feature normalization, wherein the student acoustic feature extraction process is a process of coding in a compression mode and feature normalization after learning a teacher network by using a distillation learning method.

The Teacher acoustic feature is a feature related to acoustic frequency, that is, a feature mapped to text content by sequence coding in a learning frequency domain, the Student acoustic feature is a feature of a mapping relation learned in the Teacher acoustic feature by applying a distillation learning method, the distillation learning method is to transfer parameters of a learning corresponding layer, and a simple model (Student model) is trained by using an output of a pre-trained complex model (Teacher network) as a supervision signal, for example: the student network based on TinyBert is obtained by distillation learning of a teacher network based on Bert, for example, the student network adopts a mode of distilling every N layers.

In an embodiment, before the step S40, that is, before the teacher acoustic feature extraction is performed on the audio segment to be processed through a teacher network to obtain the first feature vector, the method includes:

acquiring a pre-training sample set; the pre-training sample set comprises a plurality of pre-training samples; one of the pre-training samples corresponds to one of the text labels.

Understandably, the pre-training sample set is a set of all the pre-training samples, the pre-training sample set may be a set of a part of the speech sample set after the audio enhancement processing, the pre-training sample is a sample which is obtained after a small amount of manual labeling collected historically and the audio enhancement processing, and the text label is a text content which is manually labeled in the pre-training sample corresponding to the text label.

Inputting the pre-training sample into an initial network containing teacher parameters; the initial network is a model constructed based on Bert.

Understandably, the teacher parameter is a parameter of each layer of the initial network, and the network structure of the Bert model is a language model adopting a masking predictive coding.

And extracting frequency domain features of the pre-training samples through the initial network by using a Moco training method, coding according to the extracted frequency domain features to obtain a feature vector to be identified, and inserting the feature vector to be identified into a dynamic queue in the initial network.

Understandably, the Moco training method is to update negative samples by using dynamic queues (queue), so that the method can give consideration to the training of large samples and maintain consistency among negative samples, and the process of frequency domain feature extraction is close to the correct sample and far away from the training method of the negative sample (i.e. incorrect sample) through the dynamic queue, the initial dynamic queue is all the negative samples collected, i.e. the samples different from the input pre-training sample, the frequency domain feature extraction is to extract features related to human voice frequency from the pre-training sample, the encoding process is to encode (encode) the extracted frequency domain features, namely, the mapping function is used for carrying out sequence conversion to obtain the characteristic vector to be identified, the characteristic vector to be identified is updated into the dynamic queue, therefore, the updated dynamic queue can be shown to include the feature vector to be identified and a plurality of negative samples.

And performing character prediction according to the feature vector to be recognized and the inserted dynamic queue to obtain a text recognition result corresponding to the pre-training sample.

Understandably, the text prediction process is a process of performing sequence conversion with the same dimension on the feature vector to be recognized, simultaneously performing point-by-point code conversion on the feature vector to be recognized and each feature vector in the inserted dynamic queue, performing masking prediction coding processing and fine-tuning text decoding on the feature vector after the sequence conversion and the point-by-point code conversion, and performing comparison prediction on all the masking sequences after the fine-tuning text decoding to obtain a text recognition result corresponding to the training sample.

In an embodiment, the performing text prediction according to the feature vector to be recognized and the inserted dynamic queue to obtain a text recognition result corresponding to the pre-training sample includes:

and performing conversion coding on the characteristic vector to be identified to obtain a first coding sequence, and performing dot product coding on the characteristic vector to be identified and the inserted dynamic queue to obtain a plurality of second coding sequences.

Understandably, the transform coding is a transform with the same dimension as the input feature vector, and may be considered as performing regularization, performing coding normalization processing on the feature vector to be identified to obtain the first coding sequence, where the dot-product coding processing is to perform dot-product calculation on the feature vector to be identified and each inserted feature vector in the dynamic queue, and perform regularization processing on the feature vector with the same dimension after the dot-product calculation to obtain the second coding sequence corresponding to each feature vector in the dynamic queue one to one.

And carrying out masking prediction coding on the first coding sequence and each second coding sequence to obtain a plurality of masking sequences, and updating the dynamic queue.

Understandably, the masking Predictive coding is also called mpc (masked Predictive coding), and is to perform Predictive coding on a model based on machine-learned fransormer, that is, 15% of the marks of each masking sequence are randomly masked, select a masking frame, represent 80% of the frames in the selected masking frame by a zero vector, represent 10% of the masking frames by using information of other random frames, and perform a coding process in which the remaining 10% of the masking frames are not changed, finally obtain the masking sequence corresponding to the first coding sequence and the masking sequence corresponding to each second coding sequence, update the dynamic queue after the masking Predictive coding in an updating manner of removing a feature vector inserted earliest in the dynamic queue, update according to a first-in first-out rule, so as to ensure that the number of the dynamic queue is not changed, therefore, the size of the dictionary of the negative sample can be maintained, and the space of the model is saved.

And carrying out fine-tuning character decoding on each masking sequence, and carrying out comparison prediction on all the masking sequences after the fine-tuning character decoding to obtain the text recognition result.

Understandably, the fine tuning word decoding is to perform sequence variable approaching to the input masking sequence, approach to the sequence variable which is the unit sequence in the masking sequence and the nearest neighbor thereof, and decode a corresponding text vector according to the sequence variable, the comparison prediction is to perform comparison on the text vector corresponding to each masking sequence which is decoded and output, reduce the distance corresponding to the feature vector to be identified in the dynamic queue, expand the distance corresponding to the negative sample in the dynamic queue, thereby predicting the word corresponding to each unit sequence, splicing the words corresponding to all the unit sequences, and according to the semantic prediction of the unique context of the Bert model, obtaining a section of text with the highest prediction probability, and determining the section of text as the text recognition result,

and determining a contrast loss value according to the text label corresponding to the pre-training sample and the text recognition result.

Understandably, the text label and the text recognition result are input into a loss function in the initial network, and the comparison loss value corresponding to the pre-training sample is calculated, where the loss function may be set according to a requirement, for example, the loss function is a cross entropy loss function, and the loss function is a logarithm of the text label and the text recognition result, indicating a difference between the text label and the text recognition result.

And when the contrast loss value does not reach the pre-training convergence condition, iteratively updating the teacher parameters of the initial network until the contrast loss value reaches the pre-training convergence condition, and recording the converged initial network as the teacher network.

Understandably, the pre-training convergence condition may be a condition that the value of the contrast loss value is small and does not decrease again after 3000 times of calculation, that is, when the value of the contrast loss value is small and does not decrease again after 3000 times of calculation, the training is stopped, and the initial network after convergence is recorded as a teacher network; the pre-training convergence condition may also be a condition that the contrast loss value is smaller than a set threshold, that is, when the contrast loss value is smaller than the set threshold, the training is stopped, and the initial network after convergence is recorded as a teacher network, so that when the contrast loss value does not reach the pre-training convergence condition, the teacher parameter of the initial network is continuously adjusted, and the initial network can be continuously drawn to an accurate result, so that the recognition accuracy is higher and higher. Therefore, the accuracy of voice recognition can be improved, the efficiency of recognizing the text by the voice is improved, the capacity of a teacher network is optimized, and a dynamic queue does not need to be continuously increased to serve as a negative sample for voice recognition.

The invention realizes the aim of obtaining the pre-training sample set; inputting the pre-training sample into an initial network containing teacher parameters; the initial network is a model constructed based on Bert; extracting frequency domain features of the pre-training samples through the initial network by using a Moco training method, coding according to the extracted frequency domain features to obtain feature vectors to be identified, and inserting the feature vectors to be identified into a dynamic queue in the initial network; performing character prediction according to the feature vector to be recognized and the inserted dynamic queue to obtain a text recognition result corresponding to the pre-training sample; determining a contrast loss value according to the text label corresponding to the pre-training sample and the text recognition result; and when the contrast loss value does not reach the pre-training convergence condition, iteratively updating the teacher parameters of the initial network, and recording the converged initial network as the teacher network until the contrast loss value reaches the pre-training convergence condition.

In an embodiment, said recording said initial network after convergence as a teacher network comprises:

and performing interlayer distillation treatment on each layer in the teacher network by using a distillation learning method to obtain a distillation layer.

Understandably, the distillation learning method is to transfer and learn the parameters of the corresponding layer, and train a simple model (Student model) by using the output of a pre-trained complex model (Teacher model, Teacher network) as a supervision signal, for example: the student network based on TinyBert is obtained by distillation learning of a teacher network based on Bert, for example, the student network adopts a distillation mode with every N layers, for example, the teacher network has 12 layers, if the student network is set to 4 layers, a transform loss is calculated every 3 layers, and the mapping function g (m) is 3 × m, where m is the number of layers related to coding in the student network, and the specific correspondence is as follows: the 1 st layer transform of the student network corresponds to the 3 rd layer of the teacher network, the 2 nd layer of the student network corresponds to the 6 th layer of the teacher network, the 3 rd layer of the student network corresponds to the 9 th layer of the teacher network, and the 4 th layer of the student network corresponds to the 12 th layer of the teacher network.

The interlayer distillation treatment is a process of marking the distillation layer by N layers preset at intervals.

And structurally splicing all the distillation layers, and migrating from the teacher network to obtain student parameters in each distillation layer.

Understandably, the structure splicing is a process of splicing input and output between adjacent distillation layers, student parameters in each distillation layer are obtained by migration from the teacher network, the teacher parameters are frozen in a speech recognition model training process, and the student parameters are updated in an iteration mode.

Constructing the student network based on TinyBert according to all the migrated distillation layers; wherein a hierarchy of the student network is smaller than a hierarchy of the teacher network.

Understandably, carrying out vector alignment of input and output according to all the distillation layers after migration, thereby constructing the student network based on TinyBert, wherein the hierarchy of the student network is smaller than that of the teacher network, and the hierarchy of the student network migration comprises an embedded layer, a conversion layer and a prediction layer in the teacher network.

According to the invention, each layer in the teacher network is subjected to interlayer distillation treatment by using a distillation learning method to obtain a distillation layer; performing structural splicing on all the distillation layers, and migrating from the teacher network to obtain student parameters in each distillation layer; and constructing the student network based on TinyBert according to all the migrated distillation layers, so that parameters in the distillation layers can be migrated by using a distillation learning method, a sample does not need to be labeled in advance, only an acoustic feature vector needs to be extracted, the mapping relation of the hierarchy is compressed into a one-layer output result of the student network from the multi-layer output result of the teacher network, and the output speed of the student network is improved.

And S50, aligning and comparing the first feature vector, the second feature vector and the dynamic queue in the teacher network to obtain a loss value.

Understandably, the alignment comparison processing means that the first feature vector is added into the dynamic queue as a new historical feature vector in order to alleviate the problem that alignment cannot be performed, so that a correct feature vector cannot be found for alignment, and therefore dead loop is avoided; and performing inner product processing on the first feature vector and each historical feature vector, and simultaneously performing inner product processing on the second feature vector and each historical feature vector to determine a loss value, wherein the alignment comparison can align to correct acoustic features (including teacher acoustic features and student acoustic features) and contrast (keep away) from other irrelevant features, and while outputting the loss value, the student network can also migrate a conversion layer and a prediction layer under a distillation learning method, so that the second feature vector can be subjected to masking predictive coding and text prediction to obtain text content corresponding to the second feature vector.

In an embodiment, as shown in fig. 3, the performing alignment comparison processing on the first feature vector, the second feature vector, and a dynamic queue in the teacher network to obtain a loss value includes:

adding the first feature vector into the dynamic queue as a new historical feature vector; wherein the dynamic queue comprises a plurality of the historical feature vectors.

Understandably, the initial dynamic queue is a history feature vector of a negative sample, the queue is dynamically updated in the continuous training and learning process, and a new history feature vector is continuously introduced, namely, the first feature vector input each time is added into the dynamic queue.

And performing inner product processing on the first feature vector and each historical feature vector to obtain a first similarity value, and performing inner product processing on the second feature vector and each historical feature vector to obtain a second similarity value.

Understandably, the inner product processing is a process of performing dot product processing on the input feature vector and each historical feature vector, and the weight of each feature vector is introduced in the inner product process, so that a similar value is obtained.

And calculating the cross entropy of the first similarity value and the second similarity value to obtain the loss value.

Understandably, calculating the loss between the first similarity value and the second similarity value by using a cross entropy formula to obtain the loss value.

The invention realizes that the first characteristic vector is added into the dynamic queue to be used as a new historical characteristic vector; wherein the dynamic queue comprises a plurality of the historical feature vectors; performing inner product processing on the first feature vector and each historical feature vector to obtain a first similarity value, and simultaneously performing inner product processing on the second feature vector and each historical feature vector to obtain a second similarity value; and calculating the cross entropy of the first similar value and the second similar value to obtain the loss value, so that the alignment condition between teacher network output and student network output is compared in a dynamic queue and inner product processing mode, and the corresponding loss value is determined, so that an iteration basis of speech recognition model training is provided, teacher parameters in the teacher network are frozen, student parameters in the iteration student network can be optimized by taking the recognition of the teacher network as a standard, a sample does not need to be marked on the student network, a self-supervision learning mode is achieved, the learning efficiency of the student network is improved, and the cost of the student network learning is reduced.

And S60, when the loss value does not reach the preset convergence condition, iteratively updating the initial parameters of the initial recognition model until the loss value reaches the convergence condition, and recording the initial recognition model after convergence as the trained voice recognition model.

Understandably, the convergence condition may be a condition that the loss value is small and does not decrease again after 5000 times of calculation, that is, when the loss value is small and does not decrease again after 5000 times of calculation, the training is stopped, and the initial recognition model after convergence is recorded as a trained speech recognition model; the convergence condition may also be a condition that the loss value is smaller than a set convergence threshold value, that is, when the loss value is smaller than the set convergence threshold value, the training is stopped, and the initial recognition model after the convergence is recorded as a trained speech recognition model, so that when the loss value does not reach the pre-training convergence condition, the initial parameters of the initial recognition model are continuously adjusted, wherein the teacher parameters are frozen, the student parameters are adjusted, the learning network can be continuously drawn to an accurate result, and the accuracy of speech recognition is increased. Therefore, the accuracy of voice recognition can be improved, the efficiency of recognizing the text by the voice is improved, the capacity of a voice recognition model is optimized, and a dynamic queue does not need to be continuously increased to serve as a negative sample for voice recognition.

The invention realizes the purpose of acquiring the voice sample set comprising a plurality of voice samples; inputting the voice sample into an initial recognition model containing initial parameters; carrying out audio enhancement processing on the voice sample through the initial recognition model to obtain an audio clip to be processed; performing teacher acoustic feature extraction on the audio clip to be processed through a teacher network to obtain a first feature vector, and performing student acoustic feature extraction on the audio clip to be processed through a student network to obtain a second feature vector; wherein the initial recognition model comprises the teacher network and the student network; the student network is obtained after distillation learning is carried out on the teacher network; aligning and comparing the first feature vector, the second feature vector and a dynamic queue in the teacher network to obtain a loss value; when the loss value does not reach the preset convergence condition, iteratively updating the initial parameters of the initial recognition model until the loss value reaches the convergence condition, recording the converged initial recognition model as a trained voice recognition model, thus realizing the purpose of obtaining the voice recognition model through audio enhancement processing, extracting acoustic characteristics of a teacher through the teacher network, extracting acoustic characteristics of students through the student network obtained by distillation learning in the teacher network, carrying out alignment comparison processing by combining a dynamic queue, and carrying out iterative training,

therefore, the method realizes automatic enhancement of useful audio information, does not need to label a large number of voice samples, saves labor cost, reduces manual labeling time and workload by applying a distillation learning method and model training of a teacher network and a student network through self supervision, accelerates the efficiency of voice recognition by the student network, improves the efficiency of voice recognition, and improves the accuracy of voice recognition by common voice recognition of the teacher network and the student network.

In an embodiment, the recording the initial recognition model after convergence as a trained speech recognition model comprises:

inputting the speech to be recognized into the speech recognition model trained by the speech recognition model training method, distilling and extracting frequency domain characteristics of the speech to be recognized through a student network in the speech recognition model, and performing character prediction according to the distilled and extracted frequency domain characteristics to obtain a text to be translated corresponding to the speech to be recognized; the speech to be recognized is obtained from a translation request containing a translation target language.

Understandably, the training is completed, the student network in the voice recognition model is used for distilling and extracting frequency domain features, the distilling and extracting is used for extracting acoustic features of students learned by a distilling and learning method, and the character prediction method is used for predicting according to the frequency domain features extracted by distilling to obtain the text to be translated, so that the extraction level can be greatly reduced, the translation lag time is greatly shortened, the voice recognition time is greatly shortened, the timeliness and the accuracy of translation are improved, the translation request is to collect audio files in real time, the audio files with short time duration extracted in a short time interval in the collection process trigger the request, the audio files with short time duration are recorded as the voice to be recognized, and the translation target language is the language to be translated.

Inputting the text to be translated into a trained translation model corresponding to the translation target language, and performing translation processing through the translation model to obtain a translation text corresponding to the text to be translated.

Understandably, different translation target languages correspond to different translation models, each translation model is trained, each translation model can perform corresponding mapping relation conversion with the content of the translation target language on the input text content, so that a translation text corresponding to the text to be translated is obtained, and the translation text represents the text content converted into the corresponding translation target language.

The invention realizes the speech recognition model trained by inputting the speech to be recognized into the speech recognition model training method, distilling and extracting the frequency domain characteristics of the speech to be recognized through a student network in the speech recognition model, and performing character prediction according to the frequency domain characteristics extracted by distillation to obtain the text to be translated corresponding to the speech to be recognized; inputting the speech to be recognized into the speech recognition model trained by the speech recognition model training method, distilling and extracting frequency domain characteristics of the speech to be recognized through a student network in the speech recognition model, and performing character prediction according to the distilled and extracted frequency domain characteristics to obtain a text to be translated corresponding to the speech to be recognized; the voice to be recognized is obtained from a translation request containing a translation target language; inputting the text to be translated into a trained translation model corresponding to the translation target language, performing translation processing through the translation model to obtain a translation text corresponding to the text to be translated, inputting the text to be translated into the trained translation model corresponding to the translation target language, and performing translation processing through the translation model to obtain a translation text corresponding to the text to be translated.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In one embodiment, a speech recognition model training apparatus is provided, which corresponds to the speech recognition model training method in the above embodiments one to one. As shown in fig. 4, the speech recognition model training apparatus includes an obtaining module 11, an input module 12, an enhancing module 13, an extracting module 14, a losing module 15, and a training module 16. The functional modules are explained in detail as follows:

an obtaining module 11, configured to obtain a voice sample set; the set of speech samples comprises a plurality of speech samples;

an input module 12, configured to input the speech sample into an initial recognition model containing initial parameters;

the enhancement module 13 is configured to perform audio enhancement processing on the voice sample through the initial recognition model to obtain an audio segment to be processed;

the extraction module 14 is configured to perform teacher acoustic feature extraction on the audio segment to be processed through a teacher network to obtain a first feature vector, and perform student acoustic feature extraction on the audio segment to be processed through a student network to obtain a second feature vector; wherein the initial recognition model comprises the teacher network and the student network; the student network is obtained after distillation learning is carried out on the teacher network;

a loss module 15, configured to perform alignment comparison processing on the first feature vector, the second feature vector, and a dynamic queue in the teacher network to obtain a loss value;

and the training module 16 is configured to iteratively update the initial parameters of the initial recognition model when the loss value does not reach a preset convergence condition, and record the initial recognition model after convergence as a trained speech recognition model until the loss value reaches the convergence condition.

In one embodiment, as shown in fig. 5, the loss module 15 includes:

the adding submodule 51 is configured to add the first feature vector into the dynamic queue as a new historical feature vector; wherein the dynamic queue comprises a plurality of the historical feature vectors;

an inner product sub-module 52, configured to perform inner product processing on the first feature vector and each historical feature vector to obtain a first similar value, and perform inner product processing on the second feature vector and each historical feature vector to obtain a second similar value;

and the calculating submodule 53 is configured to calculate a cross entropy of the first similarity value and the second similarity value, so as to obtain the loss value.

For the specific definition of the speech recognition model training device, reference may be made to the above definition of the speech recognition model training method, which is not described herein again. The modules in the speech recognition model training device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech recognition model training method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the method for training a speech recognition model in the above embodiments is implemented.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the speech recognition model training method of the above-mentioned embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for training a speech recognition model, comprising:

2. The method for training a speech recognition model according to claim 1, wherein before performing teacher acoustic feature extraction on the audio segment to be processed through a teacher network to obtain the first feature vector, the method comprises:

acquiring a pre-training sample set; the pre-training sample set comprises a plurality of pre-training samples; one pre-training sample corresponds to one text label;

inputting the pre-training sample into an initial network containing teacher parameters; the initial network is a model constructed based on Bert;

extracting frequency domain features of the pre-training samples through the initial network by using a Moco training method, coding according to the extracted frequency domain features to obtain feature vectors to be identified, and inserting the feature vectors to be identified into a dynamic queue in the initial network;

performing character prediction according to the feature vector to be recognized and the inserted dynamic queue to obtain a text recognition result corresponding to the pre-training sample;

determining a contrast loss value according to the text label corresponding to the pre-training sample and the text recognition result;

3. The method of training a speech recognition model according to claim 2, wherein said recording the initial network after convergence as a teacher network comprises:

performing interlayer distillation treatment on each layer in the teacher network by using a distillation learning method to obtain a distillation layer;

performing structural splicing on all the distillation layers, and migrating from the teacher network to obtain student parameters in each distillation layer;

4. The method for training the speech recognition model according to claim 2, wherein the performing text prediction according to the feature vector to be recognized and the inserted dynamic queue to obtain the text recognition result corresponding to the pre-training sample comprises:

performing conversion coding on the characteristic vector to be identified to obtain a first coding sequence, and simultaneously performing dot product coding on the characteristic vector to be identified and the inserted dynamic queue to obtain a plurality of second coding sequences;

masking prediction coding is carried out on the first coding sequence and each second coding sequence to obtain a plurality of masking sequences, and the dynamic queue is updated;

5. The method for training a speech recognition model according to claim 1, wherein the aligning and comparing the first feature vector, the second feature vector and the dynamic queue in the teacher network to obtain a loss value comprises:

adding the first feature vector into the dynamic queue as a new historical feature vector; wherein the dynamic queue comprises a plurality of the historical feature vectors;

performing inner product processing on the first feature vector and each historical feature vector to obtain a first similarity value, and simultaneously performing inner product processing on the second feature vector and each historical feature vector to obtain a second similarity value;

6. The speech recognition model training method according to any one of claims 1 to 5, wherein said recording the initial recognition model after convergence as a trained speech recognition model comprises:

inputting the speech to be recognized into the speech recognition model trained by the speech recognition model training method, distilling and extracting frequency domain characteristics of the speech to be recognized through a student network in the speech recognition model, and performing character prediction according to the distilled and extracted frequency domain characteristics to obtain a text to be translated corresponding to the speech to be recognized; the voice to be recognized is obtained from a translation request containing a translation target language;

7. A speech recognition model training apparatus, comprising:

8. The speech recognition model training apparatus of claim 7, wherein the loss module comprises:

the adding submodule is used for adding the first characteristic vector into the dynamic queue to be used as a new historical characteristic vector; wherein the dynamic queue comprises a plurality of the historical feature vectors;

the inner product submodule is used for carrying out inner product processing on the first characteristic vector and each historical characteristic vector to obtain a first similarity value, and simultaneously carrying out inner product processing on the second characteristic vector and each historical characteristic vector to obtain a second similarity value;

and the calculating submodule is used for calculating the cross entropy of the first similarity value and the second similarity value to obtain the loss value.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the speech recognition model training method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method for training a speech recognition model according to any one of claims 1 to 6.