CN111862953B

CN111862953B - Training method of voice recognition model, voice recognition method and device

Info

Publication number: CN111862953B
Application number: CN201911240191.8A
Authority: CN
Inventors: 蒋栋蔚
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2023-08-22
Anticipated expiration: 2039-12-05
Also published as: CN111862953A

Abstract

The application provides a training method of a voice recognition model, a voice recognition method and a device, and relates to the technical field of voice recognition, wherein the training method of the voice recognition model comprises the following steps: acquiring a voice sample set comprising a plurality of voice sequences, performing frame calculation on the voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences, performing downsampling processing and mask operation on the Fbank feature vectors, and generating mask feature vectors; the mask feature vectors are input to a pre-training model to complete the pre-training process of the speech recognition model. According to the training method, the voice recognition method and the device for the voice recognition model, in the training process, because the non-labeling voice sequence is used, dependence on labeling data in the training process of the voice recognition model is greatly reduced, the recognition effect is ensured, and meanwhile, the use cost is also reduced.

Description

Training method of voice recognition model, voice recognition method and device

Technical Field

The present application relates to the field of speech recognition technology, and in particular, to a training method for a speech recognition model, a speech recognition method and a device.

Background

In recent years, with the development of deep learning technology, the voice recognition technology also undergoes revolutionary changes, and nutrition is drawn from the development of the deep learning technology, so that the voice recognition technology is a breakthrough approach. However, when training the speech recognition model by the deep learning technology, a large amount of expensive labeling data is generally required to obtain a more accurate recognition effect, which presents a great challenge to the industrial speech recognition system, not only increases the use cost, but also is difficult to popularize in a large range.

Disclosure of Invention

Accordingly, the present application is directed to a training method for a speech recognition model, a speech recognition method and a device thereof, so as to alleviate the above technical problems.

In a first aspect, an embodiment of the present application provides a method for training a speech recognition model, where the speech recognition model is a model of a transducer structure, and includes an encoder and a decoder, the encoder includes a plurality of encoding layers, the decoder includes a plurality of decoding layers, and the encoder and a masking prediction coding MPC layer form a pre-training model of the speech recognition model, and the method includes: acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences, wherein the plurality of voice sequences in the voice sample set are unlabeled voice sequences; performing downsampling processing of a designated multiple on the Fbank feature vector to generate a downsampled feature vector corresponding to the Fbank feature vector; performing mask operation on the downsampled feature vector to generate a mask feature vector; inputting the mask feature vector into a pre-training model, and outputting a prediction vector corresponding to the Fbank feature vector through an encoder of the pre-training model and the MPC layer; calculating a loss function of the prediction vector and the Fbank feature vector, adjusting parameters of an encoder in the voice recognition model according to the loss function, and continuing training the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, so as to complete a pre-training process of the voice recognition model.

In one possible implementation, after the pre-training process is completed, the encoder and the decoder of the speech recognition model form a fine tuning model of the speech recognition model, and the method further includes: and inputting the Fbank feature vector of the voice sequence into the fine tuning model to carry out fine tuning processing on parameters of the voice recognition model.

In one possible implementation manner, the step of performing frame calculation on a plurality of voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences includes: inputting the unlabeled voice sequence to a preset feature extraction system, and carrying out frame calculation on the unlabeled voice sequence through the feature extraction system to extract Fbank feature vectors of the voice sequence.

In one possible embodiment, the step of performing the downsampling process by a specified multiple on the Fbank feature vector to generate a downsampled feature vector corresponding to the Fbank feature vector includes: sequentially selecting a group of frame numbers with specified multiples for Fbank feature vectors; for each group of frames, randomly selecting a specified number of frames in the frames, and combining the selected specified number of frames in each group of frames to generate a downsampled feature vector corresponding to the Fbank feature vector; wherein the specified number is smaller than the specified multiple.

In one possible embodiment, the above specified multiple is 8 times, and the specified number is 1.

In one possible implementation manner, the step of performing the mask calculation on the downsampled feature vector and generating the mask feature vector includes: traversing each frame of the downsampled feature vector; and respectively calculating random numbers for each frame of the downsampled feature vector according to a preset random function, and setting the vector value corresponding to the frame to be 0 if the random numbers are smaller than the preset random value.

In one possible embodiment, the loss function is an L1 loss function.

In a second aspect, an embodiment of the present application provides a speech recognition method, which is implemented by a speech recognition model, where the speech recognition model is trained according to the method of the first aspect, and the method includes: acquiring voice to be recognized; inputting the voice to be recognized into a trained voice recognition model, and outputting a character recognition result corresponding to the voice to be recognized through the voice recognition model.

In a third aspect, an embodiment of the present application provides a training apparatus for a speech recognition model, where the speech recognition model is a model of a transducer structure, and includes an encoder and a decoder, the encoder includes a plurality of encoding layers, the decoder includes a plurality of decoding layers, and the encoder and a masking prediction coding MPC layer form a pre-training model of the speech recognition model, and the apparatus includes: the system comprises a sample acquisition module, a frame calculation module and a frame calculation module, wherein the sample acquisition module is used for acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences, wherein the voice sequences in the voice sample set are unlabeled voice sequences; the downsampling module is used for performing downsampling processing of specified multiples on the Fbank feature vector to generate a downsampled feature vector corresponding to the Fbank feature vector; the mask module is used for carrying out mask operation on the downsampled feature vector to generate a mask feature vector; the input module is used for inputting the mask feature vector into the pre-training model and outputting a prediction vector corresponding to the Fbank feature vector through an encoder of the pre-training model and the MPC layer; the training module is used for calculating a loss function of the prediction vector and the Fbank feature vector, adjusting parameters of an encoder in the voice recognition model according to the loss function, and continuing training the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, so that a pre-training process of the voice recognition model is completed.

In one possible implementation manner, after the pre-training process is completed, the encoder and the decoder of the speech recognition model form a fine tuning model of the speech recognition model, and the apparatus further includes: and the fine tuning module is used for inputting the Fbank feature vector of the voice sequence into the fine tuning model so as to carry out fine tuning processing on the parameters of the voice recognition model.

In one possible embodiment, the sample acquisition module is further configured to: inputting the unlabeled voice sequence to a preset feature extraction system, and carrying out frame calculation on the unlabeled voice sequence through the feature extraction system to extract Fbank feature vectors of the voice sequence.

In one possible implementation manner, the downsampling module is configured to: sequentially selecting a group of frame numbers with specified multiples for Fbank feature vectors; for each group of frames, randomly selecting a specified number of frames in the frames, and combining the selected specified number of frames in each group of frames to generate a downsampled feature vector corresponding to the Fbank feature vector; wherein the specified number is smaller than the specified multiple.

In one possible implementation manner, the masking module is configured to: traversing each frame of the downsampled feature vector; and respectively calculating random numbers for each frame of the downsampled feature vector according to a preset random function, and setting the vector value corresponding to the frame to be 0 if the random numbers are smaller than the preset random value.

In one possible embodiment, the loss function is an L1 loss function.

In a fourth aspect, an embodiment of the present application provides a speech recognition device, which is implemented by a speech recognition model, where the speech recognition model is trained according to the method of the first aspect, and the device includes: the voice acquisition module is used for acquiring voice to be recognized; the recognition module is used for inputting the voice to be recognized into the trained voice recognition model, and outputting a character recognition result corresponding to the voice to be recognized through the voice recognition model.

In a fifth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the methods described in the first and second aspects above when the computer program is executed by the processor.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the methods of the first and second aspects described above.

The embodiment of the application has the following beneficial effects:

according to the training method, the voice recognition method and the device for the voice recognition model, when the voice recognition model is trained, the plurality of voice sequences in the acquired voice sample set are unmarked voice sequences, the Fbank characteristic vectors of the voice sequences can be extracted through frame calculation on the plurality of voice sequences in the voice sample set, the Fbank characteristic vectors are subjected to downsampling processing, mask operation is carried out on the downsampled characteristic vectors obtained after downsampling processing, mask characteristic vectors can be generated, after the mask characteristic vectors are input into the pre-training model, a prediction vector can be obtained, further a loss function of the prediction vector and the Fbank characteristic vectors is calculated, parameters of an encoder in the voice recognition model are adjusted according to the loss function, the pre-training process of the voice recognition model is achieved, and in the training process, due to the fact that the unmarked voice sequences are used, dependence on marking data in the training process of the voice recognition model is greatly reduced, and meanwhile the use cost is reduced.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the application and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an architecture of an electronic device according to an embodiment of the present application;

FIG. 2 is a flowchart of a training method of a speech recognition model according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a pre-training model and a fine-tuning model according to an embodiment of the present application;

FIG. 4 is a flowchart of another method for training a speech recognition model according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for speech recognition according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a training device for a speech recognition model according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of another training device for a speech recognition model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application of the voice recognition model is more and more extensive, and at present, the voice recognition model is already applied to the field of network bus, and is used for recognizing the riding conversation process. In these anti-cheating application scenarios, mainly data of the driver riding on the call before the journey are used, specifically, the dialogue between the driver and the passenger is converted into texts by using a voice recognition model, and then the intention of the driver riding on the two sides and the responsibility of judgment are extracted from the texts by using a natural language understanding method, so that the accuracy requirement on the voice recognition model is higher.

However, the feature of the ride call is poor voice quality, which results in difficulty in labeling, and results in very little accumulation of annotation data. However, the existing speech recognition technology relies on a large amount of labeling data, and the speech recognition model has poor effect under the condition of few data. However, the network vehicle use scene has very high requirement on recognition standard, because if the standard of voice recognition cannot reach a very high level, the effect of understanding the subsequent natural language can be greatly affected, the problem of inaccurate riding judgment responsibility can be finally caused, and a large amount of marked data can also greatly improve the use cost, so that the method is difficult to popularize in a large range.

Based on the above, the embodiment of the application provides a training method for a speech recognition model, a speech recognition method and a device, which alleviate the above technical problems.

In order to facilitate understanding of this embodiment, a method for training a speech recognition model according to an embodiment of the present application is described in detail below.

FIG. 1 shows a schematic diagram of exemplary hardware and software components of an electronic device 100 in which some embodiments of the application may be implemented. The electronic device 100 may be a general purpose computer or a special purpose computer, both of which may be used to implement the training method of the speech recognition model of the present application.

Electronic device 100 may include network port 110, one or more processors 120 for executing program instructions, communication bus 130, and various forms of storage media 140, such as magnetic disk, ROM, or RAM, or any combination thereof, coupled to a network. By way of example, the computer platform may also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The apparatus of the present application may be implemented in accordance with these program instructions. The electronic device 100 also includes an Input/Output (I/O) interface 150 between a computer and other Input/Output devices (e.g., keyboard, display screen).

When the electronic device is operating, the processor 120 communicates with the storage medium 140 via the communication bus 130, and the processor 120 executes machine readable instructions to implement the steps of the training method of the speech recognition model in the following embodiments, for example: the configuration files may be stored in the storage medium 140, and after the processor reads the system file of the designated client from the storage medium 140, the training method of the voice recognition model may be executed on the electronic device, so as to further execute the training process of the voice recognition model in the following embodiments.

For ease of illustration, only one processor is depicted in the electronic device 100. It should be noted, however, that the electronic device 100 of the present application may also include a plurality of processors, and thus steps performed by one processor described in the present application may also be performed jointly by a plurality of processors or separately. For example, if the processor of the electronic device 100 performs step a and step B, it should be understood that step a and step B may also be performed by two different processors together or performed separately in one processor. For example, the first processor performs step a, the second processor performs step B, or the first processor and the second processor together perform steps a and B.

Based on the above description of the electronic device, the embodiment of the present application first describes a training method of a speech recognition model, specifically, the speech recognition model is a model of a transducer structure, including an encoder and a decoder, the encoder includes a plurality of encoding layers, the decoder includes a plurality of decoding layers, and the encoder and a masking prediction coding MPC layer form a pre-training model of the speech recognition model.

The above model of the transducer structure is based on Attention mechanism, and is proposed on the basis of the existing LAS (Listener, attender, speller) end-to-end architecture, and generally includes an encoder-decoder structure, specifically, the transducer end-to-end architecture mainly includes a neural network Self-Attention mechanism for implementing an encoder, and a neural network Self-Attention mechanism for implementing a decoder. For the trained voice recognition model, after voice is input, a word recognition result corresponding to the voice can be output.

Specifically, as shown in fig. 2, a flowchart of a training method of a speech recognition model, the method may include the following steps:

202, acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences;

the voice sequences included in the voice sample set in the embodiment of the application are unlabeled voice sequences;

specifically, in the training process, the voice sample set is equivalent to a voice data set, and voice data in the voice data set is unlabeled voice data, so that a large number of open-source Mandarin data sets can be collected to serve as sample data in the data set, and in addition, in order to improve the general applicability of the voice sample set, a large number of driver-by-driver voices can be collected by a voice recording system arranged at a client of a network vehicle in the field of network vehicle.

Further, the Fbank feature vector is a feature vector for extracting Fbank features from voice based on a Filter Bank algorithm, and is generally a processing of audio in a manner similar to human ears after frame pretreatment of voice, so as to improve the performance of voice recognition.

Step S204, performing downsampling processing of a designated multiple on the Fbank feature vector to generate a downsampled feature vector corresponding to the Fbank feature vector;

step S206, performing mask operation on the downsampled feature vector to generate a mask feature vector;

specifically, the downsampling process in step S204 described above is usually an integer multiple of downsampling, and by the downsampling process, the data amount can be appropriately reduced, and the feature vector can be conveniently processed thereafter.

Step S208, inputting the mask feature vector into a pre-training model, and outputting a prediction vector corresponding to the Fbank feature vector through an encoder of the pre-training model and the MPC layer;

specifically, the MPC (Masked Predictive Coding, masking prediction coding) layer is mainly used for predicting Fbank features, in a speech recognition model, an encoder is usually an encoder based on a neural network model, and the MPC layer can reduce and train cognitive redundancy of the encoder in the neural network to reduce redundancy of information and improve recognition accuracy of the speech recognition model.

Step S210, calculating a loss function of the prediction vector and the Fbank feature vector, adjusting parameters of an encoder in the speech recognition model according to the loss function, and continuing to train the speech recognition model after the parameters are adjusted until the loss function converges to a preset value, so as to complete a pre-training process of the speech recognition model.

Specifically, in this step, the training is equivalent to the back propagation training of the speech recognition model, and if the loss function converges to a preset value, the training is completed. In addition, the iteration number of training can be set, each time of training is equivalent to one time of iterative training, if the iteration number reaches a preset value, the loss function can be considered to be converged, the pre-training process of the speech recognition model is considered to be completed, and the method and the device can be specifically set according to actual use conditions, and the embodiment of the application is not limited to the above.

According to the training method of the voice recognition model, when the voice recognition model is trained, the plurality of voice sequences included in the acquired voice sample set are unmarked voice sequences, the Fbank characteristic vectors of the voice sequences can be extracted through frame calculation of the plurality of voice sequences in the voice sample set, the Fbank characteristic vectors are subjected to downsampling processing, mask operation is carried out on the downsampled characteristic vectors obtained after downsampling processing, mask characteristic vectors can be generated, after the mask characteristic vectors are input into the pre-training model, prediction vectors can be obtained, further loss functions of the prediction vectors and the Fbank characteristic vectors are calculated, parameters of an encoder in the voice recognition model are adjusted according to the loss functions, the pre-training process of the voice recognition model is achieved, and in the training process, due to the fact that the unmarked voice sequences are used, dependence on marking data in the training process of the voice recognition model is greatly reduced, and meanwhile, the use cost is reduced.

In general, the training process shown in fig. 2 is a pre-training process, and after the pre-training process is completed in actual use, the method further includes a fine tuning stage of the speech recognition model, specifically, in the fine tuning stage, the predictive coding layer is removed, that is, the MPC layer is removed, and a decoder of a transducer structure is added to form a fine tuning model of the speech recognition model, so that the method further includes: and inputting the Fbank feature vector of the voice sequence into the fine tuning model to carry out fine tuning processing on parameters of the voice recognition model.

For ease of understanding, fig. 3 shows a schematic diagram of the structure of a pre-training model and a fine-tuning model, as shown in fig. 3, where (a) in fig. 3 is the pre-training model and (b) in fig. 3 is the fine-tuning model.

Specifically, in the pretraining model in fig. 3, the input Fbank features are predicted by the MPC layer, specifically, the Fbank features may be extracted by a preset feature extraction system, in which a corresponding Filter Bank algorithm is integrated, to implement an extraction process of the Fbank features, and produce Fbank feature vectors, so the process of extracting Fbank feature vectors of a speech sequence shown in fig. 2 may include the following processes: inputting the unlabeled voice sequence to a preset feature extraction system, and carrying out frame calculation on the unlabeled voice sequence through the feature extraction system to extract Fbank feature vectors of the voice sequence.

Further, for the fine tuning model shown in fig. 3 (b), the decoder usually recognizes the voice sequence to output the text vector corresponding to the voice sequence, specifically, the decoder may be implemented by self-attitudes, the speaker structure layer of the decoder starts to work after receiving the start command, the high-dimensional feature vector corresponding to the voice sequence is input into self-attitudes, and the text recognition result corresponding to the voice sequence is output.

In the specific implementation, in the operation of the self-intent mechanism, a text vector corresponding to a voice sequence is output every time of operation, an intermediate state vector is output in the middle of each corresponding operation, and the intermediate state vector and a high-dimensional feature vector output in each step in a decoder are used as the input of the next step of the self-intent mechanism to perform interactive operation; meanwhile, each step of self-attribute outputs a corresponding weight alpha 1, alpha 2 and alpha 3 …, and then the weight representing the importance of the character vector output by each step in the decoder at different positions in the high-dimensional feature vector can be obtained through the loss function operation; weighting the high-dimensional feature vector according to the weight value obtained in each step to obtain the high-dimensional feature vector with weight; taking the high-dimensional feature vector with weight as a new high-dimensional feature vector H, and inputting the new high-dimensional feature vector H and the intermediate state vector into the next operation of a self-attribute mechanism for interaction; repeating the above operation, and finally outputting the corresponding text vector of the voice sequence.

In addition, since the non-labeling voice sequence is used in the present application, the above-mentioned pre-training process is equivalent to an unsupervised training process, so that the unsupervised pre-training process can bring about a larger improvement on the downstream voice recognition task, and therefore, during the unsupervised pre-training, the data is subjected to a down-sampling process, specifically, on the basis of fig. 2, fig. 4 shows a flowchart of another training method of a voice recognition model, and the training process of the voice recognition model is specifically described in detail, which includes the following steps:

step S402, a voice sample set comprising a plurality of voice sequences is obtained, and frame calculation is carried out on the plurality of voice sequences in the voice sample set so as to extract Fbank feature vectors of the voice sequences;

step S404, sequentially selecting a group of frame numbers with specified multiples for Fbank feature vectors;

step S406, for each group of frames, randomly selecting a specified number of frames, and combining the selected specified number of frames in each group of frames to generate a downsampled feature vector corresponding to the Fbank feature vector;

specifically, the specified number is smaller than a specified multiple.

The above-mentioned steps S404 and S406 are downsampling processing, where in the embodiment of the present application, the designated multiple is 8 times, and the designated number is 1, so that during unsupervised pretraining, 8 times of downsampling processing is performed on the input Fbank feature vector through the downsampling processing, that is, seven frames are randomly dropped every eight frames for the frame sequence corresponding to the Fbank feature vector, and only one frame is reserved.

Step S408, each frame of the downsampling feature vector is traversed, random numbers are calculated for each frame of the downsampling feature vector according to a preset random function, and if the random numbers are smaller than the preset random values, the vector value corresponding to the frame is set to be 0 so as to generate mask feature vectors;

in practical use, the random number generated by the random function is usually a value in the interval of 0-100%, and the preset random value is usually 15%, so that when masking calculation is performed, masking is performed on the voice feature after downsampling each frame usually with a probability of 15%, that is, the value of the voice feature vector of the frame is set to 0, and then the voice feature vector is input into the pre-training model according to the following steps.

Step S410, inputting the mask feature vector into a pre-training model, and outputting a prediction vector corresponding to the Fbank feature vector through an encoder of the pre-training model and the MPC layer;

step S412, calculating a loss function of the predictive vector and the Fbank feature vector, adjusting parameters of the encoder in the speech recognition model according to the loss function, and continuing to train the speech recognition model after the parameters are adjusted until the loss function converges to a preset value, thereby completing the pre-training process of the speech recognition model.

In practical use, the loss function is generally an L1 loss function, so in the embodiment of the application, the difference between the predicted vector corresponding to the Fbank feature vector and the Fbank feature vector is calculated through the L1 loss function, and whether convergence is performed is judged, and when convergence is performed to a preset value, the pre-training process of the speech recognition model is completed.

The prediction vector is usually a high-dimensional feature vector, that is, a high-dimensional feature vector of a speech sequence is output through a plurality of encoding layers of the encoder of the pre-training model, specifically, the encoding structure of the encoder usually adopts a twelve-layer structure, the corresponding feature dimension is 512 dimensions, and in addition, the encoding structure of the encoder can be set according to actual use conditions, which is not limited by the embodiment of the present application.

In practical use, in order to reduce modification of the model of the transducer structure, in the embodiment of the present application, fbank feature vectors are used as the input and output of the encoder. Meanwhile, the dimension of the Fbank predictive vector output by the encoder is the same as the dimension of the input Fbank feature vector. Further, after the completion of the above-mentioned unsupervised pre-training, the MPC layer is removed and a decoder of a transducer structure is added to fine tune the downstream speech recognition task. During the fine tuning phase, all parameters of the whole model are end-to-end trainable, i.e. the process from (a) to (b) in fig. 3.

Further, based on the training method of the speech recognition model, the embodiment of the application also provides a speech recognition method, which is realized by the speech recognition model, wherein the speech recognition model is obtained by training according to the training method of the speech recognition model, and the method comprises the following steps of:

step S502, obtaining voice to be recognized;

step S504, inputting the voice to be recognized into the trained voice recognition model, and outputting a character recognition result corresponding to the voice to be recognized through the voice recognition model.

The voice recognition method provided by the embodiment of the application is realized through the voice recognition model, and when the voice recognition model is trained, a plurality of voice sequences included in the acquired voice sample set are non-labeling voice sequences, so that the dependence on labeling data in the training process of the voice recognition model is greatly reduced, the recognition effect of the voice recognition method is ensured, and the use cost is also reduced.

Based on the same inventive concept, the embodiment of the application also provides a training device of the speech recognition model corresponding to the training method of the speech recognition model, and because the principle of solving the problem of the device in the embodiment of the application is similar to that of the training method of the speech recognition model in the embodiment of the application, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.

Specifically, fig. 6 shows a schematic structural diagram of a training device for a speech recognition model, where the training device includes:

the sample obtaining module 60 is configured to obtain a speech sample set including a plurality of speech sequences, and perform frame computation on the plurality of speech sequences in the speech sample set to extract Fbank feature vectors of the speech sequences, where the plurality of speech sequences included in the speech sample set are unlabeled speech sequences;

the downsampling module 62 is configured to perform downsampling processing of a specified multiple on the Fbank feature vector, and generate a downsampled feature vector corresponding to the Fbank feature vector;

a masking module 64, configured to perform a masking operation on the downsampled feature vector, and generate a masked feature vector;

the input module 66 is configured to input the mask feature vector to the pre-training model, and output a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model;

the training module 68 is configured to calculate a loss function of the prediction vector and the Fbank feature vector, adjust parameters of an encoder in the speech recognition model according to the loss function, and continue training the speech recognition model after the parameters are adjusted until the loss function converges to a preset value, thereby completing a pre-training process of the speech recognition model.

Further, after the pre-training process is completed, the encoder and the decoder of the speech recognition model form a fine tuning model of the speech recognition model, so on the basis of fig. 6, fig. 7 also shows a schematic structural diagram of another training device for the speech recognition model, and the device further includes, in addition to the structure shown in fig. 6:

the fine tuning module 70 is configured to input Fbank feature vectors of the speech sequence to the fine tuning model, so as to perform fine tuning processing on parameters of the speech recognition model.

Further, the sample acquisition module is further configured to: inputting the unlabeled voice sequence to a preset feature extraction system, and carrying out frame calculation on the unlabeled voice sequence through the feature extraction system to extract Fbank feature vectors of the voice sequence.

Further, the downsampling module is configured to: sequentially selecting a group of frame numbers with specified multiples for Fbank feature vectors; for each group of frames, randomly selecting a specified number of frames in the frames, and combining the selected specified number of frames in each group of frames to generate a downsampled feature vector corresponding to the Fbank feature vector; wherein the specified number is smaller than the specified multiple.

Further, the above specified multiple is 8 times, and the specified number is 1.

Further, the masking module is configured to: traversing each frame of the downsampled feature vector; and respectively calculating random numbers for each frame of the downsampled feature vector according to a preset random function, and setting a vector value corresponding to the frame to be 0 if the random numbers are smaller than the preset random value so as to generate a mask feature vector.

Further, the loss function is an L1 loss function.

Corresponding to the above-mentioned voice recognition method, the embodiment of the present application further provides a voice recognition device, which is implemented by a voice recognition model, where the voice recognition model is trained according to the method shown in fig. 2 or fig. 4, as shown in a schematic structural diagram of a voice recognition device in fig. 8, and the device includes:

a voice acquisition module 80, configured to acquire voice to be recognized;

the recognition module 82 is configured to input the voice to be recognized into the trained voice recognition model, and output a text recognition result corresponding to the voice to be recognized through the voice recognition model.

The device provided by the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brief description, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned.

The embodiment of the application also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and the computer program executes the steps of the method when being executed by a processor.

The computer program product of the method, the method and the device for training the speech recognition model provided by the embodiment of the application comprise a computer readable storage medium storing program codes, and the instructions included in the program codes can be used for executing the method described in the method embodiment, and specific implementation can be referred to the method embodiment and will not be repeated here.

In addition, in the description of embodiments of the present application, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present application will be understood by those skilled in the art in specific cases.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description of the present application, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above examples are only specific embodiments of the present application for illustrating the technical solution of the present application, but not for limiting the scope of the present application, and although the present application has been described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that the present application is not limited thereto: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of training a speech recognition model, wherein the speech recognition model is a model of a fransformer structure, comprising an encoder and a decoder, the encoder comprising a plurality of encoding layers, the decoder comprising a plurality of decoding layers, the encoder and a masked predictive coded MPC layer comprising a pre-training model of the speech recognition model, the method comprising:

acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences, wherein the plurality of voice sequences in the voice sample set are unlabeled voice sequences;

performing downsampling processing of a designated multiple on the Fbank feature vector to generate a downsampled feature vector corresponding to the Fbank feature vector;

performing mask operation on the downsampled feature vector to generate a mask feature vector;

inputting the mask feature vector into the pre-training model, and outputting a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model;

and calculating a loss function of the prediction vector and the Fbank feature vector, adjusting parameters of the encoder in the voice recognition model according to the loss function, and continuing to train the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, so as to finish a pre-training process of the voice recognition model.

2. The method of claim 1, wherein after the pre-training process is completed, the encoder and decoder of the speech recognition model form a fine-tuning model of the speech recognition model, the method further comprising:

and inputting the Fbank feature vector of the voice sequence into the fine tuning model to carry out fine tuning processing on parameters of the voice recognition model.

3. The method of claim 1, wherein the step of performing frame computation on a plurality of the voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences comprises:

inputting the unlabeled voice sequence to a preset feature extraction system, and carrying out frame calculation on the unlabeled voice sequence through the feature extraction system so as to extract Fbank feature vectors of the voice sequence.

4. The method of claim 1, wherein the step of performing downsampling processing on the Fbank feature vector by a specified multiple to generate a downsampled feature vector corresponding to the Fbank feature vector comprises:

sequentially selecting a group of frame numbers with specified multiples for the Fbank feature vectors;

for each group of frames, randomly selecting a specified number of frames in the frames, and combining the specified number of frames selected in each group of frames to generate a downsampled feature vector corresponding to the Fbank feature vector; wherein the specified number is smaller than the specified multiple.

5. The method of claim 4, wherein the specified multiple is 8 times and the specified number is 1.

6. The method of claim 4, wherein masking the downsampled feature vector to generate a masked feature vector comprises:

traversing each frame of the downsampled feature vector;

and respectively calculating random numbers for each frame of the downsampled feature vector according to a preset random function, and setting a vector value corresponding to the frame to be 0 if the random numbers are smaller than a preset random value so as to generate a mask feature vector.

7. The method of claim 1, wherein the loss function is an L1 loss function.

8. A speech recognition method, characterized in that the method is implemented by a speech recognition model, which is trained according to the method of any one of claims 1-7, the method comprising:

acquiring voice to be recognized;

inputting the voice to be recognized into the trained voice recognition model, and outputting a character recognition result corresponding to the voice to be recognized through the voice recognition model.

9. A training device for a speech recognition model, wherein the speech recognition model is a model of a fransformer structure, comprising an encoder and a decoder, the encoder comprising a plurality of encoding layers, the decoder comprising a plurality of decoding layers, the encoder and a masked predictive coded MPC layer comprising a pre-training model of the speech recognition model, the device comprising:

the system comprises a sample acquisition module, a frame calculation module and a frame calculation module, wherein the sample acquisition module is used for acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences, wherein the plurality of voice sequences in the voice sample set are unlabeled voice sequences;

the downsampling module is used for performing downsampling processing of specified multiples on the Fbank feature vector to generate a downsampled feature vector corresponding to the Fbank feature vector;

the mask module is used for carrying out mask operation on the downsampled feature vector to generate a mask feature vector;

the input module is used for inputting the mask feature vector into the pre-training model and outputting a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model;

the training module is used for calculating a loss function of the prediction vector and the Fbank feature vector, adjusting parameters of the encoder in the voice recognition model according to the loss function, and continuing training the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, so that a pre-training process of the voice recognition model is completed.

10. A speech recognition device, characterized in that the device is realized by a speech recognition model, which is trained according to the method of any one of claims 1-7, the device comprising:

the voice acquisition module is used for acquiring voice to be recognized;

the recognition module is used for inputting the voice to be recognized into the trained voice recognition model, and outputting a character recognition result corresponding to the voice to be recognized through the voice recognition model.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of the preceding claims 1-8 when the computer program is executed by the processor.

12. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1 to 8.