CN111862953A

CN111862953A - Training method of voice recognition model, voice recognition method and device

Info

Publication number: CN111862953A
Application number: CN201911240191.8A
Authority: CN
Inventors: 蒋栋蔚
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2020-10-30
Anticipated expiration: 2039-12-05
Also published as: CN111862953B

Abstract

The invention provides a training method of a voice recognition model, a voice recognition method and a device, which relate to the technical field of voice recognition, and the training method of the voice recognition model comprises the following steps: acquiring a voice sample set comprising a plurality of voice sequences, performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank characteristic vectors of the voice sequences, and performing down-sampling processing and mask operation on the Fbank characteristic vectors to generate mask characteristic vectors; and inputting the mask feature vector into a pre-training model to complete the pre-training process of the voice recognition model. According to the training method of the voice recognition model, the voice recognition method and the device, the unlabelled voice sequence is used in the training process, so that the dependence on labeled data in the training process of the voice recognition model is greatly reduced, the recognition effect is ensured, and the use cost is reduced.

Description

Training method of voice recognition model, voice recognition method and device

Technical Field

The invention relates to the technical field of voice recognition, in particular to a training method of a voice recognition model, a voice recognition method and a device.

Background

In recent years, with the development of deep learning technology, the voice recognition technology has undergone revolutionary changes, and it is a way to make a breakthrough in the voice recognition technology to draw nutrition from the development of deep learning technology. However, when a speech recognition model is trained by a deep learning technique, a large amount of expensive labeled data is usually required to obtain a more accurate recognition effect, which presents a great challenge to the speech recognition system in the industry, not only increases the use cost, but also is difficult to popularize in a large range.

Disclosure of Invention

In view of the above, the present invention provides a method for training a speech recognition model, a method for speech recognition, and an apparatus thereof, so as to alleviate the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for training a speech recognition model, where the speech recognition model is a model of a transform structure, and includes an encoder and a decoder, where the encoder includes multiple encoding layers, the decoder includes multiple decoding layers, and an MPC layer of the encoder and an MPC layer of a masked predictive coding constitute a pre-training model of the speech recognition model, and the method includes: acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank characteristic vectors of the voice sequences, wherein the plurality of voice sequences in the voice sample set are unmarked voice sequences; performing down-sampling processing on the Fbank characteristic vector by a designated multiple to generate a down-sampling characteristic vector corresponding to the Fbank characteristic vector; performing mask operation on the down-sampling feature vector to generate a mask feature vector; inputting the mask feature vector into a pre-training model, and outputting a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model; and calculating a loss function of the prediction vector and the Fbank characteristic vector, adjusting parameters of an encoder in the voice recognition model according to the loss function, continuing training the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, and finishing the pre-training process of the voice recognition model.

In a possible implementation, after the pre-training process is completed, the encoder and the decoder of the speech recognition model form a fine-tuning model of the speech recognition model, and the method further includes: and inputting the Fbank characteristic vector of the voice sequence into a fine tuning model so as to perform fine tuning processing on the parameters of the voice recognition model.

In a possible implementation, the step of performing frame computation on a plurality of speech sequences in the speech sample set to extract Fbank feature vectors of the speech sequences includes: and inputting the unmarked voice sequence into a preset feature extraction system, and performing frame calculation on the unmarked voice sequence through the feature extraction system to extract the Fbank feature vector of the voice sequence.

In a possible implementation manner, the down-sampling processing of the Fbank feature vector by a specified multiple is performed, and the step of generating the down-sampled feature vector corresponding to the Fbank feature vector includes: sequentially selecting a group of frame numbers of specified multiples from the Fbank characteristic vector; randomly selecting a specified number of frames from each group of frame numbers, and combining the selected specified number of frames from each group of frame numbers to generate a down-sampling feature vector corresponding to the Fbank feature vector; wherein the designated number is less than the designated multiple.

In one possible embodiment, the predetermined multiple is 8 times and the predetermined number is 1.

In a possible implementation, the step of performing a mask calculation on the downsampled feature vector to generate a masked feature vector includes: traversing each frame of the downsampled feature vector; and respectively calculating a random number for each frame of the down-sampling characteristic vector according to a preset random function, and setting a vector value corresponding to the frame to be 0 if the random number is smaller than a preset random value.

In one possible implementation, the loss function is an L1 loss function.

In a second aspect, an embodiment of the present invention provides a speech recognition method, where the method is implemented by a speech recognition model, where the speech recognition model is obtained by training according to the method in the first aspect, and the method includes: acquiring a voice to be recognized; and inputting the voice to be recognized into the trained voice recognition model, and outputting a character recognition result corresponding to the voice to be recognized through the voice recognition model.

In a third aspect, an embodiment of the present invention provides a training apparatus for a speech recognition model, where the speech recognition model is a model with a transform structure, and includes an encoder and a decoder, where the encoder includes multiple encoding layers, the decoder includes multiple decoding layers, and an MPC layer of the encoder and an MPC layer of a masked predictive coding constitute a pre-training model of the speech recognition model, and the apparatus includes: the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank characteristic vectors of the voice sequences, wherein the plurality of voice sequences in the voice sample set are unmarked voice sequences; the down-sampling module is used for performing down-sampling processing on the Fbank characteristic vector by specified times to generate a down-sampling characteristic vector corresponding to the Fbank characteristic vector; the mask module is used for performing mask operation on the down-sampling feature vector to generate a mask feature vector; the input module is used for inputting the mask feature vector into a pre-training model and outputting a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model; and the training module is used for calculating a loss function of the prediction vector and the Fbank characteristic vector, adjusting parameters of an encoder in the voice recognition model according to the loss function, continuing to train the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, and finishing the pre-training process of the voice recognition model.

In a possible implementation, after the pre-training process is completed, the encoder and the decoder of the speech recognition model form a fine-tuning model of the speech recognition model, and the apparatus further includes: and the fine tuning module is used for inputting the Fbank characteristic vector of the voice sequence into the fine tuning model so as to perform fine tuning processing on the parameters of the voice recognition model.

In a possible implementation, the sample acquiring module is further configured to: and inputting the unmarked voice sequence into a preset feature extraction system, and performing frame calculation on the unmarked voice sequence through the feature extraction system to extract the Fbank feature vector of the voice sequence.

In a possible implementation, the down-sampling module is configured to: sequentially selecting a group of frame numbers of specified multiples from the Fbank characteristic vector; randomly selecting a specified number of frames from each group of frame numbers, and combining the selected specified number of frames from each group of frame numbers to generate a down-sampling feature vector corresponding to the Fbank feature vector; wherein the designated number is less than the designated multiple.

In a possible implementation, the masking module is configured to: traversing each frame of the downsampled feature vector; and respectively calculating a random number for each frame of the down-sampling characteristic vector according to a preset random function, and setting a vector value corresponding to the frame to be 0 if the random number is smaller than a preset random value.

In one possible implementation, the loss function is an L1 loss function.

In a fourth aspect, an embodiment of the present invention provides a speech recognition apparatus, where the apparatus is implemented by a speech recognition model, where the speech recognition model is obtained by training according to the method of the first aspect, and the apparatus includes: the voice acquisition module is used for acquiring the voice to be recognized; and the recognition module is used for inputting the voice to be recognized into the trained voice recognition model and outputting a character recognition result corresponding to the voice to be recognized through the voice recognition model.

In a fifth aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the methods according to the first and second aspects when executing the computer program.

In a sixth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the method according to the first and second aspects.

The embodiment of the invention has the following beneficial effects:

The embodiment of the invention provides a training method of a voice recognition model, a voice recognition method and a device, when the voice recognition model is trained, a plurality of voice sequences contained in an obtained voice sample set are unmarked voice sequences, Fbank characteristic vectors of the voice sequences can be extracted by performing frame calculation on the plurality of voice sequences in the voice sample set, the Fbank characteristic vectors are subjected to reduction adoption processing, mask operation is performed on the down-sampled characteristic vectors obtained after the down-sampling processing, mask characteristic vectors can be generated, after the mask characteristic vectors are input into a pre-training model, prediction vectors can be obtained, further loss functions of the prediction vectors and the Fbank characteristic vectors are calculated, parameters of an encoder in the voice recognition model are adjusted according to the loss functions, the pre-training process of the voice recognition model is realized, and in the training process, because the unmarked voice sequences are used, the dependence on the labeled data in the training process of the voice recognition model is greatly reduced, and the use cost is also reduced while the recognition effect is ensured.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a speech recognition model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a pre-training model and a fine-tuning model according to an embodiment of the present invention;

FIG. 4 is a flow chart of another method for training a speech recognition model according to an embodiment of the present invention;

FIG. 5 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a training apparatus for speech recognition models according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of another training apparatus for speech recognition models according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The application of the voice recognition model is more and more extensive, and at present, the voice recognition model is applied to the field of network appointment and is used for recognizing the driving and riding conversation process. In general, about-the-net vehicles generate orders of tens of millions of orders every day, and in the orders, the situations that the driver cuts the order of taking the vehicle of the passenger to other companies and the driver requires offline transaction (the driver bypasses the platform to directly trade and avoid platform service fee) are inevitably generated, in the application scenes of anti-cheating, data of the conversation of the driver and the passenger before the journey is mainly used, specifically, the conversation of the driver and the passenger is converted into texts by using a voice recognition model, then the intention of the driver and the passenger is extracted from the texts by using a natural language understanding method and the judgment is performed, and therefore, the requirement on the accuracy of the voice recognition model is higher and higher.

However, the driver and passenger communication is characterized by poor voice quality, which causes difficulty in labeling and causes very little accumulation of labeled data. The existing voice recognition technology depends on a large amount of labeled data, and the voice recognition model has poor effect under the condition of little data. However, the requirement of the usage scenario of the network appointment car on the recognition accuracy is very high, because if the accuracy of the speech recognition cannot reach a very high level, the effect of subsequent natural language understanding will be greatly affected, and finally the problem of inaccurate responsibility of driver and crew will be caused, and a large amount of labeled data will greatly increase the usage cost, and the large-scale popularization is difficult.

Based on this, the embodiments of the present invention provide a training method for a speech recognition model, a speech recognition method, and an apparatus, which alleviate the above technical problems.

For the convenience of understanding the present embodiment, a method for training a speech recognition model provided in the embodiments of the present application is described in detail below.

FIG. 1 illustrates a schematic diagram of exemplary hardware and software components of an electronic device 100, which may implement some embodiments of the present application. The electronic device 100 may be a general purpose computer or a special purpose computer, both of which may be used to implement the speech recognition model training method of the present application.

The electronic device 100 may include a network port 110 connected to a network, one or more processors 120 for executing program instructions, a communication bus 130, and a different form of storage medium 140, such as a disk, ROM, or RAM, or any combination thereof. Illustratively, the computer platform may also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The apparatus of the present application may be implemented in accordance with these program instructions. The electronic device 100 also includes an Input/Output (I/O) interface 150 between the computer and other Input/Output devices (e.g., keyboard, display screen).

When the electronic device is in operation, the processor 120 communicates with the storage medium 140 via the communication bus 130, and the processor 120 executes the machine-readable instructions to implement the steps of the training method of the speech recognition model in the following embodiments, for example: the voice recognition model and the pre-training model of the voice recognition model may be configured by connecting to the upper computer through the network port 110, and the configuration files may be stored in the storage medium 140, and after the processor reads the system file of the specified client from the storage medium 140, the training method of the voice recognition model may be executed on the electronic device, thereby performing the training process of the voice recognition model of the following embodiments.

For ease of illustration, only one processor is depicted in electronic device 100. However, it should be noted that the electronic device 100 in the present application may also comprise a plurality of processors, and thus the steps performed by one processor described in the present application may also be performed by a plurality of processors in combination or individually. For example, if the processor of the electronic device 100 executes steps a and B, it should be understood that steps a and B may also be executed by two different processors together, or executed separately in one processor. For example, a first processor performs step a and a second processor performs step B, or the first processor and the second processor perform steps a and B together.

Based on the description of the electronic device, the embodiment of the present application first describes a training method for a speech recognition model, and specifically, the speech recognition model is a model with a transform structure, and includes an encoder and a decoder, where the encoder includes multiple encoding layers, the decoder includes multiple decoding layers, and the encoder and the MPC layer constitute a pre-training model of the speech recognition model.

The model of the transform structure is proposed based on an Attention mechanism on the basis of an existing LAS (Listener, attenter, teller) end-to-end architecture, and generally includes an encoder-decoder structure, and specifically, the transform end-to-end architecture mainly includes a Self-Attention mechanism Self-Attention of a neural network for implementing an encoder and a Self-Attention mechanism Self-Attention of a neural network for implementing a decoder. For the trained voice recognition model, after the voice is input, the character recognition result corresponding to the voice can be output.

Specifically, as shown in fig. 2, a flow chart of a training method of a speech recognition model may include the following steps:

step 202, acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences;

the voice sample set comprises a plurality of voice sequences, wherein the plurality of voice sequences are unmarked voice sequences;

specifically, in the training process, the voice sample set is equivalent to a voice data set, and the voice data in the voice data set is voice data without labels, so that a large number of open-source mandarin data sets can be collected as sample data in the data set.

Further, the Fbank feature vector is a feature vector for extracting Fbank features from the speech based on the Filter Bank algorithm, and is generally a feature vector for processing the audio in a manner similar to human ears after performing frame pre-processing on the speech so as to improve the performance of speech recognition.

Step S204, performing down-sampling processing of specified multiple on the Fbank characteristic vector to generate a down-sampling characteristic vector corresponding to the Fbank characteristic vector;

step S206, performing mask operation on the down-sampling feature vector to generate a mask feature vector;

specifically, the down-sampling process in step S204 described above, which is generally an integer multiple of down-sampling, can appropriately reduce the amount of data by the down-sampling process, and facilitate the subsequent processing of the feature vector.

Step S208, inputting the mask feature vector into a pre-training model, and outputting a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model;

specifically, the MPC (Masked Predictive Coding) layer is mainly used for predicting the Fbank features, and in the speech recognition model, the encoder is usually an encoder based on a neural network model, and the MPC layer may perform reduced training on the cognitive redundancy of the encoder in the neural network, so as to reduce the redundancy of information and improve the recognition accuracy of the speech recognition model.

Step S210, calculating a loss function of the prediction vector and the Fbank characteristic vector, adjusting parameters of an encoder in the voice recognition model according to the loss function, continuing to train the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, and finishing a pre-training process of the voice recognition model.

Specifically, in this step, it is equivalent to a back propagation training of the speech recognition model, and if the loss function converges to a preset value, the training is completed. In addition, the iteration number of training may also be set, each training is equivalent to one iteration training, if the iteration number reaches a preset value, it may also be considered that the loss function is converged, and it is considered that the pre-training process of the speech recognition model is completed, and the setting may be specifically performed according to an actual use condition, which is not limited in the embodiment of the present invention.

The method for training the speech recognition model provided by the embodiment of the invention can extract the Fbank characteristic vector of the speech sequence by performing frame calculation on the plurality of speech sequences in the speech sample set when the speech recognition model is trained, perform reduction adoption processing on the Fbank characteristic vector, perform mask operation on the downsampled characteristic vector obtained after the downsampling processing to generate a mask characteristic vector, input the mask characteristic vector into the pre-training model to obtain a prediction vector, further calculate a loss function of the prediction vector and the Fbank characteristic vector, adjust the parameters of an encoder in the speech recognition model according to the loss function to realize the pre-training process of the speech recognition model, and in the training process, because the unlabeled speech sequence is used, the dependence on the labeled data in the training process of the voice recognition model is greatly reduced, and the use cost is also reduced while the recognition effect is ensured.

In general, the training process shown in fig. 2 is usually a pre-training process, and in practical use, after the pre-training process is completed, the method further includes a fine tuning stage of the speech recognition model, specifically, in the fine tuning stage, the predictive coding layer is removed, that is, the MPC layer is removed, and a transform-structured decoder is added to form the fine tuning model of the speech recognition model, so that the method further includes: and inputting the Fbank characteristic vector of the voice sequence into a fine tuning model so as to perform fine tuning processing on the parameters of the voice recognition model.

For ease of understanding, fig. 3 shows a schematic structural diagram of a pre-training model and a fine-tuning model, as shown in fig. 3, where (a) in fig. 3 is the pre-training model and (b) in fig. 3 is the fine-tuning model.

Specifically, the pre-trained model in fig. 3 predicts the input Fbank feature through the MPC layer, specifically, the Fbank feature may be extracted through a preset feature extraction system, and a corresponding Filter Bank algorithm is integrated in the feature extraction system to implement the Fbank feature extraction process and generate the Fbank feature vector, so that the process of extracting the Fbank feature vector of the speech sequence shown in fig. 2 may include the following processes: and inputting the unmarked voice sequence into a preset feature extraction system, and performing frame calculation on the unmarked voice sequence through the feature extraction system to extract the Fbank feature vector of the voice sequence.

Further, for the fine tuning model shown in (b) of fig. 3, the decoder usually recognizes the speech sequence to output the text vector corresponding to the speech sequence, specifically, self-attribute may be used to implement the decoder, the spinner structure layer of the decoder starts to operate after receiving the start operation instruction, the high-dimensional feature vector corresponding to the speech sequence is input into self-attribute, and the text recognition result corresponding to the speech sequence is output.

In the specific implementation, in the operation of the self-attention mechanism, each step of operation outputs a text vector corresponding to a voice sequence, a middle state vector is output in the middle of each corresponding step of operation, and the middle state vector and a high-dimensional feature vector output in each step of the decoder are used as the input of the next step of the self-attention mechanism for interactive operation; meanwhile, each step of self-attention outputs a corresponding weight value alpha 1, alpha 2 and alpha 3 …, and then weight values representing the importance of different positions of the character vector output by each step in the decoder in the high-dimensional feature vector can be obtained through loss function operation; carrying out weighting operation on the high-dimensional feature vector according to the weight value obtained in each step to obtain the high-dimensional feature vector with the weight; taking the high-dimensional feature vector with the weight as a new high-dimensional feature vector H to interact with the intermediate state vector in the next operation of the self-attention mechanism; repeating the above operations, and finally outputting the corresponding text vector of the voice sequence.

In addition, since the unlabelled speech sequence is used in the present application, the above-mentioned pre-training process is equivalent to an unsupervised training process, and in order to enable the unsupervised pre-training process to bring greater promotion to the downstream speech recognition task, therefore, during the unsupervised pre-training, the data is down-sampled, specifically, on the basis of the above-mentioned fig. 2, fig. 4 shows a flow chart of another training method of the speech recognition model, and the training process of the speech recognition model is described in detail, specifically, the method includes the following steps:

step S402, a voice sample set comprising a plurality of voice sequences is obtained, and frame calculation is carried out on the plurality of voice sequences in the voice sample set so as to extract Fbank characteristic vectors of the voice sequences;

step S404, a group of frame numbers of specified multiples are sequentially selected for the Fbank characteristic vector;

step S406, randomly selecting a specified number of frames from each group of frames, and combining the selected specified number of frames from each group of frames to generate a down-sampling feature vector corresponding to the Fbank feature vector;

specifically, the number of the designations is smaller than a designated multiple.

The processes of step S404 and step S406 are down-sampling processes, wherein in the embodiment of the present invention, the designated multiple is 8 times, and the designated number is 1, so that, during unsupervised pre-training, the down-sampling process is performed on the input Fbank feature vector by 8 times, that is, for the frame sequence corresponding to the Fbank feature vector, seven frames are randomly dropped every eight frames, and only one frame is reserved.

Step S408, traversing each frame of the downsampling feature vector, respectively calculating a random number for each frame of the downsampling feature vector according to a preset random function, and if the random number is smaller than a preset random value, setting a vector value corresponding to the frame to be 0 to generate a mask feature vector;

in practical use, the random number generated by the random function is usually a numerical value in an interval of 0-100%, and the preset random value is usually 15%, so that when performing mask calculation, the voice feature after down sampling of each frame is usually masked with a probability of 15%, that is, the value of the voice feature vector of the frame is set to 0, and then the voice feature vector is input into the pre-training model according to the following steps.

Step S410, inputting the mask feature vector into a pre-training model, and outputting a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model;

step S412, calculating a loss function of the prediction vector and the Fbank characteristic vector, adjusting parameters of an encoder in the voice recognition model according to the loss function, continuing to train the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, and finishing the pre-training process of the voice recognition model.

In actual use, the loss function is usually an L1 loss function, and therefore, in the embodiment of the present invention, the difference between the prediction vector corresponding to the Fbank feature vector and the Fbank feature vector is calculated through the L1 loss function, and whether convergence occurs is determined, and when the convergence reaches a preset value, the pre-training process of the speech recognition model is completed.

The prediction vector is usually a high-dimensional feature vector, that is, a high-dimensional feature vector of a speech sequence is output through a plurality of coding layers of the coder of the pre-training model, specifically, the coding structure of the coder usually adopts a twelve-layer structure, and the corresponding feature dimension is 512 dimensions.

In practical applications, in order to reduce modifications to the model of the transform structure, in the embodiment of the present invention, Fbank feature vectors are used as input and output of the encoder. Meanwhile, the dimensionality of the Fbank prediction vector output by the encoder is the same as that of the input Fbank feature vector. Further, after the unsupervised pre-training is completed, the MPC layer is removed and a transform-structured decoder is added to fine tune the downstream speech recognition task. In the fine tuning phase, all parameters of the entire model are end-to-end trainable, i.e., in fig. 3, the process from (a) to (b).

Further, based on the above training method for the speech recognition model, an embodiment of the present invention further provides a speech recognition method, where the method is implemented by a speech recognition model, the speech recognition model is obtained by training according to the above training method for the speech recognition model, and as shown in fig. 5, the method includes the following steps:

Step S502, obtaining the voice to be recognized;

step S504, the speech to be recognized is input into the trained speech recognition model, and the character recognition result corresponding to the speech to be recognized is output through the speech recognition model.

The voice recognition method provided by the embodiment of the invention is realized through the voice recognition model, and when the voice recognition model is trained, the plurality of voice sequences which are collectively included in the acquired voice samples are unmarked voice sequences, so that the dependence on marked data in the training process of the voice recognition model is greatly reduced, the recognition effect of the voice recognition method is ensured, and the use cost is also reduced.

Based on the same inventive concept, the embodiment of the present application further provides a training apparatus for a speech recognition model corresponding to the training method for the speech recognition model, and because the principle of the apparatus in the embodiment of the present application for solving the problem is similar to the training method for the speech recognition model in the embodiment of the present application, the implementation of the apparatus can refer to the implementation of the method, and repeated details are not repeated.

Specifically, fig. 6 shows a schematic structural diagram of a training apparatus for a speech recognition model, which includes:

A sample obtaining module 60, configured to obtain a speech sample set including a plurality of speech sequences, and perform frame calculation on the plurality of speech sequences in the speech sample set to extract Fbank feature vectors of the speech sequences, where the plurality of speech sequences included in the speech sample set are unmarked speech sequences;

the down-sampling module 62 is configured to perform down-sampling processing on the Fbank feature vector by a specified multiple, and generate a down-sampling feature vector corresponding to the Fbank feature vector;

a mask module 64, configured to perform a mask operation on the downsampling feature vector to generate a mask feature vector;

the input module 66 is configured to input the mask feature vector to a pre-training model, and output a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model;

and the training module 68 is configured to calculate a loss function of the prediction vector and the Fbank feature vector, adjust parameters of an encoder in the speech recognition model according to the loss function, continue training the speech recognition model after the parameters are adjusted until the loss function converges to a preset value, and complete a pre-training process of the speech recognition model.

Further, after the pre-training process is completed, the encoder and the decoder of the speech recognition model form a fine-tuning model of the speech recognition model, so on the basis of fig. 6, fig. 7 further shows a schematic structural diagram of another training apparatus of the speech recognition model, and in addition to the structure shown in fig. 6, the apparatus further includes:

And the fine tuning module 70 is configured to input the Fbank feature vector of the voice sequence into the fine tuning model, so as to perform fine tuning processing on the parameter of the voice recognition model.

Further, the sample acquiring module is further configured to: and inputting the unmarked voice sequence into a preset feature extraction system, and performing frame calculation on the unmarked voice sequence through the feature extraction system to extract the Fbank feature vector of the voice sequence.

Further, the down-sampling module is configured to: sequentially selecting a group of frame numbers of specified multiples from the Fbank characteristic vector; randomly selecting a specified number of frames from each group of frame numbers, and combining the selected specified number of frames from each group of frame numbers to generate a down-sampling feature vector corresponding to the Fbank feature vector; wherein the designated number is less than the designated multiple.

Further, the predetermined multiple is 8 times, and the predetermined number is 1.

Further, the mask module is configured to: traversing each frame of the downsampled feature vector; and respectively calculating a random number for each frame of the down-sampling feature vector according to a preset random function, and if the random number is smaller than a preset random value, setting a vector value corresponding to the frame to be 0 so as to generate a mask feature vector.

Further, the loss function is an L1 loss function.

Corresponding to the above speech recognition method, an embodiment of the present invention further provides a speech recognition apparatus, which is implemented by a speech recognition model, where the speech recognition model is obtained by training according to the method shown in fig. 2 or fig. 4, and as shown in fig. 8, the apparatus includes:

a voice obtaining module 80, configured to obtain a voice to be recognized;

and the recognition module 82 is configured to input the speech to be recognized into the trained speech recognition model, and output a character recognition result corresponding to the speech to be recognized through the speech recognition model.

The device provided by the embodiment of the present application has the same implementation principle and technical effect as those of the foregoing method embodiments, and for the sake of brief description, no mention is made in the embodiment of the device, and reference may be made to the corresponding contents in the foregoing method embodiments.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the method.

The method for training a speech recognition model, the speech recognition method, and the computer program product of the apparatus provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood in specific cases for those skilled in the art.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that the following embodiments are merely illustrative of the present invention, and not restrictive, and the scope of the present invention is not limited thereto: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for training a speech recognition model, wherein the speech recognition model is a Transformer-structured model, and comprises an encoder and a decoder, the encoder comprises a plurality of encoding layers, the decoder comprises a plurality of decoding layers, and the encoder and a Masked Predictive Coding (MPC) layer form a pre-training model of the speech recognition model, the method comprises:

acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences, wherein the plurality of voice sequences in the voice sample set are unmarked voice sequences;

performing down-sampling processing on the Fbank characteristic vector by a designated multiple to generate a down-sampling characteristic vector corresponding to the Fbank characteristic vector;

performing mask operation on the down-sampling feature vector to generate a mask feature vector;

inputting the mask feature vector into the pre-training model, and outputting a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model;

calculating a loss function of the prediction vector and the Fbank characteristic vector, adjusting parameters of the encoder in the voice recognition model according to the loss function, continuing to train the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, and finishing a pre-training process of the voice recognition model.

2. The method of claim 1, wherein the encoder and decoder of the speech recognition model form a fine-tuning model of the speech recognition model after the pre-training process is completed, the method further comprising:

and inputting the Fbank characteristic vector of the voice sequence into the fine tuning model so as to perform fine tuning processing on the parameters of the voice recognition model.

3. The method according to claim 1, wherein said step of performing frame computation on a plurality of said speech sequences in said set of speech samples to extract Fbank feature vectors of said speech sequences comprises:

and inputting the unlabeled voice sequence into a preset feature extraction system, and performing frame calculation on the unlabeled voice sequence through the feature extraction system to extract the Fbank feature vector of the voice sequence.

4. The method according to claim 1, wherein the step of performing downsampling processing on the Fbank feature vector by a specified multiple, and the step of generating a downsampled feature vector corresponding to the Fbank feature vector comprises:

sequentially selecting a group of frame numbers of specified multiples from the Fbank characteristic vector;

Randomly selecting a specified number of frames from each group of frames, and combining the specified number of frames selected from each group of frames to generate a down-sampling feature vector corresponding to the Fbank feature vector; wherein the specified number is less than the specified multiple.

5. The method of claim 4, wherein the specified multiple is 8 and the specified number is 1.

6. The method of claim 4, wherein masking the downsampled feature vector to generate a masked feature vector comprises:

traversing each frame of the downsampled feature vector;

and respectively calculating a random number for each frame of the down-sampling feature vector according to a preset random function, and if the random number is smaller than a preset random value, setting a vector value corresponding to the frame to be 0 so as to generate a mask feature vector.

7. The method of claim 1, wherein the loss function is an L1 loss function.

8. A speech recognition method, characterized in that the method is implemented by a speech recognition model, the speech recognition model is obtained by training according to the method of any one of claims 1 to 7, and the method comprises:

Acquiring a voice to be recognized;

and inputting the voice to be recognized into the trained voice recognition model, and outputting a character recognition result corresponding to the voice to be recognized through the voice recognition model.

9. An apparatus for training a speech recognition model, wherein the speech recognition model is a Transformer-structured model, and comprises an encoder and a decoder, the encoder comprises a plurality of encoding layers, the decoder comprises a plurality of decoding layers, and the encoder and a Masking Prediction Coding (MPC) layer constitute a pre-training model of the speech recognition model, the apparatus comprising:

the system comprises a sample acquisition module, a frame calculation module and a comparison module, wherein the sample acquisition module is used for acquiring a voice sample set comprising a plurality of voice sequences, and performing frame calculation on the plurality of voice sequences in the voice sample set to extract Fbank feature vectors of the voice sequences, and the plurality of voice sequences in the voice sample set are unmarked voice sequences;

the down-sampling module is used for performing down-sampling processing on the Fbank characteristic vector by a specified multiple to generate a down-sampling characteristic vector corresponding to the Fbank characteristic vector;

the mask module is used for performing mask operation on the down-sampling feature vector to generate a mask feature vector;

The input module is used for inputting the mask feature vector into the pre-training model and outputting a prediction vector corresponding to the Fbank feature vector through an encoder and an MPC layer of the pre-training model;

and the training module is used for calculating a loss function of the prediction vector and the Fbank characteristic vector, adjusting parameters of the encoder in the voice recognition model according to the loss function, continuing to train the voice recognition model after the parameters are adjusted until the loss function converges to a preset value, and finishing the pre-training process of the voice recognition model.

10. A speech recognition device, characterized in that the device is realized by a speech recognition model, the speech recognition model is obtained by training according to the method of any one of claims 1-7, and the device comprises:

the voice acquisition module is used for acquiring the voice to be recognized;

and the recognition module is used for inputting the voice to be recognized to the trained voice recognition model and outputting a character recognition result corresponding to the voice to be recognized through the voice recognition model.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of the preceding claims 1-8 are implemented when the computer program is executed by the processor.

12. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 8.