CN111951789B

CN111951789B - Training of speech recognition model, speech recognition method, apparatus, device and medium

Info

Publication number: CN111951789B
Application number: CN202010821094.4A
Authority: CN
Inventors: 李�杰; 王晓瑞; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2021-08-17
Anticipated expiration: 2040-08-14
Also published as: CN111951789A

Abstract

The embodiment of the disclosure relates to a method, a device, equipment and a medium for training and recognizing a voice recognition model. The training method of the voice recognition model comprises the following steps: acquiring first voice data; inputting the first voice data into a first voice recognition model, and acquiring at least one first text data output by the voice recognition model; according to a preset grammar rule, second text data are recognized from the first text data, and a first voice recognition sample is generated according to the first voice data; acquiring a second voice recognition sample; and inputting the first voice recognition sample and the second voice recognition sample into the first voice recognition model, and continuing training the first voice recognition model to generate a second voice recognition model. The embodiment of the disclosure can improve the generation efficiency of training data, accelerate the training speed of the voice recognition model and improve the voice recognition accuracy of the voice recognition model.

Description

Training of speech recognition model, speech recognition method, apparatus, device and medium

Technical Field

The present disclosure relates to the field of data processing, and in particular, to methods, apparatuses, devices, and media for training and speech recognition models.

Background

In the related art, speech data can be converted into text data through an end-to-end model, so that the sequence conversion operation is simplified, and the training process is simplified.

The sequence may include text, voice, image, or video sequence data, among others. For example, the end-to-end model is a speech recognition model and the training data includes speech and text pairs. A large amount of speech is collected and correspondingly marked into text, so that speech and text pairs are formed to train the model.

In the above manner, in order to ensure the accuracy of the mapping relationship between the speech and the text, and to improve the recognition accuracy in the unknown speech field, that is, in order to improve the generalization capability of the model, training needs a large number of speech and text pairs, and accordingly, a large amount of time and labor cost are consumed.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a device and a medium for training a speech recognition model, and a method, an apparatus, a device and a medium for speech recognition, so as to at least solve the problem of low training efficiency of the speech recognition model in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a training method of a speech recognition model, including:

acquiring first voice data; inputting the first voice data into a first voice recognition model, and acquiring at least one first text data output by the voice recognition model;

according to a preset grammar rule, second text data are recognized from the first text data, and a first voice recognition sample is generated according to the first voice data;

acquiring a second voice recognition sample, wherein the second voice recognition sample comprises second voice data and third text data, and the semantics of the second voice data are the same as the semantics of the third text data;

and inputting the first voice recognition sample and the second voice recognition sample into the first voice recognition model, and continuing training the first voice recognition model to generate a second voice recognition model.

Optionally, the recognizing, according to a preset grammar rule, second text data from each of the first text data includes:

extracting matched grammatical features from each first text data, and calculating the grammatical priority of each first text data according to the matched grammatical features;

and comparing the grammar priorities of the first text data to obtain the first text data with the highest grammar priority as second text data.

Optionally, the extracting the matched grammatical feature from the first text data, and calculating the grammatical priority of the first text data according to the matched grammatical feature includes:

inputting the first text data into a pre-trained grammar priority calculation model, wherein the grammar priority calculation model is a bidirectional encoder representation model of an encoder and decoder structure based on an attention mechanism;

in the grammar priority calculation model, deleting at least one text unit in the first text data to form at least two text segments, wherein the ratio of the total number of words of each text unit to the total number of words of the first text data is a set ratio;

respectively acquiring a first text segment in front of each text unit and generating a first prediction result of each text unit;

respectively acquiring second text segments behind the text units and generating second prediction results of the text units;

generating a target prediction result of each text unit according to each first prediction result and each second prediction result;

combining the target prediction result with each text segment to generate grammar prediction data;

calculating a difference between the grammatical prediction data and the first text data as a grammatical priority of the first text data.

Optionally, inputting the first speech recognition sample and the second speech recognition sample into the first speech recognition model, and continuing training the first speech recognition model, including:

generating at least one first training data set from the plurality of first speech recognition samples;

generating at least one second training data set based on the plurality of second speech recognition samples;

generating at least one third training data set according to a plurality of first voice recognition samples and a plurality of second voice recognition samples, wherein the number of samples included in the first training data set, the number of samples included in the second training data set and the number of samples included in the third training data set are the same;

and alternately inputting the first training data set, the second training data set and the third training data set into the first voice recognition model, and continuing training the first voice recognition model, wherein two adjacent input training data sets are different.

Optionally, while continuing to train the first speech recognition model, the method further includes:

calculating a generalization error of the first speech recognition model;

and if the generalization error of the first speech recognition model is less than or equal to a first error threshold, continuing to train the first speech recognition model according to the second training data group.

Optionally, the generating the second speech recognition model includes:

calculating the generalization error of the first speech recognition model in the training process of the first speech recognition model;

and if the generalization error of the first speech recognition model is less than or equal to a second error threshold, stopping training the first speech recognition model, and taking the first speech recognition model at the current moment as a second speech recognition model, wherein the second error threshold is less than the first error threshold.

Optionally, before inputting the first speech data into the first speech recognition model, the method further includes:

obtaining a plurality of third voice recognition samples, wherein the third voice recognition samples comprise third voice data and fourth text data;

inputting each of the third speech recognition samples to an initial machine learning model for training, the initial machine learning model comprising an attention-based encoder and decoder;

extracting the voice features in the third voice data in the encoder for encoding to obtain feature vectors;

decoding, in the decoder, the feature vectors to form predictive text data;

calculating a difference between the predicted text data and the third text data;

and when the difference value meets the training condition, taking the current machine learning model as a first speech recognition model.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech recognition method including:

acquiring voice data to be recognized; acquiring a voice recognition model, wherein the voice recognition model is obtained by training by adopting a training method of the voice recognition model according to any one of the embodiments of the present disclosure;

and inputting the voice data to be recognized into the voice recognition model, and acquiring recognized text data output by the voice recognition model.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a speech recognition model, including:

a first voice data acquisition unit configured to perform acquisition of first voice data;

a first text data acquisition unit configured to perform input of the first voice data into a first voice recognition model, and acquire at least one first text data output by the voice recognition model;

a first speech recognition sample generation unit configured to perform recognition of second text data from each of the first text data according to a preset grammar rule, and generate a first speech recognition sample based on the first speech data;

a second voice recognition sample acquisition unit configured to perform acquisition of a second voice recognition sample including second voice data and third text data, the semantics of the second voice data and the semantics of the third text data being the same;

and the second voice recognition model generation unit is configured to input the first voice recognition sample and the second voice recognition sample into the first voice recognition model, continue training the first voice recognition model and generate a second voice recognition model.

Optionally, the first speech recognition sample generating unit includes:

a grammatical feature extraction subunit configured to perform extraction of matched grammatical features from each of the first text data, and calculate a grammatical priority of each of the first text data according to the matched grammatical features;

and the second text data screening subunit is configured to compare grammar priorities of the first text data, acquire the first text data with the highest grammar priority, and serve as the second text data.

Optionally, the syntax priority calculating subunit includes:

a grammar priority calculation subunit configured to perform input of the first text data into a pre-trained grammar priority calculation model, the grammar priority calculation model being a bidirectional encoder representation model of an attention-based encoder and decoder structure;

a text shielding subunit configured to execute deleting at least one text unit in the first text data in the grammar priority calculation model to form at least two text segments, where a ratio of the total number of words of each text unit to the total number of words of the first text data is a set ratio;

a first prediction result obtaining subunit configured to perform obtaining first text segments before each text unit, and generate a first prediction result of each text unit;

a second prediction result obtaining subunit configured to perform obtaining of a second text segment after each of the text units, respectively, and generate a second prediction result for each of the text units;

a target prediction result obtaining subunit configured to perform generation of a target prediction result for each text unit according to each first prediction result and each second prediction result;

a syntax prediction data obtaining subunit configured to perform combining the target prediction result and each of the text segments to generate syntax prediction data;

a syntax prediction difference calculation subunit configured to perform calculation of a difference between the syntax prediction data and the first text data as a syntax priority of the first text data.

Optionally, the second speech recognition model generating unit includes:

a first training data set acquisition subunit configured to perform generating at least one first training data set from the plurality of first speech recognition samples;

a second training data set acquisition subunit configured to perform generating at least one second training data set from a plurality of second speech recognition samples;

a third training data set obtaining subunit configured to perform generating at least one third training data set according to a plurality of first voice recognition samples and a plurality of second voice recognition samples, where the number of samples included in the first training data set, the number of samples included in the second training data set, and the number of samples included in the third training data set are the same;

a training data alternating training subunit configured to perform alternating input of the first training data set, the second training data set, and the third training data set into the first speech recognition model, and continue training of the first speech recognition model, wherein two adjacent input training data sets are different.

Optionally, the training apparatus for the speech recognition model further includes:

a label data post-training unit configured to perform calculating a generalization error of the first speech recognition model while continuing to train the first speech recognition model;

Optionally, the second speech recognition model generating unit includes:

a model training completion detection unit configured to perform calculation of a generalization error of the first speech recognition model during training of the first speech recognition model;

a third speech recognition sample acquisition unit configured to perform acquiring a plurality of third speech recognition samples including third speech data and fourth text data before inputting the first speech data into the first speech recognition model;

a first speech recognition model training unit configured to perform training by inputting each of the third speech recognition samples to an initial machine learning model including an attention-based encoder and decoder;

a voice encoding unit configured to perform, in the encoder, extracting and encoding the voice feature in the third voice data to obtain a feature vector;

a text decoding unit configured to perform decoding of the feature vector in the decoder to form predicted text data;

a predicted-text difference calculation unit configured to perform calculation of a difference between the predicted-text data and the third text data;

a first speech recognition model generation unit configured to perform, when the difference satisfies a training condition, taking a current machine learning model as a first speech recognition model.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:

a to-be-recognized voice data acquisition unit configured to perform acquisition of voice data to be recognized;

a speech recognition model obtaining unit configured to perform obtaining of a speech recognition model, wherein the speech recognition model is obtained by training using a training method of the speech recognition model according to any one of the embodiments of the present disclosure;

and the recognition text data acquisition unit is configured to input the voice data to be recognized into the voice recognition model and acquire the recognition text data output by the voice recognition model.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the training method of the speech recognition model according to any embodiment of the present disclosure or the speech recognition method according to any embodiment of the present disclosure.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a storage medium, wherein instructions, when executed by a processor of an electronic device, enable the processor to perform the training method of the speech recognition model according to any one of the embodiments of the present disclosure or the speech recognition method according to any one of the embodiments of the present disclosure.

According to a seventh aspect of the embodiments of the present disclosure, there is provided a computer program product for use in conjunction with an electronic device, the computer program product including a computer-readable storage medium and a computer program mechanism embedded therein, the program being loaded into and executed by a computer to implement the method for training a speech recognition model according to any of the embodiments of the present disclosure or the method for speech recognition according to any of the embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

text recognition is carried out on the first voice data through a pre-trained first voice recognition model, a plurality of first text data corresponding to the voice data are obtained, and text data with the same semantics of the first voice data can be accurately obtained; second text data are screened out from the plurality of first text data according to a grammar rule, and text data with accurate grammar can be obtained; combining the second text data and the first voice data into a first voice recognition sample, realizing automatic generation of a training sample, accelerating the generation speed of the training sample, and simultaneously enabling the training sample to be close to an accurate training sample manually marked; the first voice recognition sample is adopted to train the first voice recognition model continuously, so that the voice recognition model is trained by adopting the automatically generated training sample, the training speed of the voice recognition model is increased, the training efficiency of the voice recognition model is improved, the problem of training the voice recognition model is solved, the number of manually marked samples is reduced, and the labor cost for generating the training sample is reduced; and the second voice recognition sample is adopted to continue training the first voice recognition model, so that the voice recognition model is trained by adopting the accurate training sample, and the voice recognition accuracy of the voice recognition model is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a method of training a speech recognition model according to an exemplary embodiment.

FIG. 2 is a schematic diagram illustrating an encoder and decoder model based on an attention mechanism in accordance with an exemplary embodiment.

Fig. 3 is a schematic diagram of an encoder shown in accordance with an exemplary embodiment.

Fig. 4 is a schematic diagram of a decoder according to an example embodiment.

FIG. 5 is a flow diagram illustrating a method of training a speech recognition model according to an example embodiment.

FIG. 6 is a flow diagram illustrating a method of training a speech recognition model according to an example embodiment.

FIG. 7 is a flow diagram illustrating a method of speech recognition according to an example embodiment.

FIG. 8 is a block diagram illustrating a training apparatus for a speech recognition model according to an example embodiment.

FIG. 9 is a block diagram illustrating a speech recognition apparatus according to an example embodiment.

FIG. 10 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a training method of a speech recognition model according to an exemplary embodiment, where the training method of the speech recognition model is used in an electronic device and is executed by the electronic device, as shown in fig. 1, and includes the following steps.

In step S11, first voice data is acquired.

The first speech data is used as source speech in speech recognition samples to train a speech recognition model. The first speech data may include at least one speech fragment of a language, for example, the first speech data may be chinese: i love singing, or may include english and chinese: the first voice data comprises English segments I love to and Chinese segments singing. The first speech data may be acquired in various ways, for example, captured from a network, for example, by a recording device.

In step S12, the first speech data is input into a first speech recognition model, and at least one first text data output by the speech recognition model is obtained.

The first speech recognition model is used for recognizing the speech data and generating text data with the same semantic meaning as the speech data. The first speech recognition model may be a model generated by a machine learning model through training. The first speech recognition model is actually an End-to-End (End-to-End) model, and the End-to-End model is used as the speech recognition model, so that the processing operations such as preprocessing, feature processing and the like of original speech data can be reduced, the speech data is directly converted into text data, errors caused by the processing operations can be reduced, and the efficiency and the accuracy of speech recognition are improved. In addition, the first voice recognition model is used for recognizing the voice data into the text data, and the training samples are automatically generated by generating the sample pairs of the voice data and the text data, so that the labor cost for marking the voice data into the text is greatly reduced.

The voice data is a voice sequence, and the text data is a text sequence. Illustratively, the speech recognition model is used for recognizing a speech sequence into a text sequence with the same meaning; for another example, a speech recognition model is used to translate a speech sequence of one language into a text sequence of another language; in another example, a speech sequence of a long text is converted into a short text sequence, i.e., a text summary is generated from a speech chapter. In addition, there are other application scenarios, which may be specifically set according to needs, and thus, the embodiment of the present disclosure is not specifically limited.

In an implementation manner of the embodiment of the present disclosure, before inputting the first speech data into the first speech recognition model, optionally, the method further includes: obtaining a plurality of third voice recognition samples, wherein the third voice recognition samples comprise third voice data and fourth text data; inputting each of the third speech recognition samples to an initial machine learning model for training, the initial machine learning model comprising an attention-based encoder and decoder; extracting the voice features in the third voice data in the encoder for encoding to obtain feature vectors; decoding, in the decoder, the feature vectors to form predictive text data; calculating a difference between the predicted text data and the third text data; and when the difference value meets the training condition, taking the current machine learning model as a first speech recognition model.

The third speech recognition sample is used for training the initial machine learning model to generate a first speech recognition model. And the semantics of the third voice data included in the third voice recognition sample is the same as the semantics of the fourth text data included in the third voice recognition sample. Optionally, the fourth text data in the third speech recognition sample is artificially labeled text data. The initial machine learning model is trained by adopting the artificially labeled voice recognition sample, so that model training is realized by adopting an accurate voice recognition sample, and the voice recognition accuracy of the machine learning model can be improved. The third speech recognition sample is actually a manually labeled correct sample pair. Generally, the text data formed by the manual labeling mode can ensure that the grammar of the text is accurate and corresponds to the speech data accurately, so that the sample representativeness of the third speech recognition sample formed by outputting the text data according to the manual labeling is the best. The first speech recognition model is an initial machine learning model trained by the third speech recognition sample pair.

In addition, in order to save labor cost, the training times or the training data amount of the initial machine learning model is small, and the initial machine learning model after training, that is, the first speech recognition model, is in an under-fit state, and further training is needed.

The structural diagram of the initial machine learning model is shown in fig. 2, and the encoder is configured to perform feature extraction on the speech data, encode the extracted speech features, and generate feature vectors, where the speech features are used to characterize parameters of the speech data and attribute values of the parameters. The Attention module is used for correcting the feature vector of the current moment according to the feature vector of the historical moment and sending the corrected feature vector to a decoder, and the Attention module is used for enhancing vector representation. The decoder is used for decoding the corrected characteristic vector to form at least one predicted text data. The classifier is used for classifying the predicted text data and outputting at least one predicted text data. Illustratively, the classifier may be a softmax loss function, and text data with a plurality of different probabilities may be output by configuring a threshold of the softmax loss function, so as to output a plurality of sequence conversion results. In general, text data that outputs the highest probability may be configured. In the embodiment of the present disclosure, a plurality of text data may be output by configuring a threshold, for example, 5 text data may be output. In addition, the classifier may also be a Connection Timing Classification (CTC) loss function, which may be specifically set as required, and this is not limited in this embodiment of the disclosure.

The encoder is used for encoding the voice data, encoding the voice data with any length into the feature vector (c), and particularly segmenting and encoding the voice data (x) into the feature vector. The decoder is used for analyzing the characteristic vector (c) according to the context information to form text data (y). The feature vectors are actually used to describe features of the speech data.

When the encoder calculates the feature vector, the encoder usually segments the speech data, and extracts features from each speech segment formed by segmentation to form speech elements. The encoder usually configures an initial hidden layer vector in advance, and calculates a hidden layer vector corresponding to the current time by using a speech element as an input. And then sequentially taking the voice elements as input, converting the hidden layer vector obtained at the last moment to obtain the hidden layer vector corresponding to the current moment, and obtaining the hidden layer vector which is the characteristic vector when all the voice elements are input.

Illustratively, as shown in fig. 3, h1, h2, h3 … … hn are hidden layer vectors, relating to the state at the previous time and the current input. h0 is the default initial hidden layer vector, x1, x2, x3 … … xn are the speech data, c is the feature vector. H1 is calculated from h0 and the input x1 at this moment, h2 is calculated from h1 and the input x2 at this moment, and so on, c is calculated from hn and the input xn at this moment.

When the decoder analyzes the feature vector, the feature vector is usually used as input, a hidden layer vector corresponding to each moment is obtained by calculation, candidate speech segments are determined, the probability (such as confidence) of each candidate speech segment is calculated, and a target speech segment is determined according to the probability of each candidate speech segment. Specifically, the hidden layer vector corresponding to the current time may be determined and calculated according to the hidden layer vector obtained at the previous time.

Illustratively, as shown in fig. 4, h1', h2', h3'… … hn' are hidden layer vectors, relating to the state at the previous time and the current input. h0' is the default initial hidden layer vector, y1, y2, y3 … … yn are the output sequences, and c is the feature vector. H1' is calculated from h0' and c, h2 is calculated from h1' and c, and so on, hn ' is calculated from hn-1' and c. And meanwhile, calculating the probabilities of a plurality of alternative voice segments according to h0, h1' and c, determining a target voice segment as y1 to be output, calculating the probabilities of a plurality of alternative voice segments according to h1', y1 and c, determining the target voice segment as y2 to be output, and so on, and outputting yn according to hn-1', yn-1 and c. And splicing y1, y2 and y3 … … yn to obtain a sequence, namely text data.

When the decoder parses the feature vector, the target speech segment is not only related to the hidden layer vector at the last moment of the decoder, the feature vector, and the target speech segment corresponding to the last moment, but also related to the hidden layer vector in the encoder. Through the attention module, aiming at the calculation of each target voice segment, the weight of each hidden layer vector in the encoder is determined, the decoded input at the current moment and the hidden layer vectors of the encoders at all moments are subjected to weighted summation, and the hidden layer vector and the target voice segment at the next moment are calculated, so that the target voice segment is determined more accurately, and the text data is determined accurately.

The difference between the predicted text data and the third text data is used to measure the difference between the predicted text data and the accurate speech recognition result. Specifically, the difference between the predicted text data and the third text data may be calculated using a preset loss function. Illustratively, the loss function is used to calculate a product of probabilities of outputting a correct label (label) to the initial learning model, taking a negative logarithm, wherein the correct label may refer to data in the predicted text data that is the same as or similar to the third text data. The training conditions are used to detect whether training of the initial machine learning model is complete. Optionally, the training condition is used to detect whether the difference is less than or equal to a set difference threshold. Or the training condition is used for detecting whether the training times exceed a set time threshold value. In addition, the training condition may be in other forms, and the embodiment of the present disclosure is not particularly limited thereto. And determining the initial machine learning model at the current moment as a first voice recognition model when the difference value meets the training condition, which indicates that the training of the initial machine learning model is finished.

By adopting an encoder model and a decoder model based on an attention mechanism as an initial machine learning model, input data of a decoder at the last moment can be paid more attention to, input voice and output text are aligned in the time sequence direction, text data output by a first voice recognition model obtained through training is aligned with the voice data, and the recognition accuracy of the first voice recognition model is improved; and training the initial machine learning model by using the third voice data and the fourth text data with the same semantics as training samples to generate a first voice recognition model, so that the accuracy of the training samples is improved, and the recognition accuracy of the trained first voice recognition model is improved.

In step S13, second text data is recognized from each of the first text data according to a preset grammar rule, and a first speech recognition sample is generated based on the first speech data.

And according to a preset grammar rule, identifying second text data from the first text data, and screening out text data with the most accurate grammar from the plurality of first text data. The grammar rules are used to detect whether the grammar of the text data is accurate. The plurality of first text data may be respectively scored according to grammar rules, the score of the first text data may be used as an evaluation result of the first text data, and the first text data with the highest score may be determined as the second text data, where the evaluation result of the text data is used to evaluate whether the text data is grammar-accurate. Illustratively, the grammar rule is used for detecting whether the position relation between each participle and adjacent participles in the text data is correct.

The second text data is used for forming a sample pair with the first voice data to serve as training data so as to train the first voice recognition model. It is to be understood that the second text data may be the text data that most conforms to the grammar rule in the first text data, that is, the text data with the most accurate grammar. And the first voice data and the second text data obtained by screening form a target sample pair, so that the accuracy of the sample can be improved.

Illustratively, the first text data includes: hangzhou in May is a season like a painting; the Hangzhou May is a landscape as painting. The second sentence "the marquis in hangzhou is a season in which a landscape is drawn" is grammatically accurate, and thus the second sentence is the second text data.

In step S14, a second speech recognition sample is obtained, where the second speech recognition sample includes second speech data and third text data, and the semantics of the second speech data and the semantics of the third text data are the same.

The second speech recognition sample is actually the correct pair of samples that were manually labeled. Generally, the text data formed by the manual labeling mode can ensure that the grammar of the text is accurate and corresponds to the speech data accurately, so that the sample representativeness of the second speech recognition sample formed by outputting the text data according to the manual labeling is the best.

In step S15, the first speech recognition sample and the second speech recognition sample are input to the first speech recognition model, and the first speech recognition model is trained to generate a second speech recognition model.

The first speech recognition sample is a sample that is automatically generated by the computer device. The second speech recognition sample is the correct sample to be manually labeled. The second voice recognition model is a trained model, the second voice recognition model is a model obtained by training the first voice recognition model through a large number of samples, and the voice recognition accuracy of the second voice recognition model is higher compared with that of the first voice recognition model.

Due to the fact that labor cost is too high, a large number of manual labeling samples are difficult to obtain to train the first speech recognition model. By automatically generating text data with the same semantic meaning as any voice data, a large number of first voice recognition samples can be formed, a large number of training samples are quickly generated to train the first voice recognition model continuously, and the voice recognition accuracy of the first voice recognition model is improved. In fact, the presence of an error sample in the first speech recognition sample generated in a large number, for example, in the first speech recognition sample, the semantics of the speech data and the semantics of the text data are different, and/or the grammar of the text data is wrong, and the use of the error sample to train the first speech recognition model may result in a decrease in the speech recognition accuracy of the first speech recognition model. Therefore, the first voice recognition model can be trained by adding the second voice recognition sample manually marked with the correct sample, and the voice recognition accuracy of the first voice recognition model is improved.

In addition, the first speech recognition sample may be a sample formed by correspondingly recognizing speech data arbitrarily captured according to different fields as text data. The type of the first voice data can be increased through configuration, so that the type of the first sample recognition sample is increased, the coverage range of the training sample is increased, and the recognition accuracy of the first voice recognition model to the unknown voice is improved, namely the generalization capability of the first voice recognition model is improved.

According to the technical scheme of the embodiment of the disclosure, text recognition is carried out on first voice data through a pre-trained first voice recognition model, a plurality of first text data corresponding to the voice data are obtained, and the text data with the same first voice data semantics can be accurately obtained; second text data are screened out from the plurality of first text data according to a grammar rule, and text data with accurate grammar can be obtained; combining the second text data and the first voice data into a first voice recognition sample, realizing automatic generation of a training sample, accelerating the generation speed of the training sample, and simultaneously enabling the training sample to be close to an accurate training sample manually marked; the first voice recognition sample is adopted to train the first voice recognition model continuously, so that the voice recognition model is trained by adopting the automatically generated training sample, the training speed of the voice recognition model is increased, the training efficiency of the voice recognition model is improved, the problem of training the voice recognition model is solved, the number of manually marked samples is reduced, and the labor cost for generating the training sample is reduced; and the second voice recognition sample is adopted to continue training the first voice recognition model, so that the voice recognition model is trained by adopting the accurate training sample, and the voice recognition accuracy of the voice recognition model is improved.

Fig. 5 is a flowchart illustrating a method for training a speech recognition model according to an exemplary embodiment, which is a further refinement of the above technical solution, and the technical solution in the embodiment may be combined with various alternatives in one or more of the above embodiments. As shown in fig. 5, the training method of the speech recognition model includes the following steps.

In step S21, first voice data is acquired.

Non-exhaustive descriptions of the present embodiments may refer to the foregoing embodiments.

In step S22, the first speech data is input into a first speech recognition model, and at least one first text data output by the speech recognition model is obtained.

In step S23, a matching grammatical feature is extracted from each of the first text data, and a grammatical priority of each of the first text data is calculated based on the matching grammatical feature.

The grammatical features are used to characterize parameters of the speech data grammar, as well as attribute values of the parameters, e.g. the position of words in the sentence in the speech and/or the intra-sentence type of the words, etc. The grammar priority is used to evaluate the degree to which the grammar of the text data conforms to the grammar rules of the person. The more the grammar of the text data conforms to the human grammar rule, the higher the grammar priority of the text data is, the more the grammar of the text data deviates from the human grammar rule, and the lower the grammar priority of the text data is. Usually, grammars of different languages are different, and a matched calculation method can be configured for each grammar of each language respectively to perform grammar priority calculation. Illustratively, the grammar evaluation model may be a pre-trained machine learning model.

Optionally, the extracting the matched grammatical feature from the first text data, and calculating the grammatical priority of the first text data according to the matched grammatical feature includes: inputting the first text data into a pre-trained grammar priority calculation model, wherein the grammar priority calculation model is a bidirectional encoder representation model of an encoder and decoder structure based on an attention mechanism; in the grammar priority calculation model, deleting at least one text unit in the first text data to form at least two text segments, wherein the ratio of the total number of words of each text unit to the total number of words of the first text data is a set ratio; respectively acquiring a first text segment in front of each text unit and generating a first prediction result of each text unit; respectively acquiring second text segments behind the text units and generating second prediction results of the text units; generating a target prediction result of each text unit according to each first prediction result and each second prediction result; combining the target prediction result with each text segment to generate grammar prediction data; calculating a difference between the grammatical prediction data and the first text data as a grammatical priority of the first text data.

A Bidirectional Encoder Representation (BERT) model of an encoder and decoder structure based on an attention mechanism is a model obtained by training according to a large-scale unlabeled corpus, and the BERT model is used for obtaining semantic representation of texts containing rich semantic information. The BERT model is a pre-training model, the input of the BERT model comprises original word vectors of all characters or words in a text, and the output of the BERT model comprises vector representation of all characters or words in the text after full-text semantic information is fused. The BERT model adopts an unsupervised method and adopts massive text data for training. The BERT model is used for learning grammar rules in the text data so as to perform grammar priority calculation on the text data.

The BERT model training process specifically comprises the following steps: the segmentation (tokens) is entered in the input text at random masked (Masking) portions, and then only those masked tokens are predicted.

Wherein the first text data is an input text of the BERT model. The text unit is a masked input word. The first text data may be word segmented to form at least one text unit. Illustratively, the first text data is: i like singing, three text units are: i, like, and sing. The masking may refer to an operation of deleting the text unit and replacing the text unit with other special flag information, that is, after the text unit is deleted, the position of each text unit (including the deleted text unit) is still maintained. For each deleted text unit, determining the text before the text unit and the text after the text unit as two text segments, that is, two text segments exist for each deleted text unit, where a text segment may be empty. The first text fragment before the text unit is all texts before the text unit in the text data; the second text fragment following the text unit is the entire text following the text unit in the text data.

The text unit includes at least one word, the word being an editable minimum unit of text. In the first text data, any number of text units at arbitrary positions may be masked. Generally, the more text units, i.e., the more words included, the lower the prediction accuracy, e.g., all words are deleted and the prediction accuracy of the BERT model is extremely low. Therefore, the set ratio can be configured, the proportion of the number of words of the text unit to the number of words of the first text data is limited, and the prediction accuracy of the BERT model is improved. The set ratio is used to determine the number of words included in the text unit that is masked (or deleted) in the text data. For example, the set ratio is usually 15%, and in addition, the set ratio may also be set according to practical situations, and the embodiments of the present disclosure are not particularly limited.

Obtaining a first text segment preceding a unit of text and generating a first prediction result for the unit of text may be understood as obtaining a prediction result in a first direction for the unit of text. Obtaining a second text segment following the unit of text and generating a second prediction result for the unit of text may be understood as obtaining a prediction result in a second direction for the unit of text. And obtaining the prediction results in two directions, and embodying the semantic expression of the bidirectional encoder of the BERT model. And generating a target prediction result of the text unit according to the first prediction result and the second prediction result of the text unit, wherein the target prediction result is used for comprehensively considering the prediction results in two directions and determining the target prediction result of the text unit. The target prediction result is combined with a first text segment preceding the unit of text and a second text segment following the unit of text, in effect, the target prediction result is filled in at a location before being deleted. And respectively filling the target prediction results respectively matched with all the text units to the matched deletion positions to generate grammatical prediction data.

The difference between the syntactic prediction data and the first text data is used to represent the prediction accuracy of the BERT model and to represent the difference between the syntactic prediction data and the first text data. In practice, during the training process, the BERT model is used to learn semantic representation of the text data of the correct grammar, and the difference between the grammar prediction data and the text data of the correct grammar may be used to represent a difference between the text data predicted by the BERT model and the text data of the correct grammar, so that the difference between the grammar prediction data and the first text data is determined as a grammar priority of the first text data, and may be used to evaluate a degree to which the grammar of the first text data conforms to a grammar rule.

In addition, one BERT model can only perform grammar priority calculation for the text of the same language, different BERT models respectively correspond to grammars of different languages, and the BERT model matched with the language is selected to perform grammar priority calculation according to the language corresponding to the text data; alternatively, a BERT model may be configured to perform grammar priority calculation for texts of multiple languages.

The grammar priority of the first text data can be accurately calculated by adopting the pre-trained BERT model to calculate the grammar priority of the text data, the BERT model is trained by an unsupervised method, the grammar evaluation cost can be reduced, and the training speed of the grammar evaluation model can be improved by adopting the text data to carry out unsupervised training.

In step S24, the grammar priorities of the first text data are compared, the first text data with the highest grammar priority is obtained as the second text data, and the first speech recognition sample is generated based on the first speech data.

The first text data with the highest grammar priority is used for determining the first text data which best accords with the grammar rule of the person, namely determining the first text data with the most accurate grammar. The first text data with the highest grammar priority is selected to generate the first voice recognition sample, so that the first voice recognition sample can be aligned with the manually marked voice recognition sample, the grammar accuracy of the first voice recognition sample is improved, the grammar accuracy of the output text of the second voice recognition model generated by training is improved, and the voice recognition accuracy of the second voice recognition model is improved.

In step S25, a second speech recognition sample is obtained, where the second speech recognition sample includes second speech data and third text data, and the semantics of the second speech data and the semantics of the third text data are the same.

In step S26, the first speech recognition sample and the second speech recognition sample are input to the first speech recognition model, and the first speech recognition model is trained to generate a second speech recognition model.

According to the technical scheme of the embodiment of the disclosure, grammar priority calculation is performed on a plurality of text data, and the text data with the highest grammar priority is used as the second text data, so that grammar accuracy of the second text data is improved, grammar accuracy of the first voice recognition sample can be improved, and grammar accuracy of the text data output by the second voice recognition model is improved.

Fig. 6 is a flowchart illustrating a method for training a speech recognition model according to an exemplary embodiment, which is a further refinement of the above technical solution, and the technical solution in the embodiment may be combined with various alternatives in one or more of the above embodiments. As shown in fig. 6, the training method of the speech recognition model includes the following steps.

In step S31, first voice data is acquired.

In step S32, the first speech data is input into a first speech recognition model, and at least one first text data output by the speech recognition model is obtained.

In step S33, second text data is recognized from each of the first text data according to a preset grammar rule, and a first speech recognition sample is generated based on the first speech data.

In step S34, a second speech recognition sample is obtained, where the second speech recognition sample includes second speech data and third text data, and the semantics of the second speech data and the semantics of the third text data are the same.

In step S35, at least one first training data set is generated based on the plurality of first speech recognition samples.

A large number of first speech recognition samples are obtained and grouped to generate a plurality of first training data sets. The first training data set includes only samples automatically generated by the computer device.

In step S36, at least one second training data set is generated based on the plurality of second speech recognition samples.

And acquiring a large number of second voice recognition samples, grouping the second voice recognition samples, and generating a plurality of second training data groups. The second training data set comprises only manually labeled samples.

In step S37, at least one third training data set is generated according to a plurality of first speech recognition samples and a plurality of second speech recognition samples, wherein the number of samples included in the first training data set, the number of samples included in the second training data set, and the number of samples included in the third training data set are the same.

The third training data set is a mixed sample set of automatically generated samples and manually labeled samples of the computer device.

The number of samples included in the first training data set, the number of samples included in the second training data set and the number of samples included in the third training data set are the same, and errors caused by different input numbers can be avoided, so that the introduction of errors in model training can be reduced, and the accuracy of speech recognition of the model is improved.

In step S38, the first training data set, the second training data set, and the third training data set are alternately input to the first speech recognition model, and the first speech recognition model is continuously trained to generate a second speech recognition model, where two training data sets input adjacently are different.

The first training data group, the second training data group and the third training data group are alternately input into the first voice recognition model, so that the condition that only the first training data group, the second training data group or the third training data group is input and the input data of the first voice recognition model is solidified to cause the reduction of the recognition accuracy of the first voice recognition model on the unknown sample can be avoided, the randomness of the training process can be improved through alternate input, the recognition accuracy of the second voice recognition model on the unknown sample is improved, and the generalization capability of the second voice recognition model is improved.

Typically, the model is trained in a multi-pass, multi-step fashion. The training process of the first speech recognition model comprises a plurality of rounds of training, wherein one round comprises a plurality of steps, and each training selects part of training data to train the first speech recognition model. Training data can be grouped, and the model is trained according to different rounds of asynchronous grouping. Illustratively, the number of rounds of training is 30-40 and the number of steps in a round is 5-10. For example, the number of training rounds of the first speech recognition model is 40 total rounds, each round comprising 5 steps. And each step trains the first speech recognition model by adopting any one training data group of the first training data group, the second training data group or the third training data group. The type of training data set employed at each step is typically configured in a random manner. The two training data sets input adjacently are different, which can mean two adjacent steps, and the types of the adopted training data sets are different.

For example, the first speech recognition model may be trained in the following manner: in each round of training process, a first training data set is adopted to train a first voice recognition model in the first step, a second training data set is adopted to train the first voice recognition model in the second step, a third training data set is adopted to train the first voice recognition model in the third step, and the like, and the processes are repeated in the subsequent steps, for example, the first training data set is continuously adopted to train the first voice recognition model in the fourth step, and the like.

The generalization error of the first speech recognition model can be calculated each time, i.e. after each training step in the previous example is finished, and the number of training times, i.e. the number of rounds and the number of steps in the previous example, can be adjusted accordingly according to the generalization error.

The generalization capability of the first speech recognition model is used to evaluate the ability of the first speech recognition model to give a correct response to an unknown input. Generally, the generalization ability of a model is evaluated using the generalization error of the model. In general, training samples are divided into a training set and a test set, where the training data in the embodiments of the present disclosure is the training set. The training set is used for training the model, and the testing set is used for evaluating the prediction performance of the trained model on data. A portion of the target sample pairs may be selected from the training data to form a test set. The generalization error can be the proportion of all samples of the erroneous samples in the predicted results of the model for the test set. The generalization capability of the model is best when the generalization error is lowest. However, the model with too many training times may not obtain an accurate input/output mapping relationship. In fact, the model has two states of under-fitting and over-fitting during the training process. At the initial moment, the model is in an under-fitting state, and at the moment, the generalization error is reduced along with the improvement of the training times; when the training times reach the set times, the state of the model is converted into an overfitting state, and at the moment, the generalization error is improved along with the improvement of the training times. Therefore, the generalization error firstly falls and then rises, and the lowest point of the generalization error is also the turning point of the model from under-fitting to over-fitting.

Usually, the training samples of the first speech recognition model are manually labeled samples, and because the labor cost is too high, mass samples are difficult to obtain, and only through training of mass samples, the first speech recognition model can form an overfitting state. That is, the first speech recognition model is typically under-fitted. By automatically generating a large number of first voice recognition samples and continuously training the first voice recognition model, the generalization error of the trained second voice recognition model can be reduced, and the generalization capability of the second voice recognition model is improved.

Optionally, while continuing to train the first speech recognition model, the method further includes: calculating a generalization error of the first speech recognition model; and if the generalization error of the first speech recognition model is less than or equal to a first error threshold, continuing to train the first speech recognition model according to the second training data group.

The first error threshold is used to determine when the first speech recognition model begins to perform only the second training data set training operation. The generalization error of the first speech recognition model is equal to or less than a first error threshold, typically at the final stage of training.

In fact, since the first speech recognition sample may be an erroneous sample, the first training data set may have an erroneous sample, and thus training the first speech recognition model by using the first training data set may result in a decrease in speech recognition accuracy of the first speech recognition model. And in the later training stage, the first speech recognition model can be trained by only adopting the second training data set, so that the speech recognition accuracy of the first speech recognition model is further improved.

When the generalization error is less than or equal to the first error threshold, the first speech recognition model is trained by completely adopting the second training data set, namely the training data generated by the manual marking mode is adopted to train the first speech recognition model at the final stage of training. In fact, when the model is to be trained, the training data generated in the manual labeling mode can ensure that the grammar of the training data is most accurate and the semantic expression is most accurate, and the first speech recognition model is trained for the last rounds or several steps.

In addition, the first error threshold value and the total training step number have a corresponding relation. The total number of steps corresponding to the first error threshold can be determined, typically from a number of experimental statistics, and thus the first error threshold can be characterized by the total number of steps. Illustratively, 30-40 rounds may be used, each round comprising 5-10 steps, and the generalization error of the first speech recognition model is equal to or less than a first error threshold, corresponding to the last 2-5 rounds of the total number of rounds. Thus, the first speech recognition model may also be trained using the second training data set starting at the 2 nd to 5 th round from the last.

When the generalization error is less than or equal to the first error threshold value, the second training data set is adopted to train the first voice recognition model, so that the influence of the reduction of the model recognition accuracy rate caused by error samples in the first training data set is reduced, and meanwhile, in the final training stage, the labeled training data is adopted to train the first voice recognition model, so that the conversion accuracy rate of the first voice recognition model can be further improved.

Optionally, the generating the second speech recognition model includes: calculating the generalization error of the first speech recognition model in the training process of the first speech recognition model; and if the generalization error of the first speech recognition model is less than or equal to a second error threshold, stopping training the first speech recognition model, and taking the first speech recognition model at the current moment as a second speech recognition model, wherein the second error threshold is less than the first error threshold.

The second error threshold is used for judging whether the first speech recognition model is trained completely. Illustratively, the second error threshold is a minimum of the generalization error. In addition, the first error threshold may also be set to a value greater than the minimum value as needed, and thus, the embodiment of the disclosure is not particularly limited.

In practical application, the first error threshold value corresponds to the total training steps. For example, 30-40 rounds may be used, each round comprising 5-10 steps, corresponding to a total number of steps, characterizing the second error threshold, i.e. if it is determined that the training of all rounds and all steps is completed, it is determined that the generalization error of the first speech recognition model is equal to or less than the second error threshold.

Because the generalization error of the model can rise after the turning point, the generalization capability of the first speech recognition model is reduced, and the first speech recognition model can be continuously trained until the generalization error of the first speech recognition model is minimum, and at the moment, the generalization capability of the first speech recognition model reaches the best.

And the second error threshold value is configured to be the minimum value of the generalization error, so that the generalization error of the first speech recognition model is close to the minimum value of the generalization error, and the generalization capability of the first speech recognition model is improved.

According to the technical scheme of the embodiment of the disclosure, the first training data set, the second training data set and the third training data set are respectively generated according to the first voice recognition sample and the second voice recognition sample and are alternately input into the first voice recognition model for training, the randomness of the training samples can be improved, the recognition accuracy of the trained second voice recognition model to unknown samples is improved, the generalization capability of the second voice recognition model is improved, the number of samples configured to be included in each training data set is the same, and the introduction of errors caused by different input numbers can be avoided, so that the introduction of errors of model training can be reduced, and the voice recognition accuracy of the model is improved.

Fig. 7 is a flowchart illustrating a voice recognition method according to an exemplary embodiment, where the voice recognition method is used in an electronic device and executed by the electronic device, as shown in fig. 7, and includes the following steps.

In step S41, voice data to be recognized is acquired.

The speech to be recognized is used as an input sequence for sequence conversion, and the speech to be recognized comprises at least one speaker voice. The speech to be recognized can be acquired in various ways, for example, by capturing it from a network, for example, by means of a voice recording device.

In step S42, a speech recognition model is obtained, and the speech recognition model is obtained by training using the training method of the speech recognition model according to any one of the embodiments of the present disclosure.

The target training data is obtained by the training method of the voice recognition model according to any one of the embodiments of the present disclosure, a large number of training samples can be rapidly and automatically obtained, the pre-trained first voice recognition model is further trained to form the target model, the generalization capability of the target model is improved, and the mapping relation accuracy and the grammar accuracy of the target model are improved.

In step S43, the speech data to be recognized is input into the speech recognition model, and the recognition text data output by the speech recognition model is acquired.

For example, the target model may be applied in a number of scenarios, e.g., the target model is used to convert a speaker's voice into a same-language speaker's text; as another example, the target model is used to convert a speaker's voice into a different language speaker text; for another example, the target model is used for generating a dialog reply text according to the voice of the speaker so as to realize man-machine dialog; as another example, the target model is used to generate a summary text based on the speech content of the speaker, etc.

According to the technical scheme of the embodiment, the voice recognition model is obtained through the training method of the voice recognition model according to any one embodiment of the disclosure, a large number of training samples can be rapidly and automatically generated, the generation speed of the training samples is increased, the labor cost of training the target model is reduced, the training efficiency of the target model is improved, meanwhile, the training samples are close to the artificially marked accurate training samples, the recognition efficiency of the voice recognition model is improved, voice recognition is carried out on the voice recognition model obtained through the training method based on the voice recognition model, the labor cost of voice recognition texts can be reduced, and meanwhile, the recognition efficiency is improved.

FIG. 8 is a block diagram illustrating a training apparatus for a speech recognition model according to an example embodiment. Referring to fig. 8, the apparatus includes a first voice data acquiring unit 121, a first text data acquiring unit 122, a first voice recognition sample generating unit 123, a second voice recognition sample acquiring unit 124, and a second voice recognition model generating unit 125.

A first voice data acquisition unit 121 configured to perform acquisition of first voice data;

a first text data obtaining unit 122 configured to perform inputting the first voice data into a first voice recognition model, and obtain at least one first text data output by the voice recognition model;

a first speech recognition sample generation unit 123 configured to perform recognition of second text data from each of the first text data according to a preset grammar rule, and generate a first speech recognition sample based on the first speech data;

a second speech recognition sample acquisition unit 124 configured to perform acquisition of a second speech recognition sample including second speech data and third text data, the semantics of the second speech data and the semantics of the third text data being the same;

a second speech recognition model generating unit 125 configured to perform inputting the first speech recognition sample and the second speech recognition sample into the first speech recognition model, continue training the first speech recognition model, and generate a second speech recognition model.

In an implementation manner of the embodiment of the present disclosure, optionally, the first speech recognition sample generating unit 123 includes:

In an implementation manner of the embodiment of the present disclosure, optionally, the syntax priority calculating subunit includes:

In an implementation manner of the embodiment of the present disclosure, optionally, the second speech recognition model generating unit 125 includes:

In an implementation manner of the embodiment of the present disclosure, optionally, the training apparatus for the speech recognition model further includes:

a label data post-training unit configured to perform calculating a generalization error of the first speech recognition model while continuing to train the first speech recognition model; and if the generalization error of the first speech recognition model is less than or equal to a first error threshold, continuing to train the first speech recognition model according to the second training data group.

a model training completion detection unit configured to perform calculation of a generalization error of the first speech recognition model during training of the first speech recognition model; and if the generalization error of the first speech recognition model is less than or equal to a second error threshold, stopping training the first speech recognition model, and taking the first speech recognition model at the current moment as a second speech recognition model, wherein the second error threshold is less than the first error threshold.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 9 is a block diagram illustrating a speech recognition apparatus according to an example embodiment. Referring to fig. 9, the apparatus includes a to-be-recognized speech data acquiring unit 221, a speech recognition model acquiring unit 222, and a recognized text data acquiring unit 223.

A to-be-recognized voice data acquisition unit 221 configured to perform acquisition of voice data to be recognized;

a speech recognition model obtaining unit 222 configured to perform obtaining of a speech recognition model, wherein the speech recognition model is obtained by training using a training method of a speech recognition model according to any one of the embodiments of the present disclosure;

a recognition text data obtaining unit 223 configured to perform inputting the voice data to be recognized into the voice recognition model, and obtaining the recognition text data output by the voice recognition model.

Fig. 10 is a schematic structural diagram illustrating an electronic device according to an exemplary embodiment, where the electronic device includes, as shown in fig. 10:

one or more of the processors 310 may be,

in FIG. 10, a processor 310 is illustrated as an example;

a memory 320;

the processor 310 and the memory 320 in the device may be connected by a bus or other means, and fig. 10 illustrates the connection by a bus as an example.

The memory 320 may be used as a non-transitory computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a training method of a speech recognition model in the embodiment of the present disclosure (for example, the first speech data acquisition unit 121, the first text data acquisition unit 122, the first speech recognition sample generation unit 123, the second speech recognition sample acquisition unit 124, and the second speech recognition model generation unit 125 shown in fig. 8), or such as program instructions/modules corresponding to a speech recognition method in the embodiment of the present disclosure (the to-be-recognized speech data acquisition unit 221, the speech recognition model acquisition unit 222, and the recognition text data acquisition unit 223 shown in fig. 9). The processor 310 performs various functional applications of the computer device and data processing by executing software programs, instructions, and modules stored in the memory 320,

namely, the method for training the speech recognition model to implement the above method embodiment includes: acquiring first voice data; inputting the first voice data into a first voice recognition model, and acquiring at least one first text data output by the voice recognition model; according to a preset grammar rule, second text data are recognized from the first text data, and a first voice recognition sample is generated according to the first voice data; acquiring a second voice recognition sample, wherein the second voice recognition sample comprises second voice data and third text data, and the semantics of the second voice data are the same as the semantics of the third text data; and inputting the first voice recognition sample and the second voice recognition sample into the first voice recognition model, and continuing training the first voice recognition model to generate a second voice recognition model.

Or, a speech recognition method for implementing the above method embodiment, that is: acquiring voice data to be recognized; acquiring a voice recognition model, wherein the voice recognition model is acquired by training through a training method of the voice recognition model according to any one of the embodiments of the present disclosure; and inputting the voice data to be recognized into the voice recognition model, and acquiring recognized text data output by the voice recognition model.

The memory 320 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 320 may optionally include memory located remotely from processor 310, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 340 may include a display device such as a display screen.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the apparatus 800 to perform the method described above is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product for use in conjunction with an electronic device is also provided, the computer program product comprising a computer-readable storage medium and a computer program mechanism embedded therein, the program being loaded into and executed by a computer to implement the method for training a speech recognition model according to any of the embodiments of the present disclosure or the method for speech recognition according to any of the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a speech recognition model, comprising:

acquiring first voice data;

inputting the first voice data into a first voice recognition model, and acquiring at least one first text data output by the voice recognition model;

according to a preset grammar rule, second text data are recognized from the first text data, and a first voice recognition sample is generated according to the first voice data, wherein the grammar rule is used for detecting whether the grammar of the first text data is accurate or not;

2. The method for training a speech recognition model according to claim 1, wherein the recognizing second text data from each of the first text data according to a predetermined grammar rule comprises:

3. The method for training a speech recognition model according to claim 1, wherein the first speech recognition sample and the second speech recognition sample are input into the first speech recognition model, and the training of the first speech recognition model is continued, including:

4. The method for training a speech recognition model according to claim 3, further comprising, while continuing to train the first speech recognition model:

calculating a generalization error of the first speech recognition model;

5. The method for training a speech recognition model according to claim 4, wherein the generating a second speech recognition model comprises:

6. A speech recognition method, comprising:

acquiring voice data to be recognized;

obtaining a speech recognition model, wherein the speech recognition model is obtained by training by adopting the training method of the speech recognition model according to any one of claims 1 to 5;

7. An apparatus for training a speech recognition model, comprising:

a first speech recognition sample generation unit configured to perform recognition of second text data from each of the first text data according to a preset grammar rule, and generate a first speech recognition sample based on the first speech data, the grammar rule being used to detect whether a grammar of each of the first text data is accurate;

8. A speech recognition apparatus, comprising:

a speech recognition model acquisition unit configured to perform acquisition of a speech recognition model trained using a training method of the speech recognition model according to any one of claims 1 to 5;

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of training a speech recognition model according to any one of claims 1 to 5 or to implement the method of speech recognition according to claim 6.

10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of training a speech recognition model according to any one of claims 1 to 5, or to implement a method of speech recognition according to claim 6.