CN111179916A

CN111179916A - Re-scoring model training method, voice recognition method and related device

Info

Publication number: CN111179916A
Application number: CN201911413152.3A
Authority: CN
Inventors: 李安; 陈江; 胡正伦; 傅正佳
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-19
Anticipated expiration: 2039-12-31
Also published as: WO2021136029A1; CN111179916B

Abstract

The embodiment of the invention discloses a re-grading model training method, a voice recognition method and a related device, wherein the training method comprises the following steps: acquiring a plurality of voice recognition results of the voice data samples and a first label of the voice data samples, wherein the first label is a pre-labeled label; acquiring scores of the voice recognition result under a plurality of different language models; obtaining a sample feature vector and a second label of the voice data sample based on the voice recognition result, the score and the first label; and training the model by adopting the sample characteristic vector and the second label to obtain a re-grading model. The embodiment of the invention excavates the inherent relevance of the score implications of the second label and the scores scored by different language models to obtain the optimal combination mode of the scoring scores of different language models, eliminates the artificial subjectivity factor, ensures the accuracy of the voice recognition result, immediately changes the scoring mechanism of each language model, does not need to modify the weight among the scores, and improves the universality and universality of the scoring model.

Description

Re-scoring model training method, voice recognition method and related device

Technical Field

The embodiment of the invention relates to the technical field of voice recognition, in particular to a re-grading model training method, a re-grading model training device, a voice recognition method, a voice recognition device, equipment and a storage medium.

Background

Automatic Speech Recognition (ASR) is a technology that converts speech into text, and ASR can be applied to speech translation, man-machine interaction, smart home and other application scenarios.

During the decoding process of speech recognition, the speech data can obtain a plurality of speech recognition results, such as speech contents: "i is a good student", the following speech recognition results may be obtained in the speech recognition decoding process: "grasp is a sound learning sign", "wonderful rise in time of nest", "bedroom good students", "i am good students" … …, which is the most appropriate or reasonable about the accuracy of the speech recognition result.

In the prior art, each speech recognition result is generally scored, the higher the score, the greater the reasonability or accuracy of the speech recognition result, however, the accuracy of the judgment standard is lower or lower only by using a single scoring result, and therefore, the comprehensive judgment is performed after each speech recognition result is scored through a plurality of language models.

However, in the existing scoring mechanism, the multiple scores of each recognition result are directly added or a total score is calculated after a weight is added to each score according to a manually set weight, on one hand, the final score is influenced by human subjectivity, the accuracy is poor, on the other hand, when a certain scoring mechanism is changed, the weight of the scoring mechanism needs to be reset, and the applicability is poor.

Disclosure of Invention

The embodiment of the invention provides a re-grading model training method, a re-grading model training device, a voice recognition method, a voice recognition device, equipment and a storage medium, and aims to solve the problems of large artificial subjective influence and poor applicability of voice recognition re-grading in the prior art.

In a first aspect, an embodiment of the present invention provides a method for training a re-scoring model, including:

acquiring a plurality of voice recognition results of voice data samples and a first label of the voice data samples, wherein the first label is a pre-labeled label of the voice data samples;

acquiring the scores of each voice recognition result under a plurality of different language models;

obtaining a sample feature vector and a second label of the voice data sample for re-grading model training based on the voice recognition result, the score and the first label;

and training a model by adopting the sample characteristic vector and the second label to obtain a re-scoring model for re-scoring the voice recognition result.

In a second aspect, an embodiment of the present invention provides a speech recognition method, including:

acquiring a plurality of voice recognition results of voice data to be recognized;

obtaining a feature vector of the voice data to be recognized based on the voice recognition result and the score;

inputting the feature vectors into a pre-trained re-grading model to obtain a final score of each voice recognition result;

determining the voice recognition result with the minimum final score as the final recognition result of the voice data to be recognized;

the re-grading model is trained by the re-grading model training method provided by the embodiment of the invention.

In a third aspect, an embodiment of the present invention provides a re-scoring model training device, including:

the voice recognition result and first label acquisition module is used for acquiring a plurality of voice recognition results of voice data samples and first labels of the voice data samples, wherein the first labels are pre-labeled labels of the voice data samples;

the scoring module is used for acquiring scores of each voice recognition result under a plurality of different language models;

a sample feature vector and second label obtaining module, configured to obtain, based on the speech recognition result, the score and the first label, a sample feature vector and a second label, which are used by the speech data sample for re-scoring model training;

and the model training module is used for training a model by adopting the sample characteristic vector and the second label to obtain a re-scoring model for re-scoring the voice recognition result.

In a fourth aspect, an embodiment of the present invention provides a speech recognition apparatus, including:

the voice recognition result acquisition module is used for acquiring a plurality of voice recognition results of the voice data to be recognized;

the initial score acquisition module is used for acquiring the scores of each voice recognition result under a plurality of different language models;

a feature vector obtaining module, configured to obtain a feature vector of the to-be-recognized speech data based on the speech recognition result and the score;

the final score prediction module is used for inputting the sample feature vector into a pre-trained re-scoring model to obtain a final score of each voice recognition result;

the voice recognition result determining module is used for determining the voice recognition result with the minimum final score as the final recognition result of the voice data to be recognized;

the re-grading model is trained by the re-grading model training method in any one of the embodiments of the invention.

In a fifth aspect, an embodiment of the present invention provides an apparatus, where the apparatus includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a re-scoring model training method and/or a speech recognition method as described in any of the embodiments of the invention.

In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a re-scoring model training method and/or a speech recognition method according to any embodiment of the present invention.

After a plurality of voice recognition results of voice data samples and a first label of the voice data samples are obtained; obtaining scores of each voice recognition result under a plurality of different language models, and obtaining a sample feature vector and a second label of a voice data sample for re-grading model training based on the voice recognition result, the scores and the first label; and training the model by adopting the sample characteristic vector and the second label to obtain a re-grading model. The embodiment of the invention obtains the sample characteristic vector of the voice data sample used for the training of the re-grading model and the second label based on the voice recognition result, the score and the first label to train the re-grading model, digs out the inherent association implied by the score obtained by the scoring of the second label and a plurality of different language models so as to obtain the optimal combination mode of the scoring scores of the different language models, eliminates the artificial subjectivity factor, ensures the accuracy of the voice recognition result, immediately changes the scoring mechanism of each language model, does not need to modify the weight among the scores, and improves the universality and universality of the re-grading model.

Drawings

FIG. 1 is a flowchart illustrating steps of a method for training a re-grading model according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a weighted directed acyclic graph obtained after decoding a speech data sample according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps of a method for training a re-grading model according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating steps of a speech recognition method according to a third embodiment of the present invention;

fig. 5 is a block diagram of a re-scoring model training apparatus according to a fourth embodiment of the present invention;

fig. 6 is a block diagram of a speech recognition apparatus according to a fifth embodiment of the present invention;

fig. 7 is a block diagram of a device according to a sixth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of steps of a method for training a re-scoring model according to an embodiment of the present invention, where the method according to an embodiment of the present invention is applicable to a situation of training a re-scoring model, and the method may be executed by a device for training a re-scoring model implemented by the present invention, and the device for training a re-scoring model may be implemented by hardware or software and integrated into an apparatus according to an embodiment of the present invention, and specifically, as shown in fig. 1, the method for training a re-scoring model according to an embodiment of the present invention may include the following steps:

s101, obtaining a plurality of voice recognition results of voice data samples and first labels of the voice data samples, wherein the first labels are pre-labeled labels of the voice data samples.

In the embodiment of the invention, for a piece of voice data, a plurality of voice recognition results can be obtained after passing through the voice recognition coding and decoding model, and each voice recognition result comprises a series of ordered characters and words, namely, a plurality of voice recognition decoding paths can be obtained for a piece of voice data.

As shown in fig. 2, the speech recognition decoding result is a weighted directed acyclic graph, each path in the graph is a representation of a sequence of selectable words in the speech recognition decoding process, in fig. 2, a circle represents a word or a word of speech data after speech recognition decoding, each edge is provided with a weight, a plurality of paths are arranged from the leftmost circle to the rightmost circle, and each path is regarded as a speech recognition result. Meanwhile, the voice data sample has an artificially labeled and real voice recognition result, and the real voice recognition result is a first label of the voice data sample. Illustratively, the speech content is: "i is a good student", the following speech recognition results may be obtained in the speech recognition decoding process: the hand is held to learn how to sound, the nest is aloud to learn how to go, the bedroom is good students, and the I is good students.

Alternatively, for the speech data sample, the multiple speech recognition results of the speech data sample may be obtained by an Encoder-Decoder (encoding-decoding), and of course, the multiple speech recognition results of the speech data sample may also be obtained by other manners, for example, the multiple speech recognition results may be obtained by an artificial generation manner.

And S102, acquiring scores of each voice recognition result under a plurality of different language models.

In an embodiment of the present invention, the language model may construct a probability distribution p(s) of the character string s, where p(s) expresses the probability that the character string s is a sentence, where probability refers to the probability of a combination of the character strings that makes up a sentence in natural language (human speech).

After the voice recognition result is obtained, each voice recognition result can be input into a plurality of different language models to obtain the score of the voice recognition result, the score expresses the probability that the voice recognition result accords with the natural voice, and optionally, the different language models can be an acoustic model, an ngram language model and an rnnlm model.

Knowledge of the differences in acoustics, phonetics, environmental variables, speaker gender, accent, etc. is represented by acoustic models, which may be trained with lstm + ctc to obtain a mapping of speech features to phonemes. The task of the acoustic model is to give the probability of a text-to-speech utterances after a given text.

The ngram language model is a statistical-based language model for predicting the nth word from the first (n-1) words, i.e., calculating the probability of a sentence, i.e., calculating the probability of a series of words that make up a sentence.

The rnnlm model is a language model trained over the RNN and its variant networks, whose task is to predict the next word from the above.

Of course, in practical applications, a person skilled in the art can also score each speech recognition result through other language models, and the embodiment of the present invention does not limit what kind of language model is used to score the speech recognition results, nor does it limit the number of language models.

S103, obtaining a sample feature vector and a second label of the voice data sample for re-grading model training based on the voice recognition result, the score and the first label.

In an optional embodiment of the present invention, for each speech recognition result, the speech recognition result may be analyzed to extract word frequency statistical characteristics, word or word ordering characteristics, sentence length characteristics, word number, and the like of the speech recognition result as sentence and word structural characteristics, combine scores of the speech recognition result under a plurality of different language models and the sentence and word structural characteristics into a sample feature vector of the speech recognition result, and then calculate a character error rate of the speech recognition result using the first label of the speech recognition result and the speech number sample as a second label of the speech recognition result.

And S104, obtaining a re-scoring model for re-scoring the voice recognition result by adopting the sample feature vector and the second label training model.

Specifically, the sample feature vector may be input into a model after the initial model parameter to obtain an estimated character error rate of each speech recognition result, the estimated character error rate and a second label of the speech recognition result are used to calculate a loss rate, the training mode is stopped when the loss rate reaches a preset value, otherwise, the model parameter is adjusted according to the loss rate to re-iterate the training model until the loss rate meets a preset condition position, a re-scoring model for re-scoring the speech recognition results is obtained, that is, for a plurality of speech recognition results of each speech data sample, the character error rate may be obtained by re-scoring the re-scoring model as a final score, and the speech recognition result with the lowest final score is the optimal speech recognition result of the speech data sample.

According to the embodiment of the invention, the sample feature vector and the second label of the voice data sample for the training of the re-grading model are obtained based on the voice recognition result, the score and the first label, the inherent correlation implied by the score obtained by grading the second label and a plurality of different language models is mined, so that the best combination mode of the grading scores of different language models is obtained, the artificial subjectivity factor is eliminated, the accuracy of the voice recognition result is ensured, the grading mechanism of each language model is changed immediately, the weight among the scores is not required to be modified, and the universality and universality of the re-grading model are improved. Example two

Fig. 3 is a flowchart of steps of a re-grading model training method according to a second embodiment of the present invention, which is optimized based on the first embodiment of the present invention, and specifically, as shown in fig. 3, the re-grading model training method according to the second embodiment of the present invention may include the following steps:

s201, inputting the voice data samples into a decoding model to obtain a plurality of voice recognition results, wherein the voice data samples have pre-labeled first labels.

In an embodiment of the present invention, the voice data sample may be any voice data, and the voice data may be input into a voice recognition decoding model (e.g., an Encoder-Decoder) to obtain a plurality of recognition results, each of the voice recognition results has a probability, the probability expresses that the voice recognition result is a pre-labeled tag, and the pre-labeled first tag is a real text corresponding to the human-labeled voice data sample.

S202, extracting a preset number of voice recognition results as a plurality of voice recognition results of the voice data samples.

Specifically, all the speech recognition results may be ranked according to the probability of each speech recognition result, and the K speech recognition results ranked as TOP K may be extracted as the plurality of speech recognition results of the speech data sample.

S203, inputting each voice recognition result into a plurality of different language models respectively, and obtaining scores of the voice recognition results under the different language models.

In alternative embodiments of the present invention, the language models may include three language models, an acoustic model, an n-gram language model, and an RNNLM language model. After obtaining a plurality of speech recognition results, each speech recognition result can be respectively input into the acoustic model, the n-gram language model and the RNNLM language model to obtain 3 scores of each speech recognition result. Of course, in practical applications, a person skilled in the art may also input the speech recognition result into other language models, and the number of the language models are not limited by the embodiment of the present invention.

S204, aiming at each voice recognition result, analyzing the voice recognition result to extract sentence and word structural characteristics of the voice recognition result.

In the embodiment of the present invention, the speech recognition result is composed of a series of ordered words and phrases, and then the characteristics of the number of words and the number of words, the frequency of occurrence of the words and phrases, the length of a sentence, the order of the words, and the like included in the speech recognition result can be counted as the sentence and phrase structure characteristics.

S205, combining the scores of the voice recognition results under a plurality of different language models and the sentence word structure characteristics into a sample feature vector of the voice recognition results.

Specifically, a plurality of scores of the speech recognition result and the sentence-word structure feature may be connected to form one sample feature vector a (score 1, score 2, score 3, word frequency statistical feature, word or word ordering feature, sentence length feature, word number).

S206, calculating the character error rate of the voice recognition result by adopting the voice recognition result and the first label to be used as a second label of the voice recognition result.

The Character Error Rate (CER) is a scoring mode, and is a standard for evaluating the excellence of an ASR model, the character error rate is determined by the sum of the times of insertion, deletion and replacement from a predicted value to a true value, namely, for a voice data sample, the character error rate has a true first label, the first label is a true text corresponding to the voice data sample, a voice recognition result is not necessarily a true file, the voice recognition result needs to be inserted, deleted and replaced to the true text, and the frequency of statistical insertion, deletion and replacement is the character error rate.

For example, for the tag being the real text "i am a good student", if the speech recognition result is "holding a good student", the word needs to be replaced 1 time, and the word is inserted 1 time, and the character error rate can be determined to be 2.

For each voice recognition result, the voice recognition result can be compared with the first label of the voice data sample during decoding to calculate the character error rate of each voice recognition result as the second label of the voice recognition result.

And S207, carrying out normalization processing on the sample characteristic vector to obtain the normalized sample characteristic vector.

In an optional embodiment of the present invention, a maximum sample feature vector and a minimum sample feature vector may be determined from all sample feature vectors, a difference value is calculated by using the maximum sample feature vector and the minimum sample feature vector to obtain a vector difference value, a ratio between the sample feature vector and the vector difference value is calculated as a sample feature vector after normalization processing of the sample feature vector, and a calculation formula of the specific normalization processing is as follows:

in the above formula, x_iIs a sample feature vector x 'corresponding to the ith voice recognition result'_iTo normalize the processed sample feature vector, x_maxAnd x_minThe maximum sample feature vector and the minimum sample feature vector in the sample feature vectors of the voice recognition results of the voice data sample are normalized, so that the sample feature vectors of the voice recognition results can be unified under a dimension, the quantitative expression of the sample feature vectors is facilitated, high-quality training data is provided for candidate model training, and the precision of model training is improved.

And S208, initializing model parameters.

Specifically, the model of the embodiment of the present invention may be a machine learning algorithm training model such as a linear regression, a support vector machine, a decision tree model, etc., and the example of the present invention takes a linear regression as an example, and the modeling equation is as follows:

wherein, a_iIs the coefficient, x, of the i-th sample feature vector_iIs the i-th sample feature vector, y_iTo estimate the character error rate, at initialization a_iThen, the purpose of model training is to obtain the optimal a_iSo that y is_iProximate to the second tag.

S209, inputting the sample feature vector after the speech recognition result normalization processing into the model to obtain the estimated character error rate of the speech recognition result.

Specifically, the sample feature vector after the speech recognition result normalization process may be input into the initialized model, i.e., x 'in S207'_iInput into the model for each x'_iModel output estimated character error rate y_i。

S210, calculating the loss rate by adopting the estimated character error rate and the second label.

In the embodiment of the present invention, the loss function is a mean square loss function:

mse loss is loss rate, y_iFor the feature vector of the i-th sample,

and k is the number of voice recognition results of one voice data sample, and the estimated character error rate and the second label are substituted into the mean square loss function to calculate the loss rate.

S211, when the loss rate does not meet the preset condition, calculating a gradient by adopting the loss rate.

If the calculated value is smaller than the preset threshold value, the iteration of the model is stopped, otherwise, the gradient is calculated by adopting the loss rate, specifically, the gradient can be calculated by adopting a preset gradient algorithm.

S212, adjusting the model parameters by adopting the gradient, and returning to S209.

Specifically, the gradient and the preset learning rate may be used to perform gradient descent on the current parameter of the model to obtain the model after the model parameter is adjusted, and the step returns to S209 to continue to iterate the model until the loss rate is smaller than the preset threshold, or the step may be to stop training the model when the number of iterations reaches the preset number, so as to obtain a re-scoring model for re-scoring the voice recognition result.

The embodiment of the invention inputs voice data samples into a decoding model to obtain a plurality of voice recognition results, extracts a preset number of voice recognition results as a plurality of voice recognition results of the voice data samples, respectively inputs each voice recognition result into a plurality of different language models to obtain scores of the voice recognition results under different language models, analyzes the voice recognition results aiming at each voice recognition result to extract sentence-word structural characteristics of the voice recognition results, combines the scores and the sentence-word structural characteristics of the voice recognition results under the different language models into sample characteristics of the voice recognition results, calculates a character error rate by adopting the voice recognition results and a first label of the voice data samples as a second label of the voice recognition results, and trains a reprinting model through a sample characteristic vector and the second label. The method has the advantages that the internal relevance implied by the scores obtained by scoring the second label and the plurality of different language models can be mined to obtain the optimal combination mode of the scoring scores of the different language models, artificial subjectivity factors are eliminated, the accuracy of a voice recognition result is ensured, the scoring mechanism of each language model is changed immediately, the weight among the scores is not required to be modified, and the universality of the scoring model are improved.

Further, the character error rate is calculated by adopting the voice recognition result and the label as a second label of the voice recognition result, so that the model indirectly learns the character error rate of the voice data sample through the voice recognition result, and a better voice recognition result is obtained by using the model.

EXAMPLE III

Fig. 4 is a flowchart of steps of a speech recognition method according to a third embodiment of the present invention, where the speech recognition method according to the third embodiment of the present invention is applicable to a speech recognition situation, and the speech recognition method according to the third embodiment of the present invention may be executed by a speech recognition apparatus implemented by the present invention, and the speech recognition apparatus may be implemented by hardware or software and integrated in an apparatus according to the third embodiment of the present invention, and specifically, as shown in fig. 4, the speech recognition method according to the third embodiment of the present invention may include the following steps:

s301, a plurality of voice recognition results of the voice data to be recognized are obtained.

In the embodiment of the present invention, the speech data to be recognized may be data that needs to convert speech into text, for example, the speech data may be speech data in a short video, speech data on a chat interface of an instant messaging application, and the like.

S302, obtaining scores of each voice recognition result under a plurality of different language models.

Alternatively, the speech recognition results may be input into the acoustic model, the n-gram language model, and the RNNLM language model, respectively, to obtain 3 scores for each speech recognition result.

S303, obtaining a feature vector of the voice data to be recognized based on the voice recognition result and the score.

Specifically, reference may be made to S204-S207 in the second embodiment, which will not be described in detail herein.

S304, inputting the feature vectors into a pre-trained re-grading model to obtain the final score of each voice recognition result.

In the embodiment of the present invention, the re-scoring model may be trained by the re-scoring model training method described in any one of the first embodiment and the second embodiment, the re-scoring model may re-score a plurality of speech recognition results of the speech data to be recognized, and the final score of each speech recognition result may be obtained after the feature vector is input into the pre-trained re-scoring model.

S305, determining the voice recognition result with the minimum final score as the final recognition result of the voice data to be recognized.

In the embodiment of the invention, the final score expresses the character error rate of the voice recognition result relative to the real result, and the smaller the character error rate is, the closer the voice recognition result is to the real result is, so that the voice recognition result with the minimum final score can be determined as the final recognition result of the voice data to be recognized.

When the re-scoring model is trained, the sample feature vector and the second label of the voice data sample for training the re-scoring model are obtained based on the voice recognition result, the score and the label, the inherent relevance implied by the score obtained by scoring the second label and the multiple different language models is excavated, so that the optimal combination mode of the scoring scores of the different language models is obtained, when the re-scoring is carried out on the multiple voice recognition results of the voice data to be recognized through the re-scoring model, the artificial subjectivity factor can be eliminated, the accuracy of the voice recognition result is ensured, the scoring mechanism of each language model is changed immediately, the weight among the scores is not required to be modified, and the universality and universality of the re-scoring model are improved.

Example four

Fig. 5 is a block diagram of a structure of a re-scoring model training apparatus according to a fourth embodiment of the present invention, and as shown in fig. 5, the re-scoring model training apparatus according to the fourth embodiment of the present invention may specifically include the following modules:

a voice recognition result and first tag obtaining module 401, configured to obtain multiple voice recognition results of a voice data sample and a first tag of the voice data sample, where the first tag is a pre-labeled tag of the voice data sample;

a scoring module 402, configured to obtain a score of each speech recognition result under a plurality of different language models;

a sample feature vector and second label obtaining module 403, configured to obtain, based on the speech recognition result, the score and the first label, a sample feature vector and a second label of the speech data sample for re-scoring model training;

and a model training module 404, configured to train a model by using the sample feature vector and the second label to obtain a re-scoring model for re-scoring the speech recognition result.

Optionally, the voice recognition result and first tag obtaining module 401 includes:

the decoding submodule is used for inputting the voice data samples into a decoding model to obtain a plurality of voice recognition results, and the voice data samples are provided with first labels which are labeled in advance;

and the voice recognition result extraction submodule is used for extracting a preset number of voice recognition results as a plurality of voice recognition results of the voice data samples.

Optionally, the scoring module 402 includes:

and the scoring model input submodule is used for respectively inputting each voice recognition result into a plurality of different language models to obtain scores of the voice recognition results under the different language models.

Optionally, the language models include acoustic models, n-gram language models, and RNNLM language models.

Optionally, the sample feature vector and second tag obtaining module 403 includes:

the sentence structure characteristic acquisition submodule is used for analyzing the voice recognition result aiming at each voice recognition result so as to extract the sentence structure characteristic of the voice recognition result;

the feature combination submodule is used for combining the scores of the voice recognition result under a plurality of different language models and the sentence word structure features into a sample feature vector of the voice recognition result;

and the second label obtaining submodule is used for calculating the character error rate of the voice recognition result by adopting the voice recognition result and the first label to be used as a second label of the voice recognition result.

Optionally, the sentence structure feature comprises at least one of:

word frequency statistical characteristics, word or word ordering characteristics, sentence length characteristics, word number.

Optionally, the method further comprises:

and the characteristic normalization processing module is used for carrying out normalization processing on the sample characteristic vector to obtain the normalized sample characteristic vector.

Optionally, the feature normalization processing module includes:

the maximum and minimum sample characteristic vector determining submodule is used for determining a maximum sample characteristic vector and a minimum sample characteristic vector in all sample characteristic vectors;

the difference value calculation submodule is used for calculating a difference value by adopting the maximum sample characteristic vector and the minimum sample characteristic vector to obtain a vector difference value;

and the sample characteristic vector calculation submodule is used for calculating the ratio of the sample characteristic vector to the vector difference value as the sample characteristic vector after the sample characteristic vector normalization processing.

Optionally, the model training module 404 includes:

the initialization model submodule is used for initializing model parameters;

the characteristic input submodule is used for inputting the sample characteristic vector after the speech recognition result is normalized into the model to obtain the estimated character error rate of the speech recognition result;

the loss rate calculation submodule is used for calculating the loss rate by adopting the estimated character error rate and the second label;

the gradient calculation submodule is used for calculating a gradient by adopting the loss rate when the loss rate does not meet a preset condition;

and the model parameter adjusting submodule is used for adjusting the model parameters by adopting the gradient and returning to the characteristic input submodule.

Optionally, the loss rate calculation sub-module includes:

and the loss rate calculation unit is used for substituting the estimated character error rate and the second label into a preset mean square loss function to calculate the loss rate.

The re-grading model training device provided by the embodiment of the invention can execute the re-grading model training method provided by the first embodiment or the second embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 6 is a block diagram of a speech recognition apparatus according to a fifth embodiment of the present invention, and as shown in fig. 6, the speech recognition apparatus according to the fifth embodiment of the present invention may specifically include the following modules:

a voice recognition result obtaining module 501, configured to obtain multiple voice recognition results of voice data to be recognized;

an initial score obtaining module 502, configured to obtain scores of each speech recognition result in a plurality of different language models;

a feature vector obtaining module 503, configured to obtain a feature vector of the to-be-recognized speech data based on the speech recognition result and the score;

a final score prediction module 504, configured to input the feature vector into a pre-trained re-scoring model to obtain a final score of each speech recognition result;

a voice recognition result determining module 505, configured to determine a voice recognition result with a minimum final score as a final recognition result of the to-be-recognized voice data;

wherein, the re-grading model is trained by the re-grading model training method according to any embodiment of the invention.

The voice recognition device provided by the embodiment of the invention can execute the voice recognition method provided by the third embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE six

Referring to fig. 7, a schematic diagram of the structure of an apparatus in one example of the invention is shown. As shown in fig. 7, the apparatus may specifically include: a processor 60, a memory 61, a display 62 with touch functionality, an input device 63, an output device 64 and a communication device 65. The number of the processors 60 in the device may be one or more, and one processor 60 is taken as an example in fig. 7. The processor 60, the memory 61, the display 62, the input means 63, the output means 64 and the communication means 65 of the device may be connected by a bus or other means, as exemplified by the bus connection in fig. 7.

The memory 61 is used as a computer readable storage medium for storing software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the re-grading model training method according to the first embodiment to the second embodiment of the present invention (for example, the speech recognition result and first label obtaining module 401, the grading module 402, the sample feature vector and second label obtaining module 403, and the model training module 404 in the re-grading model training device according to the fourth embodiment of the present invention), or program instructions/modules corresponding to the speech recognition method according to the third embodiment of the present invention (for example, the speech recognition result obtaining module 501, the initial score obtaining module 502, the feature vector obtaining module 503, and the final score predicting module 504 in the speech recognition device according to the fifth embodiment of the present invention). The memory 61 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating device, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 61 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 60 may further include memory located remotely from the processor 60, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The display screen 62 is a display screen 62 with a touch function, which may be a capacitive screen, an electromagnetic screen, or an infrared screen. In general, the display screen 62 is used for displaying data according to instructions from the processor 60, and is also used for receiving touch operations applied to the display screen 62 and sending corresponding signals to the processor 60 or other devices. Optionally, when the display screen 62 is an infrared screen, the display screen further includes an infrared touch frame, and the infrared touch frame is disposed around the display screen 62, and may also be configured to receive an infrared signal and send the infrared signal to the processor 60 or other devices.

The communication device 65 is used for establishing a communication connection with other devices, and may be a wired communication device and/or a wireless communication device.

The input means 63 may be used for receiving input numeric or character information and generating key signal inputs related to user settings and function control of the apparatus, and may be a camera for acquiring images and a sound pickup apparatus for acquiring audio data. The output device 64 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 63 and the output device 64 can be set according to actual conditions.

The processor 60 executes various functional applications of the device and data processing, i.e., implementing the above-described re-scoring model training method and/or speech recognition method, by executing software programs, instructions and modules stored in the memory 61.

In particular, in the embodiment, when the processor 60 executes one or more programs stored in the memory 61, the method for training the re-scoring model and/or the method for recognizing the speech provided by the embodiment of the present invention are/is specifically implemented.

Embodiments of the present invention further provide a computer-readable storage medium, where instructions, when executed by a processor of a device, enable the device to perform the re-scoring model training method and/or the speech recognition method according to the above method embodiments.

It should be noted that, as for the embodiments of the apparatus, the device, and the storage medium, since they are basically similar to the embodiments of the method, the description is relatively simple, and in relevant places, reference may be made to the partial description of the embodiments of the method.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the re-scoring model training method and/or the speech recognition method according to any embodiment of the present invention.

It should be noted that, in the above-mentioned re-scoring model training apparatus and speech recognition apparatus, each unit and module included in the model training apparatus and speech recognition apparatus is only divided according to functional logic, but is not limited to the above division, as long as the corresponding function can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by suitable instruction execution devices. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for training a re-grading model is characterized by comprising the following steps:

2. The method of claim 1, wherein obtaining the plurality of speech recognition results for the speech data sample and the first label for the speech data sample comprises:

inputting the voice data samples into a decoding model to obtain a plurality of voice recognition results, wherein the voice data samples are provided with pre-labeled first labels;

and extracting a preset number of voice recognition results as a plurality of voice recognition results of the voice data samples.

3. The method of claim 1, wherein obtaining scores of each speech recognition result under a plurality of different language models comprises:

and respectively inputting each voice recognition result into a plurality of different language models to obtain the scores of the voice recognition results under the different language models.

4. The method of any of claims 1-3, wherein the language models include an acoustic model, an n-gram language model, and an RNNLM language model.

5. The method of claim 1, wherein obtaining a sample feature vector and a second label for the speech data sample for re-scoring model training based on the speech recognition result, the score, and the first label comprises:

analyzing the voice recognition result aiming at each voice recognition result to extract sentence and word structural characteristics of the voice recognition result;

combining the scores of the voice recognition result under a plurality of different language models and the sentence and word structure characteristics into a sample characteristic vector of the voice recognition result;

and calculating the character error rate of the voice recognition result by adopting the voice recognition result and the first label to be used as a second label of the voice recognition result.

6. The method of claim 5, wherein the sentence structural features comprise at least one of:

7. The method of claim 1, prior to said training a model using said sample feature vectors and said second labels to obtain a re-scoring model, comprising:

and carrying out normalization processing on the sample characteristic vector to obtain the normalized sample characteristic vector.

8. The method according to claim 7, wherein the normalizing the sample feature vector to obtain a normalized sample feature vector comprises:

determining a maximum sample feature vector and a minimum sample feature vector in all sample feature vectors;

calculating a difference value by adopting the maximum sample characteristic vector and the minimum sample characteristic vector to obtain a vector difference value;

and calculating the ratio of the sample characteristic vector to the vector difference value to be used as the sample characteristic vector after the normalization processing of the sample characteristic vector.

9. The method of claim 7 or 8, wherein training a model using the sample feature vectors and the second label to obtain a re-scoring model comprises:

initializing model parameters;

inputting the sample feature vector after the speech recognition result is normalized into the model to obtain the estimated character error rate of the speech recognition result;

calculating a loss rate by adopting the estimated character error rate and the second label;

when the loss rate does not meet a preset condition, calculating a gradient by adopting the loss rate;

and adjusting the model parameters by adopting the gradient, and returning to the step of inputting the sample feature vector after the speech recognition result is subjected to normalization processing into the model to obtain the estimated character error rate of the speech recognition result.

10. The method of claim 9, wherein calculating a loss rate using the estimated character error rate and the second label comprises:

and substituting the estimated character error rate and the second label into a preset mean square loss function to calculate the loss rate.

11. A speech recognition method, comprising:

wherein the re-grading model is trained by a re-grading model training method according to any one of claims 1-10.

12. A re-scoring model training device, comprising:

13. A speech recognition apparatus, comprising:

14. An apparatus, characterized in that the apparatus comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the re-scoring model training method of any one of claims 1-10 and/or the speech recognition method of claim 11.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a re-scoring model training method as claimed in any one of claims 1 to 10 and/or a speech recognition method as claimed in claim 11.