CN111951785A

CN111951785A - Voice recognition method and device and terminal equipment

Info

Publication number: CN111951785A
Application number: CN201910407618.2A
Authority: CN
Inventors: 陈明
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2019-05-16
Filing date: 2019-05-16
Publication date: 2020-11-17
Anticipated expiration: 2039-05-16
Also published as: CN111951785B

Abstract

The invention is suitable for the technical field of voice recognition, and provides a voice recognition method, a voice recognition device and terminal equipment, wherein the method comprises the following steps: calculating a first conditional probability of a sentence according to a pre-trained language model; adjusting a first loss function of the voice recognition model according to the first conditional probability to obtain a second loss function; and training the voice recognition model by using the second loss function, and performing voice recognition by using the trained voice recognition model. The invention can improve the accuracy of voice recognition.

Description

Voice recognition method and device and terminal equipment

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition method, a voice recognition device and terminal equipment.

Background

The voice recognition technology aims to recognize input voice signals and output characters which can be read by a computer, and can be applied to smart homes, smart vehicles, smart customer service robots and the like. With the development of Deep learning technology, the speech recognition technology is changed from the traditional machine learning Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) to the Deep Neural Network (DNN) based technology. The DNN-based speech recognition techniques are divided into two categories: one is to replace the original GMM part by DNN, namely Deep Neural Networks and Hidden Markov models (DNN-HMM), and the other is to use end-to-end speech recognition technology based on the Deep Neural Networks.

Because the End-To-End Speech Recognition technology (End-To-End Automatic Speech Recognition) based on the deep neural network can directly realize the input and decoding Recognition of Speech, complex alignment work and pronunciation dictionary making work are not needed, a large amount of early-stage preparation time can be saved, and the method is widely applied. At present, the existing end-to-end speech recognition technology (such as continuous time sequence classification (CTC), deep full feedforward connection neural network (DFSMN), Attention mechanism sequence to sequence network (Seq 2 Seq-Attention) and the like) cannot learn a complex language model, and the complex language model usually recognizes input speech through sound waveforms, so that the logic of recognized characters is poor. Therefore, when the trained speech recognition model is used for speech recognition, if the trained speech recognition model meets more complex speech, the recognition accuracy is lower.

Disclosure of Invention

In view of this, embodiments of the present invention provide a speech recognition method, a speech recognition device, and a terminal device, so as to solve the problem in the prior art that a trained speech recognition model has low recognition accuracy when encountering complex speech.

A first aspect of an embodiment of the present invention provides a speech recognition method, including:

calculating a first conditional probability of a sentence according to a pre-trained language model;

adjusting a first loss function of the voice recognition model according to the first conditional probability to obtain a second loss function;

and training the voice recognition model by using the second loss function, and performing voice recognition by using the trained voice recognition model.

A second aspect of an embodiment of the present invention provides a speech recognition apparatus, including:

the first conditional probability calculating module is used for calculating the first conditional probability of the sentence according to the pre-trained language model;

the adjusting module is used for adjusting a first loss function of the voice recognition model according to the first conditional probability to obtain a second loss function;

and the voice recognition module is used for training the voice recognition model by utilizing the second loss function and performing voice recognition by utilizing the trained voice recognition model.

A third aspect of embodiments of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method according to the first aspect when executing the computer program.

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, performs the steps of the method according to the first aspect.

In the embodiment of the invention, a pre-trained language model is used for calculating the first conditional probability of a sentence, the original first loss function of a speech recognition model is corrected to obtain a second loss function, the second loss function is further used for training the speech recognition model, the optimization of the loss function of the speech recognition model is realized, and the characteristics of the pre-trained language model are introduced; because the first loss function is optimized by adopting the first conditional probability of the pre-trained language model, the pre-trained language model is embedded into the speech recognition model, and the recognition precision of the trained speech recognition model is higher.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a specific implementation process of adjusting the first loss function according to the second conditional probability and the influence coefficient according to the embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention, which is detailed as follows:

s101: a first conditional probability of the sentence is calculated from the pre-trained language model.

It should be noted that the language model can summarize the internal relation between words from a large amount of text information, reduce the error rate of recognized words, and make the recognition result more logical, and the commonly used language models are n-gram language models and neural network-based language models.

The pre-trained speech model in the embodiment of the invention can be trained by a speech model training tool SRILM and adopting an n-gram speech model, wherein a parameter n represents that the probability of the current word occurrence is related to the probability of the previous n-1 words occurrence. In the embodiment of the present invention, a ternary language model, that is, a language model with n ═ 3 is trained, and the probability of the current word occurrence is related to the probability of the 2 preceding words occurrence. And the sentence refers to a sentence predicted to be generated by the speech recognition model according to the input sample (speech data).

Further, the calculating the conditional probability of the sentence according to the pre-trained language model includes:

for each sentence, its first conditional probability is calculated according to:

in the above formula (1), P (S) represents the first conditional probability of sentence S, C (w)_i-(n-1),…,w_i-1,w_i) Meaning word w_i-(n-1),…,w_i-1Word w after occurrence_iNumber of occurrences, C (w)_i-(n-1),…,w_i-1) Meaning word w_i-(n-1),…w_i-2Word w after occurrence_i-1The number of occurrences, m represents the number of samples, n represents a positive integer greater than 1, and i represents the ith word.

Since the n-gram language model means that the probability of the current word occurrence is related to the probability of the previous n-1 words occurrence, the first conditional probability p (S) for sentence S can be expressed as:

in the above formula (2), P (w)_i|w_i-(n-1),…,w_i-1) Meaning word w_iIn the word w_i-(n-1),…,w_i-1The probability of occurrence in the case of occurrence can be calculated by a maximum likelihood estimation method, and the above equation (1) can be obtained.

S102: and adjusting the first loss function of the voice recognition model according to the first conditional probability to obtain a second loss function.

The loss function is a difference value between the predicted value and the true value, and reflects the deviation degree between the predicted value and the true value, and the lower the deviation degree between the predicted value and the true value is, the more accurate the prediction result is, so the smaller the loss function is, the better the quality of the finally trained model is, namely, the higher the accuracy of the speech recognition is.

The first loss function refers to an original loss function of the speech recognition model. The original loss function of the speech recognition model is adjusted by utilizing the first conditional probability, so that the characteristics of the pre-trained speech model are introduced, and the accuracy of the trained speech recognition model can be improved.

Specifically, the adjusting a first loss function of the speech recognition model according to the first conditional probability includes:

calculating a second conditional probability using the first conditional probability;

and adjusting the first loss function according to the second conditional probability and the influence coefficient of the pre-trained language model.

And after the first conditional probability P (S) is obtained through calculation, the first conditional probability P (S) is transformed to obtain a second conditional probability T, and the first loss function is adjusted by utilizing the T and the influence coefficient r of the pre-trained language model.

Further, the calculating a second conditional probability using the first conditional probability includes:

calculating by using the first conditional probability according to the following formula:

in the above equation (3), T represents the calculated second conditional probability, p (S) represents the first conditional probability, and length represents the length of the sentence S, that is, the number of words included in S.

As shown in fig. 2, fig. 2 is a flowchart illustrating a specific implementation process of adjusting the first loss function according to the second conditional probability and the influence coefficient of the pre-trained language model, and includes the following steps S201 to S203:

s201: obtaining a plurality of sentences obtained through prediction, and calculating a second conditional probability of each sentence;

multiple predicted sentences are obtained from the speech recognition model, assuming that k predicted sentences are obtained, y _ pred¹,y_pred²,…,y_pred^k. Respectively calculating the T value of each sentence by using the above equations (1) and (3) to obtain T¹,T²,…,T^k。

S202: calculating average conditional probability according to the second conditional probabilities of all sentences and the influence coefficients;

calculating an average conditional probability T according to the T value obtained in the step S201 and the influence coefficient r of the pre-trained language model and the following formula_i：

In the above formula (4), T_iRepresenting the calculated average conditional probability, r representing the influence coefficient, k representing the number of sentences, j representing the jth sentence, T^jRepresenting a second conditional probability for the jth sentence.

S203: adjusting the first loss function using the average conditional probability.

The method for adjusting the first loss function by using the average conditional probability comprises the following steps: and adding the average conditional probability to the original loss function to obtain a second loss function, namely the adjusted loss function.

It should be noted that, since different influence coefficients r influence the recognition accuracy of the finally trained speech recognition model, different values of r will be adopted for different sample data.

In a preferred implementation manner of the embodiment of the present invention, the influence coefficient is an optimal influence coefficient, and the method for obtaining the optimal influence coefficient includes:

respectively training the voice recognition model by adopting a plurality of influence coefficients, and determining the influence coefficient which enables the recognition precision of the voice recognition model to be highest according to a training result, namely the optimal influence coefficient;

the adjusting the first loss function according to the second conditional probability and the influence coefficient of the pre-trained language model includes:

and adjusting the first loss function according to the second conditional probability and the optimal influence coefficient.

In general, the selectable value range of the influence coefficient r is between 0 and 1. In the embodiment of the invention, through practical training, the following conclusion can be obtained: when the value range of the influence coefficient r is [0.1,0.5], the converged speech recognition model has better recognition accuracy. However, for voice data with different sizes and different field ranges, different influence coefficients r should be selected, that is, the selection of the influence coefficient r is related to the size of the input voice data and the field range, and in an actual process, an optimal influence coefficient can be selected as required.

Optionally, the training the speech recognition model with a plurality of influence coefficients respectively includes:

and presetting a value interval for the influence coefficients, adjusting the values of the influence coefficients according to a preset step length, and training the voice recognition model by using each influence coefficient.

In the training process of the voice recognition model, a value interval can be preset for r, assuming that the value interval is [0.1,0.5], the value of r is automatically adjusted according to the step length of 0.1, the voice recognition model is trained by the value, and the r which enables the converged voice recognition model to have the highest recognition accuracy is determined according to the training result, namely the optimal influence coefficient.

After the optimal influence coefficient is determined, the loss function is adjusted according to the optimal influence coefficient and the first conditional probability, and a speech recognition model is trained according to the adjusted first loss function.

S103: and training the voice recognition model by using the second loss function, and performing voice recognition by using the trained voice recognition model.

It should be noted that the training process of the speech recognition model is as follows: inputting sample data with labels to a voice recognition model, wherein the sample data is voice data and a text corresponding to the voice data; and performing feature extraction on the sample data to obtain a feature sequence, encoding the feature sequence, decoding to obtain a predicted value, performing difference between the predicted value and a true value to obtain a loss function, and performing model training according to the loss function until the model converges to obtain the trained speech recognition model.

And then, carrying out parameter adjustment on the voice recognition model by using the value of the second loss function, and finally obtaining the voice recognition model with the optimal parameters, namely the trained voice recognition model.

When the trained voice recognition model is used for voice recognition, the audio data to be recognized is input into the trained voice recognition model, and the trained voice recognition model outputs the text corresponding to the audio to be recognized, so that the voice recognition can be realized.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention, where the apparatus includes: a first conditional probability calculation module 31, an adjustment module 32 and a speech recognition module 33. Wherein:

and a first conditional probability calculating module 31, configured to calculate a first conditional probability of the sentence according to the pre-trained language model.

Further, the first conditional probability calculating module 31 is specifically configured to: for each sentence, its first conditional probability is calculated according to:

in the above formula, P (S) represents the first conditional probability of sentence S, C (w)_i-(n-1),…,w_i-1,w_i) Meaning word w_i-(n-1),…,w_i-1Word w after occurrence_iNumber of occurrences, C (w)_i-(n-1),…,w_i-1) Meaning word w_i-(n-1),…w_i-2Word w after occurrence_i-1The number of occurrences, m represents the number of samples, n represents a positive integer greater than 1, and i represents the ith word.

And the adjusting module 32 is configured to adjust the first loss function of the speech recognition model according to the first conditional probability to obtain a second loss function.

Further, the adjusting module 32 includes: a second conditional probability calculating unit 321 and an adjusting unit 322, wherein:

the second conditional probability calculating unit 321 is configured to calculate a second conditional probability by using the first conditional probability.

Further, the second conditional probability calculating unit 321 is specifically configured to:

in the above equation (3), T represents the calculated second conditional probability, p (S) represents the first conditional probability, and length represents the length of the sentence S.

The adjusting unit 322 is configured to adjust the first loss function according to the second conditional probability and the influence coefficient of the pre-trained language model.

Further, the adjusting unit 322 includes:

a first calculating subunit 3221 configured to obtain a plurality of predicted sentences, and calculate a second conditional probability of each sentence;

a second calculating subunit 3222, configured to calculate an average conditional probability according to the second conditional probabilities of all the sentences and the influence coefficients;

an adjusting subunit 3223 is configured to adjust the first loss function by using the average conditional probability.

And a speech recognition module 33, configured to train the speech recognition model by using the second loss function, and perform speech recognition by using the trained speech recognition model.

Preferably, the influence coefficient is an optimal influence coefficient, and the apparatus further includes an optimal influence coefficient obtaining module 34, configured to use a plurality of influence coefficients to respectively train the speech recognition model, and determine, according to a training result, an influence coefficient that maximizes the recognition accuracy of the speech recognition model, that is, the optimal influence coefficient;

preferably, the adjusting unit 322 is configured to adjust the first loss function according to the second conditional probability and the optimal influence coefficient.

Further, the optimal influence coefficient obtaining module 34 is specifically configured to: and presetting a value interval for the influence coefficients, adjusting the values of the influence coefficients according to a preset step length, and training the voice recognition model by using each influence coefficient.

Fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 4, the terminal device 4 of this embodiment includes: a processor 40, a memory 41 and a computer program 42, such as a speech recognition program, stored in said memory 41 and operable on said processor 40. The processor 40, when executing the computer program 42, implements the steps in the various speech recognition method embodiments described above, such as the steps S101 to S103 shown in fig. 1. Alternatively, the processor 40, when executing the computer program 42, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 31 to 33 shown in fig. 3.

Illustratively, the computer program 42 may be partitioned into one or more modules/units that are stored in the memory 41 and executed by the processor 40 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 42 in the terminal device 4. For example, the computer program 42 may be divided into a first conditional probability calculation module, an adjustment module, and a speech recognition module, each of which functions as follows:

The terminal device 4 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 40, a memory 41. Those skilled in the art will appreciate that fig. 4 is merely an example of a terminal device 4 and does not constitute a limitation of terminal device 4 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The Processor 40 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 41 may be an internal storage unit of the terminal device 4, such as a hard disk or a memory of the terminal device 4. The memory 41 may also be an external storage device of the terminal device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 4. Further, the memory 41 may also include both an internal storage unit and an external storage device of the terminal device 4. The memory 41 is used for storing the computer program and other programs and data required by the terminal device. The memory 41 may also be used to temporarily store data that has been output or is to be output.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A speech recognition method, comprising:

2. The method of claim 1, wherein said calculating a first conditional probability for a sentence according to a pre-trained language model comprises:

3. The method of claim 1, wherein said adjusting a first loss function of a speech recognition model based on said first conditional probability comprises:

4. The method of claim 3, wherein said calculating a second conditional probability using said first conditional probability comprises:

in the above formula, T represents the calculated second conditional probability, p (S) represents the first conditional probability, and length represents the length of sentence S.

5. The method of claim 4, wherein said adjusting said first penalty function based on said second conditional probability and an influence coefficient of said pre-trained language model comprises:

obtaining a plurality of sentences obtained through prediction, and calculating a second conditional probability of each sentence;

calculating average conditional probability according to the second conditional probabilities of all sentences and the influence coefficients;

adjusting the first loss function using the average conditional probability.

6. The method as claimed in claim 5, wherein the influence coefficient is an optimal influence coefficient, and the optimal influence coefficient is obtained by:

7. The method of claim 6, wherein the training the speech recognition model with the plurality of influence coefficients comprises:

8. A speech recognition apparatus, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.