CN109545186B

CN109545186B - Speech recognition training system and method

Info

Publication number: CN109545186B
Application number: CN201811538408.9A
Authority: CN
Inventors: 胡杰
Original assignee: Momenta Suzhou Technology Co Ltd
Current assignee: Momenta Suzhou Technology Co Ltd
Priority date: 2018-12-16
Filing date: 2018-12-16
Publication date: 2022-05-27
Anticipated expiration: 2038-12-16
Also published as: CN109545186A

Abstract

The invention relates to a voice recognition training system and a method thereof, belonging to the technical field of voice recognition; in the prior art, the error correction mechanism or the error correction algorithm of the loss function is general or single, and the error of voice recognition cannot be corrected quickly and accurately; the invention provides a speech recognition training system and method, which can provide the precision and speed of the system by setting a plurality of loss functions according to the error of common recognition to respectively deal with different situations.

Description

Speech recognition training system and method

Technical Field

The invention relates to a voice recognition technology, in particular to a Chinese voice recognition training method.

Background

The voice communication with the machine is carried out, so that the machine can understand what you say, which is a thing that people dreams for a long time. The Chinese Internet of things school and enterprise alliance can take the voice recognition ratio as the hearing system of the machine. Speech recognition technology is a high technology that allows machines to convert speech signals into corresponding text or commands through a recognition and understanding process.

In recent years, research on the application of artificial neural networks to speech recognition has been promoted. In these studies, multi-layer perceptual networks based on back-propagation algorithms (BP algorithms) were mostly employed. Artificial neural networks have the ability to distinguish complex classification boundaries, which apparently contributes significantly to pattern classification. Especially in the aspect of telephone voice recognition, the method becomes a hot spot of the current voice recognition application due to the wide application prospect.

However, in the existing artificial neural network applied to the speech recognition network, the error correction mechanism or the error correction algorithm of the loss function for error correction is general or single, and the error correction cannot be quickly and accurately performed on the speech recognition error.

Disclosure of Invention

In view of the problems in the prior art, the present invention provides a speech recognition training system, which is characterized in that: the system comprises: the device comprises a feature extraction unit, a voice recognition unit and a loss function;

the feature extraction unit is used for extracting features of the voice information to be recognized;

the voice recognition unit is used for carrying out voice recognition on the input voice information to be recognized to obtain a recognition result;

the system compares the voice information to be recognized with the recognition result through pre-labeling of the voice information to be recognized, constructs the loss function, and finally corrects the voice recognition unit and the feature extraction unit step by step through step reverse conduction of the loss function;

the loss function is formed by the sum of at least two different types of loss functions.

Preferably, the two different types of loss functions are respectively: a homophonic loss function and an approximate loss function.

Preferably, the homophonic loss function represents the probability of recognition errors of different characters with the same pronunciation; the approximate loss function represents the probability of recognition errors of different characters with similar pronunciation.

Preferably, the loss function of the system is a homophonic loss function + b approximate loss function, wherein a and b are weighting coefficients.

Preferably, when the recognition result includes different characters with similar pronunciation, b > a; b < a when the recognition result includes different characters having the same pronunciation.

Preferably, the character recognition unit includes a first character recognition unit and a second character recognition unit, which correspond to the homophonic loss function and the approximate loss function, respectively.

Preferably, the system further comprises a mapping unit predicting the recognition result by mapping of a dictionary or a dictionary.

Preferably, the system further comprises a sentence loss function representing a probability of an ambiguity-prone sentence recognition error.

Preferably, the number of the voice recognition units is plural.

The invention also provides a method for carrying out voice recognition training by using the system, which is characterized by comprising the following steps: the method comprises the following steps:

a characteristic extraction step: carrying out feature extraction on voice information to be recognized;

a voice recognition step: carrying out voice recognition on the input voice information to be recognized to obtain a recognition result;

and (3) correcting: the system compares the voice information to be recognized with the recognition result through pre-labeling of the voice information to be recognized, constructs the loss function, and finally corrects the voice recognition unit and the feature extraction unit step by step through step reverse conduction of the loss function;

Preferably, the loss function further comprises a sentence loss function representing a probability of an ambiguity-prone sentence occurrence recognition error.

The invention points of the invention include but are not limited to the following points:

(1) the invention proposes that the loss function is expressed by the sum of the same loss function and the approximate loss function; the problem of different types of errors in speech recognition can be solved by setting the weight between the two under different conditions; the classification can also be limited to the field of common words according to actual conditions.

(2) The loss functions of the present invention may also include sentence loss functions that provide the accuracy and speed of the training system for sentences that are prone to ambiguity.

(3) The invention can also use a plurality of voice recognition units, namely two recurrent neural networks, and the two can respectively work in a targeted manner, thereby improving the working efficiency.

Drawings

FIG. 1 is a speech recognition training structure based on deep learning in embodiment 1 of the present invention;

fig. 2 is a speech recognition training structure based on deep learning in embodiment 2 of the present invention.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

To better illustrate the invention and to facilitate the understanding of the technical solutions thereof, typical but non-limiting examples of the invention are as follows:

the invention provides a speech recognition training method based on deep learning, which comprises the steps of firstly determining speech information to be recognized, extracting characteristics of the input speech information through a Convolutional Neural Network (CNN), then inputting the extracted characteristics into a Recurrent Neural Network (RNN), then outputting a recognition result through the Recurrent Neural Network, then comparing the specific speech content with the recognition result through marking of the speech information to be recognized, constructing a loss function, finally conducting reversely step by step through the loss function, and correcting the Neural Network step by step to achieve the training purpose.

The invention provides a speech recognition training system based on deep learning, which is characterized in that speech information to be detected is input through a speech input module of the system and then passes through a preprocessing unit, the preprocessing unit filters the input speech information to avoid interference, then the filtered speech information is input into a feature extraction unit for feature extraction, and the extracted features are input into a speech recognition unit (generally a neural network) for recognition.

Example 1

The speech recognition training system of the invention is shown in fig. 1 and comprises a preprocessing unit, a feature extraction unit, a speech recognition unit, a loss function and a mapping unit; the feature extraction unit is specifically a convolutional neural network CNN, and the character recognition unit is specifically a recurrent neural network RNN.

The system also needs some preparation work in the early stage, specifically, the following (1) is a training sample set, namely, a voice information sample is labeled, and specific labeling indexes identify specific voice information contents; (2) classifying all characters in the Chinese character library, and marking the classification of each character, wherein the specific classification is as follows:

in some embodiments, in mode 1, the classification by glyph is employed

Words with similar glyphs are all labeled as category 1, for example: "vegetables" and "dishes", "go" and "lose", "forest" and "wood", etc.;

all characters with no similarity of font are marked as category 2;

in some embodiments, approach 2 is employed, classifying by pronunciation

Different words with the same pronunciation are labeled as category 1, for example: "Ming" and "Ming", "ren" and "ren";

the characters with similar pronunciation are marked as category 2, and the pronunciation in the text can be different from the edge sound and the nose sound, the front nose sound and the back nose sound, and the tone; for example: "flow" and "cow", "root" and "more", "creep" and "permit", and the like;

because the number of the Chinese characters is more, the Chinese characters with the same or similar pronunciation are also more; the classification can be limited to the range of common words as required, and the rarely used words can be labeled in another way.

The classification according to the mode 1 can be used for a character recognition technology, is similar to a voice recognition technology in a mode that feature extraction is carried out through a convolutional neural network, then classification is carried out through a cyclic neural network, finally correction is carried out through a loss function, and finally training is completed.

The feature extraction unit is realized by constructing a convolutional neural network CNN, the convolutional neural network firstly performs initial feature extraction on the voice information through convolutional cores, and the initially extracted features can comprise part of the voice information and can be one word or a plurality of words; then, carrying out feature extraction on the features extracted at the previous level step by a secondary extraction layer or multiple extraction layers in the convolutional neural network to obtain required accurate features, and removing redundant features; and finally, all sub-voice information formed by extracting the same voice information characteristic is connected in series by a full connection layer of the convolutional neural network to form a complete extraction characteristic set.

The speech recognition unit is realized by constructing a Recurrent Neural Network (RNN), the input of the Recurrent Neural Network (RNN) comprises two kinds of data, the first kind of data is characteristic data extracted by the Convolutional Neural Network (CNN), the second kind of data is output data of the Recurrent Neural Network (RNN) at the previous time, and finally the Recurrent Neural Network (RNN) outputs a speech recognition result; in order to ensure the accuracy of speech recognition, the general usage of the language is usually considered, and therefore, on the basis of the above, the input of the recurrent neural network RNN may further include a third type of data, i.e., the predicted result of the recurrent neural network RNN at the previous time to the time, where the third type of data may be obtained through a dictionary or a dictionary mapping.

Obtaining a voice information recognition result through a convolutional neural network CNN and a cyclic neural network RNN, comparing the voice information recognition result with a voice information pre-label, performing back propagation on data when the comparison result is different, and gradually correcting each neural network in the back propagation process; the above process is repeated until the accuracy or error rate of the identification result reaches the set threshold.

The comparison between the recognition result and the pre-marked result is embodied by a loss function, while the errors compared according to the past experience are mainly classified into two types, one type is homophonic error, the other type is approximate error, and the two types belong to character errors;

the character error can be indirectly adjusted through a character loss function to extract the most expressive characteristic;

the total loss function of the embodiment is the homophonic loss function + the approximate loss function, so that the error in the speech recognition can be well solved.

Example 2

The speech recognition training system of the invention is shown in fig. 2 and comprises a preprocessing unit, a feature extraction unit, a speech recognition unit 1, a loss function 1, a speech recognition unit 2, a loss function 2 and a mapping unit; the feature extraction unit is specifically a convolutional neural network CNN, and the character recognition unit is specifically a recurrent neural network RNN.

the characters with similar pronunciation are marked as category 2, and the same pronunciation can be different from the edge sound and the nose sound, the front nose sound and the back nose sound, and the tone; for example: "current" and "cow", "root" and "more", "creep" and "permit", and the like;

then, feature extraction is carried out on the input voice information through a feature extraction unit, the extracted features are simultaneously and respectively input into a voice recognition unit 1 and a voice recognition unit 2, and then recognition results are output by the voice recognition unit 1 and the voice recognition unit 2; and then comparing the marking of the voice information to be recognized, namely the content of the specific voice with the recognition result, constructing a loss function, and finally conducting reversely step by the loss function so as to modify the neural network step by step to realize the training purpose.

The above function is a homophonic loss function + b approximate loss function, where a and b are weight coefficients, and a + b is 1; b < a if category 1 is included in the recognition result, preferably b is 0.7-0.9; b > a if category 2 is included in the recognition result, preferably b is 0.1-0.2; if both are included, let b be a.

Here, the speech recognition unit 1 and the speech recognition unit 2, i.e. the first recurrent neural network and the second recurrent neural network, may focus on different directions, the second recurrent neural network connected to the approximate loss function may specifically recognize a specific type of speech, such as characters with similar speech (category 1), while the first recurrent neural network connected to the homophonic loss function may focus on recognizing characters with the same speech (category 2), which is one of the innovative points of the present invention.

In this embodiment, the feature extraction unit is common, and in addition to this, a speech recognition unit (not shown) may be common, and then the text recognition unit outputs the results to the loss function 1 and the loss function 2, respectively, at the same time.

In addition, in the speech recognition process, besides a single character recognition error, a sentence recognition error also exists, and the sentence recognition error is quite common, so that a speech recognition unit 3 and a loss function 3 can be arranged in the system in parallel with a speech recognition unit 1, a loss function 1, a speech recognition unit 2 and a loss function 2;

the sentence recognition error generally includes the following cases:

(1) errors caused by different sentence breaks;

(2) errors due to ambiguous words;

(3) errors due to the bias phrase;

(4) errors due to multiple fixed or idioms;

if the speech information to be recognized includes the above situations, a sentence loss function can be included in the total loss function, and the specific function can be: a homophonic loss function + b approximate loss function + c sentence loss function; if the above sentences are included in the recognition result, c may be 0.5, a + b may be 0.5, and the values of a and b may be referred to the above ratio.

The preferred embodiments of the present invention have been described in detail, however, the present invention is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present invention within the technical idea of the present invention, and these simple modifications are within the protective scope of the present invention.

It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition.

In addition, any combination of the various embodiments of the present invention can be made, and the same should be considered as the disclosure of the present invention as long as the idea of the present invention is not violated.

Claims

1. A speech recognition training system, characterized by: the system comprises: the device comprises a feature extraction unit, a voice recognition unit and a loss function;

the system compares the pre-labeling of the voice information to be recognized with the recognition result, constructs the loss function, and finally corrects the voice recognition unit and the feature extraction unit step by step through the step-by-step reverse conduction of the loss function;

the loss function is formed by the sum of at least two different types of loss functions;

the two different types of loss functions are respectively: a homophonic loss function and an approximate loss function;

the homophonic loss function represents the probability of recognition errors of different characters with the same pronunciation; the approximate loss function represents the probability of recognition errors of different characters with similar pronunciation.

2. The system of claim 1, wherein: the loss function of the system is a homophonic loss function + b approximate loss function, wherein a and b are weight coefficients.

3. The system of claim 2, wherein: b > a when the recognition result includes different characters with similar pronunciation; b < a when the recognition result includes different characters having the same pronunciation.

4. The system of claim 1, wherein: the character recognition unit includes a first character recognition unit and a second character recognition unit, which correspond to the homophonic loss function and the approximate loss function, respectively.

5. The system of claim 1, wherein: the system further comprises a mapping unit predicting the recognition result by a mapping of a dictionary or dictionaries.

6. The system of claim 1, wherein: the system also includes a sentence loss function that represents a probability of an ambiguity-prone sentence occurrence recognition error.

7. The system of claim 1, the speech recognition unit being plural.

8. Method for speech recognition training using a system according to any of claims 1-6, characterized in that: the method comprises the following steps:

error correction: the system compares the voice information to be recognized with the recognition result through pre-labeling of the voice information to be recognized, constructs the loss function, and finally corrects the voice recognition unit and the feature extraction unit step by step through step reverse conduction of the loss function;

9. The method of claim 8, wherein: the loss function also includes a sentence loss function that represents a probability of an ambiguity-prone sentence recognition error.