CN111833844A

CN111833844A - Training method and system of mixed model for speech recognition and language classification

Info

Publication number: CN111833844A
Application number: CN202010739233.9A
Authority: CN
Inventors: 陆一帆; 钱彦旻; 朱森; 陈梦姣
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-10-27

Abstract

The embodiment of the invention provides a training method of a mixed model for voice recognition and language classification. The method comprises the following steps: performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training; inputting input data for training into the N layers of intermediate layers, performing voice recognition training based on voice recognition results and text labels output by the voice recognition layer, and training neural network parameters of the N layers of intermediate layers and the voice recognition layer; after the speech recognition training is completed, based on the language classification result and the language label output by the language classification layer, only the neural network parameters of the language classification layer are trained, and the language classification training is completed. The embodiment of the invention also provides a training system of the mixed model for speech recognition and language classification. The embodiment of the invention combines the voice recognition and the language classification, simplifies the system structure, saves the training cost and improves the overall system performance of the hybrid model.

Description

Training method and system of mixed model for speech recognition and language classification

Technical Field

The invention relates to the field of voice recognition, in particular to a training method and a training system for a mixed model for voice recognition and language classification.

Background

For multi-lingual speech recognition, separate dialect and mandarin speech recognition (ASR) modules and language recognition modules are typically trained using neural networks based on existing dialect and mandarin audio. For the audio frequency sent into the system, it needs to first go through the language identification module to judge which language belongs to, then call the corresponding speech recognition (ASR) module to convert the sound into words, and then interact with other modules (such as semantic understanding, speech synthesis, etc.).

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

(1) high training and deployment costs

A plurality of speech recognition (ASR) modules of dialects and mandarin and a plurality of speech recognition modules and the like need to be prepared independently, a plurality of training models are long in time consumption, a plurality of ASR resources need to be deployed when online, a plurality of resources are occupied, and training and deployment costs are high.

(2) Interdependence between modules, mutual influence of performance

The accuracy of the language identification affects the performance of the subsequent speech identification, which results in higher requirements on the language module, and under the condition of wrong language identification, the high probability of the performance of the speech identification is very poor, thereby affecting the accuracy of other modules after identification.

(3) Poor integratability

The ASR module or the language module is only one, so that the ASR module or the language module is difficult to be a truly available product, most of the ASR module or the language module needs to be matched with other models (such as semantic understanding, speech synthesis, a dialogue system and the like) to form an available product, and the models need to recognize text and language information at the same time in many times, but the system is of a serial structure, cannot output recognized text and language information at the same time, cannot meet the requirement, and therefore the integratability is poor.

Disclosure of Invention

The problems that in the prior art, training and deployment cost is high, modules are interdependent, performance is influenced mutually, and integratability is poor are at least solved.

In a first aspect, an embodiment of the present invention provides a training method for a hybrid model for speech recognition and language classification, where the hybrid model is a deep neural network structure having N intermediate layers, and an nth intermediate layer branches into a speech recognition layer and a language classification layer, the speech recognition layer outputs a speech recognition result, and the language classification layer outputs a language classification result, the training method including:

performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;

inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;

and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.

In a second aspect, an embodiment of the present invention provides a training system for a hybrid model for speech recognition and language classification, where the hybrid model is a deep neural network structure having N intermediate layers, and the N intermediate layer is bifurcated into a speech recognition layer and a language classification layer, the speech recognition layer outputs a speech recognition result, and the language classification layer outputs a language classification result, the training system including:

the input data determining program module is used for performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;

an output program module, configured to input the input data for training to the N-layer intermediate layer, perform speech recognition training based on the speech recognition result output by the speech recognition layer and the text label, and train neural network parameters of the N-layer intermediate layer and the speech recognition layer;

and the training program module is used for training only the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer after the speech recognition training is finished, and finishing the language classification training.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a hybrid model for speech recognition and language classification according to any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the training method for a hybrid model for speech recognition and language classification according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the speech recognition and the language classification are combined, parameters of a speech recognition part are not affected during the speech classification, so that the information of the language classification can be output more under the condition of keeping the speech recognition performance unchanged, the combining effect is achieved, the system structure is simplified, the training cost is saved during training, and meanwhile, the overall system performance of the hybrid model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for training a hybrid model for speech recognition and language classification according to an embodiment of the present invention;

FIG. 2 is a flow chart of a training phase of a method for training a hybrid model for speech recognition and language classification according to an embodiment of the present invention;

FIG. 3 is a network structure diagram of a training method for a hybrid model of speech recognition and language classification according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a training system for a hybrid model of speech recognition and language classification according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a training method of a hybrid model for speech recognition and language classification according to an embodiment of the present invention, where the hybrid model is a deep neural network structure having N intermediate layers, and the nth intermediate layer is bifurcated into a speech recognition layer and a language classification layer, the speech recognition layer outputs a speech recognition result, and the language classification layer outputs a language classification result, including the following steps:

s11: performing feature extraction and data alignment on mixed training audio data with text labels and language labels to determine input data for training;

s12: inputting the input data for training into the N-layer intermediate layer, performing voice recognition training based on the voice recognition result output by the voice recognition layer and the text label, and training neural network parameters of the N-layer intermediate layer and the voice recognition layer;

s13: and after the speech recognition training is finished, only training the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer, and finishing the language classification training.

In the embodiment, the speech recognition module and the language classification module are combined, and the combined and mixed module can output the recognition text and the language information at the same time. Mainly focusing on the training of languages, the training of the speech recognition model is not limited, and includes four stages as shown in fig. 2, including: data preparation, feature extraction, data alignment and model training.

For step S11, training the blending module requires data preparation. The audio with text labels and language labels needs to be prepared, the audio can be manually labeled, other modes can also be used, the correct text of the audio and the dialect or mandarin which the audio belongs to need to be determined, and the higher the accuracy of labeling is, the better the labeling is, and the model training at the later stage is facilitated.

After a large amount of marked audio data are collected, wav and corresponding text marks are arranged, the audio is subjected to feature extraction, and FBANK features are adopted. More specifically, the feature extraction of the mixed training audio data with text labels and language labels includes:

the mixed training audio data is framed by using a window with a frame length of 25ms and a frame shift of 10ms, and m-dimensional FBANK characteristics and Mel cepstral coefficient characteristics of each frame in the mixed training audio data (the parameters provided herein are relatively good parameters recognized in the speech recognition field, and may also be broadly referred to as well, for example, the characteristics may be FBANK (FilterBank), mfcc (Mel Frequency Cepstrum Coefficients, Mel cepstral Coefficients), plp (perceptual linear prediction cepstral Coefficients), the frame length may be 20-40ms, and the frame shift may be 10-20 ms). And determining the data alignment according to the Mel cepstrum coefficient characteristics.

As an embodiment, the performing feature extraction and data alignment on mixed training audio data with text labels and language labels, and determining input data for training includes:

performing feature extraction on mixed training audio with text labels and language labels, and determining m-dimensional FBANK features and Mel cepstrum coefficient features of each frame in the mixed training audio, wherein the mixed training audio comprises multi-language audio, and the languages comprise Mandarin and dialects;

and carrying out supervised training on the mixed training audio and the m-dimensional Mel cepstrum coefficient characteristic of each frame, and determining the data alignment of each frame.

In this embodiment, for supervised training, it is necessary to know what the phoneme and language information is on each frame of each audio, and this step may use a training gaussian mixture model (GMM, although a neural network model may also be trained to generate alignment, the method is not limited). Extracting Mel cepstrum coefficient (MFCC) features of the audio, framing the audio by using a window with a frame length of 25ms and a frame shift of 10ms, extracting n-dimensional MFCC features of each frame, and preparing a pronunciation dictionary (the pronunciation dictionary is a pre-prepared union set comprising phoneme sets of dialect speech and mandarin speech); the corresponding GMM model is trained from the MFCC features and the pronunciation dictionary, and then the corresponding frame-level data alignment (alignment) is generated. For speech recognition, the alignment of each frame is a phoneme, and for linguistic classification, the alignment of each frame is a class of languages or silence (silence).

For step S12, the m-dimensional FBANK feature for each frame and the alignment corresponding to each frame prepared in step S11 are input to N intermediate layers, wherein the structure of the N intermediate layers is as shown in fig. 3, and the intermediate layer structure of the neural network may adopt multiple layers of DNN (deep neural network), LSTM (long short term memory neural network), FSMN (feedforward type sequence memory network), and the like. The output of the model is two, one is the output of ASR, and the other is the output of language; the speech recognition result output by the speech recognition layer can be obtained. Because the data preparation has the text labels corresponding to the voice, the text labels can be used as targets for training, and the voice recognition result is learned from the text labels. Thereby improving the recognition effect. The method of speech recognition is only an example of training, and the training mode is not limited. In this way, the parameters for speech recognition in the N intermediate layers are continuously trained.

For step S13, after the speech recognition part is trained, as an implementation manner, the language classification result and the language label output by the language classification layer include: based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.

After the performance of voice recognition meets the requirements, one more language output is added, when a language neural network is trained, Cross-Entropy training criteria is adopted, MLE (maximum Likelihood estimate) is utilized to carry out classification optimization on each frame, the classification error rate of each frame is minimized, the gradient only transmits and updates the parameters of the forked language NN layer, the neural network layer of the voice recognition part does not update the gradient, namely the parameters of the network are not changed, and only the neural network parameters of the language part are trained, so that the language information can be output while the voice recognition performance is kept unchanged.

According to the embodiment, the speech recognition and the language classification are combined, and the parameters of the speech recognition part are not influenced during the speech classification, so that the information of the language classification can be output more under the condition of keeping the speech recognition performance unchanged, the combination effect is achieved, the system structure is simplified, the training cost is saved during training, and the overall system performance of the hybrid model is improved.

Fig. 4 is a schematic structural diagram of a training system for a hybrid model for speech recognition and language classification according to an embodiment of the present invention, which can execute the training method for a hybrid model for speech recognition and language classification according to any of the above embodiments and is configured in a terminal.

The embodiment provides a training system of a hybrid model for speech recognition and language classification, which includes: an input data determination program module 11, an output program module 12 and a training program module 13.

The input data determining program module 11 is configured to perform feature extraction and data alignment on mixed training audio data with text labels and language labels, and determine input data for training; the output program module 12 is configured to input the input data for training to the N-layer intermediate layer, perform speech recognition training based on the speech recognition result output by the speech recognition layer and the text label, and train neural network parameters of the N-layer intermediate layer and the speech recognition layer; the training program module 13 is configured to train only the neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer after the speech recognition training is completed, so as to complete the language classification training.

Further, the input data determination program module is for:

Further, the training program module is to:

based on a cross entropy training criterion, carrying out classification optimization on the data alignment of each frame by utilizing maximum likelihood estimation, and updating the language classification result to the language label.

Further, the input data determination program module is for:

and framing the mixed training audio data by using a window with the frame length of 25ms and the frame shift of 10ms, and determining m-dimensional FBANK characteristics and Mel cepstrum coefficient characteristics of each frame in the mixed training audio data.

Further, the structure of the N intermediate layers at least comprises: deep neural network, long and short term memory neural network, and feedforward sequence memory network.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the training method of the mixed model for speech recognition and language classification in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method of training a hybrid model for speech recognition and language classification in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a hybrid model for speech recognition and language classification according to any embodiment of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A training method of a hybrid model for speech recognition and language classification, wherein the hybrid model is a deep neural network structure having N intermediate layers, and the N intermediate layer is branched into a speech recognition layer outputting speech recognition results and a language classification layer outputting language classification results, the training method comprising:

2. The method of claim 1, wherein the performing feature extraction and data alignment on the mixed training audio data with text labels and language labels comprises:

3. The method of claim 1, wherein training only neural network parameters of the language classification layer based on the language classification result and the language label output by the language classification layer comprises:

4. The method of claim 1, wherein the feature extraction of the mixed training audio data with text labels and language labels comprises:

5. The method of claim 1, wherein the structure of the N-layer interlayer comprises at least: deep neural network, long and short term memory neural network, and feedforward sequence memory network.

6. A training system for a hybrid model of speech recognition and language classification, wherein the hybrid model is a deep neural network structure having N intermediate layers, and the N intermediate layer is branched into a speech recognition layer and a language classification layer, the speech recognition layer outputting speech recognition results, the language classification layer outputting language classification results, the training system comprising:

7. The system of claim 6, wherein the input data determination program module is to:

8. The system of claim 6, wherein the training program module is to:

9. The system of claim 6, wherein the input data determination program module is to:

10. The system of claim 6, wherein the structure of the N layers of interlayers comprises at least: deep neural network, long and short term memory neural network, and feedforward sequence memory network.