CN111816160A

CN111816160A - Mandarin and cantonese mixed speech recognition model training method and system

Info

Publication number: CN111816160A
Application number: CN202010737658.6A
Authority: CN
Inventors: 朱森; 钱彦旻; 陆一帆; 陈梦姣
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-10-23

Abstract

The invention discloses a method for training a mandarin and cantonese mixed speech recognition model, which comprises the following steps: training a multitask model by adopting a mixed voice training sample of N languages, wherein the multitask model comprises a plurality of shared network layers and N task neural network layers which are connected with the last layer of the shared network layers and correspond to the N languages; and migrating the network parameters of the plurality of shared network layers to the voice recognition model to be trained so as to complete the training of the voice recognition model to be trained. The embodiment of the invention firstly adopts mixed voice training samples of multiple languages to train the multitask model, then multiplexes network parameters of the multitask model in a data migration mode, and trains a mixed voice recognition model of the Mandarin and Guangdong based on mixed modeling of the Mandarin and the Guangdong. The problem of mixed speech recognition of Mandarin and Guangdong languages can be solved, the original recognition service is not required to be greatly modified, the existing achievement at present can be utilized, and the model training cost and the service development cost are reduced.

Description

Mandarin and cantonese mixed speech recognition model training method and system

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method and a system for training a mandarin and cantonese mixed voice recognition model.

Background

With the continuous development of mobile terminal devices and voice recognition technologies, some mixed mandarin and dialect voice recognition schemes have appeared. Such as a news flying voice input method, a Baidu voice input method, a dog searching input method and an Ali intelligent customer service, all of which have the function of mixed voice recognition of Mandarin and dialect.

The existing solutions are algorithms based on a deep learning framework, different acoustic modeling units are used according to respective actual conditions, and multiple languages are simultaneously identified through different acoustic training processes and algorithms.

Two common solutions are available, one is to use a language classifier to determine which language the speech belongs to, and then input the speech into a corresponding speech recognition system for recognition, as shown in fig. 1. However, this method needs to introduce a language identification module to classify the languages, which causes the identification result to depend on the classification performance of the language identification module, and causes poor identification effect when the language identification module and the classifier are unstable. The speech recognition accuracy is determined based on the accuracy of language classification and the accumulation of the accuracy of the speech recognition system, so that the speech recognition accuracy is lower than that of a single speech recognition system, and the scheme is difficult to have strong robustness in various scenes. Moreover, a plurality of sets of voice recognition systems need to be deployed on the server, so that the engineering cost is high.

The other scheme is that a mixed speech recognition method is adopted, modeling units of multiple languages are mixed together, then audio data and text data of different languages are mixed, and a conventional training flow is multiplexed to perform mixed speech recognition; or the dictionaries of various languages, training data and corpus texts can be mixed, and then the conventional training flow is multiplexed to carry out mixed speech recognition. The mixed Speech recognition method is easy to realize and low in engineering cost, but the Speech recognition method is used for training data of a plurality of different languages in a mixed mode, training data of all languages are difficult to achieve in a balanced mode in a practical situation, pronunciation differences exist in different languages, data quantity is unbalanced or the selection is not proper, the distribution of pronunciation phonemes of different languages in a training set is unbalanced, the trained recognition result is biased to the language with large data quantity, the performance of the mixed Speech recognition system is greatly reduced compared with that of a Speech recognition system of each language independently, the overall recognition rate is difficult to achieve good recognition of each language, and the performance loss of ASR (automatic Speech recognition) is large.

Disclosure of Invention

The embodiment of the invention provides a mandarin and cantonese mixed speech recognition model training method and system and a mandarin and cantonese mixed speech recognition method and system, which are used for solving at least one of the technical problems.

In a first aspect, an embodiment of the present invention provides a method for training a mandarin and cantonese hybrid speech recognition model, including:

training a multitask model by adopting a mixed voice training sample of N languages, wherein the multitask model comprises a plurality of shared network layers and N task neural network layers which are connected with the last layer of the shared network layers and correspond to the N languages;

and migrating the network parameters of the plurality of shared network layers to a speech recognition model to be trained so as to finish the training of the speech recognition model to be trained.

In a second aspect, an embodiment of the present invention provides a method for mixed speech recognition of mandarin and cantonese, including: the mixed speech of the Mandarin and the dialect is input into the speech recognition model obtained by the training method of the mixed speech recognition model of the Mandarin and the Cantonese in the embodiment of the invention for mixed speech recognition.

In a third aspect, an embodiment of the present invention provides a system for training a mandarin and cantonese hybrid speech recognition model, including:

the multi-task model training module is used for training a multi-task model by adopting a mixed voice training sample of N languages, wherein the multi-task model comprises a plurality of shared network layers and N task neural network layers which are connected with the last layer of the shared network layers and correspond to the N languages;

and the speech recognition model training module is used for transferring the network parameters of the plurality of shared network layers to a speech recognition model to be trained so as to complete the training of the speech recognition model to be trained.

In a fourth aspect, an embodiment of the present invention provides a mandarin and cantonese hybrid speech recognition system, including:

the speech recognition model is obtained by training by adopting the mixed speech recognition model training method for mandarin and cantonese;

and the voice input module is used for inputting the mixed voice of the common Chinese and the Guangdong languages into the voice recognition model to perform mixed voice recognition.

In a fifth aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the mandarin and cantonese hybrid speech recognition of the present invention as described above.

In a sixth aspect, embodiments of the present invention provide a storage medium, where one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of mandarin and cantonese hybrid speech recognition of the present invention.

The embodiment of the invention has the beneficial effects that: firstly, a multi-task model is trained by adopting a mixed voice training sample of multiple languages, then network parameters of the multi-task model are multiplexed in a data migration mode, and a mixed speech recognition model of the common language and the cantonese is trained on the basis of mixed modeling of the common language and the cantonese. The problem of mixed speech recognition of Mandarin and Guangdong languages can be solved, the original recognition service is not required to be greatly modified, the existing achievement at present can be utilized, and the model training cost and the service development cost are reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram illustrating a prior art approach for hybrid speech recognition using a language classifier;

FIG. 2 is a flowchart of one embodiment of a Mandarin and Cantonese hybrid speech recognition model training method of the present invention;

FIG. 3 is a functional block diagram of an embodiment of a Mandarin and Cantonese hybrid speech recognition model training system of the present invention;

FIG. 4 is a functional block diagram of one embodiment of a Mandarin and Cantonese hybrid speech recognition system of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a Mandarin and Cantonese hybrid speech recognition model training method of the present invention;

FIG. 6 is a diagram illustrating another embodiment of a Mandarin and Cantonese hybrid speech recognition model training method according to the present invention;

FIG. 7 is a flow diagram of one embodiment of a Mandarin and Cantonese hybrid speech recognition method of the present invention;

fig. 8 is a schematic structural diagram of an embodiment of an electronic device of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

FIG. 2 is a flow chart of an embodiment of the method for training a Mandarin and Cantonese mixed speech recognition model according to the present invention, wherein the method comprises:

s10, training a multitask model by adopting a mixed voice training sample of N languages, wherein the multitask model comprises a plurality of shared network layers and N task neural network layers which are connected with the last layer of the shared network layers and correspond to the N languages;

and S20, transferring the network parameters of the plurality of shared network layers to a speech recognition model to be trained so as to complete the training of the speech recognition model to be trained.

The embodiment of the invention firstly adopts mixed voice training samples of multiple languages to train the multitask model, then multiplexes network parameters of the multitask model in a data migration mode, and trains a mixed voice recognition model of the Mandarin and Guangdong based on mixed modeling of the Mandarin and the Guangdong. The problem of mixed speech recognition of Mandarin and Guangdong languages can be solved, the original recognition service is not required to be greatly modified, the existing achievement at present can be utilized, and the model training cost and the service development cost are reduced.

According to the multi-language and multi-task training mode, the commonality among different languages can be learned through parameters of a multi-layer language sharing network layer, and the individuality of different languages can be learned through the output of a task specific layer (such as a task neural network layer), so that the trained network contains the pronunciation conditions of various phonemes, and the model robustness is higher. And migrating the trained model parameters to a new model to help the new model training through a migration learning mode. In this way, the learned model parameters can be shared with the new model, thereby speeding up and optimizing the learning efficiency of the new model. The modeling units of Mandarin and Cantonese are integrated, and the condition that model parameters are insufficiently trained due to the fact that some characters are too few in the training text can be avoided through the sharing mode of the modeling units with larger granularity.

In some embodiments, the training the multitask model with the mixed speech training samples of the N languages comprises:

training network parameters of the N task neural network layers based on N loss functions corresponding to the N task neural network layers;

collectively training network parameters of the plurality of shared network layers based on at least N loss functions corresponding to the N task neural network layers.

In some embodiments, the multitasking model further comprises a language classification network layer connected to a last layer of the plurality of shared network layers;

the co-training the network parameters of the plurality of shared network layers based at least on the N loss functions corresponding to the N task neural network layers comprises:

collectively training the network parameters of the plurality of shared network layers based on N loss functions corresponding to the N task neural network layers and a loss function corresponding to the language classification network layer.

In some embodiments, collectively training the network parameters of the plurality of shared network layers based on the N loss functions corresponding to the N task neural network layers and the loss functions corresponding to the language classification network layers comprises: training network parameters of the plurality of shared network layers based on a weighted sum of the N loss functions corresponding to the N task neural network layers and the loss functions corresponding to the language classification network layers.

In some embodiments, the present invention further provides a mixed mandarin and cantonese speech recognition method, comprising: the mixed speech of the Mandarin and the dialect is input into the speech recognition model obtained by adopting the training method of the mixed speech recognition model of the Mandarin and the Cantonese in any embodiment of the invention for mixed speech recognition.

Referring to fig. 3, a schematic block diagram of an embodiment of a mandarin and cantonese hybrid speech recognition model training system 300 according to the present invention is shown, wherein the system comprises:

a multitask model training module 310, configured to train a multitask model with a mixed speech training sample of N languages, where the multitask model includes a plurality of shared network layers and N task neural network layers corresponding to the N languages, and the N task neural network layers are connected to a last layer of the plurality of shared network layers;

and a speech recognition model training module 320, configured to migrate the network parameters of the multiple shared network layers to a speech recognition model to be trained, so as to complete training of the speech recognition model to be trained.

As shown in fig. 4, which is a schematic block diagram of an embodiment of a mandarin and cantonese hybrid speech recognition system of the present invention, the system 400 comprises:

the speech recognition model 410 is obtained by adopting the training method of the mandarin and cantonese mixed speech recognition model according to any embodiment of the invention;

and the voice input module 420 is used for inputting the mixed voice of the common mandarin and the cantonese into the voice recognition model to perform mixed voice recognition.

In order to more intuitively embody the technical contribution of the present invention with respect to the prior art, the following detailed description will be further described with reference to specific embodiments.

Mainly comprises three steps:

(1) data preparation

Mixing audio data and text data of Mandarin and Cantonese on the data respectively, combining dictionaries, and sorting out the modeling unit according to the sorting method of the modeling unit; meanwhile, considering that dialect data are few, data expansion is carried out through modes of signal processing, real machine transcription, online crawling and the like;

extracting the characteristics of the audio, adopting FBANK characteristics, framing the audio by using a window with the frame length of 25ms and the frame shift of 10ms, and extracting 40-dimensional Fbank characteristics from each frame to train a neural network;

the neural network is trained by using N languages, so that the features and the labeled texts corresponding to the N languages are required to be prepared respectively, and then the features of the N languages are combined together and randomly scrambled to ensure the randomness of the input features of the training model.

(2) Multi-language multitask training

We use N languages to do multitask training, and in supervised multitask training mode, the training criterion can use frame-level CE (Cross) loss function or sequence-level CTC (ConnectionTestoraClassification) loss function to train the model parameters, and according to the network structure illustrated in FIG. 5, each output represents a Language specific training task, and its output corresponds to the loss output of the Language, so when there are N Language training tasks in the neural network, there are N Language loss outputs and a Language loss output.

FIG. 5 is a schematic diagram of an embodiment of a method for training a Mandarin and Cantonese mixed speech recognition model according to the present invention. Wherein, nn (neural network) layer represents a neural network layer, which may be commonly used dnn (deep neural networks), LSTM (Long Short-Term Memory), fsmn (fed forward sequential Memory networks), etc.

Meanwhile, labels of Language IDs (LID for short) of each Language are introduced, and the CE criterion is adopted for training, so that the aim of minimizing the Language type discrimination error of each frame is fulfilled, the Language domain classification error is reduced, the robustness of the network to various different languages is improved, the influence caused by imbalance of various Language data can be reduced, and the convergence speed of the neural network can be accelerated.

Multi-language multi-tasking training process:

(a) firstly, randomly scrambling input data of N language characteristics; preparing corresponding labels for each language according to different training criteria and modeling units, wherein modeling is carried out by adopting frame-level phonemes to respectively obtain frame-level phonemes labels of N languages, and then combining all the labels and arranging the labels according to an input sequence of training characteristics;

here, the method of acquiring label of the lower LID is explained:

different language categories are coded according to different numbers according to the prior information of each language category known in advance, for example: setting all sample data of the language type 1 to be 1 according to the length of the characteristic data of the sample data, setting all sample data of the language type 2 to be 2 according to the length of the characteristic data of the sample data, and repeating the steps to obtain the label of the LID of each language respectively, and then arranging the label according to the input sequence of the training characteristics;

at this time, we have prepared the feature input data, the phoneme label of each language, and the LID label of each language, respectively.

(b) Here we multiply each loss by a different scaling factor α and β depending on the specific task, which avoids the bias of the final model parameter update on some individual tasks:

Loss＝α1·loss1(output1)+α2·loss2(output2)+…+αN.lossN(outputN)+β.loss(Language ID)

the method comprises the steps of updating corresponding Task NNlayer model parameters of each language by using the loss (output) corresponding to each language, and updating bottom sharing model parameters by using all output losses, so that the bottom sharing layer can learn the characteristics of each language, a trained network can cover the pronunciation condition of phonemes in a larger range of human beings, meanwhile, the method collects various environments of training data sets of different languages, the learned parameters of a certain language under a certain environment can simultaneously improve the recognition performance of other languages under the scene, and from the perspective of the robustness of the training data, the network parameters are more robust.

(3) Transfer learning initialization training

FIG. 6 is a schematic diagram of another embodiment of a method for training a Mandarin and Cantonese mixed speech recognition model according to the present invention.

And taking the public layer parameters of the multi-language multi-task training network as an initialization model for mixed recognition of Putonghua and Cantonese, and migrating the trained model parameters to a new model to help the training of the new model in a migration learning mode. The model parameters of various languages which are learned in this way can be shared with the new Mandarin and Cantonese (dialect) models, so that the learning efficiency of the new models is accelerated and optimized.

A mixed training process of Mandarin and Guangdong languages:

(a) preparing data: different modeling units can be used for training according to different tasks, wherein corresponding character modeling units of Mandarin and Cantonese are prepared by adopting the method 2.3, the feature data of the two languages are mixed and randomly disordered, and the characters label prepared in the two languages are arranged according to the feature data;

(b) the model parameter updating is carried out by adopting the training criterion of CTC, which is consistent with the traditional training method and only has one network output, so that the finally trained model can support the simultaneous recognition of Mandarin and cantonese.

The dictionaries of the two languages are combined, and the corpus texts are combined to train the language model, so that the acoustic model and the language model can realize the purpose of simultaneously recognizing the two languages when decoding, and compared with the prior single system, the performance of a hybrid recognition system cannot be lost.

The invention uses N (N is more than or equal to 3, can be mandarin, Sichuan, cantonese, Shanghai, foreign language and the like) languages through a multi-language multi-task learning mode, and the N languages are trained through the multi-task learning mode, so that the commonality among different languages can be learned through the model parameters shared by multiple layers, the individuality of different languages can be learned through the output layer of a single language, the network trained in this way contains the pronunciation condition of various phonemes, and the model is more robust;

and in the form of transfer learning, the trained model parameters are transferred to a new model to assist the new model training. The learned model parameters can be shared with the new model, so that the learning efficiency of the new model is accelerated and optimized;

a modeling unit that integrates mandarin and cantonese: the modeling is carried out in a character mode, the word frequency of each corresponding word under the same pinyin pronunciation in a training text is counted from dictionaries of the Mandarin and the Cantonese respectively, the characters with the word frequency larger than a certain threshold are independently used as a modeling unit for modeling, the characters lower than the certain threshold are uniformly modeled by the characters with the high word frequency, thus the modeling units in the character modes of the Mandarin and the Cantonese are respectively obtained, then the same character units in the modeling units of the Mandarin and the Cantonese are combined and shared, different characters are independently reserved, thus a set of modeling unit in which the Mandarin and the Cantonese are mixed can be obtained, and the condition that model parameters are insufficiently trained due to the fact that some characters are too few in the training text can be avoided through the form shared by the large-particle modeling unit.

The model trained by the method not only meets the requirement of simultaneously recognizing two languages, but also does not need to change the framework on the kernel engineering, and the performance can reach or even be superior to that of a single language recognition system.

The invention provides a multi-language multi-task training process and a processing method for mixing mandarin and cantonese modeling units, which can solve the problem of mixed recognition of mandarin and cantonese, realize the mixed recognition of random utterances of any multiple languages, do not need to greatly modify the original recognition service, reuse the existing achievements, and greatly reduce the model training cost and the service development cost.

It is to be noted that the present invention is not readily apparent to those skilled in the art, and in fact the inventors have adopted at least the following earlier versions in carrying out the present invention:

and respectively using mandarin and cantonese recognition resources when the same audio is decoded, and then determining a final text recognition result according to the confidence degrees or semantic analysis of the two recognition results. The method does not need complex model training, is simple to realize, and can achieve the purpose only by performing some post-processing at the recognition rear end. However, the implementation method is not universal, and the situations of high engineering cost and resource waste still exist.

In the training of the acoustic model, original modeling units of two languages are directly merged and then trained, which is the most conceivable and easily implemented method, but the method does not share modeling units of different languages, and the final performance loss is large.

In fact, the invention has at least the following advantages compared with the prior art and various technical solutions tried by the inventor in the process of invention creation:

(1) the multi-language multi-task training mode enables the commonalities among different languages to be learned through multi-layer shared model parameters, and the personalities of different languages can be learned through the output layers of single languages, so that the trained network contains the pronunciation conditions of various phonemes, and the model is more robust;

(2) and migrating the trained model parameters to a new model to help the new model training through a migration learning mode. The learned model parameters can be shared with the new model, so that the learning efficiency of the new model is accelerated and optimized;

(3) a modeling unit that integrates mandarin and cantonese: by the form of large granularity modeling unit sharing, the situation that the model parameters are insufficiently trained because some characters appear too few in the training text can be avoided.

As shown in fig. 7, an embodiment of the present invention provides a mandarin and cantonese hybrid speech recognition method, including:

s71, training the multitask model by adopting mixed voice training samples of N languages to obtain parameter values of the multitask model; the multitask model is provided with a plurality of language sharing network layers (a first layer to an nth layer), and N +1 task-specific layers which are connected in parallel and connected with the deepest layer of the language sharing network layers, wherein the language sharing network layers are neural network layers, the task-specific layers are neural network layers, and N is more than or equal to 3;

s72, transferring the parameter value of the language sharing network layer to the Poyue voice recognition model; the general Guangdong voice recognition model is provided with a language sharing network layer and a general Guangdong voice recognition task special layer which are the same as the multi-task training model, wherein the general Guangdong voice recognition task special layer is a neural network layer and is connected with the deepest layer of the language sharing network layer;

s73, modeling the mixture of Mandarin and Guangdong languages, and training a general Guangdong language voice recognition model;

and S74, recognizing the mixed speech of Mandarin and Guangdong based on the trained Poyue speech recognition model.

The multi-task training model is provided with a plurality of language sharing network layers, and the depth (the number of layers) of the language sharing network layers can be set according to the use requirement. The first sharing layer receives input data, wherein the input data comprises feature data extracted from a plurality of voice training samples of N languages, a phoneme label of each language, and a label (LanguageID, hereinafter abbreviated as LID) of a language corresponding to each voice training sample.

In this embodiment, the N languages may be mandarin, sichuan, cantonese, shanghai, and foreign languages. Acquiring audio of N languages, wherein each language corresponds to a plurality of audios, acquiring a labeled text corresponding to the audio, framing the audio by using a window with the frame length of 25ms and the frame shift of 10ms, and extracting 40-dimensional FBANK features from each frame. Combining the features extracted from a plurality of audios of N languages, randomly disordering the sequence, and generating training features, so as to form training features corresponding to the mixed voice training samples of N languages, and ensure the randomness of input features during model training.

In this embodiment, for each language, frame-level phonemes are used to model the frame-level phonemes, and the modeling function may use a conventional modeling function in the art to obtain frame-level phoneme labels for the language, and then combine all the frame-level phoneme labels for the language. And for the phoneme labels of the N languages, arranging the phoneme labels according to the input sequence of the training features.

For a speech sample, since it is known in advance which language the speech sample corresponds to, i.e. a priori information of each language class, the speech sample is encoded according to different numbers, and the same LID is set for the speech sample of the same language. For example, for all voice samples with a language category of 1, the LIDs of the voice samples are set to 1 according to the length of the feature data; and setting LIDs of all the voice samples with the language type of 2 as 2 according to the length of the feature data, and setting the LIDs for each voice sample by analogy, and arranging the LIDs according to the input sequence of the training features.

The multi-task training model is provided with a plurality of language sharing network layers (the first layer to the nth layer), and the depth of the language sharing network layers can be set according to the use requirement and the calculation performance. Each layer in the language sharing network layer can adopt the same neural network structure or different neural network structures.

As shown in fig. 5, a first sharing layer (which may be a neural network layer) receives input data, the calculated data is input to a second sharing layer, and an output is generated again through calculation and used as an input of a next sharing layer, so that the data is input step by step up to a deepest sharing layer.

A language sharing network layer is employed so that sharing between different languages can be learned by the language sharing network layer.

Through calculation of a deepest language sharing network layer, output data of the deepest language sharing network layer are divided into N subsets according to language categories according to priori language category information, and each subset corresponds to output data of one language.

The deepest language sharing network layer is connected with N +1 parallel task-specific layers; the task-specific layer is a neural network layer, each of N task-specific layers respectively trains a subset, and each task-specific layer generates Output data which is recorded as Output_iWherein i is more than or equal to 1 and less than or equal to N; and the (N + 1) th task-specific layer trains the output data of the deepest language sharing network layer, and the output result is the identified LID.

Each task-specific layer is a neural network layer, and each layer in the task-specific layers can adopt the same neural network structure or different neural network structures; the structure of the shared network layer may be the same as that of a certain language, or may be different from that of each language.

By adopting the task specific layer, the individuality of different languages can be learned through the independent task specific layer, so that the trained network contains the pronunciation condition of various phonemes, and the robustness is stronger.

For the language sharing network layer and the task specific layer, a supervised mode is adopted for training, and the training criterion can adopt a frame-level CE (Cross) loss function or a sequence-level CTC (connection quality TemporalClassification) loss function to train model parameters. Per Output, i.e. Output_iCorresponding to a special training task of a language, and then calculating Output_iCorresponding to the loss output for that language. In this embodiment, there are N language specific training tasks, resulting in N language loss outputs and one LID loss output. Based on the loss outputs for the N languages and the loss output for one LID, the total loss can be calculated.

To facilitate understanding of the lost output of the LID, for example, the input mixed voice training samples of N languages, having three languages, english, tetrakawa and mandarin, are labeled as 1, 2 and 3 respectively. Because the accuracy of the training of the (N + 1) th task-specific layer has a certain deviation, the LID corresponding to the voice recognized by the (N + 1) th task-specific layer is 1, 2, 4. From the recognition result, the loss output of the LID can be calculated.

In this embodiment, the (N + 1) th task-specific layer is trained by using the CE criterion, which aims to minimize the language type discrimination error of each frame, reduce the classification error of language domain, increase the robustness of the network to various different languages, reduce the influence of unbalanced data of various languages, and accelerate the convergence rate of the neural network.

In this embodiment, the loss is configured with a scaling factor α according to a specific task condition_iBeta, thus avoiding the deviation of each network layer parameter from certain individual tasks during updating.

The total loss is calculated as:

Loss＝α₁·loss₁(output₁)+α₂·loss₂(output₂)+…+α_N·loss_N(output_N)+β·loss(LID)

and iteratively updating the parameters of the task-specific layer corresponding to each language by using the loss output of each language, and iteratively updating the parameters of the language-shared network layers by using the total loss. By the learning mode, the language sharing network layer can learn the characteristics of each language, so that the trained language sharing network layer can cover the pronunciation condition of phonemes with a larger range of human beings. Meanwhile, various environments of different language training sample sets are collected, the learned parameters of a certain language in a certain environment can improve the recognition performance of other languages in the scene, and the robustness of the network parameters is higher.

The step S72 of migrating the parameter value of the language sharing network layer to a general voice recognition model; the Potye voice recognition model is provided with a language sharing network layer and a Potye recognition task special layer which are the same as the multi-task training model, the Potye recognition task special layer is a neural network layer and is connected with the deepest layer of the language sharing network layer,

the Poyue voice recognition model recognizes a mixed voice of Putonghua and Guangdong. As shown in fig. 4, the general tongue speech recognition model has a language-sharing network layer that is the same as the multitask training model, and also has a general tongue speech recognition task-specific layer, which is a neural network layer and is connected to the deepest layer of the language-sharing network layer. The number of layers and the structure of each layer of the language sharing network layer in the general Guangdong voice recognition model are respectively the same as the number of layers and the structure of the corresponding layer of the language sharing network layer in the multitask training model.

The general cantonese recognition task specific layer structure can be the same as a certain language sharing network layer structure, and can also be different from all language sharing network layer structures.

And transferring the parameter values of the language sharing network layer in the trained multi-task training model to the Potye voice recognition model to be used as initialization parameters of the language sharing network layer in the Potye voice recognition model. The parameters which are well learned in the way are transferred, so that the learning efficiency of the Poyue voice recognition model can be accelerated and optimized.

S73, modeling the mixture of Mandarin and Guangdong languages, and training the general Guangdong language recognition model, wherein the method comprises the following steps:

a. acquiring the audio frequency of the Mandarin and the Cantonese and the corresponding labeling text, framing the audio frequency by using a window with the frame length of 25ms and the frame shift of 10ms, and extracting 40-dimensional FBANK characteristics from each frame; mixing the characteristics of Mandarin and Guangdong languages, and randomly disordering the sequence; further, if the dialect data in the training sample is less, the data can be expanded through signal processing, real machine transcription, online crawling and the like.

b. The modeling unit of the mandarin and cantonese is integrated, the integrated mandarin and cantonese modeling unit is modeled to obtain the character labels of the mandarin and cantonese, and the method comprises the following steps:

respectively counting the word frequency of each character with the same pronunciation in the training text from the Mandarin dictionary and the Cantonese dictionary, taking the character with the word frequency larger than a preset threshold value as a high-frequency character, taking the high-frequency character as a modeling unit, and replacing the character with the word frequency lower than the preset threshold value by the high-frequency character with the same pronunciation; respectively obtaining a mandarin character modeling unit and a cantonese character modeling unit;

for example, in the mandarin dictionary, the pronunciation of "collusion" is the same, the word frequency of "collusion" is greater than a preset threshold, the word frequency of "eye" is lower than the preset threshold, the "eye" is replaced by "collusion", and "collusion" is used as a modeling unit.

Combining modeling units with the same pronunciation in a mandarin character modeling unit and a cantonese character modeling unit, and independently reserving modeling units with different pronunciations;

for example, if there are a mandarin modeling unit "collusion" and a cantonese modeling unit "none", and the pronunciations of the mandarin modeling unit "collusion" and the cantonese modeling unit "none" are both m and u, the mandarin modeling unit "collusion" and the cantonese modeling unit "none" are merged. Thus, a set of modeling units mixed by Mandarin and Cantonese are obtained. Therefore, a modeling unit with larger granularity is formed, and the condition that parameter training is insufficient due to the fact that some characters are too few in the training text can be avoided.

Modeling is carried out on the integrated modeling units of the Mandarin and the Cantonese, the modeling functions can adopt conventional modeling functions in the field to obtain character labels of the Mandarin and the Cantonese, and then the character labels of the Mandarin and the Cantonese are arranged according to the input sequence of the training characteristics.

c. And training the general Guangdong voice recognition model by using the FBANK features and the character labels of the Mandarin and the Guangdong languages in a random disorderly sequence as input and adopting a training criterion of a CTC (China center control) to obtain the trained general Guangdong voice recognition model.

In this embodiment, the training criterion of the CTC is used to perform iterative update of the model parameters, only one network outputs the model, and the final trained promegand speech recognition model can support recognition of mandarin and cantonese.

The model trained by the method not only meets the requirements of identifying Mandarin and Cantonese, but also does not need to change the frame on kernel engineering, and the performance can reach or even be superior to that of a single language identification system.

And S74, recognizing the mixed speech of Mandarin and Cantonese based on the trained Potye speech recognition model, including:

and inputting mixed audio of the mandarin and the cantonese, recognizing the mixed voice of the mandarin and the cantonese based on the trained mandarin and cantonese voice recognition model, and outputting corresponding text information.

Furthermore, the trained prevoteous speech recognition model is an acoustic model capable of recognizing speech, and in order to improve the accuracy of mixed speech recognition, a language model is trained, namely, dictionaries of mandarin and cantonese are combined, and corpus texts for training are combined to train the language model. When decoding and recognizing mixed audio of Mandarin and Cantonese, the acoustic model and the language model recognize two languages simultaneously, so that the performance of a mixed recognition system is not lost compared with that of a single system.

Compared with the scheme that the resources of the mandarin and the cantonese are respectively used for identifying the same audio when the same audio is decoded, and the final text identification result is determined according to the confidence coefficient or semantic analysis of the two identification results, the scheme of the embodiment has low engineering cost and does not have the condition of resource waste.

Compared with the scheme of directly combining the original modeling units of the two languages and then training to obtain the acoustic model, the scheme of the embodiment has less performance loss.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, wherein the computer program is capable of performing the steps of the mandarin and cantonese hybrid speech recognition method when executed by a processor.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform a mandarin and cantonese hybrid speech recognition method.

The mandarin chinese and cantonese hybrid speech recognition device according to the embodiment of the present invention may be configured to execute the mandarin chinese and cantonese hybrid speech recognition method according to the embodiment of the present invention, and accordingly achieve the technical effects achieved by the mandarin chinese and cantonese hybrid speech recognition method according to the embodiment of the present invention, which are not described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

Fig. 8 is a schematic hardware structure diagram of an electronic device for a mandarin and cantonese hybrid speech recognition method according to another embodiment of the present application. As shown in fig. 8, the apparatus includes:

one or more processors 810 and a memory 820, with one processor 810 being an example in FIG. 8.

The apparatus for performing the mandarin and cantonese hybrid speech recognition method may further include: an input device 830 and an output device 840.

The processor 810, the memory 820, the input device 830, and the output device 840 may be connected by a bus or other means, such as the bus connection in fig. 8.

The memory 820 is a non-volatile computer readable storage medium and can be used for storing non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the mandarin and cantonese hybrid speech recognition method in the embodiment of the present application. The processor 810 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 820, that is, implementing the voice service method of the above-described method embodiment.

The memory 820 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the voice service apparatus, and the like. Further, the memory 820 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 820 optionally includes memory located remotely from processor 810, which may be connected to a voice service device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 830 may receive input numeric or character information and generate signals related to user settings and function control of the voice service device. The output device 840 may include a display device such as a display screen.

The one or more modules are stored in the memory 820 and, when executed by the one or more processors 810, perform a Mandarin and Cantonese hybrid speech recognition method of any of the above method embodiments.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for training a mandarin and cantonese mixed speech recognition model comprises the following steps:

2. The method of claim 1, wherein training a multitask model with mixed speech training samples in N languages comprises:

3. The method of claim 2, wherein the multitasking model further comprises a language classification network layer connected to a last layer of the plurality of shared network layers;

4. The method of claim 3, wherein collectively training the network parameters of the plurality of shared network layers based on the N penalty functions corresponding to the N task neural network layers and the penalty functions corresponding to the language classification network layers comprises:

training network parameters of the plurality of shared network layers based on a weighted sum of the N loss functions corresponding to the N task neural network layers and the loss functions corresponding to the language classification network layers.

5. A mandarin and cantonese hybrid speech recognition method, comprising:

inputting mixed speech of Mandarin and dialect into the speech recognition model trained by the method of claims 1-4, and performing mixed speech recognition.

6. A mandarin and cantonese hybrid speech recognition model training system, comprising:

7. The system of claim 6, wherein the training of the multitask model with the mixed speech training samples in the N languages comprises:

8. The system of claim 7, wherein the multitasking model further comprises a language classification network layer connected to a last layer of the plurality of shared network layers;

9. The system of claim 8, wherein collectively training the network parameters of the plurality of shared network layers based on the N penalty functions corresponding to the N task neural network layers and the penalty functions corresponding to the language classification network layers comprises:

10. A mandarin and cantonese hybrid speech recognition system comprising:

a speech recognition model trained using the method of claims 1-4;

11. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-5.

12. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.