CN110930982A

CN110930982A - Multi-accent acoustic model and multi-accent voice recognition method

Info

Publication number: CN110930982A
Application number: CN201911050896.3A
Authority: CN
Inventors: 计哲; 黄远; 高圣翔; 沈亮; 林格平; 徐艳云
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-03-27

Abstract

The invention provides a multi-accent acoustic model and a multi-accent speech recognition method, wherein the multi-accent acoustic model comprises a plurality of BLSTM layers, a plurality of Softmax output layers and a gate control unit, the plurality of BLSTM layers are connected in series in sequence and then are connected in series with each Softmax output layer, and the gate control unit is positioned between two adjacent BLSTM layers in the plurality of BLSTM layers. According to the method, the traditional Mandarin Chinese acoustic model structure is improved, and for the number of types of various accent data needing to be identified, a plurality of Softmax output layers in the traditional Mandarin Chinese acoustic model are copied, each Softmax output layer is an accent specific output layer, and the output layers are designed into accent specific forms, namely each accent shares the corresponding output layer independently; and the gating unit makes a type of accent-specific adjustment to the output of the BLSTM layer of the neural network to make the model better suited for a variety of accents.

Description

Multi-accent acoustic model and multi-accent voice recognition method

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a multi-accent acoustic model and a multi-accent voice recognition method.

Background

The speech recognition system which is constructed aiming at the Mandarin and is based on the Mandarin acoustic model using the neural network and the hidden Markov mixed model can achieve a satisfactory effect on the speech recognition of the standard Mandarin speech, but the performance of the Mandarin acoustic model is obviously reduced when the Mandarin acoustic model is applied to a speech recognition task with accents, and the performance reduction is mainly caused by the fact that the Mandarin acoustic model constructed based on the Mandarin cannot accurately classify the phoneme states of the accent speech data. Therefore, a dedicated acoustic model needs to be constructed when processing a speech recognition task of speech with accents.

The accents of a language are mainly generated from the voices of speakers in other languages, i.e., the native language, or speakers in a certain dialect of the language. In chinese, the latter is the main source of accents. The Chinese language can be roughly divided into seven dialects, namely, official dialects, Wu dialects, Xiang dialects, Hakka dialects, Min dialects, Guangdong dialects and Jiang dialects. In addition, in a relatively complicated large dialect area, the dialect area can be divided into a plurality of small dialect areas, and dialects at the city and county level can be called local dialects, such as cantonese, Qingdao dialects, Tangshang dialects and the like. Thus, the types of accents derived from different dialects are also very complicated, which leads to the problem of multi-accent speech recognition that is often handled in practical applications of speech recognition.

In a practical production environment, a large amount of mandarin speech data is easy to obtain, and accented speech data often faces the problem of data sparseness due to the complexity of labeling and high labor cost. In order to fully utilize limited data to achieve the optimal performance of a speech recognition system, it is a common practice to train a robust mandarin acoustic model using mandarin speech data with a large data volume, and then to perform adaptation using corresponding data for a single accent to obtain a specific accent acoustic model, which is called adaptation of the accent specific acoustic model. However, this method requires separate adaptive training for each target accent, and needs to find its optimal configuration parameters, and eventually, a plurality of acoustic models are obtained, and the obtained acoustic models for a specific accent are costly in training complexity and storage space.

The problem can be solved by directly using the multi-accent voice data and optimizing the Mandarin Chinese acoustic model by using the traditional method, but the performance of the multi-accent acoustic model obtained by the traditional multi-accent acoustic model self-adapting method is usually inferior to that of an accent specific acoustic model self-adapting method.

Disclosure of Invention

In order to overcome the above existing problems that a plurality of accents cannot be recognized and the recognition rate is low, or at least partially solve the above problems, embodiments of the present invention provide a multi-accent acoustic model and a multi-accent speech recognition method.

According to an aspect of the invention, a multi-pitch acoustic model is provided, which comprises a plurality of Bidirectional Long Short-Term Memory network BLSTM (Bidirectional Long Short-Term Memory) layers, a plurality of Softmax output layers and a gate control unit, wherein the BLSTM layers are connected in series in sequence and then connected in series with each Softmax output layer, and the gate control unit is located between two adjacent BLSTM layers in the BLSTM layers;

the number of the Softmax output layers is equal to the number of the types of the accent data, and each Softmax output layer corresponds to the type of the accent data in a one-to-one mode.

On the basis of the technical scheme, the invention can be improved as follows.

Preferably, the gate control unit is an addition type gate control unit or a dot-and-multiply type gate control unit.

According to another aspect of the present invention, there is provided a multi-accent speech recognition method, comprising:

extracting acoustic features of accent data to be recognized;

inputting the acoustic features into a trained multi-accent acoustic model, and outputting the posterior probability of the triphone state of the accent data to be recognized;

and obtaining a text sequence of the accent data to be recognized according to the posterior probability of the triphone state of the accent data to be recognized.

Preferably, the multi-accent acoustic model is trained by:

extracting acoustic features and accent category labels of each piece of accent data in an accent data training set comprising accent data of multiple categories;

and training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data.

Preferably, the training of the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data includes:

inputting the acoustic features of each piece of accent data into a first BLSTM layer of the multi-accent acoustic model, sequentially passing through a plurality of BLSTM layers, and simultaneously inputting a first output vector of the BLSTM layer in front of the gate control unit and the accent category label of the current accent data into the gate control unit;

and the second output vector after the specific operation of the gating unit is used as the input of the BLSTM layer behind the gating unit, and the posterior probability of the triphone state of the current accent data is output by a Softmax output layer corresponding to the current accent data.

Preferably, when the gate unit is an addition gate unit, the specific operations of the gate unit are:

g(h_i,v_a)＝h_i+Vv_a+b；

wherein h is_iIs the first output vector, v, of the ith BLSTM layer_aAccent class label for current accent data, g (h)_i,v_a) Is the second output vector after the gate control unit operation, V is an M × N matrix, h_iAnd v_aIs M and N, b is a bias vector, and M and N are positive integers.

Preferably, when the gate control unit is a dot-product gate control unit, the specific operation of the gate control unit is:

g(h_i,v_a)＝h_i·Vv_a+b；

Preferably, a unique hot code for each accent data category is used as a category label for each accent data.

Preferably, the multi-accent acoustic model is trained by a small batch stochastic gradient descent method.

The invention has the beneficial effects that:

improving the traditional Mandarin Chinese acoustic model structure, copying a plurality of Softmax output layers in the traditional Mandarin Chinese acoustic model aiming at the category quantity of a plurality of types of accent data needing to be identified, wherein each Softmax output layer is an accent specific output layer, and designing the output layers into accent specific forms, namely each accent independently shares the corresponding output layer; and the gating unit makes a type of accent-specific adjustment to the output of the BLSTM layer of the neural network to make the model better suited for a variety of accents.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a diagram of a conventional Mandarin acoustic model architecture;

FIG. 2 is a diagram of a multitone acoustic model architecture according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for performing multi-accent speech recognition using the multi-accent acoustic model of FIG. 2 according to an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the technical solutions in the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Referring to fig. 2, a multi-accent acoustic model is provided for recognizing accent data of various categories, the multi-accent acoustic model including a plurality of BLSTM layers, a plurality of Softmax output layers, and a gate unit, the plurality of BLSTM layers being concatenated sequentially and then concatenated with each Softmax output layer, the gate unit being located between two adjacent BLSTM layers of the plurality of BLSTM layers; the number of the Softmax output layers is equal to the number of the types of the accent data, and each Softmax output layer corresponds to the type of the accent data in a one-to-one mode.

It can be understood that, referring to fig. 1, for the architecture diagram of the conventional standard mandarin acoustic model, generally, the conventional mandarin acoustic model adopts a deep model, which mainly includes a plurality of BLSTM layers and a Softmax output layer, the BLSTM layers are sequentially connected in series and then connected in series with the Softmax output layer, and the mandarin acoustic model is trained based on the standard mandarin speech training set. The trained mandarin acoustic model may be used to recognize mandarin.

The traditional Mandarin acoustic model can only recognize standard Mandarin, and can not recognize the voice data of regional accent or has very low recognition precision. The embodiment of the invention provides an improved multi-accent acoustic model which can accurately identify voice data of various accents.

Referring to fig. 2, a multiple-pitch acoustic model provided in an embodiment of the present invention is improved on a network architecture of a mandarin acoustic model. Copying n parts of Softmax output layers in a common acoustic model, wherein n is the number of accent categories which can be identified by the multi-accent acoustic model, and each accent data corresponds to one Softmax output layer; and adding a gating cell between any two adjacent BLSTM layers of the plurality of BLSTM layers in the neural network.

The embodiment of the invention provides that a transfer learning method is utilized for optimization on the basis of a robust Mandarin acoustic model obtained by training Mandarin voice data with large data volume, and specifically, a multitask classification model based on a BLSTM layer and a gating mechanism based on accent information are combined in implementation, wherein an accent specific output layer (a Softmax output layer corresponding to each accent data) shares the BLSTM layer of a neural network in voice recognition tasks of a plurality of accents based on the multitask classification model, and the Softmax output layer is designed into an accent specific form, namely, each accent shares its corresponding Softmax output layer independently; the gating mechanism is that a gating unit is used for carrying out a mouth sound specific regulation on the hidden layer output of the neural network, so that the model is better suitable for various mouth sounds.

It can be understood that, in the embodiment of the present invention, a gating unit is disposed between a plurality of BLSTM layers of a common acoustic model, it should be noted that, a gating unit may be disposed between any two adjacent BLSTM layers, and in the embodiment of the present invention, a gating unit is disposed between all adjacent BLSTM layers.

On the basis of the above embodiments, in the embodiments of the present invention, the gate control unit is an addition type gate control unit or a dot-and-multiply type gate control unit. The type of gating unit may be determined according to the size of the speech data volume with accents and the performance of the mandarin acoustic model. The specific implementation manner of the gate control unit is as follows: output vector h of i-th layer_iAnd accent class label vector v_aThe vectors are sent to a gate control unit together, and after the gate control unit is subjected to specific operation, the transformed vectors g (h) are obtained_i,v_a) The (i +1) th layer is imported as an input, wherein the gating cells are located between the ith BLSTM layer and the (i +1) th BLSTM layer.

Referring to fig. 3, a multi-accent speech recognition method is provided, which performs multi-accent speech recognition based on the multi-accent acoustic models provided in the above embodiments. The method comprises the following steps:

extracting acoustic features of accent data to be recognized;

It can be understood that the embodiment of the present invention identifies the multi-accent voice data based on the multi-accent acoustic model provided by the above-mentioned embodiment. In the process of recognizing the multi-accent voice data, the extracted acoustic features of the voice data to be recognized are input into a trained multi-accent acoustic model, and the posterior probability of the triphone state of the voice data to be recognized is output. And (3) obtaining the recognized text sequence by the posterior probability joint language model and the pronunciation dictionary of the triphone state of the accent data to be recognized through a decoder, so as to realize the recognition process of the accent data to be recognized.

On the basis of the above embodiment, in the embodiment of the present invention, the multi-accent acoustic model is trained in the following manner:

On the basis of the foregoing embodiments, in an embodiment of the present invention, the training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data includes:

It is to be understood that the multi-accent acoustic model is based on a performance-robust Mandarin acoustic model trained from a Mandarin speech training set with sufficient data. The accent specific output layers and gating units are added to the Mandarin Chinese acoustic model during the initialization phase of the model, and then the modified network (i.e., the embodiment of the invention) is optimized by using a stochastic gradient descent method.

In practical application, a speech recognition system for a specific accent is usually deployed according to regions, so that for a regional accent problem, speech data acquired in one region can be divided into the same accent, and a speech recognition acoustic model is constructed by using the acquired accent data.

It should be noted that the traditional mandarin chinese acoustic model usually adopts a hybrid architecture of a neural network and a hidden markov model, wherein the neural network part is constructed by using a multi-layer bidirectional long and short term memory network (BLSTM), and the output target of the neural network part is the posterior probability of the state values of the context-dependent phonemes.

The multi-accent acoustic model provided by the embodiment of the invention is obtained by improving the traditional Mandarin acoustic model. After the improved multi-accent acoustic model is obtained, the multi-accent acoustic model needs to be trained, in the process of training the multi-accent acoustic model, voice data with accents are collected according to regions and subjected to data labeling (namely, accent category labels are labeled), a multi-accent voice training set is constructed, the training set should contain a plurality of accent data of a target, the number of the voice data of different categories should be kept approximately equal in principle, and an unique hot code representing the accent category of each voice is stored for each voice (namely, the unique hot code of each accent data category is used as the category label of each accent data). And training the obtained multi-accent acoustic model by using the training set. In the training process, the multi-accent data is disordered, and a small batch of random gradient descent is used to ensure that a plurality of accents are learned simultaneously. When one accent data sample is fed into the multi-accent acoustic model, only its corresponding Softmax output layer and shared hidden layer(s) are updated, while the other Softmax output layers remain unchanged. In the training process, a learning rate smaller than that used in the training of the Mandarin acoustic model is used to ensure the effective utilization of the Mandarin acoustic model, namely, the improved multi-accent acoustic model has a better recognition rate for the common speech and voice data.

On the basis of the above embodiments, in the embodiment of the present invention, when the gate control unit is an addition gate control unit, the specific operations of the gate control unit are as follows:

g(h_i,v_a)＝h_i+Vv_a+b；

When the gate control unit is a dot-multiply gate control unit, the specific operations of the gate control unit are:

g(h_i,v_a)＝h_i·Vv_a+b；

The specific gating unit selection may be determined according to the size of the training speech data with accents and the performance of the mandarin acoustic model, which is not limited in the embodiments of the present invention.

For the trained multi-accent acoustic model, aiming at each accent, inputting the acoustic characteristics of accent data into the trained multi-accent acoustic model, combining the posterior probability of the factor state output by an accent specific output layer (Softmax output layer) corresponding to a neural network in the multi-accent acoustic model with the language model and pronunciation dictionary constructed for a specific task to obtain a recognized text sequence through a decoder, and completing recognition of the accent to be recognized.

In the following, the multi-accent acoustic model provided by the embodiment of the present invention is compared with various conventional acoustic models, and the recognition error rates of different acoustic models are shown in table 1.

TABLE 1

The reasonability and the effectiveness of the multi-accent acoustic model constructed based on the embodiment of the invention are verified in practice, and the word error rate of the speech recognition of accents in various regions is shown in table 1. In the table, a mandarin acoustic model is trained by using 7000-hour mandarin voice data, and a multi-accent acoustic model is constructed by using accent voice data of four regions (20 hours in each region), wherein the data volume of a test set of each accent is two hours, and all voice data are customer service call voice data in an actual scene.

Compared with a Mandarin acoustic model (a baseline model), the word error rate of the multi-accent acoustic model constructed by the embodiment of the invention is averagely reduced by 9.8%, and compared with the traditional multi-accent acoustic model and a specific accent acoustic model, the multi-accent acoustic model has certain performance improvement, which shows that the multi-accent acoustic model of the embodiment of the invention is a high-efficiency and high-performance multi-accent voice recognition acoustic model.

According to the multi-accent acoustic model and the multi-accent voice recognition method provided by the embodiment of the invention, the traditional Mandarin Chinese acoustic model structure is improved, a plurality of Softmax output layers in the traditional Mandarin Chinese acoustic model are copied according to the category number of various accent data needing to be recognized, each Softmax output layer is an accent specific output layer, and the output layers are designed into accent specific forms, namely each accent shares the corresponding output layer independently; the gate control unit performs a mouth sound specific regulation on the output of the BLSTM layer of the neural network, namely, a transfer learning method is utilized, a multitask classification model based on a shared hidden layer (namely, aiming at a plurality of Softmax output layers and sharing the BLSTM layer) and a gate control mechanism based on mouth sound information are adopted, on the basis of a Mandarin Chinese acoustic model with robust performance, a multi-mouth sound acoustic model is obtained by optimizing multi-mouth sound mixed data to a plurality of target mouth sounds at the same time, and the multi-mouth sound acoustic model with robust performance on the multi-target mouth sounds is obtained while time and cost are saved.

Compared with the traditional multiple specific accent acoustic models, each traditional specific accent acoustic model needs a large amount of specific accent data during training, and obtaining of accent data of a training set is difficult.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A multi-accent acoustic model for recognizing accent data of various categories, which comprises a plurality of bidirectional long-short term memory network (BLSTM) layers, a plurality of Softmax output layers and a gate control unit, wherein the plurality of BLSTM layers are sequentially connected in series and then connected in series with each Softmax output layer, and the gate control unit is positioned between two adjacent BLSTM layers in the plurality of BLSTM layers;

2. The transonic acoustic model of claim 1, wherein the gating unit is an additive gating unit or a dot-and-multiply gating unit.

3. A multi-accent speech recognition method, comprising:

extracting acoustic features of accent data to be recognized;

inputting the acoustic features into the trained multi-accent acoustic model of claim 1, and outputting the posterior probability of the triphone state of the accent data to be recognized;

4. The multi-accent speech recognition method of claim 3, wherein the multi-accent acoustic models are trained by:

5. The multi-accent speech recognition method of claim 4, wherein the training of the multi-accent acoustic models based on the acoustic features and accent category labels of each piece of accent data comprises:

6. The multi-accent speech recognition method of claim 5,

when the gate control unit is an addition type gate control unit, the specific operations of the gate control unit are as follows:

g(h_i,v_a)＝h_i+Vv_a+b；

7. The multi-accent speech recognition method of claim 5,

when the gate control unit is a dot-multiply gate control unit, the specific operation of the gate control unit is as follows:

g(h_i,v_a)＝h_i·Vv_a+b；

8. The multi-accent speech recognition method according to any one of claims 4 to 7, wherein a unique code for each accent data category is used as a category label for each accent data.

9. The method of claim 5, wherein the multi-accent acoustic models are trained using a small batch stochastic gradient descent method.