CN110930982A - Multi-accent acoustic model and multi-accent voice recognition method - Google Patents
Multi-accent acoustic model and multi-accent voice recognition method Download PDFInfo
- Publication number
- CN110930982A CN110930982A CN201911050896.3A CN201911050896A CN110930982A CN 110930982 A CN110930982 A CN 110930982A CN 201911050896 A CN201911050896 A CN 201911050896A CN 110930982 A CN110930982 A CN 110930982A
- Authority
- CN
- China
- Prior art keywords
- accent
- data
- acoustic model
- control unit
- blstm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 239000013598 vector Substances 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 25
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000010977 unit operation Methods 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 239000000654 additive Substances 0.000 claims 1
- 230000000996 additive effect Effects 0.000 claims 1
- 241001672694 Citrus reticulata Species 0.000 abstract description 42
- 238000013528 artificial neural network Methods 0.000 abstract description 11
- 230000008569 process Effects 0.000 description 5
- 238000013145 classification model Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000013526 transfer learning Methods 0.000 description 2
- 241001575999 Hakka Species 0.000 description 1
- 235000016278 Mentha canadensis Nutrition 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a multi-accent acoustic model and a multi-accent speech recognition method, wherein the multi-accent acoustic model comprises a plurality of BLSTM layers, a plurality of Softmax output layers and a gate control unit, the plurality of BLSTM layers are connected in series in sequence and then are connected in series with each Softmax output layer, and the gate control unit is positioned between two adjacent BLSTM layers in the plurality of BLSTM layers. According to the method, the traditional Mandarin Chinese acoustic model structure is improved, and for the number of types of various accent data needing to be identified, a plurality of Softmax output layers in the traditional Mandarin Chinese acoustic model are copied, each Softmax output layer is an accent specific output layer, and the output layers are designed into accent specific forms, namely each accent shares the corresponding output layer independently; and the gating unit makes a type of accent-specific adjustment to the output of the BLSTM layer of the neural network to make the model better suited for a variety of accents.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a multi-accent acoustic model and a multi-accent voice recognition method.
Background
The speech recognition system which is constructed aiming at the Mandarin and is based on the Mandarin acoustic model using the neural network and the hidden Markov mixed model can achieve a satisfactory effect on the speech recognition of the standard Mandarin speech, but the performance of the Mandarin acoustic model is obviously reduced when the Mandarin acoustic model is applied to a speech recognition task with accents, and the performance reduction is mainly caused by the fact that the Mandarin acoustic model constructed based on the Mandarin cannot accurately classify the phoneme states of the accent speech data. Therefore, a dedicated acoustic model needs to be constructed when processing a speech recognition task of speech with accents.
The accents of a language are mainly generated from the voices of speakers in other languages, i.e., the native language, or speakers in a certain dialect of the language. In chinese, the latter is the main source of accents. The Chinese language can be roughly divided into seven dialects, namely, official dialects, Wu dialects, Xiang dialects, Hakka dialects, Min dialects, Guangdong dialects and Jiang dialects. In addition, in a relatively complicated large dialect area, the dialect area can be divided into a plurality of small dialect areas, and dialects at the city and county level can be called local dialects, such as cantonese, Qingdao dialects, Tangshang dialects and the like. Thus, the types of accents derived from different dialects are also very complicated, which leads to the problem of multi-accent speech recognition that is often handled in practical applications of speech recognition.
In a practical production environment, a large amount of mandarin speech data is easy to obtain, and accented speech data often faces the problem of data sparseness due to the complexity of labeling and high labor cost. In order to fully utilize limited data to achieve the optimal performance of a speech recognition system, it is a common practice to train a robust mandarin acoustic model using mandarin speech data with a large data volume, and then to perform adaptation using corresponding data for a single accent to obtain a specific accent acoustic model, which is called adaptation of the accent specific acoustic model. However, this method requires separate adaptive training for each target accent, and needs to find its optimal configuration parameters, and eventually, a plurality of acoustic models are obtained, and the obtained acoustic models for a specific accent are costly in training complexity and storage space.
The problem can be solved by directly using the multi-accent voice data and optimizing the Mandarin Chinese acoustic model by using the traditional method, but the performance of the multi-accent acoustic model obtained by the traditional multi-accent acoustic model self-adapting method is usually inferior to that of an accent specific acoustic model self-adapting method.
Disclosure of Invention
In order to overcome the above existing problems that a plurality of accents cannot be recognized and the recognition rate is low, or at least partially solve the above problems, embodiments of the present invention provide a multi-accent acoustic model and a multi-accent speech recognition method.
According to an aspect of the invention, a multi-pitch acoustic model is provided, which comprises a plurality of Bidirectional Long Short-Term Memory network BLSTM (Bidirectional Long Short-Term Memory) layers, a plurality of Softmax output layers and a gate control unit, wherein the BLSTM layers are connected in series in sequence and then connected in series with each Softmax output layer, and the gate control unit is located between two adjacent BLSTM layers in the BLSTM layers;
the number of the Softmax output layers is equal to the number of the types of the accent data, and each Softmax output layer corresponds to the type of the accent data in a one-to-one mode.
On the basis of the technical scheme, the invention can be improved as follows.
Preferably, the gate control unit is an addition type gate control unit or a dot-and-multiply type gate control unit.
According to another aspect of the present invention, there is provided a multi-accent speech recognition method, comprising:
extracting acoustic features of accent data to be recognized;
inputting the acoustic features into a trained multi-accent acoustic model, and outputting the posterior probability of the triphone state of the accent data to be recognized;
and obtaining a text sequence of the accent data to be recognized according to the posterior probability of the triphone state of the accent data to be recognized.
Preferably, the multi-accent acoustic model is trained by:
extracting acoustic features and accent category labels of each piece of accent data in an accent data training set comprising accent data of multiple categories;
and training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data.
Preferably, the training of the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data includes:
inputting the acoustic features of each piece of accent data into a first BLSTM layer of the multi-accent acoustic model, sequentially passing through a plurality of BLSTM layers, and simultaneously inputting a first output vector of the BLSTM layer in front of the gate control unit and the accent category label of the current accent data into the gate control unit;
and the second output vector after the specific operation of the gating unit is used as the input of the BLSTM layer behind the gating unit, and the posterior probability of the triphone state of the current accent data is output by a Softmax output layer corresponding to the current accent data.
Preferably, when the gate unit is an addition gate unit, the specific operations of the gate unit are:
g(hi,va)=hi+Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
Preferably, when the gate control unit is a dot-product gate control unit, the specific operation of the gate control unit is:
g(hi,va)=hi·Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
Preferably, a unique hot code for each accent data category is used as a category label for each accent data.
Preferably, the multi-accent acoustic model is trained by a small batch stochastic gradient descent method.
The invention has the beneficial effects that:
improving the traditional Mandarin Chinese acoustic model structure, copying a plurality of Softmax output layers in the traditional Mandarin Chinese acoustic model aiming at the category quantity of a plurality of types of accent data needing to be identified, wherein each Softmax output layer is an accent specific output layer, and designing the output layers into accent specific forms, namely each accent independently shares the corresponding output layer; and the gating unit makes a type of accent-specific adjustment to the output of the BLSTM layer of the neural network to make the model better suited for a variety of accents.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a diagram of a conventional Mandarin acoustic model architecture;
FIG. 2 is a diagram of a multitone acoustic model architecture according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for performing multi-accent speech recognition using the multi-accent acoustic model of FIG. 2 according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the technical solutions in the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Referring to fig. 2, a multi-accent acoustic model is provided for recognizing accent data of various categories, the multi-accent acoustic model including a plurality of BLSTM layers, a plurality of Softmax output layers, and a gate unit, the plurality of BLSTM layers being concatenated sequentially and then concatenated with each Softmax output layer, the gate unit being located between two adjacent BLSTM layers of the plurality of BLSTM layers; the number of the Softmax output layers is equal to the number of the types of the accent data, and each Softmax output layer corresponds to the type of the accent data in a one-to-one mode.
It can be understood that, referring to fig. 1, for the architecture diagram of the conventional standard mandarin acoustic model, generally, the conventional mandarin acoustic model adopts a deep model, which mainly includes a plurality of BLSTM layers and a Softmax output layer, the BLSTM layers are sequentially connected in series and then connected in series with the Softmax output layer, and the mandarin acoustic model is trained based on the standard mandarin speech training set. The trained mandarin acoustic model may be used to recognize mandarin.
The traditional Mandarin acoustic model can only recognize standard Mandarin, and can not recognize the voice data of regional accent or has very low recognition precision. The embodiment of the invention provides an improved multi-accent acoustic model which can accurately identify voice data of various accents.
Referring to fig. 2, a multiple-pitch acoustic model provided in an embodiment of the present invention is improved on a network architecture of a mandarin acoustic model. Copying n parts of Softmax output layers in a common acoustic model, wherein n is the number of accent categories which can be identified by the multi-accent acoustic model, and each accent data corresponds to one Softmax output layer; and adding a gating cell between any two adjacent BLSTM layers of the plurality of BLSTM layers in the neural network.
The embodiment of the invention provides that a transfer learning method is utilized for optimization on the basis of a robust Mandarin acoustic model obtained by training Mandarin voice data with large data volume, and specifically, a multitask classification model based on a BLSTM layer and a gating mechanism based on accent information are combined in implementation, wherein an accent specific output layer (a Softmax output layer corresponding to each accent data) shares the BLSTM layer of a neural network in voice recognition tasks of a plurality of accents based on the multitask classification model, and the Softmax output layer is designed into an accent specific form, namely, each accent shares its corresponding Softmax output layer independently; the gating mechanism is that a gating unit is used for carrying out a mouth sound specific regulation on the hidden layer output of the neural network, so that the model is better suitable for various mouth sounds.
It can be understood that, in the embodiment of the present invention, a gating unit is disposed between a plurality of BLSTM layers of a common acoustic model, it should be noted that, a gating unit may be disposed between any two adjacent BLSTM layers, and in the embodiment of the present invention, a gating unit is disposed between all adjacent BLSTM layers.
On the basis of the above embodiments, in the embodiments of the present invention, the gate control unit is an addition type gate control unit or a dot-and-multiply type gate control unit. The type of gating unit may be determined according to the size of the speech data volume with accents and the performance of the mandarin acoustic model. The specific implementation manner of the gate control unit is as follows: output vector h of i-th layeriAnd accent class label vector vaThe vectors are sent to a gate control unit together, and after the gate control unit is subjected to specific operation, the transformed vectors g (h) are obtainedi,va) The (i +1) th layer is imported as an input, wherein the gating cells are located between the ith BLSTM layer and the (i +1) th BLSTM layer.
Referring to fig. 3, a multi-accent speech recognition method is provided, which performs multi-accent speech recognition based on the multi-accent acoustic models provided in the above embodiments. The method comprises the following steps:
extracting acoustic features of accent data to be recognized;
inputting the acoustic features into a trained multi-accent acoustic model, and outputting the posterior probability of the triphone state of the accent data to be recognized;
and obtaining a text sequence of the accent data to be recognized according to the posterior probability of the triphone state of the accent data to be recognized.
It can be understood that the embodiment of the present invention identifies the multi-accent voice data based on the multi-accent acoustic model provided by the above-mentioned embodiment. In the process of recognizing the multi-accent voice data, the extracted acoustic features of the voice data to be recognized are input into a trained multi-accent acoustic model, and the posterior probability of the triphone state of the voice data to be recognized is output. And (3) obtaining the recognized text sequence by the posterior probability joint language model and the pronunciation dictionary of the triphone state of the accent data to be recognized through a decoder, so as to realize the recognition process of the accent data to be recognized.
On the basis of the above embodiment, in the embodiment of the present invention, the multi-accent acoustic model is trained in the following manner:
extracting acoustic features and accent category labels of each piece of accent data in an accent data training set comprising accent data of multiple categories;
and training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data.
On the basis of the foregoing embodiments, in an embodiment of the present invention, the training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data includes:
inputting the acoustic features of each piece of accent data into a first BLSTM layer of the multi-accent acoustic model, sequentially passing through a plurality of BLSTM layers, and simultaneously inputting a first output vector of the BLSTM layer in front of the gate control unit and the accent category label of the current accent data into the gate control unit;
and the second output vector after the specific operation of the gating unit is used as the input of the BLSTM layer behind the gating unit, and the posterior probability of the triphone state of the current accent data is output by a Softmax output layer corresponding to the current accent data.
It is to be understood that the multi-accent acoustic model is based on a performance-robust Mandarin acoustic model trained from a Mandarin speech training set with sufficient data. The accent specific output layers and gating units are added to the Mandarin Chinese acoustic model during the initialization phase of the model, and then the modified network (i.e., the embodiment of the invention) is optimized by using a stochastic gradient descent method.
In practical application, a speech recognition system for a specific accent is usually deployed according to regions, so that for a regional accent problem, speech data acquired in one region can be divided into the same accent, and a speech recognition acoustic model is constructed by using the acquired accent data.
It should be noted that the traditional mandarin chinese acoustic model usually adopts a hybrid architecture of a neural network and a hidden markov model, wherein the neural network part is constructed by using a multi-layer bidirectional long and short term memory network (BLSTM), and the output target of the neural network part is the posterior probability of the state values of the context-dependent phonemes.
The multi-accent acoustic model provided by the embodiment of the invention is obtained by improving the traditional Mandarin acoustic model. After the improved multi-accent acoustic model is obtained, the multi-accent acoustic model needs to be trained, in the process of training the multi-accent acoustic model, voice data with accents are collected according to regions and subjected to data labeling (namely, accent category labels are labeled), a multi-accent voice training set is constructed, the training set should contain a plurality of accent data of a target, the number of the voice data of different categories should be kept approximately equal in principle, and an unique hot code representing the accent category of each voice is stored for each voice (namely, the unique hot code of each accent data category is used as the category label of each accent data). And training the obtained multi-accent acoustic model by using the training set. In the training process, the multi-accent data is disordered, and a small batch of random gradient descent is used to ensure that a plurality of accents are learned simultaneously. When one accent data sample is fed into the multi-accent acoustic model, only its corresponding Softmax output layer and shared hidden layer(s) are updated, while the other Softmax output layers remain unchanged. In the training process, a learning rate smaller than that used in the training of the Mandarin acoustic model is used to ensure the effective utilization of the Mandarin acoustic model, namely, the improved multi-accent acoustic model has a better recognition rate for the common speech and voice data.
On the basis of the above embodiments, in the embodiment of the present invention, when the gate control unit is an addition gate control unit, the specific operations of the gate control unit are as follows:
g(hi,va)=hi+Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
When the gate control unit is a dot-multiply gate control unit, the specific operations of the gate control unit are:
g(hi,va)=hi·Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
The specific gating unit selection may be determined according to the size of the training speech data with accents and the performance of the mandarin acoustic model, which is not limited in the embodiments of the present invention.
For the trained multi-accent acoustic model, aiming at each accent, inputting the acoustic characteristics of accent data into the trained multi-accent acoustic model, combining the posterior probability of the factor state output by an accent specific output layer (Softmax output layer) corresponding to a neural network in the multi-accent acoustic model with the language model and pronunciation dictionary constructed for a specific task to obtain a recognized text sequence through a decoder, and completing recognition of the accent to be recognized.
In the following, the multi-accent acoustic model provided by the embodiment of the present invention is compared with various conventional acoustic models, and the recognition error rates of different acoustic models are shown in table 1.
TABLE 1
The reasonability and the effectiveness of the multi-accent acoustic model constructed based on the embodiment of the invention are verified in practice, and the word error rate of the speech recognition of accents in various regions is shown in table 1. In the table, a mandarin acoustic model is trained by using 7000-hour mandarin voice data, and a multi-accent acoustic model is constructed by using accent voice data of four regions (20 hours in each region), wherein the data volume of a test set of each accent is two hours, and all voice data are customer service call voice data in an actual scene.
Compared with a Mandarin acoustic model (a baseline model), the word error rate of the multi-accent acoustic model constructed by the embodiment of the invention is averagely reduced by 9.8%, and compared with the traditional multi-accent acoustic model and a specific accent acoustic model, the multi-accent acoustic model has certain performance improvement, which shows that the multi-accent acoustic model of the embodiment of the invention is a high-efficiency and high-performance multi-accent voice recognition acoustic model.
According to the multi-accent acoustic model and the multi-accent voice recognition method provided by the embodiment of the invention, the traditional Mandarin Chinese acoustic model structure is improved, a plurality of Softmax output layers in the traditional Mandarin Chinese acoustic model are copied according to the category number of various accent data needing to be recognized, each Softmax output layer is an accent specific output layer, and the output layers are designed into accent specific forms, namely each accent shares the corresponding output layer independently; the gate control unit performs a mouth sound specific regulation on the output of the BLSTM layer of the neural network, namely, a transfer learning method is utilized, a multitask classification model based on a shared hidden layer (namely, aiming at a plurality of Softmax output layers and sharing the BLSTM layer) and a gate control mechanism based on mouth sound information are adopted, on the basis of a Mandarin Chinese acoustic model with robust performance, a multi-mouth sound acoustic model is obtained by optimizing multi-mouth sound mixed data to a plurality of target mouth sounds at the same time, and the multi-mouth sound acoustic model with robust performance on the multi-target mouth sounds is obtained while time and cost are saved.
Compared with the traditional multiple specific accent acoustic models, each traditional specific accent acoustic model needs a large amount of specific accent data during training, and obtaining of accent data of a training set is difficult.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (9)
1. A multi-accent acoustic model for recognizing accent data of various categories, which comprises a plurality of bidirectional long-short term memory network (BLSTM) layers, a plurality of Softmax output layers and a gate control unit, wherein the plurality of BLSTM layers are sequentially connected in series and then connected in series with each Softmax output layer, and the gate control unit is positioned between two adjacent BLSTM layers in the plurality of BLSTM layers;
the number of the Softmax output layers is equal to the number of the types of the accent data, and each Softmax output layer corresponds to the type of the accent data in a one-to-one mode.
2. The transonic acoustic model of claim 1, wherein the gating unit is an additive gating unit or a dot-and-multiply gating unit.
3. A multi-accent speech recognition method, comprising:
extracting acoustic features of accent data to be recognized;
inputting the acoustic features into the trained multi-accent acoustic model of claim 1, and outputting the posterior probability of the triphone state of the accent data to be recognized;
and obtaining a text sequence of the accent data to be recognized according to the posterior probability of the triphone state of the accent data to be recognized.
4. The multi-accent speech recognition method of claim 3, wherein the multi-accent acoustic models are trained by:
extracting acoustic features and accent category labels of each piece of accent data in an accent data training set comprising accent data of multiple categories;
and training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data.
5. The multi-accent speech recognition method of claim 4, wherein the training of the multi-accent acoustic models based on the acoustic features and accent category labels of each piece of accent data comprises:
inputting the acoustic features of each piece of accent data into a first BLSTM layer of the multi-accent acoustic model, sequentially passing through a plurality of BLSTM layers, and simultaneously inputting a first output vector of the BLSTM layer in front of the gate control unit and the accent category label of the current accent data into the gate control unit;
and the second output vector after the specific operation of the gating unit is used as the input of the BLSTM layer behind the gating unit, and the posterior probability of the triphone state of the current accent data is output by a Softmax output layer corresponding to the current accent data.
6. The multi-accent speech recognition method of claim 5,
when the gate control unit is an addition type gate control unit, the specific operations of the gate control unit are as follows:
g(hi,va)=hi+Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
7. The multi-accent speech recognition method of claim 5,
when the gate control unit is a dot-multiply gate control unit, the specific operation of the gate control unit is as follows:
g(hi,va)=hi·Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
8. The multi-accent speech recognition method according to any one of claims 4 to 7, wherein a unique code for each accent data category is used as a category label for each accent data.
9. The method of claim 5, wherein the multi-accent acoustic models are trained using a small batch stochastic gradient descent method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911050896.3A CN110930982A (en) | 2019-10-31 | 2019-10-31 | Multi-accent acoustic model and multi-accent voice recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911050896.3A CN110930982A (en) | 2019-10-31 | 2019-10-31 | Multi-accent acoustic model and multi-accent voice recognition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110930982A true CN110930982A (en) | 2020-03-27 |
Family
ID=69849958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911050896.3A Pending CN110930982A (en) | 2019-10-31 | 2019-10-31 | Multi-accent acoustic model and multi-accent voice recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110930982A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111508501A (en) * | 2020-07-02 | 2020-08-07 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN112885351A (en) * | 2021-04-30 | 2021-06-01 | 浙江非线数联科技股份有限公司 | Dialect voice recognition method and device based on transfer learning |
CN113593534A (en) * | 2021-05-28 | 2021-11-02 | 思必驰科技股份有限公司 | Method and apparatus for multi-accent speech recognition |
CN113593524A (en) * | 2021-01-26 | 2021-11-02 | 腾讯科技(深圳)有限公司 | Method and device for training acoustic model for accent recognition, and storage medium |
CN113593525A (en) * | 2021-01-26 | 2021-11-02 | 腾讯科技(深圳)有限公司 | Method, device and storage medium for training accent classification model and accent classification |
CN114596845A (en) * | 2022-04-13 | 2022-06-07 | 马上消费金融股份有限公司 | Training method of voice recognition model, voice recognition method and device |
US11776323B2 (en) | 2022-02-15 | 2023-10-03 | Ford Global Technologies, Llc | Biometric task network |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
CN106875942A (en) * | 2016-12-28 | 2017-06-20 | 中国科学院自动化研究所 | Acoustic model adaptive approach based on accent bottleneck characteristic |
US20180053500A1 (en) * | 2016-08-22 | 2018-02-22 | Google Inc. | Multi-accent speech recognition |
US20190088251A1 (en) * | 2017-09-18 | 2019-03-21 | Samsung Electronics Co., Ltd. | Speech signal recognition system and method |
CN109829058A (en) * | 2019-01-17 | 2019-05-31 | 西北大学 | A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning |
US20200160836A1 (en) * | 2018-11-21 | 2020-05-21 | Google Llc | Multi-dialect and multilingual speech recognition |
CN112992119A (en) * | 2021-01-14 | 2021-06-18 | 安徽大学 | Deep neural network-based accent classification method and model thereof |
-
2019
- 2019-10-31 CN CN201911050896.3A patent/CN110930982A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
US20180053500A1 (en) * | 2016-08-22 | 2018-02-22 | Google Inc. | Multi-accent speech recognition |
CN106875942A (en) * | 2016-12-28 | 2017-06-20 | 中国科学院自动化研究所 | Acoustic model adaptive approach based on accent bottleneck characteristic |
US20190088251A1 (en) * | 2017-09-18 | 2019-03-21 | Samsung Electronics Co., Ltd. | Speech signal recognition system and method |
US20200160836A1 (en) * | 2018-11-21 | 2020-05-21 | Google Llc | Multi-dialect and multilingual speech recognition |
CN109829058A (en) * | 2019-01-17 | 2019-05-31 | 西北大学 | A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning |
CN112992119A (en) * | 2021-01-14 | 2021-06-18 | 安徽大学 | Deep neural network-based accent classification method and model thereof |
Non-Patent Citations (7)
Title |
---|
HAN ZHU 等: "Multi-Accent Adaptation based on Gate Mechanism", 《INTERSPEECH 2019》, pages 744 - 748 * |
HAN ZHU,等: "Multi-Accent Adaptation based on Gate Mechanism", 《HTTPS://ARXIV.ORG/ABS/2011.02774》 * |
JIANGYAN YI,等: "Improving BLSTM RNN based Mandarin speech recognition using accent dependent bottleneck features", 《2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA)》 * |
XUESONG YANG,等: "Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition", 《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) 》 * |
周刚: "藏语拉萨方言语音识别的研究", 《中国优秀硕士学位论文全文数据库》 * |
李德毅 等: "《人工智能与机器人先进技术丛书 智能摘要与深度学习》", 北京理工大学出版社, pages: 103 - 104 * |
遥遥子YY: "从0开始学习kaldi决策树绑定+三音素", pages 103 - 104, Retrieved from the Internet <URL:https://blog.csdn.net/qq_37591044/article/details/102395480> * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111508501A (en) * | 2020-07-02 | 2020-08-07 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN111508501B (en) * | 2020-07-02 | 2020-09-29 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN113593524A (en) * | 2021-01-26 | 2021-11-02 | 腾讯科技(深圳)有限公司 | Method and device for training acoustic model for accent recognition, and storage medium |
CN113593525A (en) * | 2021-01-26 | 2021-11-02 | 腾讯科技(深圳)有限公司 | Method, device and storage medium for training accent classification model and accent classification |
CN112885351A (en) * | 2021-04-30 | 2021-06-01 | 浙江非线数联科技股份有限公司 | Dialect voice recognition method and device based on transfer learning |
CN112885351B (en) * | 2021-04-30 | 2021-07-23 | 浙江非线数联科技股份有限公司 | Dialect voice recognition method and device based on transfer learning |
CN113593534A (en) * | 2021-05-28 | 2021-11-02 | 思必驰科技股份有限公司 | Method and apparatus for multi-accent speech recognition |
CN113593534B (en) * | 2021-05-28 | 2023-07-14 | 思必驰科技股份有限公司 | Method and device for multi-accent speech recognition |
US11776323B2 (en) | 2022-02-15 | 2023-10-03 | Ford Global Technologies, Llc | Biometric task network |
CN114596845A (en) * | 2022-04-13 | 2022-06-07 | 马上消费金融股份有限公司 | Training method of voice recognition model, voice recognition method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110930982A (en) | Multi-accent acoustic model and multi-accent voice recognition method | |
EP3966816B1 (en) | Large-scale multilingual speech recognition with a streaming end-to-end model | |
US8126717B1 (en) | System and method for predicting prosodic parameters | |
CN112889073A (en) | Cross-language classification using multi-language neural machine translation | |
CN110459208B (en) | Knowledge migration-based sequence-to-sequence speech recognition model training method | |
JP2004279701A (en) | Method and device for sound model creation, and speech recognition device | |
Masumura et al. | Large context end-to-end automatic speech recognition via extension of hierarchical recurrent encoder-decoder models | |
JP2023544336A (en) | System and method for multilingual speech recognition framework | |
JP2022512233A (en) | Neural adjustment code for multilingual style-dependent speech language processing | |
CN112700778A (en) | Speech recognition method and speech recognition apparatus | |
CN115039170A (en) | Proper noun recognition in end-to-end speech recognition | |
CN116303966A (en) | Dialogue behavior recognition system based on prompt learning | |
US11990117B2 (en) | Using speech recognition to improve cross-language speech synthesis | |
CN111553157A (en) | Entity replacement-based dialog intention identification method | |
US20220310080A1 (en) | Multi-Task Learning for End-To-End Automated Speech Recognition Confidence and Deletion Estimation | |
Hu et al. | The USTC system for blizzard challenge 2017 | |
KR20220128401A (en) | Attention-based joint acoustics and text on-device end-to-end (E2E) models | |
US20230317059A1 (en) | Alignment Prediction to Inject Text into Automatic Speech Recognition Training | |
Razavi et al. | An HMM-based formalism for automatic subword unit derivation and pronunciation generation | |
Wu et al. | Factored recurrent neural network language model in TED lecture transcription | |
Abraham et al. | An automated technique to generate phone-to-articulatory label mapping | |
US11823697B2 (en) | Improving speech recognition with speech synthesis-based model adapation | |
WO2022086640A1 (en) | Fast emit low-latency streaming asr with sequence-level emission regularization | |
Farooq et al. | Learning cross-lingual mappings for data augmentation to improve low-resource speech recognition | |
Wang et al. | Speech-and-text transformer: Exploiting unpaired text for end-to-end speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |