CN110930982A - Multi-accent acoustic model and multi-accent voice recognition method - Google Patents

Multi-accent acoustic model and multi-accent voice recognition method Download PDF

Info

Publication number
CN110930982A
CN110930982A CN201911050896.3A CN201911050896A CN110930982A CN 110930982 A CN110930982 A CN 110930982A CN 201911050896 A CN201911050896 A CN 201911050896A CN 110930982 A CN110930982 A CN 110930982A
Authority
CN
China
Prior art keywords
accent
data
acoustic model
control unit
blstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911050896.3A
Other languages
Chinese (zh)
Inventor
计哲
黄远
高圣翔
沈亮
林格平
徐艳云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Information Engineering of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Information Engineering of CAS
Priority to CN201911050896.3A priority Critical patent/CN110930982A/en
Publication of CN110930982A publication Critical patent/CN110930982A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a multi-accent acoustic model and a multi-accent speech recognition method, wherein the multi-accent acoustic model comprises a plurality of BLSTM layers, a plurality of Softmax output layers and a gate control unit, the plurality of BLSTM layers are connected in series in sequence and then are connected in series with each Softmax output layer, and the gate control unit is positioned between two adjacent BLSTM layers in the plurality of BLSTM layers. According to the method, the traditional Mandarin Chinese acoustic model structure is improved, and for the number of types of various accent data needing to be identified, a plurality of Softmax output layers in the traditional Mandarin Chinese acoustic model are copied, each Softmax output layer is an accent specific output layer, and the output layers are designed into accent specific forms, namely each accent shares the corresponding output layer independently; and the gating unit makes a type of accent-specific adjustment to the output of the BLSTM layer of the neural network to make the model better suited for a variety of accents.

Description

Multi-accent acoustic model and multi-accent voice recognition method
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a multi-accent acoustic model and a multi-accent voice recognition method.
Background
The speech recognition system which is constructed aiming at the Mandarin and is based on the Mandarin acoustic model using the neural network and the hidden Markov mixed model can achieve a satisfactory effect on the speech recognition of the standard Mandarin speech, but the performance of the Mandarin acoustic model is obviously reduced when the Mandarin acoustic model is applied to a speech recognition task with accents, and the performance reduction is mainly caused by the fact that the Mandarin acoustic model constructed based on the Mandarin cannot accurately classify the phoneme states of the accent speech data. Therefore, a dedicated acoustic model needs to be constructed when processing a speech recognition task of speech with accents.
The accents of a language are mainly generated from the voices of speakers in other languages, i.e., the native language, or speakers in a certain dialect of the language. In chinese, the latter is the main source of accents. The Chinese language can be roughly divided into seven dialects, namely, official dialects, Wu dialects, Xiang dialects, Hakka dialects, Min dialects, Guangdong dialects and Jiang dialects. In addition, in a relatively complicated large dialect area, the dialect area can be divided into a plurality of small dialect areas, and dialects at the city and county level can be called local dialects, such as cantonese, Qingdao dialects, Tangshang dialects and the like. Thus, the types of accents derived from different dialects are also very complicated, which leads to the problem of multi-accent speech recognition that is often handled in practical applications of speech recognition.
In a practical production environment, a large amount of mandarin speech data is easy to obtain, and accented speech data often faces the problem of data sparseness due to the complexity of labeling and high labor cost. In order to fully utilize limited data to achieve the optimal performance of a speech recognition system, it is a common practice to train a robust mandarin acoustic model using mandarin speech data with a large data volume, and then to perform adaptation using corresponding data for a single accent to obtain a specific accent acoustic model, which is called adaptation of the accent specific acoustic model. However, this method requires separate adaptive training for each target accent, and needs to find its optimal configuration parameters, and eventually, a plurality of acoustic models are obtained, and the obtained acoustic models for a specific accent are costly in training complexity and storage space.
The problem can be solved by directly using the multi-accent voice data and optimizing the Mandarin Chinese acoustic model by using the traditional method, but the performance of the multi-accent acoustic model obtained by the traditional multi-accent acoustic model self-adapting method is usually inferior to that of an accent specific acoustic model self-adapting method.
Disclosure of Invention
In order to overcome the above existing problems that a plurality of accents cannot be recognized and the recognition rate is low, or at least partially solve the above problems, embodiments of the present invention provide a multi-accent acoustic model and a multi-accent speech recognition method.
According to an aspect of the invention, a multi-pitch acoustic model is provided, which comprises a plurality of Bidirectional Long Short-Term Memory network BLSTM (Bidirectional Long Short-Term Memory) layers, a plurality of Softmax output layers and a gate control unit, wherein the BLSTM layers are connected in series in sequence and then connected in series with each Softmax output layer, and the gate control unit is located between two adjacent BLSTM layers in the BLSTM layers;
the number of the Softmax output layers is equal to the number of the types of the accent data, and each Softmax output layer corresponds to the type of the accent data in a one-to-one mode.
On the basis of the technical scheme, the invention can be improved as follows.
Preferably, the gate control unit is an addition type gate control unit or a dot-and-multiply type gate control unit.
According to another aspect of the present invention, there is provided a multi-accent speech recognition method, comprising:
extracting acoustic features of accent data to be recognized;
inputting the acoustic features into a trained multi-accent acoustic model, and outputting the posterior probability of the triphone state of the accent data to be recognized;
and obtaining a text sequence of the accent data to be recognized according to the posterior probability of the triphone state of the accent data to be recognized.
Preferably, the multi-accent acoustic model is trained by:
extracting acoustic features and accent category labels of each piece of accent data in an accent data training set comprising accent data of multiple categories;
and training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data.
Preferably, the training of the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data includes:
inputting the acoustic features of each piece of accent data into a first BLSTM layer of the multi-accent acoustic model, sequentially passing through a plurality of BLSTM layers, and simultaneously inputting a first output vector of the BLSTM layer in front of the gate control unit and the accent category label of the current accent data into the gate control unit;
and the second output vector after the specific operation of the gating unit is used as the input of the BLSTM layer behind the gating unit, and the posterior probability of the triphone state of the current accent data is output by a Softmax output layer corresponding to the current accent data.
Preferably, when the gate unit is an addition gate unit, the specific operations of the gate unit are:
g(hi,va)=hi+Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
Preferably, when the gate control unit is a dot-product gate control unit, the specific operation of the gate control unit is:
g(hi,va)=hi·Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
Preferably, a unique hot code for each accent data category is used as a category label for each accent data.
Preferably, the multi-accent acoustic model is trained by a small batch stochastic gradient descent method.
The invention has the beneficial effects that:
improving the traditional Mandarin Chinese acoustic model structure, copying a plurality of Softmax output layers in the traditional Mandarin Chinese acoustic model aiming at the category quantity of a plurality of types of accent data needing to be identified, wherein each Softmax output layer is an accent specific output layer, and designing the output layers into accent specific forms, namely each accent independently shares the corresponding output layer; and the gating unit makes a type of accent-specific adjustment to the output of the BLSTM layer of the neural network to make the model better suited for a variety of accents.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a diagram of a conventional Mandarin acoustic model architecture;
FIG. 2 is a diagram of a multitone acoustic model architecture according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for performing multi-accent speech recognition using the multi-accent acoustic model of FIG. 2 according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the technical solutions in the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Referring to fig. 2, a multi-accent acoustic model is provided for recognizing accent data of various categories, the multi-accent acoustic model including a plurality of BLSTM layers, a plurality of Softmax output layers, and a gate unit, the plurality of BLSTM layers being concatenated sequentially and then concatenated with each Softmax output layer, the gate unit being located between two adjacent BLSTM layers of the plurality of BLSTM layers; the number of the Softmax output layers is equal to the number of the types of the accent data, and each Softmax output layer corresponds to the type of the accent data in a one-to-one mode.
It can be understood that, referring to fig. 1, for the architecture diagram of the conventional standard mandarin acoustic model, generally, the conventional mandarin acoustic model adopts a deep model, which mainly includes a plurality of BLSTM layers and a Softmax output layer, the BLSTM layers are sequentially connected in series and then connected in series with the Softmax output layer, and the mandarin acoustic model is trained based on the standard mandarin speech training set. The trained mandarin acoustic model may be used to recognize mandarin.
The traditional Mandarin acoustic model can only recognize standard Mandarin, and can not recognize the voice data of regional accent or has very low recognition precision. The embodiment of the invention provides an improved multi-accent acoustic model which can accurately identify voice data of various accents.
Referring to fig. 2, a multiple-pitch acoustic model provided in an embodiment of the present invention is improved on a network architecture of a mandarin acoustic model. Copying n parts of Softmax output layers in a common acoustic model, wherein n is the number of accent categories which can be identified by the multi-accent acoustic model, and each accent data corresponds to one Softmax output layer; and adding a gating cell between any two adjacent BLSTM layers of the plurality of BLSTM layers in the neural network.
The embodiment of the invention provides that a transfer learning method is utilized for optimization on the basis of a robust Mandarin acoustic model obtained by training Mandarin voice data with large data volume, and specifically, a multitask classification model based on a BLSTM layer and a gating mechanism based on accent information are combined in implementation, wherein an accent specific output layer (a Softmax output layer corresponding to each accent data) shares the BLSTM layer of a neural network in voice recognition tasks of a plurality of accents based on the multitask classification model, and the Softmax output layer is designed into an accent specific form, namely, each accent shares its corresponding Softmax output layer independently; the gating mechanism is that a gating unit is used for carrying out a mouth sound specific regulation on the hidden layer output of the neural network, so that the model is better suitable for various mouth sounds.
It can be understood that, in the embodiment of the present invention, a gating unit is disposed between a plurality of BLSTM layers of a common acoustic model, it should be noted that, a gating unit may be disposed between any two adjacent BLSTM layers, and in the embodiment of the present invention, a gating unit is disposed between all adjacent BLSTM layers.
On the basis of the above embodiments, in the embodiments of the present invention, the gate control unit is an addition type gate control unit or a dot-and-multiply type gate control unit. The type of gating unit may be determined according to the size of the speech data volume with accents and the performance of the mandarin acoustic model. The specific implementation manner of the gate control unit is as follows: output vector h of i-th layeriAnd accent class label vector vaThe vectors are sent to a gate control unit together, and after the gate control unit is subjected to specific operation, the transformed vectors g (h) are obtainedi,va) The (i +1) th layer is imported as an input, wherein the gating cells are located between the ith BLSTM layer and the (i +1) th BLSTM layer.
Referring to fig. 3, a multi-accent speech recognition method is provided, which performs multi-accent speech recognition based on the multi-accent acoustic models provided in the above embodiments. The method comprises the following steps:
extracting acoustic features of accent data to be recognized;
inputting the acoustic features into a trained multi-accent acoustic model, and outputting the posterior probability of the triphone state of the accent data to be recognized;
and obtaining a text sequence of the accent data to be recognized according to the posterior probability of the triphone state of the accent data to be recognized.
It can be understood that the embodiment of the present invention identifies the multi-accent voice data based on the multi-accent acoustic model provided by the above-mentioned embodiment. In the process of recognizing the multi-accent voice data, the extracted acoustic features of the voice data to be recognized are input into a trained multi-accent acoustic model, and the posterior probability of the triphone state of the voice data to be recognized is output. And (3) obtaining the recognized text sequence by the posterior probability joint language model and the pronunciation dictionary of the triphone state of the accent data to be recognized through a decoder, so as to realize the recognition process of the accent data to be recognized.
On the basis of the above embodiment, in the embodiment of the present invention, the multi-accent acoustic model is trained in the following manner:
extracting acoustic features and accent category labels of each piece of accent data in an accent data training set comprising accent data of multiple categories;
and training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data.
On the basis of the foregoing embodiments, in an embodiment of the present invention, the training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data includes:
inputting the acoustic features of each piece of accent data into a first BLSTM layer of the multi-accent acoustic model, sequentially passing through a plurality of BLSTM layers, and simultaneously inputting a first output vector of the BLSTM layer in front of the gate control unit and the accent category label of the current accent data into the gate control unit;
and the second output vector after the specific operation of the gating unit is used as the input of the BLSTM layer behind the gating unit, and the posterior probability of the triphone state of the current accent data is output by a Softmax output layer corresponding to the current accent data.
It is to be understood that the multi-accent acoustic model is based on a performance-robust Mandarin acoustic model trained from a Mandarin speech training set with sufficient data. The accent specific output layers and gating units are added to the Mandarin Chinese acoustic model during the initialization phase of the model, and then the modified network (i.e., the embodiment of the invention) is optimized by using a stochastic gradient descent method.
In practical application, a speech recognition system for a specific accent is usually deployed according to regions, so that for a regional accent problem, speech data acquired in one region can be divided into the same accent, and a speech recognition acoustic model is constructed by using the acquired accent data.
It should be noted that the traditional mandarin chinese acoustic model usually adopts a hybrid architecture of a neural network and a hidden markov model, wherein the neural network part is constructed by using a multi-layer bidirectional long and short term memory network (BLSTM), and the output target of the neural network part is the posterior probability of the state values of the context-dependent phonemes.
The multi-accent acoustic model provided by the embodiment of the invention is obtained by improving the traditional Mandarin acoustic model. After the improved multi-accent acoustic model is obtained, the multi-accent acoustic model needs to be trained, in the process of training the multi-accent acoustic model, voice data with accents are collected according to regions and subjected to data labeling (namely, accent category labels are labeled), a multi-accent voice training set is constructed, the training set should contain a plurality of accent data of a target, the number of the voice data of different categories should be kept approximately equal in principle, and an unique hot code representing the accent category of each voice is stored for each voice (namely, the unique hot code of each accent data category is used as the category label of each accent data). And training the obtained multi-accent acoustic model by using the training set. In the training process, the multi-accent data is disordered, and a small batch of random gradient descent is used to ensure that a plurality of accents are learned simultaneously. When one accent data sample is fed into the multi-accent acoustic model, only its corresponding Softmax output layer and shared hidden layer(s) are updated, while the other Softmax output layers remain unchanged. In the training process, a learning rate smaller than that used in the training of the Mandarin acoustic model is used to ensure the effective utilization of the Mandarin acoustic model, namely, the improved multi-accent acoustic model has a better recognition rate for the common speech and voice data.
On the basis of the above embodiments, in the embodiment of the present invention, when the gate control unit is an addition gate control unit, the specific operations of the gate control unit are as follows:
g(hi,va)=hi+Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
When the gate control unit is a dot-multiply gate control unit, the specific operations of the gate control unit are:
g(hi,va)=hi·Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
The specific gating unit selection may be determined according to the size of the training speech data with accents and the performance of the mandarin acoustic model, which is not limited in the embodiments of the present invention.
For the trained multi-accent acoustic model, aiming at each accent, inputting the acoustic characteristics of accent data into the trained multi-accent acoustic model, combining the posterior probability of the factor state output by an accent specific output layer (Softmax output layer) corresponding to a neural network in the multi-accent acoustic model with the language model and pronunciation dictionary constructed for a specific task to obtain a recognized text sequence through a decoder, and completing recognition of the accent to be recognized.
In the following, the multi-accent acoustic model provided by the embodiment of the present invention is compared with various conventional acoustic models, and the recognition error rates of different acoustic models are shown in table 1.
TABLE 1
Figure BDA0002255306390000091
The reasonability and the effectiveness of the multi-accent acoustic model constructed based on the embodiment of the invention are verified in practice, and the word error rate of the speech recognition of accents in various regions is shown in table 1. In the table, a mandarin acoustic model is trained by using 7000-hour mandarin voice data, and a multi-accent acoustic model is constructed by using accent voice data of four regions (20 hours in each region), wherein the data volume of a test set of each accent is two hours, and all voice data are customer service call voice data in an actual scene.
Compared with a Mandarin acoustic model (a baseline model), the word error rate of the multi-accent acoustic model constructed by the embodiment of the invention is averagely reduced by 9.8%, and compared with the traditional multi-accent acoustic model and a specific accent acoustic model, the multi-accent acoustic model has certain performance improvement, which shows that the multi-accent acoustic model of the embodiment of the invention is a high-efficiency and high-performance multi-accent voice recognition acoustic model.
According to the multi-accent acoustic model and the multi-accent voice recognition method provided by the embodiment of the invention, the traditional Mandarin Chinese acoustic model structure is improved, a plurality of Softmax output layers in the traditional Mandarin Chinese acoustic model are copied according to the category number of various accent data needing to be recognized, each Softmax output layer is an accent specific output layer, and the output layers are designed into accent specific forms, namely each accent shares the corresponding output layer independently; the gate control unit performs a mouth sound specific regulation on the output of the BLSTM layer of the neural network, namely, a transfer learning method is utilized, a multitask classification model based on a shared hidden layer (namely, aiming at a plurality of Softmax output layers and sharing the BLSTM layer) and a gate control mechanism based on mouth sound information are adopted, on the basis of a Mandarin Chinese acoustic model with robust performance, a multi-mouth sound acoustic model is obtained by optimizing multi-mouth sound mixed data to a plurality of target mouth sounds at the same time, and the multi-mouth sound acoustic model with robust performance on the multi-target mouth sounds is obtained while time and cost are saved.
Compared with the traditional multiple specific accent acoustic models, each traditional specific accent acoustic model needs a large amount of specific accent data during training, and obtaining of accent data of a training set is difficult.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A multi-accent acoustic model for recognizing accent data of various categories, which comprises a plurality of bidirectional long-short term memory network (BLSTM) layers, a plurality of Softmax output layers and a gate control unit, wherein the plurality of BLSTM layers are sequentially connected in series and then connected in series with each Softmax output layer, and the gate control unit is positioned between two adjacent BLSTM layers in the plurality of BLSTM layers;
the number of the Softmax output layers is equal to the number of the types of the accent data, and each Softmax output layer corresponds to the type of the accent data in a one-to-one mode.
2. The transonic acoustic model of claim 1, wherein the gating unit is an additive gating unit or a dot-and-multiply gating unit.
3. A multi-accent speech recognition method, comprising:
extracting acoustic features of accent data to be recognized;
inputting the acoustic features into the trained multi-accent acoustic model of claim 1, and outputting the posterior probability of the triphone state of the accent data to be recognized;
and obtaining a text sequence of the accent data to be recognized according to the posterior probability of the triphone state of the accent data to be recognized.
4. The multi-accent speech recognition method of claim 3, wherein the multi-accent acoustic models are trained by:
extracting acoustic features and accent category labels of each piece of accent data in an accent data training set comprising accent data of multiple categories;
and training the multi-accent acoustic model based on the acoustic features and accent category labels of each piece of accent data.
5. The multi-accent speech recognition method of claim 4, wherein the training of the multi-accent acoustic models based on the acoustic features and accent category labels of each piece of accent data comprises:
inputting the acoustic features of each piece of accent data into a first BLSTM layer of the multi-accent acoustic model, sequentially passing through a plurality of BLSTM layers, and simultaneously inputting a first output vector of the BLSTM layer in front of the gate control unit and the accent category label of the current accent data into the gate control unit;
and the second output vector after the specific operation of the gating unit is used as the input of the BLSTM layer behind the gating unit, and the posterior probability of the triphone state of the current accent data is output by a Softmax output layer corresponding to the current accent data.
6. The multi-accent speech recognition method of claim 5,
when the gate control unit is an addition type gate control unit, the specific operations of the gate control unit are as follows:
g(hi,va)=hi+Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
7. The multi-accent speech recognition method of claim 5,
when the gate control unit is a dot-multiply gate control unit, the specific operation of the gate control unit is as follows:
g(hi,va)=hi·Vva+b;
wherein h isiIs the first output vector, v, of the ith BLSTM layeraAccent class label for current accent data, g (h)i,va) Is the second output vector after the gate control unit operation, V is an M × N matrix, hiAnd vaIs M and N, b is a bias vector, and M and N are positive integers.
8. The multi-accent speech recognition method according to any one of claims 4 to 7, wherein a unique code for each accent data category is used as a category label for each accent data.
9. The method of claim 5, wherein the multi-accent acoustic models are trained using a small batch stochastic gradient descent method.
CN201911050896.3A 2019-10-31 2019-10-31 Multi-accent acoustic model and multi-accent voice recognition method Pending CN110930982A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911050896.3A CN110930982A (en) 2019-10-31 2019-10-31 Multi-accent acoustic model and multi-accent voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911050896.3A CN110930982A (en) 2019-10-31 2019-10-31 Multi-accent acoustic model and multi-accent voice recognition method

Publications (1)

Publication Number Publication Date
CN110930982A true CN110930982A (en) 2020-03-27

Family

ID=69849958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911050896.3A Pending CN110930982A (en) 2019-10-31 2019-10-31 Multi-accent acoustic model and multi-accent voice recognition method

Country Status (1)

Country Link
CN (1) CN110930982A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111508501A (en) * 2020-07-02 2020-08-07 成都晓多科技有限公司 Voice recognition method and system with accent for telephone robot
CN112885351A (en) * 2021-04-30 2021-06-01 浙江非线数联科技股份有限公司 Dialect voice recognition method and device based on transfer learning
CN113593534A (en) * 2021-05-28 2021-11-02 思必驰科技股份有限公司 Method and apparatus for multi-accent speech recognition
CN113593524A (en) * 2021-01-26 2021-11-02 腾讯科技(深圳)有限公司 Method and device for training acoustic model for accent recognition, and storage medium
CN113593525A (en) * 2021-01-26 2021-11-02 腾讯科技(深圳)有限公司 Method, device and storage medium for training accent classification model and accent classification
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device
US11776323B2 (en) 2022-02-15 2023-10-03 Ford Global Technologies, Llc Biometric task network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN106875942A (en) * 2016-12-28 2017-06-20 中国科学院自动化研究所 Acoustic model adaptive approach based on accent bottleneck characteristic
US20180053500A1 (en) * 2016-08-22 2018-02-22 Google Inc. Multi-accent speech recognition
US20190088251A1 (en) * 2017-09-18 2019-03-21 Samsung Electronics Co., Ltd. Speech signal recognition system and method
CN109829058A (en) * 2019-01-17 2019-05-31 西北大学 A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning
US20200160836A1 (en) * 2018-11-21 2020-05-21 Google Llc Multi-dialect and multilingual speech recognition
CN112992119A (en) * 2021-01-14 2021-06-18 安徽大学 Deep neural network-based accent classification method and model thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
US20180053500A1 (en) * 2016-08-22 2018-02-22 Google Inc. Multi-accent speech recognition
CN106875942A (en) * 2016-12-28 2017-06-20 中国科学院自动化研究所 Acoustic model adaptive approach based on accent bottleneck characteristic
US20190088251A1 (en) * 2017-09-18 2019-03-21 Samsung Electronics Co., Ltd. Speech signal recognition system and method
US20200160836A1 (en) * 2018-11-21 2020-05-21 Google Llc Multi-dialect and multilingual speech recognition
CN109829058A (en) * 2019-01-17 2019-05-31 西北大学 A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning
CN112992119A (en) * 2021-01-14 2021-06-18 安徽大学 Deep neural network-based accent classification method and model thereof

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
HAN ZHU 等: "Multi-Accent Adaptation based on Gate Mechanism", 《INTERSPEECH 2019》, pages 744 - 748 *
HAN ZHU,等: "Multi-Accent Adaptation based on Gate Mechanism", 《HTTPS://ARXIV.ORG/ABS/2011.02774》 *
JIANGYAN YI,等: "Improving BLSTM RNN based Mandarin speech recognition using accent dependent bottleneck features", 《2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA)》 *
XUESONG YANG,等: "Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition", 《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) 》 *
周刚: "藏语拉萨方言语音识别的研究", 《中国优秀硕士学位论文全文数据库》 *
李德毅 等: "《人工智能与机器人先进技术丛书 智能摘要与深度学习》", 北京理工大学出版社, pages: 103 - 104 *
遥遥子YY: "从0开始学习kaldi决策树绑定+三音素", pages 103 - 104, Retrieved from the Internet <URL:https://blog.csdn.net/qq_37591044/article/details/102395480> *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111508501A (en) * 2020-07-02 2020-08-07 成都晓多科技有限公司 Voice recognition method and system with accent for telephone robot
CN111508501B (en) * 2020-07-02 2020-09-29 成都晓多科技有限公司 Voice recognition method and system with accent for telephone robot
CN113593524A (en) * 2021-01-26 2021-11-02 腾讯科技(深圳)有限公司 Method and device for training acoustic model for accent recognition, and storage medium
CN113593525A (en) * 2021-01-26 2021-11-02 腾讯科技(深圳)有限公司 Method, device and storage medium for training accent classification model and accent classification
CN112885351A (en) * 2021-04-30 2021-06-01 浙江非线数联科技股份有限公司 Dialect voice recognition method and device based on transfer learning
CN112885351B (en) * 2021-04-30 2021-07-23 浙江非线数联科技股份有限公司 Dialect voice recognition method and device based on transfer learning
CN113593534A (en) * 2021-05-28 2021-11-02 思必驰科技股份有限公司 Method and apparatus for multi-accent speech recognition
CN113593534B (en) * 2021-05-28 2023-07-14 思必驰科技股份有限公司 Method and device for multi-accent speech recognition
US11776323B2 (en) 2022-02-15 2023-10-03 Ford Global Technologies, Llc Biometric task network
CN114596845A (en) * 2022-04-13 2022-06-07 马上消费金融股份有限公司 Training method of voice recognition model, voice recognition method and device

Similar Documents

Publication Publication Date Title
CN110930982A (en) Multi-accent acoustic model and multi-accent voice recognition method
EP3966816B1 (en) Large-scale multilingual speech recognition with a streaming end-to-end model
US8126717B1 (en) System and method for predicting prosodic parameters
CN112889073A (en) Cross-language classification using multi-language neural machine translation
CN110459208B (en) Knowledge migration-based sequence-to-sequence speech recognition model training method
JP2004279701A (en) Method and device for sound model creation, and speech recognition device
Masumura et al. Large context end-to-end automatic speech recognition via extension of hierarchical recurrent encoder-decoder models
JP2023544336A (en) System and method for multilingual speech recognition framework
JP2022512233A (en) Neural adjustment code for multilingual style-dependent speech language processing
CN112700778A (en) Speech recognition method and speech recognition apparatus
CN115039170A (en) Proper noun recognition in end-to-end speech recognition
CN116303966A (en) Dialogue behavior recognition system based on prompt learning
US11990117B2 (en) Using speech recognition to improve cross-language speech synthesis
CN111553157A (en) Entity replacement-based dialog intention identification method
US20220310080A1 (en) Multi-Task Learning for End-To-End Automated Speech Recognition Confidence and Deletion Estimation
Hu et al. The USTC system for blizzard challenge 2017
KR20220128401A (en) Attention-based joint acoustics and text on-device end-to-end (E2E) models
US20230317059A1 (en) Alignment Prediction to Inject Text into Automatic Speech Recognition Training
Razavi et al. An HMM-based formalism for automatic subword unit derivation and pronunciation generation
Wu et al. Factored recurrent neural network language model in TED lecture transcription
Abraham et al. An automated technique to generate phone-to-articulatory label mapping
US11823697B2 (en) Improving speech recognition with speech synthesis-based model adapation
WO2022086640A1 (en) Fast emit low-latency streaming asr with sequence-level emission regularization
Farooq et al. Learning cross-lingual mappings for data augmentation to improve low-resource speech recognition
Wang et al. Speech-and-text transformer: Exploiting unpaired text for end-to-end speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination