CN112133304B - Low-delay speech recognition model based on feedforward neural network and training method - Google Patents
Low-delay speech recognition model based on feedforward neural network and training method Download PDFInfo
- Publication number
- CN112133304B CN112133304B CN202010988191.2A CN202010988191A CN112133304B CN 112133304 B CN112133304 B CN 112133304B CN 202010988191 A CN202010988191 A CN 202010988191A CN 112133304 B CN112133304 B CN 112133304B
- Authority
- CN
- China
- Prior art keywords
- word
- neural network
- feedforward neural
- speech recognition
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Abstract
The invention discloses a low-delay speech recognition model based on a feedforward neural network and a training method, wherein the model comprises the following steps: an encoder, a decoder and a summarizer; the training method comprises the following steps: s11, extracting voice features; s12, the encoder converts the acoustic feature sequence into high-level semantic representation; s13, converting the summarizer into a high-level semantic representation corresponding to each word position through preset position coding and high-level semantic representation; s14, the decoder further extracts semantic information at the word level from the representation corresponding to each word position; the recognition method also comprises a prediction phase, selecting the word with the highest probability at each position according to the output of the decoder.
Description
Technical Field
The invention relates to the technical field of intelligent information processing, in particular to a low-delay speech recognition model based on a feedforward neural network and a training method.
Background
Speech is one of the most natural ways of interaction for humans. Speech recognition is an intelligent information processing technique that converts speech into corresponding text. Converting speech into text is beneficial for further processing by computer terminals, so that speech recognition technology is widely used in systems of intelligent conversation, intelligent customer service, intelligent translation and the like.
Although there are a number of speech recognition techniques, these techniques all require searching the most probable sequence from left to right in the prediction stage, and specifically predicting the next word requires inputting the word predicted last into the model, such as generating the sentence "today _ weather _ good", and when predicting "good", the result "weather" of the previous prediction needs to be input into the model. This leads to the following problems:
1. since the sequence needs to be searched, the entire sequence is generated from left to right, so that when the word on the left is predicted, the model cannot utilize the information provided by the word on the right;
2. because the sequence needs to be searched, the whole sequence is generated from left to right, so that the whole prediction stage is difficult to execute in parallel, and the prediction speed is influenced;
3. since the entire sequence needs to be generated from left to right by searching the sequence, the neural network needs to feed forward multiple times, affecting the speed of prediction.
Disclosure of Invention
In order to solve the defects of the prior art and realize that each word is predicted without depending on the previous word, thereby improving the prediction speed, the invention adopts the following technical scheme:
a low-delay speech recognition model based on a feedforward neural network comprises the following components: the encoder converts the acoustic feature sequence into high-level semantic representations, the summarizer converts the high-level semantic representations into high-level semantic representations corresponding to each word position through preset position coding and the high-level semantic representations, and the decoder further extracts semantic information at a word level from the high-level semantic representations corresponding to each word position.
The encoder, the summarizer and the decoder are all formed by stacking a plurality of layers of attention modules, each attention module consists of an attention mechanism and a feedforward neural network which are connected by residual errors, and the attention mechanism is an inner product attention mechanism:
wherein the content of the first and second substances,the representation of a Query is that of a Query,the expression of the Key is shown,represents Value, TKAnd TqThe sequence lengths of Key and Query are respectively, D is the dimension of input, the attention mechanism of an encoder and a decoder is a self-attention mechanism, and Q, V, K is the same sequence; the summarizer is attention to each otherMechanism, V, K is the same sequence, and Q is a predetermined position code.
The attention mechanism is a multi-head attention mechanism:
MHA(Q,K,V)=Cat(head1,…,headn)
wherein the content of the first and second substances,as a parameter, Cat is the splicing operation.
The feedforward neural network is a position-by-position feedforward neural network:
FFN(x)=W2 relu(W1x+b1)+b2
wherein, W1,W2,b1,b2For parameters, relu is a restricted linear unit activation function, a sequence of vectors is represented position by position, each vector is transformed by the same feedforward neural network, and each new transformed vector forms a new sequence.
The position code is a sine-cosine position code, and each position element of the position code is calculated according to the following mode:
wherein 2j and 2j +1 are element indexes of the ith position vector in the position coding sequence.
The model is trained by adopting a maximum likelihood estimation criterion and a back propagation algorithm, a dictionary is established for the text according to the paired acoustic feature sequences and the text, and a loss function is calculated:
wherein y isjMarking correct answers for the corresponding position J of the dictionary, wherein J is the total length of the sequence, a word sequence length is set in advance, and if the sentence length is less than J, a sequence is added at the end of the sentence<eos>The symbols are filled up so that the sentence length is automatically predicted in the speech recognition phase.
The text is used for establishing a dictionary, and the first N words are selected as the dictionary by counting the occurrence frequency of the words in the text and sequencing according to the sequence of the word frequency from large to small; words not in the dictionary are denoted as < UNK >.
The low-delay speech training method based on the feedforward neural network comprises the following steps:
s11, extracting voice features, and acquiring acoustic feature sequences and corresponding texts from training corpora;
s12, the encoder converts the acoustic feature sequence into high-level semantic representation;
s13, converting the summarizer into a high-level semantic representation corresponding to each word position through preset position coding and high-level semantic representation;
s14, the decoder further extracts semantic information at the word level from the representation corresponding to each word position;
the low-delay speech recognition method based on the feedforward neural network comprises the following steps:
s21, voice feature extraction, extracting acoustic feature sequences from voice signals;
s22, the encoder converts the acoustic feature sequence into high-level semantic representation;
s23, converting the summarizer into a high-level semantic representation corresponding to each word position through position coding and high-level semantic representation;
s24, the decoder further extracts semantic information at the word level from the high level semantic representation corresponding to each word position;
s25, a prediction stage, selecting the word with the highest probability at each position according to the output of the decoder.
The prediction stage selects the word with the maximum probability for each word position:
wherein, yjIs a word in the dictionary, j is the corresponding position, P is its probability, argmax is a function of the maximum,the word at position j is the predicted word taken.
After training, the voice can be recognized, characters, words or sub-words at each position can be directly predicted according to the input voice without realizing a search technology, the speed of voice recognition can be greatly increased, and low-delay rapid recognition can be realized.
The invention has the advantages and beneficial effects that:
the invention directly predicts the words, words or sub-words of each position according to the input voice, avoids sequence search, directly predicts the words of each position, is beneficial to computer parallelization calculation, does not need feedforward for multiple times, greatly accelerates the speed of voice recognition and realizes low-delay rapid recognition.
Drawings
FIG. 1 is a flow chart of speech recognition of the present invention.
FIG. 2 is a schematic diagram of the structure of the speech recognition model of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
The low-delay speech recognition model based on the feedforward neural network comprises an encoder model based on the feedforward neural network of a self-attention mechanism, and the speech feature sequence is converted into high-level semantic representation; a feedforward neural network based on attention mechanism and a position-dependent summarizer model of attention mechanism, which converts the high-level representation output by the encoder model into a high-level semantic representation corresponding to each word position; a decoder model based on a self-attention mechanism feedforward neural network further extracts semantic information at a word level.
The encoder, the summarizer and the decoder are all formed by stacking a plurality of layers of attention modules, and each attention module is composed of an attention mechanism and a feedforward neural network which are connected through residual errors.
The attention mechanism is an inner product attention mechanism:
wherein the content of the first and second substances,the representation of a Query is that of a Query,the expression of the Key is shown,represents Value, TKAnd TqThe sequence lengths of Key and Query respectively, and D is the dimension of input.
The attention mechanism of the encoder and decoder is a self-attention mechanism, i.e. Q, V, K is the same sequence; the summarizer is a mutual attention mechanism, i.e. V, K is the same sequence and Q is a position code set in advance.
The attention mechanism may be a multi-head attention mechanism:
MHA(Q,K,V)=Cat(head1,…,headn)
wherein the content of the first and second substances,as a parameter, Cat is the splicing operation.
The attention mechanism is followed by a position-by-position feedforward neural network:
FFN(x)=W2 relu(W1x+b1)+b2
wherein, W1,W2,b1,b2For parameters, relu is a restricted linear unit activation function, a sequence of vectors is represented position by position, each vector is transformed by the same feedforward neural network, and each new transformed vector forms a new sequence.
The attention module further comprises a layer normalization module, wherein the normalization module is an operation in the attention module.
The position code is sine-cosine position code, and each position element is calculated according to the following mode:
wherein 2j and 2j +1 are element indexes of the ith position vector in the position coding sequence.
And in the training stage, an acoustic feature sequence and a corresponding text are obtained from the training corpus, and parameters of the acoustic feature sequence such as MFCC (Mel frequency cepstrum coefficient) and FBANK (fuzzy neural network) features are obtained.
Establishing a dictionary according to texts in a training corpus, specifically comprising:
1. counting the occurrence frequency of words in the text;
2. sorting according to the sequence of word frequency from big to small, and selecting the first N words as a dictionary; words not in the dictionary are denoted as < UNK >.
The first layer attention module of the encoder inputs the acoustic feature sequence, and each subsequent layer attention module is the output of the previous layer.
The position-dependent summarizer model uses a mutual attention mechanism to generate a representation corresponding to the position of each word according to the position code set in advance and the output result of the encoder, wherein K and V of the first layer attention module are the output of the encoder, Q is the position code, Q of each layer attention module is the output of the previous layer attention module, and K and V are the output of the encoder.
The position-related summarizer extracts the high-level representation corresponding to the word position according to the high-level representation sequence output by the encoder and the position code corresponding to each word position, and the first-level attention module of the summarizer inputs the position code and the high-level representation generated by the encoder to obtain the code corresponding to the word position; the attention modules from the second layer to the nth layer input the codes obtained by the attention module of the previous layer and the high-level representation generated by the coder, and obtain codes corresponding to the positions of the words.
The decoder first layer attention module input is the output of the position-dependent summarizer, i.e. the representation corresponding to each word position, and then each layer attention module is the output of the previous layer, extracting the high-level semantic representation for the word.
The model is trained by adopting a maximum likelihood estimation criterion and a back propagation algorithm, and a loss function is calculated according to paired speech and text:
wherein y isjMarking correct answers for the corresponding position J of the dictionary, wherein J is the total length of the sequence, a word sequence length is set in advance, and if the sentence length is less than J, a sequence is added at the end of the sentence<eos>The symbols are filled up so that the sentence length is automatically predicted in the speech recognition phase. The back propagation computational loss function corresponds to the gradient of each parameter.
In the prediction stage, after training, the speech can be recognized, and the character, word or sub-word at each position can be directly predicted according to the input speech without realizing the search technology, so that the speech recognition speed can be greatly accelerated, and the low-delay rapid recognition can be realized, and the method comprises the following steps:
1. extracting a voice signal into an acoustic feature sequence, wherein the extracted voice feature can be a feature such as a Mel frequency cepstrum coefficient;
2. the low-delay speech recognition model based on the feedforward neural network carries out speech recognition, and the method comprises the following steps:
(1) extracting a high-level semantic representation by an encoder;
(2) converting the high-level speech representation extracted by the encoder into a high-level semantic representation corresponding to each word position through a position-dependent summarizer;
(3) further extracting high-level semantic representation of each word and making prediction through a decoder;
(4) a prediction stage, namely a speech recognition stage, selecting words with the maximum probability at each position according to the output of a decoder, mapping the extracted high-level representation of the words into word probability distribution at each position by adopting linear transformation and nonlinear transformation, and selecting the words with the maximum probability at each word position:
wherein, yjIs a word in the dictionary, j is the corresponding position, P is its probability, argmax is a function of the maximum,the word at position j is the predicted word taken.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (9)
1. A low-delay speech recognition model based on a feedforward neural network comprises the following components: the encoder and the decoder are characterized by further comprising a summarizer, wherein the encoder converts the acoustic feature sequence into high-level semantic representations, the summarizer converts the high-level semantic representations into high-level semantic representations corresponding to each word position through preset position coding and the high-level semantic representations, and the decoder further extracts semantic information at the word level from the high-level semantic representations corresponding to each word position;
the encoder, the summarizer and the decoder are all formed by stacking a plurality of layers of attention modules, each attention module consists of an attention mechanism and a feedforward neural network which are connected by residual errors, and the attention mechanism is an inner product attention mechanism:
wherein the content of the first and second substances,the representation of a Query is that of a Query,the expression of the Key is shown,represents Value, TKAnd TqThe sequence lengths of Key and Query are respectively, D is the dimension of input, the attention mechanism of an encoder and a decoder is a self-attention mechanism, and Q, V, K is the same sequence; the summarizer is a mutual attention mechanism, V, K is the same sequence, and Q is a preset position code.
2. A feedforward neural network-based low-latency speech recognition model according to claim 1, wherein the attention mechanism is a multi-head attention mechanism:
MHA(Q,K,V)=Cat(head1,…,headn)
3. A feedforward neural network-based low-latency speech recognition model according to claim 1, wherein the feedforward neural network is a position-by-position feedforward neural network:
FFN(x)=W2 relu(W1x+b1)+b2
wherein, W1,W2,b1,b2For parameters, relu is a restricted linear unit activation function, a sequence of vectors is represented position by position, each vector is transformed by the same feedforward neural network, and each new transformed vector forms a new sequence.
4. A feedforward neural network-based low-latency speech recognition model according to claim 1, wherein the position code is a sine-cosine position code, each position element of which is calculated as follows:
wherein 2j and 2j +1 are element indexes of the ith position vector in the position coding sequence.
5. A feedforward neural network-based low-latency speech recognition model according to any one of claims 1-4, wherein the model is trained using maximum likelihood estimation criteria and a back propagation algorithm, based on pairs of acoustic feature sequences and text, and building a dictionary for the text;
the text establishing dictionary is established according to texts in the training corpus, the words in the texts are sequenced from large to small according to the word frequency by counting the occurrence frequency of the words in the texts, and the first N words are selected as the dictionary; words not in the dictionary are denoted as < UNK >.
6. The feedforward neural network-based low-latency speech recognition model of claim 5, wherein the text is used for establishing a dictionary by counting the occurrence frequency of words in the text, sorting the words in the order of increasing word frequency to decreasing word frequency, and selecting the first N words as the dictionary; words not in the dictionary are denoted as < UNK >.
7. A method for training a low-latency speech recognition model based on a feedforward neural network as claimed in claim 1, comprising the steps of:
s11, extracting voice features, and acquiring an acoustic feature sequence and a corresponding text from the training corpus;
s12, the encoder converts the acoustic feature sequence into high-level semantic representation;
s13, converting the summarizer into high-level semantic representation corresponding to each word position through preset position coding and high-level semantic representation;
the decoder further extracts semantic information at the word level from the representation corresponding to each word position S14.
8. The method of speech recognition based on a feedforward neural network low-latency speech recognition model of claim 1, comprising the steps of:
s21, voice feature extraction, extracting acoustic feature sequences from voice signals;
s22, the encoder converts the acoustic feature sequence into high-level semantic representation;
s23, converting the summarizer into a high-level semantic representation corresponding to each word position through position coding and high-level semantic representation;
s24, the decoder further extracts semantic information at the word level from the high level semantic representation corresponding to each word position;
s25, a prediction stage, selecting the word with the highest probability at each position according to the output of the decoder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010988191.2A CN112133304B (en) | 2020-09-18 | 2020-09-18 | Low-delay speech recognition model based on feedforward neural network and training method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010988191.2A CN112133304B (en) | 2020-09-18 | 2020-09-18 | Low-delay speech recognition model based on feedforward neural network and training method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112133304A CN112133304A (en) | 2020-12-25 |
CN112133304B true CN112133304B (en) | 2022-05-06 |
Family
ID=73841382
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010988191.2A Active CN112133304B (en) | 2020-09-18 | 2020-09-18 | Low-delay speech recognition model based on feedforward neural network and training method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112133304B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112687288B (en) * | 2021-03-12 | 2021-12-03 | 北京世纪好未来教育科技有限公司 | Echo cancellation method, echo cancellation device, electronic equipment and readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543180A (en) * | 2018-11-08 | 2019-03-29 | 中山大学 | A kind of text emotion analysis method based on attention mechanism |
CN110288980A (en) * | 2019-06-17 | 2019-09-27 | 平安科技(深圳)有限公司 | Audio recognition method, the training method of model, device, equipment and storage medium |
CN110689876A (en) * | 2019-10-14 | 2020-01-14 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN111125326A (en) * | 2019-12-06 | 2020-05-08 | 贝壳技术有限公司 | Method, device, medium and electronic equipment for realizing man-machine conversation |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10839790B2 (en) * | 2017-02-06 | 2020-11-17 | Facebook, Inc. | Sequence-to-sequence convolutional architecture |
KR102449842B1 (en) * | 2017-11-30 | 2022-09-30 | 삼성전자주식회사 | Method for training language model and apparatus therefor |
KR102462426B1 (en) * | 2017-12-14 | 2022-11-03 | 삼성전자주식회사 | Electronic device and method for analyzing meaning of speech |
US11210475B2 (en) * | 2018-07-23 | 2021-12-28 | Google Llc | Enhanced attention mechanisms |
CN109145290B (en) * | 2018-07-25 | 2020-07-07 | 东北大学 | Semantic similarity calculation method based on word vector and self-attention mechanism |
CN109949796B (en) * | 2019-02-28 | 2021-04-06 | 天津大学 | End-to-end architecture Lasa dialect voice recognition method based on Tibetan component |
JP7041281B2 (en) * | 2019-07-04 | 2022-03-23 | 浙江大学 | Address information feature extraction method based on deep neural network model |
CN110827816A (en) * | 2019-11-08 | 2020-02-21 | 杭州依图医疗技术有限公司 | Voice instruction recognition method and device, electronic equipment and storage medium |
CN111048082B (en) * | 2019-12-12 | 2022-09-06 | 中国电子科技集团公司第二十八研究所 | Improved end-to-end speech recognition method |
-
2020
- 2020-09-18 CN CN202010988191.2A patent/CN112133304B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543180A (en) * | 2018-11-08 | 2019-03-29 | 中山大学 | A kind of text emotion analysis method based on attention mechanism |
CN110288980A (en) * | 2019-06-17 | 2019-09-27 | 平安科技(深圳)有限公司 | Audio recognition method, the training method of model, device, equipment and storage medium |
CN110689876A (en) * | 2019-10-14 | 2020-01-14 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN111125326A (en) * | 2019-12-06 | 2020-05-08 | 贝壳技术有限公司 | Method, device, medium and electronic equipment for realizing man-machine conversation |
Also Published As
Publication number | Publication date |
---|---|
CN112133304A (en) | 2020-12-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110534095B (en) | Speech recognition method, apparatus, device and computer readable storage medium | |
CN111639175B (en) | Self-supervision dialogue text abstract method and system | |
US11587551B2 (en) | Leveraging unpaired text data for training end-to-end spoken language understanding systems | |
CN111145729B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN111339278B (en) | Method and device for generating training speech generating model and method and device for generating answer speech | |
CN111199727A (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
JPH02273795A (en) | Continuous speech recognition method | |
CN112199945A (en) | Text error correction method and device | |
CN111783477B (en) | Voice translation method and system | |
CN115831102A (en) | Speech recognition method and device based on pre-training feature representation and electronic equipment | |
Karita et al. | Sequence training of encoder-decoder model using policy gradient for end-to-end speech recognition | |
CN110569505A (en) | text input method and device | |
CN113488028A (en) | Speech transcription recognition training decoding method and system based on rapid skip decoding | |
Cheng et al. | Eteh: Unified attention-based end-to-end asr and kws architecture | |
CN112133304B (en) | Low-delay speech recognition model based on feedforward neural network and training method | |
JPH0372997B2 (en) | ||
CN113793599B (en) | Training method of voice recognition model, voice recognition method and device | |
Collobert et al. | Word-level speech recognition with a letter to word encoder | |
Audhkhasi et al. | Recent progress in deep end-to-end models for spoken language processing | |
JP2013182260A (en) | Language model creation device, voice recognition device and program | |
CN113486160B (en) | Dialogue method and system based on cross-language knowledge | |
Schukat-Talamazzini | Stochastic language models | |
JPH09134192A (en) | Statistical language model forming device and speech recognition device | |
CN115223549A (en) | Vietnamese speech recognition corpus construction method | |
Whittaker et al. | Vocabulary independent speech recognition using particles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |