CN112133304B - Low-delay speech recognition model based on feedforward neural network and training method - Google Patents

Low-delay speech recognition model based on feedforward neural network and training method Download PDF

Info

Publication number
CN112133304B
CN112133304B CN202010988191.2A CN202010988191A CN112133304B CN 112133304 B CN112133304 B CN 112133304B CN 202010988191 A CN202010988191 A CN 202010988191A CN 112133304 B CN112133304 B CN 112133304B
Authority
CN
China
Prior art keywords
word
neural network
feedforward neural
speech recognition
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010988191.2A
Other languages
Chinese (zh)
Other versions
CN112133304A (en
Inventor
白烨
温正棋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Extreme Element Hangzhou Intelligent Technology Co ltd
Original Assignee
Zhongke Extreme Element Hangzhou Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Extreme Element Hangzhou Intelligent Technology Co ltd filed Critical Zhongke Extreme Element Hangzhou Intelligent Technology Co ltd
Priority to CN202010988191.2A priority Critical patent/CN112133304B/en
Publication of CN112133304A publication Critical patent/CN112133304A/en
Application granted granted Critical
Publication of CN112133304B publication Critical patent/CN112133304B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Abstract

The invention discloses a low-delay speech recognition model based on a feedforward neural network and a training method, wherein the model comprises the following steps: an encoder, a decoder and a summarizer; the training method comprises the following steps: s11, extracting voice features; s12, the encoder converts the acoustic feature sequence into high-level semantic representation; s13, converting the summarizer into a high-level semantic representation corresponding to each word position through preset position coding and high-level semantic representation; s14, the decoder further extracts semantic information at the word level from the representation corresponding to each word position; the recognition method also comprises a prediction phase, selecting the word with the highest probability at each position according to the output of the decoder.

Description

Low-delay speech recognition model based on feedforward neural network and training method
Technical Field
The invention relates to the technical field of intelligent information processing, in particular to a low-delay speech recognition model based on a feedforward neural network and a training method.
Background
Speech is one of the most natural ways of interaction for humans. Speech recognition is an intelligent information processing technique that converts speech into corresponding text. Converting speech into text is beneficial for further processing by computer terminals, so that speech recognition technology is widely used in systems of intelligent conversation, intelligent customer service, intelligent translation and the like.
Although there are a number of speech recognition techniques, these techniques all require searching the most probable sequence from left to right in the prediction stage, and specifically predicting the next word requires inputting the word predicted last into the model, such as generating the sentence "today _ weather _ good", and when predicting "good", the result "weather" of the previous prediction needs to be input into the model. This leads to the following problems:
1. since the sequence needs to be searched, the entire sequence is generated from left to right, so that when the word on the left is predicted, the model cannot utilize the information provided by the word on the right;
2. because the sequence needs to be searched, the whole sequence is generated from left to right, so that the whole prediction stage is difficult to execute in parallel, and the prediction speed is influenced;
3. since the entire sequence needs to be generated from left to right by searching the sequence, the neural network needs to feed forward multiple times, affecting the speed of prediction.
Disclosure of Invention
In order to solve the defects of the prior art and realize that each word is predicted without depending on the previous word, thereby improving the prediction speed, the invention adopts the following technical scheme:
a low-delay speech recognition model based on a feedforward neural network comprises the following components: the encoder converts the acoustic feature sequence into high-level semantic representations, the summarizer converts the high-level semantic representations into high-level semantic representations corresponding to each word position through preset position coding and the high-level semantic representations, and the decoder further extracts semantic information at a word level from the high-level semantic representations corresponding to each word position.
The encoder, the summarizer and the decoder are all formed by stacking a plurality of layers of attention modules, each attention module consists of an attention mechanism and a feedforward neural network which are connected by residual errors, and the attention mechanism is an inner product attention mechanism:
Figure BDA0002689955370000011
wherein the content of the first and second substances,
Figure BDA0002689955370000012
the representation of a Query is that of a Query,
Figure BDA0002689955370000013
the expression of the Key is shown,
Figure BDA0002689955370000014
represents Value, TKAnd TqThe sequence lengths of Key and Query are respectively, D is the dimension of input, the attention mechanism of an encoder and a decoder is a self-attention mechanism, and Q, V, K is the same sequence; the summarizer is attention to each otherMechanism, V, K is the same sequence, and Q is a predetermined position code.
The attention mechanism is a multi-head attention mechanism:
MHA(Q,K,V)=Cat(head1,…,headn)
Figure BDA0002689955370000021
wherein the content of the first and second substances,
Figure BDA0002689955370000022
as a parameter, Cat is the splicing operation.
The feedforward neural network is a position-by-position feedforward neural network:
FFN(x)=W2 relu(W1x+b1)+b2
wherein, W1,W2,b1,b2For parameters, relu is a restricted linear unit activation function, a sequence of vectors is represented position by position, each vector is transformed by the same feedforward neural network, and each new transformed vector forms a new sequence.
The position code is a sine-cosine position code, and each position element of the position code is calculated according to the following mode:
Figure BDA0002689955370000023
Figure BDA0002689955370000024
wherein 2j and 2j +1 are element indexes of the ith position vector in the position coding sequence.
The model is trained by adopting a maximum likelihood estimation criterion and a back propagation algorithm, a dictionary is established for the text according to the paired acoustic feature sequences and the text, and a loss function is calculated:
Figure BDA0002689955370000025
wherein y isjMarking correct answers for the corresponding position J of the dictionary, wherein J is the total length of the sequence, a word sequence length is set in advance, and if the sentence length is less than J, a sequence is added at the end of the sentence<eos>The symbols are filled up so that the sentence length is automatically predicted in the speech recognition phase.
The text is used for establishing a dictionary, and the first N words are selected as the dictionary by counting the occurrence frequency of the words in the text and sequencing according to the sequence of the word frequency from large to small; words not in the dictionary are denoted as < UNK >.
The low-delay speech training method based on the feedforward neural network comprises the following steps:
s11, extracting voice features, and acquiring acoustic feature sequences and corresponding texts from training corpora;
s12, the encoder converts the acoustic feature sequence into high-level semantic representation;
s13, converting the summarizer into a high-level semantic representation corresponding to each word position through preset position coding and high-level semantic representation;
s14, the decoder further extracts semantic information at the word level from the representation corresponding to each word position;
the low-delay speech recognition method based on the feedforward neural network comprises the following steps:
s21, voice feature extraction, extracting acoustic feature sequences from voice signals;
s22, the encoder converts the acoustic feature sequence into high-level semantic representation;
s23, converting the summarizer into a high-level semantic representation corresponding to each word position through position coding and high-level semantic representation;
s24, the decoder further extracts semantic information at the word level from the high level semantic representation corresponding to each word position;
s25, a prediction stage, selecting the word with the highest probability at each position according to the output of the decoder.
The prediction stage selects the word with the maximum probability for each word position:
Figure BDA0002689955370000031
wherein, yjIs a word in the dictionary, j is the corresponding position, P is its probability, argmax is a function of the maximum,
Figure BDA0002689955370000032
the word at position j is the predicted word taken.
After training, the voice can be recognized, characters, words or sub-words at each position can be directly predicted according to the input voice without realizing a search technology, the speed of voice recognition can be greatly increased, and low-delay rapid recognition can be realized.
The invention has the advantages and beneficial effects that:
the invention directly predicts the words, words or sub-words of each position according to the input voice, avoids sequence search, directly predicts the words of each position, is beneficial to computer parallelization calculation, does not need feedforward for multiple times, greatly accelerates the speed of voice recognition and realizes low-delay rapid recognition.
Drawings
FIG. 1 is a flow chart of speech recognition of the present invention.
FIG. 2 is a schematic diagram of the structure of the speech recognition model of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
The low-delay speech recognition model based on the feedforward neural network comprises an encoder model based on the feedforward neural network of a self-attention mechanism, and the speech feature sequence is converted into high-level semantic representation; a feedforward neural network based on attention mechanism and a position-dependent summarizer model of attention mechanism, which converts the high-level representation output by the encoder model into a high-level semantic representation corresponding to each word position; a decoder model based on a self-attention mechanism feedforward neural network further extracts semantic information at a word level.
The encoder, the summarizer and the decoder are all formed by stacking a plurality of layers of attention modules, and each attention module is composed of an attention mechanism and a feedforward neural network which are connected through residual errors.
The attention mechanism is an inner product attention mechanism:
Figure BDA0002689955370000033
wherein the content of the first and second substances,
Figure BDA0002689955370000041
the representation of a Query is that of a Query,
Figure BDA0002689955370000042
the expression of the Key is shown,
Figure BDA0002689955370000043
represents Value, TKAnd TqThe sequence lengths of Key and Query respectively, and D is the dimension of input.
The attention mechanism of the encoder and decoder is a self-attention mechanism, i.e. Q, V, K is the same sequence; the summarizer is a mutual attention mechanism, i.e. V, K is the same sequence and Q is a position code set in advance.
The attention mechanism may be a multi-head attention mechanism:
MHA(Q,K,V)=Cat(head1,…,headn)
Figure BDA0002689955370000044
wherein the content of the first and second substances,
Figure BDA0002689955370000045
as a parameter, Cat is the splicing operation.
The attention mechanism is followed by a position-by-position feedforward neural network:
FFN(x)=W2 relu(W1x+b1)+b2
wherein, W1,W2,b1,b2For parameters, relu is a restricted linear unit activation function, a sequence of vectors is represented position by position, each vector is transformed by the same feedforward neural network, and each new transformed vector forms a new sequence.
The attention module further comprises a layer normalization module, wherein the normalization module is an operation in the attention module.
The position code is sine-cosine position code, and each position element is calculated according to the following mode:
Figure BDA0002689955370000046
Figure BDA0002689955370000047
wherein 2j and 2j +1 are element indexes of the ith position vector in the position coding sequence.
And in the training stage, an acoustic feature sequence and a corresponding text are obtained from the training corpus, and parameters of the acoustic feature sequence such as MFCC (Mel frequency cepstrum coefficient) and FBANK (fuzzy neural network) features are obtained.
Establishing a dictionary according to texts in a training corpus, specifically comprising:
1. counting the occurrence frequency of words in the text;
2. sorting according to the sequence of word frequency from big to small, and selecting the first N words as a dictionary; words not in the dictionary are denoted as < UNK >.
The first layer attention module of the encoder inputs the acoustic feature sequence, and each subsequent layer attention module is the output of the previous layer.
The position-dependent summarizer model uses a mutual attention mechanism to generate a representation corresponding to the position of each word according to the position code set in advance and the output result of the encoder, wherein K and V of the first layer attention module are the output of the encoder, Q is the position code, Q of each layer attention module is the output of the previous layer attention module, and K and V are the output of the encoder.
The position-related summarizer extracts the high-level representation corresponding to the word position according to the high-level representation sequence output by the encoder and the position code corresponding to each word position, and the first-level attention module of the summarizer inputs the position code and the high-level representation generated by the encoder to obtain the code corresponding to the word position; the attention modules from the second layer to the nth layer input the codes obtained by the attention module of the previous layer and the high-level representation generated by the coder, and obtain codes corresponding to the positions of the words.
The decoder first layer attention module input is the output of the position-dependent summarizer, i.e. the representation corresponding to each word position, and then each layer attention module is the output of the previous layer, extracting the high-level semantic representation for the word.
The model is trained by adopting a maximum likelihood estimation criterion and a back propagation algorithm, and a loss function is calculated according to paired speech and text:
Figure BDA0002689955370000051
wherein y isjMarking correct answers for the corresponding position J of the dictionary, wherein J is the total length of the sequence, a word sequence length is set in advance, and if the sentence length is less than J, a sequence is added at the end of the sentence<eos>The symbols are filled up so that the sentence length is automatically predicted in the speech recognition phase. The back propagation computational loss function corresponds to the gradient of each parameter.
In the prediction stage, after training, the speech can be recognized, and the character, word or sub-word at each position can be directly predicted according to the input speech without realizing the search technology, so that the speech recognition speed can be greatly accelerated, and the low-delay rapid recognition can be realized, and the method comprises the following steps:
1. extracting a voice signal into an acoustic feature sequence, wherein the extracted voice feature can be a feature such as a Mel frequency cepstrum coefficient;
2. the low-delay speech recognition model based on the feedforward neural network carries out speech recognition, and the method comprises the following steps:
(1) extracting a high-level semantic representation by an encoder;
(2) converting the high-level speech representation extracted by the encoder into a high-level semantic representation corresponding to each word position through a position-dependent summarizer;
(3) further extracting high-level semantic representation of each word and making prediction through a decoder;
(4) a prediction stage, namely a speech recognition stage, selecting words with the maximum probability at each position according to the output of a decoder, mapping the extracted high-level representation of the words into word probability distribution at each position by adopting linear transformation and nonlinear transformation, and selecting the words with the maximum probability at each word position:
Figure BDA0002689955370000052
wherein, yjIs a word in the dictionary, j is the corresponding position, P is its probability, argmax is a function of the maximum,
Figure BDA0002689955370000053
the word at position j is the predicted word taken.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A low-delay speech recognition model based on a feedforward neural network comprises the following components: the encoder and the decoder are characterized by further comprising a summarizer, wherein the encoder converts the acoustic feature sequence into high-level semantic representations, the summarizer converts the high-level semantic representations into high-level semantic representations corresponding to each word position through preset position coding and the high-level semantic representations, and the decoder further extracts semantic information at the word level from the high-level semantic representations corresponding to each word position;
the encoder, the summarizer and the decoder are all formed by stacking a plurality of layers of attention modules, each attention module consists of an attention mechanism and a feedforward neural network which are connected by residual errors, and the attention mechanism is an inner product attention mechanism:
Figure FDA0003530422030000011
wherein the content of the first and second substances,
Figure FDA0003530422030000012
the representation of a Query is that of a Query,
Figure FDA0003530422030000013
the expression of the Key is shown,
Figure FDA0003530422030000014
represents Value, TKAnd TqThe sequence lengths of Key and Query are respectively, D is the dimension of input, the attention mechanism of an encoder and a decoder is a self-attention mechanism, and Q, V, K is the same sequence; the summarizer is a mutual attention mechanism, V, K is the same sequence, and Q is a preset position code.
2. A feedforward neural network-based low-latency speech recognition model according to claim 1, wherein the attention mechanism is a multi-head attention mechanism:
MHA(Q,K,V)=Cat(head1,…,headn)
Figure FDA0003530422030000015
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003530422030000016
as a parameter, Cat is the splicing operation.
3. A feedforward neural network-based low-latency speech recognition model according to claim 1, wherein the feedforward neural network is a position-by-position feedforward neural network:
FFN(x)=W2 relu(W1x+b1)+b2
wherein, W1,W2,b1,b2For parameters, relu is a restricted linear unit activation function, a sequence of vectors is represented position by position, each vector is transformed by the same feedforward neural network, and each new transformed vector forms a new sequence.
4. A feedforward neural network-based low-latency speech recognition model according to claim 1, wherein the position code is a sine-cosine position code, each position element of which is calculated as follows:
Figure FDA0003530422030000017
Figure FDA0003530422030000018
wherein 2j and 2j +1 are element indexes of the ith position vector in the position coding sequence.
5. A feedforward neural network-based low-latency speech recognition model according to any one of claims 1-4, wherein the model is trained using maximum likelihood estimation criteria and a back propagation algorithm, based on pairs of acoustic feature sequences and text, and building a dictionary for the text;
the text establishing dictionary is established according to texts in the training corpus, the words in the texts are sequenced from large to small according to the word frequency by counting the occurrence frequency of the words in the texts, and the first N words are selected as the dictionary; words not in the dictionary are denoted as < UNK >.
6. The feedforward neural network-based low-latency speech recognition model of claim 5, wherein the text is used for establishing a dictionary by counting the occurrence frequency of words in the text, sorting the words in the order of increasing word frequency to decreasing word frequency, and selecting the first N words as the dictionary; words not in the dictionary are denoted as < UNK >.
7. A method for training a low-latency speech recognition model based on a feedforward neural network as claimed in claim 1, comprising the steps of:
s11, extracting voice features, and acquiring an acoustic feature sequence and a corresponding text from the training corpus;
s12, the encoder converts the acoustic feature sequence into high-level semantic representation;
s13, converting the summarizer into high-level semantic representation corresponding to each word position through preset position coding and high-level semantic representation;
the decoder further extracts semantic information at the word level from the representation corresponding to each word position S14.
8. The method of speech recognition based on a feedforward neural network low-latency speech recognition model of claim 1, comprising the steps of:
s21, voice feature extraction, extracting acoustic feature sequences from voice signals;
s22, the encoder converts the acoustic feature sequence into high-level semantic representation;
s23, converting the summarizer into a high-level semantic representation corresponding to each word position through position coding and high-level semantic representation;
s24, the decoder further extracts semantic information at the word level from the high level semantic representation corresponding to each word position;
s25, a prediction stage, selecting the word with the highest probability at each position according to the output of the decoder.
9. The method of claim 8, wherein the prediction stage selects the most probable word for each word position:
Figure FDA0003530422030000021
wherein, yjIs a word in the dictionary, j is the corresponding position, P is its probability, argmax is a function of the maximum,
Figure FDA0003530422030000022
the word at position j is the predicted word taken.
CN202010988191.2A 2020-09-18 2020-09-18 Low-delay speech recognition model based on feedforward neural network and training method Active CN112133304B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010988191.2A CN112133304B (en) 2020-09-18 2020-09-18 Low-delay speech recognition model based on feedforward neural network and training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010988191.2A CN112133304B (en) 2020-09-18 2020-09-18 Low-delay speech recognition model based on feedforward neural network and training method

Publications (2)

Publication Number Publication Date
CN112133304A CN112133304A (en) 2020-12-25
CN112133304B true CN112133304B (en) 2022-05-06

Family

ID=73841382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010988191.2A Active CN112133304B (en) 2020-09-18 2020-09-18 Low-delay speech recognition model based on feedforward neural network and training method

Country Status (1)

Country Link
CN (1) CN112133304B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687288B (en) * 2021-03-12 2021-12-03 北京世纪好未来教育科技有限公司 Echo cancellation method, echo cancellation device, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543180A (en) * 2018-11-08 2019-03-29 中山大学 A kind of text emotion analysis method based on attention mechanism
CN110288980A (en) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 Audio recognition method, the training method of model, device, equipment and storage medium
CN110689876A (en) * 2019-10-14 2020-01-14 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and storage medium
CN111125326A (en) * 2019-12-06 2020-05-08 贝壳技术有限公司 Method, device, medium and electronic equipment for realizing man-machine conversation

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10839790B2 (en) * 2017-02-06 2020-11-17 Facebook, Inc. Sequence-to-sequence convolutional architecture
KR102449842B1 (en) * 2017-11-30 2022-09-30 삼성전자주식회사 Method for training language model and apparatus therefor
KR102462426B1 (en) * 2017-12-14 2022-11-03 삼성전자주식회사 Electronic device and method for analyzing meaning of speech
US11210475B2 (en) * 2018-07-23 2021-12-28 Google Llc Enhanced attention mechanisms
CN109145290B (en) * 2018-07-25 2020-07-07 东北大学 Semantic similarity calculation method based on word vector and self-attention mechanism
CN109949796B (en) * 2019-02-28 2021-04-06 天津大学 End-to-end architecture Lasa dialect voice recognition method based on Tibetan component
JP7041281B2 (en) * 2019-07-04 2022-03-23 浙江大学 Address information feature extraction method based on deep neural network model
CN110827816A (en) * 2019-11-08 2020-02-21 杭州依图医疗技术有限公司 Voice instruction recognition method and device, electronic equipment and storage medium
CN111048082B (en) * 2019-12-12 2022-09-06 中国电子科技集团公司第二十八研究所 Improved end-to-end speech recognition method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543180A (en) * 2018-11-08 2019-03-29 中山大学 A kind of text emotion analysis method based on attention mechanism
CN110288980A (en) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 Audio recognition method, the training method of model, device, equipment and storage medium
CN110689876A (en) * 2019-10-14 2020-01-14 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and storage medium
CN111125326A (en) * 2019-12-06 2020-05-08 贝壳技术有限公司 Method, device, medium and electronic equipment for realizing man-machine conversation

Also Published As

Publication number Publication date
CN112133304A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
CN110534095B (en) Speech recognition method, apparatus, device and computer readable storage medium
CN111639175B (en) Self-supervision dialogue text abstract method and system
US11587551B2 (en) Leveraging unpaired text data for training end-to-end spoken language understanding systems
CN111145729B (en) Speech recognition model training method, system, mobile terminal and storage medium
CN111339278B (en) Method and device for generating training speech generating model and method and device for generating answer speech
CN111199727A (en) Speech recognition model training method, system, mobile terminal and storage medium
JPH02273795A (en) Continuous speech recognition method
CN112199945A (en) Text error correction method and device
CN111783477B (en) Voice translation method and system
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
Karita et al. Sequence training of encoder-decoder model using policy gradient for end-to-end speech recognition
CN110569505A (en) text input method and device
CN113488028A (en) Speech transcription recognition training decoding method and system based on rapid skip decoding
Cheng et al. Eteh: Unified attention-based end-to-end asr and kws architecture
CN112133304B (en) Low-delay speech recognition model based on feedforward neural network and training method
JPH0372997B2 (en)
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
Collobert et al. Word-level speech recognition with a letter to word encoder
Audhkhasi et al. Recent progress in deep end-to-end models for spoken language processing
JP2013182260A (en) Language model creation device, voice recognition device and program
CN113486160B (en) Dialogue method and system based on cross-language knowledge
Schukat-Talamazzini Stochastic language models
JPH09134192A (en) Statistical language model forming device and speech recognition device
CN115223549A (en) Vietnamese speech recognition corpus construction method
Whittaker et al. Vocabulary independent speech recognition using particles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant