CN112133304B

CN112133304B - Low-delay speech recognition model based on feedforward neural network and training method

Info

Publication number: CN112133304B
Application number: CN202010988191.2A
Authority: CN
Inventors: 白烨; 温正棋
Original assignee: Zhongke Extreme Element Hangzhou Intelligent Technology Co ltd
Current assignee: Zhongke Extreme Element Hangzhou Intelligent Technology Co ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2022-05-06
Anticipated expiration: 2040-09-18
Also published as: CN112133304A

Abstract

The invention discloses a low-delay speech recognition model based on a feedforward neural network and a training method, wherein the model comprises the following steps: an encoder, a decoder and a summarizer; the training method comprises the following steps: s11, extracting voice features; s12, the encoder converts the acoustic feature sequence into high-level semantic representation; s13, converting the summarizer into a high-level semantic representation corresponding to each word position through preset position coding and high-level semantic representation; s14, the decoder further extracts semantic information at the word level from the representation corresponding to each word position; the recognition method also comprises a prediction phase, selecting the word with the highest probability at each position according to the output of the decoder.

Description

Low-delay speech recognition model based on feedforward neural network and training method

Technical Field

The invention relates to the technical field of intelligent information processing, in particular to a low-delay speech recognition model based on a feedforward neural network and a training method.

Background

Speech is one of the most natural ways of interaction for humans. Speech recognition is an intelligent information processing technique that converts speech into corresponding text. Converting speech into text is beneficial for further processing by computer terminals, so that speech recognition technology is widely used in systems of intelligent conversation, intelligent customer service, intelligent translation and the like.

Although there are a number of speech recognition techniques, these techniques all require searching the most probable sequence from left to right in the prediction stage, and specifically predicting the next word requires inputting the word predicted last into the model, such as generating the sentence "today _ weather _ good", and when predicting "good", the result "weather" of the previous prediction needs to be input into the model. This leads to the following problems:

1. since the sequence needs to be searched, the entire sequence is generated from left to right, so that when the word on the left is predicted, the model cannot utilize the information provided by the word on the right;

2. because the sequence needs to be searched, the whole sequence is generated from left to right, so that the whole prediction stage is difficult to execute in parallel, and the prediction speed is influenced;

3. since the entire sequence needs to be generated from left to right by searching the sequence, the neural network needs to feed forward multiple times, affecting the speed of prediction.

Disclosure of Invention

In order to solve the defects of the prior art and realize that each word is predicted without depending on the previous word, thereby improving the prediction speed, the invention adopts the following technical scheme:

a low-delay speech recognition model based on a feedforward neural network comprises the following components: the encoder converts the acoustic feature sequence into high-level semantic representations, the summarizer converts the high-level semantic representations into high-level semantic representations corresponding to each word position through preset position coding and the high-level semantic representations, and the decoder further extracts semantic information at a word level from the high-level semantic representations corresponding to each word position.

The encoder, the summarizer and the decoder are all formed by stacking a plurality of layers of attention modules, each attention module consists of an attention mechanism and a feedforward neural network which are connected by residual errors, and the attention mechanism is an inner product attention mechanism:

wherein the content of the first and second substances,

the representation of a Query is that of a Query,

the expression of the Key is shown,

represents Value, T_KAnd T_qThe sequence lengths of Key and Query are respectively, D is the dimension of input, the attention mechanism of an encoder and a decoder is a self-attention mechanism, and Q, V, K is the same sequence; the summarizer is attention to each otherMechanism, V, K is the same sequence, and Q is a predetermined position code.

The attention mechanism is a multi-head attention mechanism:

MHA(Q,K,V)＝Cat(head₁,…,head_n)

wherein the content of the first and second substances,

as a parameter, Cat is the splicing operation.

The feedforward neural network is a position-by-position feedforward neural network:

FFN(x)＝W₂ relu(W₁x+b₁)+b₂

wherein, W₁，W₂，b₁，b₂For parameters, relu is a restricted linear unit activation function, a sequence of vectors is represented position by position, each vector is transformed by the same feedforward neural network, and each new transformed vector forms a new sequence.

The position code is a sine-cosine position code, and each position element of the position code is calculated according to the following mode:

wherein 2j and 2j +1 are element indexes of the ith position vector in the position coding sequence.

The model is trained by adopting a maximum likelihood estimation criterion and a back propagation algorithm, a dictionary is established for the text according to the paired acoustic feature sequences and the text, and a loss function is calculated:

wherein y is_jMarking correct answers for the corresponding position J of the dictionary, wherein J is the total length of the sequence, a word sequence length is set in advance, and if the sentence length is less than J, a sequence is added at the end of the sentence<eos>The symbols are filled up so that the sentence length is automatically predicted in the speech recognition phase.

The text is used for establishing a dictionary, and the first N words are selected as the dictionary by counting the occurrence frequency of the words in the text and sequencing according to the sequence of the word frequency from large to small; words not in the dictionary are denoted as < UNK >.

The low-delay speech training method based on the feedforward neural network comprises the following steps:

s11, extracting voice features, and acquiring acoustic feature sequences and corresponding texts from training corpora;

s12, the encoder converts the acoustic feature sequence into high-level semantic representation;

s13, converting the summarizer into a high-level semantic representation corresponding to each word position through preset position coding and high-level semantic representation;

s14, the decoder further extracts semantic information at the word level from the representation corresponding to each word position;

the low-delay speech recognition method based on the feedforward neural network comprises the following steps:

s21, voice feature extraction, extracting acoustic feature sequences from voice signals;

s22, the encoder converts the acoustic feature sequence into high-level semantic representation;

s23, converting the summarizer into a high-level semantic representation corresponding to each word position through position coding and high-level semantic representation;

s24, the decoder further extracts semantic information at the word level from the high level semantic representation corresponding to each word position;

s25, a prediction stage, selecting the word with the highest probability at each position according to the output of the decoder.

The prediction stage selects the word with the maximum probability for each word position:

wherein, y_jIs a word in the dictionary, j is the corresponding position, P is its probability, argmax is a function of the maximum,

the word at position j is the predicted word taken.

After training, the voice can be recognized, characters, words or sub-words at each position can be directly predicted according to the input voice without realizing a search technology, the speed of voice recognition can be greatly increased, and low-delay rapid recognition can be realized.

The invention has the advantages and beneficial effects that:

the invention directly predicts the words, words or sub-words of each position according to the input voice, avoids sequence search, directly predicts the words of each position, is beneficial to computer parallelization calculation, does not need feedforward for multiple times, greatly accelerates the speed of voice recognition and realizes low-delay rapid recognition.

Drawings

FIG. 1 is a flow chart of speech recognition of the present invention.

FIG. 2 is a schematic diagram of the structure of the speech recognition model of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

The low-delay speech recognition model based on the feedforward neural network comprises an encoder model based on the feedforward neural network of a self-attention mechanism, and the speech feature sequence is converted into high-level semantic representation; a feedforward neural network based on attention mechanism and a position-dependent summarizer model of attention mechanism, which converts the high-level representation output by the encoder model into a high-level semantic representation corresponding to each word position; a decoder model based on a self-attention mechanism feedforward neural network further extracts semantic information at a word level.

The encoder, the summarizer and the decoder are all formed by stacking a plurality of layers of attention modules, and each attention module is composed of an attention mechanism and a feedforward neural network which are connected through residual errors.

The attention mechanism is an inner product attention mechanism:

wherein the content of the first and second substances,

the representation of a Query is that of a Query,

the expression of the Key is shown,

represents Value, T_KAnd T_qThe sequence lengths of Key and Query respectively, and D is the dimension of input.

The attention mechanism of the encoder and decoder is a self-attention mechanism, i.e. Q, V, K is the same sequence; the summarizer is a mutual attention mechanism, i.e. V, K is the same sequence and Q is a position code set in advance.

The attention mechanism may be a multi-head attention mechanism:

MHA(Q,K,V)＝Cat(head₁,…,head_n)

wherein the content of the first and second substances,

as a parameter, Cat is the splicing operation.

The attention mechanism is followed by a position-by-position feedforward neural network:

FFN(x)＝W₂ relu(W₁x+b₁)+b₂

The attention module further comprises a layer normalization module, wherein the normalization module is an operation in the attention module.

The position code is sine-cosine position code, and each position element is calculated according to the following mode:

And in the training stage, an acoustic feature sequence and a corresponding text are obtained from the training corpus, and parameters of the acoustic feature sequence such as MFCC (Mel frequency cepstrum coefficient) and FBANK (fuzzy neural network) features are obtained.

Establishing a dictionary according to texts in a training corpus, specifically comprising:

1. counting the occurrence frequency of words in the text;

2. sorting according to the sequence of word frequency from big to small, and selecting the first N words as a dictionary; words not in the dictionary are denoted as < UNK >.

The first layer attention module of the encoder inputs the acoustic feature sequence, and each subsequent layer attention module is the output of the previous layer.

The position-dependent summarizer model uses a mutual attention mechanism to generate a representation corresponding to the position of each word according to the position code set in advance and the output result of the encoder, wherein K and V of the first layer attention module are the output of the encoder, Q is the position code, Q of each layer attention module is the output of the previous layer attention module, and K and V are the output of the encoder.

The position-related summarizer extracts the high-level representation corresponding to the word position according to the high-level representation sequence output by the encoder and the position code corresponding to each word position, and the first-level attention module of the summarizer inputs the position code and the high-level representation generated by the encoder to obtain the code corresponding to the word position; the attention modules from the second layer to the nth layer input the codes obtained by the attention module of the previous layer and the high-level representation generated by the coder, and obtain codes corresponding to the positions of the words.

The decoder first layer attention module input is the output of the position-dependent summarizer, i.e. the representation corresponding to each word position, and then each layer attention module is the output of the previous layer, extracting the high-level semantic representation for the word.

The model is trained by adopting a maximum likelihood estimation criterion and a back propagation algorithm, and a loss function is calculated according to paired speech and text:

wherein y is_jMarking correct answers for the corresponding position J of the dictionary, wherein J is the total length of the sequence, a word sequence length is set in advance, and if the sentence length is less than J, a sequence is added at the end of the sentence<eos>The symbols are filled up so that the sentence length is automatically predicted in the speech recognition phase. The back propagation computational loss function corresponds to the gradient of each parameter.

In the prediction stage, after training, the speech can be recognized, and the character, word or sub-word at each position can be directly predicted according to the input speech without realizing the search technology, so that the speech recognition speed can be greatly accelerated, and the low-delay rapid recognition can be realized, and the method comprises the following steps:

1. extracting a voice signal into an acoustic feature sequence, wherein the extracted voice feature can be a feature such as a Mel frequency cepstrum coefficient;

2. the low-delay speech recognition model based on the feedforward neural network carries out speech recognition, and the method comprises the following steps:

(1) extracting a high-level semantic representation by an encoder;

(2) converting the high-level speech representation extracted by the encoder into a high-level semantic representation corresponding to each word position through a position-dependent summarizer;

(3) further extracting high-level semantic representation of each word and making prediction through a decoder;

(4) a prediction stage, namely a speech recognition stage, selecting words with the maximum probability at each position according to the output of a decoder, mapping the extracted high-level representation of the words into word probability distribution at each position by adopting linear transformation and nonlinear transformation, and selecting the words with the maximum probability at each word position:

the word at position j is the predicted word taken.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A low-delay speech recognition model based on a feedforward neural network comprises the following components: the encoder and the decoder are characterized by further comprising a summarizer, wherein the encoder converts the acoustic feature sequence into high-level semantic representations, the summarizer converts the high-level semantic representations into high-level semantic representations corresponding to each word position through preset position coding and the high-level semantic representations, and the decoder further extracts semantic information at the word level from the high-level semantic representations corresponding to each word position;

wherein the content of the first and second substances,

the representation of a Query is that of a Query,

the expression of the Key is shown,

represents Value, T_KAnd T_qThe sequence lengths of Key and Query are respectively, D is the dimension of input, the attention mechanism of an encoder and a decoder is a self-attention mechanism, and Q, V, K is the same sequence; the summarizer is a mutual attention mechanism, V, K is the same sequence, and Q is a preset position code.

2. A feedforward neural network-based low-latency speech recognition model according to claim 1, wherein the attention mechanism is a multi-head attention mechanism:

MHA(Q,K,V)＝Cat(head₁,…,head_n)

wherein, the first and the second end of the pipe are connected with each other,

as a parameter, Cat is the splicing operation.

3. A feedforward neural network-based low-latency speech recognition model according to claim 1, wherein the feedforward neural network is a position-by-position feedforward neural network:

FFN(x)＝W₂ relu(W₁x+b₁)+b₂

4. A feedforward neural network-based low-latency speech recognition model according to claim 1, wherein the position code is a sine-cosine position code, each position element of which is calculated as follows:

5. A feedforward neural network-based low-latency speech recognition model according to any one of claims 1-4, wherein the model is trained using maximum likelihood estimation criteria and a back propagation algorithm, based on pairs of acoustic feature sequences and text, and building a dictionary for the text;

the text establishing dictionary is established according to texts in the training corpus, the words in the texts are sequenced from large to small according to the word frequency by counting the occurrence frequency of the words in the texts, and the first N words are selected as the dictionary; words not in the dictionary are denoted as < UNK >.

6. The feedforward neural network-based low-latency speech recognition model of claim 5, wherein the text is used for establishing a dictionary by counting the occurrence frequency of words in the text, sorting the words in the order of increasing word frequency to decreasing word frequency, and selecting the first N words as the dictionary; words not in the dictionary are denoted as < UNK >.

7. A method for training a low-latency speech recognition model based on a feedforward neural network as claimed in claim 1, comprising the steps of:

s11, extracting voice features, and acquiring an acoustic feature sequence and a corresponding text from the training corpus;

s13, converting the summarizer into high-level semantic representation corresponding to each word position through preset position coding and high-level semantic representation;

the decoder further extracts semantic information at the word level from the representation corresponding to each word position S14.

8. The method of speech recognition based on a feedforward neural network low-latency speech recognition model of claim 1, comprising the steps of:

9. The method of claim 8, wherein the prediction stage selects the most probable word for each word position:

the word at position j is the predicted word taken.