CN113936641B - Customizable end-to-end system for Chinese-English mixed speech recognition - Google Patents

Customizable end-to-end system for Chinese-English mixed speech recognition Download PDF

Info

Publication number
CN113936641B
CN113936641B CN202111548173.3A CN202111548173A CN113936641B CN 113936641 B CN113936641 B CN 113936641B CN 202111548173 A CN202111548173 A CN 202111548173A CN 113936641 B CN113936641 B CN 113936641B
Authority
CN
China
Prior art keywords
english
sequence
attention
sequence represented
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111548173.3A
Other languages
Chinese (zh)
Other versions
CN113936641A (en
Inventor
陶建华
张帅
易江燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202111548173.3A priority Critical patent/CN113936641B/en
Publication of CN113936641A publication Critical patent/CN113936641A/en
Application granted granted Critical
Publication of CN113936641B publication Critical patent/CN113936641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a customizable end-to-end system for Chinese-English mixed speech recognition, wherein the system comprises: the device comprises an acoustic encoder, an English vocabulary encoder, a decoder and a softmax function. An end-to-end model of the structure of the acoustic coder and the English vocabulary coder-decoder, and an attention-based modeling mode is used in the acoustic coder, the English vocabulary coder and the decoder. The model customizable mode is to encode English words or English phrases to be customized in advance, convert discrete words into hidden layer representation of the model and form a vector list to be retrieved. In performing the recognition process, the decoder performs the attention calculation on the sequence of the high-dimensional representation of the acoustic features and the final representation of the English vocabulary simultaneously. The method can realize the customized model aiming at the English proper nouns in different fields, realize the accurate recognition of Chinese and English in Chinese and English mixed expression, and reduce the dependence of the model on training data.

Description

Customizable end-to-end system for Chinese-English mixed speech recognition
Technical Field
The invention belongs to the field of voice recognition, and particularly relates to a customizable end-to-end system for Chinese-English mixed voice recognition.
Background
The existing Chinese-English mixed speech recognition mainly comprises two technical routes 1. a pipeline type method, wherein an acoustic model, a pronunciation model and a language model are trained and modeled separately, then the three models are integrated into a unified decoding graph in a directed acyclic graph mode, and the recognition process is a search process 2 of the decoding graph, namely an end-to-end method, the acoustic model, the pronunciation model and the language model are subjected to unified modeling optimization, the decoding graph does not need to be constructed, and concise training and decoding are realized.
The prior art has the following defects:
(1) in the pipeline method, because the acoustic model, the pronunciation model and the language model are trained and modeled separately, the problem of error accumulation can be caused, and the error of the acoustic model can be transmitted to the subsequent pronunciation model and the language model, so that the performance is reduced; on the other hand, due to the complexity of the statistical language model, the constructed decoding graph has huge volume and is not suitable for end-side application of mobile phones, intelligent sound equipment and the like;
(2) the existing end-to-end model needs a large amount of training data for training, but Chinese-English mixed data is extremely difficult to obtain, and effective training of the end-to-end model cannot be met. The English recognition problem in other fields cannot be effectively solved by the end-to-end model trained according to Chinese-English mixed data in a specific field.
Disclosure of Invention
In order to solve the above technical problems, the present invention provides a technical solution of a customizable end-to-end system for chinese-english hybrid speech recognition, so as to solve the above technical problems.
The invention discloses a customizable end-to-end system for Chinese-English mixed speech recognition in the first aspect; the system comprises:
the system comprises an acoustic encoder, an English vocabulary encoder and a decoder;
the acoustic encoder: extracting acoustic features of a voice waveform to obtain an acoustic feature sequence, performing convolution and recoding operations on the acoustic feature sequence to obtain a down-sampled and recoded feature sequence, inputting the down-sampled and recoded feature sequence into a multi-head self-attention module of an acoustic encoder based on a multi-head self-attention mechanism to obtain a sequence represented by high dimensions of the acoustic features;
the English vocabulary encoder is characterized in that: dividing English words into English sub-words with smaller granularity to obtain English sub-word sequences of the English words or English phrases, re-encoding the English sub-word sequences, and inputting the re-encoded English sub-word sequences into an encoding module based on a multi-head self-attention mechanism to obtain a sequence finally represented by an English word list;
the decoder: recoding the labeled target text to obtain a sequence represented by target text coding, respectively inputting the sequence represented by the high-dimensional representation of the acoustic features, the sequence represented by the English vocabulary finally and the sequence represented by the target text coding into a multi-head attention module of a decoder based on multi-head attention to obtain a sequence represented by the context vector of the acoustic features and a sequence represented by the context vector of the English vocabulary, splicing and fusing the sequence represented by the context vector of the acoustic features and the sequence represented by the context vector of the English vocabulary, and inputting the sequences into a full-connection layer to obtain final decoding representation.
According to the system of the first aspect of the present application, the specific method for performing convolution and re-encoding operations on the acoustic feature sequence to obtain a down-sampled and re-encoded feature sequence includes:
performing convolution operation on the acoustic feature sequence by using a plurality of 2-dimensional convolution cores, and controlling the proportion of down sampling by setting the step length of the convolution operation; after the convolution operation, connecting an activation function, and performing nonlinear transformation; and superposing multilayer convolution operation, then using a fully-connected mapping layer to map the acoustic features into high-dimensional vectors, then adding a vector sequence into position coding information, wherein the position coding uses absolute position representation, and realizing the down-sampling and re-coding of the acoustic feature sequence.
According to the system of the first aspect of the present application, the multi-head self-attention module of the acoustic encoder based on the multi-head self-attention mechanism is formed by stacking a plurality of modules with the same structure, and residual connection is performed between each module with the same structure; each module with the same structure comprises two subsections, and the specific structure comprises: the first subsection is a multi-headed self-attention layer followed by a fully-connected mapping layer of the second subsection, each subsection being subjected to a layer normalization operation, with a residual connection between the two subsections.
According to the system of the first aspect of the present application, the specific method for re-encoding the english word or the english word group and inputting the re-encoded english word or english word group to the encoding module based on the multi-head attention mechanism to obtain the final expressed sequence of the english word list includes:
and recoding the English words or the English phrases and inputting the recoded English words or English phrases into a multi-head self-attention module of an English word list coder based on a multi-head self-attention mechanism, converting the recoded English words or English phrases into an English word list high-dimensional characteristic sequence, and inputting the English word list high-dimensional characteristic sequence into a long-short term memory neural network to obtain a sequence finally represented by the English word list.
According to the system of the first aspect of the present application, the multi-head self-attention module of the english vocabulary encoder based on the multi-head self-attention mechanism is formed by stacking a plurality of modules with the same structure, and each module with the same structure is connected with each other by a residual error; each module with the same structure comprises two subsections, and the specific structure comprises: the first subsection is a multi-headed self-attention layer followed by a fully-connected mapping layer of the second subsection, each subsection being subjected to a layer normalization operation, with a residual connection between the two subsections.
According to the system of the first aspect of the present application, the specific method for obtaining the sequence represented by the context vector of the acoustic feature and the sequence represented by the context vector of the english vocabulary by inputting the sequence represented by the high dimension of the acoustic feature, the sequence represented by the english vocabulary and the sequence represented by the target text encoding into the multi-head attention module of the multi-head attention-based decoder includes:
inputting the sequence represented by the target text code into a multi-head self-attention module of the target text to obtain high-dimensional representation of the target sequence;
the high dimension of the target sequence is used as a query vector of a multi-head self-attention module of the acoustic features and a multi-head self-attention module of an English vocabulary, the high dimension of the acoustic features is used as a key and a value of the multi-head self-attention module of the acoustic features, the English vocabulary finally represents the key and the value of the multi-head self-attention module of the English vocabulary, the query vector is used for calculating cosine distances element by element, and the attention score of each key in the multi-head self-attention module of the acoustic features and the multi-head self-attention module of the English vocabulary is obtained according to the distance; and carrying out weighted average on the value sequences in the multi-head self-attention module of the acoustic features and the multi-head self-attention module of the English vocabulary by using the attention scores of the keys to obtain a sequence represented by the context vector of the acoustic features and a sequence represented by the context vector of the English vocabulary.
According to the system of the first aspect of the present application, the multi-head self-attention module of the target text is formed by stacking a plurality of modules with the same structure, and residual connection is performed between each module with the same structure; each module with the same structure comprises two subsections, and the specific structure comprises: the first subsection is a multi-headed self-attention layer followed by a fully-connected mapping layer of the second subsection, each subsection being subjected to a layer normalization operation, with a residual connection between the two subsections.
According to the system of the first aspect of the present application, the specific method for re-encoding the labeled target text to obtain the sequence represented by the target text code includes:
performing word embedding mapping on the labeled target text to obtain a high-dimensional target word vector sequence represented by a corresponding target word vector, and adding position coding information and time sequence information in the target word vector sequence to obtain a sequence represented by target file coding;
the specific method for recoding the English words or the English phrases comprises the following steps: and performing word embedding mapping on the English words or the English phrases to obtain a high-dimensional English word or English phrase vector sequence represented by the corresponding English words or English phrase vectors, and adding position coding information and time sequence information in the English word or English phrase vector sequence.
A second aspect of the invention provides an electronic device comprising a memory and a processor, the memory having stored thereon a computer program which, when executed by the processor, performs a method in a customizable chinese-english hybrid speech recognition end-to-end system according to the first aspect of the invention.
A third aspect of the invention provides a storage medium storing a computer program executable by one or more processors and operable to implement a method in a customizable chinese-english hybrid speech recognition end-to-end system according to the first aspect of the invention.
In conclusion, the scheme provided by the invention can be used for customizing the model aiming at the English proper nouns in different fields, realizing the accurate recognition of Chinese and English in Chinese and English mixed expression and reducing the dependence of the model on training data. In principle, the English phrases to be customized are coded in advance to serve as a query list of the model, and the recognition process is guided according to the query result of the decoder, so that the customized words are accurately recognized. The word list can be customized in advance aiming at English special words in different fields. Through model customization, the recognition accuracy of the Chinese-English hybrid recognition system can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a block diagram of a customizable chinese-to-english hybrid speech recognition end-to-end system according to an embodiment of the present invention;
FIG. 2 is a block diagram illustrating a customizable chinese-english hybrid speech recognition end-to-end system, according to an embodiment of the present invention;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
a first aspect of the present invention discloses a customizable chinese-english hybrid speech recognition end-to-end system, fig. 1 is a structural diagram of a customizable chinese-english hybrid speech recognition end-to-end system according to an embodiment of the present invention, specifically as shown in fig. 1, where the system 100 includes:
acoustic encoder 101, English vocabulary encoder 102, and decoder 103
The acoustic encoder 101: extracting acoustic features of a voice waveform to obtain an acoustic feature sequence, performing convolution and recoding operations on the acoustic feature sequence to obtain a down-sampled and recoded feature sequence, inputting the down-sampled and recoded feature sequence into a multi-head self-attention module of an acoustic encoder based on a multi-head self-attention mechanism to obtain a sequence represented by high dimensions of the acoustic features;
in some embodiments, the specific method of extracting acoustic features of a speech waveform comprises: performing voice waveform framing processing, and dividing continuous voice waveform points into short-time audio frames with fixed length, so as to facilitate subsequent feature extraction; extracting fbank (filter-bank) acoustic features from the short-time audio frame;
in some embodiments, the specific method for performing convolution and re-encoding operations on the acoustic feature sequence to obtain a down-sampled and re-encoded feature sequence includes:
performing convolution operation on the acoustic feature sequence by using a plurality of 2-dimensional convolution cores, and controlling the proportion of down sampling by setting the step length of the convolution operation; after the convolution operation, connecting an activation function, and performing nonlinear transformation; superposing multilayer convolution operation, then using a full-connection mapping layer to map the acoustic features into high-dimensional vectors, then adding a vector sequence into position coding information, and expressing the position coding by using absolute positions to realize down-sampling and re-coding of the acoustic feature sequence;
in some embodiments, the multi-head self-attention module of the multi-head self-attention mechanism-based acoustic encoder is formed by stacking a plurality of structurally identical modules, and residual connection is performed between each structurally identical module; each module with the same structure comprises two subsections, and the specific structure comprises: the first sub-part is a multi-head self-attention layer, and is followed by a fully-connected mapping layer of a second sub-part, each sub-part is subjected to layer normalization operation, and residual connection is carried out between the two sub-parts;
the english vocabulary encoder 102: dividing English words into English sub-words with smaller granularity to obtain English sub-word sequences of the English words or English phrases, re-encoding the English sub-word sequences, and inputting the re-encoded English sub-word sequences into an encoding module based on a multi-head self-attention mechanism to obtain a sequence finally represented by an English word list;
in some embodiments, the specific method for recoding the english word or the english word group includes: performing word embedding mapping on the English words or the English phrases to obtain a high-dimensional English word or English phrase vector sequence represented by corresponding English words or English phrase vectors, and adding position coding information and time sequence information in the English word or English phrase vector sequence;
in some embodiments, the specific method for re-encoding the english word or the english word group and inputting the re-encoded english word or english word group to the encoding module based on the multi-head attention mechanism to obtain the final expressed sequence of the english vocabulary includes:
recoding the English words or the English phrases and inputting the recoded English words or English phrases into a multi-head self-attention module of an English word list coder based on a multi-head self-attention mechanism to be converted into an English word list high-dimensional characteristic sequence, in order to convert the English word list high-dimensional characteristic sequence into a single vector to represent, facilitating a subsequent decoder to carry out attention calculation on the English word list high-dimensional characteristic sequence, and inputting the English word list high-dimensional characteristic sequence into a long-short term memory neural network to obtain a final represented sequence of the English word list;
in some embodiments, the multi-head self-attention module of the english vocabulary encoder based on the multi-head self-attention mechanism is formed by stacking a plurality of modules with the same structure, and residual connection is performed between each module with the same structure; each module with the same structure comprises two subsections, and the specific structure comprises: the first sub-part is a multi-head self-attention layer, and is followed by a fully-connected mapping layer of a second sub-part, each sub-part is subjected to layer normalization operation, and residual connection is carried out between the two sub-parts;
the multi-head attention mechanism expands the traditional attention mechanism to have multiple heads, so that each head has a different role in participating in the output of the encoder; specifically, the multiple heads independently compute h attentions and then connect their outputs to another linear projection; the attention formula is as follows:
Figure 845486DEST_PATH_IMAGE001
wherein Q, K, V represents the set of queries, keys, and values entered, respectively, the formula is as follows:
Figure 762626DEST_PATH_IMAGE002
the decoder 103: recoding the labeled target text to obtain a sequence represented by target text coding, respectively inputting the sequence represented by the high-dimensional representation of the acoustic features, the sequence represented by the final English word list and the sequence represented by the target text coding into a multi-head attention module of a decoder based on multi-head attention to obtain a sequence represented by the context vector of the acoustic features and a sequence represented by the context vector of the English word list, splicing and fusing the sequence represented by the context vector of the acoustic features and the sequence represented by the context vector of the English word list, and inputting the sequences into a full-connection layer to obtain final decoding representation;
in some embodiments, the specific method for re-encoding the labeled target text to obtain the sequence represented by the target text code includes:
performing word embedding mapping on the labeled target text to obtain a high-dimensional target word vector sequence represented by a corresponding target word vector, and adding position coding information and time sequence information in the target word vector sequence to obtain a sequence represented by target file coding;
in some embodiments, the specific method for obtaining the sequence represented by the context vector of the acoustic feature and the sequence represented by the context vector of the english vocabulary by inputting the sequence represented by the high-dimensional representation of the acoustic feature, the sequence represented by the final english vocabulary, and the sequence represented by the target text code into a multi-head attention module of a multi-head attention-based decoder includes:
as shown in fig. 2, inputting the sequence represented by the target text code into the multi-head self-attention module of the target text to obtain a high-dimensional representation of the target sequence;
the high dimension of the target sequence is used as a query vector of a multi-head self-attention module of the acoustic features and a multi-head self-attention module of an English vocabulary, the high dimension of the acoustic features is used as a key and a value of the multi-head self-attention module of the acoustic features, the English vocabulary finally represents the key and the value of the multi-head self-attention module of the English vocabulary, the query vector is used for calculating cosine distances element by element, and the attention score of each key in the multi-head self-attention module of the acoustic features and the multi-head self-attention module of the English vocabulary is obtained according to the distance; carrying out weighted average on value sequences in a multi-head self-attention module of the acoustic features and a multi-head self-attention module of the English vocabulary by using the attention scores of the keys to obtain a sequence represented by a context vector of the acoustic features and a sequence represented by a context vector of the English vocabulary;
in some embodiments, the multi-head self-attention module of the target text is formed by stacking a plurality of modules with the same structure, and residual connection is performed between each module with the same structure; each module with the same structure comprises two subsections, and the specific structure comprises: the first sub-part is a multi-head self-attention layer, and is followed by a fully-connected mapping layer of a second sub-part, each sub-part is subjected to layer normalization operation, and residual connection is carried out between the two sub-parts;
in some embodiments, the system 100 further comprises: a softmax function 104;
the final decoded representation is input to the softmax function 104 to get the goal of the maximum probability.
In summary, the technical solutions of the aspects of the present invention have the following advantages compared with the prior art:
the method can realize the customized model aiming at the English proper nouns in different fields, realize the accurate recognition of Chinese and English in Chinese and English mixed expression, and reduce the dependence of the model on training data. In principle, the English phrases to be customized are coded in advance to serve as a query list of the model, and the recognition process is guided according to the query result of the decoder, so that the customized words are accurately recognized. The word list can be customized in advance aiming at English special words in different fields.
Example 2:
as shown in fig. 1, the system 100 includes:
acoustic encoder 101, English vocabulary encoder 102, and decoder 103
The acoustic encoder 101: extracting acoustic features of a voice waveform to obtain an acoustic feature sequence, performing convolution and recoding operations on the acoustic feature sequence to obtain a down-sampled and recoded feature sequence, inputting the down-sampled and recoded feature sequence into a multi-head self-attention module of an acoustic encoder based on a multi-head self-attention mechanism to obtain a sequence represented by high dimensions of the acoustic features;
in some embodiments, the specific method of extracting acoustic features of a speech waveform comprises: every 25 milliseconds is a frame, 10 milliseconds of overlap exists between frames, and 80-dimensional fbank features are extracted after framing and serve as acoustic features;
in some embodiments, the specific method for performing convolution and re-encoding operations on the acoustic feature sequence to obtain a down-sampled and re-encoded feature sequence includes:
using convolution kernel of 3X3 and step length of 2, after convolution operation there is activation function RELU for non-linear transformation, each convolution operation down-sampling acoustic feature to half of original one, using convolution layer of 2 layers, down-sampling acoustic feature to one fourth of initial sampling rate; then, mapping the acoustic features into 256-dimensional vectors by using a full-connection mapping layer, then adding a vector sequence into position coding information, wherein the position coding is expressed by using an absolute position, and realizing the down-sampling and re-coding of the acoustic feature sequence;
in some embodiments, the multi-head self-attention module of the multi-head self-attention mechanism-based acoustic encoder is formed by stacking 12 structurally identical modules, and residual connection is performed between each structurally identical module; each module with the same structure comprises two subsections, and the specific structure comprises: the first subpart is a multi-head self-attention layer, a fully-connected mapping layer which is followed by a second subpart is arranged, the number of heads is set to be 4, the dimension of the fully-connected layer is 1024, an activation function uses GLU, each subpart is subjected to layer normalization operation, residual error connection is carried out between the two subparts, dropout operation is carried out on the self-attention layer and the fully-connected layer, and the parameter is 0.1;
the multi-headed attention mechanism extends the traditional attention mechanism to have multiple headers, such that each header has a different role in participating in the encoder output. Specifically, the multiple attentions independently calculate the multiple attentions, and then connect their outputs into another linear projection; converting, by an acoustic encoder, the original acoustic features into a high-dimensional feature representation;
the english vocabulary encoder 102: dividing English words into English sub-words with smaller granularity to obtain English sub-word sequences of the English words or English phrases, re-encoding the English sub-word sequences, and inputting the re-encoded English sub-word sequences into an encoding module based on a multi-head self-attention mechanism to obtain a sequence finally represented by an English word list;
in some embodiments, the specific method for recoding the english word or the english word group includes: performing word embedding mapping on the English words or the English phrases to obtain 256-dimensional English word or English phrase vector sequences represented by corresponding English words or English phrase vectors, and adding position coding information and time sequence information in the English word or English phrase vector sequences;
in some embodiments, the specific method for re-encoding the english word or the english word group and inputting the re-encoded english word or english word group to the encoding module based on the multi-head attention mechanism to obtain the final expressed sequence of the english vocabulary includes:
recoding the English words or the English phrases and inputting the recoded English words or English phrases into a multi-head self-attention module of an English word list coder based on a multi-head self-attention mechanism to be converted into an English word list high-dimensional characteristic sequence, in order to convert the English word list high-dimensional characteristic sequence into a single vector to represent, facilitating a subsequent decoder to carry out attention calculation on the English word list high-dimensional characteristic sequence, and inputting the English word list high-dimensional characteristic sequence into a long-short term memory neural network to obtain a final represented sequence of the English word list;
in some embodiments, the multi-head self-attention module of the english vocabulary encoder based on the multi-head self-attention mechanism is formed by stacking 6 modules with the same structure, and residual connection is performed between the modules with the same structure; each module with the same structure comprises two subsections, and the specific structure comprises: the first subpart is a multi-head self-attention layer, a fully-connected mapping layer which is followed by a second subpart is arranged, the number of heads is set to be 4, the dimension of the fully-connected layer is 1024, an activation function uses GLU, each subpart is subjected to layer normalization operation, residual error connection is carried out between the two subparts, dropout operation is carried out on the self-attention layer and the fully-connected layer, and the parameter is 0.1;
the multi-head attention mechanism expands the traditional attention mechanism to have multiple heads, so that each head has a different role in participating in the output of the encoder; specifically, the multiple heads independently compute h attentions and then connect their outputs to another linear projection;
the decoder 103: recoding the labeled target text to obtain a sequence represented by target text coding, respectively inputting the sequence represented by the high-dimensional representation of the acoustic features, the sequence represented by the final English word list and the sequence represented by the target text coding into a multi-head attention module of a decoder based on multi-head attention to obtain a sequence represented by the context vector of the acoustic features and a sequence represented by the context vector of the English word list, splicing and fusing the sequence represented by the context vector of the acoustic features and the sequence represented by the context vector of the English word list, and inputting the sequences into a full-connection layer to obtain final decoding representation;
in some embodiments, the specific method for re-encoding the labeled target text to obtain the sequence represented by the target text code includes:
performing word embedding mapping on the labeled target text to obtain a 256-dimensional target word vector sequence represented by a corresponding target word vector, and adding position coding information and time sequence information in the target word vector sequence to obtain a sequence represented by target file coding;
in some embodiments, the specific method for obtaining the sequence represented by the context vector of the acoustic feature and the sequence represented by the context vector of the english vocabulary by inputting the sequence represented by the high-dimensional representation of the acoustic feature, the sequence represented by the final english vocabulary, and the sequence represented by the target text code into a multi-head attention module of a multi-head attention-based decoder includes:
as shown in fig. 2, inputting the sequence represented by the target text code into the multi-head self-attention module of the target text to obtain a high-dimensional representation of the target sequence;
the high dimension of the target sequence is used as a query vector of a multi-head self-attention module of the acoustic features and a multi-head self-attention module of an English vocabulary, the high dimension of the acoustic features is used as a key and a value of the multi-head self-attention module of the acoustic features, the English vocabulary finally represents the key and the value of the multi-head self-attention module of the English vocabulary, the query vector is used for calculating cosine distances element by element, and the attention score of each key in the multi-head self-attention module of the acoustic features and the multi-head self-attention module of the English vocabulary is obtained according to the distance; carrying out weighted average on value sequences in a multi-head self-attention module of the acoustic features and a multi-head self-attention module of the English vocabulary by using the attention scores of the keys to obtain a sequence represented by a context vector of the acoustic features and a sequence represented by a context vector of the English vocabulary;
in some embodiments, the multi-head self-attention module of the target text is formed by stacking 6 modules with the same structure, and residual connection is performed between the modules with the same structure; each module with the same structure comprises two subsections, and the specific structure comprises: the first subpart is a multi-head self-attention layer, a fully-connected mapping layer which is followed by a second subpart is arranged, the number of heads is set to be 4, the dimension of the fully-connected layer is 1024, an activation function uses GLU, each subpart is subjected to layer normalization operation, residual error connection is carried out between the two subparts, dropout operation is carried out on the self-attention layer and the fully-connected layer, and the parameter is 0.1;
in some embodiments, the system 100 further comprises: a softmax function 104;
the final decoded representation is input to the softmax function 104 to get the goal of the maximum probability.
Example 3:
the invention discloses an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the customizable end-to-end method for Chinese-English mixed speech recognition in any one of the first aspect of the disclosure.
Fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device includes a processor, a memory, a communication interface, a display screen, and an input device, which are connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, Near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.
It will be understood by those skilled in the art that the structure shown in fig. 3 is only a partial block diagram related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the solution of the present application is applied, and a specific electronic device may include more or less components than those shown in the drawings, or combine some components, or have a different arrangement of components.
Example 4:
the invention discloses a computer readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the customizable mid-english hybrid speech recognition end-to-end method of any one of the first aspect of the present disclosure.
It should be noted that the technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered. The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A customizable chinese-to-english hybrid speech recognition end-to-end system, the system comprising:
the system comprises an acoustic encoder, an English vocabulary encoder and a decoder;
the acoustic encoder: extracting acoustic features of a voice waveform to obtain an acoustic feature sequence, performing convolution and recoding operations on the acoustic feature sequence to obtain a down-sampled and recoded feature sequence, inputting the down-sampled and recoded feature sequence into a multi-head self-attention module of an acoustic encoder based on a multi-head self-attention mechanism to obtain a sequence represented by high dimensions of the acoustic features;
the English vocabulary encoder is characterized in that: dividing English words into English sub-words with smaller granularity to obtain English sub-word sequences of the English words or English phrases, re-encoding the English sub-word sequences, and inputting the re-encoded English sub-word sequences into an encoding module based on a multi-head self-attention mechanism to obtain a sequence finally represented by an English word list;
the specific method for re-encoding the English words or the English phrases and inputting the re-encoded English words or English phrases into the encoding module based on the multi-head self-attention mechanism to obtain the finally expressed sequence of the English vocabulary comprises the following steps:
recoding the English words or the English phrases and inputting the recoded English words or English phrases into a multi-head self-attention module of an English word list coder based on a multi-head self-attention mechanism, converting the recoded English words or English phrases into an English word list high-dimensional characteristic sequence, and inputting the English word list high-dimensional characteristic sequence into a long-short term memory neural network to obtain a sequence finally represented by an English word list;
the decoder: recoding the labeled target text to obtain a sequence represented by target text coding, respectively inputting the sequence represented by the high-dimensional representation of the acoustic features, the sequence represented by the final English word list and the sequence represented by the target text coding into a multi-head attention module of a decoder based on multi-head attention to obtain a sequence represented by the context vector of the acoustic features and a sequence represented by the context vector of the English word list, splicing and fusing the sequence represented by the context vector of the acoustic features and the sequence represented by the context vector of the English word list, and inputting the sequences into a full-connection layer to obtain final decoding representation;
the specific method for respectively inputting the sequence represented by the high-dimensional representation of the acoustic features, the sequence represented by the final English vocabulary and the sequence represented by the target text code into a multi-head attention module of a decoder based on multi-head attention to obtain the sequence represented by the context vector of the acoustic features and the sequence represented by the context vector of the English vocabulary comprises the following steps:
inputting the sequence represented by the target text code into a multi-head self-attention module of the target text to obtain high-dimensional representation of the target sequence;
the high dimension of the target sequence is used as a query vector of a multi-head self-attention module of the acoustic features and a multi-head self-attention module of an English vocabulary, the high dimension of the acoustic features is used as a key and a value of the multi-head self-attention module of the acoustic features, the English vocabulary finally represents the key and the value of the multi-head self-attention module of the English vocabulary, the query vector is used for calculating cosine distances element by element, and the attention score of each key in the multi-head self-attention module of the acoustic features and the multi-head self-attention module of the English vocabulary is obtained according to the distance; and carrying out weighted average on the value sequences in the multi-head self-attention module of the acoustic features and the multi-head self-attention module of the English vocabulary by using the attention scores of the keys to obtain a sequence represented by the context vector of the acoustic features and a sequence represented by the context vector of the English vocabulary.
2. The customizable end-to-end system for hybrid mid-to-english speech recognition according to claim 1, wherein said specific method for performing convolution and re-encoding operations on said acoustic signature sequence to obtain a down-sampled and re-encoded signature sequence comprises:
performing convolution operation on the acoustic feature sequence by using a plurality of 2-dimensional convolution cores, and controlling the proportion of down sampling by setting the step length of the convolution operation; after the convolution operation, connecting an activation function, and performing nonlinear transformation; and superposing multilayer convolution operation, then using a fully-connected mapping layer to map the acoustic features into high-dimensional vectors, then adding a vector sequence into position coding information, wherein the position coding uses absolute position representation, and realizing the down-sampling and re-coding of the acoustic feature sequence.
3. The customizable end-to-end system for Chinese-English hybrid speech recognition according to claim 1, wherein the multi-headed self-attention module of the multi-headed self-attention mechanism-based acoustic encoder is stacked by a plurality of modules with the same structure, and each module with the same structure is connected with each other by a residual error; each module with the same structure comprises two subsections, and the specific structure comprises: the first subsection is a multi-headed self-attention layer followed by a fully-connected mapping layer of the second subsection, each subsection being subjected to a layer normalization operation, with a residual connection between the two subsections.
4. The customizable end-to-end system for Chinese-English hybrid speech recognition according to claim 1, wherein the multi-head self-attention module of the English vocabulary encoder based on the multi-head self-attention mechanism is formed by stacking a plurality of modules with the same structure, and residual connection is performed between each module with the same structure; each module with the same structure comprises two subsections, and the specific structure comprises: the first subsection is a multi-headed self-attention layer followed by a fully-connected mapping layer of the second subsection, each subsection being subjected to a layer normalization operation, with a residual connection between the two subsections.
5. The customizable end-to-end system for Chinese-English hybrid speech recognition according to claim 1, wherein the multi-headed self-attention module of the target text is stacked by a plurality of modules with the same structure, and residual connection is performed between each module with the same structure; each module with the same structure comprises two subsections, and the specific structure comprises: the first subsection is a multi-headed self-attention layer followed by a fully-connected mapping layer of the second subsection, each subsection being subjected to a layer normalization operation, with a residual connection between the two subsections.
6. The customizable end-to-end system for Chinese-English hybrid speech recognition according to claim 1, wherein said specific method for re-encoding the labeled target text to obtain the sequence of encoded representations of the target text comprises:
performing word embedding mapping on the labeled target text to obtain a high-dimensional target word vector sequence represented by a corresponding target word vector, and adding position coding information and time sequence information in the target word vector sequence to obtain a sequence represented by target file coding;
the specific method for recoding the English words or the English phrases comprises the following steps: and performing word embedding mapping on the English words or the English phrases to obtain a high-dimensional English word or English phrase vector sequence represented by the corresponding English words or English phrase vectors, and adding position coding information and time sequence information in the English word or English phrase vector sequence.
7. An electronic device, comprising a memory and a processor, the memory having stored thereon a computer program which, when executed by the processor, performs the method in a customizable chinese-english hybrid speech recognition end-to-end system according to any one of claims 1 to 6.
8. A storage medium storing a computer program executable by one or more processors and operable to implement a method in a customizable chinese-english hybrid speech recognition end-to-end system according to any one of claims 1 to 6.
CN202111548173.3A 2021-12-17 2021-12-17 Customizable end-to-end system for Chinese-English mixed speech recognition Active CN113936641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111548173.3A CN113936641B (en) 2021-12-17 2021-12-17 Customizable end-to-end system for Chinese-English mixed speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111548173.3A CN113936641B (en) 2021-12-17 2021-12-17 Customizable end-to-end system for Chinese-English mixed speech recognition

Publications (2)

Publication Number Publication Date
CN113936641A CN113936641A (en) 2022-01-14
CN113936641B true CN113936641B (en) 2022-03-25

Family

ID=79289296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111548173.3A Active CN113936641B (en) 2021-12-17 2021-12-17 Customizable end-to-end system for Chinese-English mixed speech recognition

Country Status (1)

Country Link
CN (1) CN113936641B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117524193B (en) * 2024-01-08 2024-03-29 浙江同花顺智能科技有限公司 Training method, device, equipment and medium for Chinese-English mixed speech recognition system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN112037798A (en) * 2020-09-18 2020-12-04 中科极限元(杭州)智能科技股份有限公司 Voice recognition method and system based on trigger type non-autoregressive model
CN112509564A (en) * 2020-10-15 2021-03-16 江苏南大电子信息技术股份有限公司 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
CN113270086A (en) * 2021-07-19 2021-08-17 中国科学院自动化研究所 Voice recognition text enhancement system fusing multi-mode semantic invariance
CN113284485A (en) * 2021-07-09 2021-08-20 中国科学院自动化研究所 End-to-end framework for unified Chinese and English mixed text generation and speech recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10971142B2 (en) * 2017-10-27 2021-04-06 Baidu Usa Llc Systems and methods for robust speech recognition using generative adversarial networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN112037798A (en) * 2020-09-18 2020-12-04 中科极限元(杭州)智能科技股份有限公司 Voice recognition method and system based on trigger type non-autoregressive model
CN112509564A (en) * 2020-10-15 2021-03-16 江苏南大电子信息技术股份有限公司 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
CN113284485A (en) * 2021-07-09 2021-08-20 中国科学院自动化研究所 End-to-end framework for unified Chinese and English mixed text generation and speech recognition
CN113270086A (en) * 2021-07-19 2021-08-17 中国科学院自动化研究所 Voice recognition text enhancement system fusing multi-mode semantic invariance

Also Published As

Publication number Publication date
CN113936641A (en) 2022-01-14

Similar Documents

Publication Publication Date Title
EP3680894B1 (en) Real-time speech recognition method and apparatus based on truncated attention, device and computer-readable storage medium
US11776531B2 (en) Encoder-decoder models for sequence to sequence mapping
EP3583594B1 (en) End-to-end text-to-speech conversion
CN111198937B (en) Dialog generation device, dialog generation program, dialog generation apparatus, computer-readable storage medium, and electronic apparatus
WO2020140487A1 (en) Speech recognition method for human-machine interaction of smart apparatus, and system
CN106910497B (en) Chinese word pronunciation prediction method and device
CN110516253B (en) Chinese spoken language semantic understanding method and system
CN110288980A (en) Audio recognition method, the training method of model, device, equipment and storage medium
CN111862977A (en) Voice conversation processing method and system
CN111144110A (en) Pinyin marking method, device, server and storage medium
US11886813B2 (en) Efficient automatic punctuation with robust inference
CN114298053B (en) Event joint extraction system based on feature and attention mechanism fusion
CN110288972A (en) Speech synthesis model training method, phoneme synthesizing method and device
CN111401081A (en) Neural network machine translation method, model and model forming method
CN113327578B (en) Acoustic model training method and device, terminal equipment and storage medium
KR20220130565A (en) Keyword detection method and apparatus thereof
WO2023165111A1 (en) Method and system for identifying user intention trajectory in customer service hotline
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN113936641B (en) Customizable end-to-end system for Chinese-English mixed speech recognition
Macoskey et al. Bifocal neural asr: Exploiting keyword spotting for inference optimization
CN115803806A (en) Systems and methods for training dual-mode machine-learned speech recognition models
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN115394321A (en) Audio emotion recognition method, device, equipment, storage medium and product
CN107808664B (en) Sparse neural network-based voice recognition method, voice recognition device and electronic equipment
Picheny et al. Trends and advances in speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant