CN113470620A

CN113470620A - Speech recognition method

Info

Publication number: CN113470620A
Application number: CN202110761056.9A
Authority: CN
Inventors: 张玉腾; 宁新; 杜静
Original assignee: Qingdao Dongting Intelligent Technology Co ltd
Current assignee: Qingdao Dongting Intelligent Technology Co ltd
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-10-01

Abstract

The invention relates to a voice recognition method, which comprises the following steps: performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting words appearing in the text contents to create a dictionary; constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism; training a voice recognition model based on the FBank feature data and the dictionary data; and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result. Thereby, the result of the streaming voice recognition can be improved in the information considering the priority context.

Description

Speech recognition method

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a voice recognition method.

Background

Due to the rapid development of deep learning technology, more and more end-to-end voice recognition methods based on deep learning appear in the voice recognition field. Compared with the traditional method, the end-to-end speech recognition method simplifies the system architecture, only uses the neural network to form, and inputs the audio data and directly outputs the grapheme of the target language. This avoids the need for specific language specialists when building such systems, reducing the implementation threshold.

The streaming voice recognition is the core technology of systems such as a dialogue system, simultaneous interpretation, real-time subtitles and the like. In recent years, the performance of end-to-end speech recognition systems has outperformed highly optimized hybrid systems. The currently proposed streaming end-to-end speech recognition systems are mainly divided into two main categories: a CTC-based Neural Network model and an RNN-T (Current Neural Network transmitter) -based Neural Network model. These models are identified on a frame-by-frame basis, so that streaming speech recognition is easily achieved. Therefore, streaming speech recognition has a huge demand in real life, but the existing streaming speech recognition cannot effectively utilize the context information for effective recognition.

Disclosure of Invention

In order to solve the problem that the prior art can not effectively identify the contact context, the invention provides a voice identification method which has the characteristics of being capable of effectively identifying the contact context, having higher and more accurate identification efficiency and the like.

A speech recognition method according to an embodiment of the present invention includes:

performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting a character creation dictionary appearing in the text contents;

constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism;

training the speech recognition model based on the FBank feature data and the lexicon data;

and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result.

Further, the voice data preprocessing comprises:

and converting the voice file into a WAV format, wherein the sampling rate is 8K, the single channel is adopted, and the FBank characteristic of each audio frequency is extracted.

Further, the text data preprocessing comprises:

extracting characters appearing in the audio file according to the text content of the audio file, and creating a dictionary; and giving an index starting from 0 to each character in the dictionary, replacing the character in the original text with the corresponding index by using the index, and generating the text to be trained.

Furthermore, the voice recognition model comprises a down-sampling layer, the down-sampling layer takes the FBank characteristic data as input, two-dimensional convolution operations are sequentially carried out, then one nonlinear transformation is carried out, then the position characteristic is added to the output after the two-dimensional convolution, and the output after the two-dimensional convolution and the position characteristic are added to be used as the output of the down-sampling layer.

Further, the speech recognition model further includes an encoding layer that encodes an output of the downsampling layer based on the plurality of transform Encoder network blocks of the attention mechanism.

Furthermore, the speech recognition model further comprises a trigger layer, wherein the trigger layer is composed of a CTC module and is used for recognizing the time point of grapheme output in the sequence output by the coding layer and cutting the grapheme output into output blocks of the coding layer.

Further, the speech recognition model further comprises a decoding layer, and the decoding layer sequentially inputs the output blocks of the coding layer segmented by the trigger layer into the decoding layer based on a plurality of transform Decoder network blocks of the attention mechanism to obtain the output of the decoding layer.

Further, the decoding layer decodes based on a wave speed search algorithm.

Further, the trigger layer generates a segmentation event based on the trigger layer of the CTC, and judges whether to activate the decoding layer according to the softmax result of the segmentation event.

The invention has the beneficial effects that: after data preprocessing is carried out on a voice file, the attention-based voice recognition method is applied to streaming voice recognition, an input sequence is cut into small fragments by using a CTC model, the attention-based model is used for recognizing the result of each fragment, and finally the fragment recognition results are spliced to obtain a complete result, so that the streaming voice recognition result can be improved in the information considering the priority context, and the recognition result is more accurate and reliable.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow diagram of a method of speech recognition provided in accordance with an exemplary embodiment;

FIG. 2 is a block diagram of a speech recognition model provided in accordance with an exemplary embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a speech recognition method, which specifically includes the following steps:

101. performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting words appearing in the text contents to create a dictionary; the first data set preparation is mainly composed of two steps of voice data preprocessing and text data preprocessing.

102. Constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism;

103. training the speech recognition model based on FBank feature data and dictionary data;

104. and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result. The decoding of the streaming audio is realized by means of frame synchronization.

The attention-based voice recognition method is applied to streaming voice recognition, the CTC model is used for cutting an input sequence into small segments, the attention-based model is used for recognizing the result of each segment, and finally the segment recognition results are spliced to obtain a complete result, so that the streaming voice recognition result can be improved in the information considering the priority context, and the recognition result is more accurate and reliable.

As a possible implementation of the above embodiment, first, the data set preparation includes two steps of voice data preprocessing and text data preprocessing:

voice data preprocessing: and converting the voice file into a WAV format, wherein the sampling rate is 8K and the single channel is adopted. Extracting FBank characteristics of each audio, firstly pre-emphasizing a speech signal, improving high-frequency components to flatten the frequency spectrum of the signal, and realizing pre-emphasis by using a first-order FIR high-pass filter, wherein the formula is as follows:

y(n)＝x(n)-ax(n-1)，0.9＜a＜1.0

where α is the pre-emphasis coefficient. The speech signal is then framed and cut into time segments of fixed short length, where the speech signal can be processed as a stationary signal, typically with a framing length set to 20-50 ms. Windowing is performed on the voice signals after framing, spectrum leakage errors are reduced, time domain signals can better meet the periodicity requirement of Fourier transform, and a Hamming window formula is usually selected as a window function:

where N is the total number of samples in speech. Then, short-time Fourier transform is used for the windowed voice segment to obtain frequency domain information, and then a formula is utilized:

a power spectrum is obtained. Finally, the FBank features are obtained using Mel filtering and the results are logarithmized. The conversion formula for the frequency f and the Mel frequency m is:

preprocessing text data: extracting characters appearing in the data set according to text contents corresponding to the audio in the data set, and creating a dictionary; three special symbols are then added to the dictionary: < blank >, < eos/sos >, < unk >, respectively, indicating: whitespace in CTCs, text start and end symbols, unknown word symbols. And giving an index starting from 0 to each character in the dictionary, replacing the character in the original text with the corresponding index by using the index, and generating the text to be trained. Note that: here, the < blank > index is 0, the < unk > index is 1, the < eos/sos > index is 2, and other text indexes can be set arbitrarily.

Referring to the structure diagram of the speech recognition model shown in fig. 2, the down-sampling layer takes the processed speech Fbank features as input, and sequentially passes through two-dimensional convolution operations, and the two-dimensional convolution parameters sequentially are: the convolution kernel is 3, the step length is 2; the convolution kernel is 5 and the step size is 3. After each two-dimensional convolution operation, a nonlinear transformation is performed. And adding position characteristics to the output after the two-dimensional convolution, wherein the characteristics are generated by adopting absolute position coding, and a formula is created as follows:

where pos corresponds to an input position, PE (pos,2i) represents a position code where pos is an even number, and PE (pos,2i +1) represents a position code where pos is an odd number, and represents a dimension of a position feature. And adding the output after the two-dimensional convolution and the position characteristic to obtain the output of the down-sampling layer.

And (3) coding layer: the layer is composed of N transform Encoder network blocks, wherein N is an integer greater than 2, the input of the first transform network block is the output of the downsampling layer, and the inputs of other transform network blocks are the outputs of the previous block. The transform network block sequentially includes a multi-head attention layer, a normalization layer, a residual error layer, and a feedforward connection layer, which can be specifically constructed according to the existing structure, and the present invention is not described herein again.

An active layer: the layer is composed of a CTC module, can identify the output time point of grapheme in the sequence output by the coding layer, controls whether the decoding layer network is activated or not, obtains an output sequence after inputting the coding layer to the activation layer, and obtains the output sequence by selecting the maximum value at each time step as a result due to the various characteristics of the CTC. And (2) reserving the first word of the continuous words which are not the < blank > in the sequence, replacing the first word with the word < block >, and then segmenting the output of the coding layer according to the index i of the word which is not the < blank >, and segmenting the output into output blocks of H coding layers. And when the model is cut, the cutting position can be controlled according to the parameter e, and is set as i-e, so that the model can see more history information.

A decoding layer: the layer is composed of M transform Decoder network blocks, wherein M is an integer larger than 2, and output blocks of the coding layer segmented by the activation layer are sequentially input into the decoding layer to obtain the output of the decoding layer. And splicing the results of all the blocks to obtain an identification result.

And finally, decoding the streaming audio in a frame synchronization mode. And (3) generating a segmentation event based on the activation layer of the CTC, judging whether to activate a decoding layer according to the softmax result of the event, and activating the decoding layer by the event with the result larger than 0.6 in our experiment. The output of the coding layer between the two recent events is sent to the decoding layer, and the decoding process uses the traditional wave speed searching algorithm. In the decoding process, because the sequence input by the decoding layer each time is the input sequence of one word in the CTC result, the problem of misalignment does not exist, and a penalty factor of adding length constraint in decoding based on tag synchronization is not required.

In some embodiments of the present invention, for training of the speech recognition model, let S ═ S (S)₁，...，s_T) Representing a CTC sequence frame of length T, where s_t∈E∪<blank>And E represents a collection of different graphemes,<blank>indicating a blank symbol. Let K ═ K₁，...，k_L) Wherein k is_lE represents a grapheme sequence of length L and assumes that when the repeated labels are folded into a single instance and blank symbols are removed, the sequence S is reduced to K. The derivation of CTC is as follows:

where p (S | K) represents transition probability and p (S | H) represents acoustic model.

The derivation formula of the alignment information provided by the activation layer on the decoding layer is as follows:

co-training the active layer and the decoding layer by using a multi-objective loss function, wherein the formula is as follows:

where λ is a hyperparameter, for controlling p_ctcAnd p_atThe weight of (c).

The speech recognition method provided by the above embodiment of the present invention applies the attention-based speech recognition method to the streaming speech recognition, cuts the input sequence into small segments using the CTC model, allows the attention-based model to recognize the result of each segment, and finally splices the segment recognition results to obtain a complete result, so that the streaming speech recognition result can be improved in the information considering the priority context.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A speech recognition method, comprising:

2. The speech recognition method of claim 1, wherein the speech data preprocessing comprises:

3. The speech recognition method of claim 1, wherein the text data preprocessing comprises:

4. The speech recognition method according to claim 1, wherein the speech recognition model includes a down-sampling layer, the down-sampling layer takes the FBank feature data as an input, sequentially performs two-dimensional convolution operations, further performs a non-linear transformation, further adds a position feature to the output after the two-dimensional convolution, and adds the output after the two-dimensional convolution and the position feature to obtain an output of the down-sampling layer.

5. The speech recognition method of claim 4, wherein the speech recognition model further comprises an encoding layer that encodes an output of the downsampling layer based on a plurality of transform Encoder network blocks of the attention mechanism.

6. The speech recognition method of claim 5, wherein the speech recognition model further comprises a trigger layer, the trigger layer being composed of a CTC module for identifying time points of output of graphemes in the sequence output by the coding layer and segmenting the output blocks into coding layers.

7. The speech recognition method of claim 6, wherein the speech recognition model further comprises a decoding layer, and the decoding layer sequentially inputs the output blocks of the coding layer segmented by the triggering layer into the decoding layer based on a plurality of transform Decoder network blocks of the attention mechanism to obtain an output of the decoding layer.

8. The speech recognition method of claim 7, wherein the decoding layer decodes based on a wave speed search algorithm.

9. The speech recognition method of claim 7, wherein the trigger layer further generates a slicing event based on a trigger layer of a CTC, and determines whether to activate a decoding layer according to a softmax result of the slicing event.