CN111916064A

CN111916064A - End-to-end neural network speech recognition model training method

Info

Publication number: CN111916064A
Application number: CN202010794361.3A
Authority: CN
Inventors: 陈虞君; 杨植麟; 张宇韬; 杜羽伦; 陈欣梅; 陈贤鑫
Original assignee: Beijing Ruikelun Intelligent Technology Co ltd
Current assignee: Beijing Ruikelun Intelligent Technology Co ltd
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2020-11-10

Abstract

The invention relates to the technical field of computer information processing, in particular to a training method of an end-to-end neural network speech recognition model, which comprises the following steps: step 1, collecting voice information and storing the voice information into an audio file; step 2, primarily screening the audio files to make the volume sizes consistent; step 3, marking the content of the audio file manually, and generating a data file; step 4, preprocessing the marked data, and performing characteristic distribution; step 5, constructing an audio preprocessing module, and carrying out variable speed, noise increase and frequency domain signal disturbance enhancement on the audio file; step 6, constructing a voice recognition model by using an end-to-end deep learning model; step 7, optimizing a voice recognition model; and 8, obtaining the decoded text information by the input audio signal. The invention provides an end-to-end voice recognition model, which aims to remarkably improve the recognition effect.

Description

End-to-end neural network speech recognition model training method

Technical Field

The invention relates to the technical field of computer information processing, in particular to a training method of an end-to-end neural network speech recognition model.

Background

Today, with the rapid development of information technology, more and more fields begin to use machine learning technology to replace the complex and repetitive work consuming manpower and material resources in the traditional industry. For example, in an online shopping website, a voice assistant or a conversation robot is used for solving common problems which may be encountered by a client, or a transportation department uses a computer vision technology to perform recognition work on a license plate of a car. By adopting the machine learning technology, the production cost can be effectively reduced, and higher accuracy can be ensured.

In the actual production process, the transcription recognition of the voice information is the basic work of various voice analysis products. The basis for the analysis of the audio signal is the transcription function of the speech information. For example, the voice input method, i.e. software directly using the voice transcription model, and in addition, most of the telephone precision marketing in industrial production also needs to convert the telephone signal into text information for analysis.

Most conventional phonetic transcription of text products is based on a two-stage recognition method. The method can convert the voice recognition task into two-stage tasks, namely the voice recognition task and the language model modeling task, and models are respectively constructed in the two-stage tasks for learning. In the first stage, the most likely corresponding word is constructed only according to the speech when the speech recognition task is performed, and in the second stage, a language model is continuously trained to finish the correction of the text after the speech signal is directly processed. Two-stage models dominate the world of speech recognition engines for a long period of time. The staged task can be performed using, for example, hidden markov models, gaussian mixture models, and the like. The two-phase model has a specific task objective for each phase, but the optimization methods for different task objectives are not the same. This variability leads to a model whose effect cannot be guaranteed to be optimal. With the rapid development of depth models in recent years, another class of end-to-end speech recognition models begins to exhibit their effectiveness. The main purpose of the end-to-end speech recognition model is to obtain a direct non-linear transformation from the input speech signal to the output text information. Therefore, the supervision information with voice and text is input into the model, the voice information and the text information can be directly in one-to-one correspondence, and direct iteration on data is facilitated. In recent years, the end-to-end speech recognition model has excellent performance, and all the models which obtain the best recognition result on the public data set are end-to-end recognition engines, so that the end-to-end speech recognition is a very promising field.

Disclosure of Invention

The invention aims to solve the problem of poor effect of the traditional speech-to-text model, provides an end-to-end speech recognition model and aims to remarkably improve the recognition effect.

In order to achieve the purpose, the invention provides the following technical scheme: a training method of an end-to-end neural network speech recognition model comprises the following steps:

step 1, collecting voice information and storing the voice information into an audio file;

step 2, primarily screening the audio files to make the volume sizes consistent;

step 3, marking the content of the audio file manually, and generating a data file;

step 4, preprocessing the marked data, and performing characteristic distribution;

step 5, constructing an audio preprocessing module, and carrying out variable speed, noise increase and frequency domain signal disturbance enhancement on the audio file;

step 6, constructing a voice recognition model by using an end-to-end deep learning model;

step 7, optimizing a voice recognition model;

and 8, obtaining the decoded text information by the input audio signal.

Preferably, step 4 further comprises extracting audio spectral Features by randomly selecting the labeled audio file, wherein the audio spectral Features include, but are not limited to, using Mel-frequency spectrograms (Mel Bank Features) and short time Fourier transform (stft) (short Term Fourier transform) audio signal representation algorithm.

Preferably, the speech recognition model in step 6 is constructed by a Transformer model, and a relative position coding technique is adopted as a supplement to the speech signal while the speech signal is calculated.

Preferably, in step 6, the N-layer fransformer model is used as an encoder and the M-layer fransformer model is used as a decoder for the fransformer model to construct the recognition engine for constructing the speech recognition model.

Preferably, the speech recognition model in step 6 is divided into two parts, an Encoder and a Decoder, and the learning target is to maximize the probability of each text being obtained from the input audio signal.

The invention has the beneficial effects that: compared with the traditional speech recognition engine, the model recognition efficiency can be improved, a better result can be achieved for the transcription result of the audio signal, the speech signal is efficiently converted into text information, the cost required by manual transcription can be basically replaced, and the information extraction and data mining work of the audio content can be better served.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of model training according to the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

referring to fig. 1, a method for training an end-to-end neural network speech recognition model includes the following steps;

step 1, collecting voice information, wherein the voice input information can come from various forms, such as storing voice calls, and can be used as voice signal information, or collecting from an open source data website, or processing voice signals and caption information in a specific website so as to obtain a voice file enough for voice recognition model training, and then storing the model file by using wav or mp3 format.

Step 2, primarily screening the voice signals, and after the voice signals are collected, because the voice signals have great difference and volume, VAD (voice activity detection) is firstly carried out to determine an effective voice part, and an effective recording section in a call is extracted to be used as a data set to be labeled.

Step 3, manually labeling; after the initial screening work of the voice signals is finished, the audio is collected, and the voice signals are labeled manually. Before the annotation, in order to increase the annotation speed, the pre-recognition of the given audio is performed by using the existing model. The pre-recognition is to use the basic model to recognize the data to be labeled, so that the recognition result is only needed to be modified manually.

After the manual labeling is finished, the labeled data is collected and arranged into a text file, and the labeled audio file and the labeled result are recorded respectively.

And 4, performing data preprocessing on the marked data, namely performing pre-extraction on the voice signals, and aiming at quickly collecting the feature distribution work of all data. And processing the voice file with the randomly selected part marked, and extracting the voice frequency spectrum characteristics of the voice file. Audio spectral features include, but are not limited to, the use of mel-frequency spectrograms (MelBankFeatures) and short-time fourier transforms stft (shorttermturertransform) and other audio signal representation algorithms.

And after the audio frequency spectrogram is obtained, acquiring the distribution of the corresponding frequency spectrums of all the audio files in a statistical mode.

For example, suppose that the characteristics of the entire audio spectral signal are such that a gaussian distribution is satisfied:

where u is the mean of these features and E is the covariance matrix of all vectors. After the mean and covariance matrix of the features are obtained, we can estimate the signal distribution of all audio frequencies. In addition, all the text information used for marking is acquired, and all the characters are acquired as a character library of the model by reading the marked text.

And 5, constructing an audio preprocessing module, and starting to process the audio signal when the audio signal is input into the identification model, wherein the processing method comprises the steps of carrying out the traditional data increasing methods of speed change, noise increase and the like on the audio and carrying out the disturbance enhancement of the frequency domain signal. For example, we use the frequency domain signal stretching method to stretch and transform the frequency domain signal, and the formula is:

X′＝TX

wherein X represents the original input audio frequency domain signal, T represents affine transformation, and X' represents the frequency domain signal after affine transformation. Through the affine transformation, a tiny disturbance can be added to the information of the audio in the frequency domain, and the purpose of enhancing the audio data is achieved.

And 6, constructing a voice recognition model by adopting an end-to-end deep learning model. Therefore, a Transformer model was used for construction. In order to accurately acquire speech information, a relative position coding technique is used as a supplement to a speech signal while the speech signal is calculated. For the Transformer model, an end-to-end recognition engine is constructed by using an N-layer Transformer model as an encoder and an M-layer Transformer model as a decoder.

Wherein the inside of the transform model will contain two structures (sub-layers), a multihead attention layer (multihead attention) and a full join layer (fed brown dnetwork).

Wherein the multi-head attention layer is mainly composed of a plurality of attention networks, and for the i-th attention layer, the input thereof is mainly composed of Q, K, V three inputs, wherein Q, K, V is audio information related to a speech signal and a relative position-coded signal, and is calculated by:

wherein h is^n-1For implicit variable representation of the previous layer, given Q_i，K_i，V_iAfter three input signals, the attention mechanism is calculated in the following way:

Attention(Q_i，K_i，V_i)＝softmax(Q_ik+Q_iWR_i-j+uk_j+vWR_i-j)

wherein Q_i，K_i，V_iImplicit expression vectors, R, for the ith layer of the transform, respectively_i-jRepresentative is relative position coding. The multi-head attention mechanism is to use a plurality of attention mechanisms and simultaneously acquire the information of the input signal Q, K, V for fusion, and the calculation method is as follows:

MultiHead(Q，K，V)＝concat(Att₁，Att₂，Att₃，...，Att_n)

wherein Att_i＝Attention_i(Q, K, V), concat represents Att_iThe result of the attention mechanism represents the vector to be spliced.

The decoder structure of the Transformer model is basically the same as that of the encoder, but the decoder comprises three sub-networks including two multi-head attention mechanisms and a fully-connected network layer, and the two multi-head attention mechanisms respectively receive an input signal and a representation vector of a result obtained by the encoder, so that the output of the decoder is obtained.

And 7, optimizing the voice recognition model, wherein the voice recognition model is divided into an Encoder part and a Decode part, because the model uses a sequence-to-sequence model for modeling and learning. Thus, the learning goal is to maximize the probability of each text being derived from the input audio signal, i.e.:

maxP(Y|X′；⊙)

where X' is the affine transformed audio signal mentioned above and Y is the text information. In order to successfully learn the result of the model, the method uses the TeacherForring technology to learn, and helps the model to quickly learn to obtain the target text.

Step 8, obtaining text results through searching

Through the training of the voice recognition model, the decoded text information can be obtained according to the input audio signal. However, since the structure of the model is an encoder-decoder structure, the determination of the text sequence needs to be performed by selecting a method using a bundle search. Here, the bundle width of the bundle search is set to M, and the specific flow is as follows:

1, input signal X, and start symbol<bos>The two are simultaneously input into the coder-decoder model to obtain the first character y₁Putting each word into a cluster in sequence according to the corresponding first M words with the maximum probability;

2. in each bundle, the next word is decoded from the existing word in the input signal X. The specific approach is that, assuming the model has processed k words, the model will be based on y in each bundle₁，y₂，...y_kAnd the input signal X predicts the (k + 1) th word, y_k+1。

3. The prediction is ended when the bundle predicts an end symbol < eos > based on the first K words and the input signal X.

Since the model training and prediction can be carried out in a batch mode. Thus, in a batch-wise search for a bundle, the prediction of the batch is only ended if all results within the batch appear < eos >.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for training an end-to-end neural network speech recognition model is characterized by comprising the following steps:

step 5, constructing an audio preprocessing module, and carrying out speed change, noise increase and frequency domain signal disturbance enhancement on the audio file;

step 7, optimizing a voice recognition model;

and 8, obtaining the decoded text information by the input audio signal.

2. The method for training the end-to-end neural network speech recognition model according to claim 1, wherein: in step 4, extracting audio spectrum Features by randomly selecting the marked audio file, wherein the audio spectrum Features include but are not limited to using Mel frequency spectrograms (Mel Bank Features) and Short Time Fourier Transform (STFT) (short Term Fourier transform) audio signal representation algorithm.

3. The method for training the end-to-end neural network speech recognition model according to claim 1, wherein: the speech recognition model in step 6 is constructed by a Transformer model, and a relative position coding technology is adopted as a supplement of the speech signal while the speech signal is calculated.

4. The method for training the end-to-end neural network speech recognition model according to claim 3, wherein: in the step 6, for the Transformer model, an N-layer Transformer model is used as an encoder, and an M-layer Transformer model is used as a decoder to construct the recognition engine for constructing the speech recognition model.

5. The method for training the end-to-end neural network speech recognition model according to claim 1, wherein: the speech recognition model in step 6 is divided into two parts, an Encoder and a Decoder, and the learning goal is to maximize the probability of obtaining each text from the input audio signal.