CN115862596A

CN115862596A - Deep learning-based spoken English speech recognition method

Info

Publication number: CN115862596A
Application number: CN202310194346.9A
Authority: CN
Inventors: 马磊; 陈义学; 夏彬彬; 侯庆
Original assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Current assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-03-28

Abstract

The invention provides an English spoken language voice recognition method based on deep learning, and belongs to the technical field of voice recognition. The method comprises the following steps: designing an English voice recognition model based on a Transformer; performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set; and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text. According to the invention, an end-to-end voice recognition method based on a Transformer is adopted, and an examinee English spoken language training data set is constructed, so that the whole network model is integrally optimized in the training process, the global optimum is ensured, and the recognition accuracy is effectively improved.

Description

Deep learning-based spoken English speech recognition method

Technical Field

The invention relates to the technical field of voice recognition, in particular to a spoken English voice recognition method based on deep learning.

Background

In recent years, with the rapid development of pattern recognition and artificial intelligence, and the deep application of technologies such as machine learning, especially deep learning, etc., the research and application fields of speech recognition technology are becoming more and more extensive. Speech recognition technology transcribes a speech signal into a corresponding text or command by a computer, which is essentially a pattern recognition process.

The model most commonly used in the initial stage of the speech recognition field is GMM-HMM, but the modeling capability is limited, the speech features and the structure cannot be completely and accurately represented, and more speech recognition models based on a neural network appear along with the development of deep learning. The DNN replaces GMM to realize the modeling of the observation state probability, so that the recognition accuracy can be improved, but the DNN-HMM model is difficult to train, each frame of voice in training data needs to be labeled, and the manual labeling difficulty is high. In addition, LSTM-RNN is widely used in acoustic models because it can capture context-dependent information of sequence data, but RNN calculation at each time requires the output of the previous time as input, and therefore, it can only be calculated serially, and is slow in speed, and furthermore, RNN is susceptible to gradient vanishing during training, converges more slowly, and requires more computational resources.

At present, when being applied to the speech recognition system of oral english examination with current speech recognition model based on neural network, although can satisfy the automatic speech recognition demand of oral english examination, the training process that current speech recognition model based on neural network exists is complicated, data annotation is difficult and occupy defects such as computational resource are big, the speech conversion flow that has led to current speech recognition system is very complicated, and when carrying out the pronunciation transcription of examinee oral english examination, speech recognition's rate of correctness remains to be improved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide an English spoken language voice recognition method based on deep learning, which adopts an end-to-end voice recognition method based on a Transformer and constructs an examinee English spoken language training data set, so that the whole network model is integrally optimized in the training process, the global optimum is ensured, and the recognition accuracy is effectively improved.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a spoken English speech recognition method based on deep learning comprises the following steps:

s1: designing an English voice recognition model based on a Transformer;

s2: performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set;

s3: and performing voice recognition on the English spoken language audio file collected in real time by using the adjusted English voice recognition model, and generating an audio recognition text.

Further, step S1 includes:

and constructing a position information embedding module, a multi-head self-attention module, a feed-forward neural network and a cross-attention module, and combining the position information embedding module, the multi-head self-attention module, the feed-forward neural network and the cross-attention module into an English speech recognition model based on a Transformer.

Further, step S2 specifically includes the following steps:

s21: acquiring audio matched with the text and manual annotation audio data aiming at the oral English test, and constructing a data set;

s22: preprocessing a data set, and dividing the data set into a training set and a test set according to a preset distribution proportion;

s23: carrying out logarithmic Mel spectrum feature extraction on the data set, carrying out feature normalization, and globally scaling the input features between-1 and 1;

s24: setting quantizer parameters and hyper-parameters for model training;

s25: and performing Model training on the English voice recognition Model by using a training set to obtain a pre-training Model, and performing fine adjustment on the pre-training Model by using a test set to obtain a final Model.

Further, step S3 specifically includes the following steps:

s31: deploying the Model obtained by training;

s32: collecting an English spoken language audio file, and suppressing noise of the English spoken language audio file by adopting an LMS adaptive filter noise reduction method to obtain a preprocessed audio sequence x;

s33: resampling the preprocessed audio sequence X to 16000Hz, and intercepting the audio sequence X into 30-second segments to form batch audio segments X;

s34: carrying out logarithmic Mel spectrum feature extraction on the batch audio fragments X, and carrying out normalization processing on the features;

s35: inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain probability distribution P of the recognition text;

s36: and performing table look-up on the probability distribution P of the recognition text to obtain the audio recognition text.

Further, step S21 specifically includes:

collecting audio matched with the text with the duration of 1.1 ten thousand hours and manual annotation audio data aiming at the oral English test with the duration of 0.15 ten thousand hours, and constructing the audio data into a data set.

Further, step S22 specifically includes:

resampling all audio data in the data set to 16000Hz, dividing the resampled audio data into segments with the duration of 30 seconds and identifying by utilizing a preset label to finish the preprocessing of the data set;

the preprocessed data set is divided into a training set and a testing set according to the proportion of 8:2.

Further, step S24 specifically includes:

setting quantizer parameters for Adam quantizer for model training, and according to formula

Setting learning rate learngrate;

wherein d is 512 and warmup steps is 5000;

setting 6 layers of perceptrons for the feedforward neural network, and setting the number of the multi-head self-attention module heads to be 6;

forming a residual error attention module by using a multi-head self-attention module and a feedforward neural network;

the encoder and decoder were built with 5 residual attention modules, respectively, and the parameter discard rate was set to 0.1.

Further, step S25 specifically includes:

carrying out 300000 steps of iteration on the English speech recognition model by using a training set to obtain a pre-training model;

and (5) fine-tuning the pre-training Model by using the manually marked audio data in the test set to obtain a final Model.

Further, step S1 further includes:

forming a characteristic vector with a time sequence on the word vector by a position information embedding method;

the position information characteristic is expressed by using sine and cosine functions, and the position information embedding formula is as follows:

where p represents position, i represents dimension, and d is 512.

Further, step S31 includes:

adopting an ONNX framework to deploy a Model, and carrying out encryption deployment on the Model to enable the Model to operate on various devices by 0;

when encryption deployment is carried out, the encryption algorithm formula adopted is as follows:

the Model encrypt is an encrypted Model, the Model is a Model obtained by training, i represents the ith element in the Model, key represents a generated random character string, and keyLength represents the length of the key.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a spoken English speech recognition method based on deep learning. Then, carrying out model structure and parameter tuning on the designed model on a training set; the data set is formed by combining audio matched with texts on the Internet and manually marked audio data aiming at an English spoken language examination, the robustness of a training model can be helped by the diversity of the Internet audio data, and the training model is finely adjusted through the manually marked data. And finally, deploying and applying the trained optimal model on actual data, thereby realizing spoken English speech recognition.

The invention adopts an improved end-to-end voice recognition method based on Transformer, constructs an examinee oral English training data set, integrally optimizes the whole network model in the training process, ensures global optimization and effectively improves the recognition accuracy. Because the network model adopts a relatively single network structure, the network model can be deployed on low-delay and high-precision equipment, and the calculation efficiency is high. After the model is deployed, the voice characteristics are input, english words are output, and the voice recognition process is simplified.

Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a process flow diagram of an embodiment of the present invention.

FIG. 2 is a schematic diagram of a model structure according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a residual attention module according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings.

Fig. 1 shows a method for recognizing spoken english based on deep learning, which includes the following steps:

s1: and designing an English voice recognition model based on a Transformer.

Specifically, a position information embedding module, a multi-head self-attention module, a feedforward neural network and a cross-attention module are constructed and combined into an English speech recognition model based on a Transformer.

As shown in fig. 2, the overall model is composed of a position information embedding part, a multi-head self-attention module, a feed-forward neural network part, a cross-attention module part and the like, can solve the problem of audio recognition in the oral english language examination, forms a time-sequential feature vector on a word vector by a position information embedding method under the condition of no language modeling, and integrates the position information and the attention mechanism to realize the overall audio recognition process.

In the process of constructing the model, the existence time information of the audio sequence is considered, so that a time-sequence feature vector is formed on a word vector by a position information embedding method, the position information feature is expressed by using sine and cosine functions, and a position information embedding formula is as follows:

where p represents position, i represents dimension, and d is 512.

S2: and performing model structure adjustment and parameter optimization of the English voice recognition model through a training data set.

Specifically, the method comprises the following five steps:

s21: collecting audio matched with the text and manual annotation audio data aiming at the oral English test, and constructing a data set.

As an example, the construction of the data set is completed by collecting audio paired with text with a duration of 1.1 ten thousand hours and manually labeled audio data for an english oral examination with a duration of 0.15 ten thousand hours.

S22: preprocessing the data set, and dividing the data set into a training set and a testing set according to a preset distribution proportion.

As an example, all audio data in the data set is resampled to 16000Hz and the audio is cut into 30 second segments and corresponds to the tags. The preprocessed data sets are partitioned into training and test sets in the proportion of 8:2.

S23: and carrying out logarithmic Mel spectrum feature extraction on the data set, and carrying out feature normalization to globally scale the input features between-1 and 1.

S24: quantizer parameters and hyper-parameters are set for model training.

It should be noted that in the method, adam quantizer is used in model training, the quantizer parameters are set to β 1=0.9, β 2=0.95, and ∈ =10-8, and the method is based on the formula

The learning rate was set where d was 512 and warmup steps was 5000.

Setting model training hyper-parameters as follows:

the feedforward neural network is set as a 6-layer perceptron, the head number of the multi-head self-attention module is set as 6, a residual attention module is formed by the multi-head self-attention module and the feedforward neural network, the encoder and the decoder are respectively formed by 5 residual attention modules, and the parameter discarding rate is set as 0.1.

As shown in fig. 3, the residual attention module is composed of a multi-headed self-attention module and a feedforward neural network, wherein the multi-headed self-attention module expands the ability of the model to focus on different positions, each weight matrix is used to map the input vector to different representation subspaces, and the output matrix is compressed as the input of the feedforward neural network.

S25: and performing Model training on the English speech recognition Model by using a training set to obtain a pre-training Model, and performing fine tuning on the pre-training Model by using a test set to obtain a final Model.

As an example, a Model is trained on a data set, a pre-trained Model is obtained by iterating 300000 steps, and a final Model is obtained by fine tuning using an artificially labeled data set on the basis of the pre-trained Model.

The step aims to realize the deployment implementation of the model, and can be specifically realized through the following six steps:

s301: and deploying the trained Model.

In this step, the model deployment adopts an ONNX framework, and the model is encrypted and deployed so that the model can run on various devices by 0, and the encryption algorithm formula adopted is as follows:

the Model encrypt is the encrypted Model, the Model is the trained final Model, i represents the ith element in the Model, key represents the generated random string, and keyLength represents the length of the key.

By encrypting the ONNX model deployment, the ONNX model deployment can run on various devices, and the safety of the ONNX model deployment is ensured.

S302: preprocessing an English spoken language audio file, and suppressing the noise of the audio by adopting an LMS adaptive filter noise reduction method in the preprocessing process to obtain a preprocessed audio sequence x.

S303: the preprocessed audio sequence X is resampled to 16000Hz and cut into 30-second segments to form a batch of audio segments X.

S304: and carrying out logarithmic Mel spectrum feature extraction on the batch of audio fragments X, and carrying out normalization processing on the features.

S305: and inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain the probability distribution P of the recognition text.

S306: and performing table look-up on the probability distribution P of the obtained recognition text to obtain the audio recognition text.

As an example, the present invention further provides a specific embodiment to explain the implementation process of step S3. The experimental environment used is a Linux system configured to: the method comprises the following steps that Inter (R) Xeone E5-2620 v4 @ 2.10GHz memory 128G and four NVIDIA A40 GPU video cards with memories of 48G are used, and model training is not needed in the implementation process before the model is formally used. And selecting the reading question audio and the open question audio of each 100 oral English tests, wherein each time is 60 seconds.

The specific implementation steps are as follows:

s311: deploying the Model obtained by training;

s312: preprocessing each 100 spoken English reading audio and open question audio files, and suppressing the noise of the audio by using an LMS adaptive filter noise reduction method in the preprocessing process to obtain a preprocessed audio sequence x;

s313: resampling the preprocessed audio sequence X to 16000Hz, and intercepting the audio sequence X into 30-second segments to form batch audio segments X;

s314: carrying out logarithmic Mel spectrum feature extraction on the obtained batch of audio fragments X, and carrying out normalization processing on the features;

s315: inputting the normalized feature sequence F of the batch of audio clips X into a Model for prediction to obtain probability distribution P of the recognition text;

s316: and performing table look-up on the obtained probability distribution P of the recognition text to obtain the audio recognition text.

By utilizing the oral English speech recognition method based on deep learning, the oral English test spoken data of 200 examinees randomly selected are tested and are subjected to manual quality inspection, and the recognition result is shown in the table 1:

table 1: oral data identification accuracy statistical table for certain English test of 200 examinees

Type of data	Accuracy rate
		Reading questions clearly	0.930
Open question	0.875

As can be seen from Table 1, the spoken English speech recognition method based on deep learning provided by the invention can effectively overcome the defects in the prior art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The method for recognizing spoken English based on deep learning provided by the invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. An English spoken language voice recognition method based on deep learning is characterized by comprising the following steps:

s1: designing an English voice recognition model based on a Transformer;

2. The deep learning-based spoken english speech recognition method according to claim 1, wherein said step S1 comprises:

and constructing a position information embedding module, a multi-head self-attention module, a feedforward neural network and a cross attention module, and combining the position information embedding module, the multi-head self-attention module, the feedforward neural network and the cross attention module into an English speech recognition model based on a Transformer.

3. The spoken english speech recognition method based on deep learning according to claim 2, wherein the step S2 specifically comprises the steps of:

s23: carrying out logarithmic Mel spectrum feature extraction on the data set, carrying out feature normalization, and globally scaling the input features to be between-1 and 1;

s24: setting quantizer parameters and hyper-parameters for model training;

4. The spoken english speech recognition method based on deep learning according to claim 3, wherein the step S3 specifically comprises the steps of:

s31: deploying the Model obtained by training;

s34: carrying out logarithmic Mel spectrum feature extraction on the batch audio clips X, and carrying out normalization processing on the features;

5. The method for recognizing spoken english speech based on deep learning according to claim 3, wherein the step S21 is specifically:

6. The spoken english speech recognition method based on deep learning according to claim 3, wherein the step S22 is specifically:

7. The method for recognizing spoken english speech based on deep learning according to claim 3, wherein the step S24 is specifically:

Setting learning rate learningRate;

wherein d is 512 and warmup steps is 5000;

setting 6 layers of perceptrons for the feedforward neural network, and setting the head number of the multi-head self-attention module to be 6;

forming a residual attention module by using a multi-head self-attention module and a feedforward neural network;

8. The method for recognizing spoken english speech based on deep learning according to claim 3, wherein the step S25 is specifically:

9. The method for recognizing spoken English based on deep learning of claim 2, wherein the step S1 further comprises:

where p represents position, i represents dimension, and d is 512.

10. The deep learning based spoken english recognition method according to claim 4, wherein said step S31 comprises: