CN111916064A - End-to-end neural network speech recognition model training method - Google Patents

End-to-end neural network speech recognition model training method Download PDF

Info

Publication number
CN111916064A
CN111916064A CN202010794361.3A CN202010794361A CN111916064A CN 111916064 A CN111916064 A CN 111916064A CN 202010794361 A CN202010794361 A CN 202010794361A CN 111916064 A CN111916064 A CN 111916064A
Authority
CN
China
Prior art keywords
recognition model
audio
speech recognition
model
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010794361.3A
Other languages
Chinese (zh)
Inventor
陈虞君
杨植麟
张宇韬
杜羽伦
陈欣梅
陈贤鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruikelun Intelligent Technology Co ltd
Original Assignee
Beijing Ruikelun Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruikelun Intelligent Technology Co ltd filed Critical Beijing Ruikelun Intelligent Technology Co ltd
Priority to CN202010794361.3A priority Critical patent/CN111916064A/en
Publication of CN111916064A publication Critical patent/CN111916064A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Abstract

The invention relates to the technical field of computer information processing, in particular to a training method of an end-to-end neural network speech recognition model, which comprises the following steps: step 1, collecting voice information and storing the voice information into an audio file; step 2, primarily screening the audio files to make the volume sizes consistent; step 3, marking the content of the audio file manually, and generating a data file; step 4, preprocessing the marked data, and performing characteristic distribution; step 5, constructing an audio preprocessing module, and carrying out variable speed, noise increase and frequency domain signal disturbance enhancement on the audio file; step 6, constructing a voice recognition model by using an end-to-end deep learning model; step 7, optimizing a voice recognition model; and 8, obtaining the decoded text information by the input audio signal. The invention provides an end-to-end voice recognition model, which aims to remarkably improve the recognition effect.

Description

End-to-end neural network speech recognition model training method
Technical Field
The invention relates to the technical field of computer information processing, in particular to a training method of an end-to-end neural network speech recognition model.
Background
Today, with the rapid development of information technology, more and more fields begin to use machine learning technology to replace the complex and repetitive work consuming manpower and material resources in the traditional industry. For example, in an online shopping website, a voice assistant or a conversation robot is used for solving common problems which may be encountered by a client, or a transportation department uses a computer vision technology to perform recognition work on a license plate of a car. By adopting the machine learning technology, the production cost can be effectively reduced, and higher accuracy can be ensured.
In the actual production process, the transcription recognition of the voice information is the basic work of various voice analysis products. The basis for the analysis of the audio signal is the transcription function of the speech information. For example, the voice input method, i.e. software directly using the voice transcription model, and in addition, most of the telephone precision marketing in industrial production also needs to convert the telephone signal into text information for analysis.
Most conventional phonetic transcription of text products is based on a two-stage recognition method. The method can convert the voice recognition task into two-stage tasks, namely the voice recognition task and the language model modeling task, and models are respectively constructed in the two-stage tasks for learning. In the first stage, the most likely corresponding word is constructed only according to the speech when the speech recognition task is performed, and in the second stage, a language model is continuously trained to finish the correction of the text after the speech signal is directly processed. Two-stage models dominate the world of speech recognition engines for a long period of time. The staged task can be performed using, for example, hidden markov models, gaussian mixture models, and the like. The two-phase model has a specific task objective for each phase, but the optimization methods for different task objectives are not the same. This variability leads to a model whose effect cannot be guaranteed to be optimal. With the rapid development of depth models in recent years, another class of end-to-end speech recognition models begins to exhibit their effectiveness. The main purpose of the end-to-end speech recognition model is to obtain a direct non-linear transformation from the input speech signal to the output text information. Therefore, the supervision information with voice and text is input into the model, the voice information and the text information can be directly in one-to-one correspondence, and direct iteration on data is facilitated. In recent years, the end-to-end speech recognition model has excellent performance, and all the models which obtain the best recognition result on the public data set are end-to-end recognition engines, so that the end-to-end speech recognition is a very promising field.
Disclosure of Invention
The invention aims to solve the problem of poor effect of the traditional speech-to-text model, provides an end-to-end speech recognition model and aims to remarkably improve the recognition effect.
In order to achieve the purpose, the invention provides the following technical scheme: a training method of an end-to-end neural network speech recognition model comprises the following steps:
step 1, collecting voice information and storing the voice information into an audio file;
step 2, primarily screening the audio files to make the volume sizes consistent;
step 3, marking the content of the audio file manually, and generating a data file;
step 4, preprocessing the marked data, and performing characteristic distribution;
step 5, constructing an audio preprocessing module, and carrying out variable speed, noise increase and frequency domain signal disturbance enhancement on the audio file;
step 6, constructing a voice recognition model by using an end-to-end deep learning model;
step 7, optimizing a voice recognition model;
and 8, obtaining the decoded text information by the input audio signal.
Preferably, step 4 further comprises extracting audio spectral Features by randomly selecting the labeled audio file, wherein the audio spectral Features include, but are not limited to, using Mel-frequency spectrograms (Mel Bank Features) and short time Fourier transform (stft) (short Term Fourier transform) audio signal representation algorithm.
Preferably, the speech recognition model in step 6 is constructed by a Transformer model, and a relative position coding technique is adopted as a supplement to the speech signal while the speech signal is calculated.
Preferably, in step 6, the N-layer fransformer model is used as an encoder and the M-layer fransformer model is used as a decoder for the fransformer model to construct the recognition engine for constructing the speech recognition model.
Preferably, the speech recognition model in step 6 is divided into two parts, an Encoder and a Decoder, and the learning target is to maximize the probability of each text being obtained from the input audio signal.
The invention has the beneficial effects that: compared with the traditional speech recognition engine, the model recognition efficiency can be improved, a better result can be achieved for the transcription result of the audio signal, the speech signal is efficiently converted into text information, the cost required by manual transcription can be basically replaced, and the information extraction and data mining work of the audio content can be better served.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of model training according to the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment is as follows:
referring to fig. 1, a method for training an end-to-end neural network speech recognition model includes the following steps;
step 1, collecting voice information, wherein the voice input information can come from various forms, such as storing voice calls, and can be used as voice signal information, or collecting from an open source data website, or processing voice signals and caption information in a specific website so as to obtain a voice file enough for voice recognition model training, and then storing the model file by using wav or mp3 format.
Step 2, primarily screening the voice signals, and after the voice signals are collected, because the voice signals have great difference and volume, VAD (voice activity detection) is firstly carried out to determine an effective voice part, and an effective recording section in a call is extracted to be used as a data set to be labeled.
Step 3, manually labeling; after the initial screening work of the voice signals is finished, the audio is collected, and the voice signals are labeled manually. Before the annotation, in order to increase the annotation speed, the pre-recognition of the given audio is performed by using the existing model. The pre-recognition is to use the basic model to recognize the data to be labeled, so that the recognition result is only needed to be modified manually.
After the manual labeling is finished, the labeled data is collected and arranged into a text file, and the labeled audio file and the labeled result are recorded respectively.
And 4, performing data preprocessing on the marked data, namely performing pre-extraction on the voice signals, and aiming at quickly collecting the feature distribution work of all data. And processing the voice file with the randomly selected part marked, and extracting the voice frequency spectrum characteristics of the voice file. Audio spectral features include, but are not limited to, the use of mel-frequency spectrograms (MelBankFeatures) and short-time fourier transforms stft (shorttermturertransform) and other audio signal representation algorithms.
And after the audio frequency spectrogram is obtained, acquiring the distribution of the corresponding frequency spectrums of all the audio files in a statistical mode.
For example, suppose that the characteristics of the entire audio spectral signal are such that a gaussian distribution is satisfied:
Figure BDA0002624975790000051
where u is the mean of these features and E is the covariance matrix of all vectors. After the mean and covariance matrix of the features are obtained, we can estimate the signal distribution of all audio frequencies. In addition, all the text information used for marking is acquired, and all the characters are acquired as a character library of the model by reading the marked text.
And 5, constructing an audio preprocessing module, and starting to process the audio signal when the audio signal is input into the identification model, wherein the processing method comprises the steps of carrying out the traditional data increasing methods of speed change, noise increase and the like on the audio and carrying out the disturbance enhancement of the frequency domain signal. For example, we use the frequency domain signal stretching method to stretch and transform the frequency domain signal, and the formula is:
X′=TX
wherein X represents the original input audio frequency domain signal, T represents affine transformation, and X' represents the frequency domain signal after affine transformation. Through the affine transformation, a tiny disturbance can be added to the information of the audio in the frequency domain, and the purpose of enhancing the audio data is achieved.
And 6, constructing a voice recognition model by adopting an end-to-end deep learning model. Therefore, a Transformer model was used for construction. In order to accurately acquire speech information, a relative position coding technique is used as a supplement to a speech signal while the speech signal is calculated. For the Transformer model, an end-to-end recognition engine is constructed by using an N-layer Transformer model as an encoder and an M-layer Transformer model as a decoder.
Wherein the inside of the transform model will contain two structures (sub-layers), a multihead attention layer (multihead attention) and a full join layer (fed brown dnetwork).
Wherein the multi-head attention layer is mainly composed of a plurality of attention networks, and for the i-th attention layer, the input thereof is mainly composed of Q, K, V three inputs, wherein Q, K, V is audio information related to a speech signal and a relative position-coded signal, and is calculated by:
Figure BDA0002624975790000061
wherein h isn-1For implicit variable representation of the previous layer, given Qi,Ki,ViAfter three input signals, the attention mechanism is calculated in the following way:
Attention(Qi,Ki,Vi)=softmax(Qik+QiWRi-j+ukj+vWRi-j)
wherein Qi,Ki,ViImplicit expression vectors, R, for the ith layer of the transform, respectivelyi-jRepresentative is relative position coding. The multi-head attention mechanism is to use a plurality of attention mechanisms and simultaneously acquire the information of the input signal Q, K, V for fusion, and the calculation method is as follows:
MultiHead(Q,K,V)=concat(Att1,Att2,Att3,...,Attn)
wherein Atti=Attentioni(Q, K, V), concat represents AttiThe result of the attention mechanism represents the vector to be spliced.
The decoder structure of the Transformer model is basically the same as that of the encoder, but the decoder comprises three sub-networks including two multi-head attention mechanisms and a fully-connected network layer, and the two multi-head attention mechanisms respectively receive an input signal and a representation vector of a result obtained by the encoder, so that the output of the decoder is obtained.
And 7, optimizing the voice recognition model, wherein the voice recognition model is divided into an Encoder part and a Decode part, because the model uses a sequence-to-sequence model for modeling and learning. Thus, the learning goal is to maximize the probability of each text being derived from the input audio signal, i.e.:
maxP(Y|X′;⊙)
where X' is the affine transformed audio signal mentioned above and Y is the text information. In order to successfully learn the result of the model, the method uses the TeacherForring technology to learn, and helps the model to quickly learn to obtain the target text.
Step 8, obtaining text results through searching
Through the training of the voice recognition model, the decoded text information can be obtained according to the input audio signal. However, since the structure of the model is an encoder-decoder structure, the determination of the text sequence needs to be performed by selecting a method using a bundle search. Here, the bundle width of the bundle search is set to M, and the specific flow is as follows:
1, input signal X, and start symbol<bos>The two are simultaneously input into the coder-decoder model to obtain the first character y1Putting each word into a cluster in sequence according to the corresponding first M words with the maximum probability;
2. in each bundle, the next word is decoded from the existing word in the input signal X. The specific approach is that, assuming the model has processed k words, the model will be based on y in each bundle1,y2,...ykAnd the input signal X predicts the (k + 1) th word, yk+1
3. The prediction is ended when the bundle predicts an end symbol < eos > based on the first K words and the input signal X.
Since the model training and prediction can be carried out in a batch mode. Thus, in a batch-wise search for a bundle, the prediction of the batch is only ended if all results within the batch appear < eos >.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (5)

1. A method for training an end-to-end neural network speech recognition model is characterized by comprising the following steps:
step 1, collecting voice information and storing the voice information into an audio file;
step 2, primarily screening the audio files to make the volume sizes consistent;
step 3, marking the content of the audio file manually, and generating a data file;
step 4, preprocessing the marked data, and performing characteristic distribution;
step 5, constructing an audio preprocessing module, and carrying out speed change, noise increase and frequency domain signal disturbance enhancement on the audio file;
step 6, constructing a voice recognition model by using an end-to-end deep learning model;
step 7, optimizing a voice recognition model;
and 8, obtaining the decoded text information by the input audio signal.
2. The method for training the end-to-end neural network speech recognition model according to claim 1, wherein: in step 4, extracting audio spectrum Features by randomly selecting the marked audio file, wherein the audio spectrum Features include but are not limited to using Mel frequency spectrograms (Mel Bank Features) and Short Time Fourier Transform (STFT) (short Term Fourier transform) audio signal representation algorithm.
3. The method for training the end-to-end neural network speech recognition model according to claim 1, wherein: the speech recognition model in step 6 is constructed by a Transformer model, and a relative position coding technology is adopted as a supplement of the speech signal while the speech signal is calculated.
4. The method for training the end-to-end neural network speech recognition model according to claim 3, wherein: in the step 6, for the Transformer model, an N-layer Transformer model is used as an encoder, and an M-layer Transformer model is used as a decoder to construct the recognition engine for constructing the speech recognition model.
5. The method for training the end-to-end neural network speech recognition model according to claim 1, wherein: the speech recognition model in step 6 is divided into two parts, an Encoder and a Decoder, and the learning goal is to maximize the probability of obtaining each text from the input audio signal.
CN202010794361.3A 2020-08-10 2020-08-10 End-to-end neural network speech recognition model training method Pending CN111916064A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010794361.3A CN111916064A (en) 2020-08-10 2020-08-10 End-to-end neural network speech recognition model training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010794361.3A CN111916064A (en) 2020-08-10 2020-08-10 End-to-end neural network speech recognition model training method

Publications (1)

Publication Number Publication Date
CN111916064A true CN111916064A (en) 2020-11-10

Family

ID=73284844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010794361.3A Pending CN111916064A (en) 2020-08-10 2020-08-10 End-to-end neural network speech recognition model training method

Country Status (1)

Country Link
CN (1) CN111916064A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257239A (en) * 2021-06-15 2021-08-13 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113345410A (en) * 2021-05-11 2021-09-03 科大讯飞股份有限公司 Training method of general speech and target speech synthesis model and related device
CN113793604A (en) * 2021-09-14 2021-12-14 思必驰科技股份有限公司 Speech recognition system optimization method and device
CN115862596A (en) * 2023-03-03 2023-03-28 山东山大鸥玛软件股份有限公司 Deep learning-based spoken English speech recognition method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053836A (en) * 2018-01-18 2018-05-18 成都嗨翻屋文化传播有限公司 A kind of audio automation mask method based on deep learning
CN109065032A (en) * 2018-07-16 2018-12-21 杭州电子科技大学 A kind of external corpus audio recognition method based on depth convolutional neural networks
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
CN109859760A (en) * 2019-02-19 2019-06-07 成都富王科技有限公司 Phone robot voice recognition result bearing calibration based on deep learning
CN110428818A (en) * 2019-08-09 2019-11-08 中国科学院自动化研究所 The multilingual speech recognition modeling of low-resource, audio recognition method
CN110459205A (en) * 2019-09-24 2019-11-15 京东数字科技控股有限公司 Audio recognition method and device, computer can storage mediums
CN110751945A (en) * 2019-10-17 2020-02-04 成都三零凯天通信实业有限公司 End-to-end voice recognition method
CN111061861A (en) * 2019-12-12 2020-04-24 西安艾尔洛曼数字科技有限公司 XLNET-based automatic text abstract generation method
CN111145718A (en) * 2019-12-30 2020-05-12 中国科学院声学研究所 Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method
KR20200092511A (en) * 2019-01-15 2020-08-04 한양대학교 산학협력단 Deep neural network based non-autoregressive speech synthesizer method and system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
CN108053836A (en) * 2018-01-18 2018-05-18 成都嗨翻屋文化传播有限公司 A kind of audio automation mask method based on deep learning
CN109065032A (en) * 2018-07-16 2018-12-21 杭州电子科技大学 A kind of external corpus audio recognition method based on depth convolutional neural networks
KR20200092511A (en) * 2019-01-15 2020-08-04 한양대학교 산학협력단 Deep neural network based non-autoregressive speech synthesizer method and system
CN109859760A (en) * 2019-02-19 2019-06-07 成都富王科技有限公司 Phone robot voice recognition result bearing calibration based on deep learning
CN110428818A (en) * 2019-08-09 2019-11-08 中国科学院自动化研究所 The multilingual speech recognition modeling of low-resource, audio recognition method
CN110459205A (en) * 2019-09-24 2019-11-15 京东数字科技控股有限公司 Audio recognition method and device, computer can storage mediums
CN110751945A (en) * 2019-10-17 2020-02-04 成都三零凯天通信实业有限公司 End-to-end voice recognition method
CN111061861A (en) * 2019-12-12 2020-04-24 西安艾尔洛曼数字科技有限公司 XLNET-based automatic text abstract generation method
CN111145718A (en) * 2019-12-30 2020-05-12 中国科学院声学研究所 Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345410A (en) * 2021-05-11 2021-09-03 科大讯飞股份有限公司 Training method of general speech and target speech synthesis model and related device
CN113257239A (en) * 2021-06-15 2021-08-13 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113257239B (en) * 2021-06-15 2021-10-08 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113793604A (en) * 2021-09-14 2021-12-14 思必驰科技股份有限公司 Speech recognition system optimization method and device
CN113793604B (en) * 2021-09-14 2024-01-05 思必驰科技股份有限公司 Speech recognition system optimization method and device
CN115862596A (en) * 2023-03-03 2023-03-28 山东山大鸥玛软件股份有限公司 Deep learning-based spoken English speech recognition method

Similar Documents

Publication Publication Date Title
CN101136199B (en) Voice data processing method and equipment
US8321218B2 (en) Searching in audio speech
CN111916064A (en) End-to-end neural network speech recognition model training method
US8255215B2 (en) Method and apparatus for locating speech keyword and speech recognition system
Tripathi et al. Environment sound classification using an attention-based residual neural network
CN111798840A (en) Voice keyword recognition method and device
Ram et al. Sparse subspace modeling for query by example spoken term detection
Pham et al. Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
Shekofteh et al. Feature extraction based on speech attractors in the reconstructed phase space for automatic speech recognition systems
Lounnas et al. CLIASR: a combined automatic speech recognition and language identification system
Almekhlafi et al. A classification benchmark for Arabic alphabet phonemes with diacritics in deep neural networks
Zhang et al. Dropdim: A regularization method for transformer networks
Zhou et al. Extracting unit embeddings using sequence-to-sequence acoustic models for unit selection speech synthesis
CN116010874A (en) Emotion recognition method based on deep learning multi-mode deep scale emotion feature fusion
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
Saputri et al. Identifying Indonesian local languages on spontaneous speech data
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
Tasnia et al. An overview of bengali speech recognition: Methods, challenges, and future direction
Li Robotic emotion recognition using two-level features fusion in audio signals of speech
CN113257240A (en) End-to-end voice recognition method based on countermeasure training
CN113763939B (en) Mixed voice recognition system and method based on end-to-end model
Zhang et al. A Non-Autoregressivee Network for Chinese Text to Speech and Voice Cloning
Huang et al. Latent discriminative representation learning for speaker recognition
Tao et al. The NLPR Speech Synthesis entry for Blizzard Challenge 2017

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination