CN111986661A

CN111986661A - Deep neural network speech recognition method based on speech enhancement in complex environment

Info

Publication number: CN111986661A
Application number: CN202010880777.7A
Authority: CN
Inventors: 王兰美; 梁涛; 朱衍波; 廖桂生; 王桂宝; 孙长征
Original assignee: Xidian University; Shaanxi University of Technology
Current assignee: Xidian University; Shaanxi University of Technology
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-11-24
Anticipated expiration: 2040-08-28
Also published as: CN111986661B

Abstract

A deep neural network speech recognition method based on speech enhancement in a complex environment builds a model by taking a deep learning neural network and speech enhancement as technical backgrounds. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on voice signals under various complex voice conditions to be recognized in a voice recognition front-end voice signal preprocessing stage; then establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and then, building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, thereby realizing the accurate recognition of Chinese voice in a complex environment. The problems that the existing voice recognition algorithm is sensitive to noise factors, high in requirement on voice quality and single in application scene are well solved.

Description

Deep neural network speech recognition method based on speech enhancement in complex environment

Technical Field

The invention belongs to the field of voice recognition, and particularly relates to a deep neural network voice recognition method based on voice enhancement in a complex environment.

Background

In recent years, technological innovation is often difficult, economy is prosperous, society is advanced, and people put forward more demands on building a beautiful life after solving the basic problems of eating, wearing, living and going. The beauty vision promotes the virtual social software integrating life, work and entertainment, such as QQ, WeChat and the like, to be greatly popularized. The virtual social software brings great convenience to life, work and communication of people, and particularly has a voice recognition function in each piece of social software. The voice recognition enables people to get rid of the constraint of traditional interaction modes such as a keyboard and a mouse, and therefore the most natural communication mode, namely voice communication, is used for transmitting information. Meanwhile, speech recognition is also gradually widely used in various fields such as industry, communication, home appliances, home services, medical care, electronic consumer products, and the like.

Most of the social software today achieves a very high level of speech recognition accuracy under pure speech conditions without background noise and without interfering sound sources. When the speech signal to be recognized contains noise, interference and reverberation, the accuracy of the existing speech recognition system is greatly reduced. The conversion is mainly that the existing voice recognition system does not consider the problems of denoising and interference suppression in the voice signal preprocessing stage and the acoustic model building stage of the voice recognition front end.

The existing Chinese speech recognition algorithm has strict requirements on the quality of speech signals and poor algorithm robustness, and speech recognition can be failed when the speech quality is poor or the audio frequency is seriously polluted. The method is only applied in a small range under pure and ideal voice conditions, and in order to improve the application of voice recognition in a real-life environment and overcome the defects of the existing algorithm, the invention provides a deep neural network voice recognition method based on voice enhancement in a complex environment. The method takes deep learning neural network and voice enhancement as technical background. Firstly, speech enhancement is carried out on speech signals under various complex speech conditions to be recognized at a speech recognition front end; establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, and thus building a voice recognition system with good performance in a complex voice environment.

In view of the application of the speech recognition technology in real life, the complex environment speech recognition technology provided by the invention is a speech recognition technology under four comprehensive speech environments including a pure speech condition, a white gaussian noise environment, a background noise or interference sound source and a reverberation environment. The method has the advantages of high identification accuracy, strong model generalization capability and good robustness to various environmental factors.

Disclosure of Invention

The invention aims to provide a deep neural network speech recognition method based on speech enhancement in a complex environment.

In order to achieve the purpose, the invention adopts the following technical solutions:

a deep neural network speech recognition method based on speech enhancement in a complex environment builds a model by taking a deep learning neural network and speech enhancement as technical backgrounds, and a flow chart of a specific speech recognition technical scheme is shown in an attached drawing and is illustrated in figure 1. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on a voice signal under a complex voice condition to be recognized in a voice recognition front-end voice signal preprocessing stage; then establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and finally, building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, thereby realizing the accurate recognition of Chinese voice in a complex environment. The problems that the existing voice recognition algorithm is sensitive to noise, high in requirement on voice quality and single in application scene are well solved. The deep neural network speech recognition method based on speech enhancement in the complex environment comprises the following steps:

step one, establishing and processing a voice data set in a complex environment. The part collects pure environment voice, white Gaussian noise environment voice, environment voice with background noise or interference sound source and voice under reverberation environment to form a voice data set C of the voice recognition system. Then, the voice data under each environment in the voice data set C is divided into a training set and a test set, respectively. The distribution proportion is the number of the training set voices: test set number of phonetic pieces 5: 1. and respectively collecting the training set and the test set under each environment and distributing in a disorderly manner to form a training set X and a test set T. The ith speech in the training set X is represented as X_i(ii) a The jth phonetic symbol in the test set T is denoted as T_j. And simultaneously editing a tag document in txt format for each voice in the training set X, wherein the content of the tag document comprises the name of the voice and the corresponding correct Chinese pinyin sequence. A partially shown view of a training set voice tag document is shown in the accompanying illustration fig. 2.

Step two, carrying out voice enhancement on the established voice training set X and the test set T to obtain an enhanced voice training set

And test set

Enhanced speech training set

The ith voice in (1) is expressed as

Test set

The j-th voice in the text is expressed as

Training the ith voice x in the set with voice_iFor example, the specific speech enhancement steps are as follows, the speech signal x to be enhanced_iUsing the built-in speech processing audiored function of matlab software to process x_iPerforming reading processing to obtain the sampling rate f of the voice signal_sAnd a matrix x containing speech information_i(n)，x_i(n) is a voice sampling value at n moments; then for x_i(n) performing pre-emphasis to obtain y_i(n); then to y_i(n) adding Hamming window to perform framing operation to obtain information y of each frame of the speech signal_i,r(n) wherein y_i,r(n) a speech information matrix of the r frame of the i-th speech signal after pre-emphasis enhancement; then to y_i,r(n) FFT to obtain the short-time signal frequency spectrum of the r-th frame of the i-th voice signal

Then using the gamma flux weighting function H_lAccording to the frequency band pair

Processing to obtain power P on l frequency band of r frame of i voice signal_i,r,l(r, l), wherein l has a value of 0.., 39; sequentially obtaining the power of each frequency band of the r frame according to the steps; then, noise reduction and dereverberation processing are carried out and spectrum integration is carried out

Thus, the short-time signal spectrum of the r frame of the i-th voice signal after enhancement is obtained, the voice signals of other frames are processed in turn in the same way to obtain the short-time signal spectrum of each frame, and then the voice signals after enhancement are synthesized on the time domain through IFFT to obtain the voice signals after enhancement

Will be provided with

Putting into the enhanced speech training set

In (1). A specific speech data enhancement flow diagram is shown in figure 3.

And step three, building a voice recognition acoustic model. The voice recognition acoustic model built by the method adopts CNN + CTC to carry out modeling, and the input layer is a training set enhanced in the second step

In the speech signal

Processing training set speech signals using MFCC feature extraction algorithm

And obtaining a 200-dimensional characteristic value sequence, wherein the hidden layer is alternately and repeatedly connected with the pooling layer by adopting a convolution layer and a Dropout layer is introduced to prevent overfitting, the convolution kernel size of the convolution layer is 3, the size of a pooling window is 2, finally, the output layer adopts a full-connection layer of 1423 neurons to output, and is activated by using a softmax function, the loss function of the CTC is used as a loss function to realize connectivity time sequence multiple output, and the characteristic value with the output of 1423 dimensions just corresponds to 1423 common Chinese pinyins in the four-step built Chinese dictionary ditt. The specific speech recognition acoustic model network framework is illustrated in the attached drawing and explained in figure 4. The specific parameters of the convolutional layer, the pooling layer, the Dropout layer, and the fully-connected layer in the acoustic model are labeled in fig. 4.

And step four, building a 2-gram language model and a dictionary of the voice recognition. The establishment of the language model comprises establishment of a language text data set, establishment of a 2-gram language model and collection and establishment of a Chinese and Chinese dictionary. The language text data set is formally expressed as an electronic version txt file, and the contents are newspaper, Chinese class text and famous novel. For the Chinese character dictionary, the dictionary is expressed as a dit.txt file, wherein 1423 Chinese characters corresponding to Chinese pinyin commonly used in daily life are marked, and the condition of one tone and multiple characters of Chinese are considered. The dictionary part constructed by the invention is shown in the figure 5.

And step five, training the built 2-gram language model by using the built language text data set to obtain a word occurrence number table and a state transition table of the language model. The specific training mode for the language model is as follows: the method comprises the steps of circularly obtaining text contents in a language text data set, counting the occurrence frequency of a single word and the occurrence frequency of two words together, and finally summarizing to obtain a single word occurrence frequency table and two word state transition tables. The specific language model training block diagram is shown in the attached drawing and explained in figure 6.

Step six, training set by using trained language model, established dictionary and enhanced voice

And performing learning training on the built acoustic model. And obtaining a weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows: initializing weights of all parts of the acoustic network model; importing speech training sets in sequence

Training the speech in (1) to any speech signal

Firstly, processing by an MFCC feature extraction algorithm to obtain a 200-dimensional characteristic value sequence of a voice signal, then processing the 200-dimensional characteristic value sequence of the voice signal by each convolution layer, pooling layer, Dropout layer and full-link layer in sequence according to the list shown in figure 7, finally outputting by the full-link layer of 1423 neurons by the output layer, and activating by a softmax function to obtain 1423-dimensional acoustic features of the voice signal; after obtaining the characteristic valueDecoding the 1423-dimensional acoustic feature values under the action of a language model and a dictionary and outputting a recognized speech signal

The Chinese phonetic sequence of (1); chinese phonetic alphabet sequence and training set identified by acoustic model

In

The Chinese pinyin label sequence is compared to calculate errors and reversely propagates and updates the weight values of all positions of the acoustic model, the loss function adopts a loss function of CTC, and Adam algorithm is optimized. Setting the trained blocksize to be 16, the iteration number epoch to be 50, storing a weight file once when each 500 voices are trained, and processing the training set according to the steps

And (4) each voice of the voice training system is completed until the loss of the acoustic model is converged. And saving a weight file and various configuration files of the acoustic model. The specific speech recognition acoustic model training block diagram is shown in the attached figure 7.

Step seven, using the trained Chinese voice recognition system based on voice enhancement to test the set

The voice is identified, the accuracy of the voice identification is counted, and the performance comparison analysis is carried out with the traditional algorithm. The specific flow framework diagram of the speech recognition test system is shown in the figure and explained in figure 8. The speech recognition accuracy of the present patent and the performance comparison with conventional algorithms are shown in part in fig. 9 and 10.

Advantages of the invention

The deep neural network speech recognition method based on speech enhancement in a complex environment well solves the problems that the existing speech recognition algorithm is sensitive to noise and other complex environment factors, high in requirement on speech quality and single in speech recognition application scene. Meanwhile, the voice recognition method provided by the invention adopts a neural network deep learning technology to perform acoustic modeling, so that the model built by the invention has strong transfer learning capability, and the voice recognition system has strong robustness in the aspect of complex environmental factor interference due to the introduction of the voice enhancement method.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the present invention will be briefly introduced to better understand the inventive content of the present invention.

FIG. 1 is a detailed flow chart of the speech recognition technique of the present invention;

FIG. 2 is a partial display diagram of the phonetic labels of the speech recognition training set according to the present invention;

FIG. 3 is a block diagram of a speech recognition speech enhancement flow diagram according to the present invention;

FIG. 4 is a diagram of a speech recognition acoustic model network framework of the present invention;

FIG. 5 is a partial display diagram of a dictionary constructed according to the present invention;

FIG. 6 is a flow chart of the language model training of the present invention;

FIG. 7 is a training diagram of an acoustic model of the present invention;

FIG. 8 is a block flow diagram of a speech recognition test system of the present invention;

FIG. 9 is a diagram showing the comparison between the speech recognition algorithm of the present invention and the conventional algorithm in a noisy environment;

FIG. 10 is a comparison between the effect of the speech recognition algorithm of the present invention and the conventional algorithm in a reverberation environment;

Detailed Description

The deep neural network speech recognition method based on speech enhancement in the complex environment comprises the following specific implementation steps:

step one, establishing and processing a voice data set in a complex environment. The part collects pure environment voice, white Gaussian noise environment voice, environment voice with background noise or interference sound source and voice under reverberation environment to form a voice data set C of the voice recognition system. Then, willAnd voice data under each environment in the voice data set C are respectively divided into a training set and a testing set. The distribution proportion is the number of the training set voices: test set number of phonetic pieces 5: 1. and respectively collecting the training set and the test set under each environment and distributing in a disorderly manner to form a training set X and a test set T. The ith speech in the training set X is represented as X_i(ii) a The jth phonetic symbol in the test set T is denoted as T_j. And simultaneously editing a tag document in txt format for each voice in the training set X, wherein the content of the tag document comprises the name of the voice and the corresponding correct Chinese pinyin sequence. A partially shown view of a training set voice tag document is shown in the accompanying illustration fig. 2.

The specific collection methods are respectively as follows: firstly, collecting voices under pure conditions, recording a plurality of people under ideal laboratory conditions, recording 3000 pure voice materials by taking Chinese newspapers, novels and student texts as materials and recording a single voice within 10 seconds; for voice collection in a Gaussian white noise environment and a reverberation environment, Adobe Audio software is adopted for synthesis, specifically, recorded pure voice and Gaussian white noise are adopted for synthesis, and the reverberation environment of the software is directly adopted for re-synthesizing voice. Wherein, 3000 voices under the environment of Gaussian white noise and voices under the environment of reverberation are respectively recorded; finally, mainly recording voices with background noise or interfering sound sources on the spot, and recording voices on the spot by a plurality of people in noisy places such as factories, restaurants and the like for 3000 voices in total. Meanwhile, all the voice file formats collected above are in the wav format. Classifying the collected voices in the following manner: 2500 voices in each type of voice environment are used as a training set of the voice recognition system, and the remaining 500 voices are used as a test set. The summary is that 10000 training sets X and 2000 testing sets T are used in the speech recognition, and the training sets and the testing sets are respectively distributed in a disorderly mode, so that overfitting of the trained model is avoided.

And test set

Enhanced speech training set

The ith voice in (1) is expressed as

Test set

The j-th voice in the text is expressed as

Processing to obtain power P on l frequency band of r frame of i voice signal_i,r,l(r, l), wherein l has a value of 0.., 39; sequentially obtaining the power of each frequency band of the r frame according to the steps; then, noise reduction and demixing are carried outSound processing and spectrum integration

Will be provided with

Putting into the enhanced speech training set

In (1). A specific speech data enhancement flow diagram is shown in figure 3.

The speech enhancement operates as detailed below for each step:

speech signal pre-emphasis

For the ith voice signal matrix X in the training set X_i(n) performing pre-emphasis to obtain y_i(n) wherein y_i(n)＝x_i(n)-αx_i(n-1), α is a constant in this patent α is 0.98; x is the number of_i(n-1) is a sampling matrix for the ith speech in the training set at time n-1.

(II) windowing framing

Using Hamming window w (n) to pre-emphasized speech signal y_i(n) performing windowing and framing to divide the continuous speech signal into discrete signals y of one frame to one frame_i,r(n)；

Wherein

And a Hamming window function, wherein N is the window length, the frame taking length in the patent is 50ms, and the frame moving is 10 ms. Pre-emphasized speech signal y_i(n) windowing and framing to obtain matrix information y of speech signal for each frame_i,r(n)。y_i,r(n) tableAnd displaying a voice information matrix of an r frame of the ith voice signal after pre-emphasis and windowing framing.

(III) FFT transformation

The voice information matrix y of the r frame of the i voice signal_i,r(n) FFT transforming the signal from time domain to frequency domain to obtain the short-time signal spectrum of the ith frame of the ith speech signal

(IV) determining the power P of the speech signal_i,r,l(r,l)

The short-time signal frequency spectrum of each frame

Processing by a gamma pass weight function to obtain the power of each frequency band of each frame of the voice signal;

P_i,r,l(r, l) represents a speech signal y_i(n) power in the l-th frequency band of the r-th frame, k being an index of a virtual variable representing discrete frequencies, ω_kIs a discrete frequency of the frequency,

since the frame length of 50ms is adopted in the FFT and the sampling rate of the speech signal is 16kHz, N is 1024; h_lThe frequency spectrum of the gamma pass filter bank of the ith frequency band calculated at the frequency index k is represented and is a built-in function of matlab software voice processing, and the input parameter of the function is the frequency band l;

representing the short-time spectrum of the r-th frame speech signal, L-40 is the total number of all channels.

(V) noise reduction and dereverberation processing of voice signals

Determining the power P of a speech signal_i,r,l(r, l), performing noise reduction and dereverberation treatment, and specifically comprising the following steps:

(1) find the r framelow pass power M of l frequency bands_i,r,l[r,l]The concrete solving formula is as follows:

M_i,r,l[r,l]＝λM_i,r,l[r-1,l]+(1-λ)P_i,r,l[r,l]

M_i,r,l[r-1,l]low-pass power representing the l-th band of the r-1 th frame; λ represents a forgetting factor, and varies depending on the bandwidth of the low-pass filter, and λ is 0.4 in this patent.

(2) Removing slowly varying components and power falling edge envelopes from the signal, for the power P of the speech signal_i,r,l[r,l]Processing to obtain the power of the l frequency band of the r frame after enhancement

Wherein

In (c)₀Is a constant factor, this patent takes c₀＝0.01。

(3) And (3) sequentially carrying out enhancement processing on each frequency band of each frame of the signal according to the steps (1) and (2).

(VI) Spectrum integration

Obtaining the enhanced power of each frame and each frequency band of the speech signal

And performing speech signal spectrum integration to obtain a short-time signal spectrum of each frame of the enhanced speech signal, wherein the formula of the spectrum integration is as follows:

mu in the above formula_i,r[r,k]Representing the spectral weight coefficient at the kth index of the r frame;

for the short-term signal spectrum of the r-th frame of the non-enhanced i-th speech signal,

the short-time signal spectrum of the r frame of the enhanced i-th speech signal.

Wherein mu_i,r[r,k]The solving formula of (2) is as follows:

μ_i,r[r,k]＝μ_i,r[r,N-k],N/2≤k≤N-1

h in the formula_lRepresents the spectrum of the gamma pass filter bank that is the l-th band calculated at the frequency index k; omega_i,r,l[r,l]For the weighting coefficient of the ith frequency band of the ith frame of the ith speech signal, the weighting coefficient is the ratio of the frequency domain after enhancement to the original frequency domain of the signal, and the solving formula is as follows:

the enhanced short-time signal spectrum of the r-th frame of the i-th voice signal after spectrum integration is obtained, and the enhanced short-time signal spectrum of each frame of the i-th voice signal is obtained by sequentially processing each frame according to the operation. Enhanced speech signal for each frame

IFFT conversion is carried out to obtain the voice signal of each frame in the time domain, and frame splicing is carried out in the time domain to obtain the enhanced voice signal

The IFFT transformation and speech signal time domain frame splicing operations are as follows:

g is the total frame number

In the above formula, the first and second carbon atoms are,

is a matrix of enhanced speech signals;

representing an enhanced speech signal matrix of the r frame; g is the total number of frames of the speech signal, which varies with the duration of the speech signal. Obtaining a sampling matrix of the enhanced n-time speech signal

And then, the voice processing audio function built in the matlab software is used for processing the voice according to the sampling rate f of the voice signal_s16Khz pair

Performing write processing to obtain enhanced voice signal

At this point, after the enhancement processing for one speech in the speech training set is completed, the training set X and the test set T are sequentially processed according to the above steps. And storing the enhanced training set speech in

Centralized, enhanced test sets are stored

And (4) concentrating.

Mid speech signal

Extracting the characteristic value sequence by adopting an MFCC characteristic extraction algorithm with the characteristic value sequence of 200 dimensions; meanwhile, the hidden layer is alternately and repeatedly connected with the pooling layer by adopting a convolution layer and the pooling layer, a Dropout layer is introduced to prevent overfitting, the convolution kernel size of the convolution layer is 3, the size of a pooling window is 2, finally, the output layer adopts a full connection layer of 1423 neurons to output, a softmax function is used for activation, a loss function of CTC is used as a loss function to realize multiple output of connectivity time sequence, and a characteristic value with 1423 dimensions output just corresponds to 1423 common Chinese pinyins in a Chinese dictionary word txt document built in the fourth step. The specific speech recognition acoustic model network framework is illustrated in the attached drawing and explained in figure 4. The specific parameters of the convolutional layer, the pooling layer, the Dropout layer, and the fully-connected layer in the acoustic model are labeled in fig. 4.

And step four, building a voice recognition language model. The language model building comprises the building of a language text data set, the design of a 2-gram language model and the collection of a Chinese and Chinese dictionary.

Establishment of language text database

First, a set of text data needed to train a language model is built. The language text data set is formally expressed as an electronic version txt file, and the contents are newspaper, Chinese class text and famous novel. Txt files establish a language text database, and the selection of the text data in the language text database must be representative so as to reflect the Chinese language habit in daily life.

(II) 2-gram language model building

The patent builds a language model by adopting a language model training method 2-gram algorithm which is divided according to words. Where 2 in the 2-gram indicates that the probability of the current word appearing is considered to be related only to its first 2 words. 2 is the constrained number of word sequence memory lengths. The specific formula of the 2-gram algorithm can be expressed as:

in the above formula, W represents a text sequence, W₁,w₂,...,w_qRespectively representing each word in the character sequence, and q represents the length of the character sequence; s (W) represents the probability that the text sequence conforms to the linguistic habit. d denotes the d-th word.

(III) Chinese dictionary establishment

And building a language model dictionary of the voice recognition system. For the Chinese character dictionary, the dictionary is expressed as a dit.txt file, wherein 1423 Chinese characters corresponding to Chinese pinyin commonly used in daily life are marked, and the condition of one tone and multiple characters of Chinese are considered. The dictionary part constructed by the invention is shown in the figure 5.

And step five, training the built 2-gram language model by using the built language text data set to obtain a word occurrence number table and a state transition table of the language model. The specific language model training block diagram is shown in the attached drawing and explained in figure 6. The specific training mode for the language model is as follows:

(1) and circularly acquiring the text content in the language text data set, counting the occurrence times of single words, and summarizing to obtain a single word occurrence time table.

(2) And circularly acquiring the times of the two words in the language text data set, and summarizing to obtain the state transition table of the two words.

And performing learning training on the built acoustic model. And obtaining a weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows:

(1) initializing weights of all parts of the acoustic network model;

(2) importing speech training sets in sequence

In the speech trainingFor arbitrary voice signals

Firstly, processing by an MFCC feature extraction algorithm to obtain a 200-dimensional characteristic value sequence of a voice signal, then processing the 200-dimensional characteristic value sequence of the voice signal by each convolution layer, pooling layer, Dropout layer and full-link layer in sequence according to the list shown in figure 7, finally outputting by the full-link layer of 1423 neurons by the output layer, and activating by a softmax function to obtain 1423-dimensional acoustic features of the voice signal;

(3) after the characteristic value is obtained, the 1423-dimensional acoustic characteristic value is decoded under the action of a language model and a dictionary, and the recognized voice signal is output

The Chinese phonetic sequence of (1);

(4) chinese phonetic alphabet sequence and training set identified by acoustic model

The ith voice

The Chinese pinyin label sequence is compared to calculate errors and reversely propagates and updates the weight values of all positions of the acoustic model, the loss function adopts a loss function of CTC, and Adam algorithm is optimized. Setting the trained blocksize to be 16, the iteration number epoch to be 50, and storing a weight file once when each 500 voices are trained; the loss function of CTC is as follows:

in the above formula

Represents the total loss generated after training in the training set, and e represents the training set after performing speech enhancement on the input speech

In the speech signal

z is the output kanji sequence, F (z | e) represents the probability that the input is e and the output sequence is z.

(5) And training the acoustic models of the voice recognition in sequence according to the steps until the loss of the acoustic models is converged, and finishing the training of the acoustic models. And saving a weight file and various configuration files of the acoustic model. The specific speech recognition acoustic model training diagram is shown in the figure and explained in figure 7.

The voice is identified, the accuracy of the voice identification is counted, and the performance comparison analysis is carried out with the traditional algorithm. The specific flow framework diagram of the speech recognition test system is shown in the figure and explained in figure 8. The speech recognition accuracy of the present patent and comparison of performance in noisy environments with conventional algorithms is illustrated in part in the accompanying drawing, which shows fig. 9; the speech recognition accuracy of the present patent and comparison of performance in reverberant environments with conventional algorithms is shown in part in the accompanying illustration figure 10.

The specific implementation mode is as follows:

(1) and (3) carrying out voice recognition test on 2000 unenhanced voice test sets T of the established complex environment voice database by using a traditional voice recognition system, and counting the accuracy of voice recognition. Representative speech recognition results are listed in the figure description, and fig. 9 and fig. 10 are shown in the figure description.

(2) 2000 enhanced speech test sets of a built speech database using a speech enhancement based speech recognition system of the present invention

And carrying out voice recognition test and counting the voice recognition accuracy of the method. And a representative speech recognition is illustrated in the accompanying drawingsThe result figures are shown in the accompanying description of FIGS. 9 and 10.

(3) And finally, performing performance analysis on the voice recognition system based on the voice enhancement provided by the invention.

After statistics is completed, the voice recognition algorithm based on voice enhancement greatly improves the recognition accuracy of the voice in a Gaussian white noise environment, an environment with background noise or an interference sound source and a reverberation environment, and the performance is improved by about 30%; compared with the traditional speech recognition algorithm, the algorithm of the invention has greatly improved recognition accuracy, and particularly has poor performance on speech recognition in a Gaussian white noise environment, an environment with background noise or an interference sound source and a reverberation environment. A comparison graph of the recognition effect of the voice recognition algorithm and the traditional voice recognition algorithm under the condition of partial noise is shown in the figure description figure 9. A comparison graph of the recognition effect of the speech recognition algorithm of the invention and the recognition effect of the traditional speech recognition algorithm under the partial reverberation environment is shown in the figure 10.

Therefore, the deep neural network speech recognition method based on speech enhancement in the complex environment well solves the problems that the existing speech recognition algorithm is sensitive to a noise environment, high in requirement on speech quality and single in applicable scene, and realizes speech recognition in the complex speech environment.

Symbol i appearing in each step represents the i-th speech signal subjected to speech enhancement processing in the training set and the test set, i is 1, 2.., 12000; symbol r denotes the r-th frame of the speech signal, r 1,2, 3.., g; g represents the total frame number after the voice signal is framed, and the value of g is changed due to the processed voice time length; symbol l denotes the l-th frequency band of the speech signal, l being 0,1, 2.., 39; k is an index of a virtual variable representing a discrete frequency, k 0,1, 2.

Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Advantages of the invention

The invention builds a model by taking deep learning neural network and voice enhancement as technical backgrounds. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on voice signals under various complex voice conditions to be recognized in a voice recognition front-end voice signal preprocessing stage; then establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and then, building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, thereby realizing the accurate recognition of Chinese voice in a complex environment. The problems that the existing voice recognition algorithm is sensitive to noise factors, high in requirement on voice quality and single in application scene are well solved.

Claims

1. The deep neural network speech recognition method based on speech enhancement in the complex environment comprises the following specific implementation steps:

step one, establishing and processing a voice data set in a complex environment; collecting pure environment voice, white Gaussian noise environment voice, environment voice with background noise or interference sound source and voice under reverberation environment at the part to form a voice data set C of the voice recognition system; then, dividing the voice data under each environment in the voice data set C into a training set and a testing set respectively; the distribution proportion is the number of the training set voices: test set number of phonetic pieces 5: 1; respectively collecting and disordering the training set and the test set under each environment to form a training set X and a test set T; the ith speech in the training set X is represented as X_i(ii) a The jth phonetic symbol in the test set T is denoted as T_j(ii) a Simultaneously, for each voice in the training set X, editing a tag in txt formatLabel the file, the content of the label file includes name and correspondent correct Chinese phonetic alphabet sequence of the speech of the strip; a partially shown view of a training set voice tag document is illustrated in figure 2;

the specific collection methods are respectively as follows: firstly, collecting voices under pure conditions, recording a plurality of people under ideal laboratory conditions, recording 3000 pure voice materials by taking Chinese newspapers, novels and student texts as materials and recording a single voice within 10 seconds; for voice collection in a Gaussian white noise environment and a reverberation environment, Adobe audio software is adopted for synthesis, specifically, recorded pure voice and Gaussian white noise are adopted for synthesis, and the reverberation environment of the software is directly adopted for re-synthesis of voice; wherein, 3000 voices under the environment of Gaussian white noise and voices under the environment of reverberation are respectively recorded; finally, for voices with background noise or interfering sound sources, the method mainly adopts field recording, and carries out field recording by a plurality of people in noisy places such as factories, restaurants and the like, wherein 3000 voices are recorded in total; meanwhile, all the collected voice files are in the wav format; classifying the collected voices in the following manner: using 2500 voices in each type of voice environment as a training set of a voice recognition system, and using the remaining 500 voices as a test set; summarizing, namely 10000 training sets X and 2000 testing sets T of the speech recognition training set, respectively distributing the training sets and the testing sets in a disorderly way, and avoiding overfitting of the trained model;

And test set

Enhanced speech training set

The ith voice in (1) is expressed as

Test set

The j-th voice in the text is expressed as

Thus, the short-time signal frequency spectrum of the r-th frame of the i-th voice signal after enhancement is obtained, and the voice signals of other frames are processed in turn to obtain each frameThe short-time signal frequency spectrum is subjected to speech signal frame synthesis on a time domain through IFFT to obtain an enhanced speech signal

Will be provided with

Putting into the enhanced speech training set

Performing the following steps; a specific speech data enhancement flow diagram is shown in figure 3;

the speech enhancement operates as detailed below for each step:

speech signal pre-emphasis

For the ith voice signal matrix X in the training set X_i(n) performing pre-emphasis to obtain y_i(n) wherein y_i(n)＝x_i(n)-αx_i(n-1), α is a constant in this patent α is 0.98; x is the number of_i(n-1) is a sampling matrix of the ith voice in the training set at the n-1 moment;

(II) windowing framing

Wherein

Hamming window function, N is window length, frame taking length in patent is 50ms, frame shift is 10 ms; pre-emphasized speech signal y_i(n) windowing and framing to obtain matrix information y of speech signal for each frame_i,r(n)；y_i,r(n) a speech information matrix of the r-th frame of the i-th speech signal after pre-emphasis, windowing and framing;

(III) FFT transformation

The voice information matrix y of the r frame of the i voice signal_i,r(n) FFT transforming it from the time domainTransforming to frequency domain to obtain short-time signal frequency spectrum of i-th speech signal r-th frame

(IV) determining the power P of the speech signal_i,r,l(r,l)

The short-time signal frequency spectrum of each frame

a short-time spectrum representing the r-th frame of the speech signal, L-40 being the total number of all channels;

(V) noise reduction and dereverberation processing of voice signals

(1) obtaining low-pass power M of ith frequency band of ith frame_i,r,l[r,l]The concrete solving formula is as follows:

M_i,r,l[r,l]＝λM_i,r,l[r-1,l]+(1-λ)P_i,r,l[r,l]

M_i,r,l[r-1,l]low-pass power representing the l-th band of the r-1 th frame; λ represents a forgetting factor, which varies with the bandwidth of the low-pass filter, and λ is 0.4 in this patent;

Wherein

In (c)₀Is a constant factor, this patent takes c₀＝0.01；

(3) Sequentially carrying out enhancement processing on each frequency band of each frame of the signal according to the steps (1) and (2);

(VI) Spectrum integration

short-time signal frequency spectrum of the r frame of the enhanced ith voice signal;

wherein mu_i,r[r,k]The solving formula of (2) is as follows:

μ_i,r[r,k]＝μ_i,r[r,N-k],N/2≤k≤N-1

obtaining an enhanced short-time signal frequency spectrum of an r frame of the ith voice signal after spectrum integration, and sequentially processing each frame according to the operation to obtain the enhanced short-time signal frequency spectrum of each frame of the ith voice signal; enhanced speech signal for each frame

g is the total frame number

In the above formula, the first and second carbon atoms are,

is a matrix of enhanced speech signals;

representing an enhanced speech signal matrix of the r frame; g is the total frame number of the voice signal, and the value is changed due to the duration of the voice signal; obtaining a sampling matrix of the enhanced n-time speech signal

Performing write processing to obtain enhanced voice signal

At this point, after the enhancement processing of one voice in the voice training set is finished, the training set X and the test set T are sequentially processed according to the steps; and storing the enhanced training set speech in

Centralized, enhanced test sets are stored

Concentrating;

step three, building a voice recognition acoustic model; the voice recognition acoustic model built by the method adopts CNN + CTC to carry out modeling, and the input layer is a training set enhanced in the second step

Mid speech signal

Extracting the characteristic value sequence by adopting an MFCC characteristic extraction algorithm with the characteristic value sequence of 200 dimensions; meanwhile, the hidden layer is alternately and repeatedly connected with the pooling layer by adopting a convolution layer and a Dropout layer, overfitting is prevented, wherein the convolution kernel size of the convolution layer is 3, the size of a pooling window is 2, finally, the output layer adopts a full connection layer of 1423 neurons for output, a softmax function is used for activation, a loss function of CTC is used as a loss function to realize multiple output of connectivity time sequence, and a characteristic value with 1423 dimensions of output just corresponds to 1423 common Chinese pinyins in a Chinese dictionary word txt document built in the fourth step; a specific speech recognition acoustic model network framework diagram is shown in figure 4; wherein the specific parameters of the convolutional layer, the pooling layer, the Dropout layer, and the fully-connected layer in the acoustic model are all labeled in fig. 4;

step four, building a voice recognition language model; the language model building comprises the steps of building a language text data set, designing a 2-gram language model and collecting a Chinese and Chinese dictionary;

establishment of language text database

Firstly, establishing a text data set required by a training language model; the language text data set is formally expressed as an electronic version txt file, and the contents are newspaper, Chinese class text and famous novel; the txt file establishes a language text database, and the selection of the text data in the language text database must be representative so as to reflect the Chinese language habit in daily life;

(II) 2-gram language model building

The method adopts a language model training method 2-gram algorithm which is divided according to words to build a language model; wherein 2 in the 2-gram indicates that the probability of the current word appearing is only related to the 2 words before the current word; 2 is the constrained quantity of the memory length of the word sequence; the specific formula of the 2-gram algorithm can be expressed as:

in the above formula, W represents a text sequence, W₁,w₂,...,w_qRespectively representing each word in the character sequence, and q represents the length of the character sequence; s (W) represents the probability that the text sequence conforms to the linguistic habit; d represents the d-th word;

(III) Chinese dictionary establishment

Building a language model dictionary of a voice recognition system; for a dictionary, a dictionary of one language is stable and invariable, for a Chinese character dictionary in the invention, the dictionary is expressed as a dit.txt file, wherein 1423 Chinese characters corresponding to Chinese pinyin commonly used in daily life are marked, and meanwhile, the condition of one tone and multiple characters of Chinese are considered, and a part of the display diagram of the dictionary built by the invention is shown in an attached drawing description figure 5;

step five, training the built 2-gram language model by using the built language text data set to obtain a word occurrence number table and a state transition table of the language model; the specific language model training block diagram is shown in the accompanying description FIG. 6; the specific training mode for the language model is as follows:

(1) circularly acquiring text contents in the language text data set, counting the occurrence times of single words, and summarizing to obtain a single word occurrence time table;

(2) circularly acquiring the times of two words appearing together in the language text data set, and summarizing to obtain a state transition table of the two words;

Performing learning training on the built acoustic model; obtaining a weight file and other parameter configuration files of the acoustic model, wherein the specific acoustic model training process comprises the following steps:

(1) initializing weights of all parts of the acoustic network model;

(2) importing speech training sets in sequence

Training the speech in (1) to any speech signal

The Chinese phonetic sequence of (1);

The ith voice

The Chinese pinyin label sequence carries out comparison calculation error and back propagation to update the weight values of all positions of the acoustic model, the loss function adopts a loss function of CTC, and Adam algorithm is optimized, the trained batch size is set to be 16, the iteration time epoch is set to be 50, and a weight file is stored once when 500 voices are trained; the loss function of CTC is as follows:

in the above formula

In the speech signal

z is the output Chinese character sequence, F (z | e) represents the input as e, and the output sequence is the probability of z;

(5) training the acoustic models of the voice recognition in sequence according to the steps until the loss of the acoustic models is converged, and finishing training the acoustic models; saving a weight file and various configuration files of the acoustic model, wherein a specific speech recognition acoustic model training diagram is shown in an accompanying drawing description figure 7;

The voice is recognized, the voice recognition accuracy is counted, and performance comparison analysis is carried out on the voice and a traditional algorithm; a specific speech recognition test system flow diagram is shown in figure 8; the speech recognition accuracy of the present patent and comparison of performance in noisy environments with conventional algorithms is illustrated in part in the accompanying drawing, which shows fig. 9; the speech recognition accuracy of the present patent and performance comparison with conventional algorithms in reverberant environments is illustrated in part in the accompanying drawing, which shows an illustration of fig. 10;

the specific implementation mode is as follows:

(1) carrying out voice recognition test on 2000 non-enhanced voice test sets T of the established complex environment voice database by using a traditional voice recognition system, and counting the accuracy of voice recognition; representative voice recognition result figures are listed in the figure description, and are shown in figure description 9 and figure 10;

Carrying out voice recognition test, and counting the voice recognition accuracy of the method; representative voice recognition result figures are listed in the figure description, and are shown in figure description 9 and figure 10;

(3) finally, the performance analysis is carried out on the voice recognition system based on the voice enhancement;

after statistics is completed, the voice recognition algorithm based on voice enhancement greatly improves the recognition accuracy of the voice in a Gaussian white noise environment, an environment with background noise or an interference sound source and a reverberation environment, and the performance is improved by about 30%; compared with the traditional speech recognition algorithm, the algorithm of the invention has greatly improved recognition accuracy, especially for the speech recognition under the environment of Gaussian white noise, the environment with background noise or interference sound source and the reverberation environment, the traditional algorithm has poor performance, but the algorithm of the invention has excellent performance and good performance, and the comparison graph of the recognition effect of the speech recognition algorithm of the invention and the traditional speech recognition algorithm under the environment of partial noise is shown in the figure description figure 9; a comparison graph of the recognition effect of the voice recognition algorithm and the traditional voice recognition algorithm under the partial reverberation environment is shown in an explanatory figure 10 of the attached drawings;

therefore, the deep neural network speech recognition method based on speech enhancement in the complex environment well solves the problems that the existing speech recognition algorithm is sensitive to a noise environment, high in requirement on speech quality and single in applicable scene, and realizes speech recognition in the complex speech environment;