CN111986661B

CN111986661B - Deep neural network voice recognition method based on voice enhancement in complex environment

Info

Publication number: CN111986661B
Application number: CN202010880777.7A
Authority: CN
Inventors: 王兰美; 梁涛; 朱衍波; 廖桂生; 王桂宝; 孙长征
Original assignee: Xidian University; Shaanxi University of Technology
Current assignee: Xidian University; Shaanxi University of Technology
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2024-02-09
Anticipated expiration: 2040-08-28
Also published as: CN111986661A

Abstract

The deep neural network voice recognition method based on voice enhancement in the complex environment is to build a model by taking deep learning neural network and voice enhancement as technical backgrounds. Firstly, building a complex voice environment data set, and performing voice enhancement on various voice signals under the complex voice conditions to be recognized in a voice signal preprocessing stage at the front end of voice recognition; then, a language text data set is established, a language model is established, and the language model is trained by an algorithm; and establishing a Chinese dictionary file; and then building a neural network acoustic model, training the acoustic model by using the enhanced voice training set by means of a language model and a dictionary to obtain an acoustic model weight file, thereby realizing accurate recognition of Chinese voice in a complex environment. The method well solves the problems that the existing voice recognition algorithm is sensitive to noise factors, has high requirements on voice quality and has single application scene.

Description

Deep neural network voice recognition method based on voice enhancement in complex environment

Technical Field

The invention belongs to the field of voice recognition, and particularly relates to a deep neural network voice recognition method based on voice enhancement in a complex environment.

Background

In recent years, technological innovation is difficult, economy is prosperous and social progress, and people put forward more demands for building good life after the basic problems of eating, wearing, living and going are solved. The virtual social software integrates life, work and entertainment and is a good prospect, and the virtual social software such as QQ, weChat and the like is promoted to be largely emerging. Virtual social software brings great convenience to life, work and communication of people, and particularly has a voice recognition function in each large social software. The voice recognition enables people to get rid of the constraint of traditional interaction modes such as a keyboard, a mouse and the like, so that the information is transmitted by using the voice communication which is the most natural communication mode. Meanwhile, voice recognition is also gradually widely applied to various fields such as industry, communication, home appliances, home services, medical treatment, electronic consumer products and the like.

Most of the social software today has a very high level of speech recognition accuracy under clean speech conditions without background noise and without interfering sound sources. When the speech signal to be recognized contains noise, interference, and reverberation, the accuracy of the existing speech recognition system is greatly reduced. The conversion is mainly the existing voice recognition system, and the problems of denoising and interference suppression are not considered in the voice signal preprocessing stage and the acoustic model building stage of the voice recognition front end.

The existing Chinese voice recognition algorithm has strict requirements on voice signal quality, has poor algorithm robustness, and can cause voice recognition failure when the voice quality is poor or the audio pollution is serious. The invention provides a deep neural network voice recognition method based on voice enhancement in a complex environment aiming at the defects of the existing algorithm in order to improve the application of voice recognition in a real life environment by only obtaining small-range application under pure ideal voice conditions. The method takes deep learning neural network and voice enhancement as technical background. Firstly, voice enhancement is carried out on voice signals under various complex voice conditions to be recognized at a voice recognition front end; building a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; the method comprises the steps of building a neural network acoustic model, training the acoustic model by using an enhanced voice training set and by means of a language model and a dictionary to obtain an acoustic model weight file, and thus building a voice recognition system under a complex voice environment with good performance.

In view of the application of the voice recognition technology in real life, the complex environment voice recognition technology provided by the invention is a voice recognition technology under four comprehensive voice environments including pure voice conditions, gaussian white noise environments, background noise or interference sound sources and reverberation environments. The method has high identification accuracy and strong model generalization capability, and has good robustness to various environmental factors.

Disclosure of Invention

The invention aims to provide a deep neural network voice recognition method based on voice enhancement in a complex environment.

In order to achieve the above object, the present invention adopts the following technical solutions:

the deep neural network voice recognition method based on voice enhancement in the complex environment is to build a model by taking the deep learning neural network and the voice enhancement as technical backgrounds, and a specific voice recognition technical scheme flow chart is shown in the attached drawing to explain figure 1. Firstly, building a complex voice environment data set, and performing voice enhancement on voice signals under complex voice conditions to be recognized in a voice signal preprocessing stage at the front end of voice recognition; then, a language text data set is established, a language model is established, and the language model is trained by an algorithm; and establishing a Chinese dictionary file; and finally, building a neural network acoustic model, training the acoustic model by using the enhanced voice training set by means of a language model and a dictionary to obtain an acoustic model weight file, thereby realizing accurate recognition of Chinese voice in a complex environment. The method well solves the problems that the existing voice recognition algorithm is sensitive to noise, has high requirements on voice quality and has single application scene. The deep neural network voice recognition method based on voice enhancement in the complex environment comprises the following steps:

Step one, establishing and processing a voice data set in a complex environment. The collection of clean ambient speech, gaussian white noise ambient speech, ambient speech with background noise or interfering sound sources, and speech in reverberant environments together form the speech data set C of the speech recognition system in this section. Then, the process is carried out,the voice data under each environment in the voice data set C is respectively divided into a training set and a testing set. The allocation proportion is the number of the training set voice: test set number of speech bars = 5:1. and respectively summarizing and disturbing the training set and the testing set under each environment to form a training set X and a testing set T. The ith speech in training set X is denoted as X _i The method comprises the steps of carrying out a first treatment on the surface of the The j-th speech in test set T is denoted as T _j . And simultaneously, editing a label document in txt format for each voice in the training set X, wherein the content of the label document comprises the name of the voice and the corresponding correct Chinese phonetic alphabet sequence. A partially displayed view of a training set voice tag document is shown in figure 2.

Step two, carrying out voice enhancement on the established voice training set X and the test set T to obtain an enhanced voice training setAnd test set->Enhanced speech training set->The ith speech of (a) is expressed as +. >Test set->The j-th speech of (a) is expressed as +.>With the ith speech x in the speech training set _i For example, the specific speech enhancement steps are as follows, the speech signal x to be enhanced _i Speech processing audioerad function pair x built in matlab software _i Reading to obtain sampling rate f of voice signal _s Matrix x containing speech information _i (n)，x _i (n) is the voice sample value at time n; then to x _i (n) performing pre-emphasis treatment to obtain y _i (n); and then to y _i (n) adding Hamming window to make frame-dividing operation to obtain information y of every frame of speech signal _i,r (n) wherein y _i,r (n) a speech information matrix representing an r frame of the ith speech signal after pre-emphasis enhancement; and then to y _i,r (n) FFT transforming to obtain the short-time signal spectrum ++of the ith speech signal (frame r)>Then using gamma-pass weighting function H _l Per band pair->Processing to obtain power P on the ith frequency band of the ith speech signal _i,r,l (r, l), wherein l has a value of 0,..39; sequentially obtaining the power of each frequency band of the r frame according to the steps; then noise reduction and reverberation removal processing are carried out, and the spectrum is integrated>Thus, the short-time signal spectrum of the ith speech signal after enhancement and the (r) th frame is obtained, the speech signals of other frames are processed as above in turn to obtain the short-time signal spectrum of each frame, and the speech signals after enhancement are synthesized by IFFT conversion and speech signal frames in the time domain >Will->Put in enhanced speech training set +.>Is a kind of medium. A specific speech data enhancement flow diagram is shown in figure 3.

And thirdly, building a voice recognition acoustic model. The speech recognition acoustic model built by the patent adopts CNN+CTC to model, and the input layer is a training set after step two enhancementVoice signal->Processing training set speech signal +.>The 200-dimensional characteristic value sequence is obtained, a hidden layer is alternately and repeatedly connected with a convolution layer and a pooling layer, a Dropout layer is introduced to prevent overfitting, the convolution kernel size of the convolution layer is 3, the pooling window size is 2, the output layer outputs 1423 nerve cells by adopting a full-connection layer, the full-connection layer is activated by a softmax function, a loss function of CTC is adopted to realize connectivity time sequence multi-output, and the 1423-dimensional characteristic value just corresponds to 1423 common Chinese pinyin in a Chinese dictionary text document built in the fourth step. Specific speech recognition acoustic model network frame diagram is shown in figure 4 of the accompanying drawings. The specific parameters of the convolution layer, pooling layer, dropout layer and fully connected layer in the acoustic model are all shown in fig. 4.

And fourthly, building a 2-gram language model and a dictionary of the voice recognition. The construction of the language model comprises the establishment of a language text data set, the construction of a 2-gram language model and the collection and establishment of a Chinese dictionary. The language text data set is expressed in the form of an electronic version of txt file, the content of which is newspaper, middle school lessons and famous novels. For the dictionary, the dictionary of one language is stable and unchanged, and for the Chinese character dictionary in the invention, the dictionary is expressed as a text file, wherein 1423 Chinese phonetic alphabets which are commonly used in daily life are marked for corresponding Chinese characters, and the one-tone multi-word condition of Chinese is considered. The dictionary part display diagram built by the invention is shown in the attached drawing and is illustrated in fig. 5.

Training the built 2-gram language model by using the established language text data set to obtain a word occurrence number table and a state transition table of the language model. The specific training mode of the language model is as follows: firstly, circularly acquiring text contents in a language text data set, counting the occurrence times of single words and the occurrence times of two words together, and finally summarizing to obtain a single word occurrence number table and a two word state transition table. A specific language model training block diagram is shown in the attached figure to illustrate FIG. 6.

Step six, using the trained language model and the established dictionary and the enhanced voice training setAnd learning and training the constructed acoustic model. A weight file and other parameter configuration files of the acoustic model are obtained. The specific acoustic model training process is as follows: initializing the weight value of each place of the acoustic network model; sequentially importing a voice training set ++>Training the speech of the person, for arbitrary speech signals +.>Firstly, processing by an MFCC feature extraction algorithm to obtain a 200-dimensional feature value sequence of a voice signal, then processing the 200-dimensional feature value sequence of the voice signal sequentially by a convolution layer, a pooling layer, a Dropout layer and a full-connection layer according to the description of the drawing, outputting by an output layer by adopting the full-connection layer of 1423 neurons, and activating by a softmax function to obtain 1423-dimensional acoustic features of the voice signal; after obtaining the characteristic value, decoding 1423-dimensional acoustic characteristic value under the action of language model and dictionary and outputting the recognized voice signal +. >Is a Chinese phonetic sequence of (2); the Chinese phonetic sequences identified by the acoustic model and the training set +.>Middle->The Chinese phonetic label sequence of the (2) is compared and error is calculated, the weight of each part of the acoustic model is updated by back propagation, the loss function adopts a loss function of CTC, and an Adam algorithm is optimized. Setting the batch size=16 and the iteration number epoch=50 of training, storing a weight file once every 500 pieces of voice, and processing the training set according to the steps in sequence>Until the loss of the acoustic model converges, and the acoustic model is trained. And saving the weight file and each configuration file of the acoustic model. A specific speech recognition acoustic model training block diagram is shown in figure 7 of the accompanying drawings.

Step seven, training a Chinese voice recognition system pair test set based on voice enhancementAnd (3) carrying out recognition on the voice, counting the accuracy of voice recognition and carrying out performance comparison analysis with the traditional algorithm. A specific flow chart of the voice recognition test system is shown in the attached drawing to illustrate FIG. 8. The speech recognition accuracy of this patent and the performance comparison part of the conventional algorithm are shown in fig. 9 and 10.

Advantages of the invention

The deep neural network voice recognition method based on voice enhancement in the complex environment well solves the problems that the existing voice recognition algorithm is sensitive to noise and other complex environment factors, has high voice quality requirements and is single in voice recognition application scene. Meanwhile, the voice recognition method provided by the invention adopts a neural network deep learning technology to carry out acoustic modeling, so that the model transfer learning capability built by the voice recognition method is strong, and the voice recognition system has strong robustness in the aspect of complex environmental factor interference due to the introduction of the voice enhancement method.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the following description will make brief description of the drawings that are needed in the description of the present invention, so as to better understand the present invention.

FIG. 1 is a flowchart of a speech recognition scheme according to the present invention;

FIG. 2 is a partially-developed view of a speech tag of a speech recognition training set in accordance with the present invention;

FIG. 3 is a speech recognition speech enhancement flow framework diagram of the present invention;

FIG. 4 is a diagram of a network framework of a speech recognition acoustic model in accordance with the present invention;

FIG. 5 is a dictionary part display diagram constructed in accordance with the present invention;

FIG. 6 is a flow chart of language model training of the present invention;

FIG. 7 is a training diagram of an acoustic model of the present invention;

FIG. 8 is a flow chart of a speech recognition testing system of the present invention;

FIG. 9 is a diagram showing the comparison of the effects of the speech recognition algorithm of the present invention and the conventional algorithm in a noisy environment;

FIG. 10 is a comparative illustration of the effect of the speech recognition algorithm of the present invention in a reverberant environment compared to a conventional algorithm;

Detailed Description

The method for recognizing the deep neural network voice based on voice enhancement in the complex environment comprises the following specific implementation steps:

step one, establishing and processing a voice data set in a complex environment. The collection of clean ambient speech, gaussian white noise ambient speech, ambient speech with background noise or interfering sound sources, and speech in reverberant environments together form the speech data set C of the speech recognition system in this section. Then, the voice data under each environment in the voice data set C is divided into a training set and a test set, respectively. The allocation proportion is the number of the training set voice: test set number of speech bars = 5:1. and respectively summarizing and disturbing the training set and the testing set under each environment to form a training set X and a testing set T. The ith speech in training set X is denoted as X _i The method comprises the steps of carrying out a first treatment on the surface of the The j-th speech in test set T is denoted as T _j . And simultaneously, editing a label document in txt format for each voice in the training set X, wherein the content of the label document comprises the name of the voice and the corresponding correct Chinese phonetic alphabet sequence. Training set languageThe partially displayed view of the voice tag document is shown in figure 2.

The specific collection method is as follows: firstly, collecting voice under pure conditions, recording multiple persons under ideal laboratory conditions, and recording 3000 pure voice materials in total by taking Chinese newspaper, novel and student lessons as materials and recording single voice within 10 seconds; for collecting the voice under the Gaussian white noise environment and the reverberation environment, the Adobe audio software is adopted for synthesis, specifically, the recorded pure voice and Gaussian white noise are adopted for synthesis, and the reverberation is directly adopted for synthesizing the voice again by adopting the reverberation environment of the software. Wherein 3000 pieces of voice under Gaussian white noise environment and voice under reverberation environment are recorded respectively; finally, for the voices with background noise or interference sound sources, the method mainly adopts field recording, and in places with relatively noisy factories, restaurants and the like, a plurality of people record the voices in the field, and the total number of the recorded voices is 3000. Meanwhile, all the collected voice file formats are in the wav format. Classifying the collected voices in the following manner: 2500 voices in each voice environment are used as training sets of the voice recognition system, and the rest 500 voices are used as test sets. Summarizing, namely 10000 speech recognition training sets X and 2000 test sets T, respectively disturbing the training sets and the test sets, and avoiding the occurrence of over fitting of the trained models.

Step two, carrying out voice enhancement on the established voice training set X and the test set T to obtain an enhanced voice training setAnd test set->Enhanced speech training set->The ith speech of (a) is expressed as +.>Test set->The j-th speech of (a) is expressed as +.>With the ith speech x in the speech training set _i For example, the specific speech enhancement steps are as follows, the speech signal x to be enhanced _i Speech processing audioerad function pair x built in matlab software _i Reading to obtain sampling rate f of voice signal _s Matrix x containing speech information _i (n)，x _i (n) is the voice sample value at time n; then to x _i (n) performing pre-emphasis treatment to obtain y _i (n); and then to y _i (n) adding Hamming window to make frame-dividing operation to obtain information y of every frame of speech signal _i,r (n) wherein y _i,r (n) a speech information matrix representing an r frame of the ith speech signal after pre-emphasis enhancement; and then to y _i,r (n) FFT transforming to obtain the short-time signal spectrum ++of the ith speech signal (frame r)>Then using gamma-pass weighting function H _l Per band pair->Processing to obtain power P on the ith frequency band of the ith speech signal _i,r,l (r, l), wherein l has a value of 0,..39; sequentially obtaining the power of each frequency band of the r frame according to the steps; then noise reduction and reverberation removal processing are carried out, and the spectrum is integrated >Thus, the short-time signal spectrum of the ith speech signal after enhancement and the (r) th frame is obtained, the speech signals of other frames are processed as above in turn to obtain the short-time signal spectrum of each frame, and the speech signals after enhancement are synthesized by IFFT conversion and speech signal frames in the time domain>Will->Put in enhanced speech training set +.>Is a kind of medium. A specific speech data enhancement flow diagram is shown in figure 3.

Each step of voice enhancement is specifically described in detail as follows:

pre-emphasis of speech signals

For the ith speech signal matrix X in training set X _i (n) pre-emphasis to give y _i (n) wherein y _i (n)＝x _i (n)-αx _i (n-1), α being a constant α=0.98 in this patent; x is x _i (n-1) is a matrix of samples at time n-1 for the ith speech in the training set.

(II) windowing framing

The voice signal y after pre-emphasis is subjected to Hamming window w (n) _i (n) windowing and framing the continuous speech signal into discrete signals y frame by frame _i,r (n)；

Wherein the method comprises the steps ofThe Hamming window function, N is the window length, the frame length is taken to be 50ms in the patent, and the frame shift is 10ms. Pre-emphasized speech signal y _i (n) windowing and framing processing to obtain the matrix information y of each frame of voice signal _i,r (n)。y _i,r (n) represents a speech information matrix of an r frame of the ith speech signal after pre-emphasis and windowing framing.

(III) FFT transforms

Matrix y of speech information of the r frame of the ith speech signal _i,r (n) FFT transforming it from time domain to frequency domain to obtain the short-time signal spectrum of the ith speech signal (r frame)

(IV) find the Power P of the Speech Signal _i,r,l (r,l)

Spectrum of short-time signal of each frameProcessing by using a gamma weight function to obtain the power of each frequency band of each frame of the voice signal; />P _i,r,l (r, l) represents the speech signal y _i (n) the power in the first frequency band of the r frame, k is an index of the discrete frequency represented by a virtual variable, ω _k Is a discrete frequency>Since a frame length of 50ms is adopted at the time of FFT conversion and the sampling rate of the voice signal is 16kHz, n=1024; h _l The frequency spectrum of the gamma pass filter group representing the first frequency band calculated at the frequency index k is a matlab software voice processing built-in function, and the input parameter of the function is a frequency band l; />Representing the short-time spectrum of the r-frame speech signal, l=40 is the total number of all channels.

(V) noise reduction and dereverberation processing of speech signals

Obtaining the power P of the voice signal _i,r,l After (r, l), carrying out noise reduction and reverberation removal treatment, wherein the specific steps are as follows:

(1) Obtaining the low-pass power M of the first frequency band of the r frame _i,r,l [r,l]The specific solving formula is as follows:

M _i,r,l [r,l]＝λM _i,r,l [r-1,l]+(1-λ)P _i,r,l [r,l]

M _i,r,l [r-1,l]A low pass power representing the first frequency band of the r-1 frame; λ represents a forgetting factor, which varies with the bandwidth of the low-pass filter, λ=0.4 in this patent.

(2) Removing slowness from signalsVarying composition and power falling edge envelope, power P for speech signals _i,r,l [r,l]Processing to obtain the power of the first frequency band of the enhanced r frame

Wherein the method comprises the steps ofIn c) ₀ As a constant factor, the patent takes c ₀ ＝0.01。

(3) And (2) carrying out enhancement processing on each frequency band of each frame of the signal according to the steps (1) and (2).

(six) Spectrum integration

Obtaining the enhanced power of each frequency band of each frame of the voice signalThe short-time signal spectrum of each frame of the voice signal after enhancement can be obtained by carrying out voice signal spectrum integration, and the formula of spectrum integration is as follows:

mu in the above _i,r [r,k]Representing the spectral weight coefficient at the kth index of the nth frame;for the short-term signal spectrum of the r frame of the unenhanced i-th speech signal,/for the r frame of the unenhanced i-th speech signal>Short-time signal spectrum of the r frame of the i-th voice signal after enhancement.

Wherein mu _i,r [r,k]The solution formula of (2) is as follows:

μ _i,r [r,k]＝μ _i,r [r,N-k],N/2≤k≤N-1

h in the formula _l Representing the spectrum of the gamma-pass filter bank of the first frequency band calculated at the frequency index k; omega _i,r,l [r,l]For the weight coefficient of the ith frequency band of the ith voice signal in the nth frame, the weight coefficient is the ratio of the frequency domain after enhancement to the original frequency domain of the signal, and the solving formula is as follows:

And obtaining the enhanced short-time signal spectrum of the r frame of the i voice signal after spectrum integration, and sequentially processing each frame according to the operation to obtain the enhanced short-time signal spectrum of each frame of the i voice signal. Enhanced speech signal for each framePerforming IFFT to obtain voice signals of each frame in time domain and performing frame splicing in time domain to obtain enhanced voice signals +.>The IFFT transformation and the speech signal time domain frame splicing operations are as follows:

g is the total frame number

In the above-mentioned method, the step of,a matrix of enhanced speech signals; />Representing an r frame enhancementA post-speech signal matrix; g is the total number of frames of the speech signal, which value varies depending on the duration of the speech signal. Sampling matrix for obtaining enhanced n-moment voice signal>Using the built-in voice processing audioread function of matlab software to make the voice signal sampling rate f _s =16 Khz pair->Performing writing processing to obtain enhanced voice signal +.>

So far, the enhancement processing of one voice in the voice training set is finished, and then the training set X and the testing set T are processed sequentially according to the steps. And save the enhanced training set voice inIn the set, the enhanced test set is kept +.>And (5) centralizing.

And thirdly, building a voice recognition acoustic model. The speech recognition acoustic model built by the patent adopts CNN+CTC to model, and the input layer is a training set after step two enhancement Middle voice signal->Extracting the characteristic value sequence by adopting an MFCC characteristic extraction algorithm; meanwhile, the hidden layers are alternately and repeatedly connected by adopting a convolution layer and a pooling layer, and a Dropout layer is introduced to prevent overfitting, wherein the convolution kernel size of the convolution layer is 3, the pooling window size is 2, and finally, the output layer adopts a fully connected layer of 1423 neurons for output and uses a softmax functionThe number is activated, a loss function of CTC is used as a loss function to realize connectivity time sequence multi-output, and the output is 1423 frequently-used Chinese pinyin in a Chinese dictionary text.txt document built in the step four, wherein the characteristic value of 1423 dimensions corresponds to the Chinese dictionary text.txt document. Specific speech recognition acoustic model network frame diagram is shown in figure 4 of the accompanying drawings. The specific parameters of the convolution layer, pooling layer, dropout layer and fully connected layer in the acoustic model are all shown in fig. 4.

And fourthly, building a voice recognition language model. The language model construction comprises the establishment of a language text data set, the design of a 2-gram language model and the collection of Chinese dictionary.

Establishment of language text database

First, a text data set required to train a language model is built. The language text data set is expressed in the form of an electronic version of txt file, the content of which is newspaper, middle school lessons and famous novels. The text data in the language text database is selected to be representative, so that the habit of Chinese language expression in daily life can be reflected.

(two) 2-gram language model construction

The patent builds a language model by adopting a language model training method 2-gram algorithm which divides words per se. Where 2 in the 2-gram indicates that the probability of the current word occurrence is considered to be only related to its first 2 words. And 2 is the constraint number of word sequence memory length. The specific formula of the 2-gram algorithm can be expressed as:

wherein W represents a text sequence, and W ₁ ,w ₂ ,...,w _q Each word in the text sequence is represented respectively, and q represents the length of the text sequence; s (W) represents the probability that the text sequence meets the linguistic habit. d represents the d-th word.

(III) Chinese dictionary creation

And constructing a language model dictionary of the voice recognition system. For the dictionary, the dictionary of one language is stable and unchanged, and for the Chinese character dictionary in the invention, the dictionary is expressed as a text file, wherein 1423 Chinese phonetic alphabets which are commonly used in daily life are marked for corresponding Chinese characters, and the one-tone multi-word condition of Chinese is considered. The dictionary part display diagram built by the invention is shown in the attached drawing and is illustrated in fig. 5.

Training the built 2-gram language model by using the established language text data set to obtain a word occurrence number table and a state transition table of the language model. A specific language model training block diagram is shown in the attached figure to illustrate FIG. 6. The specific training mode of the language model is as follows:

(1) And circularly acquiring text contents in the language text data set, counting the occurrence times of single words, and summarizing to obtain a single word occurrence number table.

(2) And circularly obtaining the frequency of occurrence of two words together in the language text data set, and summarizing to obtain a two-word state transition table.

Step six, using the trained language model and the established dictionary and the enhanced voice training setAnd learning and training the constructed acoustic model. A weight file and other parameter configuration files of the acoustic model are obtained. The specific acoustic model training process is as follows:

(1) Initializing the weight value of each place of the acoustic network model;

(2) Sequentially importing voice training setsTraining the speech of the person, for arbitrary speech signals +.>Firstly, processing by using MFCC feature extraction algorithm to obtain 200-dimensional feature value sequence of speech signal, then according to the description of figure 7, making 200-dimensional feature value sequence of speech signal pass through all convolution layers and pools in turnThe processing of the chemical layer, the Dropout layer and the full-connection layer is carried out, and finally the output layer adopts the full-connection layer of 1423 neurons for output and uses a softmax function for activation to obtain 1423-dimensional acoustic characteristics of the voice signal;

(3) After obtaining the characteristic value, decoding 1423-dimensional acoustic characteristic value under the action of language model and dictionary and outputting the identified voice signal Is a Chinese phonetic sequence of (2);

(4) Chinese phonetic sequences and training sets identified by acoustic modelsI-th speech +.>The Chinese phonetic label sequence of the (2) is compared and error is calculated, the weight of each part of the acoustic model is updated by back propagation, the loss function adopts a loss function of CTC, and an Adam algorithm is optimized. Setting the batch size=16 of training, iterating the number of times epoch=50, and storing a weight file once every 500 pieces of voice training; the loss function of CTCs is as follows:

in the aboveRepresents the total loss generated after training the training set, e represents the input speech, i.e. the training set after speech enhancement +.>Voice signal->z is the output Chinese character sequence, F (z|e) represents the probability of the input being e and the output sequence being z.

(5) And training the acoustic model of the voice recognition according to the steps in sequence until the loss of the acoustic model is converged, and finishing the training of the acoustic model. And saving the weight file and each configuration file of the acoustic model. A specific speech recognition acoustic model training diagram is shown in figure 7.

Step seven, training a Chinese voice recognition system pair test set based on voice enhancementAnd (3) carrying out recognition on the voice, counting the accuracy of voice recognition and carrying out performance comparison analysis with the traditional algorithm. A specific flow chart of the voice recognition test system is shown in the attached drawing to illustrate FIG. 8. The speech recognition accuracy of the patent and the performance comparison part of the traditional algorithm under the noise environment are shown in the attached diagram to illustrate FIG. 9; the speech recognition accuracy of this patent and the performance comparison with the traditional algorithm in reverberant environment are partially shown in the accompanying diagram of fig. 10.

The specific implementation mode is as follows:

(1) And carrying out voice recognition test on 2000 unenhanced voice test sets T of the established complex environment voice database by using a traditional voice recognition system, and counting the accuracy of voice recognition. Representative speech recognition results are illustrated in the accompanying drawings, fig. 9 and fig. 10.

(2) With the speech enhancement based speech recognition system of the present invention, 2000 enhanced speech test sets of the established speech database are testedAnd performing a voice recognition test, and counting the voice recognition accuracy of the method. Representative speech recognition results are illustrated in the accompanying drawings, fig. 9 and fig. 10.

(3) Finally, performance analysis is carried out on the voice recognition system based on voice enhancement.

After statistics, the voice recognition algorithm based on voice enhancement provided by the invention has the advantages that the recognition accuracy of voices in Gaussian white noise environment, background noise or interference sound source environment and reverberation environment is greatly improved, and the performance is improved by about 30%; compared with the traditional voice recognition algorithm, the method has the advantages that the recognition accuracy is greatly improved, particularly, the method has poor performance on the voice recognition under the Gaussian white noise environment, the background noise or interference sound source environment and the reverberation environment, and the algorithm has excellent performance. The recognition effect of the voice recognition algorithm of the invention and the recognition effect of the traditional voice recognition algorithm under the partial noise environment are shown in the attached diagram to illustrate figure 9. The recognition effect comparison chart of the voice recognition algorithm of the invention and the traditional voice recognition algorithm under the partial reverberation environment is shown in the attached drawing to illustrate figure 10.

Therefore, the deep neural network voice recognition method based on voice enhancement in the complex environment well solves the problems that the existing voice recognition algorithm is sensitive to noise environment, has high voice quality requirement and can be applied to single scene, and realizes voice recognition in the complex voice environment.

The symbol i appearing in each of the above steps represents the i-th speech signal subjected to speech enhancement processing in the training set and the test set, i=1, 2, 12000; symbol r represents an r frame of a speech signal, r=1, 2,3, g; g represents the total frame number after the frame division of the voice signal, and the value of g is changed according to the processed voice duration; symbol i denotes the first frequency band of the speech signal, i=0, 1,2, 39; k is an index where the virtual variable represents a discrete frequency, k=0, 1, 2.

The present invention is not limited to the preferred embodiments, but is capable of modification and variation in detail, and other modifications and variations can be made by those skilled in the art without departing from the scope of the present invention.

Advantages of the invention

The invention builds a model by taking deep learning neural network and voice enhancement as technical backgrounds. Firstly, building a complex voice environment data set, and performing voice enhancement on various voice signals under the complex voice conditions to be recognized in a voice signal preprocessing stage at the front end of voice recognition; then, a language text data set is established, a language model is established, and the language model is trained by an algorithm; and establishing a Chinese dictionary file; and then building a neural network acoustic model, training the acoustic model by using the enhanced voice training set by means of a language model and a dictionary to obtain an acoustic model weight file, thereby realizing accurate recognition of Chinese voice in a complex environment. The method well solves the problems that the existing voice recognition algorithm is sensitive to noise factors, has high requirements on voice quality and has single application scene.

Claims

1. The method for recognizing the deep neural network voice based on voice enhancement in the complex environment comprises the following specific implementation steps:

step one, establishing and processing a voice data set in a complex environment; collecting pure environment voice, gaussian white noise environment voice, background noise or voice under an interference sound source environment voice and voice under a reverberation environment to form a voice data set C of the voice recognition system; then, respectively dividing the voice data in each environment in the voice data set C into a training set and a testing set; the allocation proportion is the number of the training set voice: test set number of speech bars = 5:1, a step of; respectively summarizing and disturbing the training set and the testing set under each environment to form a training set X and a testing set T; the ith speech in training set X is denoted as X _i The method comprises the steps of carrying out a first treatment on the surface of the The j-th speech in test set T is denoted as T _j The method comprises the steps of carrying out a first treatment on the surface of the Simultaneously, editing a label document in txt format for each voice in the training set X, wherein the content of the label document comprises the name of the voice and a corresponding correct Chinese phonetic alphabet sequence;

the specific collection method is as follows: firstly, collecting voice under pure conditions, recording multiple persons under ideal laboratory conditions, and recording 3000 pure voice materials in total by taking Chinese newspaper, novel and student lessons as materials and recording single voice within 10 seconds; for collecting voices in Gaussian white noise environment and reverberation environment, adobe audio software is adopted to synthesize, specifically, recorded pure voices and Gaussian white noise are adopted to synthesize, and reverberation directly adopts the reverberation environment of the software to resynthesize voices; wherein 3000 pieces of voice under Gaussian white noise environment and voice under reverberation environment are recorded respectively; finally, for the voices with background noise or interference sound sources, the method mainly adopts field recording, and in the places with relatively noisy factories, restaurants and the like, a plurality of persons record the voices in the field, and the total number of the recorded voices is 3000; meanwhile, all the collected voice file formats are in wav format; classifying the collected voices in the following manner: 2500 voices in each type of voice environment are used as training sets of a voice recognition system, and the rest 500 voices are used as test sets; summarizing 10000 speech recognition training sets X and 2000 test sets T, respectively disturbing the training sets and the test sets, and avoiding the occurrence of over fitting of the trained models;

Step two, carrying out voice enhancement on the established voice training set X and the test set T to obtain an enhanced voice training setAnd test set->Enhanced speech training set->The ith speech of (a) is expressed as +.>Test set->The j th speech of the middle is expressed asWith the ith speech x in the speech training set _i For example, the specific speech enhancement steps are as follows, the speech signal x to be enhanced _i Speech processing audioerad function pair x built in matlab software _i Reading to obtain sampling rate f of voice signal _s Matrix x containing speech information _i (n)，x _i (n) is the voice sample value at time n; then to x _i (n) performing pre-emphasis treatment to obtain y _i (n); and then to y _i (n) adding Hamming window to make frame-dividing operation to obtain information y of every frame of speech signal _i,r (n) wherein y _i,r (n) a speech information matrix representing an r frame of the ith speech signal after pre-emphasis enhancement; and then to y _i,r (n) FFT transforming to obtain the short-time signal spectrum ++of the ith speech signal (frame r)>Then using gamma-pass weighting function H _l Per band pair->Processing to obtain power P on the ith frequency band of the ith speech signal _i,r,l (r, l), wherein l has a value of 0,..39; sequentially obtaining the power of each frequency band of the r frame according to the steps; then noise reduction and reverberation removal processing are carried out, and the spectrum is integrated >Thus, the short-time signal spectrum of the ith speech signal after enhancement and the (r) th frame is obtained, the speech signals of other frames are processed as above in turn to obtain the short-time signal spectrum of each frame, and the speech signals after enhancement are synthesized by IFFT conversion and speech signal frames in the time domain>Will->Put in enhanced speech training set +.>In (a) and (b);

each step of voice enhancement is specifically described in detail as follows:

pre-emphasis of speech signals

For the ith speech signal matrix X in training set X _i (n) pre-emphasis to give y _i (n) wherein y _i (n)＝x _i (n)-αx _i (n-1), α is a constant, α=0.98; x is x _i (n-1) is a sampling matrix for time n-1 of the ith speech in the training set;

(II) windowing framing

Wherein the method comprises the steps ofA Hamming window function, wherein N is the window length, the frame taking length is 50ms, and the frame shift is 10ms; pre-emphasized speech signal y _i (n) windowing and framing processing to obtain the matrix information y of each frame of voice signal _i,r (n)；y _i,r (n) a speech information matrix representing an r frame of the i-th speech signal after pre-emphasis, windowing and framing;

(III) FFT transforms

Matrix y of speech information of the r frame of the ith speech signal _i,r (n) FFT transforming it from time domain to frequency domain to obtain the short-time signal spectrum of the ith speech signal (r frame)(IV) find the Power P of the Speech Signal _i,r,l (r,l)

Spectrum of short-time signal of each frameProcessing by using a gamma weight function to obtain the power of each frequency band of each frame of the voice signal; />P _i,r,l (r, l) represents the speech signal y _i (n) the power in the first frequency band of the r frame, k is an index of the discrete frequency represented by a virtual variable, ω _k Is a discrete frequency>Since a frame length of 50ms is adopted at the time of FFT conversion and the sampling rate of the voice signal is 16kHz, n=1024; h _l The frequency spectrum of the gamma pass filter group representing the first frequency band calculated at the frequency index k is a matlab software voice processing built-in function, and the input parameter of the function is a frequency band l; />Representing the short-time spectrum of the r-frame speech signal, l=40 being the total number of all channels;

(V) noise reduction and dereverberation processing of speech signals

M _i,r,l [r,l]＝λM _i,r,l [r-1,l]+(1-λ)P _i,r,l [r,l]

M _i,r,l [r-1,l]a low pass power representing the first frequency band of the r-1 frame; λ represents a forgetting factor, which varies with the bandwidth of the low-pass filter, λ=0.4;

(2) Removing slowly varying components of the signal and the power falling edge envelope, power P to the speech signal _i,r,l [r,l]Processing to obtain the power of the first frequency band of the enhanced r frame

Wherein the method comprises the steps ofIn c) ₀ Is a constant factor, c ₀ ＝0.01；

(3) Performing enhancement processing on each frequency band of each frame of the signal according to the steps (1) and (2);

(six) Spectrum integration

mu in the above _i,r [r,k]Representing the spectral weight coefficient at the kth index of the nth frame;for the short-term signal spectrum of the r frame of the unenhanced i-th speech signal,/for the r frame of the unenhanced i-th speech signal>Short-time signal spectrum of the r frame of the enhanced i-th voice signal;

wherein mu _i,r [r,k]The solution formula of (2) is as follows:

μ _i,r [r,k]＝μ _i,r [r,N-k],N/2≤k≤N-1

h in the formula _l Representation ofIs the spectrum of the gamma-pass filter bank of the first band calculated at the frequency index k; omega _i,r,l [r,l]For the weight coefficient of the ith frequency band of the ith voice signal in the nth frame, the weight coefficient is the ratio of the frequency domain after enhancement to the original frequency domain of the signal, and the solving formula is as follows:

obtaining the enhanced short-time signal spectrum of the r frame of the i voice signal after spectrum integration, and sequentially processing each frame according to the operation to obtain the enhanced short-time signal spectrum of each frame of the i voice signal; enhanced speech signal for each frame Performing IFFT to obtain voice signals of each frame in time domain and performing frame splicing in time domain to obtain enhanced voice signals +.>The IFFT transformation and the speech signal time domain frame splicing operations are as follows:

g is the total frame number

In the above-mentioned method, the step of,a matrix of enhanced speech signals; />Representing the speech signal matrix after the r frame enhancement; g is the total frame of the speech signalA number, this value being a function of the duration of the speech signal; sampling matrix for obtaining enhanced n-moment voice signalsUsing the built-in voice processing audioread function of matlab software to make the voice signal sampling rate f _s =16khz pairPerforming writing processing to obtain enhanced voice signal +.>

So far, after the enhancement processing of one voice in the voice training set is finished, the training set X and the testing set T are sequentially processed according to the steps; and save the enhanced training set voice inIn the set, the enhanced test set is kept +.>Centralizing;

step three, constructing a voice recognition acoustic model; modeling is carried out on the built voice recognition acoustic model by adopting CNN+CTC, and the input layer is a training set enhanced in the second stepMiddle voice signal->Extracting the characteristic value sequence by adopting an MFCC characteristic extraction algorithm; meanwhile, the hidden layers are alternately and repeatedly connected by adopting a convolution layer and a pooling layer, and a Dropout layer is introduced to prevent overfitting, wherein the convolution kernel size of the convolution layer is 3, the pooling window size is 2, and finally, the output layer outputs by adopting a fully connected layer of 1423 neurons Activating by using a softmax function, realizing connectivity time sequence multi-output by using a loss function of CTC (control center) and outputting 1423 commonly used Chinese pinyin in a Chinese dictionary text.txt document which is built in the step four, wherein the characteristic value of the output 1423 dimension corresponds to the Chinese dictionary text.txt document; the voice recognition acoustic model network framework comprises a convolution layer, a pooling layer, a Dropout layer and a full connection layer;

step four, constructing a voice recognition language model; the language model construction comprises the establishment of a language text data set, the design of a 2-gram language model and the collection of a Chinese dictionary;

establishment of language text database

Firstly, establishing a text data set required by training a language model; the language text data set is expressed in the form of an electronic version txt file, and the content is newspaper, middle school lessons and famous novels; the method comprises the steps of collecting electronic version txt files of newspapers, middle school lessons and famous novels to establish a language text database, and noticing that text data in the language text database are selected to be representative, so that Chinese language habits in daily life can be reflected;

(two) 2-gram language model construction

Constructing a language model by adopting a language model training method 2-gram algorithm which divides words per se; wherein 2 in the 2-gram indicates that the probability of the current word occurrence is considered to be only related to the first 2 words thereof; 2 is the constraint number of word sequence memory length; the specific formula of the 2-gram algorithm can be expressed as:

Wherein W represents a text sequence, and W ₁ ,w ₂ ,...,w _q Each word in the text sequence is represented respectively, and q represents the length of the text sequence; s (W) represents the probability that the text sequence accords with linguistic habits; d represents the d-th word;

(III) Chinese dictionary creation

Constructing a language model dictionary of a voice recognition system; for the dictionary, the dictionary of one language is stable and unchanged, and for the Chinese character dictionary, the dictionary is expressed as a text file, wherein 1423 Chinese phonetic alphabets commonly used in daily life are marked for corresponding Chinese characters, and one-tone multi-word conditions of Chinese are considered;

training the built 2-gram language model by using the established language text data set to obtain a word occurrence number table and a state transition table of the language model; the specific training mode of the language model is as follows:

(1) Circularly acquiring text contents in a language text data set, counting the occurrence times of single words, and summarizing to obtain a single word occurrence number table;

(2) Circularly obtaining the frequency of occurrence of two words together in the language text data set, and summarizing to obtain a two-word state transition table;

step six, using the trained language model and the established dictionary and the enhanced voice training set Learning and training the constructed acoustic model; the weight file and other parameter configuration files of the acoustic model are obtained, and the specific acoustic model training flow is as follows:

(1) Initializing the weight value of each place of the acoustic network model;

(2) Sequentially importing voice training setsTraining the speech of the person, for arbitrary speech signals +.>Firstly, processing by using MFCC feature extraction algorithm to obtain 200-dimensional feature value sequence of speech signal, then according to the description of figure 7, processing 200-dimensional feature value sequence of speech signal by using convolution layer, pooling layer, dropout layer and full-connection layer in turn, finally using 1423-neuron full-connection layer to make output, using softmax function to make activation so as to obtain 1423-dimensional acoustic of speech signalFeatures;

(3) After obtaining the characteristic value, decoding 1423-dimensional acoustic characteristic value under the action of language model and dictionary and outputting the identified voice signalIs a Chinese phonetic sequence of (2);

(4) Chinese phonetic sequences and training sets identified by acoustic modelsI-th speech +.>Comparing and calculating errors, reversely propagating and updating weights of the acoustic model at all positions, optimizing a loss function by adopting a loss function of CTC (computer aided system) and an Adam algorithm, setting the batch size=16 of training, and keeping a weight file once every 500 pieces of voice of training; the loss function of CTCs is as follows:

In the aboveRepresents the total loss generated after training the training set, e represents the input speech, i.e. the training set after speech enhancement +.>Voice signal->z is the output Chinese character sequence, F (z|e) represents the probability of inputting e and outputting z;

(5) Training the acoustic model of the voice recognition according to the steps in sequence until the loss of the acoustic model converges, and finishing the training of the acoustic model; saving a weight file and various configuration files of the acoustic model; the input voice is subjected to voice enhancement processing and then is subjected to feature extraction, and acoustic model training is carried out according to a language model and a dictionary to obtain an acoustic model;

step seven, training a Chinese voice recognition system pair test set based on voice enhancementThe voice recognition is carried out, the voice recognition accuracy is counted, and performance comparison analysis is carried out with the traditional algorithm; inputting test voice, obtaining a feature vector through voice enhancement and feature extraction processing, and obtaining text output by utilizing a voice decoding and searching algorithm according to an acoustic model, a dictionary and a language model;

the specific implementation mode is as follows:

(1) Using a traditional voice recognition system to perform voice recognition test on 2000 unenhanced voice test sets T of the established complex environment voice database, and counting the accuracy of voice recognition;

(2) 2000 enhanced speech test sets of the established speech database were tested using the speech enhancement-based speech recognition systemPerforming a voice recognition test, and counting the voice recognition accuracy of the method;

(3) Finally, performing performance analysis on the proposed voice recognition system based on voice enhancement;

after statistics, the recognition accuracy of the proposed voice recognition algorithm based on voice enhancement on Gaussian white noise environment, environment with background noise or interference sound source and voice under reverberation environment is greatly improved, and the performance is improved by about 30%; compared with the traditional voice recognition algorithm, the accuracy of the algorithm recognition is greatly improved, particularly, the algorithm recognition is very poor in the traditional algorithm performance, and the algorithm performance is excellent and the performance is very good for the Gaussian white noise environment, the environment with background noise or interference sound source and the voice recognition under the reverberation environment;

therefore, the deep neural network voice recognition method based on voice enhancement in the complex environment well solves the problems that the existing voice recognition algorithm is sensitive to noise environment, has high requirements on voice quality and can be applied to single scene, and realizes voice recognition in the complex voice environment;