CN111986661A - Deep neural network speech recognition method based on speech enhancement in complex environment - Google Patents

Deep neural network speech recognition method based on speech enhancement in complex environment Download PDF

Info

Publication number
CN111986661A
CN111986661A CN202010880777.7A CN202010880777A CN111986661A CN 111986661 A CN111986661 A CN 111986661A CN 202010880777 A CN202010880777 A CN 202010880777A CN 111986661 A CN111986661 A CN 111986661A
Authority
CN
China
Prior art keywords
voice
speech
signal
frame
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010880777.7A
Other languages
Chinese (zh)
Other versions
CN111986661B (en
Inventor
王兰美
梁涛
朱衍波
廖桂生
王桂宝
孙长征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Shaanxi University of Technology
Original Assignee
Xidian University
Shaanxi University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University, Shaanxi University of Technology filed Critical Xidian University
Priority to CN202010880777.7A priority Critical patent/CN111986661B/en
Publication of CN111986661A publication Critical patent/CN111986661A/en
Application granted granted Critical
Publication of CN111986661B publication Critical patent/CN111986661B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A deep neural network speech recognition method based on speech enhancement in a complex environment builds a model by taking a deep learning neural network and speech enhancement as technical backgrounds. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on voice signals under various complex voice conditions to be recognized in a voice recognition front-end voice signal preprocessing stage; then establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and then, building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, thereby realizing the accurate recognition of Chinese voice in a complex environment. The problems that the existing voice recognition algorithm is sensitive to noise factors, high in requirement on voice quality and single in application scene are well solved.

Description

Deep neural network speech recognition method based on speech enhancement in complex environment
Technical Field
The invention belongs to the field of voice recognition, and particularly relates to a deep neural network voice recognition method based on voice enhancement in a complex environment.
Background
In recent years, technological innovation is often difficult, economy is prosperous, society is advanced, and people put forward more demands on building a beautiful life after solving the basic problems of eating, wearing, living and going. The beauty vision promotes the virtual social software integrating life, work and entertainment, such as QQ, WeChat and the like, to be greatly popularized. The virtual social software brings great convenience to life, work and communication of people, and particularly has a voice recognition function in each piece of social software. The voice recognition enables people to get rid of the constraint of traditional interaction modes such as a keyboard and a mouse, and therefore the most natural communication mode, namely voice communication, is used for transmitting information. Meanwhile, speech recognition is also gradually widely used in various fields such as industry, communication, home appliances, home services, medical care, electronic consumer products, and the like.
Most of the social software today achieves a very high level of speech recognition accuracy under pure speech conditions without background noise and without interfering sound sources. When the speech signal to be recognized contains noise, interference and reverberation, the accuracy of the existing speech recognition system is greatly reduced. The conversion is mainly that the existing voice recognition system does not consider the problems of denoising and interference suppression in the voice signal preprocessing stage and the acoustic model building stage of the voice recognition front end.
The existing Chinese speech recognition algorithm has strict requirements on the quality of speech signals and poor algorithm robustness, and speech recognition can be failed when the speech quality is poor or the audio frequency is seriously polluted. The method is only applied in a small range under pure and ideal voice conditions, and in order to improve the application of voice recognition in a real-life environment and overcome the defects of the existing algorithm, the invention provides a deep neural network voice recognition method based on voice enhancement in a complex environment. The method takes deep learning neural network and voice enhancement as technical background. Firstly, speech enhancement is carried out on speech signals under various complex speech conditions to be recognized at a speech recognition front end; establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, and thus building a voice recognition system with good performance in a complex voice environment.
In view of the application of the speech recognition technology in real life, the complex environment speech recognition technology provided by the invention is a speech recognition technology under four comprehensive speech environments including a pure speech condition, a white gaussian noise environment, a background noise or interference sound source and a reverberation environment. The method has the advantages of high identification accuracy, strong model generalization capability and good robustness to various environmental factors.
Disclosure of Invention
The invention aims to provide a deep neural network speech recognition method based on speech enhancement in a complex environment.
In order to achieve the purpose, the invention adopts the following technical solutions:
a deep neural network speech recognition method based on speech enhancement in a complex environment builds a model by taking a deep learning neural network and speech enhancement as technical backgrounds, and a flow chart of a specific speech recognition technical scheme is shown in an attached drawing and is illustrated in figure 1. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on a voice signal under a complex voice condition to be recognized in a voice recognition front-end voice signal preprocessing stage; then establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and finally, building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, thereby realizing the accurate recognition of Chinese voice in a complex environment. The problems that the existing voice recognition algorithm is sensitive to noise, high in requirement on voice quality and single in application scene are well solved. The deep neural network speech recognition method based on speech enhancement in the complex environment comprises the following steps:
step one, establishing and processing a voice data set in a complex environment. The part collects pure environment voice, white Gaussian noise environment voice, environment voice with background noise or interference sound source and voice under reverberation environment to form a voice data set C of the voice recognition system. Then, the voice data under each environment in the voice data set C is divided into a training set and a test set, respectively. The distribution proportion is the number of the training set voices: test set number of phonetic pieces 5: 1. and respectively collecting the training set and the test set under each environment and distributing in a disorderly manner to form a training set X and a test set T. The ith speech in the training set X is represented as Xi(ii) a The jth phonetic symbol in the test set T is denoted as Tj. And simultaneously editing a tag document in txt format for each voice in the training set X, wherein the content of the tag document comprises the name of the voice and the corresponding correct Chinese pinyin sequence. A partially shown view of a training set voice tag document is shown in the accompanying illustration fig. 2.
Step two, carrying out voice enhancement on the established voice training set X and the test set T to obtain an enhanced voice training set
Figure BDA0002654055760000031
And test set
Figure BDA0002654055760000032
Enhanced speech training set
Figure BDA0002654055760000033
The ith voice in (1) is expressed as
Figure BDA0002654055760000034
Test set
Figure BDA0002654055760000035
The j-th voice in the text is expressed as
Figure BDA0002654055760000036
Training the ith voice x in the set with voiceiFor example, the specific speech enhancement steps are as follows, the speech signal x to be enhancediUsing the built-in speech processing audiored function of matlab software to process xiPerforming reading processing to obtain the sampling rate f of the voice signalsAnd a matrix x containing speech informationi(n),xi(n) is a voice sampling value at n moments; then for xi(n) performing pre-emphasis to obtain yi(n); then to yi(n) adding Hamming window to perform framing operation to obtain information y of each frame of the speech signali,r(n) wherein yi,r(n) a speech information matrix of the r frame of the i-th speech signal after pre-emphasis enhancement; then to yi,r(n) FFT to obtain the short-time signal frequency spectrum of the r-th frame of the i-th voice signal
Figure BDA0002654055760000037
Then using the gamma flux weighting function HlAccording to the frequency band pair
Figure BDA0002654055760000038
Processing to obtain power P on l frequency band of r frame of i voice signali,r,l(r, l), wherein l has a value of 0.., 39; sequentially obtaining the power of each frequency band of the r frame according to the steps; then, noise reduction and dereverberation processing are carried out and spectrum integration is carried out
Figure BDA0002654055760000039
Thus, the short-time signal spectrum of the r frame of the i-th voice signal after enhancement is obtained, the voice signals of other frames are processed in turn in the same way to obtain the short-time signal spectrum of each frame, and then the voice signals after enhancement are synthesized on the time domain through IFFT to obtain the voice signals after enhancement
Figure BDA0002654055760000041
Will be provided with
Figure BDA0002654055760000042
Putting into the enhanced speech training set
Figure BDA0002654055760000043
In (1). A specific speech data enhancement flow diagram is shown in figure 3.
And step three, building a voice recognition acoustic model. The voice recognition acoustic model built by the method adopts CNN + CTC to carry out modeling, and the input layer is a training set enhanced in the second step
Figure BDA0002654055760000044
In the speech signal
Figure BDA0002654055760000045
Processing training set speech signals using MFCC feature extraction algorithm
Figure BDA0002654055760000046
And obtaining a 200-dimensional characteristic value sequence, wherein the hidden layer is alternately and repeatedly connected with the pooling layer by adopting a convolution layer and a Dropout layer is introduced to prevent overfitting, the convolution kernel size of the convolution layer is 3, the size of a pooling window is 2, finally, the output layer adopts a full-connection layer of 1423 neurons to output, and is activated by using a softmax function, the loss function of the CTC is used as a loss function to realize connectivity time sequence multiple output, and the characteristic value with the output of 1423 dimensions just corresponds to 1423 common Chinese pinyins in the four-step built Chinese dictionary ditt. The specific speech recognition acoustic model network framework is illustrated in the attached drawing and explained in figure 4. The specific parameters of the convolutional layer, the pooling layer, the Dropout layer, and the fully-connected layer in the acoustic model are labeled in fig. 4.
And step four, building a 2-gram language model and a dictionary of the voice recognition. The establishment of the language model comprises establishment of a language text data set, establishment of a 2-gram language model and collection and establishment of a Chinese and Chinese dictionary. The language text data set is formally expressed as an electronic version txt file, and the contents are newspaper, Chinese class text and famous novel. For the Chinese character dictionary, the dictionary is expressed as a dit.txt file, wherein 1423 Chinese characters corresponding to Chinese pinyin commonly used in daily life are marked, and the condition of one tone and multiple characters of Chinese are considered. The dictionary part constructed by the invention is shown in the figure 5.
And step five, training the built 2-gram language model by using the built language text data set to obtain a word occurrence number table and a state transition table of the language model. The specific training mode for the language model is as follows: the method comprises the steps of circularly obtaining text contents in a language text data set, counting the occurrence frequency of a single word and the occurrence frequency of two words together, and finally summarizing to obtain a single word occurrence frequency table and two word state transition tables. The specific language model training block diagram is shown in the attached drawing and explained in figure 6.
Step six, training set by using trained language model, established dictionary and enhanced voice
Figure BDA0002654055760000051
And performing learning training on the built acoustic model. And obtaining a weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows: initializing weights of all parts of the acoustic network model; importing speech training sets in sequence
Figure BDA0002654055760000052
Training the speech in (1) to any speech signal
Figure BDA0002654055760000053
Firstly, processing by an MFCC feature extraction algorithm to obtain a 200-dimensional characteristic value sequence of a voice signal, then processing the 200-dimensional characteristic value sequence of the voice signal by each convolution layer, pooling layer, Dropout layer and full-link layer in sequence according to the list shown in figure 7, finally outputting by the full-link layer of 1423 neurons by the output layer, and activating by a softmax function to obtain 1423-dimensional acoustic features of the voice signal; after obtaining the characteristic valueDecoding the 1423-dimensional acoustic feature values under the action of a language model and a dictionary and outputting a recognized speech signal
Figure BDA0002654055760000054
The Chinese phonetic sequence of (1); chinese phonetic alphabet sequence and training set identified by acoustic model
Figure BDA0002654055760000055
In
Figure BDA0002654055760000056
The Chinese pinyin label sequence is compared to calculate errors and reversely propagates and updates the weight values of all positions of the acoustic model, the loss function adopts a loss function of CTC, and Adam algorithm is optimized. Setting the trained blocksize to be 16, the iteration number epoch to be 50, storing a weight file once when each 500 voices are trained, and processing the training set according to the steps
Figure BDA0002654055760000057
And (4) each voice of the voice training system is completed until the loss of the acoustic model is converged. And saving a weight file and various configuration files of the acoustic model. The specific speech recognition acoustic model training block diagram is shown in the attached figure 7.
Step seven, using the trained Chinese voice recognition system based on voice enhancement to test the set
Figure BDA0002654055760000058
The voice is identified, the accuracy of the voice identification is counted, and the performance comparison analysis is carried out with the traditional algorithm. The specific flow framework diagram of the speech recognition test system is shown in the figure and explained in figure 8. The speech recognition accuracy of the present patent and the performance comparison with conventional algorithms are shown in part in fig. 9 and 10.
Advantages of the invention
The deep neural network speech recognition method based on speech enhancement in a complex environment well solves the problems that the existing speech recognition algorithm is sensitive to noise and other complex environment factors, high in requirement on speech quality and single in speech recognition application scene. Meanwhile, the voice recognition method provided by the invention adopts a neural network deep learning technology to perform acoustic modeling, so that the model built by the invention has strong transfer learning capability, and the voice recognition system has strong robustness in the aspect of complex environmental factor interference due to the introduction of the voice enhancement method.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the present invention will be briefly introduced to better understand the inventive content of the present invention.
FIG. 1 is a detailed flow chart of the speech recognition technique of the present invention;
FIG. 2 is a partial display diagram of the phonetic labels of the speech recognition training set according to the present invention;
FIG. 3 is a block diagram of a speech recognition speech enhancement flow diagram according to the present invention;
FIG. 4 is a diagram of a speech recognition acoustic model network framework of the present invention;
FIG. 5 is a partial display diagram of a dictionary constructed according to the present invention;
FIG. 6 is a flow chart of the language model training of the present invention;
FIG. 7 is a training diagram of an acoustic model of the present invention;
FIG. 8 is a block flow diagram of a speech recognition test system of the present invention;
FIG. 9 is a diagram showing the comparison between the speech recognition algorithm of the present invention and the conventional algorithm in a noisy environment;
FIG. 10 is a comparison between the effect of the speech recognition algorithm of the present invention and the conventional algorithm in a reverberation environment;
Detailed Description
The deep neural network speech recognition method based on speech enhancement in the complex environment comprises the following specific implementation steps:
step one, establishing and processing a voice data set in a complex environment. The part collects pure environment voice, white Gaussian noise environment voice, environment voice with background noise or interference sound source and voice under reverberation environment to form a voice data set C of the voice recognition system. Then, willAnd voice data under each environment in the voice data set C are respectively divided into a training set and a testing set. The distribution proportion is the number of the training set voices: test set number of phonetic pieces 5: 1. and respectively collecting the training set and the test set under each environment and distributing in a disorderly manner to form a training set X and a test set T. The ith speech in the training set X is represented as Xi(ii) a The jth phonetic symbol in the test set T is denoted as Tj. And simultaneously editing a tag document in txt format for each voice in the training set X, wherein the content of the tag document comprises the name of the voice and the corresponding correct Chinese pinyin sequence. A partially shown view of a training set voice tag document is shown in the accompanying illustration fig. 2.
The specific collection methods are respectively as follows: firstly, collecting voices under pure conditions, recording a plurality of people under ideal laboratory conditions, recording 3000 pure voice materials by taking Chinese newspapers, novels and student texts as materials and recording a single voice within 10 seconds; for voice collection in a Gaussian white noise environment and a reverberation environment, Adobe Audio software is adopted for synthesis, specifically, recorded pure voice and Gaussian white noise are adopted for synthesis, and the reverberation environment of the software is directly adopted for re-synthesizing voice. Wherein, 3000 voices under the environment of Gaussian white noise and voices under the environment of reverberation are respectively recorded; finally, mainly recording voices with background noise or interfering sound sources on the spot, and recording voices on the spot by a plurality of people in noisy places such as factories, restaurants and the like for 3000 voices in total. Meanwhile, all the voice file formats collected above are in the wav format. Classifying the collected voices in the following manner: 2500 voices in each type of voice environment are used as a training set of the voice recognition system, and the remaining 500 voices are used as a test set. The summary is that 10000 training sets X and 2000 testing sets T are used in the speech recognition, and the training sets and the testing sets are respectively distributed in a disorderly mode, so that overfitting of the trained model is avoided.
Step two, carrying out voice enhancement on the established voice training set X and the test set T to obtain an enhanced voice training set
Figure BDA0002654055760000071
And test set
Figure BDA0002654055760000072
Enhanced speech training set
Figure BDA0002654055760000073
The ith voice in (1) is expressed as
Figure BDA0002654055760000074
Test set
Figure BDA0002654055760000075
The j-th voice in the text is expressed as
Figure BDA0002654055760000076
Training the ith voice x in the set with voiceiFor example, the specific speech enhancement steps are as follows, the speech signal x to be enhancediUsing the built-in speech processing audiored function of matlab software to process xiPerforming reading processing to obtain the sampling rate f of the voice signalsAnd a matrix x containing speech informationi(n),xi(n) is a voice sampling value at n moments; then for xi(n) performing pre-emphasis to obtain yi(n); then to yi(n) adding Hamming window to perform framing operation to obtain information y of each frame of the speech signali,r(n) wherein yi,r(n) a speech information matrix of the r frame of the i-th speech signal after pre-emphasis enhancement; then to yi,r(n) FFT to obtain the short-time signal frequency spectrum of the r-th frame of the i-th voice signal
Figure BDA0002654055760000081
Then using the gamma flux weighting function HlAccording to the frequency band pair
Figure BDA0002654055760000082
Processing to obtain power P on l frequency band of r frame of i voice signali,r,l(r, l), wherein l has a value of 0.., 39; sequentially obtaining the power of each frequency band of the r frame according to the steps; then, noise reduction and demixing are carried outSound processing and spectrum integration
Figure BDA0002654055760000083
Thus, the short-time signal spectrum of the r frame of the i-th voice signal after enhancement is obtained, the voice signals of other frames are processed in turn in the same way to obtain the short-time signal spectrum of each frame, and then the voice signals after enhancement are synthesized on the time domain through IFFT to obtain the voice signals after enhancement
Figure BDA0002654055760000084
Will be provided with
Figure BDA0002654055760000085
Putting into the enhanced speech training set
Figure BDA0002654055760000086
In (1). A specific speech data enhancement flow diagram is shown in figure 3.
The speech enhancement operates as detailed below for each step:
speech signal pre-emphasis
For the ith voice signal matrix X in the training set Xi(n) performing pre-emphasis to obtain yi(n) wherein yi(n)=xi(n)-αxi(n-1), α is a constant in this patent α is 0.98; x is the number ofi(n-1) is a sampling matrix for the ith speech in the training set at time n-1.
(II) windowing framing
Using Hamming window w (n) to pre-emphasized speech signal yi(n) performing windowing and framing to divide the continuous speech signal into discrete signals y of one frame to one framei,r(n);
Wherein
Figure BDA0002654055760000087
And a Hamming window function, wherein N is the window length, the frame taking length in the patent is 50ms, and the frame moving is 10 ms. Pre-emphasized speech signal yi(n) windowing and framing to obtain matrix information y of speech signal for each framei,r(n)。yi,r(n) tableAnd displaying a voice information matrix of an r frame of the ith voice signal after pre-emphasis and windowing framing.
(III) FFT transformation
The voice information matrix y of the r frame of the i voice signali,r(n) FFT transforming the signal from time domain to frequency domain to obtain the short-time signal spectrum of the ith frame of the ith speech signal
Figure BDA0002654055760000091
(IV) determining the power P of the speech signali,r,l(r,l)
The short-time signal frequency spectrum of each frame
Figure BDA0002654055760000092
Processing by a gamma pass weight function to obtain the power of each frequency band of each frame of the voice signal;
Figure BDA0002654055760000093
Pi,r,l(r, l) represents a speech signal yi(n) power in the l-th frequency band of the r-th frame, k being an index of a virtual variable representing discrete frequencies, ωkIs a discrete frequency of the frequency,
Figure BDA0002654055760000094
since the frame length of 50ms is adopted in the FFT and the sampling rate of the speech signal is 16kHz, N is 1024; hlThe frequency spectrum of the gamma pass filter bank of the ith frequency band calculated at the frequency index k is represented and is a built-in function of matlab software voice processing, and the input parameter of the function is the frequency band l;
Figure BDA0002654055760000095
representing the short-time spectrum of the r-th frame speech signal, L-40 is the total number of all channels.
(V) noise reduction and dereverberation processing of voice signals
Determining the power P of a speech signali,r,l(r, l), performing noise reduction and dereverberation treatment, and specifically comprising the following steps:
(1) find the r framelow pass power M of l frequency bandsi,r,l[r,l]The concrete solving formula is as follows:
Mi,r,l[r,l]=λMi,r,l[r-1,l]+(1-λ)Pi,r,l[r,l]
Mi,r,l[r-1,l]low-pass power representing the l-th band of the r-1 th frame; λ represents a forgetting factor, and varies depending on the bandwidth of the low-pass filter, and λ is 0.4 in this patent.
(2) Removing slowly varying components and power falling edge envelopes from the signal, for the power P of the speech signali,r,l[r,l]Processing to obtain the power of the l frequency band of the r frame after enhancement
Figure BDA0002654055760000096
Wherein
Figure BDA0002654055760000097
In (c)0Is a constant factor, this patent takes c0=0.01。
(3) And (3) sequentially carrying out enhancement processing on each frequency band of each frame of the signal according to the steps (1) and (2).
(VI) Spectrum integration
Obtaining the enhanced power of each frame and each frequency band of the speech signal
Figure BDA0002654055760000101
And performing speech signal spectrum integration to obtain a short-time signal spectrum of each frame of the enhanced speech signal, wherein the formula of the spectrum integration is as follows:
Figure BDA0002654055760000102
mu in the above formulai,r[r,k]Representing the spectral weight coefficient at the kth index of the r frame;
Figure BDA0002654055760000103
for the short-term signal spectrum of the r-th frame of the non-enhanced i-th speech signal,
Figure BDA0002654055760000104
the short-time signal spectrum of the r frame of the enhanced i-th speech signal.
Wherein mui,r[r,k]The solving formula of (2) is as follows:
Figure BDA0002654055760000105
μi,r[r,k]=μi,r[r,N-k],N/2≤k≤N-1
h in the formulalRepresents the spectrum of the gamma pass filter bank that is the l-th band calculated at the frequency index k; omegai,r,l[r,l]For the weighting coefficient of the ith frequency band of the ith frame of the ith speech signal, the weighting coefficient is the ratio of the frequency domain after enhancement to the original frequency domain of the signal, and the solving formula is as follows:
Figure BDA0002654055760000106
the enhanced short-time signal spectrum of the r-th frame of the i-th voice signal after spectrum integration is obtained, and the enhanced short-time signal spectrum of each frame of the i-th voice signal is obtained by sequentially processing each frame according to the operation. Enhanced speech signal for each frame
Figure BDA0002654055760000107
IFFT conversion is carried out to obtain the voice signal of each frame in the time domain, and frame splicing is carried out in the time domain to obtain the enhanced voice signal
Figure BDA0002654055760000108
The IFFT transformation and speech signal time domain frame splicing operations are as follows:
Figure BDA0002654055760000111
Figure BDA0002654055760000112
g is the total frame number
In the above formula, the first and second carbon atoms are,
Figure BDA0002654055760000113
is a matrix of enhanced speech signals;
Figure BDA0002654055760000114
representing an enhanced speech signal matrix of the r frame; g is the total number of frames of the speech signal, which varies with the duration of the speech signal. Obtaining a sampling matrix of the enhanced n-time speech signal
Figure BDA0002654055760000115
And then, the voice processing audio function built in the matlab software is used for processing the voice according to the sampling rate f of the voice signals16Khz pair
Figure BDA0002654055760000116
Performing write processing to obtain enhanced voice signal
Figure BDA0002654055760000117
At this point, after the enhancement processing for one speech in the speech training set is completed, the training set X and the test set T are sequentially processed according to the above steps. And storing the enhanced training set speech in
Figure BDA0002654055760000118
Centralized, enhanced test sets are stored
Figure BDA0002654055760000119
And (4) concentrating.
And step three, building a voice recognition acoustic model. The voice recognition acoustic model built by the method adopts CNN + CTC to carry out modeling, and the input layer is a training set enhanced in the second step
Figure BDA00026540557600001110
Mid speech signal
Figure BDA00026540557600001111
Extracting the characteristic value sequence by adopting an MFCC characteristic extraction algorithm with the characteristic value sequence of 200 dimensions; meanwhile, the hidden layer is alternately and repeatedly connected with the pooling layer by adopting a convolution layer and the pooling layer, a Dropout layer is introduced to prevent overfitting, the convolution kernel size of the convolution layer is 3, the size of a pooling window is 2, finally, the output layer adopts a full connection layer of 1423 neurons to output, a softmax function is used for activation, a loss function of CTC is used as a loss function to realize multiple output of connectivity time sequence, and a characteristic value with 1423 dimensions output just corresponds to 1423 common Chinese pinyins in a Chinese dictionary word txt document built in the fourth step. The specific speech recognition acoustic model network framework is illustrated in the attached drawing and explained in figure 4. The specific parameters of the convolutional layer, the pooling layer, the Dropout layer, and the fully-connected layer in the acoustic model are labeled in fig. 4.
And step four, building a voice recognition language model. The language model building comprises the building of a language text data set, the design of a 2-gram language model and the collection of a Chinese and Chinese dictionary.
Establishment of language text database
First, a set of text data needed to train a language model is built. The language text data set is formally expressed as an electronic version txt file, and the contents are newspaper, Chinese class text and famous novel. Txt files establish a language text database, and the selection of the text data in the language text database must be representative so as to reflect the Chinese language habit in daily life.
(II) 2-gram language model building
The patent builds a language model by adopting a language model training method 2-gram algorithm which is divided according to words. Where 2 in the 2-gram indicates that the probability of the current word appearing is considered to be related only to its first 2 words. 2 is the constrained number of word sequence memory lengths. The specific formula of the 2-gram algorithm can be expressed as:
Figure BDA0002654055760000121
in the above formula, W represents a text sequence, W1,w2,...,wqRespectively representing each word in the character sequence, and q represents the length of the character sequence; s (W) represents the probability that the text sequence conforms to the linguistic habit. d denotes the d-th word.
(III) Chinese dictionary establishment
And building a language model dictionary of the voice recognition system. For the Chinese character dictionary, the dictionary is expressed as a dit.txt file, wherein 1423 Chinese characters corresponding to Chinese pinyin commonly used in daily life are marked, and the condition of one tone and multiple characters of Chinese are considered. The dictionary part constructed by the invention is shown in the figure 5.
And step five, training the built 2-gram language model by using the built language text data set to obtain a word occurrence number table and a state transition table of the language model. The specific language model training block diagram is shown in the attached drawing and explained in figure 6. The specific training mode for the language model is as follows:
(1) and circularly acquiring the text content in the language text data set, counting the occurrence times of single words, and summarizing to obtain a single word occurrence time table.
(2) And circularly acquiring the times of the two words in the language text data set, and summarizing to obtain the state transition table of the two words.
Step six, training set by using trained language model, established dictionary and enhanced voice
Figure BDA0002654055760000122
And performing learning training on the built acoustic model. And obtaining a weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows:
(1) initializing weights of all parts of the acoustic network model;
(2) importing speech training sets in sequence
Figure BDA0002654055760000131
In the speech trainingFor arbitrary voice signals
Figure BDA0002654055760000132
Firstly, processing by an MFCC feature extraction algorithm to obtain a 200-dimensional characteristic value sequence of a voice signal, then processing the 200-dimensional characteristic value sequence of the voice signal by each convolution layer, pooling layer, Dropout layer and full-link layer in sequence according to the list shown in figure 7, finally outputting by the full-link layer of 1423 neurons by the output layer, and activating by a softmax function to obtain 1423-dimensional acoustic features of the voice signal;
(3) after the characteristic value is obtained, the 1423-dimensional acoustic characteristic value is decoded under the action of a language model and a dictionary, and the recognized voice signal is output
Figure BDA0002654055760000133
The Chinese phonetic sequence of (1);
(4) chinese phonetic alphabet sequence and training set identified by acoustic model
Figure BDA0002654055760000134
The ith voice
Figure BDA0002654055760000135
The Chinese pinyin label sequence is compared to calculate errors and reversely propagates and updates the weight values of all positions of the acoustic model, the loss function adopts a loss function of CTC, and Adam algorithm is optimized. Setting the trained blocksize to be 16, the iteration number epoch to be 50, and storing a weight file once when each 500 voices are trained; the loss function of CTC is as follows:
Figure BDA0002654055760000136
in the above formula
Figure BDA0002654055760000137
Represents the total loss generated after training in the training set, and e represents the training set after performing speech enhancement on the input speech
Figure BDA0002654055760000138
In the speech signal
Figure BDA0002654055760000139
z is the output kanji sequence, F (z | e) represents the probability that the input is e and the output sequence is z.
(5) And training the acoustic models of the voice recognition in sequence according to the steps until the loss of the acoustic models is converged, and finishing the training of the acoustic models. And saving a weight file and various configuration files of the acoustic model. The specific speech recognition acoustic model training diagram is shown in the figure and explained in figure 7.
Step seven, using the trained Chinese voice recognition system based on voice enhancement to test the set
Figure BDA00026540557600001310
The voice is identified, the accuracy of the voice identification is counted, and the performance comparison analysis is carried out with the traditional algorithm. The specific flow framework diagram of the speech recognition test system is shown in the figure and explained in figure 8. The speech recognition accuracy of the present patent and comparison of performance in noisy environments with conventional algorithms is illustrated in part in the accompanying drawing, which shows fig. 9; the speech recognition accuracy of the present patent and comparison of performance in reverberant environments with conventional algorithms is shown in part in the accompanying illustration figure 10.
The specific implementation mode is as follows:
(1) and (3) carrying out voice recognition test on 2000 unenhanced voice test sets T of the established complex environment voice database by using a traditional voice recognition system, and counting the accuracy of voice recognition. Representative speech recognition results are listed in the figure description, and fig. 9 and fig. 10 are shown in the figure description.
(2) 2000 enhanced speech test sets of a built speech database using a speech enhancement based speech recognition system of the present invention
Figure BDA0002654055760000141
And carrying out voice recognition test and counting the voice recognition accuracy of the method. And a representative speech recognition is illustrated in the accompanying drawingsThe result figures are shown in the accompanying description of FIGS. 9 and 10.
(3) And finally, performing performance analysis on the voice recognition system based on the voice enhancement provided by the invention.
After statistics is completed, the voice recognition algorithm based on voice enhancement greatly improves the recognition accuracy of the voice in a Gaussian white noise environment, an environment with background noise or an interference sound source and a reverberation environment, and the performance is improved by about 30%; compared with the traditional speech recognition algorithm, the algorithm of the invention has greatly improved recognition accuracy, and particularly has poor performance on speech recognition in a Gaussian white noise environment, an environment with background noise or an interference sound source and a reverberation environment. A comparison graph of the recognition effect of the voice recognition algorithm and the traditional voice recognition algorithm under the condition of partial noise is shown in the figure description figure 9. A comparison graph of the recognition effect of the speech recognition algorithm of the invention and the recognition effect of the traditional speech recognition algorithm under the partial reverberation environment is shown in the figure 10.
Therefore, the deep neural network speech recognition method based on speech enhancement in the complex environment well solves the problems that the existing speech recognition algorithm is sensitive to a noise environment, high in requirement on speech quality and single in applicable scene, and realizes speech recognition in the complex speech environment.
Symbol i appearing in each step represents the i-th speech signal subjected to speech enhancement processing in the training set and the test set, i is 1, 2.., 12000; symbol r denotes the r-th frame of the speech signal, r 1,2, 3.., g; g represents the total frame number after the voice signal is framed, and the value of g is changed due to the processed voice time length; symbol l denotes the l-th frequency band of the speech signal, l being 0,1, 2.., 39; k is an index of a virtual variable representing a discrete frequency, k 0,1, 2.
Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
Advantages of the invention
The invention builds a model by taking deep learning neural network and voice enhancement as technical backgrounds. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on voice signals under various complex voice conditions to be recognized in a voice recognition front-end voice signal preprocessing stage; then establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and then, building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, thereby realizing the accurate recognition of Chinese voice in a complex environment. The problems that the existing voice recognition algorithm is sensitive to noise factors, high in requirement on voice quality and single in application scene are well solved.

Claims (1)

1. The deep neural network speech recognition method based on speech enhancement in the complex environment comprises the following specific implementation steps:
step one, establishing and processing a voice data set in a complex environment; collecting pure environment voice, white Gaussian noise environment voice, environment voice with background noise or interference sound source and voice under reverberation environment at the part to form a voice data set C of the voice recognition system; then, dividing the voice data under each environment in the voice data set C into a training set and a testing set respectively; the distribution proportion is the number of the training set voices: test set number of phonetic pieces 5: 1; respectively collecting and disordering the training set and the test set under each environment to form a training set X and a test set T; the ith speech in the training set X is represented as Xi(ii) a The jth phonetic symbol in the test set T is denoted as Tj(ii) a Simultaneously, for each voice in the training set X, editing a tag in txt formatLabel the file, the content of the label file includes name and correspondent correct Chinese phonetic alphabet sequence of the speech of the strip; a partially shown view of a training set voice tag document is illustrated in figure 2;
the specific collection methods are respectively as follows: firstly, collecting voices under pure conditions, recording a plurality of people under ideal laboratory conditions, recording 3000 pure voice materials by taking Chinese newspapers, novels and student texts as materials and recording a single voice within 10 seconds; for voice collection in a Gaussian white noise environment and a reverberation environment, Adobe audio software is adopted for synthesis, specifically, recorded pure voice and Gaussian white noise are adopted for synthesis, and the reverberation environment of the software is directly adopted for re-synthesis of voice; wherein, 3000 voices under the environment of Gaussian white noise and voices under the environment of reverberation are respectively recorded; finally, for voices with background noise or interfering sound sources, the method mainly adopts field recording, and carries out field recording by a plurality of people in noisy places such as factories, restaurants and the like, wherein 3000 voices are recorded in total; meanwhile, all the collected voice files are in the wav format; classifying the collected voices in the following manner: using 2500 voices in each type of voice environment as a training set of a voice recognition system, and using the remaining 500 voices as a test set; summarizing, namely 10000 training sets X and 2000 testing sets T of the speech recognition training set, respectively distributing the training sets and the testing sets in a disorderly way, and avoiding overfitting of the trained model;
step two, carrying out voice enhancement on the established voice training set X and the test set T to obtain an enhanced voice training set
Figure FDA0002654055750000021
And test set
Figure FDA0002654055750000022
Enhanced speech training set
Figure FDA0002654055750000023
The ith voice in (1) is expressed as
Figure FDA0002654055750000024
Test set
Figure FDA0002654055750000025
The j-th voice in the text is expressed as
Figure FDA0002654055750000026
Training the ith voice x in the set with voiceiFor example, the specific speech enhancement steps are as follows, the speech signal x to be enhancediUsing the built-in speech processing audiored function of matlab software to process xiPerforming reading processing to obtain the sampling rate f of the voice signalsAnd a matrix x containing speech informationi(n),xi(n) is a voice sampling value at n moments; then for xi(n) performing pre-emphasis to obtain yi(n); then to yi(n) adding Hamming window to perform framing operation to obtain information y of each frame of the speech signali,r(n) wherein yi,r(n) a speech information matrix of the r frame of the i-th speech signal after pre-emphasis enhancement; then to yi,r(n) FFT to obtain the short-time signal frequency spectrum of the r-th frame of the i-th voice signal
Figure FDA0002654055750000027
Then using the gamma flux weighting function HlAccording to the frequency band pair
Figure FDA0002654055750000028
Processing to obtain power P on l frequency band of r frame of i voice signali,r,l(r, l), wherein l has a value of 0.., 39; sequentially obtaining the power of each frequency band of the r frame according to the steps; then, noise reduction and dereverberation processing are carried out and spectrum integration is carried out
Figure FDA0002654055750000029
Thus, the short-time signal frequency spectrum of the r-th frame of the i-th voice signal after enhancement is obtained, and the voice signals of other frames are processed in turn to obtain each frameThe short-time signal frequency spectrum is subjected to speech signal frame synthesis on a time domain through IFFT to obtain an enhanced speech signal
Figure FDA00026540557500000210
Will be provided with
Figure FDA00026540557500000211
Putting into the enhanced speech training set
Figure FDA00026540557500000212
Performing the following steps; a specific speech data enhancement flow diagram is shown in figure 3;
the speech enhancement operates as detailed below for each step:
speech signal pre-emphasis
For the ith voice signal matrix X in the training set Xi(n) performing pre-emphasis to obtain yi(n) wherein yi(n)=xi(n)-αxi(n-1), α is a constant in this patent α is 0.98; x is the number ofi(n-1) is a sampling matrix of the ith voice in the training set at the n-1 moment;
(II) windowing framing
Using Hamming window w (n) to pre-emphasized speech signal yi(n) performing windowing and framing to divide the continuous speech signal into discrete signals y of one frame to one framei,r(n);
Wherein
Figure FDA0002654055750000031
Hamming window function, N is window length, frame taking length in patent is 50ms, frame shift is 10 ms; pre-emphasized speech signal yi(n) windowing and framing to obtain matrix information y of speech signal for each framei,r(n);yi,r(n) a speech information matrix of the r-th frame of the i-th speech signal after pre-emphasis, windowing and framing;
(III) FFT transformation
The voice information matrix y of the r frame of the i voice signali,r(n) FFT transforming it from the time domainTransforming to frequency domain to obtain short-time signal frequency spectrum of i-th speech signal r-th frame
Figure FDA0002654055750000032
(IV) determining the power P of the speech signali,r,l(r,l)
The short-time signal frequency spectrum of each frame
Figure FDA0002654055750000033
Processing by a gamma pass weight function to obtain the power of each frequency band of each frame of the voice signal;
Figure FDA0002654055750000034
Pi,r,l(r, l) represents a speech signal yi(n) power in the l-th frequency band of the r-th frame, k being an index of a virtual variable representing discrete frequencies, ωkIs a discrete frequency of the frequency,
Figure FDA0002654055750000035
since the frame length of 50ms is adopted in the FFT and the sampling rate of the speech signal is 16kHz, N is 1024; hlThe frequency spectrum of the gamma pass filter bank of the ith frequency band calculated at the frequency index k is represented and is a built-in function of matlab software voice processing, and the input parameter of the function is the frequency band l;
Figure FDA0002654055750000036
a short-time spectrum representing the r-th frame of the speech signal, L-40 being the total number of all channels;
(V) noise reduction and dereverberation processing of voice signals
Determining the power P of a speech signali,r,l(r, l), performing noise reduction and dereverberation treatment, and specifically comprising the following steps:
(1) obtaining low-pass power M of ith frequency band of ith framei,r,l[r,l]The concrete solving formula is as follows:
Mi,r,l[r,l]=λMi,r,l[r-1,l]+(1-λ)Pi,r,l[r,l]
Mi,r,l[r-1,l]low-pass power representing the l-th band of the r-1 th frame; λ represents a forgetting factor, which varies with the bandwidth of the low-pass filter, and λ is 0.4 in this patent;
(2) removing slowly varying components and power falling edge envelopes from the signal, for the power P of the speech signali,r,l[r,l]Processing to obtain the power of the l frequency band of the r frame after enhancement
Figure FDA0002654055750000041
Wherein
Figure FDA0002654055750000042
In (c)0Is a constant factor, this patent takes c0=0.01;
(3) Sequentially carrying out enhancement processing on each frequency band of each frame of the signal according to the steps (1) and (2);
(VI) Spectrum integration
Obtaining the enhanced power of each frame and each frequency band of the speech signal
Figure FDA0002654055750000043
And performing speech signal spectrum integration to obtain a short-time signal spectrum of each frame of the enhanced speech signal, wherein the formula of the spectrum integration is as follows:
Figure FDA0002654055750000044
mu in the above formulai,r[r,k]Representing the spectral weight coefficient at the kth index of the r frame;
Figure FDA0002654055750000045
for the short-term signal spectrum of the r-th frame of the non-enhanced i-th speech signal,
Figure FDA0002654055750000046
short-time signal frequency spectrum of the r frame of the enhanced ith voice signal;
wherein mui,r[r,k]The solving formula of (2) is as follows:
Figure FDA0002654055750000047
μi,r[r,k]=μi,r[r,N-k],N/2≤k≤N-1
h in the formulalRepresents the spectrum of the gamma pass filter bank that is the l-th band calculated at the frequency index k; omegai,r,l[r,l]For the weighting coefficient of the ith frequency band of the ith frame of the ith speech signal, the weighting coefficient is the ratio of the frequency domain after enhancement to the original frequency domain of the signal, and the solving formula is as follows:
Figure FDA0002654055750000051
obtaining an enhanced short-time signal frequency spectrum of an r frame of the ith voice signal after spectrum integration, and sequentially processing each frame according to the operation to obtain the enhanced short-time signal frequency spectrum of each frame of the ith voice signal; enhanced speech signal for each frame
Figure FDA0002654055750000052
IFFT conversion is carried out to obtain the voice signal of each frame in the time domain, and frame splicing is carried out in the time domain to obtain the enhanced voice signal
Figure FDA0002654055750000053
The IFFT transformation and speech signal time domain frame splicing operations are as follows:
Figure FDA0002654055750000054
Figure FDA0002654055750000055
g is the total frame number
In the above formula, the first and second carbon atoms are,
Figure FDA0002654055750000056
is a matrix of enhanced speech signals;
Figure FDA0002654055750000057
representing an enhanced speech signal matrix of the r frame; g is the total frame number of the voice signal, and the value is changed due to the duration of the voice signal; obtaining a sampling matrix of the enhanced n-time speech signal
Figure FDA0002654055750000058
And then, the voice processing audio function built in the matlab software is used for processing the voice according to the sampling rate f of the voice signals16Khz pair
Figure FDA0002654055750000059
Performing write processing to obtain enhanced voice signal
Figure FDA00026540557500000510
At this point, after the enhancement processing of one voice in the voice training set is finished, the training set X and the test set T are sequentially processed according to the steps; and storing the enhanced training set speech in
Figure FDA00026540557500000511
Centralized, enhanced test sets are stored
Figure FDA00026540557500000512
Concentrating;
step three, building a voice recognition acoustic model; the voice recognition acoustic model built by the method adopts CNN + CTC to carry out modeling, and the input layer is a training set enhanced in the second step
Figure FDA00026540557500000513
Mid speech signal
Figure FDA00026540557500000514
Extracting the characteristic value sequence by adopting an MFCC characteristic extraction algorithm with the characteristic value sequence of 200 dimensions; meanwhile, the hidden layer is alternately and repeatedly connected with the pooling layer by adopting a convolution layer and a Dropout layer, overfitting is prevented, wherein the convolution kernel size of the convolution layer is 3, the size of a pooling window is 2, finally, the output layer adopts a full connection layer of 1423 neurons for output, a softmax function is used for activation, a loss function of CTC is used as a loss function to realize multiple output of connectivity time sequence, and a characteristic value with 1423 dimensions of output just corresponds to 1423 common Chinese pinyins in a Chinese dictionary word txt document built in the fourth step; a specific speech recognition acoustic model network framework diagram is shown in figure 4; wherein the specific parameters of the convolutional layer, the pooling layer, the Dropout layer, and the fully-connected layer in the acoustic model are all labeled in fig. 4;
step four, building a voice recognition language model; the language model building comprises the steps of building a language text data set, designing a 2-gram language model and collecting a Chinese and Chinese dictionary;
establishment of language text database
Firstly, establishing a text data set required by a training language model; the language text data set is formally expressed as an electronic version txt file, and the contents are newspaper, Chinese class text and famous novel; the txt file establishes a language text database, and the selection of the text data in the language text database must be representative so as to reflect the Chinese language habit in daily life;
(II) 2-gram language model building
The method adopts a language model training method 2-gram algorithm which is divided according to words to build a language model; wherein 2 in the 2-gram indicates that the probability of the current word appearing is only related to the 2 words before the current word; 2 is the constrained quantity of the memory length of the word sequence; the specific formula of the 2-gram algorithm can be expressed as:
Figure FDA0002654055750000061
in the above formula, W represents a text sequence, W1,w2,...,wqRespectively representing each word in the character sequence, and q represents the length of the character sequence; s (W) represents the probability that the text sequence conforms to the linguistic habit; d represents the d-th word;
(III) Chinese dictionary establishment
Building a language model dictionary of a voice recognition system; for a dictionary, a dictionary of one language is stable and invariable, for a Chinese character dictionary in the invention, the dictionary is expressed as a dit.txt file, wherein 1423 Chinese characters corresponding to Chinese pinyin commonly used in daily life are marked, and meanwhile, the condition of one tone and multiple characters of Chinese are considered, and a part of the display diagram of the dictionary built by the invention is shown in an attached drawing description figure 5;
step five, training the built 2-gram language model by using the built language text data set to obtain a word occurrence number table and a state transition table of the language model; the specific language model training block diagram is shown in the accompanying description FIG. 6; the specific training mode for the language model is as follows:
(1) circularly acquiring text contents in the language text data set, counting the occurrence times of single words, and summarizing to obtain a single word occurrence time table;
(2) circularly acquiring the times of two words appearing together in the language text data set, and summarizing to obtain a state transition table of the two words;
step six, training set by using trained language model, established dictionary and enhanced voice
Figure FDA0002654055750000071
Performing learning training on the built acoustic model; obtaining a weight file and other parameter configuration files of the acoustic model, wherein the specific acoustic model training process comprises the following steps:
(1) initializing weights of all parts of the acoustic network model;
(2) importing speech training sets in sequence
Figure FDA0002654055750000072
Training the speech in (1) to any speech signal
Figure FDA0002654055750000073
Firstly, processing by an MFCC feature extraction algorithm to obtain a 200-dimensional characteristic value sequence of a voice signal, then processing the 200-dimensional characteristic value sequence of the voice signal by each convolution layer, pooling layer, Dropout layer and full-link layer in sequence according to the list shown in figure 7, finally outputting by the full-link layer of 1423 neurons by the output layer, and activating by a softmax function to obtain 1423-dimensional acoustic features of the voice signal;
(3) after the characteristic value is obtained, the 1423-dimensional acoustic characteristic value is decoded under the action of a language model and a dictionary, and the recognized voice signal is output
Figure FDA0002654055750000074
The Chinese phonetic sequence of (1);
(4) chinese phonetic alphabet sequence and training set identified by acoustic model
Figure FDA0002654055750000075
The ith voice
Figure FDA0002654055750000076
The Chinese pinyin label sequence carries out comparison calculation error and back propagation to update the weight values of all positions of the acoustic model, the loss function adopts a loss function of CTC, and Adam algorithm is optimized, the trained batch size is set to be 16, the iteration time epoch is set to be 50, and a weight file is stored once when 500 voices are trained; the loss function of CTC is as follows:
Figure FDA0002654055750000081
in the above formula
Figure FDA0002654055750000082
Represents the total loss generated after training in the training set, and e represents the training set after performing speech enhancement on the input speech
Figure FDA0002654055750000083
In the speech signal
Figure FDA0002654055750000084
z is the output Chinese character sequence, F (z | e) represents the input as e, and the output sequence is the probability of z;
(5) training the acoustic models of the voice recognition in sequence according to the steps until the loss of the acoustic models is converged, and finishing training the acoustic models; saving a weight file and various configuration files of the acoustic model, wherein a specific speech recognition acoustic model training diagram is shown in an accompanying drawing description figure 7;
step seven, using the trained Chinese voice recognition system based on voice enhancement to test the set
Figure FDA0002654055750000085
The voice is recognized, the voice recognition accuracy is counted, and performance comparison analysis is carried out on the voice and a traditional algorithm; a specific speech recognition test system flow diagram is shown in figure 8; the speech recognition accuracy of the present patent and comparison of performance in noisy environments with conventional algorithms is illustrated in part in the accompanying drawing, which shows fig. 9; the speech recognition accuracy of the present patent and performance comparison with conventional algorithms in reverberant environments is illustrated in part in the accompanying drawing, which shows an illustration of fig. 10;
the specific implementation mode is as follows:
(1) carrying out voice recognition test on 2000 non-enhanced voice test sets T of the established complex environment voice database by using a traditional voice recognition system, and counting the accuracy of voice recognition; representative voice recognition result figures are listed in the figure description, and are shown in figure description 9 and figure 10;
(2) 2000 enhanced speech test sets of a built speech database using a speech enhancement based speech recognition system of the present invention
Figure FDA0002654055750000086
Carrying out voice recognition test, and counting the voice recognition accuracy of the method; representative voice recognition result figures are listed in the figure description, and are shown in figure description 9 and figure 10;
(3) finally, the performance analysis is carried out on the voice recognition system based on the voice enhancement;
after statistics is completed, the voice recognition algorithm based on voice enhancement greatly improves the recognition accuracy of the voice in a Gaussian white noise environment, an environment with background noise or an interference sound source and a reverberation environment, and the performance is improved by about 30%; compared with the traditional speech recognition algorithm, the algorithm of the invention has greatly improved recognition accuracy, especially for the speech recognition under the environment of Gaussian white noise, the environment with background noise or interference sound source and the reverberation environment, the traditional algorithm has poor performance, but the algorithm of the invention has excellent performance and good performance, and the comparison graph of the recognition effect of the speech recognition algorithm of the invention and the traditional speech recognition algorithm under the environment of partial noise is shown in the figure description figure 9; a comparison graph of the recognition effect of the voice recognition algorithm and the traditional voice recognition algorithm under the partial reverberation environment is shown in an explanatory figure 10 of the attached drawings;
therefore, the deep neural network speech recognition method based on speech enhancement in the complex environment well solves the problems that the existing speech recognition algorithm is sensitive to a noise environment, high in requirement on speech quality and single in applicable scene, and realizes speech recognition in the complex speech environment;
symbol i appearing in each step represents the i-th speech signal subjected to speech enhancement processing in the training set and the test set, i is 1, 2.., 12000; symbol r denotes the r-th frame of the speech signal, r 1,2, 3.., g; g represents the total frame number after the voice signal is framed, and the value of g is changed due to the processed voice time length; symbol l denotes the l-th frequency band of the speech signal, l being 0,1, 2.., 39; k is an index of a virtual variable representing a discrete frequency, k 0,1, 2.
CN202010880777.7A 2020-08-28 2020-08-28 Deep neural network voice recognition method based on voice enhancement in complex environment Active CN111986661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010880777.7A CN111986661B (en) 2020-08-28 2020-08-28 Deep neural network voice recognition method based on voice enhancement in complex environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010880777.7A CN111986661B (en) 2020-08-28 2020-08-28 Deep neural network voice recognition method based on voice enhancement in complex environment

Publications (2)

Publication Number Publication Date
CN111986661A true CN111986661A (en) 2020-11-24
CN111986661B CN111986661B (en) 2024-02-09

Family

ID=73440031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010880777.7A Active CN111986661B (en) 2020-08-28 2020-08-28 Deep neural network voice recognition method based on voice enhancement in complex environment

Country Status (1)

Country Link
CN (1) CN111986661B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633175A (en) * 2020-12-24 2021-04-09 哈尔滨理工大学 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
CN112786051A (en) * 2020-12-28 2021-05-11 出门问问(苏州)信息科技有限公司 Voice data identification method and device
CN113257262A (en) * 2021-05-11 2021-08-13 广东电网有限责任公司清远供电局 Voice signal processing method, device, equipment and storage medium
CN113808581A (en) * 2021-08-17 2021-12-17 山东大学 Chinese speech recognition method for acoustic and language model training and joint optimization
CN114444609A (en) * 2022-02-08 2022-05-06 腾讯科技(深圳)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN116580708A (en) * 2023-05-30 2023-08-11 中国人民解放军61623部队 Intelligent voice processing method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160240190A1 (en) * 2015-02-12 2016-08-18 Electronics And Telecommunications Research Institute Apparatus and method for large vocabulary continuous speech recognition
CN109272990A (en) * 2018-09-25 2019-01-25 江南大学 Audio recognition method based on convolutional neural networks
KR20190032868A (en) * 2017-09-20 2019-03-28 현대자동차주식회사 Method and apparatus for voice recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160240190A1 (en) * 2015-02-12 2016-08-18 Electronics And Telecommunications Research Institute Apparatus and method for large vocabulary continuous speech recognition
KR20190032868A (en) * 2017-09-20 2019-03-28 현대자동차주식회사 Method and apparatus for voice recognition
CN109272990A (en) * 2018-09-25 2019-01-25 江南大学 Audio recognition method based on convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
潘粤成;刘卓;潘文豪;蔡典仑;韦政松;: "一种基于CNN/CTC的端到端普通话语音识别方法", 现代信息科技, no. 05 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633175A (en) * 2020-12-24 2021-04-09 哈尔滨理工大学 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
CN112786051A (en) * 2020-12-28 2021-05-11 出门问问(苏州)信息科技有限公司 Voice data identification method and device
CN112786051B (en) * 2020-12-28 2023-08-01 问问智能信息科技有限公司 Voice data recognition method and device
CN113257262A (en) * 2021-05-11 2021-08-13 广东电网有限责任公司清远供电局 Voice signal processing method, device, equipment and storage medium
CN113808581A (en) * 2021-08-17 2021-12-17 山东大学 Chinese speech recognition method for acoustic and language model training and joint optimization
CN113808581B (en) * 2021-08-17 2024-03-12 山东大学 Chinese voice recognition method based on acoustic and language model training and joint optimization
CN114444609A (en) * 2022-02-08 2022-05-06 腾讯科技(深圳)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN116580708A (en) * 2023-05-30 2023-08-11 中国人民解放军61623部队 Intelligent voice processing method and system

Also Published As

Publication number Publication date
CN111986661B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN111986661A (en) Deep neural network speech recognition method based on speech enhancement in complex environment
CN112017644B (en) Sound transformation system, method and application
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN114566189B (en) Speech emotion recognition method and system based on three-dimensional depth feature fusion
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
CN115602165B (en) Digital employee intelligent system based on financial system
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN114495969A (en) Voice recognition method integrating voice enhancement
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
CN109452932A (en) A kind of Constitution Identification method and apparatus based on sound
CN112185363A (en) Audio processing method and device
Almekhlafi et al. A classification benchmark for Arabic alphabet phonemes with diacritics in deep neural networks
CN113744715A (en) Vocoder speech synthesis method, device, computer equipment and storage medium
CN113539243A (en) Training method of voice classification model, voice classification method and related device
CN114626424B (en) Data enhancement-based silent speech recognition method and device
Li et al. Intelligibility enhancement via normal-to-lombard speech conversion with long short-term memory network and bayesian Gaussian mixture model
CN112951270B (en) Voice fluency detection method and device and electronic equipment
CN114283822A (en) Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient
CN114550675A (en) Piano transcription method based on CNN-Bi-LSTM network
CN111009252A (en) Speech enhancement system and method of embedding codec
Dua et al. A review on Gujarati language based automatic speech recognition (ASR) systems
CN114863939B (en) Panda attribute identification method and system based on sound
Agrawal et al. Robust raw waveform speech recognition using relevance weighted representations
Shome et al. A robust DNN model for text-independent speaker identification using non-speaker embeddings in diverse data conditions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant