CN108133702A - A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias - Google Patents

A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias Download PDF

Info

Publication number
CN108133702A
CN108133702A CN201711384226.6A CN201711384226A CN108133702A CN 108133702 A CN108133702 A CN 108133702A CN 201711384226 A CN201711384226 A CN 201711384226A CN 108133702 A CN108133702 A CN 108133702A
Authority
CN
China
Prior art keywords
layer
mee
dnn
neural network
deep neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711384226.6A
Other languages
Chinese (zh)
Inventor
周翊
黄张翼
舒晓峰
孙旭光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201711384226.6A priority Critical patent/CN108133702A/en
Publication of CN108133702A publication Critical patent/CN108133702A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention relates to a kind of deep neural network speech enhan-cement models based on MEE Optimality Criterias, belong to artificial intelligent voice enhancing field.The model includes input layer, hidden layer and output layer, and entire training pattern is divided into training stage and enhancing stage.Clean speech is added the mixing noisy speech built under different signal-to-noise ratio by the training stage two-by-two with a variety of noise likes;Feature extraction is carried out to mixing voice, DNN networks is input to and is trained.The enhancing stage carries out same characteristic features extraction to mixing voice to be measured, is input to trained DNN networks and is decoded, and network exports the estimation to the feature of clean speech, then carries out Waveform Reconstructing, obtains the voice document after noise reduction.The present invention has preferable universality to the noisy speech noise reduction containing nonstationary noise in practical problem.

Description

A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias
Technical field
The invention belongs to the invention belongs to artificial intelligent voice enhancing field, relate generally to deep neural network in voice sound Learn the application in model.
Background technology
In recent years, as deep neural network (Deep Neural Network, DNN) is in the success of field of speech recognition Using there has also been significant progresses for speech enhan-cement task.The deep layer nonlinear organization of DNN can be designed to a fine drop It makes an uproar wave filter, while is trained based on big data, DNN can fully learn complicated non-linear between noisy speech and clean speech Relationship.
In the speech enhan-cement model based on deep neural network, a cost function is needed to update network weight. In the recurrence task of speech enhan-cement, generally with least mean-square error MSE criterion, criterion, advantage are to calculate simply as an optimization, But it is only applicable to stationary noise as Gaussian noise.Because MSE considered when similarity is measured it is of overall importance, that is, It says, the effect of all sample points in space to be measured is all bigger, and especially for the sample point far from this line of y=x, MSE will be put These big effects far from error distribution maenvalue point.So when error belongs to Gaussian Profile, MSE best performances.But In practical problem, there are many nonstationary noises in noisy speech, i.e., noise is not belonging to Gaussian Profile, therefore MSE criterion exist Effect in practical problem is not usually highly desirable.
Relative to the measurement of overall importance of MSE, a kind of similarity measurement methods of the minimal error entropy MEE as locality, Similarity is mainly influenced by core width;When selecting a suitable core width, the performance surface of MEE criterion is more than fixing Curvature, it is and smooth in most space internal ratio MSE performance surfaces.Not only robustness is good, but also is more suitable for reality by MEE Non-Gaussian noise in problem.For MSE criterion it is undesirable to nonstationary noise effect the defects of, it is therefore desirable to one kind is based on deep The speech enhan-cement model of neural network is spent, tradition MSE criterion are replaced using MEE Optimality Criterias.
Invention content
In view of this, the purpose of the present invention is to provide a kind of deep neural network voice increasings based on MEE Optimality Criterias Strong model has preferable universality to the noisy speech noise reduction containing nonstationary noise in practical problem.
In order to achieve the above objectives, the present invention provides following technical solution:
A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias, as shown in Fig. 2, including input layer, hidden Layer and output layer;The hidden layers numbers are 3, number of nodes 1024.
As shown in Figure 1, the model is divided into training stage and enhancing stage.
The training stage:Clean speech is added to the mixed zone built under different signal-to-noise ratio two-by-two to make an uproar with a variety of noise likes Voice carries out feature extraction to mixing voice, is input to deep neural network (Deep Neural Network, DNN) and carries out Training.
The enhancing stage:Same characteristic features extraction is carried out to mixing voice to be measured, trained DNN is input to and carries out Decoding, DNN exports the estimation to the feature of clean speech, then carries out Waveform Reconstructing, obtains the voice document after noise reduction.
Further, it in the DNN training stages, is updated with error Back-Propagation (error BackPropagation, BP) algorithm DNN weights;The exciter response obtained by each hidden layer is inputted, the output of the last layer of hidden layer is next layer of input, until Last layer obtains predicted value;The difference of predicted value and reference signal needs the mistake of backpropagation, is adjusted according to this mistake Save each weights and the biasing of DNN.
Further, the last of minimal error entropy MEE cost functions is defined actually to be expressed as:
Wherein, n represents the number of nodes of hidden layer;E (i) and e (u) represents the mistake of i-th of neuron and u-th of neuron respectively Accidentally;Mistake e=target-output represents the estimated value and ginseng to clean speech log power spectrum exported after DNN is trained Examine the difference of value;H represents core width, i.e. smoothing parameter, is set as 0.01 in the present invention;Gaussian kernel function K is expressed as:
In order to use BP algorithm, need to obtain the analytical expression of gradient delta ω;Because (1) formula function is monotonic increase , its operand is minimized, operand can be expressed as:
Wherein, yk=output;
As i=k,Derivative be:
As u=k,Derivative be:
Comprehensive (3), (4), (5) Shi Ke get:
Abbreviation arranges (6) Shi Ke get:
Wherein, WkjRepresenting the weights of jth k-th of neuron of layer, net (j) is expressed as the input of k-th of neuron of jth layer, F () is the activation primitive of neuron, and f ' () represents the derivative of f ();
To sum up, give learning rate η, by the use of MEE as the BP algorithm of cost function in weights, more new formula (7) can obtain:
The beneficial effects of the present invention are:The present invention is proposed in the speech enhan-cement model based on deep neural network, is adopted Conventional Least Mean Square error criterion is replaced with minimal error entropy (MEE) Optimality Criteria, is efficiently solved in practical problem containing non-flat The problem of noisy speech noise reduction of steady noise.
Description of the drawings
In order to make the purpose of the present invention, technical solution and advantageous effect clearer, the present invention provides drawings described below and carries out Explanation:
Fig. 1 is deep neural network speech-enhancement system block diagram;
Fig. 2 is BP network diagrams.
Specific embodiment
Below in conjunction with attached drawing, the preferred embodiment of the present invention is described in detail.
4620 clean speech and white noise, pink noise, Volvo noise and automobile is selected to make an uproar from TIMIT data sets Sound, which is added, is mixed into -5db, and the noisy speech under 5db signal-to-noise ratio is as training set.Optionally 200 clean speech are similary each Babble noises and factory noise are mixed under signal-to-noise ratio as test set.
Training stage, feature is put forward to training set voice, feature selecting log power spectrum is separately input to MSE-DNN networks It is trained with MEE-DNN networks proposed by the present invention.
After the completion of network training, log power spectrum is equally extracted to test set voice, is separately input to two kinds of differences again DNN networks in, obtain the estimation to clean speech log power spectrum, carry out Waveform Reconstructing with overlap-add method, enhanced Afterwards can audiometry voice document.
Voice quality and voice quality comparison after the enhancing of MEE-DNN networks are as shown in table 1 after the enhancing of MSE-DNN networks.Its In, N1 represents Babble noises, and N2 represents Factory noises.
Table 1
Finally illustrate, preferred embodiment above is merely illustrative of the technical solution of the present invention and unrestricted, although logical It crosses above preferred embodiment the present invention is described in detail, however, those skilled in the art should understand that, can be Various changes are made to it in form and in details, without departing from claims of the present invention limited range.

Claims (3)

1. a kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias, it is characterised in that:The model includes input Layer, hidden layer and output layer;The hidden layers numbers are 3, number of nodes 1024;
The model is divided into training stage and enhancing stage;
The training stage:Clean speech is added to the mixed zone built under different signal-to-noise ratio two-by-two to make an uproar language with a variety of noise likes Sound carries out feature extraction to mixing voice, is input to deep neural network (Deep Neural Network, DNN) and is instructed Practice;
The enhancing stage:Same characteristic features extraction is carried out to mixing voice to be measured, trained DNN is input to and is solved Code, DNN exports the estimation to the feature of clean speech, then carries out Waveform Reconstructing, obtains the voice document after noise reduction.
2. a kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias as described in claim 1, feature exist In:In the DNN training stages, DNN weights are updated with error Back-Propagation (error BackPropagation, BP) algorithm;Input is logical Cross the exciter response that each hidden layer obtains, the output of last layer is next layer of input in hidden layer, to the last one layer obtain it is pre- Measured value;The difference of predicted value and reference signal needs the mistake of backpropagation, and each weights of DNN are adjusted according to this mistake And biasing.
3. a kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias as described in claim 1, feature exist In:The last of minimal error entropy MEE cost functions is defined actually to be expressed as:
Wherein, n represents the number of nodes of hidden layer;E (i) and e (u) represents the mistake of i-th of neuron and u-th of neuron respectively; Mistake e=target-output, the estimated value to clean speech log power spectrum for representing to export after DNN is trained and reference The difference of value;H represents core width, i.e. smoothing parameter;Gaussian kernel function K is expressed as:
In order to use BP algorithm, need to obtain the analytical expression of gradient delta ω;Because (1) formula function is monotonic increase, most Its operand of smallization, operand table are shown as:
Wherein, yk=output;
As i=k,Derivative be:
As u=k,Derivative be:
Comprehensive (3), (4), (5) Shi Ke get:
Abbreviation arranges (6) Shi Ke get:
Wherein, WkjRepresent the weights of jth k-th of neuron of layer, net (j) is expressed as the input of k-th of neuron of jth layer, f () is the activation primitive of neuron, and f ' () represents the derivative of f ();
To sum up, give learning rate η, by the use of MEE as the BP algorithm of cost function in weights, update (7) Shi Ke get:
CN201711384226.6A 2017-12-20 2017-12-20 A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias Pending CN108133702A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711384226.6A CN108133702A (en) 2017-12-20 2017-12-20 A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711384226.6A CN108133702A (en) 2017-12-20 2017-12-20 A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias

Publications (1)

Publication Number Publication Date
CN108133702A true CN108133702A (en) 2018-06-08

Family

ID=62390713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711384226.6A Pending CN108133702A (en) 2017-12-20 2017-12-20 A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias

Country Status (1)

Country Link
CN (1) CN108133702A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109256144A (en) * 2018-11-20 2019-01-22 中国科学技术大学 Sound enhancement method based on integrated study and noise perception training
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks
CN109378010A (en) * 2018-10-29 2019-02-22 珠海格力电器股份有限公司 Neural network model training method, voice denoising method and device
CN109658949A (en) * 2018-12-29 2019-04-19 重庆邮电大学 A kind of sound enhancement method based on deep neural network
CN110111803A (en) * 2019-05-09 2019-08-09 南京工程学院 Based on the transfer learning sound enhancement method from attention multicore Largest Mean difference
CN110211602A (en) * 2019-05-17 2019-09-06 北京华控创为南京信息技术有限公司 Intelligent sound enhances communication means and device
WO2020010566A1 (en) * 2018-07-12 2020-01-16 Intel Corporation Devices and methods for link adaptation
CN111081266A (en) * 2019-12-18 2020-04-28 暗物智能科技(广州)有限公司 Training generation countermeasure network, and voice enhancement method and system
CN112086100A (en) * 2020-08-17 2020-12-15 杭州电子科技大学 Quantization error entropy based urban noise identification method of multilayer random neural network
WO2021027132A1 (en) * 2019-08-12 2021-02-18 平安科技(深圳)有限公司 Audio processing method and apparatus and computer storage medium
CN115331689A (en) * 2022-08-11 2022-11-11 北京声智科技有限公司 Training method, device, equipment, storage medium and product of voice noise reduction model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104883330A (en) * 2014-02-27 2015-09-02 清华大学 Blind equalization method and blind equalization system
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104883330A (en) * 2014-02-27 2015-09-02 清华大学 Blind equalization method and blind equalization system
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JORGE M. SANTOS等: ""Robust Sound Event Classification Using Deep Neural Networks"", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
YONG XU等: ""An Experimental Study on Speech Enhancement Based on Deep Neural Networks"", 《IEEE SIGNAL PROCESSING LETTERS》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020010566A1 (en) * 2018-07-12 2020-01-16 Intel Corporation Devices and methods for link adaptation
CN109378010A (en) * 2018-10-29 2019-02-22 珠海格力电器股份有限公司 Neural network model training method, voice denoising method and device
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks
CN109326299B (en) * 2018-11-14 2023-04-25 平安科技(深圳)有限公司 Speech enhancement method, device and storage medium based on full convolution neural network
CN109256144A (en) * 2018-11-20 2019-01-22 中国科学技术大学 Sound enhancement method based on integrated study and noise perception training
CN109256144B (en) * 2018-11-20 2022-09-06 中国科学技术大学 Speech enhancement method based on ensemble learning and noise perception training
CN109658949A (en) * 2018-12-29 2019-04-19 重庆邮电大学 A kind of sound enhancement method based on deep neural network
CN110111803A (en) * 2019-05-09 2019-08-09 南京工程学院 Based on the transfer learning sound enhancement method from attention multicore Largest Mean difference
CN110211602B (en) * 2019-05-17 2021-09-03 北京华控创为南京信息技术有限公司 Intelligent voice enhanced communication method and device
CN110211602A (en) * 2019-05-17 2019-09-06 北京华控创为南京信息技术有限公司 Intelligent sound enhances communication means and device
WO2021027132A1 (en) * 2019-08-12 2021-02-18 平安科技(深圳)有限公司 Audio processing method and apparatus and computer storage medium
CN111081266A (en) * 2019-12-18 2020-04-28 暗物智能科技(广州)有限公司 Training generation countermeasure network, and voice enhancement method and system
CN112086100A (en) * 2020-08-17 2020-12-15 杭州电子科技大学 Quantization error entropy based urban noise identification method of multilayer random neural network
CN112086100B (en) * 2020-08-17 2022-12-02 杭州电子科技大学 Quantization error entropy based urban noise identification method of multilayer random neural network
CN115331689A (en) * 2022-08-11 2022-11-11 北京声智科技有限公司 Training method, device, equipment, storage medium and product of voice noise reduction model

Similar Documents

Publication Publication Date Title
CN108133702A (en) A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias
CN109524020B (en) Speech enhancement processing method
CN105023580B (en) Unsupervised noise estimation based on separable depth automatic coding and sound enhancement method
CN111429947B (en) Speech emotion recognition method based on multi-stage residual convolutional neural network
CN108899051A (en) A kind of speech emotion recognition model and recognition methods based on union feature expression
CN109147774B (en) Improved time-delay neural network acoustic model
CN108172238A (en) A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
Shen et al. Reinforcement learning based speech enhancement for robust speech recognition
CN104751227B (en) Construction method and system for the deep neural network of speech recognition
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN107967920A (en) A kind of improved own coding neutral net voice enhancement algorithm
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN114708855B (en) Voice awakening method and system based on binary residual error neural network
CN113823264A (en) Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment
Sangeetha et al. Emotion speech recognition based on adaptive fractional deep belief network and reinforcement learning
Shi et al. End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network.
CN112885375A (en) Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network
Mu et al. Voice activity detection optimized by adaptive attention span transformer
McAuley et al. Subband correlation and robust speech recognition
Shen et al. Transducer-based language embedding for spoken language identification.
CN110619886B (en) End-to-end voice enhancement method for low-resource Tujia language
CN113571095A (en) Speech emotion recognition method and system based on nested deep neural network
Naini et al. Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180608

RJ01 Rejection of invention patent application after publication