CN109410976B - Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid - Google Patents

Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid Download PDF

Info

Publication number
CN109410976B
CN109410976B CN201811292475.7A CN201811292475A CN109410976B CN 109410976 B CN109410976 B CN 109410976B CN 201811292475 A CN201811292475 A CN 201811292475A CN 109410976 B CN109410976 B CN 109410976B
Authority
CN
China
Prior art keywords
speech
voice
quadrant
binaural
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811292475.7A
Other languages
Chinese (zh)
Other versions
CN109410976A (en
Inventor
李如玮
李涛
孙晓月
杨登才
潘冬梅
张永亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201811292475.7A priority Critical patent/CN109410976B/en
Publication of CN109410976A publication Critical patent/CN109410976A/en
Application granted granted Critical
Publication of CN109410976B publication Critical patent/CN109410976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

A speech enhancement method based on binaural sound source localization and deep learning in a binaural digital hearing aid belongs to the field of speech signal processing. Firstly, a two-stage deep neural network is used for accurately positioning the target voice, and j is combined with spatial filtering to remove noise different from the direction of the target voice. The method comprises the steps of using a built time delay control two-way long-short time memory deep neural network and a deep learning model combined with a classifier, taking an extracted multi-resolution auditory cepstrum coefficient as characteristic input, classifying each time-frequency unit of noisy speech into a speech time-frequency unit or a noisy time-frequency unit through the nonlinear processing capacity of deep learning, and finally removing noise in the same direction as target speech by using a speech waveform synthesis algorithm. The algorithm removes noise in a direction different from that of the target voice and also removes noise in the same direction as that of the target voice, and finally obtains enhanced voice meeting speech intelligibility and comfort of the ear-impaired. All deep learning models adopt offline training, and the real-time performance is met.

Description

Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid
Technical Field
The invention belongs to the technical field of voice signal processing, and relates to two key voice signal processing technologies of target voice positioning and voice enhancement in a digital hearing aid.
Background
Hearing impairment is a chronic disease that seriously affects the quality of life of humans. The incidence of hearing loss in the elderly over 65 years of age in the united states is about 30% to 40%, in canada 20%, in europe 35%, in china 35%. And with increasing age, the incidence increases dramatically. At present, the total population of old people over 60 years old worldwide reaches 6 hundred million, wherein China accounts for nearly 30 percent, and the hearing threshold of 22.28 percent of old people in China is in a normal range. The development of hearing aids has brought gospel to these hearing impaired patients. A hearing aid is a device for amplifying sound to compensate for lost hearing, and is an important means for solving the communication difficulty of hearing-impaired patients. In recent years, with the adoption of digital technology, hearing aid technology has been rapidly developed. Compared with the traditional hearing aid, the digital hearing aid greatly improves the controllability, can flexibly divide frequency bands and adjust the sound intensity in multiple frequency bands. In addition, the digital hearing aid has higher flexibility, and the application of the advanced algorithm can get rid of the dependence on a solidified analog circuit, flexibly adjust and update the algorithm and meet the requirements of patients. The miniaturization, interactivity, intelligence and transgression of the digital hearing aid also greatly improve the acceptance of patients to the hearing aid.
However, even today, the market is full of various hearing aids, the hearing aid wearing rate of people with hearing impairment in China is 1%, because the performance of the hearing aid is sharply reduced in a noise environment, and people with hearing impairment cannot hear clearly and feel uncomfortable at the same time. Especially, under the cocktail party problem, the ear-impaired person can not recognize the sound he wants at all, and the life quality of the ear-impaired person is seriously affected. It has been found by investigation that in the united states where digital hearing aid technology is highly developed, 70% of hearing aid users are not satisfied with their performance in noisy environments, and 95% of hearing aid users wish to have improved speech intelligibility and comfort in noisy environments.
The voice enhancement is a key technology for improving the performance of the digital hearing aid in a noise environment, and a voice enhancement module in the digital hearing aid processes acquired original digital signals, so that background noise is eliminated, voice quality is improved, a patient can easily accept the voice enhancement, fatigue of the listener (subjective measurement) is reduced, intelligibility of voice of the patient (objective measurement) is improved, and robustness of a subsequent processing process on input noise is improved.
Currently, speech enhancement algorithms commonly used in digital hearing aids are: spectral subtraction, multi-channel adaptive noise reduction, synchronous detection noise reduction, harmonic enhancement, wiener filter, MMSE estimation of short-time spectral amplitude, auditory masking based methods, binaural noise suppression, and the like. The above methods are applied to various monaural hearing aids, and speech intelligibility and comfort of the ear-impaired are improved to a certain extent. However, for the hearing impaired, the existing speech enhancement methods cannot achieve satisfactory results because the existing methods have their own defects, but the noise is varied, some methods can only partially remove the noise in a direction different from the target speech, some methods bring new noise while removing the noise, and some methods cause unrecoverable damage to the speech while removing the noise. The defects of the existing method cause that only less than 1 percent of people with ear disabilities in China wear hearing aids, so that the life quality is seriously reduced. Therefore, the influence of noise on the performance of the digital hearing aid is eliminated, the speech intelligibility and the comfort degree of the hearing aid worn by the ear-impaired people are improved, the pain of the ear-impaired people is relieved, and the development trend of the digital hearing aid is realized.
In recent years, deep learning has been rapidly developed in the field of speech signal processing, and many speech enhancement algorithms based on deep learning have been proposed. The basic idea of the algorithm is as follows: a neural network comprising a plurality of levels is trained with a noisy speech signal and clean speech to be noise resistant. Due to the strong nonlinear fitting capability of deep learning, the method is very suitable for enhancing the voice containing non-stationary noise. Common deep learning-based speech enhancement algorithms are: (1) And estimating a clean voice signal in the noisy voice signal by using the deep neural network. (2) A time-frequency mask between clean speech and noise is estimated using a deep neural network. (3) And respectively estimating a clean speech signal and a noise signal by using the deep neural network. (4) And classifying the noise by using a deep neural network, and then training a corresponding voice enhancement model in a targeted manner. However, in the above speech enhancement algorithms based on deep learning, all collected data are directly used as input of deep learning, but the complexity of the method is high, and the requirement of the digital hearing aid on real-time performance is not met, or characteristic parameters of the collected data are extracted as input of deep learning, but the characteristic parameters extracted by the current method cannot well describe characteristic differences of speech and noise, and speech intelligibility and comfort of the hearing impaired cannot be improved. The autonomous learning ability of deep neural networks is incomparable with other methods.
Therefore, a speech enhancement algorithm based on deep learning capability is necessary to satisfy speech intelligibility and comfort for the hearing impaired in noisy environments.
The invention provides a speech enhancement algorithm based on binaural sound source localization and deep learning. And then, using a built deep learning model combining the LC-BLSTM-DNN and the classifier, taking the extracted multi-resolution auditory cepstrum coefficient as characteristic input, classifying each time-frequency unit of the noisy speech into a speech time-frequency unit or a noisy time-frequency unit through the nonlinear processing capability of deep learning, and finally removing the noise in the same direction as the target speech by using a speech waveform synthesis algorithm to obtain the final enhanced speech. All deep learning models adopt offline training, and the real-time performance of the digital hearing aid can be met by using the trained models.
Disclosure of Invention
The technical scheme adopted by the invention is as follows: the method comprises the steps of firstly, constructing a two-stage deep learning network by utilizing strong data driving capability of deep learning, simulating a statistical relationship between binaural features and corresponding azimuth angles to obtain spatial information of target voice, and then removing noise from different directions from the target voice by combining spatial filtering. Secondly, a reasonable deep learning model is built, features which can be directly used for noise and voice classification are learned from primary feature parameters representing voice and noise differences and are classified, noise from the same direction as the target voice is removed, and finally enhancement of space noisy voice is achieved. The specific steps of the process are as follows:
firstly, performing time-frequency analysis on an input signal of the digital hearing aid by adopting a Gamma-tone filter which can simulate the working mechanism of a basement membrane and auditory nerves in an auditory system of a human ear.
And step two, extracting two binaural spatial cues of binaural time difference and binaural sound level difference from the frequency domain signal generated by filtering through the Gamma filter, wherein the two characteristics of the voices in different directions have difference, so that the two characteristics can be used as effective characteristics for positioning. And the two characteristics have a mutually complementary relationship, namely the binaural time difference has a better positioning effect at low frequency, and the binaural sound level difference has a better positioning effect at high frequency.
And step three, dividing the possible azimuth angle of the noisy speech into four quadrants by taking the double-ear connecting line as an x axis and the perpendicular bisector of the double-ear connecting line as a y axis. Namely, the front right is the first quadrant, the front left is the second quadrant, the rear left is the third quadrant, and the rear right is the fourth quadrant. And then constructing a first-level deep learning network, taking the binaural time difference and the binaural sound level difference of the noisy speech as characteristic input, and outputting the characteristic input as a quadrant where the target speech position is located, so as to solve the problem of front and back confusion of target speech positioning.
And step four, constructing a second-level deep learning network for each quadrant, and selecting the second-level deep learning network for azimuth angle positioning by combining the quadrant to which the azimuth angle of the target voice judged in the step three belongs. And taking the normalized cross-correlation function and the binaural sound level difference as the input of a second-level deep learning network, and finally outputting the direction of the target voice.
And step five, acquiring the azimuth angle of the target voice according to the step four by using the existing spatial filtering algorithm to match the head related transfer function in the corresponding direction, and unmixing the target voice from the mixed voice. And removing the noise in the direction different from the target sound source to obtain the primary enhanced voice in the same direction as the target voice.
And step six, performing time-frequency analysis on the preliminary enhanced voice obtained in the step five, namely performing frequency-domain analysis on the preliminary enhanced voice by using the same Gamma filter in the step one to obtain frequency-domain representation of the preliminary enhanced voice.
And step seven, extracting multi-resolution auditory cepstrum coefficients from the spectrum signals obtained in the step six.
And step eight, constructing a deep neural network model for extracting the features which can be directly used for classifier classification from the step seven, taking the multi-resolution auditory cepstrum coefficient extracted from the step seven as feature input, and outputting the speech features which are learned from the multi-resolution auditory cepstrum coefficient and are directly used for classification to distinguish speech and noise.
And step nine, constructing a feature classifier, and outputting a value representing ideal binary masking according to the feature parameters obtained in the step eight. Namely, a unit containing voice information is obtained, and finally, the voice waveform synthesis algorithm is utilized to remove noise in the same direction as the target voice, so that the final enhanced voice, namely the output signal of the digital hearing aid, is obtained.
Advantageous effects
The invention provides a high-performance speech enhancement algorithm combining binaural sound source positioning and deep learning, aiming at the problems that the speech enhancement algorithm in the existing digital hearing aid has poor capability of processing non-stationary noise, cannot meet speech intelligibility and comfort of a hearing-impaired person in a noise environment and the like. The invention has the advantages that: firstly, binaural sound source localization is combined with spatial filtering to remove noise in a direction different from that of a target sound source. This makes the present invention suitable for removing noise in a direction different from that of the target voice. Secondly, in binaural sound source localization, a two-stage deep learning azimuth localization algorithm is adopted, so that errors of front and back confusion of sound source directions, which cannot be solved by a traditional algorithm, are better solved. And finally, combining the extracted new characteristic parameters capable of accurately representing the difference between the voice and the noise with the constructed deep learning model for characteristic extraction and classification to remove the noise in the same direction as the target voice. Finally, the enhanced voice meeting the speech intelligibility and comfort of the ear-impaired people is obtained. All models adopt offline training, so that good real-time performance and low power consumption of the hearing aid can be guaranteed.
Drawings
FIG. 1 flow chart of an implementation of the present invention
FIG. 2 binaural Sound Source localization flow diagram
FIG. 3 is a deep neural network model diagram for extracting speech features
FIG. 4 feature classifier Structure
FIG. 5 deep neural network training flow chart
Detailed Description
Firstly, performing time-frequency analysis on an input signal of the digital hearing aid by adopting a Gamma-tone filter which can simulate the working mechanism of a basement membrane and auditory nerves in an auditory system of a human ear.
(1) Input signal x for digital hearing aid l (k)、x r (k) Passes through a Gamma atom filter
Figure RE-GDA0001892618670000051
Dividing the signal band into 64 bands to obtain a decomposed signal x f,l (k)、x f,r (k) In that respect Wherein m is the order of the filter,
Figure RE-GDA0001892618670000052
for the initial phase of the filter, U (t) is the unit step function, and B is the bandwidth. f is the index range of the frequency band and takes 1 to 64 c The central frequency of the filter is 50Hz to 8kHz, l and r are left and right ear marks, and k is the number of sample points.
(2) Decomposed signal x obtained by using Hamming window pair f,l (k)、x f,r (k) And performing frame division and windowing, wherein 20ms (320 points) of the voice signal with the sampling rate of 16kHz is taken as the length of one frame, the frame shift length is 10ms (160 points), the Hamming window function is defined as formula (1), and the frame division and windowing are defined as formulas (2) and (3) according to the short-time stationary characteristic of the voice signal.
Figure RE-GDA0001892618670000053
Where w (n) is the Hamming window function, n is the number of samples per frame, and L is the window length.
x ft,l (n)=x′ ft,l (n)×w(n) 0≤n≤L-1 (2)
x ft,r (n)=x′ ft,r (n)×w(n) 0≤n≤L-1 (3)
Where n is the number of samples per frame, x ft,l (n)、x ft,r (n) is a framed, windowed time-frequency unit, x' ft,l (n)、x′ ft,r And (n) is a time-frequency unit before framing and windowing, w (n) is a Hamming window function, t is a frame index, and f is a frequency band index.
And step two, extracting two binaural spatial cues of binaural time difference and binaural sound level difference from the frequency domain signal generated by filtering through the Gamma filter, wherein the two characteristics of the voices in different directions have difference, so that the two characteristics can be used as effective characteristics for positioning. And the two characteristics have a mutually complementary relationship, namely the binaural time difference has a better positioning effect at low frequency, and the binaural sound level difference has a better positioning effect at high frequency.
Two binaural cue features, binaural time difference (equation (4)), normalized cross-correlation function (equation (5)), and binaural level difference (equation (6)), are extracted for the signal video-decomposed by the Gammatone filter. As information describing the azimuth of the speech signal:
Figure RE-GDA0001892618670000061
wherein the content of the first and second substances,
Figure RE-GDA0001892618670000062
Figure RE-GDA0001892618670000063
wherein i is the index of the number of samples in the frame, tau is the time delay, the range is-1 ms to 1ms, and 33-dimensional CCF can be obtained under the sampling rate of 16 kHz. t is the frame index and f is the index of the frequency band.
And step three, dividing the possible azimuth angle of the noisy speech into four quadrants by taking the double-ear connecting line as an x axis and the perpendicular bisector of the double-ear connecting line as a y axis. Namely, the right front is a first quadrant, the left front is a second quadrant, the left rear is a third quadrant, and the right rear is a fourth quadrant. And then constructing a first-level deep learning network, taking the binaural time difference and the binaural level difference of the noisy speech as characteristic input, and outputting the characteristic input as a quadrant where the noisy speech position is located, so as to solve the problem of front and back confusion of target speech positioning.
(1) The method comprises the steps of dividing azimuth angles to which noisy voices possibly belong into four quadrants by taking a double-ear connecting line as an x axis and a perpendicular bisector of the double-ear connecting line as a y axis, and marking the four quadrants as a first quadrant at the front right, a second quadrant at the front left, a third quadrant at the rear left and a fourth quadrant at the rear right.
(2) And (3) constructing a first-level deep confidence network, wherein the activation functions of the 4 hidden layers all adopt sigmoid functions, the output layer adopts purelin activation functions, the ITD and ILD characteristics of the noisy speech with known azimuth angles are used as input, the quadrant to which the azimuth angles belong is used as a label, and the network is trained until the mean square error is not changed any more, as shown in FIG. 5.
(3) And performing time-frequency analysis on the noisy speech received by the hearing aid, and then extracting ITD and ILD as input of a first-stage deep learning network for completing training, and outputting the input as a quadrant to which the target speech azimuth belongs.
And step four, constructing a second-level deep learning network for each quadrant, and selecting a second-level deep learning network for azimuth angle positioning by combining the quadrant to which the azimuth angle of the target voice judged in the step three belongs. And taking the normalized cross-correlation function and the binaural sound level difference as the input of a second-level deep learning network, and finally outputting the direction of the target voice.
(1) 4 deep neural networks with four hidden layers are built to respectively correspond to each quadrant, each layer is provided with 128 neural units, an output layer is a unit, and a sigmoid function is adopted between layers as an activation function. The CCF and ILD of noisy speech extraction at known azimuth are used as input, and the real azimuth is used as a label training network until the mean square error is not reduced any more, as shown in fig. 5.
(2) And selecting a corresponding second-level deep learning network C according to the output of the first-level deep neural network. Using CCF, ILD of noisy speech as input to neural network C, the result of the neural output is the estimated target speech azimuth.
And step five, acquiring the azimuth angle of the target voice according to the step four by using the existing spatial filtering algorithm to match the head related transfer function in the corresponding direction, and unmixing the target voice from the mixed voice. Removing the noise in the direction different from the target sound source (deconvolution process shown in equation (7)) to obtain the preliminary enhanced speech s in the same direction as the target speech l (k)、s r (k)。
Figure RE-GDA0001892618670000071
Where h (k) is the head-related transfer function in the direction of the target speech, x E (k) A signal is received for the hearing aid. E is epsilon { l, r } represents left and right ear indexes, k is the number of sample points, a is deconvolution time delay, and 0 to k-1 are taken.
The complete sound source localization flowchart is shown in fig. 2.
Step six, for s l (k)、s r (k) Performing time-frequency analysis, i.e. performing frequency-domain analysis on the preliminarily enhanced speech by using the same Gamma filter in the first step to obtain a frequency-domain representation s of the preliminarily enhanced speech f,l (k)、s f,r (k)。
And step seven, extracting multi-resolution auditory cepstrum coefficients from the spectrum signals obtained in the step six.
(1) By setting different frame lengths n t To s to f,l (k)、s f,r (k) Performing frame-by-frame windowing to obtain a time-frequency unit representation s of the speech signal ft,l (n t )、s ft,r (n t ) Time-frequency units with different frame lengths pay attention to the high-resolution features of the details and grasp the low-resolution features of the global property.
(2) Calculating the energy of each time-frequency unit (cochlear pattern)
Figure RE-GDA0001892618670000081
Where E ∈ { l, r } represents the left-right ear index, and i is the index of the number of samples in the frame.
(3) Applying cube root compression, C _ G, to the energy of each time-frequency cell E (t,f)= (CG E (t,f)) 1/3 . This not only allows for compression of multi-resolution cochlear map features, it is more representative of the difference between speech and noise, but also is computationally simple.
(4) Finally, discrete Cosine Transform (DCT) is adopted to pair C _ G E (t, f) performing a decorrelation operation to obtain multi-resolution auditory cepstrum coefficients. As shown in formula (8):
Figure RE-GDA0001892618670000082
the above formula is pair C _ G E And (t, f) taking the DCT process, wherein M is the number of channels of the Gamma tone filter, and M =64 is taken in the algorithm. d represents reserved C _ G E The coefficient of the former d dimension of (t, d), tests show that when d is greater than 36, F E The value of (t, d) is extremely small and also represents F E The information of (t, d) is almost contained in the first 36 dimensions.
And step eight, constructing a deep neural network model for extracting the features which can be directly used for classifier classification from the step seven, taking the multi-resolution auditory cepstrum coefficient extracted from the step seven as feature input, and outputting the speech features which are learned from the multi-resolution auditory cepstrum coefficient and are directly used for classification to distinguish speech and noise.
A full connection layer consisting of 3 LC-BSLTM layers and 2 DNN layers is built. As shown in fig. 3, where LC-BSLTM is a delay control-bidirectional long-short time memory network, it is a variation of LSTM network, and it changes the unidirectional network in the typical LSTM network into a bidirectional network, and there are two information transmission processes in the forward and backward directions along the time axis, so that it can better utilize the context information.
Step nine, constructing a feature classifier, taking the feature parameters which are obtained in the step eight and can be directly used for classifier classification as input, and outputting values representing ideal binary masking. The unit containing voice information is obtained, and finally, the voice waveform synthesis algorithm is utilized to remove noise in the same direction as the target voice, so that the final enhanced voice, namely the output signal of the digital hearing aid, is obtained.
The classifier is formed by adding a softmax layer to an output layer of the LC-BLSTM-DNN network as shown in FIG. 4. Namely, the output layer of the LC-BLSTM-DNN network is used as the input of the feature classifier, and the output is binary masking capable of distinguishing noise and voice.
And combining the step eight and the step nine to form a complete deep neural network for distinguishing the voice and the noise. It forms the multi-resolution auditory cepstrum coefficient F of the noise-containing voice with the known pure voice and the noise E (t, f) as a feature input, labeling with binary masks computed from known noise and clean speech, and training using an error back-propagation algorithm by time-unwrapping. And suppressing overfitting by combining with dropout algorithm, and a training flow chart is shown in fig. 5. The specific calculation process of the binary mask is shown in formula (9). For each time-frequency unit, if the Local signal-to-noise ratio SNR (t, f) is larger than a certain threshold (LC), where we set LC to 5dB for better guaranteed speech quality, the masking value of the corresponding time-frequency unit is set to 1, otherwise to 0.
Figure RE-GDA0001892618670000091
When in test, the multi-resolution auditory cepstrum coefficient F extracted finally in the step seven is used E And (t, f) inputting features, finally outputting binary masks capable of distinguishing noise and voice, wherein the binary masks comprise information of time-frequency units of voice signals, and finally removing noise in the same direction as the target voice by using a voice waveform synthesis algorithm to obtain the final enhanced voice. The objective performance test is carried out on the enhanced voice by adopting a voice quality perception evaluation value (PESQ) as an evaluation standard, a comparison algorithm is to adopt other characteristics and remove noise different from a sound source direction without using sound source positioning, and the noise is 15 noises from a NoiseX-92 database, and the noise comprises the following steps: white, babble, ping, f16,The other data are the same, and table 1 lists the enhancement effects of several kinds of noise and signal-to-noise ratios, and it can be seen by comparison that the effect of the present invention on speech enhancement under various kinds of noise is higher than that of the comparison algorithm by 0.25 on average.
In summary, the complete speech enhancement algorithm flow chart is shown in fig. 1. Firstly, binaural sound source information is input into a sound source positioning module, the spatial information of a target sound source is determined, then, noise in different directions from a target voice is removed by utilizing spatial filtering, then, noise-containing voice which is separated from noise in different directions and contains noise in the same direction as the target voice is input into a deep learning module to be extracted and can be directly used for classifying features of a classifier, the classifier classifies input signals according to the features, a unit containing voice information is identified, finally, noise in the same direction as the target voice is removed by utilizing a voice waveform synthesis algorithm, and finally, enhanced voice is obtained.
TABLE 1 Final enhancement Effect
Figure RE-GDA0001892618670000101

Claims (1)

1. A speech enhancement method based on binaural sound source localization and deep learning in a binaural hearing aid is characterized by comprising the following specific steps:
firstly, performing time-frequency analysis on an input signal of a digital hearing aid by adopting a Gamma-tone filter which can simulate a basement membrane and an auditory nerve working mechanism in an auditory system of a human ear;
extracting two binaural spatial cues of binaural time difference and binaural sound level difference from the frequency domain signal generated by filtering through a Gamma filter, and judging a coordinate quadrant and an azimuth angle of the target voice according to the two characteristics;
dividing possible azimuth angles of the noisy speech into four quadrants by taking a double-ear connecting line as an x axis and a perpendicular bisector of the double-ear connecting line as a y axis; namely, the right front is a first quadrant, the left front is a second quadrant, the left rear is a third quadrant and the right rear is a fourth quadrant; then, a first-level deep learning network is constructed, the binaural time difference and the binaural sound level difference of the noisy speech are taken as characteristic input, and the characteristic input is output as a quadrant where the target speech position is located, so that the problem of front and back confusion of target speech positioning is solved;
step four, constructing a second-level deep learning network for each quadrant, and selecting the second-level deep learning network for positioning by combining the quadrant to which the azimuth angle of the target voice judged in the step three belongs; taking the normalized cross-correlation function and the binaural sound level difference as the input of a second-level deep learning network, and finally outputting the direction of the target voice;
fifthly, removing noise in a direction different from the direction of the target sound source according to the spatial information of the target voice obtained in the fourth step by utilizing spatial filtering to obtain a primary enhanced voice in the same direction as the target voice;
step six, performing time-frequency analysis on the preliminary enhanced voice obtained in the step five, namely performing frequency-domain analysis on the preliminary enhanced voice by using the same Gamma filter in the step one to obtain frequency-domain representation of the preliminary enhanced voice
Step seven, extracting multi-resolution auditory cepstrum coefficients from the spectrum signals obtained in the step six;
step eight, constructing a deep neural network model for extracting the features directly used for classifier classification from the step seven, taking the multi-resolution auditory cepstrum coefficient extracted from the step seven as feature input, and outputting the feature which is learned from the multi-resolution auditory cepstrum coefficient and directly used for classification to distinguish voice and noise;
step nine, constructing a feature classifier, and outputting a value representing ideal binary masking according to the feature parameters obtained in the step eight; the method comprises the steps of obtaining a unit containing voice information, and finally removing noise in the same direction as a target voice by using a voice waveform synthesis algorithm to obtain a final enhanced voice, namely an output signal of a digital hearing aid;
firstly, performing time-frequency analysis on an input signal of a digital hearing aid by adopting a Gamma-tone filter which can simulate a basement membrane and an auditory nerve working mechanism in an auditory system of a human ear;
(1)input signal x for digital hearing aid l (k)、x r (k) Passes through a Gamma atom filter
Figure FDA0003932975370000011
Dividing the signal band into 64 bands to obtain a decomposed signal x f,l (k)、x f,r (k) (ii) a Wherein m is the order of the filter,
Figure FDA0003932975370000012
for the initial phase of the filter, U (t) is the unit step function, and B is the bandwidth; f is the index range of the frequency band and takes 1 to 64 c The central frequency of the filter is 50Hz to 8kHz, l and r are left and right ear marks, and k is the number of sample points;
(2) Decomposed signal x obtained by using Hamming window pair f,l (k)、x f,r (k) Performing framing and windowing, wherein 20ms (320 points) of the voice signal with the sampling rate of 16kHz is taken as the length of one frame according to the short-time stationary characteristic of the voice signal, the frame shift length is 10ms (160 points), a Hamming window function is defined as formula (1), and framing and windowing are defined as formulas (2) and (3);
Figure FDA0003932975370000021
wherein w (n) is a Hamming window function, n is the number of sample points of each frame, and L is the window length;
x ft,l (n)=x′ ft,l (n)×w(n)0≤n≤L-1 (2)
x ft,r (n)=x′ ft,r (n)×w(n)0≤n≤L-1 (3)
where n is the number of samples per frame, x ft,l (n)、x ft,r (n) is a framed, windowed time-frequency unit, x' ft,l (n)、x′ ft,r (n) is a time-frequency unit before framing and windowing, w (n) is a Hamming window function, t is a frame index, and f is an index of a frequency band;
step two, extracting two binaural spatial cues of binaural time difference and binaural sound level difference from the frequency domain signal generated by filtering through a Gamma filter;
extracting two binaural cue characteristics of a binaural time difference (equation (4)), a normalized cross-correlation function (equation (5)) and a binaural level difference (equation (6)) from a signal subjected to video decomposition by a Gamma filter; as information describing the azimuth of the speech signal:
Figure FDA0003932975370000023
wherein the content of the first and second substances,
Figure FDA0003932975370000024
Figure FDA0003932975370000025
wherein i is the index of the number of samples in the frame, tau is the time delay, the range is-1 ms to 1ms, and a CCF with 33 dimensions is obtained under the sampling rate of 16 kHz; t is a frame index, and f is an index of a frequency band;
taking a double-lug connection line as an x axis and a perpendicular bisector of the double-lug connection line as a y axis, and dividing possible azimuth angles of the noisy speech into four quadrants; namely, the right front is a first quadrant, the left front is a second quadrant, the left back is a third quadrant and the right back is a fourth quadrant; then, a first-level deep learning network is constructed, the binaural time difference and the binaural level difference of the noisy speech are taken as characteristic input, and the characteristic input is output as a quadrant where the noisy speech position is located, so that the problem of front and back confusion of target speech positioning is solved;
(1) Dividing azimuth angles to which noisy voices possibly belong into four quadrants by taking a double-ear connecting line as an x axis and a perpendicular bisector of the double-ear connecting line as a y axis, and marking the four quadrants as a first quadrant at the front right, a second quadrant at the front left, a third quadrant at the rear left and a fourth quadrant at the rear right;
(2) Building a first-level deep confidence network, wherein all the activation functions of 4 layers of hidden layers adopt sigmoid functions, an output layer adopts purelin activation functions, ITD and ILD characteristics of noisy speech with known azimuth angles are used as input, quadrants to which the azimuth angles belong are used as labels, and the network is trained until the mean square error is not changed;
(3) ITD and ILD extracted after time-frequency analysis is carried out on the noisy speech received by the hearing aid are used as input of a first-stage deep learning network for finishing training, and the input is output as a quadrant to which a target speech azimuth belongs;
step four, constructing a second-level deep learning network for each quadrant, and selecting the second-level deep learning network for azimuth angle positioning by combining the quadrant to which the azimuth angle of the target voice judged in the step three belongs; taking the normalized cross-correlation function and the binaural sound level difference as the input of a second-level deep learning network, and finally outputting the direction of the target voice;
(1) 4 deep neural networks with four hidden layers are built to respectively correspond to each quadrant, each layer is provided with 128 neural units, an output layer is a unit, and a sigmoid function is adopted between layers as an activation function; CCF and ILD extracted from noisy speech with known azimuth are used as input, and the real azimuth is used as a label training network until the mean square error is not reduced;
(2) Selecting a corresponding second-level deep learning network C according to the output of the first-level deep neural network; CCF and ILD of the noisy speech are used as the input of a neural network C, and the result of the neural output is the estimated target speech azimuth;
step five, obtaining the azimuth angle of the target voice according to the step four by using the existing spatial filtering algorithm to match the head related transfer function in the corresponding direction, and unmixing the target voice from the mixed voice; removing noise in a direction different from the target sound source, as shown in the deconvolution process of equation (7), to obtain a preliminary enhanced speech s in the same direction as the target speech l (k)、s r (k);
Figure FDA0003932975370000031
Where h (k) is the head-related transfer function in the direction of the target speech,x E (k) Receiving a signal for a hearing aid; e belongs to { l, r } to represent left and right ear indexes, k is the number of sample points, a is deconvolution time delay, and 0 to k-1 are taken;
step six, for s l (k)、s r (k) Performing time-frequency analysis, i.e. performing frequency-domain analysis on the preliminarily enhanced speech by using the same Gamma-tone filter in the first step to obtain a frequency-domain representation s of the preliminarily enhanced speech f,l (k)、s f,r (k);
Step seven, extracting multi-resolution auditory cepstrum coefficients from the spectrum signals obtained in the step six;
(1) By setting different frame lengths n t To s to f,l (k)、s f,r (k) Performing frame division and windowing to obtain time-frequency unit representation s of voice signal ft,l (n t )、s ft,r (n t ) Time-frequency units with different frame lengths pay attention to the high-resolution features of the details and grasp the global low-resolution features;
(2) Calculating the energy of each time-frequency unit
Figure FDA0003932975370000041
Wherein E belongs to { l, r } to represent left and right ear index, i is index of number of sample points in frame;
(3) Applying cube root compression, C _ G, to the energy of each time-frequency cell E (t,f)=(CG E (t,f)) 1/3
(4) Finally, discrete cosine transform DCT pair C _ G is adopted E (t, f) performing a decorrelation operation to obtain multi-resolution auditory cepstrum coefficients; as shown in formula (8):
Figure FDA0003932975370000042
the above formula is pair C _ G E (t, f) taking the DCT process, wherein M is the number of channels of the Gamma atom filter, and M =64; d represents reserved C _ G E The coefficient of the former d dimension of (t, d), tests show that when d is greater than 36, F E The value of (t, d) is extremely small and also represents F E The information of (t, d) is almost contained in the first 36 dimensionsPerforming the following steps;
step eight, constructing a deep neural network model for extracting the features directly used for classifier classification from the step seven, taking the multi-resolution auditory cepstrum coefficient extracted from the step seven as feature input, and outputting the feature which is learned from the multi-resolution auditory cepstrum coefficient and directly used for classification to distinguish voice and noise;
building a full connection layer consisting of 3 LC-BSLTM layers and 2 DNN layers; wherein LC-BSLTM is a delay control-bidirectional long-short time memory network;
step nine, constructing a feature classifier, taking the feature parameters which are obtained in the step eight and can be directly used for classifier classification as input, and outputting a value representing ideal binary masking; namely, a unit containing voice information is obtained, and finally, the voice waveform synthesis algorithm is utilized to remove noise in the same direction as the target voice, so that the final enhanced voice, namely the output signal of the digital hearing aid, is obtained.
CN201811292475.7A 2018-11-01 2018-11-01 Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid Active CN109410976B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811292475.7A CN109410976B (en) 2018-11-01 2018-11-01 Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811292475.7A CN109410976B (en) 2018-11-01 2018-11-01 Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid

Publications (2)

Publication Number Publication Date
CN109410976A CN109410976A (en) 2019-03-01
CN109410976B true CN109410976B (en) 2022-12-16

Family

ID=65471037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811292475.7A Active CN109410976B (en) 2018-11-01 2018-11-01 Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid

Country Status (1)

Country Link
CN (1) CN109410976B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859767B (en) * 2019-03-06 2020-10-13 哈尔滨工业大学(深圳) Environment self-adaptive neural network noise reduction method, system and storage medium for digital hearing aid
CN110010150A (en) * 2019-04-15 2019-07-12 吉林大学 Auditory Perception speech characteristic parameter extracting method based on multiresolution
CN110136737A (en) * 2019-06-18 2019-08-16 北京拙河科技有限公司 A kind of voice de-noising method and device
CN110415702A (en) * 2019-07-04 2019-11-05 北京搜狗科技发展有限公司 Training method and device, conversion method and device
CN110517705B (en) * 2019-08-29 2022-02-18 北京大学深圳研究生院 Binaural sound source positioning method and system based on deep neural network and convolutional neural network
CN110728970B (en) * 2019-09-29 2022-02-25 东莞市中光通信科技有限公司 Method and device for digital auxiliary sound insulation treatment
CN111354353B (en) * 2020-03-09 2023-09-19 联想(北京)有限公司 Voice data processing method and device
CN111429930B (en) * 2020-03-16 2023-02-28 云知声智能科技股份有限公司 Noise reduction model processing method and system based on adaptive sampling rate
KR20210147155A (en) 2020-05-27 2021-12-07 현대모비스 주식회사 Apparatus of daignosing noise quality of motor
CN111916101B (en) * 2020-08-06 2022-01-21 大象声科(深圳)科技有限公司 Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals
CN112735456B (en) * 2020-11-23 2024-01-16 西安邮电大学 Speech enhancement method based on DNN-CLSTM network
CN116711007A (en) * 2021-04-01 2023-09-05 深圳市韶音科技有限公司 Voice enhancement method and system
CN113794963B (en) * 2021-09-14 2022-08-05 深圳大学 Speech enhancement system based on low-cost wearable sensor

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778920A (en) * 2014-02-12 2014-05-07 北京工业大学 Speech enhancing and frequency response compensation fusion method in digital hearing-aid
CN105741849A (en) * 2016-03-06 2016-07-06 北京工业大学 Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid
CN106782565A (en) * 2016-11-29 2017-05-31 重庆重智机器人研究院有限公司 A kind of vocal print feature recognition methods and system
EP3203472A1 (en) * 2016-02-08 2017-08-09 Oticon A/s A monaural speech intelligibility predictor unit
CN107479030A (en) * 2017-07-14 2017-12-15 重庆邮电大学 Based on frequency dividing and improved broad sense cross-correlation ears delay time estimation method
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108091345A (en) * 2017-12-27 2018-05-29 东南大学 A kind of ears speech separating method based on support vector machines
CN108122559A (en) * 2017-12-21 2018-06-05 北京工业大学 Binaural sound sources localization method based on deep learning in a kind of digital deaf-aid

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778920A (en) * 2014-02-12 2014-05-07 北京工业大学 Speech enhancing and frequency response compensation fusion method in digital hearing-aid
EP3203472A1 (en) * 2016-02-08 2017-08-09 Oticon A/s A monaural speech intelligibility predictor unit
CN105741849A (en) * 2016-03-06 2016-07-06 北京工业大学 Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid
CN106782565A (en) * 2016-11-29 2017-05-31 重庆重智机器人研究院有限公司 A kind of vocal print feature recognition methods and system
CN107479030A (en) * 2017-07-14 2017-12-15 重庆邮电大学 Based on frequency dividing and improved broad sense cross-correlation ears delay time estimation method
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108122559A (en) * 2017-12-21 2018-06-05 北京工业大学 Binaural sound sources localization method based on deep learning in a kind of digital deaf-aid
CN108091345A (en) * 2017-12-27 2018-05-29 东南大学 A kind of ears speech separating method based on support vector machines

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Gammatone滤波器分解的HRTF和GMM的双耳声源定位算法;李如玮等;《北京工业大学学报》;20181011;第44卷(第11期);1385-1390 *
数字助听器中基于两步降噪的多通道频响补偿算法;李如玮等;《北京生物医学工程》;20180816;第37卷(第04期);337-344、426 *

Also Published As

Publication number Publication date
CN109410976A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109410976B (en) Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid
CN107845389B (en) Speech enhancement method based on multi-resolution auditory cepstrum coefficient and deep convolutional neural network
CN110728989B (en) Binaural speech separation method based on long-time and short-time memory network L STM
CN110517705B (en) Binaural sound source positioning method and system based on deep neural network and convolutional neural network
CN107767859B (en) Method for detecting speaker intelligibility of cochlear implant signal in noise environment
CN111916101B (en) Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals
CN107479030A (en) Based on frequency dividing and improved broad sense cross-correlation ears delay time estimation method
CN111292762A (en) Single-channel voice separation method based on deep learning
CN109164415B (en) Binaural sound source positioning method based on convolutional neural network
CN108122559B (en) Binaural sound source positioning method based on deep learning in digital hearing aid
CN105741849A (en) Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid
WO2007028250A2 (en) Method and device for binaural signal enhancement
CN113470671B (en) Audio-visual voice enhancement method and system fully utilizing vision and voice connection
Hummersone A psychoacoustic engineering approach to machine sound source separation in reverberant environments
Aroudi et al. Cognitive-driven binaural LCMV beamformer using EEG-based auditory attention decoding
Roman et al. Pitch-based monaural segregation of reverberant speech
CN109147808A (en) A kind of Speech enhancement hearing-aid method
CN112885375A (en) Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network
Gabbay et al. Seeing through noise: Speaker separation and enhancement using visually-derived speech
Patil et al. Marathi speech intelligibility enhancement using i-ams based neuro-fuzzy classifier approach for hearing aid users
May Robust speech dereverberation with a neural network-based post-filter that exploits multi-conditional training of binaural cues
Kothapally et al. Speech Detection and Enhancement Using Single Microphone for Distant Speech Applications in Reverberant Environments.
Xu et al. Learning to separate voices by spatial regions
Magadum et al. An Innovative Method for Improving Speech Intelligibility in Automatic Sound Classification Based on Relative-CNN-RNN
Gajecki et al. A Fused Deep Denoising Sound Coding Strategy for Bilateral Cochlear Implants

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant