CN109410976B

CN109410976B - Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid

Info

Publication number: CN109410976B
Application number: CN201811292475.7A
Authority: CN
Inventors: 李如玮; 李涛; 孙晓月; 杨登才; 潘冬梅; 张永亚
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2018-11-01
Filing date: 2018-11-01
Publication date: 2022-12-16
Anticipated expiration: 2038-11-01
Also published as: CN109410976A

Abstract

A speech enhancement method based on binaural sound source localization and deep learning in a binaural digital hearing aid belongs to the field of speech signal processing. Firstly, a two-stage deep neural network is used for accurately positioning the target voice, and j is combined with spatial filtering to remove noise different from the direction of the target voice. The method comprises the steps of using a built time delay control two-way long-short time memory deep neural network and a deep learning model combined with a classifier, taking an extracted multi-resolution auditory cepstrum coefficient as characteristic input, classifying each time-frequency unit of noisy speech into a speech time-frequency unit or a noisy time-frequency unit through the nonlinear processing capacity of deep learning, and finally removing noise in the same direction as target speech by using a speech waveform synthesis algorithm. The algorithm removes noise in a direction different from that of the target voice and also removes noise in the same direction as that of the target voice, and finally obtains enhanced voice meeting speech intelligibility and comfort of the ear-impaired. All deep learning models adopt offline training, and the real-time performance is met.

Description

Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid

Technical Field

The invention belongs to the technical field of voice signal processing, and relates to two key voice signal processing technologies of target voice positioning and voice enhancement in a digital hearing aid.

Background

Hearing impairment is a chronic disease that seriously affects the quality of life of humans. The incidence of hearing loss in the elderly over 65 years of age in the united states is about 30% to 40%, in canada 20%, in europe 35%, in china 35%. And with increasing age, the incidence increases dramatically. At present, the total population of old people over 60 years old worldwide reaches 6 hundred million, wherein China accounts for nearly 30 percent, and the hearing threshold of 22.28 percent of old people in China is in a normal range. The development of hearing aids has brought gospel to these hearing impaired patients. A hearing aid is a device for amplifying sound to compensate for lost hearing, and is an important means for solving the communication difficulty of hearing-impaired patients. In recent years, with the adoption of digital technology, hearing aid technology has been rapidly developed. Compared with the traditional hearing aid, the digital hearing aid greatly improves the controllability, can flexibly divide frequency bands and adjust the sound intensity in multiple frequency bands. In addition, the digital hearing aid has higher flexibility, and the application of the advanced algorithm can get rid of the dependence on a solidified analog circuit, flexibly adjust and update the algorithm and meet the requirements of patients. The miniaturization, interactivity, intelligence and transgression of the digital hearing aid also greatly improve the acceptance of patients to the hearing aid.

However, even today, the market is full of various hearing aids, the hearing aid wearing rate of people with hearing impairment in China is 1%, because the performance of the hearing aid is sharply reduced in a noise environment, and people with hearing impairment cannot hear clearly and feel uncomfortable at the same time. Especially, under the cocktail party problem, the ear-impaired person can not recognize the sound he wants at all, and the life quality of the ear-impaired person is seriously affected. It has been found by investigation that in the united states where digital hearing aid technology is highly developed, 70% of hearing aid users are not satisfied with their performance in noisy environments, and 95% of hearing aid users wish to have improved speech intelligibility and comfort in noisy environments.

The voice enhancement is a key technology for improving the performance of the digital hearing aid in a noise environment, and a voice enhancement module in the digital hearing aid processes acquired original digital signals, so that background noise is eliminated, voice quality is improved, a patient can easily accept the voice enhancement, fatigue of the listener (subjective measurement) is reduced, intelligibility of voice of the patient (objective measurement) is improved, and robustness of a subsequent processing process on input noise is improved.

Currently, speech enhancement algorithms commonly used in digital hearing aids are: spectral subtraction, multi-channel adaptive noise reduction, synchronous detection noise reduction, harmonic enhancement, wiener filter, MMSE estimation of short-time spectral amplitude, auditory masking based methods, binaural noise suppression, and the like. The above methods are applied to various monaural hearing aids, and speech intelligibility and comfort of the ear-impaired are improved to a certain extent. However, for the hearing impaired, the existing speech enhancement methods cannot achieve satisfactory results because the existing methods have their own defects, but the noise is varied, some methods can only partially remove the noise in a direction different from the target speech, some methods bring new noise while removing the noise, and some methods cause unrecoverable damage to the speech while removing the noise. The defects of the existing method cause that only less than 1 percent of people with ear disabilities in China wear hearing aids, so that the life quality is seriously reduced. Therefore, the influence of noise on the performance of the digital hearing aid is eliminated, the speech intelligibility and the comfort degree of the hearing aid worn by the ear-impaired people are improved, the pain of the ear-impaired people is relieved, and the development trend of the digital hearing aid is realized.

In recent years, deep learning has been rapidly developed in the field of speech signal processing, and many speech enhancement algorithms based on deep learning have been proposed. The basic idea of the algorithm is as follows: a neural network comprising a plurality of levels is trained with a noisy speech signal and clean speech to be noise resistant. Due to the strong nonlinear fitting capability of deep learning, the method is very suitable for enhancing the voice containing non-stationary noise. Common deep learning-based speech enhancement algorithms are: (1) And estimating a clean voice signal in the noisy voice signal by using the deep neural network. (2) A time-frequency mask between clean speech and noise is estimated using a deep neural network. (3) And respectively estimating a clean speech signal and a noise signal by using the deep neural network. (4) And classifying the noise by using a deep neural network, and then training a corresponding voice enhancement model in a targeted manner. However, in the above speech enhancement algorithms based on deep learning, all collected data are directly used as input of deep learning, but the complexity of the method is high, and the requirement of the digital hearing aid on real-time performance is not met, or characteristic parameters of the collected data are extracted as input of deep learning, but the characteristic parameters extracted by the current method cannot well describe characteristic differences of speech and noise, and speech intelligibility and comfort of the hearing impaired cannot be improved. The autonomous learning ability of deep neural networks is incomparable with other methods.

Therefore, a speech enhancement algorithm based on deep learning capability is necessary to satisfy speech intelligibility and comfort for the hearing impaired in noisy environments.

The invention provides a speech enhancement algorithm based on binaural sound source localization and deep learning. And then, using a built deep learning model combining the LC-BLSTM-DNN and the classifier, taking the extracted multi-resolution auditory cepstrum coefficient as characteristic input, classifying each time-frequency unit of the noisy speech into a speech time-frequency unit or a noisy time-frequency unit through the nonlinear processing capability of deep learning, and finally removing the noise in the same direction as the target speech by using a speech waveform synthesis algorithm to obtain the final enhanced speech. All deep learning models adopt offline training, and the real-time performance of the digital hearing aid can be met by using the trained models.

Disclosure of Invention

The technical scheme adopted by the invention is as follows: the method comprises the steps of firstly, constructing a two-stage deep learning network by utilizing strong data driving capability of deep learning, simulating a statistical relationship between binaural features and corresponding azimuth angles to obtain spatial information of target voice, and then removing noise from different directions from the target voice by combining spatial filtering. Secondly, a reasonable deep learning model is built, features which can be directly used for noise and voice classification are learned from primary feature parameters representing voice and noise differences and are classified, noise from the same direction as the target voice is removed, and finally enhancement of space noisy voice is achieved. The specific steps of the process are as follows:

firstly, performing time-frequency analysis on an input signal of the digital hearing aid by adopting a Gamma-tone filter which can simulate the working mechanism of a basement membrane and auditory nerves in an auditory system of a human ear.

And step two, extracting two binaural spatial cues of binaural time difference and binaural sound level difference from the frequency domain signal generated by filtering through the Gamma filter, wherein the two characteristics of the voices in different directions have difference, so that the two characteristics can be used as effective characteristics for positioning. And the two characteristics have a mutually complementary relationship, namely the binaural time difference has a better positioning effect at low frequency, and the binaural sound level difference has a better positioning effect at high frequency.

And step three, dividing the possible azimuth angle of the noisy speech into four quadrants by taking the double-ear connecting line as an x axis and the perpendicular bisector of the double-ear connecting line as a y axis. Namely, the front right is the first quadrant, the front left is the second quadrant, the rear left is the third quadrant, and the rear right is the fourth quadrant. And then constructing a first-level deep learning network, taking the binaural time difference and the binaural sound level difference of the noisy speech as characteristic input, and outputting the characteristic input as a quadrant where the target speech position is located, so as to solve the problem of front and back confusion of target speech positioning.

And step four, constructing a second-level deep learning network for each quadrant, and selecting the second-level deep learning network for azimuth angle positioning by combining the quadrant to which the azimuth angle of the target voice judged in the step three belongs. And taking the normalized cross-correlation function and the binaural sound level difference as the input of a second-level deep learning network, and finally outputting the direction of the target voice.

And step five, acquiring the azimuth angle of the target voice according to the step four by using the existing spatial filtering algorithm to match the head related transfer function in the corresponding direction, and unmixing the target voice from the mixed voice. And removing the noise in the direction different from the target sound source to obtain the primary enhanced voice in the same direction as the target voice.

And step six, performing time-frequency analysis on the preliminary enhanced voice obtained in the step five, namely performing frequency-domain analysis on the preliminary enhanced voice by using the same Gamma filter in the step one to obtain frequency-domain representation of the preliminary enhanced voice.

And step seven, extracting multi-resolution auditory cepstrum coefficients from the spectrum signals obtained in the step six.

And step eight, constructing a deep neural network model for extracting the features which can be directly used for classifier classification from the step seven, taking the multi-resolution auditory cepstrum coefficient extracted from the step seven as feature input, and outputting the speech features which are learned from the multi-resolution auditory cepstrum coefficient and are directly used for classification to distinguish speech and noise.

And step nine, constructing a feature classifier, and outputting a value representing ideal binary masking according to the feature parameters obtained in the step eight. Namely, a unit containing voice information is obtained, and finally, the voice waveform synthesis algorithm is utilized to remove noise in the same direction as the target voice, so that the final enhanced voice, namely the output signal of the digital hearing aid, is obtained.

Advantageous effects

The invention provides a high-performance speech enhancement algorithm combining binaural sound source positioning and deep learning, aiming at the problems that the speech enhancement algorithm in the existing digital hearing aid has poor capability of processing non-stationary noise, cannot meet speech intelligibility and comfort of a hearing-impaired person in a noise environment and the like. The invention has the advantages that: firstly, binaural sound source localization is combined with spatial filtering to remove noise in a direction different from that of a target sound source. This makes the present invention suitable for removing noise in a direction different from that of the target voice. Secondly, in binaural sound source localization, a two-stage deep learning azimuth localization algorithm is adopted, so that errors of front and back confusion of sound source directions, which cannot be solved by a traditional algorithm, are better solved. And finally, combining the extracted new characteristic parameters capable of accurately representing the difference between the voice and the noise with the constructed deep learning model for characteristic extraction and classification to remove the noise in the same direction as the target voice. Finally, the enhanced voice meeting the speech intelligibility and comfort of the ear-impaired people is obtained. All models adopt offline training, so that good real-time performance and low power consumption of the hearing aid can be guaranteed.

Drawings

FIG. 1 flow chart of an implementation of the present invention

FIG. 2 binaural Sound Source localization flow diagram

FIG. 3 is a deep neural network model diagram for extracting speech features

FIG. 4 feature classifier Structure

FIG. 5 deep neural network training flow chart

Detailed Description

(1) Input signal x for digital hearing aid _l (k)、x _r (k) Passes through a Gamma atom filter

Dividing the signal band into 64 bands to obtain a decomposed signal x _f,l (k)、x _f,r (k) In that respect Wherein m is the order of the filter,

for the initial phase of the filter, U (t) is the unit step function, and B is the bandwidth. f is the index range of the frequency band and takes 1 to 64 _c The central frequency of the filter is 50Hz to 8kHz, l and r are left and right ear marks, and k is the number of sample points.

(2) Decomposed signal x obtained by using Hamming window pair _f,l (k)、x _f,r (k) And performing frame division and windowing, wherein 20ms (320 points) of the voice signal with the sampling rate of 16kHz is taken as the length of one frame, the frame shift length is 10ms (160 points), the Hamming window function is defined as formula (1), and the frame division and windowing are defined as formulas (2) and (3) according to the short-time stationary characteristic of the voice signal.

Where w (n) is the Hamming window function, n is the number of samples per frame, and L is the window length.

x _ft,l (n)＝x′ _ft,l (n)×w(n) 0≤n≤L-1 (2)

x _ft,r (n)＝x′ _ft,r (n)×w(n) 0≤n≤L-1 (3)

Where n is the number of samples per frame, x _ft,l (n)、x _ft,r (n) is a framed, windowed time-frequency unit, x' _ft,l (n)、x′ _ft,r And (n) is a time-frequency unit before framing and windowing, w (n) is a Hamming window function, t is a frame index, and f is a frequency band index.

Two binaural cue features, binaural time difference (equation (4)), normalized cross-correlation function (equation (5)), and binaural level difference (equation (6)), are extracted for the signal video-decomposed by the Gammatone filter. As information describing the azimuth of the speech signal:

wherein the content of the first and second substances,

wherein i is the index of the number of samples in the frame, tau is the time delay, the range is-1 ms to 1ms, and 33-dimensional CCF can be obtained under the sampling rate of 16 kHz. t is the frame index and f is the index of the frequency band.

And step three, dividing the possible azimuth angle of the noisy speech into four quadrants by taking the double-ear connecting line as an x axis and the perpendicular bisector of the double-ear connecting line as a y axis. Namely, the right front is a first quadrant, the left front is a second quadrant, the left rear is a third quadrant, and the right rear is a fourth quadrant. And then constructing a first-level deep learning network, taking the binaural time difference and the binaural level difference of the noisy speech as characteristic input, and outputting the characteristic input as a quadrant where the noisy speech position is located, so as to solve the problem of front and back confusion of target speech positioning.

(1) The method comprises the steps of dividing azimuth angles to which noisy voices possibly belong into four quadrants by taking a double-ear connecting line as an x axis and a perpendicular bisector of the double-ear connecting line as a y axis, and marking the four quadrants as a first quadrant at the front right, a second quadrant at the front left, a third quadrant at the rear left and a fourth quadrant at the rear right.

(2) And (3) constructing a first-level deep confidence network, wherein the activation functions of the 4 hidden layers all adopt sigmoid functions, the output layer adopts purelin activation functions, the ITD and ILD characteristics of the noisy speech with known azimuth angles are used as input, the quadrant to which the azimuth angles belong is used as a label, and the network is trained until the mean square error is not changed any more, as shown in FIG. 5.

(3) And performing time-frequency analysis on the noisy speech received by the hearing aid, and then extracting ITD and ILD as input of a first-stage deep learning network for completing training, and outputting the input as a quadrant to which the target speech azimuth belongs.

And step four, constructing a second-level deep learning network for each quadrant, and selecting a second-level deep learning network for azimuth angle positioning by combining the quadrant to which the azimuth angle of the target voice judged in the step three belongs. And taking the normalized cross-correlation function and the binaural sound level difference as the input of a second-level deep learning network, and finally outputting the direction of the target voice.

(1) 4 deep neural networks with four hidden layers are built to respectively correspond to each quadrant, each layer is provided with 128 neural units, an output layer is a unit, and a sigmoid function is adopted between layers as an activation function. The CCF and ILD of noisy speech extraction at known azimuth are used as input, and the real azimuth is used as a label training network until the mean square error is not reduced any more, as shown in fig. 5.

(2) And selecting a corresponding second-level deep learning network C according to the output of the first-level deep neural network. Using CCF, ILD of noisy speech as input to neural network C, the result of the neural output is the estimated target speech azimuth.

And step five, acquiring the azimuth angle of the target voice according to the step four by using the existing spatial filtering algorithm to match the head related transfer function in the corresponding direction, and unmixing the target voice from the mixed voice. Removing the noise in the direction different from the target sound source (deconvolution process shown in equation (7)) to obtain the preliminary enhanced speech s in the same direction as the target speech _l (k)、s _r (k)。

Where h (k) is the head-related transfer function in the direction of the target speech, x _E (k) A signal is received for the hearing aid. E is epsilon { l, r } represents left and right ear indexes, k is the number of sample points, a is deconvolution time delay, and 0 to k-1 are taken.

The complete sound source localization flowchart is shown in fig. 2.

Step six, for s _l (k)、s _r (k) Performing time-frequency analysis, i.e. performing frequency-domain analysis on the preliminarily enhanced speech by using the same Gamma filter in the first step to obtain a frequency-domain representation s of the preliminarily enhanced speech _f,l (k)、s _f,r (k)。

(1) By setting different frame lengths n _t To s to _f,l (k)、s _f,r (k) Performing frame-by-frame windowing to obtain a time-frequency unit representation s of the speech signal _ft,l (n _t )、s _ft,r (n _t ) Time-frequency units with different frame lengths pay attention to the high-resolution features of the details and grasp the low-resolution features of the global property.

(2) Calculating the energy of each time-frequency unit (cochlear pattern)

Where E ∈ { l, r } represents the left-right ear index, and i is the index of the number of samples in the frame.

(3) Applying cube root compression, C _ G, to the energy of each time-frequency cell _E (t,f)＝ (CG _E (t,f)) ^1/3 . This not only allows for compression of multi-resolution cochlear map features, it is more representative of the difference between speech and noise, but also is computationally simple.

(4) Finally, discrete Cosine Transform (DCT) is adopted to pair C _ G _E (t, f) performing a decorrelation operation to obtain multi-resolution auditory cepstrum coefficients. As shown in formula (8):

the above formula is pair C _ G _E And (t, f) taking the DCT process, wherein M is the number of channels of the Gamma tone filter, and M =64 is taken in the algorithm. d represents reserved C _ G _E The coefficient of the former d dimension of (t, d), tests show that when d is greater than 36, F _E The value of (t, d) is extremely small and also represents F _E The information of (t, d) is almost contained in the first 36 dimensions.

A full connection layer consisting of 3 LC-BSLTM layers and 2 DNN layers is built. As shown in fig. 3, where LC-BSLTM is a delay control-bidirectional long-short time memory network, it is a variation of LSTM network, and it changes the unidirectional network in the typical LSTM network into a bidirectional network, and there are two information transmission processes in the forward and backward directions along the time axis, so that it can better utilize the context information.

Step nine, constructing a feature classifier, taking the feature parameters which are obtained in the step eight and can be directly used for classifier classification as input, and outputting values representing ideal binary masking. The unit containing voice information is obtained, and finally, the voice waveform synthesis algorithm is utilized to remove noise in the same direction as the target voice, so that the final enhanced voice, namely the output signal of the digital hearing aid, is obtained.

The classifier is formed by adding a softmax layer to an output layer of the LC-BLSTM-DNN network as shown in FIG. 4. Namely, the output layer of the LC-BLSTM-DNN network is used as the input of the feature classifier, and the output is binary masking capable of distinguishing noise and voice.

And combining the step eight and the step nine to form a complete deep neural network for distinguishing the voice and the noise. It forms the multi-resolution auditory cepstrum coefficient F of the noise-containing voice with the known pure voice and the noise _E (t, f) as a feature input, labeling with binary masks computed from known noise and clean speech, and training using an error back-propagation algorithm by time-unwrapping. And suppressing overfitting by combining with dropout algorithm, and a training flow chart is shown in fig. 5. The specific calculation process of the binary mask is shown in formula (9). For each time-frequency unit, if the Local signal-to-noise ratio SNR (t, f) is larger than a certain threshold (LC), where we set LC to 5dB for better guaranteed speech quality, the masking value of the corresponding time-frequency unit is set to 1, otherwise to 0.

When in test, the multi-resolution auditory cepstrum coefficient F extracted finally in the step seven is used _E And (t, f) inputting features, finally outputting binary masks capable of distinguishing noise and voice, wherein the binary masks comprise information of time-frequency units of voice signals, and finally removing noise in the same direction as the target voice by using a voice waveform synthesis algorithm to obtain the final enhanced voice. The objective performance test is carried out on the enhanced voice by adopting a voice quality perception evaluation value (PESQ) as an evaluation standard, a comparison algorithm is to adopt other characteristics and remove noise different from a sound source direction without using sound source positioning, and the noise is 15 noises from a NoiseX-92 database, and the noise comprises the following steps: white, babble, ping, f16,The other data are the same, and table 1 lists the enhancement effects of several kinds of noise and signal-to-noise ratios, and it can be seen by comparison that the effect of the present invention on speech enhancement under various kinds of noise is higher than that of the comparison algorithm by 0.25 on average.

In summary, the complete speech enhancement algorithm flow chart is shown in fig. 1. Firstly, binaural sound source information is input into a sound source positioning module, the spatial information of a target sound source is determined, then, noise in different directions from a target voice is removed by utilizing spatial filtering, then, noise-containing voice which is separated from noise in different directions and contains noise in the same direction as the target voice is input into a deep learning module to be extracted and can be directly used for classifying features of a classifier, the classifier classifies input signals according to the features, a unit containing voice information is identified, finally, noise in the same direction as the target voice is removed by utilizing a voice waveform synthesis algorithm, and finally, enhanced voice is obtained.

TABLE 1 Final enhancement Effect

Claims

1. A speech enhancement method based on binaural sound source localization and deep learning in a binaural hearing aid is characterized by comprising the following specific steps:

firstly, performing time-frequency analysis on an input signal of a digital hearing aid by adopting a Gamma-tone filter which can simulate a basement membrane and an auditory nerve working mechanism in an auditory system of a human ear;

extracting two binaural spatial cues of binaural time difference and binaural sound level difference from the frequency domain signal generated by filtering through a Gamma filter, and judging a coordinate quadrant and an azimuth angle of the target voice according to the two characteristics;

dividing possible azimuth angles of the noisy speech into four quadrants by taking a double-ear connecting line as an x axis and a perpendicular bisector of the double-ear connecting line as a y axis; namely, the right front is a first quadrant, the left front is a second quadrant, the left rear is a third quadrant and the right rear is a fourth quadrant; then, a first-level deep learning network is constructed, the binaural time difference and the binaural sound level difference of the noisy speech are taken as characteristic input, and the characteristic input is output as a quadrant where the target speech position is located, so that the problem of front and back confusion of target speech positioning is solved;

step four, constructing a second-level deep learning network for each quadrant, and selecting the second-level deep learning network for positioning by combining the quadrant to which the azimuth angle of the target voice judged in the step three belongs; taking the normalized cross-correlation function and the binaural sound level difference as the input of a second-level deep learning network, and finally outputting the direction of the target voice;

fifthly, removing noise in a direction different from the direction of the target sound source according to the spatial information of the target voice obtained in the fourth step by utilizing spatial filtering to obtain a primary enhanced voice in the same direction as the target voice;

step six, performing time-frequency analysis on the preliminary enhanced voice obtained in the step five, namely performing frequency-domain analysis on the preliminary enhanced voice by using the same Gamma filter in the step one to obtain frequency-domain representation of the preliminary enhanced voice

Step seven, extracting multi-resolution auditory cepstrum coefficients from the spectrum signals obtained in the step six;

step eight, constructing a deep neural network model for extracting the features directly used for classifier classification from the step seven, taking the multi-resolution auditory cepstrum coefficient extracted from the step seven as feature input, and outputting the feature which is learned from the multi-resolution auditory cepstrum coefficient and directly used for classification to distinguish voice and noise;

step nine, constructing a feature classifier, and outputting a value representing ideal binary masking according to the feature parameters obtained in the step eight; the method comprises the steps of obtaining a unit containing voice information, and finally removing noise in the same direction as a target voice by using a voice waveform synthesis algorithm to obtain a final enhanced voice, namely an output signal of a digital hearing aid;

(1)input signal x for digital hearing aid _l (k)、x _r (k) Passes through a Gamma atom filter

Dividing the signal band into 64 bands to obtain a decomposed signal x _f，l (k)、x _f，r (k) (ii) a Wherein m is the order of the filter,

for the initial phase of the filter, U (t) is the unit step function, and B is the bandwidth; f is the index range of the frequency band and takes 1 to 64 _c The central frequency of the filter is 50Hz to 8kHz, l and r are left and right ear marks, and k is the number of sample points;

(2) Decomposed signal x obtained by using Hamming window pair _f，l (k)、x _f，r (k) Performing framing and windowing, wherein 20ms (320 points) of the voice signal with the sampling rate of 16kHz is taken as the length of one frame according to the short-time stationary characteristic of the voice signal, the frame shift length is 10ms (160 points), a Hamming window function is defined as formula (1), and framing and windowing are defined as formulas (2) and (3);

wherein w (n) is a Hamming window function, n is the number of sample points of each frame, and L is the window length;

x _ft，l (n)＝x′ _ft，l (n)×w(n)0≤n≤L-1 (2)

x _ft，r (n)＝x′ _ft，r (n)×w(n)0≤n≤L-1 (3)

where n is the number of samples per frame, x _ft，l (n)、x _ft，r (n) is a framed, windowed time-frequency unit, x' _ft，l (n)、x′ _ft，r (n) is a time-frequency unit before framing and windowing, w (n) is a Hamming window function, t is a frame index, and f is an index of a frequency band;

step two, extracting two binaural spatial cues of binaural time difference and binaural sound level difference from the frequency domain signal generated by filtering through a Gamma filter;

extracting two binaural cue characteristics of a binaural time difference (equation (4)), a normalized cross-correlation function (equation (5)) and a binaural level difference (equation (6)) from a signal subjected to video decomposition by a Gamma filter; as information describing the azimuth of the speech signal:

wherein the content of the first and second substances,

wherein i is the index of the number of samples in the frame, tau is the time delay, the range is-1 ms to 1ms, and a CCF with 33 dimensions is obtained under the sampling rate of 16 kHz; t is a frame index, and f is an index of a frequency band;

taking a double-lug connection line as an x axis and a perpendicular bisector of the double-lug connection line as a y axis, and dividing possible azimuth angles of the noisy speech into four quadrants; namely, the right front is a first quadrant, the left front is a second quadrant, the left back is a third quadrant and the right back is a fourth quadrant; then, a first-level deep learning network is constructed, the binaural time difference and the binaural level difference of the noisy speech are taken as characteristic input, and the characteristic input is output as a quadrant where the noisy speech position is located, so that the problem of front and back confusion of target speech positioning is solved;

(1) Dividing azimuth angles to which noisy voices possibly belong into four quadrants by taking a double-ear connecting line as an x axis and a perpendicular bisector of the double-ear connecting line as a y axis, and marking the four quadrants as a first quadrant at the front right, a second quadrant at the front left, a third quadrant at the rear left and a fourth quadrant at the rear right;

(2) Building a first-level deep confidence network, wherein all the activation functions of 4 layers of hidden layers adopt sigmoid functions, an output layer adopts purelin activation functions, ITD and ILD characteristics of noisy speech with known azimuth angles are used as input, quadrants to which the azimuth angles belong are used as labels, and the network is trained until the mean square error is not changed;

(3) ITD and ILD extracted after time-frequency analysis is carried out on the noisy speech received by the hearing aid are used as input of a first-stage deep learning network for finishing training, and the input is output as a quadrant to which a target speech azimuth belongs;

step four, constructing a second-level deep learning network for each quadrant, and selecting the second-level deep learning network for azimuth angle positioning by combining the quadrant to which the azimuth angle of the target voice judged in the step three belongs; taking the normalized cross-correlation function and the binaural sound level difference as the input of a second-level deep learning network, and finally outputting the direction of the target voice;

(1) 4 deep neural networks with four hidden layers are built to respectively correspond to each quadrant, each layer is provided with 128 neural units, an output layer is a unit, and a sigmoid function is adopted between layers as an activation function; CCF and ILD extracted from noisy speech with known azimuth are used as input, and the real azimuth is used as a label training network until the mean square error is not reduced;

(2) Selecting a corresponding second-level deep learning network C according to the output of the first-level deep neural network; CCF and ILD of the noisy speech are used as the input of a neural network C, and the result of the neural output is the estimated target speech azimuth;

step five, obtaining the azimuth angle of the target voice according to the step four by using the existing spatial filtering algorithm to match the head related transfer function in the corresponding direction, and unmixing the target voice from the mixed voice; removing noise in a direction different from the target sound source, as shown in the deconvolution process of equation (7), to obtain a preliminary enhanced speech s in the same direction as the target speech _l (k)、s _r (k)；

Where h (k) is the head-related transfer function in the direction of the target speech，x _E (k) Receiving a signal for a hearing aid; e belongs to { l, r } to represent left and right ear indexes, k is the number of sample points, a is deconvolution time delay, and 0 to k-1 are taken;

step six, for s _l (k)、s _r (k) Performing time-frequency analysis, i.e. performing frequency-domain analysis on the preliminarily enhanced speech by using the same Gamma-tone filter in the first step to obtain a frequency-domain representation s of the preliminarily enhanced speech _f，l (k)、s _f，r (k)；

(1) By setting different frame lengths n _t To s to _f，l (k)、s _f，r (k) Performing frame division and windowing to obtain time-frequency unit representation s of voice signal _ft，l (n _t )、s _ft，r (n _t ) Time-frequency units with different frame lengths pay attention to the high-resolution features of the details and grasp the global low-resolution features;

(2) Calculating the energy of each time-frequency unit

Wherein E belongs to { l, r } to represent left and right ear index, i is index of number of sample points in frame;

(3) Applying cube root compression, C _ G, to the energy of each time-frequency cell _E (t，f)＝(CG _E (t，f)) ^1/3 ；

(4) Finally, discrete cosine transform DCT pair C _ G is adopted _E (t, f) performing a decorrelation operation to obtain multi-resolution auditory cepstrum coefficients; as shown in formula (8):

the above formula is pair C _ G _E (t, f) taking the DCT process, wherein M is the number of channels of the Gamma atom filter, and M =64; d represents reserved C _ G _E The coefficient of the former d dimension of (t, d), tests show that when d is greater than 36, F _E The value of (t, d) is extremely small and also represents F _E The information of (t, d) is almost contained in the first 36 dimensionsPerforming the following steps;

building a full connection layer consisting of 3 LC-BSLTM layers and 2 DNN layers; wherein LC-BSLTM is a delay control-bidirectional long-short time memory network;

step nine, constructing a feature classifier, taking the feature parameters which are obtained in the step eight and can be directly used for classifier classification as input, and outputting a value representing ideal binary masking; namely, a unit containing voice information is obtained, and finally, the voice waveform synthesis algorithm is utilized to remove noise in the same direction as the target voice, so that the final enhanced voice, namely the output signal of the digital hearing aid, is obtained.