CN106531179A

CN106531179A - Multi-channel speech enhancement method based on semantic prior selective attention

Info

Publication number: CN106531179A
Application number: CN201510574907.3A
Authority: CN
Inventors: 付强; 王晓飞; 国雁萌; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2015-09-10
Filing date: 2015-09-10
Publication date: 2017-03-22
Anticipated expiration: 2035-09-10
Also published as: CN106531179B

Abstract

The invention provides a multi-channel speech enhancement method based on semantic prior selective attention. The method comprises the following steps: picking speech signals from any directions in a reverberant environment by virtue of a multi-microphone array, collecting the multiple paths of speech signals and pre-processing the speech signals; detecting special activation words in the pre-processed speech signals by virtue of an activation word speech recognition model; processing signals which are not cut and include activation word segments, so as to obtain a complete activation word segment; analyzing the activation word segment by virtue of a multi-channel phase difference sound localization method based on reverberation robust, so as to obtain an acoustic wave reaching direction of a target sound source; and enhancing speech in the direction and inhibiting noise from other directions and room reverberation in a remote speak scene, so that enhanced speech in the target direction is obtained. The method provided by the invention is applicable to such occasions as intelligent household electrical appliance, smart home, vehicle-mounted and wearable devices and the like that remote speak type speech input and interaction are required, and the method is especially applicable to complex acoustic noise and interference environment occasions.

Description

A kind of multi-channel speech enhancement method of the selective attention based on semantic priori

Technical field

The present invention relates to speech processes field, the multichannel of more particularly to a kind of selective attention based on semantic priori Sound enhancement method.

Background technology

With voice communication and the continuous popularization of man-machine voice interaction system, people increasingly expect to cast aside microphone and ear The loaded down with trivial details equipment such as machine, realizes man machine language's exchange of similar human conversation general nature.However, voice is a kind of Sound wave, can be subject to the multiple anti-of various impacts, the decay of such as sound wave, wall and barrier when transmitting in atmosphere Penetrate (reverberation), simultaneous other sound sources and environment noise etc..At multiple voice systems and multiple speakers When same environment, how to guarantee that system is properly received voice messaging, further determined voice system and can move towards practical. Speech enhan-cement is the effective means for extracting targeted voice signal in a kind of complicated noise, is divided into single-channel voice Strengthen and multicenter voice strengthens.

Single-channel voice is strengthened the main difference being distributed in time-frequency domain using voice and noise and realizes that noise is eliminated.It is single Enhanced two key problems of channel speech are Noise Estimation and a priori SNR estimation；The former is the pass for reducing noise Key factor, and the latter is then related to the degree of residual " music noise ".Single channel strengthens algorithm and under many circumstances can Signal to noise ratio is enough significantly improved, especially has preferable eradicating efficacy to stationary noise (white noise, car are made an uproar).

Multicenter voice strengthens the ability that make use of microphone array to pick up spatial information, can be with reference to time domain, frequency domain And spatial information, obtain the receiving ability with space distinction.Generally, multicenter voice strengthens needs priori Arrival bearing angle information, so as to form reliable steering vector, using vacant filtering theory, to from non-targeted The back drop in direction is suppressed, and for single-channel voice strengthens, multicenter voice enhancing possesses preferably The ability of noise suppressed.

Why human auditory can process many sound sources and have a problem of reverberation, in addition can also detect when many people speak with Oneself voice interested is tracked, main cause is that human auditory has specific Selective attention ability.As the mankind couple When certain target sound is interested, can be according to specific tasks and environment, choosing target voice is most had with ambient sound The feature of distinction, and compared according to priori and screened, exclusive PCR sound simultaneously obtains target voice.

For voice application, noise that may be present or interference in the actual scene such as daily household, vehicle-mounted and outdoor It is many.And existing speech enhan-cement or separation method, pickup undistorted to target voice is all extremely difficult to, And while eliminate or suppress the purpose of non-targeted signal, particularly presence, reverberation are larger in multiple coherent sound sources simultaneously In the case of low signal-to-noise ratio.

Speech enhan-cement based on multichannel (microphone array) receives the amplitude and phase place of signal using multiple microphones Difference, can form spatial selectivity to the signal of target direction so that beam shaping (Beamforming, BM), Spatial activity detection (Directive speech activity detection, DSAD) algorithm points to target direction, So as to suppress or refuse the interference signal in non-targeted direction.But the direction of arrival (DOA) of target sound source still cannot Know in advance.In the case where simple sund source is assumed, mesh can be determined with sound localization (Source Location, SL) technology The DOA of mark sound source, but in actual application environment, it is this to assume to be difficult to meet.In most cases, can simultaneously There is multi-acoustical, and number is unknown.There iing the reverberation field of room reflections, situation can be more complicated, causes target sound The noise in source is excessive.

The content of the invention

It is an object of the invention to overcome the drawbacks described above that current multi-channel speech enhancement method is present, will be based on semantic Identification of sound source and combined based on the sound localization technology of signal processing, merge microphone array " space filtering " A kind of characteristic, it is proposed that multi-channel speech enhancement method of the selective attention based on semantic priori, can effective gram Take noise and interference.

To achieve these goals, the invention provides a kind of multichannel language of the selective attention based on semantic priori Sound Enhancement Method, methods described include：Many microphone array pickups come from the language of any direction in reverberant ambiance Message number, gathers multi-path voice signal and carries out pretreatment；After activation word speech recognition model inspection pretreatment Voice signal present in specific activation word；Process is carried out to the not cleaved signal comprising activation word section to obtain Complete activation word section；Carried out to activating word section using the multichannel phase difference sound localization method based on reverberation robust Process, obtain the sound wave arrival direction of target sound source；The voice of the direction is strengthened, and suppresses other directions Noise and far say RMR room reverb under scene, acquire the enhancing voice of target direction.

In above-mentioned technical proposal, the concrete grammar includes：

Step 1) pickup of many microphone arrays comes from the voice signal of any direction in reverberant ambiance, gathers multichannel Voice signal；

Step 2) to step 1) the multi-path voice signal that gathers carries out pretreatment；

Step 3) it is specific sharp using whether there is in the pretreated voice signal of activation word speech recognition model inspection Word living；If testing result is affirmative, retain the not cleaved signal comprising activation word section, into step 4)； Otherwise, proceed to step 1)；

Step 4) Voice activity detector is carried out to the not cleaved signal comprising activation word section obtain complete activation Word section；It is analyzed to activating word section using the multichannel phase difference sound localization method based on reverberation robust, is obtained The sound wave arrival direction of target sound source；The voice of the direction is strengthened, and is suppressed remaining directivity noise and is come From RMR room reverb under scene is said in the diffusion noise of environment and far, the enhancing voice of target direction is got.

In above-mentioned technical proposal, the step 2) detailed process be：If there is acoustics to return in multi-path voice signal Ripple, the multi-path voice signal to picking up carry out Echo Cancellation, suppress diffusion background noise and gain control；It is no Then, only it is diffused background noise to suppress and gain control to multi-path voice signal.

In above-mentioned technical proposal, the step 3) in using the pretreated language of activation word speech recognition model inspection In message number with the presence or absence of the specific detailed process for activating word it is：According to a large amount of activation word data or specific of priori The data of speaker, training obtain the activation word speech recognition model that speaker is related or speaker is unrelated；Using Identification decoding policy is detected and is calculated confidence level to activating word content, so as to complete discriminant classification, voice is known Do not combine with keyword retrieval algorithm, realize the detection to activating word.

In above-mentioned technical proposal, the step 4) specifically include：

Step 4-1) starting point and the detection of tail point of word will be activated by Voice activity detector, obtain complete multichannel Activation word section；

Step 4-2) carried out point to activating word section using the multichannel phase difference sound localization method based on reverberation robust Analysis；The sound wave arrival direction information of target sound source is obtained, that is, gets the target speaker side for sending the certain semantic To；According to sound wave arrival direction information, the voice of the direction is strengthened；

Step 4-3) further suppress remaining directivity noise and come from the diffusion of environment to make an uproar using multichannel post filtering Sound and RMR room reverb under scene is far said, acquire the enhancing voice of target direction.

In above-mentioned technical proposal, step 4-2) specifically include：

Step 4-2-1) activation word section is transformed to into time-frequency domain, on each frequency, the Coherent Part to signal respectively It is tracked with incoherent part；

Step 4-2-2) count the time frequency point occupied by direct sound wave；

Step 4-2-3) in the time frequency point occupied by direct sound wave, signal arrival is obtained in low frequency without spacial aliasing part The distribution of the time difference；

Step 4-2-4) in HFS, according to the signal step-out time information that low frequency is obtained, remove spacial aliasing Affect, obtain the signal step-out time information of Whole frequency band；Then obtain sound wave arrival direction information；

Step 4-2-5) according to sound wave arrival direction information, the voice of the direction is strengthened.

In above-mentioned technical proposal, step 4-2-5) in enhanced mode carried out to voice have two kinds：

First kind of way：According to sound wave arrival direction information, known direction voice is carried out using Beamforming Method Strengthen, suppression comes from coherence's sound source in other directions；

The second way：Extraterrestrial target Speech signal detection is carried out using the known direction, acceptance comes from target area The voice in domain, refusal come from the sound source in other directions.

It is an advantage of the current invention that：

1st, the bright method of we can be used for intelligent appliance, smart home, vehicle-mounted and wearable device etc. needs far to say formula Phonetic entry and the occasion of interaction, are particularly well-suited to acoustic noise and the interference environment occasion of complexity；

2nd, the method for the present invention can be selectively picked up under the conditions of hands-free (far-field hands-free) is far said Echo signal, suppresses interference and noise.

Description of the drawings

Fig. 1 is the flow chart of the multi-channel speech enhancement method of the selective attention based on semantic priori of the present invention；

Fig. 2 is the flow chart that the utilization known direction of the present invention carries out extraterrestrial target Speech signal detection.

Specific embodiment

Target voice distinguishes over the feature of other sound to be had a lot, and this category feature will be made full use of to be detected, then need Pay the utmost attention to the most and most reliable features of priori.For example, when speaker plays sound, with speaker sound The related sound of sound is construed as echo interference；If the semanteme of target voice is known, then semanteme is exactly bright Aobvious distinction feature；If the sound wave arrival direction of target voice (Direction of Arrival, DOA), it is known that So can be used for removing a large amount of unrelated sound by detecting DOA information.By the detection to various distinction information With compare, may finally suppress the impact of sound, and filter out target language segment from mixing sound.

Describe the present invention below in conjunction with the accompanying drawings.

As shown in figure 1, a kind of multi-channel speech enhancement method of the selective attention based on semantic priori, the side Method includes：

If there is acoustic echo in voice signal, the multi-path voice signal to picking up carries out Echo Cancellation, suppression Diffusion background noise and gain control；Otherwise, only it is diffused background noise to suppress and must to multi-path voice signal The gain control wanted；

According to a large amount of activation word data or the data of certain speaker dependent of priori, it is related that training obtains speaker Or the activation word speech recognition model that speaker is unrelated；Detected to activating word content using identification decoding policy And confidence level is calculated, and so as to complete discriminant classification, speech recognition and keyword retrieval algorithm are combined, it is right to realize The detection of activation word.

Step 4) speech enhan-cement is carried out to the not cleaved signal comprising activation word section；Specifically include：

Step 4-1) pass through Voice activity detector (VAD：Voice Activity Detection) word will be activated Starting point and the detection of tail point, obtain complete multichannel activation word section；

Step 4-2) carried out point to activating word section using the multichannel phase difference sound localization method based on reverberation robust Analysis；The DOA information of target sound source is obtained, that is, gets the target speaker direction for sending the certain semantic；Specifically Including：

Step 4-2-2) count the time frequency point occupied by direct sound wave；

Step 4-2-3) in the time frequency point occupied by direct sound wave, step-out time is obtained in low frequency without spacial aliasing part (TDOA：Time Difference Of Arrival) distribution；

Step 4-2-4) in HFS, according to the signal step-out time information that low frequency is obtained, remove spacial aliasing Affect, obtain the TDOA of the signal of Whole frequency band, obtain then DOA information；

Step 4-2-5) according to DOA information, the voice of known direction is strengthened；Step 4-2-5) in Enhanced mode is carried out to the voice of known direction two kinds：

First kind of way：According to DOA information, known direction voice is strengthened using Beamforming Method, pressed down System comes from coherence's sound source in other directions；

In the present embodiment, the minimum variance using multichannel based on diagonal loading (Diagonal Loading) without Distortion response Beamforming Method suppresses to come from coherence's sound source in other directions, in other embodiments, may be used also The suppression of directional interference is realized with the blind source separate technology (Blind Source Separation) based on filial generation.

The second way：Extraterrestrial target Speech signal detection (DSAD) is carried out using the known direction, receives to come from The voice of target area, refusal come from the sound source in other directions.

As shown in Fig. 2 by taking dual pathways DSAD as an example, utilizing beam reference energy ratio to each time frequency point (Beam-to-Reference Ratio, BRR) and signal to noise ratio snr make decisions.Judgement threshold for BRR Value, combines direct sound wave mixed phase acoustic energy ratio (Direct-to-Reverberate Ratio, DRR) follow-up mechanism, The detection threshold value of each time frequency point is adjusted, so as to improve each time frequency point likelihood according to environment self-adaption The accuracy of estimation, reduces the impact of high frequency aliasing using Sidelobe Suppression mechanism, improves then complete with the accurate of judgement Property.

Step 4-3) further suppress remaining directivity noise and come from the diffusion of environment to make an uproar using multichannel post filtering Sound and far say RMR room reverb under scene；Acquire enhancing voice.

Claims

1. a kind of multi-channel speech enhancement method of the selective attention based on semantic priori, methods described include：It is many Microphone array pickup comes from the voice signal of any direction in reverberant ambiance, and collection multi-path voice signal is gone forward side by side Row pretreatment；Using specific activation word present in the pretreated voice signal of activation word speech recognition model inspection； The not cleaved signal comprising activation word section is carried out processing and obtains complete activation word section；Using based on reverberation Shandong The multichannel phase difference sound localization method of rod is analyzed to activating word section, obtains the sound wave arrival side of target sound source To；The voice of the direction is strengthened, and is suppressed the noise in other directions and is far said RMR room reverb under scene, Acquire the enhancing voice of target direction.

2. the multi-channel speech enhancement method of the selective attention based on semantic priori according to claim 1, Characterized in that, the concrete grammar includes：

3. the multi-channel speech enhancement method of the selective attention based on semantic priori according to claim 2, Characterized in that, the step 2) detailed process be：If there is acoustic echo in multi-path voice signal, to picking up The multi-path voice signal got carries out Echo Cancellation, suppresses diffusion background noise and gain control；Otherwise, it is only right Multi-path voice signal is diffused background noise and suppresses and gain control.

4. the multi-channel speech enhancement method of the selective attention based on semantic priori according to claim 2, Characterized in that, the step 3) in using in the activation pretreated voice signal of word speech recognition model inspection With the presence or absence of the detailed process of specific activation word it is：According to a large amount of activation word data of priori or speaker dependent Data, training obtain the activation word speech recognition model that speaker is related or speaker is unrelated；Using identification decoding Strategy is detected and is calculated confidence level to activating word content, so as to complete discriminant classification, by speech recognition and key Word and search algorithm combines, and realizes the detection to activating word.

5. the multi-channel speech enhancement method of the selective attention based on semantic priori according to claim 2, Characterized in that, the step 4) specifically include：

6. the multi-channel speech enhancement method of the selective attention based on semantic priori according to claim 5, Characterized in that, step 4-2) specifically include：

Step 4-2-2) count the time frequency point occupied by direct sound wave；

7. the multi-channel speech enhancement method of the selective attention based on semantic priori according to claim 6, Characterized in that, step 4-2-5) in enhanced mode carried out to voice have two kinds：