CN106531179B

CN106531179B - A kind of multi-channel speech enhancement method of the selective attention based on semantic priori

Info

Publication number: CN106531179B
Application number: CN201510574907.3A
Authority: CN
Inventors: 付强; 王晓飞; 国雁萌; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2015-09-10
Filing date: 2015-09-10
Publication date: 2019-08-20
Anticipated expiration: 2035-09-10
Also published as: CN106531179A

Abstract

The present invention provides a kind of multi-channel speech enhancement methods of selective attention based on semantic priori, which comprises more microphone arrays pick up the voice signal of any direction in reverberant ambiance, acquire multi-path voice signal and are pre-processed；Utilize specific activation word present in the activation pretreated voice signal of word speech recognition model inspection；The signal comprising activation word section haveing not been cut is handled to obtain complete activation word section；Activation word section is analyzed using the multichannel phase difference sound localization method based on reverberation robust, obtains the sound wave arrival direction of target sound source；The voice of the direction is enhanced, and inhibits the noise in other directions and far says the RMR room reverb under scene, acquires the enhancing voice of target direction.The bright method of we can be used for the occasion that the needs such as intelligent appliance, smart home, vehicle-mounted and wearable device far say formula voice input and interaction, especially suitable for complicated acoustic noise and interference environment occasion.

Description

A kind of multi-channel speech enhancement method of the selective attention based on semantic priori

Technical field

The present invention relates to speech processes field, in particular to the multichannel language of a kind of selective attention based on semantic priori Sound Enhancement Method.

Background technique

As voice communication and the continuous of man-machine voice interaction system are popularized, people increasingly expect to cast aside microphone and earphone Etc. cumbersome equipment, realize that the man machine language of similar human conversation general nature exchanges.However, voice is a kind of sound wave, in sky It will receive various influences, such as the decaying of sound wave when transmitting in gas, the multiple reflections (reverberation) of wall and barrier exist simultaneously Other sound sources and ambient noise etc..When multiple voice systems and multiple speakers are in same environment, how to ensure be System is properly received voice messaging, and can further determine voice system move towards practical.Speech enhan-cement is in a kind of complicated noise The effective means for extracting targeted voice signal are divided into single-channel voice enhancing and multicenter voice enhancing.

Single-channel voice enhancing mainly realizes that noise is eliminated in the difference that time-frequency domain is distributed using voice and noise.Single-pass Two key problems of road speech enhan-cement are noise estimation and a priori SNR estimation；The former is the key factor for reducing noise, And the latter is then related to the degree of residual " music noise ".Single channel enhancing algorithm can significantly improve noise in many cases Than especially having preferable eradicating efficacy to stationary noise (white noise, vehicle are made an uproar).

The ability that microphone array picks up spatial information is utilized in multicenter voice enhancing, can in conjunction with time domain, frequency domain with And spatial information, obtain the reception ability for having space distinction.In general, multicenter voice enhancing needs the arrival bearing of priori Angle information, using vacant filtering theory, presses down the back drop from non-targeted direction to form reliable steering vector System, for single-channel voice enhancing, multicenter voice enhances the ability for having better noise suppressed.

Why human auditory, which can be handled, more sound sources and has the problem of reverberation, in addition can also be detected when more people speak and with The interested voice of track oneself, main cause are that human auditory has specific Selective attention ability.When the mankind are to certain target , can be according to specific tasks and environment when sound is interested, choosing target voice and ambient sound most has the feature of distinction, and It is compared and is screened according to priori knowledge, exclusive PCR sound simultaneously obtains target voice.

For voice application, noise that may be present or interference are in daily household, vehicle-mounted and outdoor etc. actual scenes It is various.And existing speech enhan-cement or separation method, it is all extremely difficult to the undistorted pickup of target voice, and disappear simultaneously The purpose of non-targeted signal is removed or inhibits, especially multiple coherent sound sources exist simultaneously, reverberation is larger and low signal-to-noise ratio situation Under.

The amplitude and phase that speech enhan-cement based on multichannel (microphone array) receives signal using multiple microphones are poor, Spatial selectivity can be formed to the signal of target direction, so that beam forming (Beamforming, BM), spatial activity are examined It surveys (Directive speech activity detection, DSAD) algorithm and is directed toward target direction, to inhibit or refuse The interference signal in non-targeted direction.But the direction of arrival (DOA) of target sound source can not still be known in advance.Assume in simple sund source Under, it can determine the DOA of target sound source with auditory localization (Source Location, SL) technology, however actual application environment In, this hypothesis is difficult to meet.In most cases, multi-acoustical can be existed simultaneously, and number is unknown.There are room reflections Reverberation field, situation can be more complicated, causes the noise of target sound source excessive.

Summary of the invention

It, will be semantic-based it is an object of the invention to overcome drawbacks described above existing for current multi-channel speech enhancement method Identification of sound source and auditory localization technology based on signal processing combine, and merge " space filtering " characteristic of microphone array, mention The multi-channel speech enhancement method for having gone out a kind of selective attention based on semantic priori, can be with effectively overcoming noise and interference.

To achieve the goals above, the present invention provides a kind of multicenter voices of selective attention based on semantic priori Enhancement Method, which comprises more microphone arrays pick up the voice signal of any direction in reverberant ambiance, adopt Collection multi-path voice signal is simultaneously pre-processed；Exist using activating in the pretreated voice signal of word speech recognition model inspection Specific activation word；The signal comprising activation word section haveing not been cut is handled to obtain complete activation word section；Using base Activation word section is handled in the multichannel phase difference sound localization method of reverberation robust, the sound wave for obtaining target sound source reaches Direction；The voice of the direction is enhanced, and inhibits the noise in other directions and far says the RMR room reverb under scene, is obtained Obtain the enhancing voice of target direction.

In above-mentioned technical proposal, the specific method includes:

The more microphone arrays of step 1) pick up the voice signal of any direction in reverberant ambiance, acquire multichannel language Sound signal；

Step 2) pre-processes the multi-path voice signal that step 1) acquires；

Step 3) swashs using in the activation pretreated voice signal of word speech recognition model inspection with the presence or absence of specific Word living；If testing result is affirmative, retains the signal comprising activation word section haveing not been cut, enter step 4)；Otherwise, turn Enter step 1)；

Step 4) carries out Voice activity detector to the signal comprising activation word section haveing not been cut and is completely activated Word section；Activation word section is analyzed using the multichannel phase difference sound localization method based on reverberation robust, obtains target sound The sound wave arrival direction in source；The voice of the direction is enhanced, and inhibits remaining directionality noise and the expansion from environment It dissipates noise and far says the RMR room reverb under scene, get the enhancing voice of target direction.

In above-mentioned technical proposal, the detailed process of the step 2) are as follows: if there are acoustic echo in multi-path voice signal, Echo Cancellation is carried out to the multi-path voice signal picked up, inhibits diffusion ambient noise and gain control；Otherwise, only to multichannel Voice signal is diffused ambient noise and inhibits and gain control.

In above-mentioned technical proposal, the activation pretreated voice of word speech recognition model inspection is utilized in the step 3) With the presence or absence of the detailed process of specific activation word in signal are as follows: according to a large amount of activation word data of priori or speaker dependent Data, training obtain the activation word speech recognition model that speaker is related or speaker is unrelated；Using identification decoding policy pair Activation word content is detected and is calculated confidence level, so that discriminant classification is completed, by speech recognition and keyword retrieval algorithm phase In conjunction with detection of the realization to activation word.

In above-mentioned technical proposal, the step 4) is specifically included:

Step 4-1) starting point for activating word and tail point are detected by Voice activity detector, it obtains complete multichannel and swashs Word section living；

Step 4-2) activation word section is analyzed using the multichannel phase difference sound localization method based on reverberation robust； The sound wave arrival direction information of target sound source is obtained, that is, gets the target speaker direction for issuing the certain semantic；According to sound Wave arrival direction information, enhances the voice of the direction；

Step 4-3) it further suppresses remaining directionality noise using multichannel post filtering and makes an uproar from the diffusion of environment Sound and the remote RMR room reverb said under scene, acquire the enhancing voice of target direction.

In above-mentioned technical proposal, the step 4-2) it specifically includes:

Step 4-2-1) activation word section transformed into time-frequency domain, on each frequency point, respectively to the Coherent Part of signal and Incoherent part is tracked；

Step 4-2-2) count the time frequency point occupied by direct sound wave；

Step 4-2-3) in the time frequency point occupied by direct sound wave, when low frequency obtains signal arrival without spacial aliasing part The distribution of difference；

Step 4-2-4) in high frequency section, the signal step-out time information obtained according to low frequency removes the shadow of spacial aliasing It rings, obtains the signal step-out time information of Whole frequency band；Then sound wave arrival direction information is obtained；

Step 4-2-5) according to sound wave arrival direction information, the voice of the direction is enhanced.

In above-mentioned technical proposal, the step 4-2-5) in there are two types of the modes that are enhanced voice:

First way: according to sound wave arrival direction information, known direction voice is increased using Beamforming Method By force, inhibit coherence's sound source from other directions；

The second way: extraterrestrial target Speech signal detection is carried out using the known direction, is received from target area Voice, refuse from other directions sound source.

The present invention has the advantages that

1, the bright method of we can be used for the needs such as intelligent appliance, smart home, vehicle-mounted and wearable device and far say formula language The occasion of sound input and interaction, especially suitable for complicated acoustic noise and interference environment occasion；

2, method of the invention can selectively pick up under the conditions of far saying hands-free (far-field hands-free) Echo signal is taken, interference and noise are inhibited.

Detailed description of the invention

Fig. 1 is the flow chart of the multi-channel speech enhancement method of the selective attention of the invention based on semantic priori；

Fig. 2 is the flow chart of the invention that extraterrestrial target Speech signal detection is carried out using known direction.

Specific embodiment

The feature that target voice distinguishes over other sound has very much, and this category feature to be made full use of to be detected, then needs Pay the utmost attention to priori knowledge at most and most reliable feature.For example, when loudspeaker plays sound, it is relevant to loudspeaker sound Sound is construed as echo interference；If the semanteme of target voice is it is known that so semanteme is exactly apparent differentiating characteristics； If the sound wave arrival direction (Direction of Arrival, DOA) of target voice is it is known that so pass through detection DOA information It can be used for removing a large amount of unrelated sound.By the detection to various distinction information and compare, sound may finally be inhibited It influences, and filters out target language segment from mixed sound.

Present invention will now be described in detail with reference to the accompanying drawings..

As shown in Figure 1, a kind of multi-channel speech enhancement method of the selective attention based on semantic priori, the method packet It includes:

Step 2) pre-processes the multi-path voice signal that step 1) acquires；

If there are acoustic echos in voice signal, Echo Cancellation is carried out to the multi-path voice signal picked up, inhibits to expand Dissipate ambient noise and gain control；Otherwise, only ambient noise is diffused to multi-path voice signal to inhibit and necessary gain Control；

According to a large amount of activation word data of priori or the data of some speaker dependent, training obtain speaker it is related or The unrelated activation word speech recognition model of person speaker；Activation word content is detected and calculated using identification decoding policy and is set Reliability combines speech recognition and keyword retrieval algorithm to complete discriminant classification, realizes the detection to activation word.

Step 4) carries out speech enhan-cement to the signal comprising activation word section haveing not been cut；It specifically includes:

Step 4-1) by Voice activity detector (VAD:Voice Activity Detection) will activate word rise Point and the detection of tail point obtain complete multichannel activation word section；

Step 4-2) activation word section is analyzed using the multichannel phase difference sound localization method based on reverberation robust； The DOA information of target sound source is obtained, that is, gets the target speaker direction for issuing the certain semantic；It specifically includes:

Step 4-2-2) count the time frequency point occupied by direct sound wave；

Step 4-2-3) in the time frequency point occupied by direct sound wave, step-out time is obtained without spacial aliasing part in low frequency The distribution of (TDOA:Time Difference Of Arrival)；

Step 4-2-4) in high frequency section, the signal step-out time information obtained according to low frequency removes the shadow of spacial aliasing It rings, obtains the TDOA of the signal of Whole frequency band, then obtain DOA information；

Step 4-2-5) according to DOA information, the voice of known direction is enhanced；The step 4-2-5) in known There are two types of the modes that the voice in direction is enhanced:

First way: according to DOA information, enhancing known direction voice using Beamforming Method, inhibits to come From in coherence's sound source in other directions；

In the present embodiment, the minimum variance for being based on diagonal load (Diagonal Loading) using multichannel is undistorted Response Beamforming Method inhibits coherence's sound source from other directions to be also based on filial generation in other embodiments Blind source separate technology (Blind Source Separation) realize directional interference inhibition.

The second way: extraterrestrial target Speech signal detection (DSAD) is carried out using the known direction, is received from mesh The voice in region is marked, the sound source from other directions is refused.

As shown in Fig. 2, utilizing beam reference energy ratio (Beam-to- to each time frequency point by taking binary channels DSAD as an example Reference Ratio, BRR) and Signal to Noise Ratio (SNR) make decisions.For the decision threshold of BRR, direct sound wave mixed phase is combined Acoustic energy ratio (Direct-to-Reverberate Ratio, DRR) follow-up mechanism, so that the detection threshold value of each time frequency point can To be adjusted according to environment self-adaption, to improve the accuracy of each time frequency point possibility predication, dropped using Sidelobe Suppression mechanism The influence of low high frequency aliasing then improves the full accuracy with judgement.

Step 4-3) it further suppresses remaining directionality noise using multichannel post filtering and makes an uproar from the diffusion of environment Sound and the remote RMR room reverb said under scene；Acquire enhancing voice.

Claims

1. a kind of multi-channel speech enhancement method of the selective attention based on semantic priori, which comprises more microphones Array picks up the voice signal of any direction in reverberant ambiance, acquires multi-path voice signal and is pre-processed；Benefit The specific activation word present in the activation pretreated voice signal of word speech recognition model inspection；Include to what is had not been cut The signal of activation word section is handled to obtain complete activation word section；It is fixed using the multichannel phase difference sound source based on reverberation robust Position method analyzes activation word section, obtains the sound wave arrival direction of target sound source；The voice of the direction is enhanced, and Inhibit the noise in other directions and far say the RMR room reverb under scene, acquires the enhancing voice of target direction；

The method specifically includes:

The more microphone arrays of step 1) pick up the voice signal of any direction in reverberant ambiance, acquisition multi-path voice letter Number；

Step 2) pre-processes the multi-path voice signal that step 1) acquires；

Step 3) activates word with the presence or absence of specific using in the activation pretreated voice signal of word speech recognition model inspection； If testing result is affirmative, retains the signal comprising activation word section haveing not been cut, enter step 4)；Otherwise, it is transferred to step It is rapid 1)；

Step 4) carries out Voice activity detector to the signal comprising activation word section haveing not been cut and obtains completely activating word section； Activation word section is analyzed using the multichannel phase difference sound localization method based on reverberation robust, obtains the sound of target sound source Wave arrival direction；The voice of the direction is enhanced, and inhibits remaining directionality noise and the diffusion noise from environment And far say RMR room reverb under scene, get the enhancing voice of target direction；

It is specific using whether there is in the activation pretreated voice signal of word speech recognition model inspection in the step 3) Activation word detailed process are as follows: according to a large amount of activation word data of priori or the data of speaker dependent, training is spoken The activation word speech recognition model that people is related or speaker is unrelated；Activation word content is detected using identification decoding policy And confidence level is calculated, to complete discriminant classification, speech recognition and keyword retrieval algorithm are combined, realized to activation word Detection.

2. the multi-channel speech enhancement method of the selective attention according to claim 1 based on semantic priori, feature It is, the detailed process of the step 2) are as follows: if there are acoustic echos in multi-path voice signal, to the multi-path voice picked up Signal carries out Echo Cancellation, inhibits diffusion ambient noise and gain control；Otherwise, back only is diffused to multi-path voice signal Scape noise suppressed and gain control.

3. the multi-channel speech enhancement method of the selective attention according to claim 1 based on semantic priori, feature It is, the step 4) specifically includes:

Step 4-1) starting point for activating word and tail point are detected by Voice activity detector, obtain complete multichannel activation word Section；

Step 4-2) activation word section is analyzed using the multichannel phase difference sound localization method based on reverberation robust；It obtains The sound wave arrival direction information of target sound source gets the target speaker direction for issuing the certain semantic；It is arrived according to sound wave Up to directional information, the voice of the direction is enhanced；

Step 4-3) use multichannel post filtering further suppress remaining directionality noise and from environment diffusion noise with And far say RMR room reverb under scene, acquire the enhancing voice of target direction.

4. the multi-channel speech enhancement method of the selective attention according to claim 3 based on semantic priori, feature It is, the step 4-2) it specifically includes:

Step 4-2-1) word section will be activated to transform to time-frequency domain, on each frequency point, respectively to the Coherent Part of signal and non-phase Stem portion is tracked；

Step 4-2-2) count the time frequency point occupied by direct sound wave；

Step 4-2-3) in the time frequency point occupied by direct sound wave, signal step-out time is obtained without spacial aliasing part in low frequency Distribution；

Step 4-2-4) in high frequency section, the signal step-out time information obtained according to low frequency removes the influence of spacial aliasing, obtains Take the signal step-out time information of Whole frequency band；Then sound wave arrival direction information is obtained；

5. the multi-channel speech enhancement method of the selective attention according to claim 4 based on semantic priori, feature Be, the step 4-2-5) in there are two types of the modes that are enhanced voice:

First way: according to sound wave arrival direction information, enhancing known direction voice using Beamforming Method, suppression Make coherence's sound source from other directions；

The second way: extraterrestrial target Speech signal detection is carried out using the known direction, receives the language from target area Sound refuses the sound source from other directions.