CN109523999B

CN109523999B - Front-end processing method and system for improving far-field speech recognition

Info

Publication number: CN109523999B
Application number: CN201811602419.9A
Authority: CN
Inventors: 李军锋; 高飞; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2021-03-23
Anticipated expiration: 2038-12-26
Also published as: CN109523999A

Abstract

The application provides a front-end processing method and a system for improving far-field speech recognition, wherein the method comprises the following steps: calculating the impulse response signal of the room to obtain the division time points of the early reverberation signal and the late reverberation signal, and intercepting the direct sound signal and the early reverberation signal; convolving the direct sound signal and the early reverberation signal with a clean voice signal in a voice library on a time domain to obtain a time domain target signal; respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; and after converting the time domain mixed signal into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by an ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal. The invention separates the target signal from the mixed voice under the noise reverberation condition through the ideal amplitude masking.

Description

Front-end processing method and system for improving far-field speech recognition

Technical Field

The present invention relates to the field of audio signal processing, and in particular, to a front-end processing method and system for improving far-field speech recognition.

Background

With the continuous development of voice technology, the application of voice interaction is very wide, and the application is as large as the personal application of national military to home household. At present, more and more applications are applied based on speech recognition, such as smart homes, service robots and the like, but in a real speech interaction scene, background noise and room reverberation interfere with speech propagation, and the interference has damage to speech quality and speech intelligibility and great harm to speech recognition. Therefore, separating speech from these disturbances is particularly important for speech recognition.

Based on the research of auditory masking phenomenon, an Ideal Binary Mask (IBM) is proposed to separate a target voice from a noisy voice, and the main idea of the IBM is to reserve a time-frequency unit with stronger target signal energy than the noisy signal energy through a certain local threshold and remove other time-frequency units. Many studies have shown that IBM can improve speech intelligibility and speech quality. Ideal Ratio Mask (IRM) is used as IBM soft decision, which can retain more information of voice and has better performance in voice recognition performance. In a noisy environment, the IRM is calculated from the energy ratio of clean speech to noisy speech. When a scene is converted to a reverberation noise environment, the current practice still applies a noise-only method, noise is additive, reverberation is multiplicative, and the reverberation consists of direct sound, early reflection and late reverberation, and obviously, the method is unreasonable for processing the reverberation.

The Room Impulse Response (RIR) is generally used to describe the reverberation characteristics of a Room, studies show that direct sound and early reflections in the Room Impulse Response are favorable parts for human ear hearing, some studies show that the direct sound and early reverberation of the first 50ms are used as target voices, and experimental results show that the masking can effectively improve voice intelligibility and voice quality under the noise reverberation condition. However, the reverberation varies with the acoustic characteristics of the room environment, early reflections of different lengths have different effects on speech intelligibility, and the method of intercepting the first 50ms at different reverberation times is not good.

Disclosure of Invention

In order to solve the above problems, the present invention provides a front-end processing method and system for improving far-field speech recognition.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

in a first aspect, the present application provides a front-end processing method for improving far-field speech recognition, including: calculating the impulse response signal of the room to obtain the division time points of the early reverberation signal and the late reverberation signal, and intercepting the direct sound signal and the early reverberation signal; the room impulse response signal is composed of the direct sound signal, the early reverberation signal and the late reverberation signal in sequence; convolving the direct sound signal and the early reverberation signal with a clean speech signal in a speech library in a time domain to obtain a time domain target signal; respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; the time-domain mixed signal is obtained by convolving the room impulse response signal with the voice in the voice library in the time domain and then mixing the room impulse response signal with a noise signal; and after the time domain mixed signal is converted into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal.

In another possible implementation, the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal includes: determining a division time point of an early reverberation signal and a late reverberation signal of the room impulse response signal by calculating an echo density function of the room impulse response signal, the echo density function NED being defined as:

wherein the content of the first and second substances,

is the fraction of expected samples outside the mean standard deviation of the gaussian distribution, 1 · { } is an index function that returns 1 when the parameter inside is true, and returns 0 otherwise, ω (l) is a weight function, δ is the standard deviation of the room impulse response signal in the current window; when the reverberation changes from the early reverberation to the late reverberation, NED approaches 1 from 0, and the division time of the early reverberation signal and the late reverberation signal is defined as when the standard deviation of the late reverberation signal approaches 1 infinitely.

In another possible implementation, the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal includes: calculating the division time points of the early reverberation signal and the late reverberation signal by a kurtosis which is a fourth-order moment of a statistical process, the kurtosis gamma being based on an assumption of a diffuse scattered field of the late reverberation₄Is defined as:

where E is the expectation of the impact response x to be processed, μ is the mean and δ is the standard deviation; the split time is defined as the time instant when the kurtosis calculated in the sliding window reaches zero.

In another possible implementation, the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal includes: calculating the division time point of the early reverberation signal and the late reverberation signal according to the room characteristics, wherein the time t is defined as:

where V and S are the volume of the room and the surface area of the room, respectively.

In another possible implementation, the calculating other signals except the time-domain target signal in the time-domain target signal and the time-domain mixed signal respectively to obtain target signal energy and other signal energy, and obtaining an ideal ratio mask by using the target signal energy and the other signal energy specifically includes: respectively carrying out Fourier transform on the time domain target signal and the other signals, and calculating to obtain target signal energy and other signal energy; substituting the target signal energy and the other signal energy into an ideal ratio masking formula to obtain the ideal ratio masking; the ideal ratio masking formula IRM (k, l) is:

where D (k, l) represents the target signal energy, R (k, l) represents the other signal energy except the target signal energy among the mixed signal energy, k represents the band index, and l represents the frame index.

In another possible implementation, the time-domain mixed signal is obtained by convolving the room impulse response signal with the speech in the speech library in the time domain and then mixing the convolved signal with a noise signal, where the time-domain mixed signal is generated by:

m(t)＝s(t)·h(t)+n(t)

where s (t) represents a clean speech signal, h (t) represents a room impulse response signal, n (t) represents a noise signal, and t represents a time index.

In another possible implementation, after converting the time-domain mixed signal into a frequency-domain mixed signal, masking and multiplying the amplitude of the frequency-domain mixed signal by the ideal ratio, and then obtaining a reconstructed signal by using the phase of the frequency-domain mixed signal, specifically includes:

carrying out short-time Fourier transform on the time domain mixed signal to obtain a frequency domain mixed signal;

and masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and obtaining a reconstructed signal by using the phase of the frequency domain mixed signal, wherein the calculation formula of the reconstructed signal s' (t) is as follows:

s′(t)＝istft{M(k,l)×IRM(k,l)×exp[j∠M^f(k,l)]}

wherein istft is expressed as inverse Fourier operation, M (k, l) is frequency domain mixed signal, and angle M^f(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.

In a second aspect, the present application provides a front-end processing system for improving far-field speech recognition, comprising: the interception unit is used for calculating the room impulse response signal to obtain the division time points of the early reverberation signal and the late reverberation signal and intercepting the direct sound signal and the early reverberation signal; the room impulse response signal is composed of the direct sound signal, the early reverberation signal and the late reverberation signal in sequence; the first generation unit is used for convolving the direct sound signal and the early reverberation signal with a clean speech signal in a speech library on a time domain to obtain a time domain target signal; the second generating unit is used for respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; the time-domain mixed signal is obtained by convolving the room impulse response signal with the voice in the voice library in the time domain and then mixing the room impulse response signal with a noise signal; and the third generating unit is used for converting the time domain mixed signal into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal.

In another possible implementation, the second generating unit is specifically configured to perform fourier transform on the time-domain target signal and the other signals, and calculate to obtain the target signal energy and the other signal energy; substituting the target signal energy and the other signal energy into an ideal ratio masking formula to obtain the ideal ratio masking; the ideal ratio masking formula IRM (k, l) is:

In another possible implementation, the third generating unit is specifically configured to,

s′(t)＝istft{M(k,l)×IRM(k,l)×exp[j∠M^f(k,l)]}

where istft is expressed as an inverse Fourier operationM (k, l) represents a frequency domain mixed signal, and M is^f(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.

The invention intercepts the early reverberation signal by calculating the room impulse response signals with different acoustic characteristics, then combines the early reverberation signal with the ideal ratio masking, applies the combination to the mixed voice signal to obtain the reconstructed signal, and realizes the separation of the target signal from the mixed voice under the noise reverberation condition by the ideal amplitude masking.

Drawings

The drawings that accompany the detailed description can be briefly described as follows.

Fig. 1 is a flowchart of a front-end processing method for improving far-field speech recognition according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a component of a room impulse response signal provided in an embodiment of the present application;

fig. 3 is a block diagram of a front-end processing system for improving far-field speech recognition according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

Fig. 1 is a flowchart of a front-end processing method for improving far-field speech recognition according to an embodiment of the present disclosure. The front-end processing method for improving far-field speech recognition shown in fig. 1 specifically comprises the following implementation steps:

step S102, the room impulse response signal is calculated to obtain the division time points of the early reverberation signal and the late reverberation signal, and the direct sound signal and the early reverberation signal are intercepted.

Preferably, as shown in fig. 2, the room impulse response signal in the present application actually consists of the direct sound, the early reverberation and the late reverberation, however, the direct sound and the early reverberation are parts beneficial to the auditory sense of human ears, and the present application mainly processes by acquiring the divided time points of the early reverberation and the late reverberation and then intercepting the direct sound and the early reverberation.

Specifically, after the room impulse response signal is up-sampled to a certain frequency, the division time points of the early reverberation signal and the late reverberation signal of the room impulse response signal are calculated, and then the direct sound signal and the early reverberation signal are intercepted.

The upsampling of the room impulse response signal to a certain frequency is performed to conveniently intercept the time of early reverberation.

Preferably, the present application upsamples the room impulse response signal to 48kHz, although other frequencies are possible.

In one embodiment, the split time points of the early and late reverberation signals of the room impulse response signal are determined by calculating an echo density function of the room impulse response signal, the echo density function NED being defined as:

wherein the content of the first and second substances,

is the fraction of expected samples outside the mean standard deviation of the gaussian distribution, 1 · { } is an index function that returns 1 when the parameter inside is true, and 0 otherwise, ω (l) is a weight function, δ is the standard deviation of the room impulse response signal in the current window.

When the reverberation changes from the early reverberation to the late reverberation, NED approaches 1 from 0, and the division time of the early reverberation signal and the late reverberation signal is defined as when the standard deviation of the late reverberation signal approaches 1 infinitely.

In one embodiment, the split time points of the early and late reverberation signals are calculated by kurtosis, which is the fourth moment of the statistical process,gamma of peak state₄Is defined as:

where E is the expectation of the impact response x to be processed, μ is the mean and δ is the standard deviation;

the split time is defined as the time instant when the kurtosis calculated in the sliding window reaches zero.

In one embodiment, the divided time points of the early reverberation signal and the late reverberation signal are calculated by room characteristics, and the time t is defined as:

And step S104, convolving the direct sound signal and the early reverberation signal with the clean speech signal in the speech library on a time domain to obtain a time domain target signal.

Preferably, the speech library Hub5 used in the present application is a telephone recording of english speech, and the recruited speakers are connected by the robot operator to talk at will on a daily topic announced by the robot operator at the beginning of the call. The sampling frequency of the speech library is 8000 hz. Wherein the clean speech signal refers to a recorded speech without any operation.

Specifically, the direct sound signal and the early reverberation signal are down-sampled to the sampling frequency of the voice signal, and then convolved with the clean voice signal in the voice library in the time domain to obtain a time domain target signal.

And step S106, respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy.

Preferably, the time-domain mixed signal is obtained by convolving the room impulse response signal with all voices in the voice library in the time domain, and then mixing the room impulse response signal with the noise signal. The time domain mixed signal generation mode is as follows:

m(t)＝s(t)·h(t)+n(t)

The noise signal refers to background noise in a real voice interaction scene, and the noise and room reverberation can interfere the propagation of voice, and the interference not only has damage to voice quality and voice intelligibility, but also affects voice recognition.

Specifically, Fourier transform is respectively carried out on other signals except the time domain target signal in the time domain target signal and the time domain mixed signal, and target signal energy D (k, l) and other signal energy R (k, l) are obtained through calculation; and then substituting the target signal energy D (k, l) and other signal energy R (k, l) into an ideal ratio masking formula to obtain ideal ratio masking. Where the ideal ratio mask formula IRM (k, l) is:

Step S108, after the time domain mixed signal is converted into the frequency domain mixed signal, the amplitude value of the frequency domain mixed signal is covered and multiplied by the ideal ratio, and then the phase position of the frequency domain mixed signal is used to obtain a reconstructed signal.

Specifically, after short-time fourier transform is performed on the time domain mixed signal, a frequency domain mixed signal is obtained; then, the amplitude of the frequency domain mixed signal is masked and multiplied by the ideal ratio, and then the phase of the frequency domain mixed signal is used to obtain a reconstructed signal, wherein the calculation formula of the reconstructed signal s' (t) is as follows:

s′(t)＝istft{M(k,l)×IRM(k,l)×exp[j∠M^f(k,l)]}

Fig. 3 is a block diagram of a front-end processing system for improving far-field speech recognition according to an embodiment of the present disclosure. A front-end processing system for enhancing far-field speech recognition as shown in fig. 3, comprising: a clipping unit 301, a first generating unit 302, a second generating unit 303 and a third generating unit 304.

The intercepting unit 301 is configured to calculate a room impulse response signal, obtain a division time point of an early reverberation signal and a late reverberation signal, and intercept the direct sound signal and the early reverberation signal.

After the room impulse response signal is up-sampled to a certain frequency, the division time points of the early reverberation signal and the late reverberation signal of the room impulse response signal are calculated, and then the direct sound signal and the early reverberation signal are intercepted.

wherein the content of the first and second substances,

is the fraction of expected samples outside the standard deviation of the mean of the Gaussian distribution, 1 { } is an index function that returns when the parameter inside is true1, otherwise return 0, ω (l) is a weight function, δ is the standard deviation of the room impulse response signal in the current window.

In one embodiment, the split time points of the early and late reverberation signals are calculated by a kurtosis, which is the fourth moment of the statistical process, based on the assumption of diffuse scattered fields of late reverberation₄Is defined as:

The first generating unit 302 is configured to convolve the direct sound signal and the early reverberation signal with the clean speech signal in the speech library in a time domain to obtain a time domain target signal.

After the direct sound signal and the early reverberation signal are down-sampled to the sampling frequency of the voice signal, the direct sound signal and the early reverberation signal are convoluted with a clean voice signal in a voice library on a time domain to obtain a time domain target signal.

The second generating unit 303 is configured to calculate the time-domain target signal and other signals in the time-domain mixed signal except the time-domain target signal, respectively, to obtain target signal energy and other signal energy, and obtain an ideal ratio mask according to the target signal energy and the other signal energy.

m(t)＝s(t)·h(t)+n(t)

Performing Fourier transform on the time domain target signal and other signals except the time domain target signal in the time domain mixed signal respectively, and calculating to obtain target signal energy D (k, l) and other signal energy R (k, l); and then substituting the target signal energy D (k, l) and other signal energy R (k, l) into an ideal ratio masking formula to obtain ideal ratio masking. Where the ideal ratio mask formula IRM (k, l) is:

The third generating unit 304 is configured to convert the time-domain mixed signal into a frequency-domain mixed signal, mask and multiply the amplitude of the frequency-domain mixed signal by an ideal ratio, and use the phase of the frequency-domain mixed signal to obtain a reconstructed signal.

Performing short-time Fourier transform on the time domain mixed signal to obtain a frequency domain mixed signal; then, the amplitude of the frequency domain mixed signal is masked and multiplied by the ideal ratio, and then the phase of the frequency domain mixed signal is used to obtain a reconstructed signal, wherein the calculation formula of the reconstructed signal s' (t) is as follows:

s′(t)＝istft{M(k,l)×IRM(k,l)×exp[j∠M^f(k,l)]}

wherein, istft tableShown as inverse Fourier operation, M (k, l) represents a frequency domain mixed signal, and angle M^f(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.

Finally, the description is as follows: the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A front-end processing method for enhancing far-field speech recognition, comprising:

calculating the impulse response signal of the room to obtain the division time points of the early reverberation signal and the late reverberation signal, and intercepting the direct sound signal and the early reverberation signal; the room impulse response signal is composed of the direct sound signal, the early reverberation signal and the late reverberation signal in sequence;

convolving the direct sound signal and the early reverberation signal with a clean speech signal in a speech library in a time domain to obtain a time domain target signal;

respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; the time-domain mixed signal is obtained by convolving the room impulse response signal with the voice in the voice library in the time domain and then mixing the room impulse response signal with a noise signal;

and after the time domain mixed signal is converted into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal.

2. The method of claim 1, wherein the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal comprises:

determining a division time point of an early reverberation signal and a late reverberation signal of the room impulse response signal by calculating an echo density function of the room impulse response signal, the echo density function NED being defined as:

wherein the content of the first and second substances,

is the fraction of expected samples outside the mean standard deviation of the gaussian distribution, 1 · { } is an index function that returns 1 when the parameter inside is true, and returns 0 otherwise, ω (l) is a weight function, δ is the standard deviation of the room impulse response signal in the current window;

3. The method of claim 1, wherein the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal comprises:

calculating early reverberation by kurtosis based on an assumption of a diffuse scattered field of the late reverberation signalThe time point of the division of the signal and the late reverberation signal, the kurtosis is the fourth moment of the statistical process, and the kurtosis gamma₄Is defined as:

the split time point is defined as the time instant when the kurtosis calculated in the sliding window reaches zero.

4. The method of claim 1, wherein the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal comprises:

calculating the division time point of the early reverberation signal and the late reverberation signal according to the room characteristics, wherein the time t is defined as:

5. The method according to claim 1, wherein the calculating other signals except the time-domain target signal in the time-domain target signal and the time-domain mixed signal respectively to obtain a target signal energy and other signal energies, and obtaining an ideal ratio mask by using the target signal energy and the other signal energies specifically comprises:

respectively carrying out Fourier transform on the time domain target signal and the other signals, and calculating to obtain target signal energy and other signal energy;

substituting the target signal energy and the other signal energy into an ideal ratio masking formula to obtain the ideal ratio masking; the ideal ratio masking formula IRM (k, l) is:

6. The method according to claim 1, wherein the time-domain mixed signal is obtained by convolving the room impulse response signal with the speech in the speech library in the time domain, and mixing the time-domain mixed signal with a noise signal, and the time-domain mixed signal is generated in a manner that:

m(t)＝s(t)·h(t)+n(t)

7. The method according to claim 1, wherein the step of converting the time-domain mixed signal into a frequency-domain mixed signal, masking and multiplying an amplitude of the frequency-domain mixed signal by the ideal ratio, and obtaining a reconstructed signal by using a phase of the frequency-domain mixed signal comprises:

s′(t)＝istft{M(k,l)×IRM(k,l)×exp[j∠M^f(k,l)]}

wherein istft is expressed as inverse Fourier operation, M (k, l) is expressed as frequency domain mixed signal, IRM (k, l) is masked by ideal ratio, and angle M^f(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.

8. A front-end processing system that promotes far-field speech recognition, comprising:

the interception unit is used for calculating the room impulse response signal to obtain the division time points of the early reverberation signal and the late reverberation signal and intercepting the direct sound signal and the early reverberation signal; the room impulse response signal is composed of the direct sound signal, the early reverberation signal and the late reverberation signal in sequence;

the first generation unit is used for convolving the direct sound signal and the early reverberation signal with a clean speech signal in a speech library on a time domain to obtain a time domain target signal;

the second generating unit is used for respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; the time-domain mixed signal is obtained by convolving the room impulse response signal with the voice in the voice library in the time domain and then mixing the room impulse response signal with a noise signal;

and the third generating unit is used for converting the time domain mixed signal into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal.

9. The system according to claim 8, characterized in that the second generation unit is specifically configured to,

10. The system according to claim 8, characterized in that the third generation unit is in particular adapted to,

s′(t)＝istft{M(k,l)×IRM(k,l)×exp[j∠M^f(k,l)]}