CN109523999B - Front-end processing method and system for improving far-field speech recognition - Google Patents

Front-end processing method and system for improving far-field speech recognition Download PDF

Info

Publication number
CN109523999B
CN109523999B CN201811602419.9A CN201811602419A CN109523999B CN 109523999 B CN109523999 B CN 109523999B CN 201811602419 A CN201811602419 A CN 201811602419A CN 109523999 B CN109523999 B CN 109523999B
Authority
CN
China
Prior art keywords
signal
time
reverberation
energy
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811602419.9A
Other languages
Chinese (zh)
Other versions
CN109523999A (en
Inventor
李军锋
高飞
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN201811602419.9A priority Critical patent/CN109523999B/en
Publication of CN109523999A publication Critical patent/CN109523999A/en
Application granted granted Critical
Publication of CN109523999B publication Critical patent/CN109523999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Abstract

The application provides a front-end processing method and a system for improving far-field speech recognition, wherein the method comprises the following steps: calculating the impulse response signal of the room to obtain the division time points of the early reverberation signal and the late reverberation signal, and intercepting the direct sound signal and the early reverberation signal; convolving the direct sound signal and the early reverberation signal with a clean voice signal in a voice library on a time domain to obtain a time domain target signal; respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; and after converting the time domain mixed signal into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by an ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal. The invention separates the target signal from the mixed voice under the noise reverberation condition through the ideal amplitude masking.

Description

Front-end processing method and system for improving far-field speech recognition
Technical Field
The present invention relates to the field of audio signal processing, and in particular, to a front-end processing method and system for improving far-field speech recognition.
Background
With the continuous development of voice technology, the application of voice interaction is very wide, and the application is as large as the personal application of national military to home household. At present, more and more applications are applied based on speech recognition, such as smart homes, service robots and the like, but in a real speech interaction scene, background noise and room reverberation interfere with speech propagation, and the interference has damage to speech quality and speech intelligibility and great harm to speech recognition. Therefore, separating speech from these disturbances is particularly important for speech recognition.
Based on the research of auditory masking phenomenon, an Ideal Binary Mask (IBM) is proposed to separate a target voice from a noisy voice, and the main idea of the IBM is to reserve a time-frequency unit with stronger target signal energy than the noisy signal energy through a certain local threshold and remove other time-frequency units. Many studies have shown that IBM can improve speech intelligibility and speech quality. Ideal Ratio Mask (IRM) is used as IBM soft decision, which can retain more information of voice and has better performance in voice recognition performance. In a noisy environment, the IRM is calculated from the energy ratio of clean speech to noisy speech. When a scene is converted to a reverberation noise environment, the current practice still applies a noise-only method, noise is additive, reverberation is multiplicative, and the reverberation consists of direct sound, early reflection and late reverberation, and obviously, the method is unreasonable for processing the reverberation.
The Room Impulse Response (RIR) is generally used to describe the reverberation characteristics of a Room, studies show that direct sound and early reflections in the Room Impulse Response are favorable parts for human ear hearing, some studies show that the direct sound and early reverberation of the first 50ms are used as target voices, and experimental results show that the masking can effectively improve voice intelligibility and voice quality under the noise reverberation condition. However, the reverberation varies with the acoustic characteristics of the room environment, early reflections of different lengths have different effects on speech intelligibility, and the method of intercepting the first 50ms at different reverberation times is not good.
Disclosure of Invention
In order to solve the above problems, the present invention provides a front-end processing method and system for improving far-field speech recognition.
In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:
in a first aspect, the present application provides a front-end processing method for improving far-field speech recognition, including: calculating the impulse response signal of the room to obtain the division time points of the early reverberation signal and the late reverberation signal, and intercepting the direct sound signal and the early reverberation signal; the room impulse response signal is composed of the direct sound signal, the early reverberation signal and the late reverberation signal in sequence; convolving the direct sound signal and the early reverberation signal with a clean speech signal in a speech library in a time domain to obtain a time domain target signal; respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; the time-domain mixed signal is obtained by convolving the room impulse response signal with the voice in the voice library in the time domain and then mixing the room impulse response signal with a noise signal; and after the time domain mixed signal is converted into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal.
In another possible implementation, the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal includes: determining a division time point of an early reverberation signal and a late reverberation signal of the room impulse response signal by calculating an echo density function of the room impulse response signal, the echo density function NED being defined as:
Figure BDA0001922842050000021
wherein the content of the first and second substances,
Figure BDA0001922842050000022
is the fraction of expected samples outside the mean standard deviation of the gaussian distribution, 1 · { } is an index function that returns 1 when the parameter inside is true, and returns 0 otherwise, ω (l) is a weight function, δ is the standard deviation of the room impulse response signal in the current window; when the reverberation changes from the early reverberation to the late reverberation, NED approaches 1 from 0, and the division time of the early reverberation signal and the late reverberation signal is defined as when the standard deviation of the late reverberation signal approaches 1 infinitely.
In another possible implementation, the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal includes: calculating the division time points of the early reverberation signal and the late reverberation signal by a kurtosis which is a fourth-order moment of a statistical process, the kurtosis gamma being based on an assumption of a diffuse scattered field of the late reverberation4Is defined as:
Figure BDA0001922842050000023
where E is the expectation of the impact response x to be processed, μ is the mean and δ is the standard deviation; the split time is defined as the time instant when the kurtosis calculated in the sliding window reaches zero.
In another possible implementation, the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal includes: calculating the division time point of the early reverberation signal and the late reverberation signal according to the room characteristics, wherein the time t is defined as:
Figure BDA0001922842050000024
where V and S are the volume of the room and the surface area of the room, respectively.
In another possible implementation, the calculating other signals except the time-domain target signal in the time-domain target signal and the time-domain mixed signal respectively to obtain target signal energy and other signal energy, and obtaining an ideal ratio mask by using the target signal energy and the other signal energy specifically includes: respectively carrying out Fourier transform on the time domain target signal and the other signals, and calculating to obtain target signal energy and other signal energy; substituting the target signal energy and the other signal energy into an ideal ratio masking formula to obtain the ideal ratio masking; the ideal ratio masking formula IRM (k, l) is:
Figure BDA0001922842050000025
where D (k, l) represents the target signal energy, R (k, l) represents the other signal energy except the target signal energy among the mixed signal energy, k represents the band index, and l represents the frame index.
In another possible implementation, the time-domain mixed signal is obtained by convolving the room impulse response signal with the speech in the speech library in the time domain and then mixing the convolved signal with a noise signal, where the time-domain mixed signal is generated by:
m(t)=s(t)·h(t)+n(t)
where s (t) represents a clean speech signal, h (t) represents a room impulse response signal, n (t) represents a noise signal, and t represents a time index.
In another possible implementation, after converting the time-domain mixed signal into a frequency-domain mixed signal, masking and multiplying the amplitude of the frequency-domain mixed signal by the ideal ratio, and then obtaining a reconstructed signal by using the phase of the frequency-domain mixed signal, specifically includes:
carrying out short-time Fourier transform on the time domain mixed signal to obtain a frequency domain mixed signal;
and masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and obtaining a reconstructed signal by using the phase of the frequency domain mixed signal, wherein the calculation formula of the reconstructed signal s' (t) is as follows:
s′(t)=istft{M(k,l)×IRM(k,l)×exp[j∠Mf(k,l)]}
wherein istft is expressed as inverse Fourier operation, M (k, l) is frequency domain mixed signal, and angle Mf(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.
In a second aspect, the present application provides a front-end processing system for improving far-field speech recognition, comprising: the interception unit is used for calculating the room impulse response signal to obtain the division time points of the early reverberation signal and the late reverberation signal and intercepting the direct sound signal and the early reverberation signal; the room impulse response signal is composed of the direct sound signal, the early reverberation signal and the late reverberation signal in sequence; the first generation unit is used for convolving the direct sound signal and the early reverberation signal with a clean speech signal in a speech library on a time domain to obtain a time domain target signal; the second generating unit is used for respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; the time-domain mixed signal is obtained by convolving the room impulse response signal with the voice in the voice library in the time domain and then mixing the room impulse response signal with a noise signal; and the third generating unit is used for converting the time domain mixed signal into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal.
In another possible implementation, the second generating unit is specifically configured to perform fourier transform on the time-domain target signal and the other signals, and calculate to obtain the target signal energy and the other signal energy; substituting the target signal energy and the other signal energy into an ideal ratio masking formula to obtain the ideal ratio masking; the ideal ratio masking formula IRM (k, l) is:
Figure BDA0001922842050000031
where D (k, l) represents the target signal energy, R (k, l) represents the other signal energy except the target signal energy among the mixed signal energy, k represents the band index, and l represents the frame index.
In another possible implementation, the third generating unit is specifically configured to,
carrying out short-time Fourier transform on the time domain mixed signal to obtain a frequency domain mixed signal;
and masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and obtaining a reconstructed signal by using the phase of the frequency domain mixed signal, wherein the calculation formula of the reconstructed signal s' (t) is as follows:
s′(t)=istft{M(k,l)×IRM(k,l)×exp[j∠Mf(k,l)]}
where istft is expressed as an inverse Fourier operationM (k, l) represents a frequency domain mixed signal, and M isf(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.
The invention intercepts the early reverberation signal by calculating the room impulse response signals with different acoustic characteristics, then combines the early reverberation signal with the ideal ratio masking, applies the combination to the mixed voice signal to obtain the reconstructed signal, and realizes the separation of the target signal from the mixed voice under the noise reverberation condition by the ideal amplitude masking.
Drawings
The drawings that accompany the detailed description can be briefly described as follows.
Fig. 1 is a flowchart of a front-end processing method for improving far-field speech recognition according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a component of a room impulse response signal provided in an embodiment of the present application;
fig. 3 is a block diagram of a front-end processing system for improving far-field speech recognition according to an embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
Fig. 1 is a flowchart of a front-end processing method for improving far-field speech recognition according to an embodiment of the present disclosure. The front-end processing method for improving far-field speech recognition shown in fig. 1 specifically comprises the following implementation steps:
step S102, the room impulse response signal is calculated to obtain the division time points of the early reverberation signal and the late reverberation signal, and the direct sound signal and the early reverberation signal are intercepted.
Preferably, as shown in fig. 2, the room impulse response signal in the present application actually consists of the direct sound, the early reverberation and the late reverberation, however, the direct sound and the early reverberation are parts beneficial to the auditory sense of human ears, and the present application mainly processes by acquiring the divided time points of the early reverberation and the late reverberation and then intercepting the direct sound and the early reverberation.
Specifically, after the room impulse response signal is up-sampled to a certain frequency, the division time points of the early reverberation signal and the late reverberation signal of the room impulse response signal are calculated, and then the direct sound signal and the early reverberation signal are intercepted.
The upsampling of the room impulse response signal to a certain frequency is performed to conveniently intercept the time of early reverberation.
Preferably, the present application upsamples the room impulse response signal to 48kHz, although other frequencies are possible.
In one embodiment, the split time points of the early and late reverberation signals of the room impulse response signal are determined by calculating an echo density function of the room impulse response signal, the echo density function NED being defined as:
Figure BDA0001922842050000041
wherein the content of the first and second substances,
Figure BDA0001922842050000042
is the fraction of expected samples outside the mean standard deviation of the gaussian distribution, 1 · { } is an index function that returns 1 when the parameter inside is true, and 0 otherwise, ω (l) is a weight function, δ is the standard deviation of the room impulse response signal in the current window.
When the reverberation changes from the early reverberation to the late reverberation, NED approaches 1 from 0, and the division time of the early reverberation signal and the late reverberation signal is defined as when the standard deviation of the late reverberation signal approaches 1 infinitely.
In one embodiment, the split time points of the early and late reverberation signals are calculated by kurtosis, which is the fourth moment of the statistical process,gamma of peak state4Is defined as:
Figure BDA0001922842050000051
where E is the expectation of the impact response x to be processed, μ is the mean and δ is the standard deviation;
the split time is defined as the time instant when the kurtosis calculated in the sliding window reaches zero.
In one embodiment, the divided time points of the early reverberation signal and the late reverberation signal are calculated by room characteristics, and the time t is defined as:
Figure BDA0001922842050000052
where V and S are the volume of the room and the surface area of the room, respectively.
And step S104, convolving the direct sound signal and the early reverberation signal with the clean speech signal in the speech library on a time domain to obtain a time domain target signal.
Preferably, the speech library Hub5 used in the present application is a telephone recording of english speech, and the recruited speakers are connected by the robot operator to talk at will on a daily topic announced by the robot operator at the beginning of the call. The sampling frequency of the speech library is 8000 hz. Wherein the clean speech signal refers to a recorded speech without any operation.
Specifically, the direct sound signal and the early reverberation signal are down-sampled to the sampling frequency of the voice signal, and then convolved with the clean voice signal in the voice library in the time domain to obtain a time domain target signal.
And step S106, respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy.
Preferably, the time-domain mixed signal is obtained by convolving the room impulse response signal with all voices in the voice library in the time domain, and then mixing the room impulse response signal with the noise signal. The time domain mixed signal generation mode is as follows:
m(t)=s(t)·h(t)+n(t)
where s (t) represents a clean speech signal, h (t) represents a room impulse response signal, n (t) represents a noise signal, and t represents a time index.
The noise signal refers to background noise in a real voice interaction scene, and the noise and room reverberation can interfere the propagation of voice, and the interference not only has damage to voice quality and voice intelligibility, but also affects voice recognition.
Specifically, Fourier transform is respectively carried out on other signals except the time domain target signal in the time domain target signal and the time domain mixed signal, and target signal energy D (k, l) and other signal energy R (k, l) are obtained through calculation; and then substituting the target signal energy D (k, l) and other signal energy R (k, l) into an ideal ratio masking formula to obtain ideal ratio masking. Where the ideal ratio mask formula IRM (k, l) is:
Figure BDA0001922842050000053
where D (k, l) represents the target signal energy, R (k, l) represents the other signal energy except the target signal energy among the mixed signal energy, k represents the band index, and l represents the frame index.
Step S108, after the time domain mixed signal is converted into the frequency domain mixed signal, the amplitude value of the frequency domain mixed signal is covered and multiplied by the ideal ratio, and then the phase position of the frequency domain mixed signal is used to obtain a reconstructed signal.
Specifically, after short-time fourier transform is performed on the time domain mixed signal, a frequency domain mixed signal is obtained; then, the amplitude of the frequency domain mixed signal is masked and multiplied by the ideal ratio, and then the phase of the frequency domain mixed signal is used to obtain a reconstructed signal, wherein the calculation formula of the reconstructed signal s' (t) is as follows:
s′(t)=istft{M(k,l)×IRM(k,l)×exp[j∠Mf(k,l)]}
wherein istft is expressed as inverse Fourier operation, M (k, l) is frequency domain mixed signal, and angle Mf(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.
The invention intercepts the early reverberation signal by calculating the room impulse response signals with different acoustic characteristics, then combines the early reverberation signal with the ideal ratio masking, applies the combination to the mixed voice signal to obtain the reconstructed signal, and realizes the separation of the target signal from the mixed voice under the noise reverberation condition by the ideal amplitude masking.
Fig. 3 is a block diagram of a front-end processing system for improving far-field speech recognition according to an embodiment of the present disclosure. A front-end processing system for enhancing far-field speech recognition as shown in fig. 3, comprising: a clipping unit 301, a first generating unit 302, a second generating unit 303 and a third generating unit 304.
The intercepting unit 301 is configured to calculate a room impulse response signal, obtain a division time point of an early reverberation signal and a late reverberation signal, and intercept the direct sound signal and the early reverberation signal.
After the room impulse response signal is up-sampled to a certain frequency, the division time points of the early reverberation signal and the late reverberation signal of the room impulse response signal are calculated, and then the direct sound signal and the early reverberation signal are intercepted.
In one embodiment, the split time points of the early and late reverberation signals of the room impulse response signal are determined by calculating an echo density function of the room impulse response signal, the echo density function NED being defined as:
Figure BDA0001922842050000061
wherein the content of the first and second substances,
Figure BDA0001922842050000062
is the fraction of expected samples outside the standard deviation of the mean of the Gaussian distribution, 1 { } is an index function that returns when the parameter inside is true1, otherwise return 0, ω (l) is a weight function, δ is the standard deviation of the room impulse response signal in the current window.
When the reverberation changes from the early reverberation to the late reverberation, NED approaches 1 from 0, and the division time of the early reverberation signal and the late reverberation signal is defined as when the standard deviation of the late reverberation signal approaches 1 infinitely.
In one embodiment, the split time points of the early and late reverberation signals are calculated by a kurtosis, which is the fourth moment of the statistical process, based on the assumption of diffuse scattered fields of late reverberation4Is defined as:
Figure BDA0001922842050000063
where E is the expectation of the impact response x to be processed, μ is the mean and δ is the standard deviation;
the split time is defined as the time instant when the kurtosis calculated in the sliding window reaches zero.
In one embodiment, the divided time points of the early reverberation signal and the late reverberation signal are calculated by room characteristics, and the time t is defined as:
Figure BDA0001922842050000071
where V and S are the volume of the room and the surface area of the room, respectively.
The first generating unit 302 is configured to convolve the direct sound signal and the early reverberation signal with the clean speech signal in the speech library in a time domain to obtain a time domain target signal.
After the direct sound signal and the early reverberation signal are down-sampled to the sampling frequency of the voice signal, the direct sound signal and the early reverberation signal are convoluted with a clean voice signal in a voice library on a time domain to obtain a time domain target signal.
The second generating unit 303 is configured to calculate the time-domain target signal and other signals in the time-domain mixed signal except the time-domain target signal, respectively, to obtain target signal energy and other signal energy, and obtain an ideal ratio mask according to the target signal energy and the other signal energy.
Preferably, the time-domain mixed signal is obtained by convolving the room impulse response signal with all voices in the voice library in the time domain, and then mixing the room impulse response signal with the noise signal. The time domain mixed signal generation mode is as follows:
m(t)=s(t)·h(t)+n(t)
where s (t) represents a clean speech signal, h (t) represents a room impulse response signal, n (t) represents a noise signal, and t represents a time index.
Performing Fourier transform on the time domain target signal and other signals except the time domain target signal in the time domain mixed signal respectively, and calculating to obtain target signal energy D (k, l) and other signal energy R (k, l); and then substituting the target signal energy D (k, l) and other signal energy R (k, l) into an ideal ratio masking formula to obtain ideal ratio masking. Where the ideal ratio mask formula IRM (k, l) is:
Figure BDA0001922842050000072
where D (k, l) represents the target signal energy, R (k, l) represents the other signal energy except the target signal energy among the mixed signal energy, k represents the band index, and l represents the frame index.
The third generating unit 304 is configured to convert the time-domain mixed signal into a frequency-domain mixed signal, mask and multiply the amplitude of the frequency-domain mixed signal by an ideal ratio, and use the phase of the frequency-domain mixed signal to obtain a reconstructed signal.
Performing short-time Fourier transform on the time domain mixed signal to obtain a frequency domain mixed signal; then, the amplitude of the frequency domain mixed signal is masked and multiplied by the ideal ratio, and then the phase of the frequency domain mixed signal is used to obtain a reconstructed signal, wherein the calculation formula of the reconstructed signal s' (t) is as follows:
s′(t)=istft{M(k,l)×IRM(k,l)×exp[j∠Mf(k,l)]}
wherein, istft tableShown as inverse Fourier operation, M (k, l) represents a frequency domain mixed signal, and angle Mf(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.
The invention intercepts the early reverberation signal by calculating the room impulse response signals with different acoustic characteristics, then combines the early reverberation signal with the ideal ratio masking, applies the combination to the mixed voice signal to obtain the reconstructed signal, and realizes the separation of the target signal from the mixed voice under the noise reverberation condition by the ideal amplitude masking.
Finally, the description is as follows: the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A front-end processing method for enhancing far-field speech recognition, comprising:
calculating the impulse response signal of the room to obtain the division time points of the early reverberation signal and the late reverberation signal, and intercepting the direct sound signal and the early reverberation signal; the room impulse response signal is composed of the direct sound signal, the early reverberation signal and the late reverberation signal in sequence;
convolving the direct sound signal and the early reverberation signal with a clean speech signal in a speech library in a time domain to obtain a time domain target signal;
respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; the time-domain mixed signal is obtained by convolving the room impulse response signal with the voice in the voice library in the time domain and then mixing the room impulse response signal with a noise signal;
and after the time domain mixed signal is converted into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal.
2. The method of claim 1, wherein the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal comprises:
determining a division time point of an early reverberation signal and a late reverberation signal of the room impulse response signal by calculating an echo density function of the room impulse response signal, the echo density function NED being defined as:
Figure FDA0002659487120000011
wherein the content of the first and second substances,
Figure FDA0002659487120000012
is the fraction of expected samples outside the mean standard deviation of the gaussian distribution, 1 · { } is an index function that returns 1 when the parameter inside is true, and returns 0 otherwise, ω (l) is a weight function, δ is the standard deviation of the room impulse response signal in the current window;
when the reverberation changes from the early reverberation to the late reverberation, NED approaches 1 from 0, and the division time of the early reverberation signal and the late reverberation signal is defined as when the standard deviation of the late reverberation signal approaches 1 infinitely.
3. The method of claim 1, wherein the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal comprises:
calculating early reverberation by kurtosis based on an assumption of a diffuse scattered field of the late reverberation signalThe time point of the division of the signal and the late reverberation signal, the kurtosis is the fourth moment of the statistical process, and the kurtosis gamma4Is defined as:
Figure FDA0002659487120000013
where E is the expectation of the impact response x to be processed, μ is the mean and δ is the standard deviation;
the split time point is defined as the time instant when the kurtosis calculated in the sliding window reaches zero.
4. The method of claim 1, wherein the calculating the room impulse response signal to obtain the divided time points of the early reverberation signal and the late reverberation signal comprises:
calculating the division time point of the early reverberation signal and the late reverberation signal according to the room characteristics, wherein the time t is defined as:
Figure FDA0002659487120000021
where V and S are the volume of the room and the surface area of the room, respectively.
5. The method according to claim 1, wherein the calculating other signals except the time-domain target signal in the time-domain target signal and the time-domain mixed signal respectively to obtain a target signal energy and other signal energies, and obtaining an ideal ratio mask by using the target signal energy and the other signal energies specifically comprises:
respectively carrying out Fourier transform on the time domain target signal and the other signals, and calculating to obtain target signal energy and other signal energy;
substituting the target signal energy and the other signal energy into an ideal ratio masking formula to obtain the ideal ratio masking; the ideal ratio masking formula IRM (k, l) is:
Figure FDA0002659487120000022
where D (k, l) represents the target signal energy, R (k, l) represents the other signal energy except the target signal energy among the mixed signal energy, k represents the band index, and l represents the frame index.
6. The method according to claim 1, wherein the time-domain mixed signal is obtained by convolving the room impulse response signal with the speech in the speech library in the time domain, and mixing the time-domain mixed signal with a noise signal, and the time-domain mixed signal is generated in a manner that:
m(t)=s(t)·h(t)+n(t)
where s (t) represents a clean speech signal, h (t) represents a room impulse response signal, n (t) represents a noise signal, and t represents a time index.
7. The method according to claim 1, wherein the step of converting the time-domain mixed signal into a frequency-domain mixed signal, masking and multiplying an amplitude of the frequency-domain mixed signal by the ideal ratio, and obtaining a reconstructed signal by using a phase of the frequency-domain mixed signal comprises:
carrying out short-time Fourier transform on the time domain mixed signal to obtain a frequency domain mixed signal;
and masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and obtaining a reconstructed signal by using the phase of the frequency domain mixed signal, wherein the calculation formula of the reconstructed signal s' (t) is as follows:
s′(t)=istft{M(k,l)×IRM(k,l)×exp[j∠Mf(k,l)]}
wherein istft is expressed as inverse Fourier operation, M (k, l) is expressed as frequency domain mixed signal, IRM (k, l) is masked by ideal ratio, and angle Mf(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.
8. A front-end processing system that promotes far-field speech recognition, comprising:
the interception unit is used for calculating the room impulse response signal to obtain the division time points of the early reverberation signal and the late reverberation signal and intercepting the direct sound signal and the early reverberation signal; the room impulse response signal is composed of the direct sound signal, the early reverberation signal and the late reverberation signal in sequence;
the first generation unit is used for convolving the direct sound signal and the early reverberation signal with a clean speech signal in a speech library on a time domain to obtain a time domain target signal;
the second generating unit is used for respectively calculating other signals except the time domain target signal in the time domain target signal and the time domain mixed signal to obtain target signal energy and other signal energy, and obtaining ideal ratio masking through the target signal energy and the other signal energy; the time-domain mixed signal is obtained by convolving the room impulse response signal with the voice in the voice library in the time domain and then mixing the room impulse response signal with a noise signal;
and the third generating unit is used for converting the time domain mixed signal into a frequency domain mixed signal, masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and then using the phase of the frequency domain mixed signal to obtain a reconstructed signal.
9. The system according to claim 8, characterized in that the second generation unit is specifically configured to,
respectively carrying out Fourier transform on the time domain target signal and the other signals, and calculating to obtain target signal energy and other signal energy;
substituting the target signal energy and the other signal energy into an ideal ratio masking formula to obtain the ideal ratio masking; the ideal ratio masking formula IRM (k, l) is:
Figure FDA0002659487120000031
where D (k, l) represents the target signal energy, R (k, l) represents the other signal energy except the target signal energy among the mixed signal energy, k represents the band index, and l represents the frame index.
10. The system according to claim 8, characterized in that the third generation unit is in particular adapted to,
carrying out short-time Fourier transform on the time domain mixed signal to obtain a frequency domain mixed signal;
and masking and multiplying the amplitude of the frequency domain mixed signal by the ideal ratio, and obtaining a reconstructed signal by using the phase of the frequency domain mixed signal, wherein the calculation formula of the reconstructed signal s' (t) is as follows:
s′(t)=istft{M(k,l)×IRM(k,l)×exp[j∠Mf(k,l)]}
wherein istft is expressed as inverse Fourier operation, M (k, l) is expressed as frequency domain mixed signal, IRM (k, l) is masked by ideal ratio, and angle Mf(k, l) denotes a phase of the frequency domain mixed signal, k denotes a band index, and l denotes a frame index.
CN201811602419.9A 2018-12-26 2018-12-26 Front-end processing method and system for improving far-field speech recognition Active CN109523999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811602419.9A CN109523999B (en) 2018-12-26 2018-12-26 Front-end processing method and system for improving far-field speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811602419.9A CN109523999B (en) 2018-12-26 2018-12-26 Front-end processing method and system for improving far-field speech recognition

Publications (2)

Publication Number Publication Date
CN109523999A CN109523999A (en) 2019-03-26
CN109523999B true CN109523999B (en) 2021-03-23

Family

ID=65797174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811602419.9A Active CN109523999B (en) 2018-12-26 2018-12-26 Front-end processing method and system for improving far-field speech recognition

Country Status (1)

Country Link
CN (1) CN109523999B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110428852B (en) * 2019-08-09 2021-07-16 南京人工智能高等研究院有限公司 Voice separation method, device, medium and equipment
CN111312273A (en) * 2020-05-11 2020-06-19 腾讯科技(深圳)有限公司 Reverberation elimination method, apparatus, computer device and storage medium
CN112201262A (en) * 2020-09-30 2021-01-08 珠海格力电器股份有限公司 Sound processing method and device
CN112201229A (en) * 2020-10-09 2021-01-08 百果园技术(新加坡)有限公司 Voice processing method, device and system
CN112652290B (en) * 2020-12-14 2023-01-20 北京达佳互联信息技术有限公司 Method for generating reverberation audio signal and training method of audio processing model
CN112735461A (en) * 2020-12-29 2021-04-30 西安讯飞超脑信息科技有限公司 Sound pickup method, related device and equipment
CN113643714B (en) * 2021-10-14 2022-02-18 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program
CN116189698A (en) * 2021-11-25 2023-05-30 广州视源电子科技股份有限公司 Training method and device for voice enhancement model, storage medium and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090122999A1 (en) * 2007-11-13 2009-05-14 Samsung Electronics Co., Ltd Method of improving acoustic properties in music reproduction apparatus and recording medium and music reproduction apparatus suitable for the method
CN105427860A (en) * 2015-11-11 2016-03-23 百度在线网络技术(北京)有限公司 Far field voice recognition method and device
CN105427859A (en) * 2016-01-07 2016-03-23 深圳市音加密科技有限公司 Front voice enhancement method for identifying speaker
CN107481731A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of speech data Enhancement Method and system
CN108389586A (en) * 2017-05-17 2018-08-10 宁波桑德纳电子科技有限公司 A kind of long-range audio collecting device, monitoring device and long-range collection sound method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090122999A1 (en) * 2007-11-13 2009-05-14 Samsung Electronics Co., Ltd Method of improving acoustic properties in music reproduction apparatus and recording medium and music reproduction apparatus suitable for the method
CN105427860A (en) * 2015-11-11 2016-03-23 百度在线网络技术(北京)有限公司 Far field voice recognition method and device
CN105427859A (en) * 2016-01-07 2016-03-23 深圳市音加密科技有限公司 Front voice enhancement method for identifying speaker
CN108389586A (en) * 2017-05-17 2018-08-10 宁波桑德纳电子科技有限公司 A kind of long-range audio collecting device, monitoring device and long-range collection sound method
CN107481731A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of speech data Enhancement Method and system

Also Published As

Publication number Publication date
CN109523999A (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN109523999B (en) Front-end processing method and system for improving far-field speech recognition
EP3791565B1 (en) Method and apparatus utilizing residual echo estimate information to derive secondary echo reduction parameters
US8046219B2 (en) Robust two microphone noise suppression system
EP2845189B1 (en) A universal reconfigurable echo cancellation system
JP5007442B2 (en) System and method using level differences between microphones for speech improvement
CN108447496B (en) Speech enhancement method and device based on microphone array
EP2568695A1 (en) Method and device for suppressing residual echo
KR20130108063A (en) Multi-microphone robust noise suppression
JP2003534570A (en) How to suppress noise in adaptive beamformers
EP3245795B1 (en) Reverberation suppression using multiple beamformers
US20200286501A1 (en) Apparatus and a method for signal enhancement
Thiergart et al. An informed MMSE filter based on multiple instantaneous direction-of-arrival estimates
Yang Multilayer adaptation based complex echo cancellation and voice enhancement
Compernolle DSP techniques for speech enhancement
EP1286334A2 (en) Method and circuit arrangement for reducing noise during voice communication in communications systems
JP2005514668A (en) Speech enhancement system with a spectral power ratio dependent processor
Zhang et al. A microphone array dereverberation algorithm based on TF-GSC and postfiltering
Sugiyama et al. Automatic gain control with integrated signal enhancement for specified target and background-noise levels
ZHANG et al. Fast echo cancellation algorithm in smart speaker
Fukui et al. Hands-free audio conferencing unit with low-complexity dereverberation
Ma et al. Application of Deep Learning-based Single-channel Speech Enhancement for Frequency-modulation Transmitted Speech
Nguyen Power Level In Dual− Microphone System
Haque et al. Acoustic Echo Cancellation for the Advancement in Telecommunication
Wang et al. Time-Frequency Thresholding: A new algorithm in wavelet package speech enhancement
KR20200054754A (en) Audio signal processing method and apparatus for enhancing speech recognition in noise environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant