CN113362847A - Audio signal processing method and device and storage medium - Google Patents

Audio signal processing method and device and storage medium Download PDF

Info

Publication number
CN113362847A
CN113362847A CN202110582749.1A CN202110582749A CN113362847A CN 113362847 A CN113362847 A CN 113362847A CN 202110582749 A CN202110582749 A CN 202110582749A CN 113362847 A CN113362847 A CN 113362847A
Authority
CN
China
Prior art keywords
sound source
signals
signal
determining
microphones
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110582749.1A
Other languages
Chinese (zh)
Inventor
侯海宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd, Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202110582749.1A priority Critical patent/CN113362847A/en
Publication of CN113362847A publication Critical patent/CN113362847A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure relates to an audio signal processing method and apparatus, and a storage medium. The method comprises the following steps: acquiring original noisy signals collected by at least two microphones for at least two sound sources respectively; carrying out sound source separation on original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources; determining observed estimated signals of each sound source at the at least two microphones respectively based on the respective frequency domain estimated signals of the at least two sound sources; determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources; and determining the audio signals emitted by the at least two sound sources respectively according to the enhanced output signals of each sound source after filtering processing. Through the scheme of the embodiment of the disclosure, the interference can be reduced, and the voice quality of the audio signal can be improved.

Description

Audio signal processing method and device and storage medium
Technical Field
The present disclosure relates to the field of signal processing, and in particular, to an audio signal processing method and apparatus, and a storage medium.
Background
In the related technology, the intelligent product equipment mostly adopts a microphone array for pickup, and a microphone beam forming technology is applied to improve the processing quality of a voice signal so as to improve the voice recognition rate in a real environment. However, the beam forming technology of multiple microphones is sensitive to the position error of the microphones, the performance influence is large, and in addition, the increase of the number of the microphones also causes the increase of the product cost.
Therefore, currently more and more smart product devices are configured with only two microphones; two microphones often enhance speech using blind source separation techniques that are completely different from the multiple microphone beamforming techniques. However, the speech signal after blind source separation often has noise residue, which causes a problem of low signal-to-noise ratio.
Disclosure of Invention
The present disclosure provides an audio signal processing method and apparatus, and a storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided an audio signal processing method, including:
acquiring original noisy signals collected by at least two microphones for at least two sound sources respectively;
carrying out sound source separation on original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources;
determining observed estimated signals of each sound source at the at least two microphones respectively based on the respective frequency domain estimated signals of the at least two sound sources;
determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources;
and determining the audio signals emitted by the at least two sound sources respectively according to the enhanced output signals of each sound source after filtering processing.
In some embodiments, the performing sound source separation on the original noisy signals of each of the at least two microphones to obtain frequency domain estimation signals of each of the at least two sound sources includes:
carrying out sound source separation on the original noisy signals by using the separation matrix of each frame of signals after deblurring processing to obtain respective frequency domain estimation signals of the at least two sound sources; wherein the frequency domain estimation signal carries phase information of the audio signal emitted by the sound source.
In some embodiments, the method further comprises:
and determining the fuzzy processed separation matrix by using the separation matrix and the inverse matrix of the separation matrix.
In some embodiments, the method further comprises:
when the current frame is not the first frame, determining the separation matrix of the current frame based on the separation matrix of the previous frame of the current frame and the original noisy signal of the current frame; or
When the current frame is the first frame, a separation matrix of the current frame is determined based on a predetermined identity matrix and an original noisy signal of the current frame.
In some embodiments, the observation estimation signal carries phase information of the audio signal emitted by the sound source; determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources, comprising:
determining estimated coordinate information of the at least two sound sources according to the observation estimation signals;
determining time delay differences from the at least two sound sources to the at least two microphones according to the estimated coordinate information and the coordinate information of the at least two microphones;
determining the enhanced output signal of each sound source according to the time delay difference.
In some embodiments, said determining a delay difference from said at least two sound sources to said at least two microphones based on said estimated coordinate information and coordinate information of said at least two microphones comprises:
determining the distance from each sound source to the at least two microphones respectively according to the estimated coordinate information and the coordinate information of the at least two microphones;
and determining the time delay difference according to the distance and the sound velocity.
In some embodiments, said determining said enhanced output signal for each sound source from said time delay difference comprises:
and determining an enhanced output signal of each sound source according to the time delay difference and the observation estimation signals of each sound source at each microphone.
In some embodiments, the method further comprises:
and performing the filtering processing on the enhanced output signal of each sound source according to the observation estimation signal.
In some embodiments, said performing said filtering process on said enhanced output signal of each sound source based on said observed estimated signal comprises:
determining an interference signal of the enhanced output signal according to the observation estimation signal;
and carrying out the filtering processing on the enhanced output signal according to the interference signal.
According to a second aspect of the embodiments of the present disclosure, there is provided an audio signal processing apparatus including:
the first acquisition module is used for acquiring original noisy signals acquired by at least two microphones for at least two sound sources respectively;
the separation module is used for carrying out sound source separation on the original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources;
a first determining module, configured to determine, based on respective frequency domain estimated signals of the at least two sound sources, observed estimated signals of each of the sound sources at the at least two microphones, respectively;
a second determining module for determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources;
and the third determining module is used for determining the audio signals emitted by the at least two sound sources respectively according to the enhanced output signals of the sound sources after filtering processing.
In some embodiments, the separation module comprises:
the separation submodule is used for carrying out sound source separation on the original noisy signals by utilizing the separation matrix of each frame of signals after the deblurring processing is carried out, and respective frequency domain estimation signals of the at least two sound sources are obtained; wherein the frequency domain estimation signal carries phase information of the audio signal emitted by the sound source.
In some embodiments, the apparatus further comprises:
and the fourth determination module is used for determining the separation matrix after the fuzzy processing by using the separation matrix and the inverse matrix of the separation matrix.
In some embodiments, the apparatus further comprises:
a fifth determining module, configured to determine, when the current frame is not the first frame, a separation matrix of the current frame based on the separation matrix of the previous frame of the current frame and an original noisy signal of the current frame; or
And a sixth determining module, configured to determine, when the current frame is the first frame, a separation matrix of the current frame based on the predetermined identity matrix and the original noisy signal of the current frame.
In some embodiments, the observation estimation signal carries phase information of the audio signal emitted by the sound source; the second determining module includes:
a first determining submodule, configured to determine estimated coordinate information of the at least two sound sources according to the observation estimation signal;
a second determining submodule, configured to determine, according to the estimated coordinate information and the coordinate information of the at least two microphones, a delay difference from the at least two sound sources to the at least two microphones;
a third determining submodule, configured to determine the enhanced output signal of each sound source according to the time delay difference.
In some embodiments, the first determining submodule is specifically configured to:
determining the distance from each sound source to the at least two microphones respectively according to the estimated coordinate information and the coordinate information of the at least two microphones;
and determining the time delay difference according to the distance and the sound velocity.
In some embodiments, the third determining submodule is specifically configured to:
and determining an enhanced output signal of each sound source according to the time delay difference and the observation estimation signals of each sound source at each microphone.
In some embodiments, the apparatus further comprises:
and the filtering module is used for carrying out filtering processing on the enhanced output signal of each sound source according to the observation estimation signal.
In some embodiments, the filtering module comprises:
a fourth determining submodule, configured to determine an interference signal of the enhanced output signal according to the observation estimation signal;
and the filtering submodule is used for carrying out filtering processing on the enhanced output signal according to the interference signal.
According to a third aspect of embodiments of the present disclosure, there is provided an audio signal processing apparatus, the apparatus comprising at least: a processor and a memory for storing executable instructions operable on the processor, wherein:
the processor is configured to execute the executable instructions, and the executable instructions perform the steps of any of the audio signal processing methods described above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in any of the audio signal processing methods described above.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: according to the technical scheme of the embodiment of the disclosure, after the audio signals are separated to obtain the respective frequency domain estimation signals of each sound source, the observation estimation signals of each sound source at the plurality of microphones are further determined according to the frequency domain estimation signals, and then the audio signals of each sound source are enhanced and filtered, so that the signal-to-noise ratio of the separated signals is improved, and the signal quality is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a first flowchart illustrating a method of audio signal processing according to an exemplary embodiment;
FIG. 2 is a flowchart II illustrating a method of audio signal processing according to an exemplary embodiment;
fig. 3 is a block diagram illustrating an application scenario of an audio signal processing method according to an exemplary embodiment.
FIG. 4 is a flowchart three illustrating a method of audio signal processing according to an exemplary embodiment;
fig. 5 is a block diagram illustrating a structure of an audio signal processing apparatus according to an exemplary embodiment;
fig. 6 is a block diagram illustrating a physical structure of an audio signal processing apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
Fig. 1 is a flowchart illustrating an audio signal processing method according to an exemplary embodiment, as shown in fig. 1, including the steps of:
s101, acquiring original noisy signals acquired by at least two microphones for at least two sound sources respectively;
step S102, carrying out sound source separation on original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources;
step S103, determining observation estimated signals of each sound source at the at least two microphones respectively based on respective frequency domain estimated signals of the at least two sound sources;
step S104, determining an enhanced output signal of each sound source based on the observation estimated signals corresponding to the at least two sound sources;
and step S105, determining the audio signals emitted by the at least two sound sources according to the enhanced output signals of each sound source after filtering processing.
The method disclosed by the embodiment of the disclosure is applied to the terminal. Here, the terminal is an electronic device into which two or more microphones are integrated. For example, the terminal may be a vehicle-mounted terminal, a computer, a server, or the like.
In an embodiment, the terminal may further be: an electronic device connected to a predetermined device in which two or more microphones are integrated; and the electronic equipment receives the audio signal collected by the predetermined equipment based on the connection and sends the processed audio signal to the predetermined equipment based on the connection. For example, the predetermined device is a sound box or the like.
In practical application, the terminal comprises at least two microphones, and the at least two microphones simultaneously detect audio signals sent by at least two sound sources respectively to obtain original noisy signals of the at least two microphones respectively. Here, it is understood that in the present embodiment, the at least two microphones detect the audio signals emitted by the two sound sources synchronously.
In the embodiment of the present disclosure, the number of the microphones is 2 or more, and the number of the sound sources is 2 or more.
In the embodiment of the present disclosure, the original noisy signal is: comprising a mixed signal of the sounds emitted by at least two sound sources. For example, the number of the microphones is 2, namely a microphone 1 and a microphone 2; the number of the sound sources is 2, namely a sound source 1 and a sound source 2; the original noisy signal of said microphone 1 is an audio signal comprising a sound source 1 and a sound source 2; the original noisy signal of the microphone 2 is also an audio signal comprising both the sound source 1 and the sound source 2.
As another example, the number of the microphones is 3, which are respectively the microphone 1, the microphone 2 and the microphone 3; the number of the sound sources is 3, namely a sound source 1, a sound source 2 and a sound source 3; the original noisy signal of the microphone 1 is an audio signal comprising a sound source 1, a sound source 2 and a sound source 3; the original noisy signals of said microphone 2 and said microphone 3 are likewise audio signals each comprising a sound source 1, a sound source 2 and a sound source 3.
It is understood that if the sound emitted by one sound source is an audio signal in a corresponding microphone, the signals of other sound sources in the microphones are noise signals. The disclosed embodiments require recovery of sound sources emanating from at least two sound sources from at least two microphones.
It will be appreciated that the number of sound sources is generally the same as the number of microphones. If the number of microphones is smaller than the number of sound sources in some embodiments, the number of sound sources may be reduced to a dimension equal to the number of microphones.
It will be understood that when the microphones collect audio signals from sound sources, the audio signals of at least one frame of audio frame may be collected, and the collected audio signals are the original noisy signals of each microphone. The original noisy signal may be either a time domain signal or a frequency domain signal. If the original signal with noise is a time domain signal, the time domain signal can be converted into a frequency domain signal according to the operation of time-frequency conversion.
Here, the time-domain signal may be subjected to time-frequency Transform based on Fast Fourier Transform (FFT) to obtain a frequency-domain signal. Alternatively, the time-domain signal may be frequency-domain transformed based on a short-time Fourier transform (STFT). Or, the time domain signal may be subjected to time-frequency transformation based on other fourier transforms to obtain a frequency domain signal.
For example, if the time domain signal of the p-th microphone in the n-th frame is:
Figure BDA0003084556490000061
transforming the time domain signal of the nth frame into a frequency domain signal, and determining the original noisy signal of the nth frame as follows:
Figure BDA0003084556490000062
and m is the discrete time point number of the nth frame of time domain signal, and k is a frequency point. Thus, the present embodiment can obtain the original noisy signal of each frame through the time domain to frequency domain variation. Of course, the original noisy signal for each frame may be obtained based on other fast fourier transform equations, which is not limited herein.
According to the original noisy signal of the frequency domain, an initial frequency domain estimation signal can be obtained in a priori estimation mode.
Illustratively, the method may be based on an initialized separation matrix, such as an identity matrix; or separating the original signal with noise according to the separation matrix obtained from the previous frame to obtain the frequency domain estimation signal of each frame of each sound source. Therefore, the method provides a basis for separating the audio signals of the sound sources based on the frequency domain estimated signals and the separation matrix.
In the embodiment of the present disclosure, after sound source separation, the frequency domain estimation signal of each sound source is obtained, but noise residue may still exist in the frequency domain estimation signal of each sound source. Therefore, in order to reduce the noise residual and further improve the signal-to-noise ratio of the signal, post-processing is also required to be performed on the frequency domain estimation signal after separation.
Here, the observed signals of the respective sound sources at the respective microphones, i.e., the above-described observed estimated signals, may be estimated using the frequency domain estimated signals. The frequency domain estimation signal of each sound source can be enhanced, filtered and the like through the observation estimation signal, and finally, the audio signal sent by each enhanced sound source is obtained.
Therefore, through the embodiment of the disclosure, the audio signal after the blind source separation is further post-processed, and signal enhancement and filtering are realized, so that the signal-to-noise ratio of the signal is improved, the residual noise is reduced, and the signal quality is improved.
In some embodiments, as shown in fig. 2, in the step S102, the performing sound source separation on the original noisy signals of each of the at least two microphones to obtain frequency domain estimation signals of each of the at least two sound sources includes:
step S202, carrying out sound source separation on the original noisy signals by using the separation matrix of each frame of signals after the deblurring processing to obtain respective frequency domain estimation signals of the at least two sound sources; wherein the frequency domain estimation signal carries phase information of the audio signal emitted by the sound source.
In the embodiment of the present disclosure, the separation matrix may be used to perform sound source separation on the original noisy signal, and after separating the original noisy signal of each frame, the separation matrix may be updated, and the updated separation matrix is used to separate the signal of the next frame.
For example, after obtaining the original noisy signal of each frame, a separation signal of the current frame may be obtained based on the separation matrix and the original noisy signal of the current frame. Here, based on the separation matrix and the original noisy signal of the current frame, obtaining the separation signal of the current frame may be: and multiplying the original noisy signal of the current frame based on the separation matrix to obtain a separation signal of the current frame. For example, if the separation matrix is w (k), if the original noisy signal of the current frame is X (k, n); the split signal of the current frame is: y (k, n) ═ w (k) X (k, n).
In the embodiment of the present disclosure, the deblurred separation matrix is the separation matrix subjected to amplitude deblurring. Here, the deblurring process may include: the amplitude is adjusted by using an MDP (minimum Distortion Principle) algorithm. The separation signals obtained by the separation matrix after the deblurring processing can recover the estimation signals of the observation data of each sound source at the microphones, so in the embodiment of the present disclosure, the observation estimation signals of each sound source at least two microphones respectively can be determined by the frequency domain estimation signals.
Exemplarily, for the case where there are two sound sources s1 and s2 and two microphones mic1 and mic2, the separated Y (k, τ) is used as [ Y (k, τ) ]1(k,τ),Y2(k,τ)]TThe observed estimated signal for each sound source can be recovered:
the observed estimated signal at mic1 for sound source s1 is: y is1(k,τ)=h11s1(k, τ), i.e. Y11(k,τ)=Y1(k, τ); wherein h is11Is a transfer function, s1(k, τ) is the signal vector of sound source s 1.
The observed estimated signal at mic2 for sound source s2 is: y is2(k,τ)=h22s2(k, τ), i.e. Y22(k,τ)=Y2(k, τ). Wherein h is22Is a transfer function, s2(k, τ) is the signal vector of sound source s 2.
Since the observed signal at each microphone is a superposition of the two source observations, the observed estimated signal at mic1 for source s2 is: y is12(k,τ)=X1(k,τ)-Y11(k, τ); the observed estimated signal at mic2 for sound source s1 is: y is21(k,τ)=X2(k,τ)-Y22(k, τ), where k denotes the number of frequency bins and τ denotes the number of frames of the audio signal.
In some embodiments, as shown in fig. 2, the method further comprises:
step S201, determining the separation matrix after the blurring processing by using the separation matrix and an inverse matrix of the separation matrix.
In the embodiment of the present disclosure, the separation matrix may be deblurred by using an MDP algorithm, that is, the deblurred separation matrix is determined by using the separation matrix and an inverse matrix of the separation matrix.
Illustratively, the separation matrix is amplitude deblurred for W (k, τ): w (k, τ) ═ diag (invW (k, τ)) · W (k, τ). Where invW (k, τ) is the inverse of W (k, τ). diag (invW (k, τ)) means that the non-dominant diagonal element of invW (k, τ) is set to 0.
Therefore, the frequency domain estimation signals can be separated by using the separation matrix obtained after the MDP algorithm is used for deblurring, so that the observation signals of each sound source at each microphone can be recovered, the original phase information is reserved, the direction of each sound source is convenient to determine, and the signals of each sound source after separation are convenient to enhance.
In some embodiments, the method further comprises:
when the current frame is not the first frame, determining the separation matrix of the current frame based on the separation matrix of the previous frame of the current frame and the original noisy signal of the current frame; or
When the current frame is the first frame, a separation matrix of the current frame is determined based on a predetermined identity matrix and an original noisy signal of the current frame.
In an embodiment, if the separation matrix is the separation matrix of the current frame, the separation signal of the current frame is obtained based on the separation matrix of the current frame and the original noisy signal of the current frame.
In another embodiment, if the separation matrix is the separation matrix of the previous frame of the current frame, the separation signal of the current frame is obtained based on the separation matrix of the previous frame and the original noisy signal of the current frame.
In an embodiment, if a frame length of an audio signal collected by a microphone is n, where n is a natural number greater than or equal to 1, then n is 1, which is a first frame. And if the current frame is the first frame, the separation matrix of the first frame is an identity matrix. For example, if the number of the microphones is 2, the identity matrix is:
Figure BDA0003084556490000081
if the number of the microphones is 3, the identity matrix is:
Figure BDA0003084556490000091
by analogy, if the number of the microphones is N, the identity matrix may be:
Figure BDA0003084556490000092
wherein, the
Figure BDA0003084556490000093
Is an N × N identity matrix.
In other embodiments, if the current frame is an audio frame after the first frame, the separation matrix of the current frame is determined based on the separation matrix of the previous frame of the current frame and the original noisy signal of the current frame.
In one embodiment, an audio frame may be an audio segment of a predetermined duration.
For example, the separation matrix of the current frame is determined based on the separation matrix of the previous frame of the current frame and the original noisy signal of the current frame, which may specifically be as follows: then, the covariance matrix of the current frame can be calculated according to the original signal with noise and the covariance matrix of the previous frame; and calculating the separation matrix of the current frame based on the covariance of the current frame and the separation matrix of the previous frame.
In some embodiments, the observation estimation signal carries phase information of the audio signal emitted by the sound source; determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources, comprising:
determining estimated coordinate information of the at least two sound sources according to the observation estimation signals;
determining time delay differences from the at least two sound sources to the at least two microphones according to the estimated coordinate information and the coordinate information of the at least two microphones;
determining the enhanced output signal of each sound source according to the time delay difference.
Here, the observation estimation signals retain original phase information, and thus, the azimuth of each sound source can be determined using the observation estimation signals. That is, the signals after separation are subjected to direction tracking and delay-sum beamforming, so as to obtain the signals of the main paths of the sound sources, thereby realizing signal enhancement.
In the embodiment of the present disclosure, a SRP-PHAT (controlled Response Power-Phase Transform) direction-finding algorithm may be used to find the direction-finding using the observed estimation signal, so as to determine the position of the sound source. For example, for the case of two sound sources s1 and s2, the estimated coordinate information of the sound source s1 may be determined using a tour algorithm
Figure BDA0003084556490000101
And estimated coordinate information of the sound source s2
Figure BDA0003084556490000102
Wherein x, y, z respectively represent three directions of coordinate axes.
Since the positions of the microphones are known, for example, in the case of two microphones, the coordinate information is set to
Figure BDA0003084556490000103
And
Figure BDA0003084556490000104
the distance from the sound source to the microphone and hence the delay spread of the sound signal delivery can be determined based on the sound source position and the microphone position.
Here, the delay difference refers to a delay difference between the sound source to the first microphone and the second microphone, respectively.
By using the time delay difference, the interference signal in the separated audio signal can be estimated, and the enhanced output signal of each sound source can be further determined.
In some embodiments, said determining a delay difference from said at least two sound sources to said at least two microphones based on said estimated coordinate information and coordinate information of said at least two microphones comprises:
determining the distance from each sound source to the at least two microphones respectively according to the estimated coordinate information and the coordinate information of the at least two microphones;
and determining the time delay difference according to the distance and the sound velocity.
Here, the distance between each sound source and each microphone may be determined using the estimated coordinate information of each sound source and the coordinate information of each microphone, and further, the distance difference between the same sound source and different microphones may be determined:
Figure BDA0003084556490000105
Figure BDA0003084556490000106
where d1 is the difference between the distance from the sound source s1 to the first microphone and the distance to the second microphone; d2 is the difference between the distance from the sound source to the first microphone and the distance from the sound source to the second microphone.
Using the distance difference and the sound velocity, the delay difference can be obtained:
the delay difference corresponding to the sound source s1 is:
Figure BDA0003084556490000107
the delay difference corresponding to the sound source s2 is
Figure BDA0003084556490000108
Wherein f issIs the sampling frequency, c is the speed of sound.
In some embodiments, said determining said enhanced output signal for each sound source from said time delay difference comprises:
and determining an enhanced output signal of each sound source according to the time delay difference and the observation estimation signals of each sound source at each microphone.
In the embodiment of the present disclosure, by using the delay difference and the observation estimation signals corresponding to different microphones of each sound source, a beam delay and sum beamforming process may be performed to obtain an enhanced output of the main channel, that is, the enhanced output signal.
Illustratively, for sound source s1, its observation at the first microphone may be used to estimate signal Y11(k, τ) and an observed estimated signal Y of the second microphone21(k, τ) and the delay difference τ1Performing beam delay and sum beam forming to obtain enhanced output of main path
Figure BDA0003084556490000111
K is 1,2, …, K is the sequence number of the frequency point, and K is the total frequency point number corresponding to one frame; nfftIs the system frame length. Wherein exp represents an exponential function; j represents an imaginary part; pi represents the circumferential ratio.
Accordingly, for sound source s2, signal Y can be estimated using its observation at the first microphone12(k, τ) and an observed estimated signal Y of the second microphone22(k, τ) and the delay difference τ2Performing beam delay and sum beam forming to obtain enhanced output of main path
Figure BDA0003084556490000112
K is 1,2, …, K is the sequence number of the frequency point, and K is the total frequency point number corresponding to one frame; nfft is the system frame length. Wherein exp represents an exponential function; j represents an imaginary part; pi represents the circumferential ratio.
Therefore, the audio signal subjected to blind source separation by using separation matrix separation is further enhanced, the system noise is reduced, and the signal quality is improved.
In some embodiments, the method further comprises:
and performing the filtering processing on the enhanced output signal of each sound source according to the observation estimation signal.
In the embodiment of the present disclosure, after the separated audio signal is enhanced to obtain an enhanced output signal, further noise reduction filtering may be performed in a manner of adaptive filtering or the like.
In an embodiment, the observation estimation signal may be used as a reference signal, and an adaptive filter is used to determine an interference residue in the enhanced output signal, so as to filter and remove the interference residue.
Therefore, the enhanced signals are further filtered, the signal-to-noise ratio of the signals is improved, and the signal quality and the separation effect of the signals after blind source separation are improved.
In some embodiments, said performing said filtering process on said enhanced output signal of each sound source based on said observed estimated signal comprises:
determining an interference signal of the enhanced output signal according to the observation estimation signal;
and carrying out the filtering processing on the enhanced output signal according to the interference signal.
Illustratively, for the sound source s1, the estimated signal Y can be observed with the sound source s222(k, τ) or Y12(k, tau) as reference signal, passing through an adaptive filter
Figure BDA0003084556490000113
To estimate YM1The interference residue in (k, τ), i.e. the interference signal of the enhanced output signal described above. Note YM1The interference residue in (k, τ) is YC1(k, τ), then:
Figure BDA0003084556490000121
thus, the interference signal can be used to perform adaptive cancellation, i.e. filtering, on the enhanced output signal, so as to obtain a filtered output result, i.e. to complete the noise reduction processing.
It should be noted that the above adaptive filter
Figure BDA0003084556490000122
The updating can be performed by:
Figure BDA0003084556490000123
wherein u is1(k,n)=[|Y22(k,n)|2,...,|Y22(k,n-L+1)|2]Is a reference input vector;
Figure BDA0003084556490000124
to estimate the error.
Similarly, for source s2, signal Y may be estimated using observations11(k, τ) or Y21And (k, tau) is used as a reference signal, an adaptive filter is used for estimating an interference signal, and then filtering processing is carried out on the interference signal to obtain an output result after noise reduction.
Furthermore, the frequency domain signal of the output signal may be subjected to inverse fourier transform to obtain a time domain signal. For example, the frequency domain signal after noise reduction may be subjected to ISTFT (inverse short-time fourier transform) and overlap-add to obtain a separated and enhanced time domain signal, thereby restoring the audio signal emitted by each sound source.
Embodiments of the present disclosure also provide the following examples:
FIG. 4 is a flow chart illustrating a method of audio signal processing according to an exemplary embodiment; in the audio signal processing method, as shown in fig. 3, the sound source includes a sound source 1 and a sound source 2, and the microphone includes a microphone 1 and a microphone 2. Based on the audio signal processing method, the audio signals of the sound source 1 and the sound source 2 are restored from the original noisy signals of the microphone 1 and the microphone 2. As shown in fig. 4, the method comprises the steps of:
step S401: initializing W (k) and Vp(k);
Wherein the initialization comprises the following steps: if the system frame length is Nfft, the corresponding frequency point number in one frame is K ═ Nfft/2+ 1.
1) Initializing a separation matrix of each frequency point;
Figure BDA0003084556490000125
wherein, the
Figure BDA0003084556490000126
Is an identity matrix; the k is the serial number of the frequency point; the value of k may be: k is 1,2, …, K is any frequency point in a frame.
2) Initializing weighted covariance matrix V of each sound source at each frequency pointp(k)。
Figure BDA0003084556490000131
Wherein the content of the first and second substances,
Figure BDA0003084556490000132
is a zero matrix; wherein p is used to represent a microphone; p ═1,2。
Step S402: obtaining an original noisy signal of a p microphone in an n frame;
to pair
Figure BDA0003084556490000133
Windowing and Nfft point obtaining corresponding frequency domain signals:
Figure BDA0003084556490000134
wherein m is the number of points selected by Fourier transform; wherein the STFT is a short-time Fourier transform; the above-mentioned
Figure BDA0003084556490000135
Time domain signals of the nth frame of the p microphone; here, the time domain signal is an original noisy signal.
Then the X ispThe observed signal for (k,) n is: x (k, n) ═ X1(k,n),X2(k,n)]T(ii) a Wherein, [ X ]1(k,n),X2(k,n)]TIs a transposed matrix.
Step S403: obtaining frequency domain estimation signals of two sound source signals by using W (k) of the previous frame;
let the a priori frequency domain estimates of the two source signals Y (k, n) be [ Y [ [ Y ]1(k,n),Y2(k,n)]TWherein Y is1(k,n),Y2(k, n) are estimated values of the sound source 1 and the sound source 2 at the time frequency points (k, n), respectively.
The observation matrix X (k, n) is separated by a separation matrix W (k) to obtain: y (k, n) ═ w (k)' X (k, n); where W' (k) is the separation matrix of the previous frame (i.e., the frame previous to the current frame).
Then the prior frequency domain estimation of the p sound source in the n frame is:
Figure BDA0003084556490000136
step S404: updating a weighted covariance matrix Vp(k,n);
Calculating an updated weighted covariance matrix:
Figure BDA0003084556490000137
wherein β is a smoothing coefficient. In one embodiment, β is 0.98; wherein, the Vp(k, n-1) is the weighted covariance matrix of the previous frame; the above-mentioned
Figure BDA0003084556490000138
Is XpConjugate transpose of (k, n); the above-mentioned
Figure BDA0003084556490000139
Is a weighting coefficient, wherein
Figure BDA00030845564900001310
Is an auxiliary variable; the above-mentioned
Figure BDA00030845564900001311
As a comparison function.
Step S405: update separation matrix W (k, τ):
wi(k,τ)=(W(k,τ-1)Vi(k,τ))-1ei
Figure BDA00030845564900001312
W(k,τ)=[w1(k,τ),w2(k,τ)]H(ii) a i is 1, 2. Wherein e isiIs a feature vector.
Step S406: and (3) carrying out amplitude deblurring processing on W (k, tau) by using an MDP algorithm:
w (k, τ) ═ diag (invW (k, τ)) · W (k, τ), where invW (k, τ) is the inverse matrix of W (k, τ). diag (invW (k, τ)) indicates that the non-dominant diagonal element is set to 0.
In some embodiments, the original noisy signals may be separated by using the processed separation matrix to obtain respective frequency domain estimation signals of each sound source, but a certain noise residual may still exist in the separated signals. Therefore, in order to improve the signal-to-noise ratio, the separated signals may be post-processed here. The method specifically comprises the following steps:
step S407: the observed signal at each microphone for each sound source is determined using a separation matrix:
separating the original signal with noise by using W (k, tau) after MDP processing to obtain Y (k, tau) ═ Y1(k,τ),Y2(k,τ)]T. According to the characteristics of the MDP algorithm, the recovered frequency domain estimation signal Y (k, τ) is exactly the estimation of the observation data of the sound source at the corresponding microphone, i.e. the estimation of the observation signal of the sound source s1 at the mic1 is: y is1(k,τ)=h11s1(k, τ) as Y11(k,τ)=Y1(k, τ). The estimate of the observed signal at mic2 for sound source s2 is: y is2(k,τ)=h22s2(k, τ) as Y22(k,τ)=Y2(k,τ)。
Since the observed signal at each microphone is a superposition of the two source observations, the estimate of the observed data at the mic1 for the source s2 is: y is12(k,τ)=X1(k,τ)-Y11(k, τ); the estimation of the observed data at mic2 for sound source s1 is: y is21(k,τ)=X2(k,τ)-Y22(k,τ)。
In this way, based on the MDP algorithm, the observed signals of each sound source at each microphone are completely recovered, and the original phase information is retained, so that the azimuth of each sound source can be further estimated based on the signals.
Step S408: and (3) estimating the azimuth of each sound source by using the observation signal estimation of each sound source at each mic position respectively by using an SRP-PHAT algorithm:
y pair by using SRP-PAHT algorithm11(k, τ) and Y21(k, τ) process to estimate the bearing of the sound source s 1:
the above-mentioned observation estimation signal of the sound source s1 is traversed:
Figure BDA0003084556490000141
wherein, Yi(τ)=[Yi(1,τ),…,Yi(K,τ)]TThe signal is estimated for the observations of the ith microphone's frame τ. K is the length of the system frame; means that the two vector correspondences are multiplied; denotes the adjoint.
Similarly, the sound source s2 is also processed by the algorithm.
For any point s on the unit sphere, the coordinate is(s)x,sy,sz) Satisfy the following requirements
Figure BDA0003084556490000151
Calculating to obtain the time delay difference of any two microphones:
Figure BDA0003084556490000152
wherein f issIs the sampling frequency of the system and c is the speed of sound.
According to
Figure BDA0003084556490000153
Find the corresponding SRP (controlled Response Power):
Figure BDA0003084556490000154
traversing all points s on the unit sphere, and finding the point with the maximum value of the SRP, namely the estimated sound source:
Figure BDA0003084556490000155
in the embodiment of the present disclosure, the estimated coordinate information of the sound source may be obtained by the above method, for example: the coordinates of sound source s1 are
Figure BDA0003084556490000156
The coordinates of sound source s2 are
Figure BDA0003084556490000157
Step S409: and (3) obtaining main path signals of each sound source by using a delay-sum beam forming technology, and improving the signal-to-noise ratio:
determining the time delay difference from each sound source to each microphone based on the estimated azimuth information:
Figure BDA0003084556490000158
Figure BDA0003084556490000159
Figure BDA00030845564900001510
Figure BDA00030845564900001511
using Y11(k, τ) and Y21(k, τ) to the sound source s1Performing beam delay and sum beam forming to obtain the enhanced output of the main channel
Figure BDA00030845564900001512
Using Y12(k, τ) and Y22(k, τ) to the sound source s2Performing beam delay and sum beamforming to obtain an enhanced output signal of a main path thereof:
Figure BDA0003084556490000161
where K is 1,2, …, K.
Step S410, removing interference residue from the enhanced output signal:
for sound source s1Further removing YM1Interference residue in (k, τ), using Y22(k, τ) or Y12(k, τ) as reference signal, in this case Y22(k, τ) through an adaptive filter
Figure BDA0003084556490000162
To estimate YM1Interference remains in (k, τ). Note YM1The interference residue in (k, τ) is YC1(k, τ), then:
Figure BDA0003084556490000163
Figure BDA0003084556490000164
the updating method comprises the following steps:
Figure BDA0003084556490000165
wherein the content of the first and second substances,
u1(k,n)=[|Y22(k,n)|2,...,|Y22(k,n-L+1)|2]is a reference input vector;
Figure BDA0003084556490000166
to estimate the error.
The output after adaptive noise cancellation is:
Figure BDA0003084556490000167
the YM is further removed from the sound source s2 in the same manner2Interference residue in (k, τ), using Y11(k, τ) or Y21(k, τ) as reference signal, in this case Y11(k, τ) through an adaptive filter
Figure BDA0003084556490000168
L-1, to estimate YM2Interference remains in (k, τ). Note YM2The interference residue in (k, τ) is YC2(k, τ), then:
Figure BDA0003084556490000169
Figure BDA00030845564900001610
the updating method comprises the following steps:
Figure BDA00030845564900001611
wherein u is2(k,n)=[|Y11(k,n)|2,...,|Y11(k,n-L+1)|2]Is a reference input vector;
Figure BDA0003084556490000171
to estimate the error;
the output after adaptive noise cancellation is:
Figure BDA0003084556490000172
in this way, a separated signal with reduced residual interference can be obtained.
Step S411: and performing time-frequency conversion on the separated signals to obtain time-domain audio signals emitted by each sound source.
Are respectively paired with YE1(τ)=[YE1(1,τ),...,YE1(K,τ)]And YE2(τ)=[YE2(1,τ),...,YE2(K,τ)]K is 1, K is subjected to ISTFT and overlap addition to obtain a time domain sound source signal which is subjected to separation and post-processing enhancement and is recorded as
Figure BDA0003084556490000173
Wherein m is 1, …, Nfft; i is 1, 2.
Fig. 5 is a block diagram illustrating an apparatus for processing an audio signal according to an exemplary embodiment. Referring to fig. 5, the apparatus 500 includes a first obtaining module 501, a separating module 502, a first determining module 503, a second determining module 504, and a third determining module 505.
A first obtaining module 501, configured to obtain original noisy signals that are collected by at least two microphones for at least two sound sources respectively;
a separation module 502, configured to perform sound source separation on original noisy signals of the at least two microphones, to obtain frequency domain estimation signals of the at least two sound sources;
a first determining module 503, configured to determine observed estimated signals of the at least two microphones of each of the sound sources based on respective frequency domain estimated signals of the at least two sound sources;
a second determining module 504, configured to determine an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources;
a third determining module 505, configured to determine, according to the enhanced output signal of each sound source after the filtering processing, audio signals emitted by the at least two sound sources respectively.
In some embodiments, the separation module comprises:
the separation submodule is used for carrying out sound source separation on the original noisy signals by utilizing the separation matrix of each frame of signals after the deblurring processing is carried out, and respective frequency domain estimation signals of the at least two sound sources are obtained; wherein the frequency domain estimation signal carries phase information of the audio signal emitted by the sound source.
In some embodiments, the apparatus further comprises:
and the fourth determination module is used for determining the separation matrix after the fuzzy processing by using the separation matrix and the inverse matrix of the separation matrix.
In some embodiments, the apparatus further comprises:
a fifth determining module, configured to determine, when the current frame is not the first frame, a separation matrix of the current frame based on the separation matrix of the previous frame of the current frame and an original noisy signal of the current frame; or
And a sixth determining module, configured to determine, when the current frame is the first frame, a separation matrix of the current frame based on the predetermined identity matrix and the original noisy signal of the current frame.
In some embodiments, the observation estimation signal carries phase information of the audio signal emitted by the sound source; the second determining module includes:
a first determining submodule, configured to determine estimated coordinate information of the at least two sound sources according to the observation estimation signal;
a second determining submodule, configured to determine, according to the estimated coordinate information and the coordinate information of the at least two microphones, a delay difference from the at least two sound sources to the at least two microphones;
a third determining submodule, configured to determine the enhanced output signal of each sound source according to the time delay difference.
In some embodiments, the first determining submodule is specifically configured to:
determining the distance from each sound source to the at least two microphones respectively according to the estimated coordinate information and the coordinate information of the at least two microphones;
and determining the time delay difference according to the distance and the sound velocity.
In some embodiments, the third determining submodule is specifically configured to:
and determining an enhanced output signal of each sound source according to the time delay difference and the observation estimation signals of each sound source at each microphone.
In some embodiments, the apparatus further comprises:
and the filtering module is used for carrying out filtering processing on the enhanced output signal of each sound source according to the observation estimation signal.
In some embodiments, the filtering module comprises:
a fourth determining submodule, configured to determine an interference signal of the enhanced output signal according to the observation estimation signal;
and the filtering submodule is used for carrying out filtering processing on the enhanced output signal according to the interference signal.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 6 is a block diagram illustrating a physical structure of an audio signal processing apparatus 600 according to an exemplary embodiment. For example, the apparatus 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and so forth.
Referring to fig. 6, apparatus 600 may include one or more of the following components: a processing component 601, a memory 602, a power component 603, a multimedia component 604, an audio component 605, an input/output (I/O) interface 606, a sensor component 607, and a communication component 608.
The processing component 601 generally controls the overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 601 may include one or more processors 610 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 601 may also include one or more modules that facilitate interaction between the processing component 601 and other components. For example, the processing component 601 may include a multimedia module to facilitate interaction between the multimedia component 604 and the processing component 601.
The memory 610 is configured to store various types of data to support operations at the apparatus 600. Examples of such data include instructions for any application or method operating on the apparatus 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 602 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 603 provides power to the various components of the device 600. The power supply component 603 may include: a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 600.
The multimedia component 604 includes a screen that provides an output interface between the device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 604 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and/or rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
Audio component 605 is configured to output and/or input audio signals. For example, audio component 605 includes a Microphone (MIC) configured to receive external audio signals when apparatus 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 610 or transmitted via the communication component 608. In some embodiments, audio component 605 also includes a speaker for outputting audio signals.
The I/O interface 606 provides an interface between the processing component 601 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 607 includes one or more sensors for providing various aspects of status assessment for the apparatus 600. For example, the sensor component 607 may detect the open/closed state of the apparatus 600, the relative positioning of components, such as a display and keypad of the apparatus 600, the sensor component 607 may also detect a change in the position of the apparatus 600 or a component of the apparatus 600, the presence or absence of user contact with the apparatus 600, orientation or acceleration/deceleration of the apparatus 600, and a change in the temperature of the apparatus 600. The sensor component 607 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor component 607 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 607 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 608 is configured to facilitate wired or wireless communication between the apparatus 600 and other devices. The apparatus 600 may access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 608 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 608 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, or other technologies.
In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 602 comprising instructions, executable by the processor 610 of the apparatus 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform any of the methods provided in the above embodiments.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (20)

1. An audio signal processing method, comprising:
acquiring original noisy signals collected by at least two microphones for at least two sound sources respectively;
carrying out sound source separation on original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources;
determining observed estimated signals of each sound source at the at least two microphones respectively based on the respective frequency domain estimated signals of the at least two sound sources;
determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources;
and determining the audio signals emitted by the at least two sound sources respectively according to the enhanced output signals of each sound source after filtering processing.
2. The method according to claim 1, wherein said performing sound source separation on the original noisy signals of each of the at least two microphones to obtain frequency domain estimation signals of each of the at least two sound sources comprises:
carrying out sound source separation on the original noisy signals by using the separation matrix of each frame of signals after deblurring processing to obtain respective frequency domain estimation signals of the at least two sound sources; wherein the frequency domain estimation signal carries phase information of the audio signal emitted by the sound source.
3. The method of claim 2, further comprising:
and determining the fuzzy processed separation matrix by using the separation matrix and the inverse matrix of the separation matrix.
4. The method of claim 2, further comprising:
when the current frame is not the first frame, determining the separation matrix of the current frame based on the separation matrix of the previous frame of the current frame and the original noisy signal of the current frame; or
When the current frame is the first frame, a separation matrix of the current frame is determined based on a predetermined identity matrix and an original noisy signal of the current frame.
5. The method according to claim 1, wherein the observation estimation signal carries phase information of the audio signal emitted by the sound source; determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources, comprising:
determining estimated coordinate information of the at least two sound sources according to the observation estimation signals;
determining time delay differences from the at least two sound sources to the at least two microphones according to the estimated coordinate information and the coordinate information of the at least two microphones;
determining the enhanced output signal of each sound source according to the time delay difference.
6. The method of claim 5, wherein determining the time delay difference from the at least two sound sources to the at least two microphones based on the estimated coordinate information and the coordinate information of the at least two microphones comprises:
determining the distance from each sound source to the at least two microphones respectively according to the estimated coordinate information and the coordinate information of the at least two microphones;
and determining the time delay difference according to the distance and the sound velocity.
7. The method of claim 5, wherein determining the enhanced output signal for each sound source based on the delay difference comprises:
and determining an enhanced output signal of each sound source according to the time delay difference and the observation estimation signals of each sound source at each microphone.
8. The method of claim 1, further comprising:
and performing the filtering processing on the enhanced output signal of each sound source according to the observation estimation signal.
9. The method of claim 8, wherein said performing said filtering process on said enhanced output signal of each sound source based on said observed estimated signal comprises:
determining an interference signal of the enhanced output signal according to the observation estimation signal;
and carrying out the filtering processing on the enhanced output signal according to the interference signal.
10. An audio signal processing apparatus, comprising:
the first acquisition module is used for acquiring original noisy signals acquired by at least two microphones for at least two sound sources respectively;
the separation module is used for carrying out sound source separation on the original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources;
a first determining module, configured to determine, based on respective frequency domain estimated signals of the at least two sound sources, observed estimated signals of each of the sound sources at the at least two microphones, respectively;
a second determining module for determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources;
and the third determining module is used for determining the audio signals emitted by the at least two sound sources respectively according to the enhanced output signals of the sound sources after filtering processing.
11. The apparatus of claim 10, wherein the separation module comprises:
the separation submodule is used for carrying out sound source separation on the original noisy signals by utilizing the separation matrix of each frame of signals after the deblurring processing is carried out, and respective frequency domain estimation signals of the at least two sound sources are obtained; wherein the frequency domain estimation signal carries phase information of the audio signal emitted by the sound source.
12. The apparatus of claim 11, further comprising:
and the fourth determination module is used for determining the separation matrix after the fuzzy processing by using the separation matrix and the inverse matrix of the separation matrix.
13. The apparatus of claim 11, further comprising:
a fifth determining module, configured to determine, when the current frame is not the first frame, a separation matrix of the current frame based on the separation matrix of the previous frame of the current frame and an original noisy signal of the current frame; or
And a sixth determining module, configured to determine, when the current frame is the first frame, a separation matrix of the current frame based on the predetermined identity matrix and the original noisy signal of the current frame.
14. The apparatus according to claim 10, wherein the observation estimation signal carries phase information of the audio signal emitted by the sound source; the second determining module includes:
a first determining submodule, configured to determine estimated coordinate information of the at least two sound sources according to the observation estimation signal;
a second determining submodule, configured to determine, according to the estimated coordinate information and the coordinate information of the at least two microphones, a delay difference from the at least two sound sources to the at least two microphones;
a third determining submodule, configured to determine the enhanced output signal of each sound source according to the time delay difference.
15. The apparatus according to claim 14, wherein the first determining submodule is specifically configured to:
determining the distance from each sound source to the at least two microphones respectively according to the estimated coordinate information and the coordinate information of the at least two microphones;
and determining the time delay difference according to the distance and the sound velocity.
16. The apparatus according to claim 14, wherein the third determining submodule is specifically configured to:
and determining an enhanced output signal of each sound source according to the time delay difference and the observation estimation signals of each sound source at each microphone.
17. The apparatus of claim 10, further comprising:
and the filtering module is used for carrying out filtering processing on the enhanced output signal of each sound source according to the observation estimation signal.
18. The apparatus of claim 17, wherein the filtering module comprises:
a fourth determining submodule, configured to determine an interference signal of the enhanced output signal according to the observation estimation signal;
and the filtering submodule is used for carrying out filtering processing on the enhanced output signal according to the interference signal.
19. Audio signal processing device, characterized in that it comprises at least: a processor and a memory for storing executable instructions operable on the processor, wherein:
the processor is adapted to execute the executable instructions, which when executed perform the steps of the audio signal processing method as provided in any of the preceding claims 1 to 9.
20. A non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in the audio signal processing method provided in any one of claims 1 to 9.
CN202110582749.1A 2021-05-26 2021-05-26 Audio signal processing method and device and storage medium Pending CN113362847A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110582749.1A CN113362847A (en) 2021-05-26 2021-05-26 Audio signal processing method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110582749.1A CN113362847A (en) 2021-05-26 2021-05-26 Audio signal processing method and device and storage medium

Publications (1)

Publication Number Publication Date
CN113362847A true CN113362847A (en) 2021-09-07

Family

ID=77527795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110582749.1A Pending CN113362847A (en) 2021-05-26 2021-05-26 Audio signal processing method and device and storage medium

Country Status (1)

Country Link
CN (1) CN113362847A (en)

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090012779A1 (en) * 2007-03-05 2009-01-08 Yohei Ikeda Sound source separation apparatus and sound source separation method
CN101911724A (en) * 2008-03-18 2010-12-08 高通股份有限公司 Speech enhancement using multiple microphones on multiple devices
CN106531156A (en) * 2016-10-19 2017-03-22 兰州交通大学 Speech signal enhancement technology method based on indoor multi-mobile source real-time processing
CN106887239A (en) * 2008-01-29 2017-06-23 高通股份有限公司 For the enhanced blind source separation algorithm of the mixture of height correlation
US20170193975A1 (en) * 2015-12-31 2017-07-06 Harman International Industries, Inc. Active noise-control system with source-separated reference signal
CN107919133A (en) * 2016-10-09 2018-04-17 赛谛听股份有限公司 For the speech-enhancement system and sound enhancement method of destination object
CN108463848A (en) * 2016-03-23 2018-08-28 谷歌有限责任公司 Adaptive audio for multichannel speech recognition enhances
CN108962276A (en) * 2018-07-24 2018-12-07 北京三听科技有限公司 A kind of speech separating method and device
CN109243483A (en) * 2018-10-17 2019-01-18 西安交通大学 A kind of noisy frequency domain convolution blind source separation method
CN109584900A (en) * 2018-11-15 2019-04-05 昆明理工大学 A kind of blind source separation algorithm of signals and associated noises
CN110010148A (en) * 2019-03-19 2019-07-12 中国科学院声学研究所 A kind of blind separation method in frequency domain and system of low complex degree
US20190355374A1 (en) * 2018-05-16 2019-11-21 Nanjing Horizon Robotics Technology Co., Ltd. Method and apparatus for reducing noise of mixed signal
CN111009257A (en) * 2019-12-17 2020-04-14 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
CN111128221A (en) * 2019-12-17 2020-05-08 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
CN111402917A (en) * 2020-03-13 2020-07-10 北京松果电子有限公司 Audio signal processing method and device and storage medium
CN111429939A (en) * 2020-02-20 2020-07-17 西安声联科技有限公司 Sound signal separation method of double sound sources and sound pickup
CN111435598A (en) * 2019-01-15 2020-07-21 北京地平线机器人技术研发有限公司 Voice signal processing method and device, computer readable medium and electronic equipment
CN112349292A (en) * 2020-11-02 2021-02-09 深圳地平线机器人科技有限公司 Signal separation method and device, computer readable storage medium, electronic device
CN112565119A (en) * 2020-11-30 2021-03-26 西北工业大学 Broadband DOA estimation method based on time-varying mixed signal blind separation

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090012779A1 (en) * 2007-03-05 2009-01-08 Yohei Ikeda Sound source separation apparatus and sound source separation method
CN106887239A (en) * 2008-01-29 2017-06-23 高通股份有限公司 For the enhanced blind source separation algorithm of the mixture of height correlation
CN101911724A (en) * 2008-03-18 2010-12-08 高通股份有限公司 Speech enhancement using multiple microphones on multiple devices
US20170193975A1 (en) * 2015-12-31 2017-07-06 Harman International Industries, Inc. Active noise-control system with source-separated reference signal
CN108463848A (en) * 2016-03-23 2018-08-28 谷歌有限责任公司 Adaptive audio for multichannel speech recognition enhances
CN107919133A (en) * 2016-10-09 2018-04-17 赛谛听股份有限公司 For the speech-enhancement system and sound enhancement method of destination object
CN106531156A (en) * 2016-10-19 2017-03-22 兰州交通大学 Speech signal enhancement technology method based on indoor multi-mobile source real-time processing
US20190355374A1 (en) * 2018-05-16 2019-11-21 Nanjing Horizon Robotics Technology Co., Ltd. Method and apparatus for reducing noise of mixed signal
CN108962276A (en) * 2018-07-24 2018-12-07 北京三听科技有限公司 A kind of speech separating method and device
CN109243483A (en) * 2018-10-17 2019-01-18 西安交通大学 A kind of noisy frequency domain convolution blind source separation method
CN109584900A (en) * 2018-11-15 2019-04-05 昆明理工大学 A kind of blind source separation algorithm of signals and associated noises
CN111435598A (en) * 2019-01-15 2020-07-21 北京地平线机器人技术研发有限公司 Voice signal processing method and device, computer readable medium and electronic equipment
CN110010148A (en) * 2019-03-19 2019-07-12 中国科学院声学研究所 A kind of blind separation method in frequency domain and system of low complex degree
CN111009257A (en) * 2019-12-17 2020-04-14 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
CN111128221A (en) * 2019-12-17 2020-05-08 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
CN111429939A (en) * 2020-02-20 2020-07-17 西安声联科技有限公司 Sound signal separation method of double sound sources and sound pickup
CN111402917A (en) * 2020-03-13 2020-07-10 北京松果电子有限公司 Audio signal processing method and device and storage medium
CN112349292A (en) * 2020-11-02 2021-02-09 深圳地平线机器人科技有限公司 Signal separation method and device, computer readable storage medium, electronic device
CN112565119A (en) * 2020-11-30 2021-03-26 西北工业大学 Broadband DOA estimation method based on time-varying mixed signal blind separation

Similar Documents

Publication Publication Date Title
CN111128221B (en) Audio signal processing method and device, terminal and storage medium
CN111009256B (en) Audio signal processing method and device, terminal and storage medium
CN111429933B (en) Audio signal processing method and device and storage medium
CN111402917B (en) Audio signal processing method and device and storage medium
CN111179960B (en) Audio signal processing method and device and storage medium
CN111009257B (en) Audio signal processing method, device, terminal and storage medium
CN113314135B (en) Voice signal identification method and device
CN111986693A (en) Audio signal processing method and device, terminal equipment and storage medium
CN113053406A (en) Sound signal identification method and device
CN113506582A (en) Sound signal identification method, device and system
CN113362847A (en) Audio signal processing method and device and storage medium
US20210398548A1 (en) Method and device for processing audio signal, and storage medium
CN113223553B (en) Method, apparatus and medium for separating voice signal
CN111667842B (en) Audio signal processing method and device
CN110517703B (en) Sound collection method, device and medium
CN113488066A (en) Audio signal processing method, audio signal processing apparatus, and storage medium
CN113223548B (en) Sound source positioning method and device
CN113489855A (en) Sound processing method, sound processing device, electronic equipment and storage medium
CN113362848B (en) Audio signal processing method, device and storage medium
CN111429934B (en) Audio signal processing method and device and storage medium
CN112863537B (en) Audio signal processing method, device and storage medium
CN114464203B (en) Noise filtering method, device, system, vehicle and storage medium
EP4113515A1 (en) Sound processing method, electronic device and storage medium
CN113421579B (en) Sound processing method, device, electronic equipment and storage medium
CN113345456B (en) Echo separation method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination