CN113053408B

CN113053408B - Sound source separation method and device

Info

Publication number: CN113053408B
Application number: CN202110268230.6A
Authority: CN
Inventors: 丁少为; 关海欣; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Shenzhen Yunzhisheng Information Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Shenzhen Yunzhisheng Information Technology Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2022-06-14
Anticipated expiration: 2041-03-12
Also published as: CN113053408A

Abstract

The invention relates to a sound source separation method, which comprises the following steps: according to the array element distance preset in the microphone array, a first differential beam former and a second differential beam former are respectively arranged at a first end and a second end of the microphone array; converting the mixed signal to a short-time frequency domain to obtain a first signal; calculating a DOA estimate for each frame in the first signal; calculating first and second DOA errors; inputting the first signal into a first and a second differential beam former to obtain a first far-end signal, a first near-end signal, a second far-end signal and a second near-end signal; according to the first DOA error, performing first adaptive cancellation processing on the first near-end signal and the second far-end signal to obtain a first output signal; according to the second DOA error, second self-adaptive cancellation processing is carried out on the first far-end signal and the second near-end signal, and a second output signal is obtained; and respectively carrying out short-time Fourier inverse transformation on the first output signal and the second output signal to obtain a first separation signal and a second separation signal.

Description

Sound source separation method and device

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a sound source separation method and apparatus.

Background

In the prior art, when sound sources are separated, two fixed beam formers can be designed according to a uniform linear microphone array, main lobes point to two end-fire directions of the linear array respectively, two fixed weights are weighted to array receiving signals, and two paths of output signals are obtained and are separated signals, or blind source separation is performed through independent component analysis and the like.

However, in the conventional fixed beam former, when the number of array elements is small and the array aperture is small, the low-frequency main lobe is wide, and the suppression of the other-end signal is weak, resulting in a large amount of the other-end signal remaining in the split signal.

While other blind source separation methods result in higher computational complexity due to the need to solve the separation matrix.

Disclosure of Invention

The invention aims to provide a sound source separation method aiming at the defects of the prior art, so as to solve the problems that in the prior art, the residual quantity of signals at the other end in a separation signal is more, and the computation complexity of a blind source separation method is higher.

In a first aspect, the present invention provides a sound source separation method, including:

according to an array element distance preset in a microphone array, arranging a first differential beam former at a first end of the microphone array and arranging a second differential beam former at a second end of the microphone array; wherein the main lobe direction of the first differential beamformer is towards the first end, the null direction of the first differential beamformer is towards the second end, the main lobe direction of the second differential beamformer is towards the second end, and the null direction of the second differential beamformer is towards the first end;

carrying out short-time Fourier transform on a mixed signal received by a microphone array, and transforming the mixed signal to a short-time frequency domain to obtain a first signal; wherein the mixed signal is a mixed signal generated by a first sound source at the first end and a second sound source at the second end;

calculating a direction of arrival (DOA) estimate for each frame in the first signal;

calculating a first DOA error corresponding to a first sound source and a second DOA error corresponding to a second sound source according to the DOA estimation;

inputting the first signal into a first differential beam former and a second differential beam former respectively to obtain a first far-end signal and a first near-end signal output by the first differential beam former and a second far-end signal and a second near-end signal output by the second differential beam former;

according to the first DOA error, performing first adaptive cancellation processing on the first near-end signal and the second far-end signal to obtain a first output signal; according to the second DOA error, second self-adaptive cancellation processing is carried out on the first far-end signal and the second near-end signal to obtain a second output signal;

and respectively carrying out short-time Fourier inverse transformation on the first output signal and the second output signal to obtain a first separation signal and a second separation signal.

Preferably, the array element spacing is in the range of 2.0cm-3.5 cm.

Preferably, the calculating, according to the DOA estimation, a first DOA error corresponding to a first sound source and a second DOA error corresponding to a second sound source specifically includes:

according to the formula err_ACalculating a first DOA error from |0- θ |;

according to the formula err_BCalculating a second DOA error from |180- θ |;

wherein, err_AIs the first DOA error, err_BIs a second DOA error; θ is the DOA estimate.

Preferably, the performing, according to the first DOA error, a first adaptive cancellation process on the first near-end signal and the second far-end signal to obtain a first output signal specifically includes:

comparing the first DOA error with a preset error threshold, and when the first DOA error is not greater than the preset error threshold, taking the first output signal of the current frame as a first near-end signal without updating a first adaptive processing filter coefficient;

and when the first DOA error is larger than a preset error threshold value, the second far-end signal of the current frame is not reserved, and the coefficient of the first self-adaptive processing filter is updated.

Preferably, the performing, according to the second DOA error, second adaptive cancellation processing on the first far-end signal and the second near-end signal to obtain a second output signal specifically includes:

comparing the second DOA error with a preset error threshold, and when the second DOA error is not greater than the preset error threshold, taking the second output signal of the current frame as a second near-end signal without updating the coefficient of a second adaptive processing filter;

and when the second DOA error is larger than a preset error threshold value, the first far-end signal of the current frame is not reserved, and the coefficient of the second self-adaptive processing filter is updated.

In a second aspect, the present invention provides a sound source separating apparatus comprising:

the microphone array comprises a setting unit, a first differential beam former and a second differential beam former, wherein the setting unit is used for setting the first differential beam former at the first end of the microphone array and setting the second differential beam former at the second end of the microphone array according to array element spacing preset in the microphone array; wherein the main lobe direction of the first differential beamformer is towards the first end, the null direction of the first differential beamformer is towards the second end, the main lobe direction of the second differential beamformer is towards the second end, and the null direction of the second differential beamformer is towards the first end;

the microphone array comprises a conversion unit, a first signal processing unit and a second signal processing unit, wherein the conversion unit is used for carrying out short-time Fourier transform on a mixed signal received by the microphone array and converting the mixed signal to a short-time frequency domain to obtain a first signal; wherein the mixed signal is a mixed signal generated by a first sound source at the first end and a second sound source at the second end;

a calculation unit for calculating a direction of arrival, DOA, estimate for each frame in the first signal;

the calculating unit is further used for calculating a first DOA error corresponding to the first sound source and a second DOA error corresponding to the second sound source according to the DOA estimation;

a processing unit, configured to input the first signal into a first differential beam former and a second differential beam former respectively, so as to obtain a first far-end signal and a first near-end signal output by the first differential beam former, and a second far-end signal and a second near-end signal output by the second differential beam former;

the processing unit is further configured to perform a first adaptive cancellation process on the first near-end signal and the second far-end signal according to the first DOA error to obtain a first output signal; according to the second DOA error, second self-adaptive cancellation processing is carried out on the first far-end signal and the second near-end signal to obtain a second output signal;

the transformation unit is further configured to perform short-time inverse fourier transformation on the first output signal and the second output signal, respectively, to obtain a first separated signal and a second separated signal.

Preferably, the array element spacing is in the range of 2.0cm-3.5 cm.

Preferably, the computing unit is specifically configured to:

according to the formula err_ACalculating a first DOA error from |0- θ |;

according to the formula err_BCalculating a second DOA error from |180- θ |;

Preferably, the processing unit is specifically configured to:

comparing the first DOA error with a preset error threshold, and when the first DOA error is not greater than the preset error threshold, taking the first output signal of the current frame as a first near-end signal without updating the coefficient of a first adaptive processing filter;

and when the first DOA error is larger than a preset error threshold value, the current frame is a second far-end signal, and the coefficient of the first self-adaptive processing filter is updated.

Preferably, the processing unit is specifically configured to:

and when the second DOA error is larger than a preset error threshold value, the current frame is a first far-end signal, and the coefficient of a second self-adaptive processing filter is updated.

In a third aspect, the invention provides an apparatus comprising a memory for storing a program and a processor for performing the method of any of the first aspects.

In a fourth aspect, the present invention provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method according to any one of the first aspect.

In a fifth aspect, the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of any of the first aspects.

According to the sound source separation method provided by the embodiment of the invention, the adaptive cancellation processing is added after the output of the differential beam former, so that the interference residue at the other end in the output signal is less, and the parameters of the adaptive cancellation filter are controlled to be updated through DOA errors, so that the voice damage after separation is reduced; in addition, the method adopts the fixed beam forming and self-adaptive cancellation technology, does not relate to solving of a separation matrix, and has lower calculation complexity compared with blind source separation methods such as independent component analysis and the like.

Drawings

Fig. 1 is a schematic flow chart of a sound source separation method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a microphone array according to an embodiment of the invention;

fig. 3 is a schematic structural diagram of a sound source separation apparatus according to a second embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Fig. 1 is a schematic flow chart of a sound source separation method according to an embodiment of the present invention, where an execution subject of the method is a device with a computing function, such as a terminal and a server. The technical solution of the present invention is described in detail below with reference to fig. 1.

Step 110, according to an array element distance preset in a microphone array, arranging a first differential beam former at a first end of the microphone array and arranging a second differential beam former at a second end of the microphone array; the main lobe direction of the first differential beam former faces to the first end, the zero limit direction of the first differential beam former faces to the second end, the main lobe direction of the second differential beam former faces to the second end, and the zero limit direction of the second differential beam former faces to the first end;

specifically, in the present application, the array element spacing may be set to 2.0cm to 3.5cm based on a small-spacing microphone array, and the speakers a and B are respectively located at two ends of the microphone array, such as the 2mic array shown in fig. 2. Two first-order differential beamformers can be designed according to the array element spacing, namely a first differential beamformer and a second differential beamformer, wherein the main lobe direction of the first differential beamformer is 0 degrees, namely the talker a direction, and the null direction is 180 degrees, namely the talker B direction, the first differential beamformer is opposite to the first differential beamformer, namely the main lobe direction of the second differential beamformer is 180 degrees, and the null direction is 0 degree. Speaker a corresponds to a first sound source and speaker B corresponds to a second sound source.

Step 120, performing short-time fourier transform on the mixed signal received by the microphone array, and transforming the mixed signal to a short-time frequency domain to obtain a first signal; the mixed signal is generated by a first sound source at the first end and a second sound source at the second end.

Specifically, the microphone array may receive mixed sound signals of a first sound source and a second sound source, and since a speech signal has a short-time stationary characteristic and is generally converted to a short-time frequency domain for analysis, the mixed sound signal is subjected to short-time fourier transform to obtain a first signal.

Step 130, calculating a direction of arrival (DOA) estimate of each frame in the first signal.

Specifically, the DOA estimation value of each frame of signal can be obtained in real time by performing DOA estimation through any one of the commonly used methods, such as a beam forming algorithm, a subspace algorithm, and a deconvolution algorithm.

Step 140, calculating a first DOA error corresponding to the first sound source and a second DOA error corresponding to the second sound source according to the DOA estimation;

specifically, the DOA estimate may be denoted as θ, and a first DOA error err corresponding to a first sound source, i.e., the talker a direction, may be calculated respectively_ASecond DOA error err corresponding to the direction of the second sound source, i.e. speaker B_BWherein, err_A＝|0-θ|，err_B＝|180-θ|。

Step 150, inputting the first signal into a first differential beam former and a second differential beam former respectively to obtain a first far-end signal and a first near-end signal output by the first differential beam former, and a second far-end signal and a second near-end signal output by the second differential beam former;

specifically, the first main lobe direction of the first differential beamformer is a, and the null direction is B, i.e., the a-direction signal is retained and the B-direction signal is suppressed in the output signal. The second differential beamformer, in contrast, retains the B-direction signals and suppresses the a-direction signals. The first near-end signal output by the first differential beamformer may be denoted as SA1 and the first far-end signal as SB1, and the second near-end signal SA2 and the second far-end signal output by the second differential beamformer may be denoted as SB 2.

Step 160, according to the first DOA error, performing a first adaptive cancellation process on the first near-end signal and the second far-end signal to obtain a first output signal; according to the second DOA error, second self-adaptive cancellation processing is carried out on the first far-end signal and the second near-end signal to obtain a second output signal;

for the first adaptive cancellation processing, the first DOA error may be compared with a preset error threshold, when the first DOA error is not greater than the preset error threshold, the first output signal of the current frame is a first near-end signal, and at this time, the coefficient of the first adaptive processing filter is not updated;

when the first DOA error is larger than the preset error threshold value, the current frame is the second far-end signal, and the first self-adaptive processing filter coefficient is updated, so that the first near-end signal of the current frame is continuously processed through the updated first self-adaptive processing filter coefficient.

Specifically, the first adaptive processing filter is obtained by updating according to the input signal in the first adaptive cancellation process. The error threshold is an experimental value, and can be set according to a plurality of experiments, for example, the error threshold θ can be set_th30. There is impairment to the target direction speech signal if it is updated during the target direction speech phase, so whether the first adaptive processing filter is updated is controlled by the real-time first DOA. The first adaptive processing filter update is performed on a non-target signal, and only the current first adaptive processing filter coefficients are used to process data at the target signal stage without changing the value of the first adaptive processing filter, the update of the first adaptive processing filter being adaptively updated according to the signal. If err_A≤θ_thThen, the frame signal is an a-direction signal, and needs to be reserved, i.e. the SA1 is reserved, and the output is marked as T_A. If err_A＞θ_thIf the frame signal is interference noise or B-direction signal, it needs to be eliminated, and at this time, the coefficient of the filter is updated, so as to continuously determine the first output signal.

Correspondingly, aiming at the second adaptive cancellation processing, the second DOA error is compared with a preset error threshold, when the second DOA error is not greater than the preset error threshold, the second output signal of the current frame is a second near-end signal, and the coefficient of a filter of the second adaptive processing is not updated;

and when the second DOA error is larger than a preset error threshold value, the current frame is a first far-end signal, and the coefficient of the second self-adaptive processing filter is updated.

Specifically, the adaptive cancellation processing can be performed according to the first far-end signal SB1 and the second near-end signal SA2, and also according to err_BControls whether the second adaptive processing filter coefficients are updated, i.e.: if err_B＞θ_thIf the frame signal is the interference noise or the a-direction signal and needs to be eliminated, the second adaptive processing filter coefficient is updated. If err_B≤θ_thThen the frame signal is a B-direction signal, and needs to be reserved, i.e. the SA2 is reserved, and the output is denoted as T_B。

The first adaptive cancellation processing may be any one of a Least Mean Square (Least Mean Square LMS) algorithm LMS, a Normalized LMS (NLMS) algorithm, and an Least Square (RLS) method. The second adaptive cancellation process and the first adaptive cancellation process are the same algorithm.

And 170, performing short-time inverse Fourier transform on the first output signal and the second output signal respectively to obtain a first separation signal and a second separation signal.

Specifically, for two output signals T_AAnd T_BAnd respectively carrying out short-time Fourier inverse transformation to obtain a final first separation signal A and a final second separation signal B.

Furthermore, the array element array can be expanded to more array elements, and only two corresponding differential beam formers need to be designed according to the linear microphone array.

Fig. 3 is a schematic structural diagram of a sound source separation apparatus according to a second embodiment of the present invention, as shown in fig. 3, the sound source separation apparatus includes: a setting unit 310, a transformation unit 320, a calculation unit 330 and a processing unit 340.

The setting unit 310 is configured to set a first differential beam former at a first end of the microphone array and a second differential beam former at a second end of the microphone array according to an array element distance preset in the microphone array; the main lobe direction of the first differential beam former faces to the first end, the zero limit direction of the first differential beam former faces to the second end, the main lobe direction of the second differential beam former faces to the second end, and the zero limit direction of the second differential beam former faces to the first end;

the transformation unit 320 is configured to perform short-time fourier transformation on the mixed signal received by the microphone array, and transform the mixed signal to a short-time frequency domain to obtain a first signal; the mixed signal is generated by a first sound source at a first end and a second sound source at a second end;

the calculating unit 330 is configured to calculate a direction of arrival DOA estimate for each frame in the first signal;

the calculating unit 330 is further configured to calculate, according to the DOA estimation, a first DOA error corresponding to the first sound source and a second DOA error corresponding to the second sound source;

the processing unit 340 is configured to input the first signal into the first differential beam former and the second differential beam former respectively to obtain a first far-end signal and a first near-end signal output by the first differential beam former, and a second far-end signal and a second near-end signal output by the second differential beam former;

the processing unit 340 is further configured to perform a first adaptive cancellation process on the first near-end signal and the second far-end signal according to the first DOA error to obtain a first output signal; according to the second DOA error, second self-adaptive cancellation processing is carried out on the first far-end signal and the second near-end signal to obtain a second output signal;

the transforming unit 320 is further configured to perform short-time inverse fourier transform on the first output signal and the second output signal, respectively, to obtain a first separated signal and a second separated signal.

Wherein the spacing between the array elements is within 2.0cm-3.5 cm.

Wherein, the calculating unit 330 is specifically configured to:

according to the formula err_ACalculating a first DOA error from |0- θ |;

according to the formula err_BCalculating a second DOA error from |180- θ |;

Wherein the processing unit 340 is specifically configured to:

comparing the first DOA error with a preset error threshold, and when the first DOA error is not greater than the preset error threshold, taking the first output signal of the current frame as a first near-end signal, and not updating the coefficient of the first adaptive processing filter;

Wherein, the processing unit 340 is specifically configured to:

comparing the second DOA error with a preset error threshold, and when the second DOA error is not greater than the preset error threshold, taking the second output signal of the current frame as a second near-end signal, and not updating the coefficient of the second adaptive processing filter;

and when the second DOA error is larger than a preset error threshold value, the current frame is the first far-end signal, and the coefficient of the second self-adaptive processing filter is updated.

According to the sound source separation device provided by the embodiment of the invention, the adaptive cancellation processing is added after the output of the differential beam former, so that the interference residue at the other end in the output signal is less, and the parameters of the adaptive cancellation filter are controlled to be updated through DOA errors, so that the voice damage after separation is reduced; in addition, the method adopts the fixed beam forming and self-adaptive cancellation technology, does not relate to solving of a separation matrix, and has lower calculation complexity compared with blind source separation methods such as independent component analysis and the like.

The third embodiment of the invention provides equipment, which comprises a memory and a processor, wherein the memory is used for storing programs, and the memory can be connected with the processor through a bus. The memory may be a non-volatile memory such as a hard disk drive and a flash memory, in which a software program and a device driver are stored. The software program is capable of performing various functions of the above-described methods provided by embodiments of the present invention; the device drivers may be network and interface drivers. The processor is used for executing a software program, and the software program can realize the method provided by the first embodiment of the invention when being executed.

A fourth embodiment of the present invention provides a computer program product including instructions, which, when the computer program product runs on a computer, causes the computer to execute the method provided in the first embodiment of the present invention.

The fifth embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method provided in the first embodiment of the present invention is implemented.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A sound source separation method, characterized by comprising:

according to the array element distance preset in a microphone array, arranging a first differential beam former at a first end of the microphone array and arranging a second differential beam former at a second end of the microphone array; wherein the main lobe direction of the first differential beamformer is towards the first end, the null direction of the first differential beamformer is towards the second end, the main lobe direction of the second differential beamformer is towards the second end, and the null direction of the second differential beamformer is towards the first end;

2. The method of claim 1, wherein the array element spacing is in the range of 2.0cm to 3.5 cm.

3. The method according to claim 1, wherein said calculating, from said DOA estimates, a first DOA error for a first acoustic source and a second DOA error for a second acoustic source specifically comprises:

according to the formula err_ACalculating a first DOA error from |0- θ |;

according to the formula err_BCalculating a second DOA error from |180- θ |;

4. The method of claim 1, wherein the performing a first adaptive cancellation process on the first near-end signal and the second far-end signal according to the first DOA error to obtain a first output signal specifically comprises:

5. The method according to claim 1, wherein the performing, according to the second DOA error, second adaptive cancellation processing on the first far-end signal and the second near-end signal to obtain a second output signal specifically includes:

6. A sound source separation device, characterized by comprising:

the microphone array comprises a setting unit, a first differential beam former and a second differential beam former, wherein the setting unit is used for setting the first differential beam former at the first end of a microphone array and setting the second differential beam former at the second end of the microphone array according to array element spacing preset in the microphone array; wherein the main lobe direction of the first differential beamformer is towards the first end, the null direction of the first differential beamformer is towards the second end, the main lobe direction of the second differential beamformer is towards the second end, and the null direction of the second differential beamformer is towards the first end;

7. The apparatus of claim 6, wherein the array element spacing is in the range of 2.0cm-3.5 cm.

8. The apparatus according to claim 6, wherein the computing unit is specifically configured to:

according to the formula err_ACalculating a first DOA error from |0- θ |;

according to the formula err_BCalculating a second DOA error from |180- θ |;

9. The apparatus according to claim 6, wherein the processing unit is specifically configured to:

10. The apparatus according to claim 6, wherein the processing unit is specifically configured to: