CN111429939B

CN111429939B - Sound signal separation method of double sound sources and pickup

Info

Publication number: CN111429939B
Application number: CN202010251574.1A
Authority: CN
Inventors: 黄海; 刘佳; 隆弢
Original assignee: Xi'an Shenglian Technology Co ltd
Current assignee: Xi'an Shenglian Technology Co ltd
Priority date: 2020-02-20
Filing date: 2020-04-01
Publication date: 2023-06-09
Anticipated expiration: 2040-04-01
Also published as: CN111429939A

Abstract

The embodiment of the invention provides a sound signal separation method of double sound sources and a pickup, wherein a mixed sound signal is divided into voice frames, delay differences of the voice frames reaching different array element combinations in a microphone array are estimated, then the propagation direction of the voice frames is judged according to the determined delay differences, and sound signals corresponding to different sound sources are separated in real time according to the propagation direction and output. The time delay estimation is carried out through the generalized cross-correlation algorithm, so that the time delay can be accurately estimated, the operation amount of the algorithm can be ensured to be low, and the algorithm can track the sound source azimuth more accurately and efficiently in a real-time system, thereby realizing the automatic separation of the sound signals of the first sound source and the second sound source.

Description

Sound signal separation method of double sound sources and pickup

Technical Field

The invention relates to the technical field of voice processing, in particular to a sound signal separation method of a double sound source and a pickup.

Background

In recent years, with rapid development of speech recognition technology, urgent technical demands are put on real-time sound source separation technology in a multi-path speech recognition scene. For example, in some important meeting scenarios, real-time meeting recording, as well as recording quality, can play a significant role. However, in the current practical market, meeting records are recorded or recorded and arranged on the site by manpower; or the video is recorded earlier and played back later for arrangement. Both of these methods are very time-consuming and cumbersome manual work. Sound signals can be recorded by means of sound recordings, but when a certain content needs to be played back, the whole recording needs to be played back, which takes a long time.

In the prior art, the sound source orientation technology exists, but the problems of low positioning accuracy and poor real-time tracking performance generally exist in the technologies, and in addition, the problems of untimely switching of sound source separation, misjudgment of voice separation and the like exist in the technologies.

Therefore, in the practical application process, the sound source separation technology has low positioning accuracy, the sound source separation is not switched timely, and the voice separation is misjudged.

Disclosure of Invention

The embodiment of the invention provides a sound signal separation method and a sound pickup for double sound sources, which are used for solving the problems of low positioning accuracy, untimely switching of sound source separation and misjudgment of voice separation in the prior art.

In view of the above technical problems, in a first aspect, an embodiment of the present invention provides a method for separating sound signals of two sound sources, including:

receiving a mixed sound signal from a first sound source and a second sound source;

dividing a received mixed sound signal into voice frames with preset frame lengths, judging the propagation direction of each voice frame, and determining the propagation direction corresponding to each voice frame;

and separating the sound signal from the first sound source and the sound signal from the second sound source according to the corresponding propagation direction of each voice frame.

In a second aspect, an embodiment of the present invention provides an acoustic signal separation apparatus, including:

a receiving module for receiving a mixed sound signal from a first sound source and a second sound source;

the processing module is used for dividing the received mixed sound signal into voice frames with preset frame lengths, judging the propagation direction of each voice frame and determining the propagation direction corresponding to each voice frame;

and the separation module is used for separating the sound signal from the first sound source and the sound signal from the second sound source according to the corresponding propagation direction of each voice frame.

In a third aspect, embodiments of the present invention provide a pickup,

the device comprises a microphone array unit, a processing unit and an output unit; the microphone array unit comprises a microphone array and an audio coding unit;

the microphone array unit is used for sending the collected sound signals to the processing unit;

the processing unit is configured to perform the sound signal separation method of the dual sound source according to any one of the above, separate the sound signal from the first sound source and the sound signal from the second sound source, and send the sound signal from the first sound source and the sound signal from the second sound source to the output unit, respectively;

The output unit is used for respectively outputting the sound signal from the first sound source and the sound signal from the second sound source;

the audio coding unit is used for converting sound waves received by the microphone array into electric signals to obtain sound signals.

In a fourth aspect, an embodiment of the present invention provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the dual sound source sound signal separation method described above when the program is executed by the processor.

In a fifth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the dual sound source sound signal separation method described above.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a method for separating sound signals of dual sound sources according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a microphone array according to another embodiment of the present invention;

FIG. 3 is a block diagram of a sound source separation algorithm provided in accordance with another embodiment of the present invention;

fig. 4 is a schematic diagram of a dual-sound source separation application scenario provided in another embodiment of the present invention;

FIG. 5 is a schematic diagram of audio signals of a dialogue between A and B according to another embodiment of the present invention;

FIG. 6 is a schematic diagram of a sound signal of separated A according to another embodiment of the present invention;

FIG. 7 is a schematic diagram of a separated B sound signal according to another embodiment of the present invention;

FIG. 8 is a block diagram of a dual source real-time separation pickup structure provided in accordance with another embodiment of the present invention;

FIG. 9 is a schematic diagram of a sound source separation algorithm implementation area according to another embodiment of the present invention;

FIG. 10 is a comparative schematic diagram of test results of dual source separation in an actual scenario provided by another embodiment of the present invention; where a represents the conversational speech of A and B, B represents the separated speech of A, and c represents the separated conversational speech of B.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flow chart of a method for separating sound signals of dual sound sources according to an embodiment of the present invention, referring to fig. 1, the method includes the following steps:

step 101: receiving a mixed sound signal from a first sound source and a second sound source;

step 102: dividing a received mixed sound signal into voice frames with preset frame lengths, judging the propagation direction of each voice frame, and determining the propagation direction corresponding to each voice frame;

Step 103: and separating the sound signal from the first sound source and the sound signal from the second sound source according to the corresponding propagation direction of each voice frame.

The method provided by the present embodiment is executed by a computer, a server, and a device (for example, a sound pickup) that processes a sound signal. The variation range of the preset frame length is typically between several milliseconds and several tens of milliseconds, for example, the preset frame length is selected to be 4ms in this embodiment. There is an overlap time between adjacent speech frames, for example, the overlap ratio is 75%.

The embodiment provides a sound signal separation method of double sound sources and a pickup, wherein a mixed sound signal is divided into voice frames, delay differences of the voice frames reaching different array element combinations in a microphone array are estimated, then the propagation direction of the voice frames is judged according to the determined delay differences, and sound signals corresponding to different sound sources are separated in real time according to the propagation direction and output. The time delay estimation is carried out through the generalized cross-correlation algorithm, so that the time delay can be accurately estimated, the operation amount of the algorithm can be ensured to be low, and the algorithm can track the sound source azimuth more accurately and efficiently in a real-time system, thereby realizing the automatic separation of the sound signals of the first sound source and the second sound source.

Further, on the basis of the foregoing embodiment, the dividing the received mixed sound signal into voice frames with preset frame lengths, determining a propagation direction of each voice frame, and determining a propagation direction corresponding to each voice frame includes:

dividing the received mixed sound signal into voice frames with the preset frame length;

determining the maximum delay difference corresponding to different array element combinations according to the positions of the array elements in the microphone array for receiving the mixed sound signals, and obtaining the array element combination with the maximum delay difference larger than a preset threshold value as a selected array element combination;

for any target voice frame in each voice frame, determining the time delay difference of each selected array element combination for receiving the target voice frame through a generalized cross-correlation function, and determining the propagation direction of the target voice frame according to the time delay difference of each selected array element combination for receiving the target voice frame;

wherein the array element combination is the combination of any two microphone array elements in the microphone array; there is an overlap time between adjacent speech frames.

Further, on the basis of the foregoing embodiments, determining, according to positions of the array elements in the microphone array that receives the mixed sound signal, a maximum delay difference corresponding to different array element combinations, and obtaining, as the selected array element combination, an array element combination with the maximum delay difference greater than a preset threshold value, includes:

According to the formula

Determining maximum delay differences corresponding to different array element combinations in the microphone array, and obtaining array element combinations corresponding to the maximum three maximum delay differences as selected array element combinations;

wherein ,

is the maximum time delay difference d corresponding to the array element combination consisting of the microphone array element i and the microphone array element j _ij For the distance between microphone element i and microphone element j in the microphone array, c=340 m/s is the speed of sound, f _s =16khz is sampling frequency, ++>

Representing an upward rounding.

Taking the microphone array in a linear arrangement as shown in fig. 2 as an example, the following describes how to determine the selected array element combination:

defining the directions of the microphone array elements M8 to M1 in fig. 2 as 0 ° directions, the directions of the microphone array elements M1 to M8 as 180 ° directions, and the spacing between adjacent microphone array elements as d=11 mm.

Calculation by formula

After the maximum delay difference corresponding to each array element combination, the array element combination corresponding to the maximum three maximum delay differences is selected, namely, an M1 and M8 array element combination, an M1 and M7 array element combination, an M2 and M8 array element combination, an M1 and M6 array element combination, an M2 and M7 array element combination and an M3 and M8 array element combination, wherein the array element combinations are selected array element combinations. And determining the propagation direction of the sound signal through the delay difference corresponding to the selected array element combination.

Further, on the basis of the foregoing embodiments, determining, by a generalized cross-correlation function, a delay difference of receiving the target speech frame for each selected array element combination, for any target speech frame in each pair of speech frames includes:

for any target voice frame in each voice frame, the method comprises the following steps of

And

calculating the time delay difference of each selected array element combination for receiving the target voice frame;

wherein ,

representing the delay difference X of the receiving of the target voice frame by the microphone array element i and the microphone array element j in the selected array element combination _i (ω _k') and />

Respectively representing the frequency spectrum of the sound signals received by microphone array element i and microphone array element j, +.>

Representation of X _i (ω _k') and />

The result of the fast fourier transform is performed.

Specifically, the process of calculating the delay difference through the generalized cross-correlation function is as follows:

for the combination of M1 and M8 array elements, according to the sound signal received by M1 and the sound signal received by M8, the method passes through the formula

and />

Calculating the delay difference of M1 and M8 for receiving the sound signal in the speech frame>

wherein ,/>

Similarly, get the->

and />

Respectively obtain the time delay difference +.>

and />

Further, on the basis of the above embodiments,

determining, by the generalized cross-correlation function, a delay difference of receiving the target voice frame by each selected array element combination, and determining a propagation direction of the target voice frame according to the delay difference of receiving the target voice frame by each selected array element combination, where the determining includes:

Dividing the selected array element combinations with the same maximum delay difference into the same group;

determining the delay difference of each selected array element combination for receiving the target voice frame through a generalized cross-correlation function, and calculating the average value of the delay differences of the selected array element combinations in each group according to the delay difference of each selected array element combination for receiving the target voice frame to be used as a group delay difference;

and judging the propagation direction of the target voice frame according to the packet delay difference of each packet and the set judgment standard of each packet.

Further, on the basis of the above embodiments,

the step of judging the target voice frame propagation direction according to the packet delay difference of each packet and the set judgment standard of each packet comprises the following steps:

counting the first number of the grouping time delay differences smaller than the set judgment standard of the grouping and the second number of the grouping time delay differences larger than the set judgment standard of the grouping in any grouping;

and if the first number is larger than the second number, the propagation direction of the target voice frame is a first direction, and if the first number is smaller than the second number, the propagation direction of the target voice frame is a second direction.

Wherein, for any packet, the set judgment standard of the packet is equal to half of the maximum delay difference of the packet.

Wherein grouping the selected array element combinations comprises: the selected array element combinations with the same maximum delay difference are divided into the same group.

Further, the set judgment standard of each group is half of the maximum delay difference of the array element combination in the group.

For example, the grouping of the selected array element combinations is as follows:

a first grouping: m1, M8 array element combination;

and (2) second grouping: m1, M7 array element combination, M2, M8 array element combination;

third grouping: m1, M6 array element combinations, M2, M7 array element combinations, M3, M8 array element combinations.

The packet delay difference of the first packet is

The packet delay difference of the second packet is +.>

The packet delay difference of the third packet is +.>

Setting the first group as the setting judgment standard

The second group is set with a judgment standard of +.>

The third group is set with a judgment criterion of +.>

When the label is less than 0, the orientation of the microphone array element M8 to the microphone array element M1 is represented as 0 DEG direction, and the orientation of the label more than 0 microphone array element M1 to the microphone array element M8 is represented as 180 DEG direction.

If τ ₁ Less than

(corresponding label= -1), the first number is increased by 1, if τ ₁ Is greater than->

(corresponding to label=1), then the second number is incremented by 1. If τ ₂ Less than->

(corresponding label= -1), the first number is increased by 1, if τ ₂ Is greater than->

(corresponding to label=1), then the second number is incremented by 1. If τ ₃ Less than->

(corresponding label= -1), the first number is increased by 1, if τ ₃ Is greater than->

(corresponding to label=1), then the second number is incremented by 1.

If the first number cnt1 is greater than cnt2, the sound source corresponding to the voice frame is the first sound source in the direction from M8 to M1, and if the first number cnt1 is less than cnt2, the sound source corresponding to the voice frame is the second sound source in the direction from M1 to M8.

After the sound source corresponding to each voice frame is determined by the method, the sound signals of the double sound sources can be separated.

Further, on the basis of the foregoing embodiments, the separating the sound signal from the first sound source and the sound signal from the second sound source according to the propagation direction corresponding to each speech frame includes:

according to the corresponding propagation direction of each voice frame, determining a sound signal composed of each voice frame with the propagation direction being the first direction as a sound signal from the first sound source;

and determining a sound signal composed of each voice frame with the propagation direction being the second direction as a sound signal from the second sound source according to the propagation direction corresponding to each voice frame.

Specifically, a sound signal composed of voice frames with the propagation directions being the first direction is used as a sound signal of the first sound source, and a sound signal composed of voice frames with the propagation directions being the second direction is used as a sound signal of the second sound source; wherein the first sound source is located in a direction opposite to the first direction, and the second sound source is located in a direction opposite to the second direction.

Fig. 3 is a schematic flow chart of sound source separation of dual sound sources provided in this embodiment, referring to fig. 3, in the process of talking between sound source a and sound source B, a microphone array receives sound signals, determines the propagation direction of the sound signals through calculation, separates sound signals sent by different sound sources according to the propagation direction, and outputs the sound signals of different sound sources after being enhanced by two channels, so as to output clear voice to each sound source.

In general, the method provided in this embodiment includes the steps of: (1) Firstly, estimating the time delay difference from a sound source signal to different microphone combinations; (2) judging the sound source direction according to the obtained time delay estimation; and (3) separating sound sources in different directions in real time.

Taking a dual sound source as an example, assume that in this embodiment, the microphone array is shown in fig. 2, and each microphone array element in the microphone array adopts 8 electret omnidirectional microphones, which are arranged linearly, the directions of M8 to M1 are defined as 0 ° directions, the directions of M1 to M8 are 180 ° directions, and the distance between the microphones is d=11mm. After receiving the sound signal, the sound source separation calculation process specifically comprises the following steps:

(1) Selecting a weight function Φ (ω _k' ) At this time, select Φ (ω _k' )＝1；

(2) Short-time treatment is performed. The signal received by the microphone is divided into short time speech frames with a certain overlap ratio, and the frame length can be from a few milliseconds to tens of milliseconds. In the binaural separation algorithm, a frame length of 4ms was selected with an overlap of 75%. By framing, a group of array element combination output signals at the time t are obtained:

{x _n (t),x _n (t+1),…，x _n (t+K-1)},n＝1,2,3,6,7,8；

(3) Estimating x _n Spectrum of (t):

wherein, FFT {. Cndot. } is the fast Fourier transform;

(4) Calculating the maximum delay point between different microphone combinations:

wherein d_ij Represents the distance between microphone i and microphone j, c=340 m/s is the speed of sound, f _s =16khz is the sampling frequency,

is rounded upwards;

(5) According to

The microphone combinations are divided into three groups, namely, the maximum delay points are equally divided into one group:

(1) m1, M8 microphone pairs;

(2) m1, M7 microphone pair, M2, M8 microphone pair;

(3) m1, M6 microphone pair, M2, M7 microphone pair, M3, M8 microphone pair;

(6) Respectively calculating generalized cross-correlation functions of different microphone pairs in the three microphone groups in the step (5):

①

wherein IFFT {.cndot. } is an inverse fast fourier transform;

the same principle can be obtained:

②

③

(7) Obtaining time delay estimates of different microphone pairs in the three microphone groups:

①

The same principle can be obtained:

②

③

(8) Three delays can be derived from this:

(9) Voice activity detection-Voice Activity Detection, VAD: setting a proper threshold value according to the peak value of the cross-correlation function, and judging the current frame as a voice signal if the peak value is higher than the threshold value; if the time delay value is lower than the threshold value, judging that the current frame is a noise signal, and taking the time delay value of the previous frame as the time delay value of the current frame;

(10) Taking half of the maximum delay point of each group as a judgment standard, and setting the mark value of the judgment angle direction as label, namely

Corresponds to label= -1;

corresponds to label=1;

the following judgment criteria are set in the same way:

corresponds to label= -1;

corresponds to label=1;

corresponds to label= -1;

corresponds to label=1;

when the current frame is a voice signal, smoothing is carried out on the calculation of the label by adopting a filter, so that the algorithm performance is more robust; when the current frame is a noise signal, the label value of the previous frame is taken as the label value of the current frame;

(11) Judging the sound source direction according to the label value:

judging that the label is 0 DEG direction;

judging that the label is larger than 0 and is 180 degrees;

(12) The number cnt1 determined to be 0 ° in the direction and the number cnt2 determined to be 180 ° in the direction are counted separately:

cnt1> cnt2 the observed speech signal of the frame is finally judged to be 0 degree direction;

cnt1< cnt2 the observed speech signal of the frame is finally judged to be 180 degrees in direction;

(13) Optimizing codes, and processing misjudgment during voice separation, so as to realize automatic separation of double sound sources.

The generalized cross Correlation algorithm (Generalized Cross-Correlation, GCC) is described as follows:

the generalized cross correlation algorithm (GCC) is the most currently used delay estimation algorithm. The method is efficient in calculation and short in decision delay, so that the method has good target tracking capability. In addition, the method is easy to realize in the system, and particularly has good effect in a scene with high signal-to-noise ratio. However, in an indoor environment where reverberation is strong, the estimation result of the GCC algorithm may have errors, but may not lead to a breakdown of the whole separation algorithm.

Assuming that there is an unknown sound source in a certain direction in the sound field, the output signal of the nth element of the microphone array (N elements) can be expressed as follows:

x _n (k)＝a _n s(k-D _n )+b _n (k),n＝1,2,…N (1)

wherein a_n Is a sound propagation attenuation factor and satisfies 0.ltoreq.a _n ≤1；D _n Corresponding to the propagation time delay from the unknown sound source to microphone n; s (k) is the sound emitted by the speaker or speaker, i.e. the sound source signal, the spectrum being broadband in nature; b _n (k) Additive noise received for the nth microphone. Suppose b _n (k) Obeying a gaussian distribution of zero mean and statistically uncorrelated with the sound source signal s (k) and with the noise signals received at the other microphones.

Under this signal model, the signal delay difference between the i-th microphone and the j-th microphone can be expressed as:

τ _ij ＝D _j -D _i (2)

where i, j=1, 2, …, N, and i+.j. The target of the delay estimation is based on the observed signal x _n (k) Obtaining tau _ij Estimate of (2)

According to the generalized cross-correlation algorithmAssuming we have only two microphones, their output signals are denoted as x, respectively ₁(k) and x₂ (k) Their cross-correlation function is defined as:

wherein E [. Cndot.]Representing mathematical expectation, x is ₁(k) and x₂ (k) Substituting the cross-correlation function (3) to obtain:

due to b _n (k) Is Gaussian white noise and is uncorrelated with the sound source signal, the noise signals received at other microphones, thus

According to formula (4), we can easily derive +.>

At p=d ₂ -D ₁ The maximum value is obtained. Thus, x ₁(k) and x₂ (k) The relative arrival time difference of (2) is:

wherein p E [ -tau [ ] _max ,τ _max ]，τ _max Is the maximum possible delay.

When the digital implementation of equation (5) is performed, the cross-correlation function (Cross Correlation Function, CCF) is unknown and requires estimation, typically by replacing the statistical average defined in equation (3) with a time average.

Let us assume that at time t we have a set of x _n Observing samples, i.e. x _m (t),x _m (t+1),…,x _m (t+k-1),…,x _m (t+k-1) }, m=1, 2, the corresponding cross-correlation function of which can be performed by the following equationEstimating:

or by the following equation:

where K is the size of the speech frame. The difference between the equation (6) and the equation (7) is that the former is a biased estimation and the latter is an unbiased estimation. The former is widely used in many applications because of its low estimated variance and asymptotically unbiased.

Furthermore, estimating the cross-correlation function can also be achieved by a discrete fourier forward transform and a discrete fourier inverse transform, i.e.:

wherein ,

in order to be of an angular frequency,

is x _n (k) Short-time discrete fourier transform at time t. Both (6) and (8) produce the same cross-correlation function estimate. However, the latter has been widely used in systems because it can more efficiently implement a cross-correlation function using a fast fourier transform and an inverse fast fourier transform.

In summary, the generalized cross-correlation method is implemented by weighting the cross-power spectrum between the sensor outputs, and this weighting process can effectively improve the performance of the delay estimation. In combination with formula (1)Is used for estimating x by adopting GCC method ₁(k) and x₂ (k) Is a relative arrival time difference of:

wherein

Is a generalized cross-correlation function,

is x ₁(k) and x₂ (k) Cross-power spectrum of (·) ^* Is complex conjugated, phi (omega' _k ) As a weighted function (sometimes also called pre-filtering), the cross-power spectrum is thus weighted as

In an actual system, cross-power spectra

Is usually achieved by substituting the instantaneous value for the desired value, i.e

There are a number of member algorithms in a class of GCC-based algorithms that depend on how the weighting function Φ (ω) _k' ). Different weighting functions have different properties, such as Phase Transform (phas) that can better resist additive noise; smooth coherent transforms (Smoothed Coherence Transform, SCOT) can improve robustness to multipath effects in reverberant environments; in addition, different weights can be weighted according to the use requirementThe functions are combined. In general, the weighting function should be selected according to the specific application and the corresponding conditions of use.

Based on the above embodiment, the flow chart of the algorithm implementation steps is shown in fig. 3, the steps of the sound source separation method in this patent are shown in the dashed box, and the expansion possibly used in the later stage is shown outside the dashed box. For example, the separated two-channel voice can further reduce the influence of various noises on voice quality through algorithms such as beam forming or voice enhancement.

The invention provides a real-time separation pickup, which aims to solve the problem of real-time separation of double sound sources in a question-answering mode. At present, most existing sound source orientations generally have the problems of low positioning accuracy and poor real-time tracking property, and in addition, the problems of untimely switching, misjudgment of voice separation and the like exist in sound source separation.

In the above method for separating sound information, it should be noted that, in the method for separating sound sources of the sound pickup in real time by using two sound sources, the omni-directional microphone may be an electret or MEMS microphone.

To further explain the present embodiment, the microphone array may use 8 electret omnidirectional microphones, or other numbers greater than or equal to 2, and may be used in a uniform linear array, a non-uniform linear array, a circular array, and any other array arrangement; and can be extended to 3-way, 4-way and multi-way sound source separation.

According to the sound source separation method and structure based on the microphone array and the double-sound source real-time separation pickup, a sound source separation algorithm based on time delay estimation is applied. The time delay estimation is carried out through the generalized cross-correlation algorithm, so that the time delay can be accurately estimated, the operation amount of the algorithm can be ensured to be low, and the algorithm can track the sound source azimuth more accurately and efficiently in a real-time system, thereby realizing the automatic separation of the sound signals of the first sound source and the second sound source.

The invention provides a separation pickup which aims at realizing real-time separation and extraction of dual-source voice signals so as to finish voice recognition and recording of different voice signals. Solves the following problems in the prior art: (1) speaker identification. The algorithm is required to be able to track the speaker in real time, determining who is speaking. When the speaker changes, the algorithm can quickly make a judgment and give the correct result. (2) Speech enhancement. The algorithm is required to be able to enhance the desired speech while suppressing or counteracting the effects of interfering speech. For example, when the voice of interest originates from sound source 1, the desired signal that needs to be enhanced is voice 1 while suppressing or canceling the effect of voice 2. And vice versa. (3) separation of sound sources. The algorithm is required to separate the speech signals of different sound sources while extracting the speech of the same speaker.

In a second aspect, the present application provides a sound pickup including a microphone array unit, a processing unit, and an output unit; the microphone array unit comprises a microphone array and an audio coding unit;

the processing unit is used for executing the double-sound-source sound signal separation method, separating the sound signals from the first sound source and the second sound source, and respectively sending the sound signals from the first sound source and the second sound source to the output unit;

Further, on the basis of the foregoing embodiments, the arrangement manner of the microphone array elements in the microphone array includes a uniform linear array, a non-uniform linear array, and a circular array.

Further, on the basis of the above embodiments, the microphone array elements in the microphone array are electret omnidirectional microphones or MEMS microphones.

Further, on the basis of the above embodiments, the sensitivity of the microphone array elements in the microphone array is-29 db±1dB, the frequency response is greater than or equal to 100Hz and less than or equal to 10kHz, and the signal-to-noise ratio is greater than or equal to 60dB.

Further, the microphone was cylindrical, 9.7mm in diameter and 4.5mm in height.

In fact, the uniform linear array microphone array shown in fig. 2 can well achieve the voice separation of the sound source a and the sound source B when the sound source a and the sound source B are respectively located at both ends of the microphone array. To enhance the sound source separation effect, the sound source a and the sound source B may be located in the region as shown in fig. 9, that is, the sound source a may be located in the 0 ° direction of the microphone array and in the cone range having an angle of 60 ° with respect to the 0 ° direction, and the sound source B may be located in the 180 ° direction of the microphone array and in the cone range having an angle of 60 ° with respect to the 180 ° direction.

Further, on the basis of the above embodiments, the processing unit includes an acquisition encoding unit, an FPGA processing unit, and a DSP processing unit;

the acquisition encoding unit receives the sound signals sent by the microphone array unit, performs first preprocessing on the sound signals, and transmits the sound signals subjected to the first preprocessing to the FPGA processing unit in a time division multiplexing mode;

the FPGA processing unit performs second preprocessing on the received sound signals and transmits the sound signals subjected to the second preprocessing to the DSP processing unit;

the DSP processing unit separates received sound signals, determines sound signals from the first sound source and sound signals from the second sound source, respectively sends the sound signals from the first sound source and the sound signals from the second sound source to the FPGA processing unit, and respectively sends the sound signals from the first sound source and the sound signals from the second sound source to the output unit;

wherein the first preprocessing includes gain control, a/D analog-to-digital conversion, and automatic level control of the sound signal; the second preprocessing comprises serial-to-parallel conversion, data caching, high-pass filtering and parallel-to-serial conversion.

A microphone array is a signal pickup device formed by arranging a certain number of microphones in a certain specific spatial geometrical distribution. The array parameters include: geometric parameters such as the number of microphones, array-aperture, microphone array element spacing, microphone spatial distribution form and the like. The microphone array signal processing is to perform correlation analysis and mixing processing on target sound signals acquired by a microphone array by utilizing the statistical characteristics among array element signals, so as to realize the requirements of sound source orientation, sound source separation, signal enhancement and the like. As a result, signal processing technology based on microphone arrays has gradually evolved into a hot topic of research in the field of digital signal processing.

The following describes a method for separating sound signals of dual sound sources and application of a pickup provided by the application:

the application mode is as follows: the microphone array of the pickup is shown in fig. 2, 8 electret omnidirectional microphones are adopted, the microphones are linearly arranged, the 0-degree direction is a No. 1 microphone, the 180-degree direction is a No. 8 microphone, the pickup is aimed at the separation of double sound sources, and an application scene is shown in fig. 4. In fig. 4, the two parties to the conversation are in a one-to-one manner, i.e., a and B do not speak at the same time. The placement of the linear array pickup, for example, is 0 ° directed to male a and 180 ° directed to female B. Through the real-time separation adapter of the dual sound source that this patent provided record, carry out automatic real-time separation with A and B's dialogue, export from two audio frequency passageway respectively alone.

The dual-sound source real-time separation pickup can effectively and accurately automatically separate voices of two parties of a conversation through carrying out repeated long-time testing on multiple groups of the dual-sound source real-time separation pickup under different use scenes, and meanwhile, the pickup keeps stable operation for a long time. The dialogue voices of A and B, the separated voice of A and the separated voice of B in one group of test results are shown in fig. 5, 6 and 7 respectively, and the broken line in fig. 5 shows the marks for judging different sound sources. When the sign value is larger than 0, the voice A is judged, and when the sign value is smaller than 0, the voice B is judged, so that the voices of the two parties of the conversation are separated; as can be seen from fig. 6 and 7, the separation of the voices is in time, basically no separation error exists, and the starting and ending of the voices are not misjudged.

The dual-sound source real-time separation pickup can accurately and rapidly automatically separate voices of different sound sources, can be applicable to various application scenes, and is not limited to the following application scenes:

scenario example 1: a review room scenario.

In the interrogation room of public security authorities, police officers conduct investigation and evidence collection of criminal behaviors on criminal suspects and an interrogation process, and are important law enforcement links. The traditional inquiry strokes are usually arranged by special writers, so that the record editing workload is large; under a large number of conditions, talking is fast to carry out, recording is slow, and the arranged interrogation records are easy to miss some important problems; at the same time, the details of the questions in some interrogation processes need to be reviewed repeatedly by recording at a later stage.

Scenario example 2: a service hall or a bank counter.

In order to improve the service quality, the traditional investigation mode usually adopts a questionnaire form, and people need to leave time to fill out the questionnaire in a matching way. The finishing of the later questionnaire will also take a lot of manpower and time. By the separation method, voice information of the service personnel and voice information of the serviced personnel can be separated in real time, and therefore service quality of the service personnel can be judged.

The dual-sound source real-time separation pickup can effectively and accurately automatically separate voices of two parties of a conversation through carrying out repeated long-time testing on multiple groups of the dual-sound source real-time separation pickup under different use scenes, and meanwhile, the pickup keeps stable operation for a long time. In fig. 10, a, B, and c are shown respectively a and B dialogue voices, a separated voice a, and a separated voice B in a set of test results. Wherein, the broken line in a is indicated as a mark for judging different sound sources. When the flag value is greater than 0, the voice A is judged, and when the flag value is less than 0, the voice B is judged, so that voices of the two parties of the conversation are separated. From b and c, it can be seen that the voice separation is in time, basically no separation error exists, and the starting and ending of the voice are not misjudged.

Fig. 8 is a schematic structural diagram of a pickup provided in this embodiment, referring to fig. 8, the dual-sound source real-time separation pickup structure includes:

The power management unit is used for providing power for the microphone array unit, the acquisition encoding unit, the FPGA (Field Programmable Gate Array) processing unit, the DSP (Digital Signal Processor) processing unit and the audio output unit and is responsible for the power management of the whole system;

(1) The microphone array unit comprises 8 omnidirectional electret differential microphones and a differential signal conditioning circuit, wherein the 8 microphones are linearly arranged on a circuit board to form a microphone linear array, and the center distance between adjacent microphones is 11+/-0.1 mm; the 8 omnidirectional electret differential microphones convert audio sound wave signals into electric signals, and the electric signals are processed by the differential conditioning circuit and then input into the audio coding unit;

(2) The acquisition coding unit is used for converting the 8-channel Analog audio input signals of the microphone array unit 1 into audio digital signals, and gain control, analog-to-Digital Converter (A/D) and automatic level control (Automatic Level Control) of the 8-channel audio signals of the microphone array unit are completed in parallel by adopting 2 high-performance audio acquisition coding chips; the acquisition coding unit transmits the converted audio digital signals to an acquisition processing module of the FPGA processing unit in a time division multiplexing mode through a IIS (Integrated Interchip Sound) bus;

(3) The FPGA processing unit comprises an FPGA acquisition processing module and an FPGA output processing module; the FPGA acquisition processing module comprises a TDM time sequence module, a data processing module and a TDM slave time sequence module;

TDM (Time Division Multiplex) timing module for realizing timing generation, data processing and buffer function of signal collection;

the data processing module is used for preprocessing the data transmitted by the acquisition encoding unit and comprises the following steps: serial-to-parallel conversion, data caching, high-pass filtering, parallel-to-serial conversion;

the TDM slave timing module is used for generating a TDM slave timing, and inputting the 8-channel audio digital signal preprocessed by the data processing module into the DSP processing unit in a time division multiplexing mode through TDM timing communication with a McASP0 (Multichannel Audio Serial Port 0) bus from the DSP processing unit;

the DSP processing unit comprises a Short-time Fourier transform (Short-Time Fourier Transform, STFT) STFT, a sound source separation module, a voice enhancement module and an Inverse Short-time Fourier transform (ISTFT) ISTFT module, and is used for dividing 8-channel audio signals received in a dMAX mode from a McASP0 bus into small signal frames respectively, wherein the frame length is 4ms, namely 64 points, the overlapping rate between adjacent frames is 75%, the window length is 256 points, and filtering the Short-time frame signals;

The short-time frame signal is transformed from the time domain to the frequency domain through 256-point short-time Fourier STFT in the DSP processing unit;

the sound source separation module of the DSP processing unit is used for realizing sound source separation through a double sound source separation algorithm in a frequency domain, and realizing one-to-one intercom real-time separation in a designated conical area right in front (0 DEG + -60 DEG in figure 9) and right behind (180 DEG + -60 DEG in figure 9);

the voice enhancement module of the DSP processing unit is used for carrying out beam forming, voice noise reduction and reverberation removal on the two-channel voice signals separated by the sound source, and eliminating interference, noise and reverberation in a frequency domain and extracting two-channel expected audio signals;

the 256-point short-time Fourier inverse transform ISTFT module is used for transforming the two-channel expected audio signals into a time domain by utilizing overlap-add summation to obtain reconstructed two-channel audio signals with real-time separation enhancement;

(4) The DSP processing unit transmits the two-channel audio signals which are separated and enhanced in real time to an FPGA output processing module of the FPGA processing unit in a TDM mode through a McASP1 (Multichannel Audio Serial Port 1) bus;

the FPGA output processing module is used for realizing time sequence generation, data processing and buffering of signal output:

The system is used for generating TDM slave time sequence, receiving the two-channel audio signals which are enhanced by the DSP processing unit in real time and buffering the data;

the system comprises an audio output unit, a real-time separation enhancement unit and a real-time separation enhancement unit, wherein the audio output unit is used for generating two paths of IIS main time sequences and transmitting the two paths of audio signals which are enhanced in real time to the audio output unit;

(5) The audio output unit is connected with the FPGA output processing module through the IIS bus and is used for respectively driving two paths of IIS buses: IIS1 bus, IIS2 bus;

the IIS1 bus is used for transmitting two paths of audio data to the audio D/A, converting audio digital signals into analog signals and outputting the analog signals to the audio power amplifier driving unit;

(6) The audio power amplifier driving unit is used for amplifying the analog signals and is connected with the 3.5mm earphone stereo interface to realize 600 ohm unbalanced output;

the IIS2 bus is used for transmitting two paths of audio digital signals to the USB to IIS Bridge sound card bridging chip and converting the two paths of audio digital signals into USB packets, and is used for realizing USB asynchronous transmission of the audio signals, and virtualizing a pickup into a computer standard USB sound card, so that plug and play is realized.

Based on the embodiment, the acquisition coding unit mainly adopts 2 high-performance audio acquisition coding chips, and gain control, analog-to-digital conversion and automatic level control of 8 paths of audio signals are completed in parallel. The system adopts a sampling rate of 16kHz and a sampling precision of 24 bits. Analog audio input signals received by the microphone are amplified by PGA (Programmable Gain Amplifier) at first and then input to the A/D for analog-to-digital conversion, the gain of the ALC realizes that the gain of the PGA is automatically adjusted along with the amplitude of the analog audio input signals, and the level of audio signals output by the PGA can be tracked and monitored at any time by the ALC; when the audio input signal increases, the ALC circuit automatically decreases the gain of the PGA, and when the input signal decreases, the ALC circuit automatically increases the gain of the PGA to stabilize the amplitude of the acquired audio signal at a certain level or to fluctuate within a smaller range, thereby increasing the dynamic input range of the system. The acquisition encoding unit converts the 8-channel analog audio input signal into an audio digital signal, and then transmits the audio digital signal to an acquisition processing module of the FPGA processing unit in a time division multiplexing mode through an IIS bus.

The acquisition processing module of the FPGA processing unit mainly realizes functions of time sequence generation, data processing, cache and the like of signal acquisition; the data transmitted by the acquisition coding unit are preprocessed by serial-parallel conversion, data caching, high-pass filtering, parallel-serial conversion and the like; generating TDM slave time sequence, through TDM time sequence communication with McASP0 bus from DSP processing unit, inputting the preprocessed 8 channel audio digital signal to DSP processing unit through time division multiplexing mode.

The DSP processing unit is the core of the whole sound source separation pickup. Firstly, the DSP receives data from the McASP0 bus in a dMAX mode; secondly, respectively dividing the received 8-channel audio signal into small signal frames, wherein the frame length is 4ms, namely 64 points, the overlapping rate between adjacent frames is 75%, the window length is 256 points, and a Hamming window or a Kaiser window can be selected to carry out filtering treatment on the short-time frame signals; these frame signals are then transformed from the time domain to the frequency domain by a 256-point short-time fourier transform; implementing a double sound source separation algorithm in a frequency domain, and implementing real-time separation of one-to-one intercom in a designated conical region right in front (0 degree plus or minus 60 degrees) and right in back (180 degrees plus or minus 60 degrees) as shown in fig. 8; the two separated channels of voice signals are respectively subjected to a voice enhancement algorithm comprising wave beam formation, voice noise reduction and reverberation removal, so that interference, noise and reverberation are eliminated in a frequency domain, and two channels of expected audio signals are extracted; then, respectively utilizing overlap-add summation, and transforming the estimated two-channel expected audio signals into a time domain through 256-point short-time Fourier inverse transformation to obtain reconstructed two-channel audio signals with real-time separation enhancement; finally, the DSP processing unit transmits the two-channel audio signals which are separated and enhanced in real time to an output processing module of the FPGA processing unit in a TDM mode through the McASP1 bus.

The output processing module of the FPGA processing unit mainly realizes the functions of time sequence generation, data processing, cache and the like of signal output; generating TDM slave time sequence, receiving DSP processing unit to separate enhanced two-channel audio signals in real time, and buffering data. The output processing module of the FPGA processing unit generates two paths of IIS main time sequences and transmits the two channels of audio signals which are separated and enhanced in real time to the audio output unit.

The audio output unit is connected with the output processing module of the FPGA processing unit through the IIS buses and respectively drives two paths of IIS buses; the IIS1 bus transmits two paths of audio data to the audio D/A, converts audio digital signals into analog signals, and outputs the analog signals to the audio power amplifier driving unit, and the audio power amplifier driving unit amplifies the analog signals and is connected with a 3.5mm earphone stereo interface to realize 600 ohm unbalanced output; the IIS2 bus transmits two paths of audio digital signals to a USB (Universal Serial Bus, USB) to IIS Bridge sound card bridging chip, the two paths of audio digital signals are converted into USB packets, USB asynchronous transmission of the audio signals is realized, and a pickup is virtualized into a computer standard USB sound card, so that plug and play is realized.

The dual-sound source real-time separation pickup provided by the invention can automatically separate talking between police officers and criminal suspects and respectively output the talking from two audio channels; according to the voice quality, the separated double-channel voice can be directly connected into a rear-end multi-channel voice recognition system to perform real-time voice recognition and recording, and an interrogation record is automatically formed. Further processing may be performed by algorithms (e.g., beam forming, speech enhancement, etc.), respectively, to reduce the impact of various noise on recognition performance. Later, the bookman can also perform manual calibration. The system improves the accuracy of inquiry records and simultaneously greatly reduces the workload of a bookend for inquiring the strokes. In addition, the separated sound of the criminal suspects is utilized to establish a sound print library of the criminal suspects, and case handling can be assisted in the future, so that the efficiency and the capability of public security investigation and case breaking are further improved.

The dual-sound-source separation pickup provided by the invention can automatically separate conversations of staff and masses and output the conversations from two audio channels respectively and independently. The department center can evaluate the requirements of masses and the service quality according to the separated voice information, and dynamically grasp the working condition of the window.

The dual-sound source real-time separation pickup provided by the invention has stronger expandability and can be based on a linear array, a circular array or other array shapes, and although the given example is based on a uniform linear array, the dual-sound source real-time separation pickup can be used for a uniform linear array and can also be used for a non-uniform linear array and a circular array. According to the invention, the angles of a plurality of sound sources are obtained by utilizing a sound source orientation algorithm according to actual scene requirements, so that the real-time separation of the multi-channel sound sources is realized, for example, the real-time separation of the ring array multi-sound sources which are expanded can be applied to a multi-person conference system, the speech of different participants can be automatically separated and output in real time, then the speech is accessed into a rear-end multi-channel speech recognition transcription system, the real-time transcription of speech can be realized, and the conference summary can be efficiently completed. The invention can be used for double sound source separation, and can also be expanded into 3 paths, 4 paths and multi-path sound source separation.

The double-sound-source real-time separation technology based on the GCC time delay estimation algorithm provided by the invention can effectively separate the voices of the two talkers by utilizing the rapid and accurate time delay estimation performance of the GCC algorithm. And the voice starting and ending parts existing between speaker conversion are effectively processed, so that the algorithm performance is more robust. More importantly, the sound source separation algorithm adopted by the method is low in complexity and runs in real time in a practical system.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of separating sound signals from a binaural source, comprising:

separating a sound signal from the first sound source and a sound signal from the second sound source according to the corresponding propagation direction of each voice frame;

the method for determining the propagation direction of the mixed sound signal comprises the steps of dividing the received mixed sound signal into voice frames with preset frame lengths, judging the propagation direction of each voice frame, and determining the propagation direction corresponding to each voice frame, wherein the steps comprise:

2. The method for separating sound signals of dual sound sources according to claim 1, wherein determining maximum delay differences corresponding to different array element combinations according to positions of array elements in a microphone array for receiving the mixed sound signals, and obtaining the array element combination with the maximum delay difference greater than a preset threshold value as the selected array element combination includes:

according to the formula

wherein ,

Representing an upward rounding.

3. The method for separating sound signals of dual sound sources according to claim 2, wherein determining a delay difference of receiving the target voice frame for each selected array element combination by a generalized cross-correlation function for any target voice frame in the pair of voice frames, determining a propagation direction of the target voice frame according to the delay difference of receiving the target voice frame for each selected array element combination, comprises:

4. The method for separating sound signals from two sound sources according to claim 3, wherein said determining the target speech frame propagation direction based on the packet delay difference of each packet and the set determination criterion of each packet comprises:

5. The method for separating sound signals of dual sound sources according to claim 1, wherein determining a delay difference of receiving the target speech frame for each selected array element combination by a generalized cross-correlation function for any target speech frame in the pair of speech frames comprises:

and />

wherein ,

Representation of X _i (ω _k') and />

The result of the fast fourier transform is performed.

6. The method according to claim 4, wherein the step of separating the sound signal from the first sound source and the sound signal from the second sound source based on the propagation direction corresponding to each speech frame comprises:

7. The sound pick-up is characterized by comprising a microphone array unit, a processing unit and an output unit; the microphone array unit comprises a microphone array and an audio coding unit;

the processing unit is configured to perform the sound signal separation method of the dual sound source according to any one of claims 1 to 6, separate the sound signal from the first sound source and the sound signal from the second sound source, and send the sound signal from the first sound source and the sound signal from the second sound source to the output unit, respectively;

8. The pickup of claim 7, wherein the arrangement of microphone elements in the microphone array includes a uniform linear array, a non-uniform linear array, and a circular array;

the microphone array elements in the microphone array are electret omnidirectional microphones or MEMS microphones;

the sensitivity of the microphone array elements in the microphone array is-29 dB plus or minus 1dB, the frequency response is more than or equal to 100Hz and less than or equal to 10kHz, and the signal-to-noise ratio is more than or equal to 60dB.

9. The pickup of claim 7, wherein the processing unit comprises an acquisition encoding unit, an FPGA processing unit, and a DSP processing unit;