CN107017005B

CN107017005B - DFT-based dual-channel speech sound separation method

Info

Publication number: CN107017005B
Application number: CN201710287632.4A
Authority: CN
Inventors: 叶晨; 陈建清; 陈适宜
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2020-03-24
Anticipated expiration: 2037-04-27
Also published as: CN107017005A

Abstract

The invention relates to a DFT-based two-channel speech sound separation method, which comprises the following steps: s1, slicing the time domain signal sequences of the left channel and the right channel respectively, and performing DFT conversion to obtain frequency domain signal sequences of the left channel and the right channel; s2, obtaining the angle condition between the left and right sound track background music component and the angle condition between the speech component and the frequency point signal, separating the speech and the music; and S3, performing inverse DFT on the result obtained in the step S2 to obtain time domain signals of the left channel and the right channel after the speech and music are separated. Compared with the prior art, the method can effectively separate background music and speech sound by utilizing the discrete Fourier transform of the fragments; different phase difference conditions are determined by considering the angle range of the sound pickup system and the distance between two channels of the sound pickup system, so that the calculation result is more accurate; and filtering the obtained final result to filter unnecessary noise, and can be applied to the application programs of the cell phones of the Karaoke type.

Description

DFT-based dual-channel speech sound separation method

Technical Field

The invention relates to a voice processing method, in particular to a double-channel speech sound separation method based on DFT.

Background

The main technique of separating human voice comes from processing frequency and phase, and the existing technique basically involves two manual linkage operations, such as filtering in a frequency stage and phase cancellation in some frequencies. The DFT algorithm can effectively convert time domain information into frequency domain information, and the inverse DFT transform can convert the frequency domain information into time domain information. The DFT algorithm has wide application in digital filtering, power spectrum analysis and communication theory. The technology is applied to the separation of the human voice and the background music and is improved, so that the human voice can be well separated.

A strengthening separation method for multiple specific musical instruments in single-channel music voice separation relates to a strengthening separation method for multiple specific musical instruments in single-channel music voice separation. The method is used for strengthening and separating 8 musical instruments including an electric guitar, a clarinet, a violin, a piano, an acoustic guitar, an organ, a flute and a trumpet, and the strengthening and separating are realized through a layer of single musical instrument separator and three layers of multi-musical instrument combination reinforcers, wherein the first layer of multi-musical instrument combination reinforcers can separate 2 types of musical instrument sounds, the second layer of multi-musical instrument combination reinforcers can separate 4 types of musical instrument sounds, and the third layer of multi-musical instrument combination reinforcers can separate 8 types of musical instrument sounds. However, the technology is limited to the separation of the sound of the musical instrument, and the application field is narrow; only single channel music can be processed, the single channel has too little information to distinguish it according to the difference of speech sounds and background music, and the result is usually hard to imagine.

Disclosure of Invention

The present invention aims to overcome the defects of the prior art and provide a DFT-based dual-channel speech separation method which can well separate human voice from background music.

The purpose of the invention can be realized by the following technical scheme:

a DFT-based two-channel speech sound separation method is used for separating speech sound and background music, and comprises the following steps:

s1, slicing the time domain signal sequences of the left channel and the right channel, and performing DFT conversion to obtain the frequency domain signal sequences of the left channel and the right channel, wherein the signal separation expression of each frequency point is as follows:

wherein, | ω_L| is a modulus value of the left channel signal,

is the unit vector, | ω, of the left channel signal_humanLL is the module value of the left channel speech component,

is the unit vector of the vocal component of the left channel, | omega_musicL| is the background music score of the left trackThe modulus value of the quantity is,

is the unit vector of the left channel background music component, | omega_R| is a modulus value of the right channel signal,

is the unit vector, | omega, of the right channel signal_humanR| is the module value of the right channel speech component,

is the unit vector, | omega, of the vocal component of the right channel_musicR| is a module value of the right channel background music component,

a unit vector for a right channel background music component;

s2, let each frequency point | ω_humanL|＝|ω_humanR|，

Obtaining the angle condition between the left and right sound track background music components and the angle condition between the speech sound component and the frequency point signal, and calculating the angle condition in the formula (1)

And

separating the speech and music;

and S3, performing DFT inverse transformation on the result obtained in the step S2, and performing noise filtering to obtain time domain signals of the left channel and the right channel after the speech and the music are separated.

In step S2, the angle between the left and right channel background music components is: when the frequency of the frequency point signal is greater than 603Hz,

otherwise

Wherein d is the distance between two channels of the sound pickup system, α is the angle of a single sound pickup device in the sound pickup system for receiving audio, and λ is the wavelength of the frequency point signal.

The single sound pickup device receives the maximum angle of the audio

In step S2, the angle between the speech component and the frequency point signal is:

in step S1, the time domain signal sequences of the left channel and the right channel are divided into a plurality of slices having equal lengths.

Compared with the prior art, the method has the advantages that after the frequency domain signal is obtained by utilizing the discrete Fourier transform of the fragments, the background music and the speech sound can be effectively separated; different phase difference conditions are determined by considering the angle range of the sound pickup system and the distance between two channels of the sound pickup system, so that the calculation result is more accurate; and filtering the obtained final result to filter unnecessary noise, and can be applied to the application programs of the cell phones of the Karaoke type.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a schematic diagram illustrating a relationship between a sound pickup system and a sound source according to the present embodiment.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Examples

As shown in fig. 1, a DFT-based two-channel speech separation method for separating speech from background music includes the following steps:

and S1, slicing the time domain signal sequences of the left channel and the right channel respectively, and performing DFT conversion to obtain the frequency domain signal sequences of the left channel and the right channel.

S2, let each frequency point | ω_humanL|＝|ω_humanR|，

Acquiring an included angle condition between the left and right sound track background music components and an included angle condition between the speech component and the frequency point signal, and separating the speech and the music;

For each frequency bin, there is the following equation:

ω_i＝ω_human+ω_music

wherein ω is_iIs the complex value of the ith frequency point, omega_humanAs the speech component of the ith frequency point, ω_musicIs a component of the background music. All variables are complex, in other words, the above formula can be written as

There will be songs for two channels

For equation (1), the left side is the known variable, the complex value of a certain frequency bin, can also be decomposed into unit vector and module value. The left variable has two and the right variable has four, and equation (1) can be considered to be numerically unsolvable in view of frequency bin independence. A similar conclusion can be drawn also for equation (2). Considering that most of voices in an album listened to daily are recorded through a microphone, namely at any frequency point, the following should be provided:

|ω_humanL|＝|ω_humanR|

thus, equation (2) is transformed into the following form:

the discrete Fourier transform can be as follows

Wherein:

assuming that the slicing is performed for two sequences (containing left and right channels) that are long enough, we get:

the result obtained after the sliced Fourier transform is

Wherein ω is_RijJ term, ω, representing the ith slice of the right channel_LijThe jth term representing the ith slice of the left channel. If all slices after inverse transformation are expected to be attached, impulse response is not generated as much as possible, and the change of any frequency point among the slices is required to be as small as possible. A section of unprocessed audio may be selected for its sliced fourier transform and the frequency bins at the same position therein may be selected for analysis to observe continuous and non-subsequent impulse response frequency phase changes.

Assuming a sinusoidal signal, sampling is performed in a time domain slice of a fixed length, where the phase of the signal in the time domain slice is:

for the nth sampling period, the range of sampling is considered to be:

where n is the number of cycles experienced by a small time domain slice,

is an angle that exceeds an integer period within a time slice. So for the Nth time slice, the corresponding (the latter equal sign is not equal meaning) has relative to the first time slice

A phase difference.

The characteristics in modulus are less obvious than those in phase, but after the signal is reconstructed in frequency domain, the time domain signal between adjacent slices must be continuous and smooth, otherwise obvious impulse response will occur.

Next, an attempt is made to pass X in the formula (7)_RijAnd X_LijRebuild new X'_RijAnd X'_Lij。

The frequency domain result obtained after DFT transformation can be processed by the next voice and audio separation algorithm. All the following processing processes are performed on each frequency point. Let S (m, n) be the input parameter here. Considering that the source of the signal is left and right channels, the combination (2) has:

if order:

the following simple equation is obtained:

the parameter g here is not presented above, and the model assumed here is more accurate. If the speech sounds are loudness shifted during post-processing, such as when singing at the ear, g1 ≠ g 2. However, only g is considered here₁＝g₂The case (1). It is not important what the two parameters are in particular, since

Is unknown. In other words, if g₁,g₂Increase in same ratio only will

Is reduced on a par, but does not affect

And (4) solving. Establishing at this point, let:

equation (13) can become:

removing intermediate parts (speech) taking into account a certain frequency point

) Are generated by a variety of instruments and synthesizers. These sound sources emit various sounds to reachThe phase difference of the recording points of the left and right channels will also be different. It is assumed that the left and right channels have a characteristic of being uniformly distributed in the sound source, in other words, various phase differences are uniformly distributed at the same frequency point. Therefore, considering the comprehensive effect of averaging the frequency points, the angle between the background music of the left and right sound channels is considered as

Therefore, there is a first additional condition derived from assumptions and a priori knowledge, which condition is hereinafter also referred to as a first phase difference condition:

this equation seems simple and is the key to solving the problem. This assumption is actually problematic because the distance between the recordings of the left and right channels is only about 30cm, and for parts within 300Hz, it is almost impossible for the left and right channels to differ by 90 °. Because the wavelength of the sound wave within 300Hz must be greater than 1 meter, considering the distance from the sound source to the left and right channel access points, there will be:

the frequency point emitted by the source will reach 0.6 pi only when the source is present on the extension of the two receivers. The optimization of the selection of different frequency angles is discussed in detail below. Here again in

The equation is solved for the condition.

Substituting equation (14) into (15) would be:

simplifying to obtain:

where θ is

And

the angle of (a) is often small. In fact, it is convenient to approximate θ here to 0. Thus, there are:

solving the two-order equation to obtain two roots:

the negative sign is taken here in view of the problem of energy distribution. Therefore:

all components that need to be solved are:

the next work is to substitute equation (11) to solve the inverse DFT transform. And filtering the obtained final result to remove unnecessary noise.

Above when solving for equation (4)Using equation (5), consider that the mean of the angles of all frequency bins should be

The premise of this assumption is that at any frequency point, all sources are rich enough and the phase differences are distributed over the entire real axis (this does not conflict with the angle between (0,180) degrees, since the determination of the angle essentially determines that its domain is uniformly mapped to the real axis.) certainly this is not the case, first background music often comes from small indoor recordings and various special effects are added by software. This scheme is the sound image moving system mentioned above, and this scheme usually places a certain section of recorded sound source within a virtual specific distance, and then obtains different left and right channels through computer simulation.

In addition, the influence of the sound reception angle of the sound reception system needs to be considered. For two specific points of observation, left and right ears for humans, left and right channel radios for the sound pick-up system, and two sound-simulated points of reception for post-processing, the source location is often within a limited sound field in front of the two points of observation, as shown in fig. 2, A, B being the image point.

For a small band accompaniment, the requirement for this angle is often not very critical, in other words, the line connecting the sound source to the pickup is not a very small acute angle θ from the extension of the two pickups.

It is also worth studying the dimensions of the audio image point from the sound pick-up system and the distance between two channels of the sound pick-up system. The distance between the two channels of a modern common case sound pick-up system is:

d＝30cm

and h is 1-2 m under the normal condition, the distance is usually more random, and the distance actually implies the distance of adding the image point during post-production. Therefore, it can be considered that the phase difference when a certain sound source reaches two sound pickup devices is:

this equation states: the phase difference between the two pickups is not drastically changed by the distance of the sound source from the sound pick-up system, and since d is determined, in the low frequency range,

often fluctuating within a certain range. This is in contradiction to a strong assumption established in the previous section:

the reason is as follows: when λ is large, since d is small and θ is large, the phase difference between the two sound pickup devices does not reach

In (1). By refining the range of this angle, a more accurate average value of the angle is given.

An upper limit of α in fig. 2 may be given in particular, it may be considered that all sound sources are present from one side of the sound pick-up system and:

with the above conditions, for sound waves with a wavelength λ, when the maximum value of the phase difference between the two sound pickup devices is minimal, all sound sources located on the vertical bisector of the two sound pickup devices will not have the phase difference:

based on equation (23), there is:

now, for equation (15) in the previous section, we correct for this equation, which is also referred to as the second phase difference condition:

given the parameter values, λ is the wavelength of the current processing frequency point,

d is 0.3 m. For high frequency sound waves, e.g. sound waves greater than 2kHz, due to

The assumption given here is no longer valid and the first phase difference condition is still used.

From equation (25) and equation (14), there is:

substitution α, d, has:

from equation (26), when the wavelength is less than 0.2819m, the phase difference condition should be chosen as equation (15). Considering that the operation is in the frequency domain and the velocity of the acoustic wave in air is 340m/s, there are:

therefore, the second phase difference condition is selected when the detection frequency point is smaller than 603Hz, and the first phase difference condition is selected when the frequency point value is larger than 603 Hz. Given the constraints, the equations can be solved. Under the second phase difference condition, the following equation is given:

here, the

The previous coefficient removal is only for simplicity of writing. Moreover, this factor has no practical effect, as already explained above. In the first phase difference condition, use is made of

The multiplication terms are directly eliminated to obtain a simple calculation result. But if based on the second phase difference condition, the simplified results have to be faced

A quadratic term. This quadratic term is numerically the product of two root equations, which becomes a quaternion quadrivalent equation for solution.

Considering that most of the energy of audio is concentrated in the range of medium and low frequencies, where the sound waves are between two sound pick-up devices (only about 30 cm), the amplitudes of the attenuations in the air are not very different. Specifically, since the distance difference is short, and air absorption attenuation is not considered, ground absorption attenuation is considered, only diffusion attenuation is considered, and it can be considered that the sound source is at least 1m away from two sound pickup devices, there are:

wherein l₁,l₂For a distance of a sound source from two pick-up devices，P₁,P₂The sound pressure of the sound wave emitted by the sound source reaching the two sound pickup devices. In fact, the ratio of these two should be slightly larger than 1, rather than close to the result of 1.69 obtained by the above equation, since here the sound source would typically be directly opposite the two sound pick-up devices, rather than on the extension of the sides, and the distance of the sound source would also be larger than 1 m. The significance of this equation is to give an upper bound on the variation, facilitating the establishment of the following approximation:

combining equation (28) yields such an approximated error range:

in fact this is an acceptable error range. And it is believed that in most cases, this approximation will yield more accurate results. Substituting it into equation (27), trying to eliminate

Obtaining:

simplifying to obtain:

the first order term is approximated as a scalar quantity, under the same principle as equation (18):

the coefficients of the first order quadratic equation are:

the solution is still according to the above scheme and considering the unity of sign, and should also take the root with negative sign, and in principle, should not get an inverted left and right channel:

due to various reflection diffraction, the difference between two sides is not almost zero at low frequency. Here simply written as:

and performing short-time Fourier transform inverse transformation on the processed left and right channels to obtain time-domain signals. Filtering the time domain signal to filter out the noise of the high frequency part generated by the processing, and obtaining the final result:

Claims

1. a DFT-based two-channel speech separation method is used for separating speech and background music, and is characterized by comprising the following steps:

wherein, | ω_L| is a modulus value of the left channel signal,

is the unit vector, | ω, of the left channel signal_humanLL is leftThe modulus of the vocal tract speech components,

is the unit vector of the vocal component of the left channel, | omega_musicL| is the module value of the left channel background music component,

a unit vector for a right channel background music component;

s2, let each frequency point | ω_humanL|＝|ω_humanR|，

And

separating the speech and music;

and S3, performing inverse DFT on the result obtained in the step S2 to obtain time domain signals of the left channel and the right channel after the speech and music are separated.

2. The DFT-based two-channel speech separation method according to claim 1, wherein in step S2, the angle between the left and right channel background music components is: when the frequency of the frequency point signal is greater than 603Hz,

otherwise

Wherein d is the distance between two channels of the sound pickup system, α is the angle of a single sound pickup device in the sound pickup system for receiving audio, and λ is the wavelength and symbol of frequency point signals<，>Representing the angle between the two vectors.

3. The DFT-based two-channel speech separation method of claim 2, wherein the angle at which the single pickup device receives audio is determined by the angle at which the single pickup device receives audio

4. The DFT-based two-channel speech separation method according to claim 1, wherein in step S1, the time domain signal sequences of the left channel and the right channel are divided into a plurality of slices with equal length.

5. The DFT-based two-channel speech separation method according to claim 1, wherein the step S3 further includes: and carrying out noise filtering on the result of the DFT inverse transformation.