WO2020057050A1

WO2020057050A1 - Method for extracting direct sound and background sound, and loudspeaker system and sound reproduction method therefor

Info

Publication number: WO2020057050A1
Application number: PCT/CN2019/075368
Authority: WO
Inventors: 叶超; 蔡野锋; 马登永; 沐永生
Original assignee: 中科上声（苏州）电子有限公司
Priority date: 2018-09-17
Filing date: 2019-02-18
Publication date: 2020-03-26
Also published as: CN109036455B; CN109036455A

Abstract

A method for extracting a direct sound and a background sound, and a loudspeaker system and sound reproduction method therefor, capable of better distinguishing a direct sound from a background sound. The method comprises the following steps: S1, respectively performing short-time Fourier transform on a left-channel signal x_L(n) and a right-channel signal x_R(n) to obtain X_L(m,k) and X_R(m,k) respectively corresponding to the left-channel signal and the right-channel signal, wherein n represents a time domain sampling point, and m and k respectively represent a discrete time and a discrete frequency; S2, introducing a spatial factor to express a background signal as a signal generated by a signal passing though different transmission paths in a room, and respectively performing energy estimation on the X_L(m,k) and the X_R(m,k) to obtain energies P_L(m,k) and P_R(m,k) of left and right channels; S3, setting the value of the spatial factor, and performing signal separation to obtain an estimation Ŝ(m,k) of a direct sound signal, an estimation (I) of a left-channel background signal, and an estimation (II) of a right-channel background signal of a time-frequency domain; and S4, performing inverse Fourier transform to obtain a direct sound signal Ŝ(n), a left-channel background signal (III), and a right-channel background signal (IV) of a time domain.

Description

Direct sound and background sound extraction method, speaker system and sound reproduction method

Technical field

The invention relates to a method for converting a stereo dual-channel signal into a multi-channel signal, and particularly relates to a direct sound and background sound extraction method based on frequency-domain spatial decomposition, a speaker system, and a sound playback method thereof.

Background technique

At present, most of the audio sources are still stereo, including CD, MP3, broadcast signals, etc. are dual-channel output, only left and right channels (L, R), all the characteristic information, such as direct sound signal, reverb sound signal, sound source Position, sound field space, etc. are contained in both channels. When multiple speakers are used to reproduce a stereo sound source, if the left and right channel signals are directly fed to each speaker, the spatial sound field will be chaotic. Therefore, it is necessary to use digital signal processing technology to convert stereo signals into multi-channel signals and play them back through multiple speaker systems to build a real spatial sound field.

The traditional processing method generally uses the point-by-point calculation method in the time domain. When the direct sound and the background sound signal are separated, because the correlation coefficient is a point-by-point calculation method, it is easy to introduce errors, and the direct sound and the background sound cannot be distinguished well.

For example, the method based on principal component analysis (PCA) disclosed in U.S. patent US6496584B2 and Chinese patent ZL01802081.X uses the minimum mean square error method to calculate the weighting factors of the left and right channels to separate the speech sound and the background sound. By calculating the correlation between the left and right channels Coefficient, determine the vector relationship of the acoustic signals in three-dimensional coordinates, and then divide the speech sound and the background sound into four signals left, middle, right, and surround according to the principle of energy conservation, and then divide the surround signals into left and rear through a decorrelation filter And right back surround to achieve dual-channel to 5-channel conversion. This method performs calculations in the time domain. The method is simple and fast, but only one surround signal can be separated by PCA analysis. The method of separating left and right back surround by using a decorrelation filter will generate certain errors.

Summary of the Invention

In view of the above problems, the present invention aims to propose a direct sound and background sound extraction method, which can better distinguish the direct sound and the background sound. The invention also aims to propose a speaker system and a method for sound reproduction based on the direct sound and background sound extraction method.

According to the first aspect of the present invention, the technical solution adopted by the present invention is:

A direct sound and background sound extraction method includes the following steps:

S1. Perform short-time Fourier transform on the left channel signal x _L (n) and the right channel signal x _R (n) to obtain X _L (m, k) corresponding to the left channel signal and the right channel signal, respectively. And X _R (m, k), where n is the time domain sampling point, and m and k are discrete time and discrete frequency, respectively;

S2. Introduce the space factor, express the background sound signal as a signal generated by different transmission paths in the room, and estimate the energy of X _L (m, k) and X _R (m, k) respectively. The energy P _L (m, k) and P _R (m, k) of the left and right channels;

S3. Set the value of the spatial factor and perform signal separation to obtain an estimate of the direct sound signal in the time-frequency domain

Left channel background sound signal estimation

And right channel background sound signal estimation

S4. Direct acoustic signals are obtained through inverse Fourier transform

Left channel background sound signal

And right channel background signal

In an embodiment, in step S1,

Among them, J represents the number of direct sound sources existing in space, and s _j (n) represents the direct sound signal at a certain moment,

The coefficients of the direct sound signals assigned to the left and right channel signals, respectively, and n _L (n) and n _R (n) represent the background signals of the left and right channels, respectively;

Where S _j (m, k) represents the direct sound signal in the time-frequency domain,

The time-frequency domain expressions of the coefficients of the direct sound signal assigned to the left and right channel signals, respectively. N _L (m, k) and N _R (m, k) represent the time-frequency domain expressions of the left and right channel background signals.

In an embodiment, step S2 specifically includes:

S21. At a certain time m and a certain frequency band k, there is only one sound source S _i , then

Among them, A _L and A _R respectively represent the coefficients of the direct sound signal assigned to the left and right channel signals;

S22. Introduce space factors B _L (m, k) and B _R (m, k), and obtain the following expression, N _L (m, k) = B _L (m, k) N (m, k), N _R (m, k) = B _R (m, k) N (m, k),

| b _L (m, k) | ≤1,

| b _R (m, k) | ≤1,

Among them, N (m, k) represents the background signal in the time-frequency domain, and b _L (m, k) and b _R (m, k) represent the amplitudes of the spatial factors of the left and right channels, respectively.

Respectively indicate the phase of the left and right channel spatial factors;

Then, X _L (m, k) and X _R (m, k) are simplified as:

X _L (m, k) = A _L (m, k) S (m, k) + B _L (m, k) N (m, k)

X _R (m, k) = A _R (m, k) S (m, k) + B _R (m, k) N (m, k)

Correlation coefficient between left and right channel signals

Among them, E {} represents the expectation of the signal;

S23. From the energy perspective, the energy P _L (m, k) and P _R (m, k) of the left and right channels can be obtained respectively:

Preferably, step S3 specifically includes: setting the value of the space factor to obtain an analytical solution of P _S (m, k), P _N (m, k), A _L (m, k), A _R (m, k) To calculate the following formulas (1) and (2)

Substituting the space factors B _L (m, k) and B _R (m, k) into Eq. (2) respectively, we get

with

More preferably, the value of the space factor b _L (m, k) = b _R (m, k) = 1 is set,

According to the second aspect of the present invention, the technical solution adopted by the present invention is:

A sound reproduction method for a speaker system, which uses the direct sound and background sound extraction methods described above to separate the direct sound signal and the background sound signal, and distributes the direct sound signal and the background sound signal to each speaker of the speaker system for Sound playback.

Specifically, the direct sound signal and the background sound signal are allocated to each speaker of the speaker system according to the orientation of the sound image in the stereo signal and the number and position of the speakers of the speaker system.

According to the third aspect of the present invention, the technical solution adopted by the present invention is:

A speaker system includes a plurality of speakers, characterized in that the speaker system further comprises an extraction device for performing the direct sound and background sound extraction method as described above.

Specifically, the extraction device includes an STFT module, an energy estimation module, a signal separation module, and an ISTFT module, which are sequentially connected,

The input of the STFT module is a left channel signal x _L (n) and a right channel signal x _R (n), which are used to output X corresponding to the left channel signal and the right channel signal after performing short-time Fourier transform. _L (m, k) and X _R (m, k);

The energy estimation module is used to receive X _L (m, k) and X _R (m, k) output by the STFT module, and introduce a space factor to express the background sound signal as a signal passing through different transmission paths in the room. Generated signals, and perform energy estimation on X _L (m, k) and X _R (m, k) respectively, and obtain the energy P _L (m, k) and P _R (m, k) of the left and right channels and output To the signal separation module;

The signal separation module is used to set the value of the spatial factor and perform signal separation to obtain

with

And output to the ISTFT module;

The ISTFT module is used to perform inverse Fourier transform and output a direct sound signal

Left channel background sound signal

And right channel background signal

The present invention adopts the above scheme, and has the following advantages over the prior art:

By defining the spatial factor variables between the left and right channel signals, to characterize the differences between the left and right channels caused by background reverberation due to factors such as room reverberation and space size in the process of sound propagation; the background sound signals of the left and right channels can be separated However, the traditional method can only isolate one background signal.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solution of the present invention more clearly, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For ordinary technicians, other drawings can be obtained based on these drawings without paying creative work.

1 is a signal processing flowchart of a direct sound and background sound extraction method according to the present invention;

Figure 2 shows the left and right channel signals;

FIG. 3 shows the correlation coefficients of the left and right channel signals at a certain moment;

4a, 4b, and 4c respectively show the separated direct sound signal, the left channel background sound signal, and the right channel background sound signal.

detailed description

The preferred embodiments of the present invention are described in detail below with reference to the accompanying drawings, so that the advantages and features of the present invention can be more easily understood by those skilled in the art. It should be noted that the description of these embodiments is used to help understand the present invention, but does not limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

This embodiment provides a direct sound and background sound extraction method. Referring to the signal flowchart shown in FIG. 1, the extraction method includes the following steps:

S1. Perform short-time Fourier transform (STFT) on the left channel signal x _L (n) and the right channel signal x _R (n) to obtain X _L (m , k) and X _R (m, k), where n is the sampling point in time domain, and m and k are discrete time and discrete frequency, respectively

S3. Set the value of the spatial factor and perform signal separation to obtain an estimation of the direct sound signal in the time-frequency domain.

Left channel background sound signal estimation

And right channel background sound signal estimation

S4. Obtain the direct sound signal in the time domain through the inverse Fourier transform (ISTFT)

Left channel background sound signal

And right channel background signal

Specifically, as shown in FIG. 2, the left and right channel signals are:

Where s _j (n) represents the direct sound signal at a certain time,

Represents the coefficients of the direct sound signal assigned to the left and right channel signals, and n _L (n) and n _R (n) represent the background signals of the left and right channels.

S1, after short-time Fourier transform (STFT), get

Among them, m and k represent time and frequency, respectively.

S2, there are two assumptions:

S21. At a certain time m and a certain frequency band k, there is only one sound source S _i , that is,

therefore

S22. The spatial factor B is introduced, and the background sound signal is expressed as a signal generated through different transmission paths in the room, which is similar to the direct sound signal expression mode, that is, N _L (m, k) = B _L (m, k) N (m, k), N _R (m, k) = B _R (m, k) N (m, k)

| b _L (m, k) | ≤1,

| b _R (m, k) | ≤1,

In this way, the above formula can be simplified to

X _L (m, k) = A _L (m, k) S (m, k) + B _L (m, k) N (m, k)

X _R (m, k) = A _R (m, k) S (m, k) + B _R (m, k) N (m, k)

The correlation coefficient between the left and right channel signals (as shown in Figure 3) is defined as

S23. From the energy perspective, the energy of the left and right channels can be obtained as follows:

S3. Generally, the background sound energy is less than the direct sound energy, and

So the traditional approach is to ignore P _N (m, k), that is,

Here, it is assumed that the value of the space factor is b _L = b _R = 1,

The contribution of P _N (m, k) is not ignored, so that the analytical solutions of P _S (m, k), P _N (m, k), A _L (m, k), and A _R (m, k) are obtained.

Then, S (m, k), N (m, k) can be calculated.

Substituting the space factors B _L (m, k) and B _R (m, k) into N _L (m, k) and N _R (m, k).

S4. Finally, the inverse Fourier transform is used to obtain the direct sound signals shown in Figs.

Background sound signal

with

In this extraction method, (1) by defining the spatial factor variables between the left and right channel signals, to characterize the difference between the left and right channels caused by background reverberation, room size, and other factors during the sound propagation process; 2) The background sound signals of the left and right channels can be separated, but the traditional method can only separate one background signal; (3) The calculation process after adding the space factor is relatively simple, and an analytical solution of the direct sound and the background sound can be obtained.

This embodiment also provides a sound playback method for a speaker system. The speaker system includes multiple speakers, and each speaker is disposed at a different position. The sound playback method is a method for stereo conversion of multi-channel sound signals, and specifically includes: using the direct sound and background sound extraction methods described above to separate the direct sound signal and the background sound signal, and according to the orientation of the sound image in the stereo signal And the number and position of the speakers of the speaker system, the direct sound signal and the background sound signal are distributed to each speaker of the speaker system, thereby completing sound reproduction.

This embodiment further provides a speaker system including a plurality of speakers, and the speaker system further includes an extraction device for performing the direct sound and background sound extraction method described above. As shown in FIG. 1, the extraction device specifically includes an STFT module, an energy estimation module, a signal separation module, and an ISTFT module connected in this order. The input of the STFT module is the left channel signal x _L (n) and the right channel signal x _R (n). After short-time Fourier transform, the X _L ( m, k) and X _R (m, k); the energy estimation module receives X _L (m, k) and X _R (m, k) output by the STFT module, and introduces a space factor to express the background sound signal as a signal Signals generated through different transmission paths in the room, and energy estimates for X _L (m, k) and X _R (m, k), respectively, to obtain the energy P _L (m, k) and P of the left and right channels _R (m, k) and A _L , A _R and output to the signal separation module; the signal separation module also sets the value of the space factor and performs signal separation to obtain

with

And output to the ISTFT module; the ISTFT module respectively performs inverse Fourier transform to output direct sound signals

Left channel background sound signal

And right channel background signal

The above embodiment is only for explaining the technical concept and features of the present invention, and is a preferred embodiment. The purpose is that persons familiar with the technology can understand the content of the present invention and implement it accordingly. protected range.

Claims

A direct sound and background sound extraction method is characterized in that it includes the following steps:

S1. Perform short-time Fourier transform on the left channel signal x L (n) and the right channel signal x R (n) to obtain X L (m, k) corresponding to the left channel signal and the right channel signal, respectively. And X R (m, k), where n is the time domain sampling point, and m and k are discrete time and discrete frequency, respectively;

S2. Introduce the space factor, express the background sound signal as a signal generated by different transmission paths in the room, and estimate the energy of X L (m, k) and X R (m, k) respectively. The energy P L (m, k) and P R (m, k) of the left and right channels;

S3. Set the value of the spatial factor and perform signal separation to obtain an estimation of the direct sound signal in the time-frequency domain.
Left channel background sound signal estimation
And right channel background sound signal estimation

S4. Obtaining a direct sound signal in the time domain through the inverse Fourier transform
Left channel background sound signal
And right channel background signal
The direct sound and background sound extraction method according to claim 1, wherein in step S1,

Among them, J represents the number of direct sound sources existing in space, and s j (n) represents the direct sound signal at a certain moment,
The coefficients of the direct sound signals assigned to the left and right channel signals, respectively, and n L (n) and n R (n) represent the background signals of the left and right channels, respectively;

Where S j (m, k) represents the direct sound signal in the time-frequency domain,
The time-frequency domain expressions of the coefficients of the direct sound signal assigned to the left and right channel signals, respectively. N L (m, k) and N R (m, k) represent the time-frequency domain expressions of the left and right channel background signals.
The direct sound and background sound extraction method according to claim 1 or 2, wherein step S2 specifically comprises:

S21. At a certain time m and a certain frequency band k, there is only one sound source S i , then

Among them, A L and A R respectively represent the coefficients of the direct sound signal assigned to the left and right channel signals;

S22. Introduce space factors B L (m, k) and B R (m, k), and obtain the following expression, N L (m, k) = B L (m, k) N (m, k), N R (m, k) = B R (m, k) N (m, k),

Among them, N (m, k) represents the background signal in the time-frequency domain, and b L (m, k) and b R (m, k) represent the amplitudes of the spatial factors of the left and right channels, respectively.
Respectively indicate the phase of the left and right channel spatial factors;

Then, X L (m, k) and X R (m, k) are simplified as:

X L (m, k) = A L (m, k) S (m, k) + B L (m, k) N (m, k)

X R (m, k) = A R (m, k) S (m, k) + B R (m, k) N (m, k)

Correlation coefficient between left and right channel signals

Among them, E {} represents the expectation of the signal;

S23. From the energy perspective, the energy P L (m, k) and P R (m, k) of the left and right channels can be obtained respectively:
The direct sound and background sound extraction method according to claim 3, wherein step S3 specifically comprises: setting a value of a space factor to obtain P S (m, k), P N (m, k), A L (m, k), Analytical solution of A R (m, k), calculate the following formulas (1) and (2)

Substituting the space factors B L (m, k) and B R (m, k) into Eq. (2) respectively, we get
with
The direct sound and background sound extraction method according to claim 4, wherein the value of the spatial factor is set to b L (m, k) = b R (m, k) = 1,
A sound playback method for a speaker system, characterized in that the direct sound and background sound extraction method according to any one of claims 1 to 5 is used to separate the direct sound signal and the background sound signal, and separate the direct sound signal and the background. The acoustic signals are distributed to the individual speakers of the speaker system for acoustic reproduction.
The sound reproduction method according to claim 6, wherein the direct sound signal and the background sound signal are allocated to each speaker of the speaker system according to the orientation of the sound image in the stereo signal and the number and position of the speakers of the speaker system. .
A speaker system, comprising a plurality of speakers, characterized in that the speaker system further comprises an extraction device for performing a direct sound and background sound extraction method according to any one of claims 1-5.
The speaker system according to claim 8, wherein the extraction device comprises an STFT module, an energy estimation module, a signal separation module, and an ISTFT module connected in sequence,

The input of the STFT module is a left channel signal x L (n) and a right channel signal x R (n), which are used to output X corresponding to the left channel signal and the right channel signal after performing short-time Fourier transform. L (m, k) and X R (m, k);

The energy estimation module is used to receive X L (m, k) and X R (m, k) output by the STFT module, and introduce a space factor to express the background sound signal as a signal passing through different transmission paths in the room. Generated signals, and perform energy estimation on X L (m, k) and X R (m, k) respectively, and obtain the energy P L (m, k) and P R (m, k) of the left and right channels and output To the signal separation module;

The signal separation module is used to set the value of the spatial factor and perform signal separation to obtain

with
And output to the ISTFT module;

The ISTFT module is used to perform inverse Fourier transform and output a direct sound signal
Left channel background sound signal
And right channel background signal