CN109640242B

CN109640242B - Audio source component and environment component extraction method

Info

Publication number: CN109640242B
Application number: CN201811507726.9A
Authority: CN
Inventors: 史创; 陈璐; 方惠; 李会勇
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2020-05-12
Anticipated expiration: 2038-12-11
Also published as: CN109640242A

Abstract

The invention discloses an audio source component and environment component extraction method, and belongs to the technical field of audio and video processing. The method comprises the steps of firstly, solving each positive frequency point component value of an environment component and a source component in each frame under two conditions based on the positive frequency point component values of a stereo audio signal in a left channel and a right channel of a signal complex frequency domain and a source acoustic phase factor; determining a real solution by comparing the energy of the source component and the environment component in the two groups of solving results, and constructing a corresponding negative frequency point component value through a conjugate symmetry relationship; and finally, performing frequency domain to time domain conversion processing on each frequency point component value of each frame to obtain the environment component signals and the source component signals of the left and right channels of the stereo audio signal to be subjected to component extraction. The invention can be used for stereo expansion, and the time domain waveforms of the source component and the environment component extracted by the extraction method have high consistency with the waveforms of the left channel source component and the environment component of the original voice.

Description

Audio source component and environment component extraction method

Technical Field

The invention belongs to the technical field of audio processing, and particularly relates to a technology for decomposing a stereo sound scene.

Background

The channel-based audio format is widely applied in real life, and is mostly adopted in mobile phones, computers and earphones. Audio in this format typically requires a specific sound system for playback. For today's diversified sound systems, the audio signal needs to be decomposed and reconstructed to meet different sound systems to obtain better spatial quality (spatial quality). For example, to obtain a better hearing experience, a two-channel stereo in a mobile phone is played back in a multi-channel headset. The conventional and customary way is: the two-channel signals are processed using audio re-synthesis (audio re-synthesis) and virtualization (virtualization) techniques to obtain multi-channel audio output. Based on the literature "upper and lower two-channel stereo-audio for consumer electronics" and the literature "Spatial audio processing" MPEG surround and other application ", the conventional method can solve the problem of adaptability of the playback system, but the Spatial quality of the reconstructed acoustic scene needs to be improved.

An improved idea for the above problem is to consider the acoustic scene as a linear combination of a source component (primary component) and an ambient component (ambient component). Will stand upThe left and right channels of the bulk acoustic signal are denoted x respectively_LAnd x_RThus, there are: x is the number of_L＝p_L+a_L，x_R＝p_R+a_RWherein p is_LAnd p_RRepresenting the source components of the left and right channels, respectively, a_LAnd a_RRepresenting the left and right channel ambient components, respectively. In chinese patent application publication No. CN101902679A, an acoustic processing technique disclosed therein can convert a binaural input signal into a 5.1-path surround output signal, and the technique performs a difference between left and right channel signals, and then performs filtering and delay processing on the difference to obtain an ambient component of an acoustic scene, but the method has a large error in estimating the ambient component. In the channel audio format, the following reasonable assumptions can be made: the source components in the left and right channels satisfy a linear relationship, namely: p is a radical of_R＝kp_LDefining k as a source translation factor; the magnitude of the irrelevance between the environmental components, i.e.: a is_L⊥a_R，|a_L|＝|a_RL. For the Principal Component Analysis (PCA) algorithm proposed by the above hypothesis Michael m.goodwin and Jean-Marc Jot, the source component and the environment component are respectively estimated by adopting a source environment extraction method for the mixed signal. The quality of sound scene reconstruction can be improved by processing the source component and the environment component by adopting different rendering methods. However, PCA has the disadvantages that the error of a source component is large, irrelevance between environment components is not satisfied, and loudness distortion exists.

Disclosure of Invention

The invention aims to: in view of the existing problems, a new source environmental component extraction method based on uncorrelated environmental components is provided to further improve the accuracy of source component and environmental component extraction, and simultaneously ensure loudness equalization between channels.

The method for extracting the audio source component and the environmental component comprises the following steps:

step 1: respectively framing left and right channel signals of a stereo audio signal to be subjected to component extraction, converting each frame signal to a frequency domain, and extracting a positive frequency point component value x of the left and right channel signals in each frame_L[m,f]、x_R[m,f]Wherein m represents the number of frames and f represents the frequency value;

step 2: in a signal complex frequency domain, x is obtained according to the positive frequency point component value of each frame_L[m,f]Coordinate (x) of₁,y₁) And x_R[m,f]Coordinate (x) of₂,y₂)；

And step 3: respectively solving the component values of the environment component and the source component at each positive frequency point of each frame under two conditions:

(1) for the case where the left channel ambient component is 90 ° behind the right channel ambient component:

(2) for the case where the left channel ambient component leads the right channel ambient component by 90 °:

wherein (a)₁,b₁)、(a₂,b₂) The coordinates, p, of the positive frequency point component values of the left and right channel environment components in the complex frequency domain of the signal respectively_L、p_RRespectively representing the positive frequency point component values of the left channel source component and the right channel source component, wherein k represents a source translation factor;

and 4, step 4: determining a true solution for each positive frequency point of each frame: respectively calculating source component energy and environment component energy for the two groups of solution results, judging whether a solution that the source component energy is greater than the environment component energy exists, and if so, determining that the real solution of the current positive frequency point is a solution that the source component energy is greater than the environment component energy; otherwise, the solution is that the energy of the environment component is greater than the energy of the source component;

the source component energy is the sum of the energies of the left and right source component values of the current positive frequency point, and the environment component energy is the sum of the energies of the left and right channel environment component values of the current positive frequency point;

and 5: constructing the negative frequency point component values of the source components and the environment components of the left channel and the right channel in each frame through a conjugate symmetry relation based on the real solution of each positive frequency point of each frame;

step 6: and performing frequency domain to time domain conversion processing on each frequency point component value of each frame to obtain the environment component signals and the source component signals of the left channel and the right channel of the stereo audio signal to be subjected to component extraction.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: the time domain waveforms of the source component and the environment component extracted by the extraction method have high consistency with the waveforms of the source component and the environment component of the left channel of the original voice, and the left channel and the right channel environment components extracted by the extraction method do not have the problem of amplitude distortion, so that the loudness balance of the left channel and the right channel environment components can be ensured. In addition, the source component and the environment component extracted based on the present invention can highly restore the original audio signal.

Drawings

FIG. 1 is a geometric representation of the source environment extraction method of the present invention;

FIG. 2 is a process flow diagram of a source environment extraction method of the present invention;

FIG. 3 is a time domain waveform of an original left channel source component;

FIG. 4 is a time domain waveform of an original left channel ambient component;

FIG. 5 is a time domain waveform of the left channel source component extracted by the new method of the present invention;

FIG. 6 is a time domain waveform of the left channel environmental component extracted by the new method of the present invention;

FIG. 7 is a time domain waveform of the environmental component of the right channel extracted by the new method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

Aiming at the defects existing in the conventional PCA algorithm and based on the irrelevance of environment components, the invention provides a novel source environment component extraction method (primary and ambient component estimation based on the uncorrelated environment components, UAPAE) based on the condition that the environment components are vertical in a frequency domain, so as to improve the extraction precision of the source components and the environment components and ensure the loudness balance between channels.

A stereo signal can be seen as a linear combination of source and ambient components, with a linear relationship of k times satisfied between the source components and uncorrelated and constant amplitude between the ambient components. The environmental component and the source component can be separated using the geometric relationship under the above conditions. Transforming the signal to the frequency domain by short-time Fourier transform, and for each time frequency point:

x_L[m,f]＝p_L[m,f]+a_L[m,f](1)

x_R[m,f]＝p_R[m,f]+a_R[m,f](2)

where m is the number of frames, f is the frequency value, x_L[m,f]、x_R[m,f]Respectively representing corresponding left and right channel signals, p_L[m,f]、p_R[m,f]Respectively representing the source components of the corresponding left and right channels, a_L[m,f]、a_R[m,f]Respectively representing the environment classification of the corresponding left and right channels.

And performing signal decomposition at each time frequency point, solving only in a positive frequency part due to the fact that Fourier transformation of the real signals has conjugate symmetry, constructing a negative frequency component through a conjugate symmetry relation, thereby obtaining a frequency spectrum of the signals, and obtaining a time domain solution which is still the real signals through inverse Fourier transformation.

In the complex frequency domain of the signal, a coordinate relationship as shown in fig. 1 can be established according to the frequency domain value: for a certain time frequency point, use (x)₁,y₁) Representing the left channel signal x_LBy (x)₂,y₂) Representing the right channel signal x_RCoordinates of (a) with₁,b₁) Representing the left channel ambient component a_LCoordinates of (a) with₂,b₂) Representing the right channel ambient component a_RIn fig. 1, Im denotes the imaginary part and Re denotes the real part;

due to a_L⊥a_RAnd | a_L|＝|a_R|，Then there are:

thus, it is possible to obtain:

or

The formula (4) and the formula (5) correspond to a_LRatio a_RLags by 90 °, and_Lratio a_RLeading by 90 deg. two cases.

Without loss of generality, in the present embodiment, one of them is selected for solution, and the solution in the other case can be obtained by a similar method.

According to p_R＝kp_LThe following can be obtained: (x)₂-a₂,y₂-b₂)＝k(x₁-a₁,y₁-b₁) In combination with formula (4), one can solve:

when a is_LRatio a_RA 90 ° lead gives:

the frequency spectrum of the positive frequency component of each frame of signal can be obtained through the relation.

This method contains two solutions, and it is not possible to determine which one is the true solution without additional conditions. The invention can judge a proper one by introducing an optimization criterion: and when the energy value of the source component in the solved signal is higher than the environment component, selecting the solution of which the corresponding source component energy is higher than the environment component energy as the solved signal, and otherwise, selecting the solution of which the environment component energy is higher than the source component energy.

Examples

And (3) making stereo audio to be decomposed:

the source component of the left channel uses a recorded mono speech audio signal (the time domain waveform is shown in fig. 3), and the source component of the right channel multiplies the source component of the left channel by a source panning factor k, where k is 2 in this example. The left channel audio signal of the binaural sound is taken as a left channel environment component (the time domain waveform is shown in fig. 4), and the environment component of the right channel is obtained by performing hilbert transform on the left channel environment component.

Then, the powers of the source component and the ambient component are calculated, and the source components of the left and right channels are processed so that the ratio of the total source component power to the total power is 0.8.

And then mixing the source component and the environment component of the left channel and the right channel respectively to obtain the output signals of the left channel and the right channel, namely obtaining the stereo audio signal to be processed.

Referring to fig. 2, the specific operation steps of implementing sound scene decomposition on the stereo audio signal to be processed by using the extraction method of the present invention are as follows:

first, frame division processing is performed on left and right output signals of a stereo audio signal, and in this embodiment, each frame after frame division processing includes 4096 sampling points.

Then, 4096-point Fast Fourier Transform (FFT) is performed on each frame of audio signal to obtain the frequency spectrum of the left and right channel output signals.

Traverse all frames, for all positive frequency points x within each frame_L[m,f]And x_R[m,f]The two cases are solved according to equations (6) and (7), and equations (8) and (9), respectivelyConditional positive frequency component a of the left and right channel ambient components_L＝a₁+jb₁、a_R＝a₂+jb₂And a positive frequency component p of the source component_L＝x₁+jy₁、p_R＝x₂+jy₂And j represents an imaginary unit.

And then comparing the energy of the source component and the energy of the environment component according to the solving results under the two conditions to determine the real solution of each frame under different positive frequency points: for the solution result of each positive frequency point, if the source component energy (sum of left and right channels) is greater than the environment component energy (sum of left and right channels), the real solution of the current positive frequency point is the solution that the source component energy is greater than the environment component energy; otherwise, the solution is that the energy of the environment component is greater than the energy of the source component.

In this embodiment, the calculation method of the energy of the source component and the energy of the environment component at each positive frequency point is as follows: e_p＝|p_L|²+|p_R|²，E_a＝|a_L|²+|a_R|²。

And then, constructing a negative frequency component value through a conjugate symmetry relation based on a real solution under each positive frequency point.

And finally, inversely converting the frequency domain signals of all the frames of the source components and the environment components of the left and right channels into time domain signals, finally connecting the signals, and using the extracted components for stereo expansion.

The example extracts the source component of the left channel, the environmental component of the left and right channels, and plots the time domain waveforms thereof, as shown in fig. 5-7; as can be seen from the comparison and analysis with the left channel source component of the original speech shown in fig. 3 and the environment component shown in fig. 4, the time domain waveforms of the left channel source component and the environment component extracted by the extraction method of the present invention have high consistency with the waveforms of the left channel source component and the environment component of the original speech, and the left channel environment component and the right channel environment component extracted by the extraction method of the present invention do not have the problem of amplitude distortion, so that the loudness equalization of the left channel environment component and the right channel environment component can be ensured. In addition, the source component and the environment component extracted by the invention can be found to be almost indistinguishable from the original source component and the environment component through earphone playback, and the original audio signal can be highly restored. In conclusion, the component extraction method provided by the invention has practical utilization value.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. An audio source component and environment component extraction method, comprising the steps of:

wherein the content of the first and second substances,(a₁,b₁)、(a₂,b₂) The coordinates, p, of the positive frequency point component values of the left and right channel environment components in the complex frequency domain of the signal respectively_L、p_RRespectively representing the positive frequency point component values of the left channel source component and the right channel source component, wherein k represents a source translation factor;

2. The method of claim 1, wherein in step 1, the number of sampling points included in each frame is set to 4096 when performing the framing process.