CN101518100A

CN101518100A - Dialogue enhancement techniques

Info

Publication number: CN101518100A
Application number: CNA2007800343512A
Authority: CN
Inventors: 吴贤午; 郑亮源; C·法勒
Original assignee: LG Electronics Inc
Current assignee: LG Electronics Inc
Priority date: 2006-09-14
Filing date: 2007-09-14
Publication date: 2009-08-26
Anticipated expiration: 2027-09-14
Also published as: CN101518098A; CN101518098B; CN101518102A; CN101518102B; CN101518100B

Abstract

A plural-channel audio signal (e.g., a stereo audio) is processed to modify a gain (e.g., a volume or loudness) of a speech component signal (e.g., dialogue spoken by actors in a movie) relative to an ambient component signal (e.g., reflected or reverberated sound) or other component signals. In one aspect, the speech component signal is identified and modified. In one aspect, the speech component signal is identified by assuming that the speech source (e.g., the actor currently speaking) is in the center of a stereo sound image of the plural-channel audio signal and by considering the spectral content of the speech component signal.

Description

The dialogue enhancement techniques

Related application

Present patent application requires the priority of the U.S. Provisional Patent Application of following common pending trial:

Be entitled as that " Method of Separately Controlling Dialogue Volume " is (the separately method of control dialogue volume), that on September 14th, 2006 submitted to, lawyer's case number is the U.S. Provisional Patent Application No.60/844 of No.19819-047P01,806;

Be entitled as " Separate Dialogue Volume (SDV) " (talking with volume (SDV) separately), that on January 11st, 2007 submitted to, lawyer's case and number be the U.S. Provisional Patent Application No.60/884 of No.19819-120P01,594; And

Be entitled as " Enhancing Stereo Audio with Remix Capability and SeparateDialogue " (with audio mixing ability again and separately dialogue strengthen stereo audio), on June 11st, 2007 submitted to, lawyer's case number is the U.S. Provisional Patent Application No.60/943 of No.19819-160P01,268.

Each of these temporary patent applications is all complete by reference to be incorporated into this.

Technical field

The subject of this patent application content relates generally to signal processing.

Background of invention

The audio frequency enhancement techniques often is used to strengthen bass frequencies in home entertainment system, stereo and other consumer-elcetronics devicess, and simulates the various environment (for example music hall) of listening to.Some techniques attempt make the film dialogue more clear by for example adding more high-frequency.Yet neither one solves with respect to surrounding environment and other component signals and strengthens the problem of talking with in these technology.

Summary of the invention

Multi-channel audio signal (for example, stereo audio) (for example is processed into respect to the surrounding environment component signal, reflection or reverberation sound) and other component signals revise the gain (for example, volume or loudness) of speech components signals (for example, the dialogue that the performer says in the film).In one aspect, the speech components signal is identified and is revised.In one aspect, the speech components signal identifies at the center of the stereo sound image of multi-channel audio signal and by the spectral content of considering the speech components signal by hypothesis speech source (what for example, the performer was current speaks).

Other realizations that comprise the realization that relates to method, system and computer-readable medium are disclosed.

Accompanying drawing is described

Fig. 1 is the block diagram that is used to talk with the audio mixing model of enhancement techniques.

Fig. 2 be illustrate service time-figure of frequency square exploded perspective acoustical signal.

Fig. 3 A is used to the figure at the function of the gain of the function of the dialogue calculating conduct decomposition gain factor at the center of acoustic image.

Fig. 3 B is used to not in the dialogue at center and calculates figure as the function of the gain of the function that decomposes gain coefficient.

Fig. 4 is the block diagram of example dialogue enhanced system.

Fig. 5 is the flow chart that example dialogue strengthens process.

Fig. 6 is the block diagram that is used to realize with reference to the digital television system of described feature of figure 1-5 and process.

Describe in detail

The dialogue enhancement techniques

Fig. 1 is the block diagram that is used to talk with the audio mixing model 100 of enhancement techniques.In this model 100, the listener is from a left side and R channel received audio signal.Audio signal s is corresponding with the localization sound (localized sound) of the direction of coming free factor a to determine.Audio signal n independently ₁And n ₂Corresponding with horizontal reflection that often is called ambient sound or surrounding environment or reverberation sound.Stereophonic signal can be recorded or audio mixing, so that at given audio-source, this source audio signal coherently enters a left side and the right audio signal sound channel with concrete direction prompting (for example level difference, time difference), and the horizontal independent signal n of reflection or reverberation ₁And n ₂Enter and determine that auditory events range and listener are around the sound channel of pointing out.But the model 100 mathematics faces of land are shown the consciousness of the stereophonic signal of an audio-source with the location of catching audio-source and surrounding environment and promote to decompose.

x ₁(n)＝s(n)+n ₁(n) [1]

x ₂(n)＝as(n)+n ₂(n)

Be to obtain in having the on-fixed situation of a plurality of simultaneous effective audio-source, effectively to decompose, the decomposition of [1] can be in a plurality of frequency bands independently and the execution of adaptation time ground.

X ₁(i，k)＝S(i，k)+N ₁(i，k) [2]

X ₂(i，k)＝A(i，k)S(i，k)+N ₂(i，k)，

Wherein i is the sub-band index, and k is the sub-band time index.

Fig. 2 be illustrate service time-figure of decomposition of stereophonic signal of frequency square.In having each T/F square 200 of index i and k, signal S, N ₁, N ₂A can be estimated independently with the decomposition gain factor.For the purpose of the mark, sub-band and time index i and k are left in the basket in the following description for simplicity.

When use had the sub-band decomposition of consciousness promotor site band bandwidth, the bandwidth of sub-band can be selected to and equal a critical band.S, N ₁, N ₂With A can be by approximate estimation in each sub-band, every t millisecond (for example 20ms).For low computational complexity, short time discrete Fourier transform (STFT) can be used to realize fast Fourier transform (FFT).Given stereo sub-band signal X ₁And X ₂, can determine S, A, N ₁, N ₂Estimation.X ₁The estimation in short-term of power can be expressed as:

P_{X 1} (i, k) = E {X_{1}^{2} (i, k)}, - - - [3]

Wherein E{.} asks average calculating operation in short-term.For other signals, can use identical agreement, i.e. P _X2, P _SAnd P _N=P _N1=P _N2Be that corresponding short-time rating is estimated.N ₁And N ₂Power be assumed to be identically, the amount of promptly supposing horizontal independent sound is identical for a left side with R channel.

Estimation P _S, A and P _N

The sub-band of given stereophonic signal is represented, can determine power (P _X1, P _X2) and normalized cross correlation.Normalized cross correlation between a left side and the R channel is

Φ (i, k) = \frac{E {X_{1} (i, k) X_{2} (i, k)}}{\sqrt{E {X_{1}^{2} (i, k) E {X_{2}^{2} (i, k)}}} - - - [4]

A, P _S, P _NCan be calculated as the P of estimation _X1, P _X2Function with Φ.Three with the known equation relevant with known variables are:

P _X1＝P _S+P _N

P _X2＝A ²P _S+P _N [5]

Φ = \frac{{aP}_{S}}{\sqrt{P_{X 1} P_{X 2}}}

Equation [5] can be obtained A, P _SAnd P _N, to obtain

A = \frac{B}{2 C}

P_{S} = \frac{2 C^{2}}{B} - - - [6]

P_{N} = X_{1} - \frac{{2 C}^{2}}{B}

And

B = P_{X 2} - P_{X 1} + \sqrt{{(P_{X 1} - P_{X 2})}^{2} + {4 P}_{X 1} P_{X 2} Φ^{2}} - - - [7]

C = Φ \sqrt{P_{X 1} P_{X 2}}

S, N ₁And N ₂Least-squares estimation

Then, S, N ₁And N ₂Least-squares estimation be calculated as A, P _SAnd P _NFunction.For each i and k, signal S can be estimated as

\hat{S} = w_{1} X_{1} + w_{2} X_{2} - - - [8]

= w_{1} (S + N_{1}) + w_{2} (AS + N_{2}),

W wherein ₁And w ₂It is real-valued weight.Estimation error is

E＝(1-w ₁-w ₂A)S-w ₁N ₁-w ₂N ₂. [9]

At error E and X ₁And X ₂[6] during quadrature, promptly

E{EX ₁}＝0 [10]

E{EX ₂}＝0，

Weight w ₁And w ₂On the least square meaning, be best.

Obtain two equations

(1-w ₁-w ₂A)P _S-w ₁P _N＝0 [11]

A(1-w ₁-w ₂A)P _S-w ₂P _N＝0，

Therefrom calculate weight,

w_{1} = \frac{P_{S} P_{N}}{(A^{2} + 1) P_{S} P_{N} + P_{N}^{2}} - - - [12]

w_{2} = \frac{{AP}_{S} P_{N}}{(A^{2} + 1) P_{S} P_{N} + P_{N}^{}} .

N ₁Estimation can be

{\hat{N}}_{1} = w_{3} X_{1} + w_{4} X_{2} - - - [13]

= w_{3} (S + N_{1}) + w_{4} (AS + N_{2}) .

Evaluated error is

E＝(-w ₃-w ₄A)S-(1-w ₃)N ₁-w ₂N ₂. [14]

Once more, calculate weight so that evaluated error and X ₁And X ₂Quadrature causes

w_{3} = \frac{A^{2} P_{S} P_{N} + P_{N}^{2}}{(A^{2} + 1) P_{S} P_{N} + P_{N}^{2}} - - - [15]

w_{4} = \frac{- A P_{S} P_{N}}{(A^{2} + 1) P_{S} P_{N} + P_{N}^{}}

Be used to calculate N ₂Least-squares estimation

{\hat{N}}_{2} = w_{5} X_{1} + w_{6} X_{2} - - - [16]

= w_{5} (S + N_{1}) + w_{6} (AS + N_{2}),

Weight be

w_{5} = \frac{- A P_{S} P_{N}}{(A^{2} + 1) P_{S} P_{N} + P_{N}^{2}} - - - [17]

w_{6} = \frac{P_{S} P_{N} + P_{N}^{2}}{(A^{2} + 1) P_{S} P_{N} + P_{N}^{}}

Postposition scalable in proportion (post-scaling)

\hat{S}, {\hat{N}}_{1}, {\hat{N}}_{2}

In some implementations, least-squares estimation is can postposition scalable in proportion, so that the power and the P that estimate _SAnd P _N=P _N1=P _N2Equate.

Power be

P_{\hat{S}} = {(w_{1} + {aw}_{2})}^{2} P_{S} + (w_{1}^{2} + w_{2}^{2}) P_{N} . - - - [18]

Thereby, in order to obtain to have power P _SThe estimation of S,

By scalable in the ratio of putting

{\hat{S}}^{'} = \frac{\sqrt{P_{S}}}{\sqrt{{(w_{1} + {aw}_{2})}^{2} P_{S} + (w_{1}^{2} + w_{2}^{2}) P_{N}}} \hat{S} . - - - [19]

Use similar inference,

With

Scalable in proportion

{\hat{N}}_{1}^{'} = \frac{\sqrt{P_{N}}}{\sqrt{{(w_{3} + {aw}_{4})}^{2} P_{S} + (w_{3}^{2} + w_{4}^{2}) P_{N}}} {\hat{N}}_{1} - - - [20]

{\hat{N}}_{2}^{'} = \frac{\sqrt{P_{N}}}{\sqrt{{(w_{5} + {aw}_{6})}^{2} P_{S} + (w_{5}^{2} + w_{6}^{2}) P_{N}}} {\hat{N}}_{2} .

Stereophonic signal is synthetic

Given previously described signal decomposition, with the similar signal of original stereo signal can be by obtaining using [2] and time domain is returned in the sub-band conversion each time and at each sub-band.

In order to generate the signal with modified dialogue gain, sub-band is calculated as

Y_{1} (i, k) = 10^{\frac{g (i, k)}{20}} S (i, k) + N_{1} (i, k) - - - [21]

Y_{2} (i, k) = 10^{\frac{g (i, k)}{20}} A (i, k) S (i, k) + N_{2} (i, k),

Wherein (i k) is calculated that what revised as required is the gain factor of unit with dB so that talk with gain to g.

Have several promotions how to calculate g (i, observation k):

Usually dialogue is at the center of acoustic image, promptly the component signal of time k that belongs to dialogue and frequency i will have corresponding decomposition gain factor A near one (0dB) (i, k).

Voice signal comprises most energy up to 4kHz.In fact the above voice of 8kHz do not comprise energy.

Voice do not comprise low-down frequency (for example being lower than about 70Hz) usually yet.

These observe hint g, and (i is k) in low-down frequency and be set to 0dB more than the 8kHz, to revise stereophonic signal potentially as small as possible.In other frequencies, g (i, k) be controlled as required dialogue gain G d and A (i, function k):

g(i，k)＝f(G _d，A(i，k)). [22]

The example of suitable function f is shown in Fig. 3 A.Attention in Fig. 3 A, f and A (i, the relation between k) uses logarithm (dB) ratio to draw, but A (i, k) and f can define with linear scale in addition.Concrete example at f is:

g (i, k) = I + (10^{\frac{G_{d}}{20}} - 1) \cos (\min {\frac{π | {10 \log}_{10} (A (i, k) |}{W}, \frac{π}{2}}), - - - [23]

Wherein W determines the width of the gain region of function f, as shown in Figure 3A.Constant W is relevant with the direction sensitivity of dialogue gain.For example the value of W=6dB gives majority signal with good result.But notice that different W can be best for different signals.

Because the calibration of broadcasting or receiving equipment difference (for example there is different gains on a left side with R channel), dialogue may not be accurately to occur at the center.In the case, function f can be offset, so that its center is corresponding with the dialogue position.The example of the function f of quilt through being offset is shown in Fig. 3 B.

Replace and realize and vague generalization

Sign based on the dialogue component signal of the spectral range of center hypothesis (perhaps common hypothesis on location) and voice is simple and suitable in many cases.Yet dialogue identifier can be modified and improve potentially.A kind of may be that the more phonetic feature of exploring such as formant, harmonic structure, transient phenomena is talked with component signal to detect.

As mentioned, for different audio materials, difform gain function (for example Fig. 3 A and 3B) may be best.Thereby, can use the signal adaptive gain function.

The dialogue gain controlling also can realize at the household audio and video system that has around sound.An importance of dialogue gain controlling is whether to detect dialogue in center channel.A kind of method of carrying out this is whether inspection center has sufficient signal energy, makes that dialogue might be in center channel.If dialogue is in center channel, then gain can be added to center channel with control dialogue volume.If dialogue not in center channel (for example, if surrounding system playback stereo audio content), then can apply two sound channels dialogue gain controlling with reference to figure 1-3 as described previously.

In some implementations, disclosed dialogue enhancement techniques can realize by the signal of decay except that the speech components signal.For example, multi-channel audio signal can comprise speech components signal (for example, dialogue signal) and other component signals (for example, reverberation).Other component signals can based on the speech components signal in the acoustic image of multi-channel audio signal the position and be modified (for example, being attenuated), and the speech components signal can remain unchanged.

The dialogue enhanced system

Fig. 4 is the block diagram of example dialogue enhanced system 400.In some implementations, system 400 comprises analysis filterbank 402, power estimator 404, signal estimator 406, the scalable in proportion module 408 of postposition, signal synthesizing module 410 and composite filter group 412.Though the assembly 402-412 of system 400 is shown independent process, the process of two or more assemblies is capable of being combined in single component.

For each time k, multi-channel signal becomes sub-band signal i by analysis filterbank 402.In the example shown, the left side of stereophonic signal and R channel x ₁(n), x ₂(n) analyzed bank of filters 402 is broken down into i sub-band X ₁(i, k), X ₂(i, k).Power estimator 404 generations had before been described with reference to Fig. 1 and 2 And Power estimate.Signal estimator 406 estimates to generate estimated signal from power

And

The scalable in proportion signal of the scalable in proportion module 408 of postposition is estimated to provide

And Signal synthesizing module 410 receives the scalable in proportion signal estimation of postposition and decomposes gain factor A, constant W and required dialogue gain G _d, and a synthetic left side and the right sub-band signal that is input to composite filter group 412 estimated

And

Have based on G to provide _dA left side and the right time-domain signal of the dialogue gain of revising

With

Dialogue enhancing process

Fig. 5 is the flow chart that example dialogue strengthens process 500.In some implementations, process 500 is by resolving into multi-channel audio signal frequency sub-bands signal (502) beginning.Decomposition can be carried out by the bank of filters of using various known transform, and these conversion include but not limited to: multiphase filter group, quadrature mirror filter bank (QMF), hybrid filter-bank, discrete Fourier transform (DFT) (DFT), correction discrete cosine transform (MDCT).

Use sub-band signal to estimate first group of power (504) of two or more sound channels of audio signal.Use this first group of power to determine cross correlation (506).Use first group of power and cross correlation to estimate to decompose gain factor (508).Decompose gain factor and provide position indicating for the dialogue source in the acoustic image.Use first group of power and cross correlation to estimate second group of power (510) of speech components signal and surrounding environment component signal.Use second group of power and decompose gain factor estimation voice and surrounding environment component signal (512).Voice of estimating and surrounding environment component signal are by postposition scalable in proportion (514).Use is through voice and the surrounding environment component signal and the synthetic sub-band signal (516) with dialogue gain of modification of required dialogue gain of the scalable in proportion estimation of postposition.Required dialogue gain can be provided with or be specified by the user automatically.Synthetic sub-band signal for example uses, and the composite filter group is transformed into the time-domain audio signal (512) with modification dialogue gain.

Be used for the output normalization that background suppresses

In some implementations, expectation suppresses the audio frequency of background scene but not strengthens the dialogue signal.This can have the dialogue enhancing output signal realization of dialogue gain by normalization.Normalization can be carried out by at least two kinds of different modes.In one example, output signal

With

Can pass through normalization factor g _NormNormalization:

{\hat{Y}}_{1} (i, k) = \frac{Y_{1} (i, k)}{g_{norm}} - - - [24]

{\hat{Y}}_{2} (i, k) = \frac{Y_{2} (i, k)}{g_{norm}} .

Another example, the dialogue reinforced effects has g by use _NormWeight w ₁-w ₆Normalization compensates.Normalization factor g _NormThe dialogue gain that can adopt and revise

Identical value.

In order to maximize the consciousness quality, can revise g _NormNormalization can not only be carried out at frequency domain but also in time domain.When carrying out in frequency domain, homogenization can be carried out at for example 70Hz that applies the dialogue gain and the frequency band between the 8KHz.

Alternatively, similarly the result can be embodied as and gain be not applied to S (i, decay N in the time of k) ₁(i, k) and N ₂(i, k).This notion can use following equation to describe:

{\hat{Y}}_{1} (i, k) = S (i, k) + 10^{\frac{g_{atten} (i, k)}{20}} N_{1} (i, k), - - - [25]

{\hat{Y}}_{2} (i, k) = S (i, k) + 10^{\frac{g_{atten} (i, k)}{20}} N_{2} (i, k) .

Detect use dialogue volume separately based on monophony

As input signal X ₁(i, k) and X ₂(i, when k) similar substantially, for example input is similar monophonic signal, then Shu Ru almost each part can be regarded as 5, and when the user provides required dialogue gain, the gain volume of increase signal of required dialogue.For preventing this situation, the characteristic that expectation uses independent dialogue volume (SDV) technology to observe input signal.

In [4], calculate the normalized cross correlation of stereophonic signal.This normalized cross correlation can be used as the tolerance that monophonic signal detects.When the Φ in [4] surpassed given threshold value, input signal can be construed to monophonic signal, and independent dialogue volume can be closed automatically.On the contrary, as Φ during less than given threshold value, input signal can be construed to stereophonic signal, and independent dialogue volume can be opened automatically.The dialogue gain can be used as the algorithm switch at independent dialogue volume:

\hat{g} (i, k) = 1,

For φ＞Thr _Mono, [26]

\hat{g} (i, k) = g (i, k),

φ＜Thr _stereo.

In addition, when

At Thr _MonoWith Thr _StereoBetween the time,

Can be expressed as

Function:

\hat{g} (i, k) = f (φ, g (i, k)),

For Thr _Mono＞φ＞Thr _Stereo. [27]

Example be with at

The inverse proportion weighting be applied to

For

\hat{g} (i, k) = \frac{- φ + {Thr}_{mono}}{{Thr}_{mono} - {Thr}_{stereo}} g (i, k),

For Thr _Mono＞φ＞Thr _Stereo. [28]

In order to prevent

Sudden change, the time smoothing technology can be combined to obtain

The digital television system example

Fig. 6 is the block diagram that is used to realize with reference to the example digital television system 600 of described feature of figure 1-5 and process.Digital Television (DTV) is the telecommunication system by means of digital signal broadcasting and reception motion picture and sound.DTV adopts the digital modulation data, and it is by digital compression and need decode by custom-designed television set or the PC that has the reference receiver of set-top box or TV card is housed.Although the system among Fig. 6 is the DTV system, the disclosed realization that is used to talk with enhancing also can be applicable to analog TV system or any other system that can talk with enhancing.

In some implementations, system 600 (for example, can comprise interface 602, demodulator 604, decoder 606 and audio/video output 608, user's input interface 610, one or more processor 612

Processor) and one or more computer-readable medium 614 (for example, RAM, ROM, SDRAM, hard disk, CD, flash memory, SAN etc.).These assemblies are coupled to one or more communication channels 616 (for example, bus) separately.In some implementations, interface 602 comprises the various circuit of the audio/video signal that is used to obtain audio signal or combination.For example, in the simulated television system, interface can comprise antenna mounted electronics, tuner or frequency mixer, radio frequency (RF) amplifier, local oscillator, intermediate frequency (IF) amplifier, one or more filter, demodulator, audio frequency amplifier etc.Other realizations of system 600 are possible, comprise having more or the more realization of widgets.

Tuner 602 can be the DTV tuner that is used to receive the digital television signal that comprises video and audio content.Demodulator 604 extracts video and audio signal from digital television signal.If video and audio signal is encoded (for example, mpeg encoded), these signals of decoder 606 decoding then.A/V output can be can display video and any equipment (for example, TV display, computer monitor, LCD, loud speaker, audio system) of audio plays.

In some implementations, display device or the demonstration on screen (OSD) on can for example using a teleswitch shows the dialogue volume level to the user.The dialogue volume level can be with respect to the keynote magnitude.One or more Drawing Objects can be used to show the dialogue volume level and with respect to the dialogue volume level of master volume.For example, first Drawing Object (for example, bar) can show and be used to refer to master volume, and second graph object (for example, line) can show or is combined on first Drawing Object with indication dialogue volume level with first Drawing Object.

In some implementations, user's input interface can comprise and is used to receive and the circuit (for example, wireless or infrared remote receiver) and/or the software of the infrared or wireless signal that generated by remote controller of decoding.Remote controller can comprise independent dialogue volume control key or button or be used to change the independent dialogue volume control options button of the state of master volume operating key or button, so that master volume control can be used to control master volume or independent dialogue volume.In some implementations, dialogue volume or master volume key can change its visual appearance to indicate its function.

Example controller and user interface at U.S. Patent application No.______, be entitled as that " Controller andUser Interface For Dialogue Enhancement Techniques " (being used for talking with the controller and the user interface of enhancement techniques), on September 14th, 2007 submit to, lawyer's case number for No.19819-160001 describes, this patent application is complete by reference to be incorporated into this.

In some implementations, one or more processors can be carried out the code that is stored in the computer-readable medium 614, with realization as with reference to described feature of Fig. 1-5 and operation 618,620,622,624,626,628,630 and 632.

Computer-readable medium also comprises operating system 618, analysis/synthetic filtering device group 620, power estimator 622, signal estimator 624, the scalable in proportion module 626 of postposition and signal synthesizer 628.Term " computer-readable medium " expression participates in providing instruction for any medium of carrying out to processor 612, includes but not limited to non-volatile media (for example CD or disk), Volatile media (for example memory) and transmission medium.Transmission medium includes but not limited to, coaxial cable, copper cash and optical fiber.Transmission medium also occurs with the form of sound, light or rf wave.

Operating system 618 can be multi-user, multiprocessing, multitask, multithreading, real-time etc.Operating system 618 is carried out basic task, includes but not limited to: identification is from the input of user's input interface 610; Keep file and catalogue on tracking and the supervisory computer computer-readable recording medium 614 (for example memory or memory device); Control peripheral devices; And manage the traffic on one or more communication channels 616.

Above-mentioned feature can be advantageously implemented as the one or more computer programs that can carry out on programmable system, this programmable system comprises: at least one programmable processor, it is coupled receiving data and instruction from data-storage system, and data and instruction are sent to data-storage system; At least one input equipment; And at least one output equipment.Computer program is one group of instruction, and this group instruction can be used in computer directly or indirectly to carry out certain activity or to produce certain result.Computer program can be (for example to comprise the compiling or any type of programming language of interpretative code, Objective-C (OO C language), Java) write, and it can use in any form, comprises as stand-alone program or as module, assembly, subroutine or other unit of being adapted at using in the computing environment.

The suitable processor that is used for execution of programs of instructions comprises uniprocessor or one of multiprocessor or the multinuclear as the computer of the general and special microprocessor of example and any kind.Generally speaking, processor will receive instruction and data from read-only memory or random access memory or both.The primary element of computer is processor that is used to execute instruction and the one or more memories that are used for store instruction and data.Generally speaking, computer also comprises the one or more mass-memory units that are used for storing data files, or be coupled effectively with these devices communicatings; This equipment comprises the disk such as internal disk and removable dish; Magneto optical disk; And CD.Be applicable to that the memory device of visibly expressing computer program instructions and data comprises the nonvolatile memory of form of ownership, comprise semiconductor memory apparatus such as EPROM, EEPROM and flash memory device as example; Disk such as internal hard drive and removable dish; Magneto optical disk; And CD-ROM and DVD-ROM dish.Processor and memory can be replenished or are attached among the ASIC by ASIC (application-specific integrated circuit (ASIC)).

For mutual with the user is provided, can to can providing the computer such as the keyboard of mouse or tracking ball and positioning equipment of input by it to computer on, the display device CRT of user's display message (cathode ray tube) or LCD (LCD) monitor and user realize feature having such as being used for.

Can be in the computer system that comprises such as the aft-end assembly of data server, or in the computer system that comprises such as the middleware component of application server or Internet server, or in the computer system that comprises such as the front end assemblies of client computer with graphic user interface or explorer, or in its combination, realize these features.The assembly of system can be by connecting such as any form of communication network or the digital data communications of medium.The example of communication network comprises for example computer and the network of LAN, WAN and formation internet.

Computer system can comprise client-server.Client-server is general far apart and pass through network interaction usually.The relation of client-server produces according to the computer program that moves on corresponding computer and have the client-server relation each other.

A plurality of realizations have been described.Yet, will understand and can carry out various modifications.For example, capable of being combined, deletion, revise or replenish the key element of one or more realizations to form further realization.As another example, particular order or consecutive order shown in the logic flow that is described in the drawings is also nonessential are realized desired result.In addition, can provide other steps, maybe can from described flow process, remove step, and add other assemblies to described system, or remove other assembly from described system.Therefore, other are implemented in the scope of following claim.

Claims

1. method comprises:

Acquisition comprises the multi-channel audio signal of speech components signal and other component signals; And

Based on the described speech components signal of the location updating of the described speech components signal in the acoustic image of described audio signal.

2. the method for claim 1 is characterized in that, revises also to comprise:

Based on the described speech components signal of the spectral content modification of described speech components signal.

3. method as claimed in claim 1 or 2 is characterized in that, described modification also comprises:

Determine the position of the described speech components signal in the described acoustic image; And

Gain factor is applied to described speech components signal.

4. method as claimed in claim 3 is characterized in that, described gain factor is the described position of described speech components signal and the function that is used for the required gain of described speech components signal.

5. method as claimed in claim 4 is characterized in that, described function is the signal adaptive gain function with gain region relevant with the directional sensitivity of described gain factor.

6. the method according to any one of the preceding claims is characterized in that, described modification also comprises:

In time domain or frequency domain, use the described multi-channel audio signal of normalization factor normalization.

7. the method according to any one of the preceding claims is characterized in that, also comprises:

Determine that in fact whether described audio signal is monaural; And

If described audio signal is not in fact monaural, then revise described speech components signal automatically.

8. method as claimed in claim 7 is characterized in that, determines that in fact whether described audio signal is monaurally also to comprise:

Determine the cross correlation between two or more sound channels of described audio signal; And

With described cross correlation and one or more threshold; And

Result based on described comparison determines that in fact whether described audio signal is monophony.

9. the method according to any one of the preceding claims is characterized in that, revises also to comprise:

Described audio signal is resolved into a plurality of frequency sub-bands signals;

Use described sub-band signal to estimate first group of power of two or more sound channels of described multi-channel audio signal;

Use described first group of power of estimating to determine cross correlation;

Use the described first group of power estimated and cross correlation to estimate to decompose gain factor.

10. method as claimed in claim 9 is characterized in that, the bandwidth of at least one sub-band is selected to human auditory system's a critical band and equates.

11. method as claimed in claim 8 is characterized in that, comprising:

Estimate second group of power of described speech components signal and surrounding environment component signal from described first group of power and described cross correlation.

12. method as claimed in claim 11 is characterized in that, also comprises:

Use described second group of power and described decomposition gain factor to estimate described speech components signal and described surrounding environment component signal.

13. method as claimed in claim 12 is characterized in that, uses least-squares estimation to determine estimated voice and surrounding environment component signal.

14. method as claimed in claim 12 is characterized in that, described cross correlation is by normalization.

15., it is characterized in that estimated speech components signal and estimated surrounding environment component signal are scalable in proportion by postposition as claim 13 or 14 described methods.

16. as each described method in the claim 11 to 15, it is characterized in that, also comprise:

Use second estimated power and user to specify gain synthon band signal.

17. method as claimed in claim 16 is characterized in that, also comprises:

The sub-band signal that is synthesized is transformed into the time-domain audio signal that has with the speech components signal of the gain modifications of described user's appointment.

18. a method comprises:

Obtain audio signal;

User's input of the modification of first component signal of the described audio signal of acquisition appointment; And

Revise described first component signal based on the position indicating of described first component signal in the acoustic image of described input and described audio signal.

19. method as claimed in claim 18 is characterized in that, described modification also comprises:

Gain factor is applied to described first component signal.

20. method as claimed in claim 19 is characterized in that, described gain factor is the described position indicating of described first component signal and the function of required gain.

21. method as claimed in claim 20 is characterized in that, described function has the gain region relevant with the directional sensitivity of described gain factor.

22., it is characterized in that described modification also comprises as each described method in the claim 18 to 21:

In time domain or frequency domain, use the described audio signal of normalization factor normalization.

23., it is characterized in that described modification also comprises as each described method in the claim 18 to 22:

Use described sub-band signal to estimate first group of power of two or more sound channels of described audio signal;

Use described first group of power to determine cross correlation;

Use described first group of power and cross correlation to estimate to decompose gain factor;

Estimate second group of power of described first component signal and second component signal from described first group of power and described cross correlation;

Use described second group of power and described decomposition gain factor to estimate described first component signal and described second component signal;

Use first and second estimated component signals and described input synthon band signal; And

The sub-band signal that is synthesized is transformed into the time-domain audio signal of first component signal with modification.

24. a system comprises:

Interface, the configurable multi-channel audio signal that is used to obtain to comprise speech components signal and other component signals of described interface; And

Processor, described processor are coupled to described interface and can be configured to based on the described speech components signal of the location updating of the described speech components signal in the acoustic image of described audio signal.

25. a method comprises:

Based on described other component signals of the location updating of the described speech components signal in the acoustic image of described multi-channel audio signal.