CN101518100A - Dialogue enhancement techniques - Google Patents

Dialogue enhancement techniques Download PDF

Info

Publication number
CN101518100A
CN101518100A CNA2007800343512A CN200780034351A CN101518100A CN 101518100 A CN101518100 A CN 101518100A CN A2007800343512 A CNA2007800343512 A CN A2007800343512A CN 200780034351 A CN200780034351 A CN 200780034351A CN 101518100 A CN101518100 A CN 101518100A
Authority
CN
China
Prior art keywords
signal
audio signal
speech components
gain
power
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007800343512A
Other languages
Chinese (zh)
Other versions
CN101518100B (en
Inventor
吴贤午
郑亮源
C·法勒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LG Electronics Inc
Original Assignee
LG Electronics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LG Electronics Inc filed Critical LG Electronics Inc
Priority claimed from PCT/EP2007/008028 external-priority patent/WO2008031611A1/en
Publication of CN101518100A publication Critical patent/CN101518100A/en
Application granted granted Critical
Publication of CN101518100B publication Critical patent/CN101518100B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Stereophonic System (AREA)

Abstract

A plural-channel audio signal (e.g., a stereo audio) is processed to modify a gain (e.g., a volume or loudness) of a speech component signal (e.g., dialogue spoken by actors in a movie) relative to an ambient component signal (e.g., reflected or reverberated sound) or other component signals. In one aspect, the speech component signal is identified and modified. In one aspect, the speech component signal is identified by assuming that the speech source (e.g., the actor currently speaking) is in the center of a stereo sound image of the plural-channel audio signal and by considering the spectral content of the speech component signal.

Description

The dialogue enhancement techniques
Related application
Present patent application requires the priority of the U.S. Provisional Patent Application of following common pending trial:
Be entitled as that " Method of Separately Controlling Dialogue Volume " is (the separately method of control dialogue volume), that on September 14th, 2006 submitted to, lawyer's case number is the U.S. Provisional Patent Application No.60/844 of No.19819-047P01,806;
Be entitled as " Separate Dialogue Volume (SDV) " (talking with volume (SDV) separately), that on January 11st, 2007 submitted to, lawyer's case and number be the U.S. Provisional Patent Application No.60/884 of No.19819-120P01,594; And
Be entitled as " Enhancing Stereo Audio with Remix Capability and SeparateDialogue " (with audio mixing ability again and separately dialogue strengthen stereo audio), on June 11st, 2007 submitted to, lawyer's case number is the U.S. Provisional Patent Application No.60/943 of No.19819-160P01,268.
Each of these temporary patent applications is all complete by reference to be incorporated into this.
Technical field
The subject of this patent application content relates generally to signal processing.
Background of invention
The audio frequency enhancement techniques often is used to strengthen bass frequencies in home entertainment system, stereo and other consumer-elcetronics devicess, and simulates the various environment (for example music hall) of listening to.Some techniques attempt make the film dialogue more clear by for example adding more high-frequency.Yet neither one solves with respect to surrounding environment and other component signals and strengthens the problem of talking with in these technology.
Summary of the invention
Multi-channel audio signal (for example, stereo audio) (for example is processed into respect to the surrounding environment component signal, reflection or reverberation sound) and other component signals revise the gain (for example, volume or loudness) of speech components signals (for example, the dialogue that the performer says in the film).In one aspect, the speech components signal is identified and is revised.In one aspect, the speech components signal identifies at the center of the stereo sound image of multi-channel audio signal and by the spectral content of considering the speech components signal by hypothesis speech source (what for example, the performer was current speaks).
Other realizations that comprise the realization that relates to method, system and computer-readable medium are disclosed.
Accompanying drawing is described
Fig. 1 is the block diagram that is used to talk with the audio mixing model of enhancement techniques.
Fig. 2 be illustrate service time-figure of frequency square exploded perspective acoustical signal.
Fig. 3 A is used to the figure at the function of the gain of the function of the dialogue calculating conduct decomposition gain factor at the center of acoustic image.
Fig. 3 B is used to not in the dialogue at center and calculates figure as the function of the gain of the function that decomposes gain coefficient.
Fig. 4 is the block diagram of example dialogue enhanced system.
Fig. 5 is the flow chart that example dialogue strengthens process.
Fig. 6 is the block diagram that is used to realize with reference to the digital television system of described feature of figure 1-5 and process.
Describe in detail
The dialogue enhancement techniques
Fig. 1 is the block diagram that is used to talk with the audio mixing model 100 of enhancement techniques.In this model 100, the listener is from a left side and R channel received audio signal.Audio signal s is corresponding with the localization sound (localized sound) of the direction of coming free factor a to determine.Audio signal n independently 1And n 2Corresponding with horizontal reflection that often is called ambient sound or surrounding environment or reverberation sound.Stereophonic signal can be recorded or audio mixing, so that at given audio-source, this source audio signal coherently enters a left side and the right audio signal sound channel with concrete direction prompting (for example level difference, time difference), and the horizontal independent signal n of reflection or reverberation 1And n 2Enter and determine that auditory events range and listener are around the sound channel of pointing out.But the model 100 mathematics faces of land are shown the consciousness of the stereophonic signal of an audio-source with the location of catching audio-source and surrounding environment and promote to decompose.
x 1(n)=s(n)+n 1(n) [1]
x 2(n)=as(n)+n 2(n)
Be to obtain in having the on-fixed situation of a plurality of simultaneous effective audio-source, effectively to decompose, the decomposition of [1] can be in a plurality of frequency bands independently and the execution of adaptation time ground.
X 1(i,k)=S(i,k)+N 1(i,k) [2]
X 2(i,k)=A(i,k)S(i,k)+N 2(i,k),
Wherein i is the sub-band index, and k is the sub-band time index.
Fig. 2 be illustrate service time-figure of decomposition of stereophonic signal of frequency square.In having each T/F square 200 of index i and k, signal S, N 1, N 2A can be estimated independently with the decomposition gain factor.For the purpose of the mark, sub-band and time index i and k are left in the basket in the following description for simplicity.
When use had the sub-band decomposition of consciousness promotor site band bandwidth, the bandwidth of sub-band can be selected to and equal a critical band.S, N 1, N 2With A can be by approximate estimation in each sub-band, every t millisecond (for example 20ms).For low computational complexity, short time discrete Fourier transform (STFT) can be used to realize fast Fourier transform (FFT).Given stereo sub-band signal X 1And X 2, can determine S, A, N 1, N 2Estimation.X 1The estimation in short-term of power can be expressed as:
P X 1 ( i , k ) = E { X 1 2 ( i , k ) } , - - - [ 3 ]
Wherein E{.} asks average calculating operation in short-term.For other signals, can use identical agreement, i.e. P X2, P SAnd P N=P N1=P N2Be that corresponding short-time rating is estimated.N 1And N 2Power be assumed to be identically, the amount of promptly supposing horizontal independent sound is identical for a left side with R channel.
Estimation P S, A and P N
The sub-band of given stereophonic signal is represented, can determine power (P X1, P X2) and normalized cross correlation.Normalized cross correlation between a left side and the R channel is
Φ ( i , k ) = E { X 1 ( i , k ) X 2 ( i , k ) } E { X 1 2 ( i , k ) E { X 2 2 ( i , k ) } - - - [ 4 ]
A, P S, P NCan be calculated as the P of estimation X1, P X2Function with Φ.Three with the known equation relevant with known variables are:
P X1=P S+P N
P X2=A 2P S+P N [5]
Φ = aP S P X 1 P X 2
Equation [5] can be obtained A, P SAnd P N, to obtain
A = B 2 C
P S = 2 C 2 B - - - [ 6 ]
P N = X 1 - 2 C 2 B
And
B = P X 2 - P X 1 + ( P X 1 - P X 2 ) 2 + 4 P X 1 P X 2 Φ 2 - - - [ 7 ]
C = Φ P X 1 P X 2
S, N 1And N 2Least-squares estimation
Then, S, N 1And N 2Least-squares estimation be calculated as A, P SAnd P NFunction.For each i and k, signal S can be estimated as
S ^ = w 1 X 1 + w 2 X 2 - - - [ 8 ]
= w 1 ( S + N 1 ) + w 2 ( AS + N 2 ) ,
W wherein 1And w 2It is real-valued weight.Estimation error is
E=(1-w 1-w 2A)S-w 1N 1-w 2N 2. [9]
At error E and X 1And X 2[6] during quadrature, promptly
E{EX 1}=0 [10]
E{EX 2}=0,
Weight w 1And w 2On the least square meaning, be best.
Obtain two equations
(1-w 1-w 2A)P S-w 1P N=0 [11]
A(1-w 1-w 2A)P S-w 2P N=0,
Therefrom calculate weight,
w 1 = P S P N ( A 2 + 1 ) P S P N + P N 2 - - - [ 12 ]
w 2 = AP S P N ( A 2 + 1 ) P S P N + P N 2 .
N 1Estimation can be
N ^ 1 = w 3 X 1 + w 4 X 2 - - - [ 13 ]
= w 3 ( S + N 1 ) + w 4 ( AS + N 2 ) .
Evaluated error is
E=(-w 3-w 4A)S-(1-w 3)N 1-w 2N 2. [14]
Once more, calculate weight so that evaluated error and X 1And X 2Quadrature causes
w 3 = A 2 P S P N + P N 2 ( A 2 + 1 ) P S P N + P N 2 - - - [ 15 ]
w 4 = - A P S P N ( A 2 + 1 ) P S P N + P N 2
Be used to calculate N 2Least-squares estimation
N ^ 2 = w 5 X 1 + w 6 X 2 - - - [ 16 ]
= w 5 ( S + N 1 ) + w 6 ( AS + N 2 ) ,
Weight be
w 5 = - A P S P N ( A 2 + 1 ) P S P N + P N 2 - - - [ 17 ]
w 6 = P S P N + P N 2 ( A 2 + 1 ) P S P N + P N 2
Postposition scalable in proportion (post-scaling)
S ^ , N ^ 1 , N ^ 2
In some implementations, least-squares estimation is can postposition scalable in proportion, so that the power and the P that estimate SAnd P N=P N1=P N2Equate.
Figure A20078003435100098
Power be
P S ^ = ( w 1 + aw 2 ) 2 P S + ( w 1 2 + w 2 2 ) P N . - - - [ 18 ]
Thereby, in order to obtain to have power P SThe estimation of S,
Figure A200780034351000910
By scalable in the ratio of putting
S ^ ′ = P S ( w 1 + aw 2 ) 2 P S + ( w 1 2 + w 2 2 ) P N S ^ . - - - [ 19 ]
Use similar inference,
Figure A200780034351000912
With
Figure A200780034351000913
Scalable in proportion
N ^ 1 ′ = P N ( w 3 + aw 4 ) 2 P S + ( w 3 2 + w 4 2 ) P N N ^ 1 - - - [ 20 ]
N ^ 2 ′ = P N ( w 5 + aw 6 ) 2 P S + ( w 5 2 + w 6 2 ) P N N ^ 2 .
Stereophonic signal is synthetic
Given previously described signal decomposition, with the similar signal of original stereo signal can be by obtaining using [2] and time domain is returned in the sub-band conversion each time and at each sub-band.
In order to generate the signal with modified dialogue gain, sub-band is calculated as
Y 1 ( i , k ) = 10 g ( i , k ) 20 S ( i , k ) + N 1 ( i , k ) - - - [ 21 ]
Y 2 ( i , k ) = 10 g ( i , k ) 20 A ( i , k ) S ( i , k ) + N 2 ( i , k ) ,
Wherein (i k) is calculated that what revised as required is the gain factor of unit with dB so that talk with gain to g.
Have several promotions how to calculate g (i, observation k):
Usually dialogue is at the center of acoustic image, promptly the component signal of time k that belongs to dialogue and frequency i will have corresponding decomposition gain factor A near one (0dB) (i, k).
Voice signal comprises most energy up to 4kHz.In fact the above voice of 8kHz do not comprise energy.
Voice do not comprise low-down frequency (for example being lower than about 70Hz) usually yet.
These observe hint g, and (i is k) in low-down frequency and be set to 0dB more than the 8kHz, to revise stereophonic signal potentially as small as possible.In other frequencies, g (i, k) be controlled as required dialogue gain G d and A (i, function k):
g(i,k)=f(G d,A(i,k)). [22]
The example of suitable function f is shown in Fig. 3 A.Attention in Fig. 3 A, f and A (i, the relation between k) uses logarithm (dB) ratio to draw, but A (i, k) and f can define with linear scale in addition.Concrete example at f is:
g ( i , k ) = I + ( 10 G d 20 - 1 ) cos ( min { π | 10 log 10 ( A ( i , k ) | W , π 2 } ) , - - - [ 23 ]
Wherein W determines the width of the gain region of function f, as shown in Figure 3A.Constant W is relevant with the direction sensitivity of dialogue gain.For example the value of W=6dB gives majority signal with good result.But notice that different W can be best for different signals.
Because the calibration of broadcasting or receiving equipment difference (for example there is different gains on a left side with R channel), dialogue may not be accurately to occur at the center.In the case, function f can be offset, so that its center is corresponding with the dialogue position.The example of the function f of quilt through being offset is shown in Fig. 3 B.
Replace and realize and vague generalization
Sign based on the dialogue component signal of the spectral range of center hypothesis (perhaps common hypothesis on location) and voice is simple and suitable in many cases.Yet dialogue identifier can be modified and improve potentially.A kind of may be that the more phonetic feature of exploring such as formant, harmonic structure, transient phenomena is talked with component signal to detect.
As mentioned, for different audio materials, difform gain function (for example Fig. 3 A and 3B) may be best.Thereby, can use the signal adaptive gain function.
The dialogue gain controlling also can realize at the household audio and video system that has around sound.An importance of dialogue gain controlling is whether to detect dialogue in center channel.A kind of method of carrying out this is whether inspection center has sufficient signal energy, makes that dialogue might be in center channel.If dialogue is in center channel, then gain can be added to center channel with control dialogue volume.If dialogue not in center channel (for example, if surrounding system playback stereo audio content), then can apply two sound channels dialogue gain controlling with reference to figure 1-3 as described previously.
In some implementations, disclosed dialogue enhancement techniques can realize by the signal of decay except that the speech components signal.For example, multi-channel audio signal can comprise speech components signal (for example, dialogue signal) and other component signals (for example, reverberation).Other component signals can based on the speech components signal in the acoustic image of multi-channel audio signal the position and be modified (for example, being attenuated), and the speech components signal can remain unchanged.
The dialogue enhanced system
Fig. 4 is the block diagram of example dialogue enhanced system 400.In some implementations, system 400 comprises analysis filterbank 402, power estimator 404, signal estimator 406, the scalable in proportion module 408 of postposition, signal synthesizing module 410 and composite filter group 412.Though the assembly 402-412 of system 400 is shown independent process, the process of two or more assemblies is capable of being combined in single component.
For each time k, multi-channel signal becomes sub-band signal i by analysis filterbank 402.In the example shown, the left side of stereophonic signal and R channel x 1(n), x 2(n) analyzed bank of filters 402 is broken down into i sub-band X 1(i, k), X 2(i, k).Power estimator 404 generations had before been described with reference to Fig. 1 and 2 And Power estimate.Signal estimator 406 estimates to generate estimated signal from power
Figure A20078003435100113
Figure A20078003435100114
And
Figure A20078003435100115
The scalable in proportion signal of the scalable in proportion module 408 of postposition is estimated to provide
Figure A20078003435100116
And Signal synthesizing module 410 receives the scalable in proportion signal estimation of postposition and decomposes gain factor A, constant W and required dialogue gain G d, and a synthetic left side and the right sub-band signal that is input to composite filter group 412 estimated
Figure A20078003435100118
And
Figure A20078003435100119
Have based on G to provide dA left side and the right time-domain signal of the dialogue gain of revising
Figure A200780034351001110
With
Figure A200780034351001111
Dialogue enhancing process
Fig. 5 is the flow chart that example dialogue strengthens process 500.In some implementations, process 500 is by resolving into multi-channel audio signal frequency sub-bands signal (502) beginning.Decomposition can be carried out by the bank of filters of using various known transform, and these conversion include but not limited to: multiphase filter group, quadrature mirror filter bank (QMF), hybrid filter-bank, discrete Fourier transform (DFT) (DFT), correction discrete cosine transform (MDCT).
Use sub-band signal to estimate first group of power (504) of two or more sound channels of audio signal.Use this first group of power to determine cross correlation (506).Use first group of power and cross correlation to estimate to decompose gain factor (508).Decompose gain factor and provide position indicating for the dialogue source in the acoustic image.Use first group of power and cross correlation to estimate second group of power (510) of speech components signal and surrounding environment component signal.Use second group of power and decompose gain factor estimation voice and surrounding environment component signal (512).Voice of estimating and surrounding environment component signal are by postposition scalable in proportion (514).Use is through voice and the surrounding environment component signal and the synthetic sub-band signal (516) with dialogue gain of modification of required dialogue gain of the scalable in proportion estimation of postposition.Required dialogue gain can be provided with or be specified by the user automatically.Synthetic sub-band signal for example uses, and the composite filter group is transformed into the time-domain audio signal (512) with modification dialogue gain.
Be used for the output normalization that background suppresses
In some implementations, expectation suppresses the audio frequency of background scene but not strengthens the dialogue signal.This can have the dialogue enhancing output signal realization of dialogue gain by normalization.Normalization can be carried out by at least two kinds of different modes.In one example, output signal
Figure A20078003435100121
With
Figure A20078003435100122
Can pass through normalization factor g NormNormalization:
Y ^ 1 ( i , k ) = Y 1 ( i , k ) g norm - - - [ 24 ]
Y ^ 2 ( i , k ) = Y 2 ( i , k ) g norm .
Another example, the dialogue reinforced effects has g by use NormWeight w 1-w 6Normalization compensates.Normalization factor g NormThe dialogue gain that can adopt and revise
Figure A20078003435100125
Identical value.
In order to maximize the consciousness quality, can revise g NormNormalization can not only be carried out at frequency domain but also in time domain.When carrying out in frequency domain, homogenization can be carried out at for example 70Hz that applies the dialogue gain and the frequency band between the 8KHz.
Alternatively, similarly the result can be embodied as and gain be not applied to S (i, decay N in the time of k) 1(i, k) and N 2(i, k).This notion can use following equation to describe:
Y ^ 1 ( i , k ) = S ( i , k ) + 10 g atten ( i , k ) 20 N 1 ( i , k ) , - - - [ 25 ]
Y ^ 2 ( i , k ) = S ( i , k ) + 10 g atten ( i , k ) 20 N 2 ( i , k ) .
Detect use dialogue volume separately based on monophony
As input signal X 1(i, k) and X 2(i, when k) similar substantially, for example input is similar monophonic signal, then Shu Ru almost each part can be regarded as 5, and when the user provides required dialogue gain, the gain volume of increase signal of required dialogue.For preventing this situation, the characteristic that expectation uses independent dialogue volume (SDV) technology to observe input signal.
In [4], calculate the normalized cross correlation of stereophonic signal.This normalized cross correlation can be used as the tolerance that monophonic signal detects.When the Φ in [4] surpassed given threshold value, input signal can be construed to monophonic signal, and independent dialogue volume can be closed automatically.On the contrary, as Φ during less than given threshold value, input signal can be construed to stereophonic signal, and independent dialogue volume can be opened automatically.The dialogue gain can be used as the algorithm switch at independent dialogue volume:
g ^ ( i , k ) = 1 , For φ>Thr Mono, [26]
g ^ ( i , k ) = g ( i , k ) , φ<Thr stereo.
In addition, when
Figure A20078003435100133
At Thr MonoWith Thr StereoBetween the time,
Figure A20078003435100134
Can be expressed as
Figure A20078003435100135
Function:
g ^ ( i , k ) = f ( φ , g ( i , k ) ) , For Thr Mono>φ>Thr Stereo. [27]
Example be with at
Figure A20078003435100137
The inverse proportion weighting be applied to
Figure A20078003435100138
For
g ^ ( i , k ) = - φ + Thr mono Thr mono - Thr stereo g ( i , k ) , For Thr Mono>φ>Thr Stereo. [28]
In order to prevent
Figure A200780034351001310
Sudden change, the time smoothing technology can be combined to obtain
Figure A200780034351001311
The digital television system example
Fig. 6 is the block diagram that is used to realize with reference to the example digital television system 600 of described feature of figure 1-5 and process.Digital Television (DTV) is the telecommunication system by means of digital signal broadcasting and reception motion picture and sound.DTV adopts the digital modulation data, and it is by digital compression and need decode by custom-designed television set or the PC that has the reference receiver of set-top box or TV card is housed.Although the system among Fig. 6 is the DTV system, the disclosed realization that is used to talk with enhancing also can be applicable to analog TV system or any other system that can talk with enhancing.
In some implementations, system 600 (for example, can comprise interface 602, demodulator 604, decoder 606 and audio/video output 608, user's input interface 610, one or more processor 612
Figure A200780034351001312
Processor) and one or more computer-readable medium 614 (for example, RAM, ROM, SDRAM, hard disk, CD, flash memory, SAN etc.).These assemblies are coupled to one or more communication channels 616 (for example, bus) separately.In some implementations, interface 602 comprises the various circuit of the audio/video signal that is used to obtain audio signal or combination.For example, in the simulated television system, interface can comprise antenna mounted electronics, tuner or frequency mixer, radio frequency (RF) amplifier, local oscillator, intermediate frequency (IF) amplifier, one or more filter, demodulator, audio frequency amplifier etc.Other realizations of system 600 are possible, comprise having more or the more realization of widgets.
Tuner 602 can be the DTV tuner that is used to receive the digital television signal that comprises video and audio content.Demodulator 604 extracts video and audio signal from digital television signal.If video and audio signal is encoded (for example, mpeg encoded), these signals of decoder 606 decoding then.A/V output can be can display video and any equipment (for example, TV display, computer monitor, LCD, loud speaker, audio system) of audio plays.
In some implementations, display device or the demonstration on screen (OSD) on can for example using a teleswitch shows the dialogue volume level to the user.The dialogue volume level can be with respect to the keynote magnitude.One or more Drawing Objects can be used to show the dialogue volume level and with respect to the dialogue volume level of master volume.For example, first Drawing Object (for example, bar) can show and be used to refer to master volume, and second graph object (for example, line) can show or is combined on first Drawing Object with indication dialogue volume level with first Drawing Object.
In some implementations, user's input interface can comprise and is used to receive and the circuit (for example, wireless or infrared remote receiver) and/or the software of the infrared or wireless signal that generated by remote controller of decoding.Remote controller can comprise independent dialogue volume control key or button or be used to change the independent dialogue volume control options button of the state of master volume operating key or button, so that master volume control can be used to control master volume or independent dialogue volume.In some implementations, dialogue volume or master volume key can change its visual appearance to indicate its function.
Example controller and user interface at U.S. Patent application No.______, be entitled as that " Controller andUser Interface For Dialogue Enhancement Techniques " (being used for talking with the controller and the user interface of enhancement techniques), on September 14th, 2007 submit to, lawyer's case number for No.19819-160001 describes, this patent application is complete by reference to be incorporated into this.
In some implementations, one or more processors can be carried out the code that is stored in the computer-readable medium 614, with realization as with reference to described feature of Fig. 1-5 and operation 618,620,622,624,626,628,630 and 632.
Computer-readable medium also comprises operating system 618, analysis/synthetic filtering device group 620, power estimator 622, signal estimator 624, the scalable in proportion module 626 of postposition and signal synthesizer 628.Term " computer-readable medium " expression participates in providing instruction for any medium of carrying out to processor 612, includes but not limited to non-volatile media (for example CD or disk), Volatile media (for example memory) and transmission medium.Transmission medium includes but not limited to, coaxial cable, copper cash and optical fiber.Transmission medium also occurs with the form of sound, light or rf wave.
Operating system 618 can be multi-user, multiprocessing, multitask, multithreading, real-time etc.Operating system 618 is carried out basic task, includes but not limited to: identification is from the input of user's input interface 610; Keep file and catalogue on tracking and the supervisory computer computer-readable recording medium 614 (for example memory or memory device); Control peripheral devices; And manage the traffic on one or more communication channels 616.
Above-mentioned feature can be advantageously implemented as the one or more computer programs that can carry out on programmable system, this programmable system comprises: at least one programmable processor, it is coupled receiving data and instruction from data-storage system, and data and instruction are sent to data-storage system; At least one input equipment; And at least one output equipment.Computer program is one group of instruction, and this group instruction can be used in computer directly or indirectly to carry out certain activity or to produce certain result.Computer program can be (for example to comprise the compiling or any type of programming language of interpretative code, Objective-C (OO C language), Java) write, and it can use in any form, comprises as stand-alone program or as module, assembly, subroutine or other unit of being adapted at using in the computing environment.
The suitable processor that is used for execution of programs of instructions comprises uniprocessor or one of multiprocessor or the multinuclear as the computer of the general and special microprocessor of example and any kind.Generally speaking, processor will receive instruction and data from read-only memory or random access memory or both.The primary element of computer is processor that is used to execute instruction and the one or more memories that are used for store instruction and data.Generally speaking, computer also comprises the one or more mass-memory units that are used for storing data files, or be coupled effectively with these devices communicatings; This equipment comprises the disk such as internal disk and removable dish; Magneto optical disk; And CD.Be applicable to that the memory device of visibly expressing computer program instructions and data comprises the nonvolatile memory of form of ownership, comprise semiconductor memory apparatus such as EPROM, EEPROM and flash memory device as example; Disk such as internal hard drive and removable dish; Magneto optical disk; And CD-ROM and DVD-ROM dish.Processor and memory can be replenished or are attached among the ASIC by ASIC (application-specific integrated circuit (ASIC)).
For mutual with the user is provided, can to can providing the computer such as the keyboard of mouse or tracking ball and positioning equipment of input by it to computer on, the display device CRT of user's display message (cathode ray tube) or LCD (LCD) monitor and user realize feature having such as being used for.
Can be in the computer system that comprises such as the aft-end assembly of data server, or in the computer system that comprises such as the middleware component of application server or Internet server, or in the computer system that comprises such as the front end assemblies of client computer with graphic user interface or explorer, or in its combination, realize these features.The assembly of system can be by connecting such as any form of communication network or the digital data communications of medium.The example of communication network comprises for example computer and the network of LAN, WAN and formation internet.
Computer system can comprise client-server.Client-server is general far apart and pass through network interaction usually.The relation of client-server produces according to the computer program that moves on corresponding computer and have the client-server relation each other.
A plurality of realizations have been described.Yet, will understand and can carry out various modifications.For example, capable of being combined, deletion, revise or replenish the key element of one or more realizations to form further realization.As another example, particular order or consecutive order shown in the logic flow that is described in the drawings is also nonessential are realized desired result.In addition, can provide other steps, maybe can from described flow process, remove step, and add other assemblies to described system, or remove other assembly from described system.Therefore, other are implemented in the scope of following claim.

Claims (25)

1. method comprises:
Acquisition comprises the multi-channel audio signal of speech components signal and other component signals; And
Based on the described speech components signal of the location updating of the described speech components signal in the acoustic image of described audio signal.
2. the method for claim 1 is characterized in that, revises also to comprise:
Based on the described speech components signal of the spectral content modification of described speech components signal.
3. method as claimed in claim 1 or 2 is characterized in that, described modification also comprises:
Determine the position of the described speech components signal in the described acoustic image; And
Gain factor is applied to described speech components signal.
4. method as claimed in claim 3 is characterized in that, described gain factor is the described position of described speech components signal and the function that is used for the required gain of described speech components signal.
5. method as claimed in claim 4 is characterized in that, described function is the signal adaptive gain function with gain region relevant with the directional sensitivity of described gain factor.
6. the method according to any one of the preceding claims is characterized in that, described modification also comprises:
In time domain or frequency domain, use the described multi-channel audio signal of normalization factor normalization.
7. the method according to any one of the preceding claims is characterized in that, also comprises:
Determine that in fact whether described audio signal is monaural; And
If described audio signal is not in fact monaural, then revise described speech components signal automatically.
8. method as claimed in claim 7 is characterized in that, determines that in fact whether described audio signal is monaurally also to comprise:
Determine the cross correlation between two or more sound channels of described audio signal; And
With described cross correlation and one or more threshold; And
Result based on described comparison determines that in fact whether described audio signal is monophony.
9. the method according to any one of the preceding claims is characterized in that, revises also to comprise:
Described audio signal is resolved into a plurality of frequency sub-bands signals;
Use described sub-band signal to estimate first group of power of two or more sound channels of described multi-channel audio signal;
Use described first group of power of estimating to determine cross correlation;
Use the described first group of power estimated and cross correlation to estimate to decompose gain factor.
10. method as claimed in claim 9 is characterized in that, the bandwidth of at least one sub-band is selected to human auditory system's a critical band and equates.
11. method as claimed in claim 8 is characterized in that, comprising:
Estimate second group of power of described speech components signal and surrounding environment component signal from described first group of power and described cross correlation.
12. method as claimed in claim 11 is characterized in that, also comprises:
Use described second group of power and described decomposition gain factor to estimate described speech components signal and described surrounding environment component signal.
13. method as claimed in claim 12 is characterized in that, uses least-squares estimation to determine estimated voice and surrounding environment component signal.
14. method as claimed in claim 12 is characterized in that, described cross correlation is by normalization.
15., it is characterized in that estimated speech components signal and estimated surrounding environment component signal are scalable in proportion by postposition as claim 13 or 14 described methods.
16. as each described method in the claim 11 to 15, it is characterized in that, also comprise:
Use second estimated power and user to specify gain synthon band signal.
17. method as claimed in claim 16 is characterized in that, also comprises:
The sub-band signal that is synthesized is transformed into the time-domain audio signal that has with the speech components signal of the gain modifications of described user's appointment.
18. a method comprises:
Obtain audio signal;
User's input of the modification of first component signal of the described audio signal of acquisition appointment; And
Revise described first component signal based on the position indicating of described first component signal in the acoustic image of described input and described audio signal.
19. method as claimed in claim 18 is characterized in that, described modification also comprises:
Gain factor is applied to described first component signal.
20. method as claimed in claim 19 is characterized in that, described gain factor is the described position indicating of described first component signal and the function of required gain.
21. method as claimed in claim 20 is characterized in that, described function has the gain region relevant with the directional sensitivity of described gain factor.
22., it is characterized in that described modification also comprises as each described method in the claim 18 to 21:
In time domain or frequency domain, use the described audio signal of normalization factor normalization.
23., it is characterized in that described modification also comprises as each described method in the claim 18 to 22:
Described audio signal is resolved into a plurality of frequency sub-bands signals;
Use described sub-band signal to estimate first group of power of two or more sound channels of described audio signal;
Use described first group of power to determine cross correlation;
Use described first group of power and cross correlation to estimate to decompose gain factor;
Estimate second group of power of described first component signal and second component signal from described first group of power and described cross correlation;
Use described second group of power and described decomposition gain factor to estimate described first component signal and described second component signal;
Use first and second estimated component signals and described input synthon band signal; And
The sub-band signal that is synthesized is transformed into the time-domain audio signal of first component signal with modification.
24. a system comprises:
Interface, the configurable multi-channel audio signal that is used to obtain to comprise speech components signal and other component signals of described interface; And
Processor, described processor are coupled to described interface and can be configured to based on the described speech components signal of the location updating of the described speech components signal in the acoustic image of described audio signal.
25. a method comprises:
Acquisition comprises the multi-channel audio signal of speech components signal and other component signals; And
Based on described other component signals of the location updating of the described speech components signal in the acoustic image of described multi-channel audio signal.
CN2007800343512A 2006-09-14 2007-09-14 Dialogue enhancement techniques Expired - Fee Related CN101518100B (en)

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
US84480606P 2006-09-14 2006-09-14
US60/844,806 2006-09-14
US88459407P 2007-01-11 2007-01-11
US60/884,594 2007-01-11
US94326807P 2007-06-11 2007-06-11
US60/943,268 2007-06-11
PCT/EP2007/008028 WO2008031611A1 (en) 2006-09-14 2007-09-14 Dialogue enhancement techniques

Publications (2)

Publication Number Publication Date
CN101518100A true CN101518100A (en) 2009-08-26
CN101518100B CN101518100B (en) 2011-12-07

Family

ID=41040630

Family Applications (3)

Application Number Title Priority Date Filing Date
CN2007800343809A Expired - Fee Related CN101518102B (en) 2006-09-14 2007-09-14 Dialogue enhancement techniques
CN2007800343512A Expired - Fee Related CN101518100B (en) 2006-09-14 2007-09-14 Dialogue enhancement techniques
CN2007800343194A Expired - Fee Related CN101518098B (en) 2006-09-14 2007-09-14 Controller and user interface for dialogue enhancement techniques

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN2007800343809A Expired - Fee Related CN101518102B (en) 2006-09-14 2007-09-14 Dialogue enhancement techniques

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN2007800343194A Expired - Fee Related CN101518098B (en) 2006-09-14 2007-09-14 Controller and user interface for dialogue enhancement techniques

Country Status (1)

Country Link
CN (3) CN101518102B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105791722A (en) * 2014-12-22 2016-07-20 深圳Tcl数字技术有限公司 Television sound adjusting method and television
CN106663433A (en) * 2014-07-02 2017-05-10 高通股份有限公司 Reducing correlation between higher order ambisonic (HOA) background channels
CN107659888A (en) * 2017-08-21 2018-02-02 广州酷狗计算机科技有限公司 Identify the method, apparatus and storage medium of pseudostereo audio
US10311880B2 (en) 2012-11-26 2019-06-04 Harman International Industries, Incorporated System for perceived enhancement and restoration of compressed audio signals

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9372251B2 (en) 2009-10-05 2016-06-21 Harman International Industries, Incorporated System for spatial extraction of audio signals
US9185509B2 (en) * 2009-12-23 2015-11-10 Nokia Technologies Oy Apparatus for processing of audio signals
CN104871565B (en) * 2012-12-19 2017-03-08 索尼公司 Apparatus for processing audio and method
CN106303816B (en) * 2015-05-25 2019-12-24 联想(北京)有限公司 Information control method and electronic equipment
CN112218229B (en) * 2016-01-29 2022-04-01 杜比实验室特许公司 System, method and computer readable medium for audio signal processing
CN107342092B (en) * 2017-05-08 2020-09-08 深圳市创锐智汇科技有限公司 Audio mixing system and method for automatically distributing gain
US11895369B2 (en) 2017-08-28 2024-02-06 Dolby Laboratories Licensing Corporation Media-aware navigation metadata
CN116405836B (en) * 2023-06-08 2023-09-08 安徽声讯信息技术有限公司 Microphone tuning method and system based on Internet

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6111755A (en) * 1998-03-10 2000-08-29 Park; Jae-Sung Graphic audio equalizer for personal computer system
KR100561440B1 (en) * 2004-07-24 2006-03-17 삼성전자주식회사 Apparatus and method for compensating audio volume automatically in response to the change of channel
JP2006222686A (en) * 2005-02-09 2006-08-24 Fujitsu Ten Ltd Audio device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10311880B2 (en) 2012-11-26 2019-06-04 Harman International Industries, Incorporated System for perceived enhancement and restoration of compressed audio signals
CN104823237B (en) * 2012-11-26 2019-06-11 哈曼国际工业有限公司 For repairing system, computer readable storage medium and the method for compressed audio signal
CN106663433A (en) * 2014-07-02 2017-05-10 高通股份有限公司 Reducing correlation between higher order ambisonic (HOA) background channels
CN105791722A (en) * 2014-12-22 2016-07-20 深圳Tcl数字技术有限公司 Television sound adjusting method and television
CN105791722B (en) * 2014-12-22 2018-12-07 深圳Tcl数字技术有限公司 Television sound method of adjustment and television set
CN107659888A (en) * 2017-08-21 2018-02-02 广州酷狗计算机科技有限公司 Identify the method, apparatus and storage medium of pseudostereo audio

Also Published As

Publication number Publication date
CN101518098A (en) 2009-08-26
CN101518098B (en) 2013-10-23
CN101518102A (en) 2009-08-26
CN101518102B (en) 2013-06-19
CN101518100B (en) 2011-12-07

Similar Documents

Publication Publication Date Title
CN101518100B (en) Dialogue enhancement techniques
US8275610B2 (en) Dialogue enhancement techniques
US20200152210A1 (en) Determining the inter-channel time difference of a multi-channel audio signal
US8705769B2 (en) Two-to-three channel upmix for center channel derivation
CN102113315B (en) Method and apparatus for processing audio signal
JP5192545B2 (en) Improved audio with remixing capabilities
RU2408164C1 (en) Methods for improvement of dialogues

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20111207

Termination date: 20180914