CN1524399A

CN1524399A - Audio channel translation

Info

Publication number: CN1524399A
Application number: CNA028046625A
Authority: CN
Inventors: ��ˡ��֡��ά˹; 马克·富兰克林·戴维斯
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2001-02-07
Filing date: 2002-02-07
Publication date: 2004-08-25
Anticipated expiration: 2022-02-07
Also published as: WO2002063925A2; HK1066966A1; CN1275498C; CA2437764A1; JP2004526355A; MXPA03007064A; WO2002063925A3; CA2437764C; DE60225806D1; EP1410686A2; EP1410686B1; DE60225806T2; KR100904985B1; ATE390823T1; AU2002251896B2; KR20030079980A; WO2002063925A8; AU2002251896A2

Abstract

A process for translating M audio input channels representing a soundfield to N audio output channels representing the same soundfield, wherein each channel is a single audio stream representing audio arriving from a direction, M and N are positive whole integers, and M is at least 2, generates one or more sets of output channels, each set having one or more output channels. Each set is associated with two or more spatially adjacent input channels and each output channel in a set is generated by a process that includes determining a measure of the correlation of the two or more input channels and the level interrelationships of the two or more input channels.

Description

The sound channel conversion

Technical field

The present invention relates to Audio Signal Processing.Particularly, the present invention relates to represent the conversion of M input sound channel of a sound field to the N that represents a same sound field output channels, wherein each sound channel is the single audio stream of the audio frequency that arrived at by a direction of expression, and M and N are positive integers, and M is at least 2.

Background technology

Though the mankind have only two ears, we can recognize the sound of actual three-dimensional, and this depends on a plurality of positioning indicatings, for example relevant transfer function (HRTF) and the head movement of head.So audio reproduction true to nature fully requires to keep and reproduce full three-dimensional sound field, perhaps needs the prompting of being felt at least.Unfortunately, the SoundRec technology does not fit into obtains three-dimensional sound field, and obtaining of incompatibility two dimensional surface sound even do not adapt to obtaining of one dimension straight line sound yet.Current SoundRec technology only is suitable for obtaining, preserving and show the discrete channels of zero dimension.

Since Edison invention SoundRec, concentrated on the defective of the cylinder/disk media that overcomes its original analog track modulated mostly about the effort that improves fidelity.These defectives comprise limited and uneven frequency response, noise, distortion, wow and flutter, velocity accuracy, wearing and tearing, dirt and duplicate infringement.Though more existing scattered effort for local improvement, comprise that electronics amplifies, magnetic recording, reduce noise and the price cassette player also higher than some automobile, but the traditional problem of each sound channel quality is especially introduced audio frequency cd (CD) and is proved not final the solution before up to having researched and developed general digital record.From having researched and developed digital record particularly since the CD, except the quality that further expands digital record to some effort of 24 bits/96kHz (KHz) sampling, concentrate in the main effort aspect the audio reproduction research and to be reduced to the data volume that keeps each sound channel quality required-mostly adopt perceptual audio coder, and improve the space fidelity.This back theme that problem is this paper.

The effort of room for improvement fidelity is carried out along two lines: attempt to transmit the perception prompting of whole sound field, and it is approximate to attempt to transmit of actual original sound field.Adopt the system example of last method to comprise the dual track record and based on the virtual surround system of two loud speakers.There are a plurality of unfortunate defectives in these systems, especially aspect the sound of locating reliably on some direction, and require to use earphone or aspect single fixing listening listen on the position.

No matter be in a room or such as the such place of business of cinema, for reproduction of stereo is given the multidigit audience, unique feasible method is to attempt approximate actual original sound field.If the discrete channel characteristic of given SoundRec, this is can be not wonderful: great majority make great efforts to comprise to be conservatively to increase the quantity of reproducing sound channel at present.Expression property system comprises early stage moving-monophony three loud speaker cinefilm tracks fifties, conventional stereo, the quadrasonics of the sixties, five-sound channel discrete magnetic track on 70 millimeters cinefilms, adopt the Dolby Surround of matrix the seventies, the AC-3 5.1 sound channel surround sounds of the nineties and recent around-EX6.1 sound channel surround sound." Dolby " (Doby), " Pro Logic " and " Surround EX " (around-EX) be the trade mark of Dolby laboratory chartered company.In varying degrees, these systems provide the spatial reproduction of having improved than mono reproduction.Yet the audio mixing of a large amount of sound channels causes more time and cost burden on one's body contents producer, and the impression that causes is typically one in the discrete channels of several dispersions, rather than a continuous sound field.The Pro Logic of Dolby decoding is described in United States Patent (USP) 4,799, and in 260, this patent full content at this as a reference.The detailed content of AC-3 is described in (can be at the World Wide Web Site of the Internet among the document A/52 " digital audio compression standard (AC-3) " that announces in 20 days December nineteen ninety-five of advanced television system committee (ATSC) Www.atsc.org/Standards/A52/a-52.docObtain).The errata on July 22nd, also visible (can be at the World Wide Web Site of the Internet Www.dolby.com/tech/ATSC err.pdfObtain).

Basis of the present invention general introduction

Rebuilding a basis that distributes arbitrarily in the ripple medium of a no information source is provided by a Gauss theorem, and the wave field that this theorem is defined in certain zone is determined by the pressure distribution along the zone boundary fully.This means, in the scope in a room, rebuild on the sound field principle in the music hall and can so realize: the room is set in music hall, wall is sound insulation, transparent by wall being become on the acoustics then at the unlimited minimum microphone of the outside of wall configuration, each microphone signal after suitable amplification, be connected to one in the room within the walls corresponding loud speaker.By between microphone and loud speaker, inserting a suitable recording medium, one satisfactory-may be that unpractical-accurate three-dimensional sound reproduction system has been implemented.It is practical that remaining design work becomes this system.

The first step of the practicability of marching toward can by notice interested signal be frequency band limited-the about 20kHz of the upper limit, and the application space sampling theorem finishes, the space samples theorem is the modification of time domain sampling theorem more commonly used.The latter says, if the time domain waveform of a continuous limit band is taken a sample drop-out not then discretely with the speed that doubles the information source highest frequency at least.The space samples theorem is for identical consideration, and its regulation space samples must be at least two double densities of minimal wave length density at interval, to avoid losing of information.Because the wavelength of 20kHz is about 3/8 inch in air, this means that an accurate three dimensional sound system availability interval is not more than 3/16 inch microphone and the array of loud speaker is realized.Expand to all surface in typical 9 feet * 12 feet rooms, this produces about 2.5 hundred ten thousand sound channels, and this is tangible improvement for unlimited, but is still unpractiaca at present., it has set up the basic skills of utilizing as the discrete channels array of space samples, according to this method, uses the suitable interpolation sound field of can regenerating.

In case sound field is characterized, this is possible on the principle: decoder produces optimum signal arbitrary output loud speaker of feeding.The sound channel that is fed to such decoder is called as " substantially ", " being transmitted " and " input " sound channel in the different places of presents, and any delivery channel that the position does not correspond to the position of a sound channel in the basic sound channel will be called as " centre " sound channel.An output channels also can have one and the corresponding to position of basic input sound channel.

So require to reduce the number of discrete channels space samples or basic sound channel.Realize that this point can be based on the following fact: no longer follow each cycle in the above sense of hearing of 1500Hz (hertz), and only follow the critical band envelope.This allows sound channel separation corresponding with 1500Hz, is approximately 3 inches.This sound channel sum that will reduce 9 feet * 12 feet rooms is to about 6000, than the configuration of front, reduced by about 2.49 hundred ten thousand sound channels effectively.

Under any circumstance, can further reduce the space samples channel number by means of the psychoacoustic location limit in theory.For sound placed in the middle, the resolution limit of level is about the arc of 1 degree, and the corresponding vertical resolution limit is about 5 degree.If this density is suitably expanded on a sphere, it is hundreds of to thousands of sound channels that the result will remain.

Summary of the invention

According to the present invention, processing will represent that M input sound channel of a sound field is converted to N output channels of the same sound field of expression, and wherein each sound channel is to represent the single audio stream of the sound that arrived at by a direction, and M and N are positive integers, and M is at least 2.One or more groups output channels is produced, and each group has one or more output channels.Input sound channel adjacent on each group and the two or more spaces interrelates, and each output channels in a group is handled by one and is produced, and this processing comprises the level correlation of relativity measurement and two or more input sound channels of definite two or more input sound channels.

In one aspect of the invention, many group output channels are in relation to the input sound channel more than two, and it handles the correlation of determining those input sound channels of getting in touch with every group of output channels according to hierarchical order, make every group or many groups be sorted according to the number of input sound channel, these input sound channels are in relation to this group output channels (one or more).The highest order of maximum number correspondence of input sound channel, processing procedure is handled each group in regular turn according to its hierarchical order.According to one aspect of the present invention, processing procedure is considered the result to the group processing of higher order in addition.

In the input sound channel of the sound that M expression of playback of the present invention or decoding aspect hypothesis arrived at by a direction each by each source side to a passive-matrix recently-adjacent amplitude-following coding to produce (promptly a source side is to being supposed mainly to shine upon the most contiguous basic sound channel (one or more)), and do not need additional side chain information (utilization of side chain or supplementary is optional), thereby technology, control desk and the format compatible of it and existing audio mixing.Though these source signals can produce by directly using a passive encoder matrix, these source signals of the intrinsic generation of recording method that great majority are commonly used (so, constitute one " efficient coding matrix ").Playback of the present invention or decoding aspect are also mostly compatible with the source signal that writes down naturally, for example with the signal of the directional microphone of 5 reality record, because allow some possible time delay, the sound that arrives at from middle direction tends to mainly be mapped to the most contiguous microphone (a horizontal array, being mapped to clearly in the most contiguous microphone).

May be implemented as the grid of continuous processing module or functions of modules (being called " decoder module " later on) according to a decoder of the present invention or decoding processing, each decoder module is used to that the most contiguous basic sound channel produces one or more output channels (perhaps producing the control signal that can be used for producing one or more output channels) from the two or more spaces that link with this decoder module.Output channels embodies the relative scale that is associated with primary channel sound intermediate frequency signal the most contiguous on the space of concrete decoder module.As following being explained in more detail, sharing node and have the mutual loose couplings of decoder module on the meaning of decoder module classification in module.Module according to the number of its basic sound channel that interrelates by order of classification (a module or a plurality of module with the relevant basic sound channel of maximum numbers have the highest order).A hypervisor functions is so managed these modules: the common node signal is shared liberally, and the decoder module of higher order can influence the output of lower-order secondary module.

Each decoder module can comprise a matrix effectively, make it directly produce output signal, perhaps each decoder module can produce control signal, the control signal that these control signals produce with other decoder module be used to change the coefficient of a variable matrix or change be input to a fixed matrix or from the scale factor of a fixed matrix output, to produce all output signals.

The work of decoder module imitation people ear makes every effort to provide the reproduction of feeling transparent.Each decoder module can be implemented as or the structure or the function of wide band or multiband, under latter instance or with a continuous bank of filters, perhaps use a block structure, for example adopt such as a processor on each frequency band, doing identical substantial treatment based on conversion.

Though the basis invention relates generally to the space conversion of M input sound channel to N output channels, wherein M and N are positive integers, and M is at least 2, another content of this invention is by eligibly relying on virtual mapping, the number of loudspeakers that receives N output channels can be reduced to the numerical value of a practicality, is not promptly placing the acoustic image that formation is felt on the locus of loud speaker.The prevailing application of virtual mapping is by the track at a reflection between mobile two loud speakers of a monophonic signal stereoscopic rendering between the sound channel.Virtual mapping is not considered to a kind of feasible method for the group with a small amount of sound channel reproduces, because it requires the listener equidistant or approximate equidistantly with two loud speakers.For example, at a distance of too far away, therefore the center channel as many dialogues source is important to the loud speaker of middle at the cinema left front and right front for the useful mirage phantom that most of audiences obtain a central acoustic image, and the center loudspeaker of a physics is used.

Yet when the density of loud speaker is increased, concerning most of audiences, for the level and smooth scope that moves, the position that can occur virtual mapping between any a pair of loud speaker can reach at least; When loud speaker was enough, the gap between the loud speaker no longer can be perceived.An array like this has 2,000,000 the almost undistinguishable potentiality of array that contrast is released previously.

In order to test effect of the present invention, we have developed a horizontal array, 5 loud speakers on every face wall are considered public corner loud speaker, 16 altogether, add that the vertical angle with about 45 degree places one of listener top to enclose 6 loud speakers, add the directly single loud speaker above the listener, totally 23, add a mega bass loud speaker (LFE sound channel), amount to 24, all sound channels are all fed by a PC (personal computer) who is used for 24 sound channel playbacks.Though can be called 23.1 sound channel systems by this system of present saying, for simply, it will be called as one 24 sound channel system here.

Fig. 1 is a top view, and its letter illustrates and meets the Utopian decode structures that top described test is arranged.The basic sound channel of 5 wide scopes of level is illustrated as the square in the foreign country 1 ', 3 ', 5 ', 9 and 13 '.A vertical sound channel is illustrated as the dotted line square 23 ' of center, and this sound channel may be derived by reverberation relevant or that produced by the basic sound channel of 5 wide scopes, perhaps provides separately.23 wide region output channels are illustrated by the filled circles that respective digital 1-23 marks.16 output channels are on a horizontal plane on the cylindrical, and last 6 output channels of interior circle are 45 degree above horizontal plane.Output channels 23 is directly above one or more audiences.5 two input decoder modules are illustrated by cylindrical upward arrow 24-28, and they are connected each between the basic sound channel of level.5 vertical decoder modules of additional two inputs are illustrated by arrow 29-33, connect vertical sound channel each in the horizontal sound channel.Sound channel after upborne central authorities lean on is that output channels 21 is derived by one three input decoder module, and it is illustrated by the arrow between output channels 21 and the basic sound channel 9,13 and 23.So basic sound channel the most contiguous on each module and corresponding a pair of or three spaces is associated.Though the decoder module shown in Fig. 1 has 3,4 or 5 output channels, a decoder module can have arbitrarily the output channels of number rationally.Output channels can be positioned in the middle of one or more basic sound channels or on the position identical with basic sound channel.So in Fig. 1 example, on each basic sound channel position an output channels is arranged also.Each input sound channel is shared by two or three decoder modules.

As will be discussed, design object of the present invention is loud speaker and the arrangement architecture thereof that the playback processor should be able to work in any number in principle, the array of 24 sound channels will be used as an illustrative examples, but not be to obtain the compellent perceived continuously needed density of sound field and unique example of arrangement architecture according to the present invention.

Can use big and can propose discrete channels problem of the number and/or out of Memory by this requirement of playback sound channel number that the user selects, these must be communicated to the playback processor, so that it derives above-described 24 sound channels as an option at least.Obviously, a kind of possible method is to transmit 24 discrete channels simply, but except the information producer must mix a plurality of so independently sound channels may be pretty troublesome, and transmit so many sound channel for transmission medium also may trouble outside, preferably not like this, because 24 channel structure be many may in a kind of, and need and can produce more or less playback sound channel by a public transmission signal array.

An approach of regeneration output channels is to use formal spatial interpolation, for each output produce a fixed weighting that is transmitted sound channel and, suppose that the density of these sound channels is enough big, do like this to can allowing greatly.Yet this will need thousands of to millions of sound channels that is transmitted, and be equivalent to realize with the FIR filter of hundreds of taps the time domain interpolation of individual signals.Be transmitted that sound channel reduces to that practical, number need be used psychoacoustic principle and by the more positive dynamic interpolation of enough few sound channel, but still do not answer following problem: need a plurality of sound channels in order to produce an intact sound field sensation.

This problem has been answered by inventor's experiment that finish, that also repeated by other people recently several years ago.At least early the basis of experiment is to observe traditional two sound channel ears record can reproduce a real left side/right acoustic image and distribute, but causes unsettled front/rear position to be determined, part is because of the imperfection of used HRTF, and does not have head movement to point out.For avoiding this defective, two-ears (4 a sound channel) record is implemented, and it uses two pairs of directional microphones at a distance of corresponding head part's size.One microphone forward-facing, another is to backward-facing.What obtain is recorded in playback on 4 loud speakers that separate near head, to alleviate the acoustics cross-coupling effect.This structure provides a real left side/right side timing and amplitude positioning indicating from each to loud speaker, and the discrete location of the correspondence of microphone and loud speaker provides clearly front/rear information.This result is a very compellent surround sound playback, just lacks the suitable performance of elevation information.The just preceding sound channel of central authorities and two height sound channels are added in other people's experiment recently, have provided the same sense of reality, even may improve owing to having added elevation information.

So,, it seems that relevant sensory information can add in one or more vertical sound channels and be transmitted in the horizontal sound channel of general 4 to 5 " similar ears " from evidence two aspects that psychologic acoustics is considered and experiment provides.Yet the signal cross that the ears sound channel is right is presented characteristic makes them be not suitable for directly giving one group of loud speaker playback, because have only very little isolation in mid frequency range with at low frequency.Thereby with introduce at encoder crossfeed (as to ears to do) it is compared only to need in the decoder cancellation, this is more simple and more direct: the maintenance sound channel is isolated and mutually from the nearest sound channel mixing output channels signal that is transmitted.Do so not only and can come direct playback without decoder by the loud speaker of same number, if desired, also carry out available mixed down with a passive-matrix decoder to the minority sound channel, and it corresponds essentially to the standard arrangement structure of existing 5.1 sound channels, is corresponding at least on horizontal plane.It also extensively is compatible with the nature record, the record that the directional microphone of for example available 5 reality is implemented, because allow some possible time delay, will tend to mainly be mapped to nearest microphone (in horizontal array, especially being mapped to a nearest microphone) to the sound that arrives at by the third side.

So, angle from sensation, this should be possible: sound channel switch decoders is accepted 5.1 sound channel programs of a standard, the convictive playback of 16 horizontal loud speakers in the aforementioned 24 sound channel arrays of and the horizontal loud speaker by any number-comprise-realize.By additional vertical sound channel, just as being that a digital movie system is advised sometimes, can to whole 24 sound channel arrays present respectively derive, feel that effective signal, these signals produce one together and listen to the continuous sound-field of feeling on the position at great majority.Certainly, if in the on-the-spot source sound channel that can obtain fine structure of coding, additional information about them can be used to change effectively the encoder matrix scaling factor, limitation with the precompensation decoder, perhaps can include simply as additional side chain (assisting) information, may be similar to the coupling coordinate that is used in AC-3 (Dolby Digital) multi-channel encoder, should be unnecessary in sensuously such additional information still; And the requirement that in fact, comprises this information is unwanted.The required work of sound channel switch decoders is not limited to the information source work with 5.1 sound channels, and can have reason to believe at least that still performance can obtain from 5.1 sound channel information sources reliably with less or more sound channel.

A remaining problem of not returning label is how to extract middle output channels by the thinned array that is transmitted sound channel.The solution that one aspect of the present invention is advised is to utilize the notion of virtual image again, but does some variation a little.Noticed before that virtual mapping was inapplicable for carrying out group playback with sparse loudspeaker array because it require listener and each loud speaker apart from approximately equal.But it is through transforming the sensation that can provide middle mirage phantom sound channel to a listener who takes one's seat brokenly, and this is the signal that has moved between nearest actual output channels for those amplitudes.Comprise a series of modular interpolated signal processors so advise the sound channel switch decoders in one aspect of the invention, each processor is listener that the best is taken one's seat of imitation effectively, and each is with imitation human auditory system's method work, those will form the composition of virtual mapping with the signal extraction of being moved by amplitude, and with they actual loud speakers of feeding; Loud speaker is preferably enough thickly arranged, and makes the virtual image of nature can be filled between the loud speaker in the remaining gap.

Generally, each decoder module is derived its input by the most contiguous basic sound channel that is transmitted, and for example, for a canopy of the heavens formula (atop) loudspeaker array, can be 3 or more basic sound channel.Produce with a kind of method more than two the related output channels of basic sound channel can be to carry out a series of paired operations, for example, the feed input of other module of the output of some paired decoder module.Yet this has two shortcomings.A shortcoming is the time constant that the cascade decoder module is introduced a plurality of cascades, causes some output channels faster than other sound channel reaction, thereby causes the sound position illusion.Second shortcoming is output channels in the middle of relevant in pairs can only the assigning along straight line between a pair of sound channel or that derive; The application of three or more basic sound channels has exceeded this restriction.Therefore, an expansion that is generally being correlated with is developed, is used for relevant three or more output signal, and this technology is illustrated below.

Horizontal location in people's ear is mainly based on two positioning indicatings: difference of vibration and two interaural differences between two ears.The latter only for the time go up approximate alignment-poor+600 microseconds about-signal is to effectively.Actual effect is that the middle reflection of mirage phantom will only appear on the position corresponding to an a concrete left side/right difference of vibration, supposes that signal composition public in two true sound channels is that be correlated with or approximate relevant (annotate: two signals can have+cross correlation score between 1 to-1.Xiang Guan signal (correlation=1) has same waveform as and time to go up aligning fully, but different amplitudes can be arranged, corresponding to off-centered Image Location).When a right correlation of signal was lower than 1, the reflection of feeling up to for two incoherent signals, was videoed broadening in the middle of will not existing, only separative and different left sides and right reflection.These two reflections negative relevantly are treated to by ear usually that to be similar to incoherent signal right, though can more expanded in the wide region.Relevant being implemented on the critical band basis, and more than about 1500Hz, the critical band signal envelope is used to replace signal itself, to save human computation requirement (MIPS).

Perpendicular positioning is more complicated, depends on the dynamic modulation of HRTF top prompting and level prompting with head movement, but final effect is similar to horizontal location with respect to mobile amplitude, crosscorrelation and the corresponding Image Location of feeling with converge.Yet vertical space resolving power precision is lower than horizontal definition, and for suitable interpolation performance, does not need so close basic sound channel array.

Utilize the work of directed processor-its imitation people ear-benefit be that the similar imperfection that any imperfection of signal processing or restriction should be able to be by people's ears and being limited in sensuously be covered up, thereby allow following possibility: system is felt almost as broad as long with original totally continuous playback.

Though the situation that no matter how the present invention is designed to be effectively applied to or few output channels can be used (comprises the playback of not decoding and being undertaken by the loud speaker with the input sound channel as much, and passive following mixing arrived less sound channel, comprise monophone, the surround sound of stereo and compatible Lt/Rt), preferably make every effort to use many and somewhat random, yet the playback sound channel/loud speaker of practical number, and use coding sound channel similar or still less number, comprise existing 5.1 sound channel surround channels, and the possible next generation 11 or 12 sound channel digital movie soundtrack are as source material.

Enforcement of the present invention requires to embody four principles: the error containment, advantage keeps, firm power and level and smooth synchronously.

The notion of error containment is under given decoding error possibility, and the position should be near its real anticipated orientation under rational meaning after the decoding in each source.This has stipulated in the decoding policy conservative to a certain degree.Have more positive decoding, they be accompanied by may be bigger in the error event the space on inconsistent, usually recommend to accept the decoding of less precision, to exchange the space containment of guaranteeing for.Even under the sure situation about being employed of more high-precision decoding, engage to generate the possibility of artificial voice picture between positive and conservative mode if exist Dynamic Signal conditional request decoder, it may be unadvisable using more high-precision decoding.

It is a more binding mutation of error containment that advantage keeps, and it requires the single good advantage signal of determining only to move in those the most contiguous output channels by decoded device.It is necessary that this condition is converged for the reflection of the signal of keeping on top, and helps feeling the discreteness of matrix decoder.When signal when being dominant, it is curbed from other output channels, method is or deducts it from relevant baseband signal, perhaps directly makes the matrix coefficient of other output channels be complementary to the matrix coefficient (" anti-advantage coefficient/signal ") that is used to produce advantage signal.

Firm power decoding not only requires total decoding power output to equal input power, and requires each sound channel that is encoded in the basic array that transmits and the I/O power of phasing signal to equate.The illusion minimum that this produces change in gain.

Smoothly mean synchronously system is applied smoothingtime constant with signal correction, and require: if the arbitrary smooth network in decoder module is switched to the fast time constant pattern, other smooth network of all in this module is switched equally.This is the predominant direction that presents slow decline/before leaving for fear of the phasing signal that newly is dominant.

Description of drawings

Fig. 1 is a schematic diagram, and the vertical view of a Utopian decoder arrangement structure is shown.

Embodiment

Decoder module

Arbitrary source side mainly is mapped on the most contiguous sound channel to being assumed to be because encode, the sound channel transforms decode is based on a series of automanual decoder modules, their output channels of on common meaning, regenerating, especially output channels in the middle of, each output channels by all subclass that is transmitted sound channel, is obtained with the method that is similar to people's ear usually.

To be similar to the method for people's ear, the work of decoder module is based on the combination of amplitude ratio and crosscorrelation, and amplitude is than the current principal direction that is used for determining nominal, and crosscorrelation is used for determining the relative width of reflection.

The control signal that application is obtained by amplitude ratio and crosscorrelation, processor produces the voice signal of output channels.Because this is preferably based on linear relationship and realizes, to avoid producing distortion, decoder forms the weighted sum of the basic sound channel that includes signal of interest.(, also can require in calculating weighted sum, also to comprise the basic sound channel of non-vicinity) as explained later.This limited but dynamic more normal matrixing that is called as of interpolation method.If in information source, the signal mapped (amplitude moves) that needs then is the problem of a M:N matrix decoding in the most contiguous M basic sound channel.In other words, output channels is represented the input sound channel relative scale.

Particularly in the situation of two input decoder modules, it resembles the problem that active 2:N matrix decoder relates to very much, the Dolby Pro Logic matrix decoder of new model for example, and it has paired decoder module input corresponding to the Lt/Rt code signal.

Attention: the output of 2:N matrix decoder is sometimes referred to as basic sound channel.Yet use " substantially " to call the input sound channel of sound channel switch decoders in this article.

Yet, between the work of the autonomous 2:N decoder of prior art and decoder module of the present invention, have a significant difference at least.The former is except with an a left side/right amplitude indication left side/right position, and this point also is the hypothesis of sound channel switch decoders, and they also indicate front/rear position with the phase place of mutual sound channel, particularly based on Lt/Rt coding sound channel and/poor the ratio.

This autonomous 2:N decoder architecture has two problems.A problem is, for example relevant fully (the place ahead), but off-centered signal will cause and/difference is than less than infinity, thereby indicate a position (being similar to the off-centered behind signal of complete inverse correlation) not exclusively forwardly improperly.The result is between a decode empty that point deformation arranged.Second shortcoming is that location map is many-to-one, introduces intrinsic decoding error.For example in a 4:2:4 matrix system, a pair of do not have before-go into or the uncorrelated left side of carrying on the back-going into-go into and the right side-going into signal will shine upon identical with signal pure, uncorrelated Lt/Rt is right, also can shine upon one do not have a left side-go into/right-go into uncorrelated before-go into/back-it is right to go into, and perhaps shines upon the content of all 4 uncorrelated inputs.Decoder is not selected having in the face of an incoherent Lt/Rt, and " loosening this matrix " promptly distributes sound to all output channels with a passive-matrix.Can not be decoded as one and have only a left side-go out/right-go out simultaneously, or before having only-go out/carry on the back-signal array that goes out.

Basic problem is, the phase place of in the N:2:N matrix system, the using mutual sound channel front/rear position of encoding, and this is different from the work of people's ear, and people's ear is differentiated front/rear position without phase place.The present invention the most handy at least three not point-blank basic sound channel come work, make front/rear position indicate by the direction initialization of basic sound channel, rather than provide different directions according to their relative phase or polarity, like this, a pair of incoherent or anti--baseband signal of relevant sound channel conversion is decoded as basic-output channels signal of separation clearly, do not have M signal, do not have " rear " direction to be instructed to yet.(in addition, this has been avoided unluckily in the autonomous 2:N decoder " center gathering " effect, a wherein incoherent left side-go into and right-go into signal with the separating degree that reduced by playback because decoder present these two signals and and difference give center and sound channel on every side.) certainly, in principle can be by spatially expanding a Lt/Rt signal with a N:M sound channel converting system and 2:N decoder-N=4 or 5-cascade, but in the case, any limitation of 2:N decoder-for example the center assemble-will be brought in the sound channel output of multiplication, also can make up the sound channel switch decoders that the design of these functions to one receives 2 sound channel Lt/Rt signals, and change its characteristic in the case to explain that negative coherent signal is the orientation with rear, keep other processing constant.Yet, even still exist in the case by having only two decodings that sound channel caused that are transmitted fuzzy.

So each decoder module, the decoder module that especially has two input sound channels are similar to existing active 2:N decoder, have front/rear detection forbidding or change, the output channels of any number.Certainly numeral goes up and can not produce the sound channel of more number uniquely from the sound channel of less number with matrix, because this is based on separating the individual linear equation with M unknown number of N, and M is greater than N.So expectation is that decoder module presents not so good sound channel when having and restores when having a plurality of autonomous source sides to signal.Yet the human auditory system is used the limitation of two ears, and is same with bearing, and the permission system is perceived as the limit from usefulness, even also like this during with the work of all sound channels.The sound channel quality of separating when other sound channel is muted remains and will consider, this is to be sitting in a loud speaker listener nearby in order to look after.

The work of people's ear is relevant with frequency certainly, but most of acoustic images are correlated with on all frequencies, and, can expect that a wide band sound channel converting system may also have satisfactory performance in some applications according to Pro logic decoder successful experience experiment as broadband system.Multiband sound channel switch decoders should also be possible, adopts on the basis of frequency band one by one and similarly handles, and use identical code signal under each situation, and the number of single frequency band and bandwidth can be used as a free parameter and leave the decoder implementor for.Though the MIPS that the multiband processing may be higher than broadband processing requirements if input signal is divided into data block, and handles based on piece and realizes that then the computation requirement amount may be not too high.

Before the algorithm that explanation can be used by decoder module of the present invention, at first provide sharing the consideration of node.

Share node

If the used basic sound channel group of decoder module all is independently, then decoder module itself should be independently, autonomous entity.Yet not this situation usually.Given will being enjoyed with the separated output signal of two or more adjacent basic sound channels usually by the transmission sound channel.If decoder module this array that is used to decode independently, each will be by the influence of the output signal of adjacent channels, and causing may be grave error.On function, the output signal of two adjacent decoder modules will " draw " to-or shift to-another because public fundamental node comprises two signals, level is increased.If-recurrent here situation-signal is dynamic, the amount of mutual effect is with greatly to causing the Kinematic Positioning error relevant with signal greatly to making us unhappy.This problem does not exist in ProLogic and other active 2:N decoding, because they have only the sound channel of single separation to importing as decoder.

So compensation " is shared node ", and effect is necessary.A possible method is, before the output signal of the adjacent decoder module of the shared common node of attempting to regenerate, deducts a signal of having regenerated from common node.This is normally impossible, thereby uses following method instead: each decoder module dopes the public output signal energy that is total to now on the input sound channel, and the output signal Energy Estimation of its adjacent block of each module of hypervisor.

The paired calculating of public energy

For example, suppose that basic sound channel comprises a common signal X and independent incoherent signal Y and Z to A/B:

A＝0.707X+Y

R=0.707X+Z is scaling factor wherein

0.707 = \sqrt{0.5}

Provide a power that the most contiguous basic sound channel is kept mapping.

The RMS energy

(A) = {&Integral; A}^{2} &PartialD; t = \overset{&OverBar;}{A^{2}} = \overset{&OverBar;}{{(0.707 X + Y)}^{2}} = \overset{&OverBar;}{(0.5 X^{2} + 0.707 XY + Y^{2})}

= 0.5 \overset{&OverBar;}{X^{2}} + 0.707 \overset{&OverBar;}{XY} + \overset{&OverBar;}{Y^{2}}

Because X and Y are uncorrelated, XY=0, so

\overset{&OverBar;}{A^{2}} = 0.5 \overset{&OverBar;}{X^{2}} + \overset{&OverBar;}{Y^{2}} .

That is because X and Y are uncorrelated, the gross energy among the basic sound channel A be signal X and Y energy and.Similarly:

\overset{&OverBar;}{B^{2}} = 0.5 \overset{&OverBar;}{X^{2}} + \overset{&OverBar;}{Z^{2}}

Because X, Y and Z are incoherent, and the average crossed product of A and B is:

\overset{&OverBar;}{AB} = 0.5 \overset{&OverBar;}{X^{2}}

Like this, an output signal by two adjacent basic sound channels-they also can comprise independently, incoherent signal-divide equally under the situation of enjoying, average intersection one product of signal equals the energy of common signal component in each sound channel.If common signal is not to be shared with dividing equally, it is basic sound channel of its deflection, geometric average between the energy that average crossed product will be a component common among A and the B, thus, the independent public Energy Estimation of sound channel can be carried out normalization and be obtained by the square root with the sound channel amplitude ratio.Real-time time average calculates with a leaky integrator that has with suitable time constant of fall, to reflect advancing activity.The level and smooth available nonlinear rising of time constant and fall time option come perfectly meticulously, and in the multiband system, usable frequency is calibrated.

The more public energy of high-order calculates

In order to obtain the public energy of decoder module, must form the average intersection-product of all input signals with three or more inputs.The paired processing of importing simply can not be distinguished the separate output signals between the public signal of every pair of input and all inputs.

For example, consider three basic sound channel A, B and C, they are respectively by incoherent signal W, Y, Z and public signal X form:

A＝X+W

B＝X+Y

C＝X+Z

If average crossed product is calculated, as in calculate on second rank, all comprise W, and the item of the combination of Y and Z will be by cancellation, remaining X ³Average:

\overset{&OverBar;}{ABC} = \overset{&OverBar;}{X^{3}}

Unfortunately, be zero time signal if X is a mean value, then its cube on average also is zero.Unlike X ²Average, to the X value of any non-zero, X ²Be positive number, X ³With X identical symbol is arranged, thereby the contribution of positive and negative part will balance out.Obviously, this any odd power for X is set up equally, and the odd power of X is corresponding to odd number module input, but the result that index also can lead to errors greater than 2 even index; For example have component (X, X ,-X ,-X) 4 inputs will be with (X X) will have identical product/mean value for X, X.

The problems referred to above can solve with the average product technology of distortion.Before doing on average, the symbol of each product is removed by the absolute value of getting product.The symbol of each of product is examined.If they are all identical, the absolute value of product is sent to be averaged, if arbitrary symbol is different from other, the negative value of the absolute value of product is by average.Because the number of possible same symbol combination is not equal to the number of possible distinct symbols combination, a weighted factor is applied in the absolute value product that becomes negative and compensates, and this weighted factor is made of the ratio with symbol combination number and distinct symbols number of combinations.For example one three input module has two possible situations with symbol in 8 possibilities, and remaining six may situations be distinct symbols, so scaling factor is 2/6=1/3.This compensation when and and if only if all inputs of a decoder module just cause product increase integration or addition when having public signal component.

Yet on average can comparing of same order module not, they all must have identical dimension.The second order of a routine is relevant to comprise the average of two input multiplication, thereby dimension is energy or power.So in more high-order is correlated with, must also have been changed into the power dimension by average item.Relevant for K rank, each product absolute value must become the power that its index is 2/k before average.

Certainly, irrelevant with order, if desired, the energy of each input node of module can be calculated as the respective nodes signal square average, and do not need at first to rise to its k time power, be reduced to a second order amount again.

The node of sharing: adjacent levels

The mean square by using basic sound channel signal and the crossed product of distortion, can estimate public output channels signal energy size, top example relates to single interpolation processor, if but A/B (/C) node one or more are that to have its another module with the incoherent common signal component of any other signal of controlling oneself common with another, average intersection-the product that then calculates above should be unaffected, makes to calculate not exist the acoustic image rate to draw and should imitate inherently.(annotate: if two output signals are not correlated with, they will tend to the decoder that furthers, but have a similar effect in people's ear, make system works that the human auditory is still kept loyal again.)

In case each decoder module has calculated the public output channels signal of estimation on each basic sound channel at it, hypervisor functions can be informed the public energy of adjacent block each other, at that place, the generation of output channels signal is carried out described as following.Must consider the sandwich construction that the same order module may be not overlapping by a module applied public energy calculation on a node, and from the public energy that arbitrary low order module of sharing same node is estimated, deduct the public energy of a high-order module.

For example, suppose to have the basic sound channel A and the B of two horizontal directions of two adjacent expressions, and the basic sound channel C of an expression vertical direction, and further the signal energy of direction in inside of an expression of hypothesis existence (promptly at A, a direction in the restriction of B and C) is X ²The output channels centre or that derive.Be input as that (the public energy of three input modules C) will be X for A, B ², but (A, B), (B is C) with (A, public energy C) they also should be X to two input modules ²If (C), (A will be B) with (A, public energy addition simply C) will obtain 3X to the module that A connected for A, B ², rather than X ²In order correctly to calculate the common node energy, the public energy of each high-order module at first deducts from the public energy that each overlapping low order module is estimated, thus high-order module (A, B, public energy X C) ²From the public Energy Estimation of two two input modules, deducted, under each situation, obtain 0, and the clean public Energy Estimation that obtains node A place equals X ²+ 0+0=X ²

The output channels signal produces

As previously mentioned, be a matrix method with a linear method basically by all processing of sound channel regeneration output channels that transmit, promptly form the weighted sum of basic sound channel, to obtain the output channels signal.The optimal selection of matrix scaling factor is generally irrelevant with signal.Really, if the number of the output channels of current active equals to be transmitted the sound channel number of (but representing different directions), the system that makes is that strictness is restricted, and then can calculate the source signal prototype that efficient coding inverse of a matrix matrix and reduction separate on the mathematics.If Huo Dong output channels number may still can calculate a pseudo inverse matrix greater than basic sound channel number even.

Unfortunately, the method existing problems, computational requirements-particularly handle based on multiband, and towards the high accuracy floating-point realize-and be a most important factor.Even M signal is to be positioned between the most contiguous basic sound channel by hypothesis, the mathematics inverse matrix of efficient coding matrix or puppet-inverse matrix generally have contribution from all basic sound channels to each output channels, and this is because the nodes sharing effect.Any imperfection-in fact this is inevitably if having in decoding, and a basic sound channel signal may be by regenerating with the output channels of its apart from each other on the space, and this is not conform to very much to require.In addition, pseudo inverse matrix calculating is tended to produce minimum RMS energy and is separated, and this has expanded range of sounds greatly, provides minimum separating degree; This is quite inconsistent with the present invention.

Therefore, for the fault-tolerant decoder of realizing a practicality-have intrinsic space to separate code error therein, be used to signal with the equal modules structure that is used for input and produce.

The production process of a decoder module regeneration output signal is described in detail in detail below.Note being connected in the active position of each output channels of module and is supposed to be determined by amplitude ratio, it is required that these amplitudes are that framing signal arrives its physical location, promptly corresponding to analogy to the ratio of active matrix code coefficient.In order to avoid by zero problem of removing, ratio typically is calculated as the matrix coefficient of a sound channel divided by all RMS of matrix coefficient (being generally 1) of this input sound channel and the merchant who obtains.For example, one be input as energy ratio used in two input modules of L and R should be the L energy divided by L and R energy sum (" L-ratio "), it has 0 to 1 span.If two input decoder modules have 5 output channels, the efficient coding matrix coefficient is to being (1.0,0), (0.89,0.45), (0.71,0.71), (0.45,0.89) and (0,0.1), corresponding L-ratio is 1.0,0.89,0.71,0.45 and 0 because the every pair of solid son of calibration have one 1.0 RMS and.

From the signal energy of each input node (basic sound channel) of decoder module, deduct any node of being taken away by adjacent decoder module altogether from signal, obtain normalized input signal power level, the remainder that is used to calculate.

Predominant direction indication be calculated as basic orientation by the vector of relative energy weighting and.For one two input module, it is reduced to the L-ratio of normalization input signal power level.

Comprising that the L-ratio of predominant direction output channels therein by predominant direction L-ratio during previous step is rapid and output channels compares determines.For example, if the L-ratio of above-mentioned five output decoder modules input is 0.75, then the second and the 3rd output channels has comprised predominant direction, because 0.89＞0.75＞0.71.

The mapping advantage signal obtains to the ratio calculation of the most contiguous mobile scaling factor of containing sound channel by the anti--advantage signal level of sound channel.With anti--advantage signal that specific output channels interrelates is as the result of corresponding decoder module input signal with the anti--advantage matrix scaling factor conversion of output channels.Anti--advantage matrix the scaling factor of an output channels is RMS and equals those scaling factors of 1 that they cause zero output when single advantage signal is positioned on this output channels.If the encoder matrix scaling factor of output channels be (A, B), then this sound channel anti--advantage scaling factor be (B ,-A).

Proof

If single advantage signal be located in and have the coding scaling factor (A, on output channels B), then signal must have amplitude (KA, KB), wherein K is the total amplitude of signal, so, for this sound channel, anti--advantage signal is (KA*B-KB*A)=0.

Therefore, if an advantage signal is made up of two input module input signals (x (t), y (t)), it has the input amplitude that is normalized to RMS=1, and (X, Y), the advantage signal of generation is dom (t)=Xx (t)+Yy (t).If the position of this signal be included in the matrix scaling factor be respectively (A, B) and (C, between output channels D), for the matrix scaling factor be (A, the advantage signal scaling factor that B) sound channel is calibrated dom (t) is:

SF(A，B)＝sqrt((DX-CY)/((DX-CY)+(BX-AY)))，

And for the matrix scaling factor be (corresponding advantage signal scaling factor is for C, sound channel D):

SF(C，D)＝sqrt((BX-AY)/((DX-CY)+(BX-AY)))，

When predominant direction from an output channels when another output channels is removed, these two scaling factors are removed between 0 and 1 in the opposite direction, and have constant power and.

Instead-advantage signal calculated and located with the suitable lid that increases to all non--advantage sound channels calibrations.Instead-advantage signal is a matrixing signal without any advantage signal.If being input as of decoder module (x (t), y (t)), its normalized amplitude are that (X, Y), advantage signal is Xx (t)+Yy (t), anti--advantage signal is Yx (t)-Xy (t), with the location independent of non--advantage output channels.

Except advantage/anti--advantage signal distributes, second signal distributions " passive " matrix computations, it is calibrated the output channels matrix scaling factor with holding power based on what discussed.

The crosscorrelation of decoder module input signal is calculated as the square root of the average crossed product of input signal divided by the product of normalization incoming level.

Get back to the explanation of production process now, final output be calculated as the weighting that advantage signal and cpm signal distribute cross compound turbine and, wherein release the cross compound turbine factor with the input signal crosscorrelation of decoder module.For correlation=1, a use advantage/anti--advantage distributes.When correlation reduced, the output signal array to be implemented on the low positive correlation value, typically was 0.2 to 0.4 by broadened to the cross-fading of passive distribution, depends on the delivery channel number that is connected to decoder module.When correlation further reduces, when going to zero, passive amplitude output distributes outwardly-bent gradually, reduces output signal level, with the response of imitation people ear to these signals.

The vertical processing

So far described in order to produce the great majority processing of output channels signal and the orientation independent of output and basic sound channel from adjacent basic sound channel.Yet because the horizontal orientation of people's ear, human auditory's positioning inclination has the less susceptibility to mutual sound channel correlation in vertical direction than horizontal direction.For keeping the sense of reality of people's ear work, this may need: weaken related constraint in the input sound channel interpolation processor with a vertical orientation, for example handled coherent signal with a function of flexure before using it.Yet might will not bring the deterioration of any sense of hearing with the processing identical, will simplify the structure of entire decoder like this with horizontal sound channel.

Strictness is said, vertical signal comprises the sound next from the above and below, and described decoder architecture should be worked to them equally well, but does not have natural phonation in the reality usually from the below, thereby it is handled and sound channel can not damaged the system space fidelity that is felt by cancellation.

This notion has practical significance when sound channel is transformed into existing 5.1 sound channel surround sound materials using, and this material does not have vertical sound channel certainly.Yet it can comprise vertical information, and it is overhead for example to waft, and their record is across a plurality of or whole horizontal sound channels.So, should from these source materials, extract a virtual vertical sound channel, method is the correlation of considering between non--adjacent channels or the sound channel group.If there is above-mentioned correlation, they will represent the top from the listener usually, rather than the existence of the vertical information of below.In some cases, also can derive virtual vertical information by a reverberation generator, possible key is the used environmental model of listening to.In case virtual vertical sound channel is extracted or derives from 5.1 sound channel information sources, to big number sound channel-for example previously described 24 channel structure-expansion can resemble to provide and carry out the real vertical sound channel.

Directional memory

The operation that control produces about decoder module, as mentioned above, it is similar to the work such as a 2:N independent decoding device of the decoder of Pro Logic, and one aspect of the present invention unique in processing " memory " is in smooth network, and this network produces basic control signal.At any one time, only have a predominant direction and an input correlation, and the signal generation is carried out according to these signals directly.

Yet, particularly under the acoustic enviroment of complexity (as the cocktail party of prototype), people's ear presents position memory to a certain degree, the advantage sound that perhaps inertia, of short duration quilt from certain assigned direction are clearly located will cause other those sound that can not clearly locate from non-single-minded direction to be felt from same source.

Can in decoder module, (in fact equally in Pro Logic decoding) imitate this effect, method is to increase an explicit mechanism to preserve up-to-date predominant direction track, and during signal conditioning fuzzy on the direction, the weighting output signal distributes, and makes it point to up-to-date predominant direction.This can improve regeneration discreteness and the stability of being felt by the sophisticated signal array.

The sound channel of revising relevant and that select is mixed

As previously mentioned, the crosscorrelation the when output of each decoder module distribution is determined to be based on its input signal, this may underestimate the output signal content amount in some cases.For example, this will be with a signal appearance of record naturally, and non-in this signal-center position has different slightly arrival times and unequal amplitude, and this causes correlation to reduce.If use the microphone of big spacing, time delay between the bigger sound channel of Xiang Yingyou, above-mentioned effect may be more serious.In order to compensate this effect, correlation calculations can expand to a scope that covers the interchannel time delay, and this requires with slightly higher processing MIPS is cost.Because the auditory nerve cell does not have about 1 millisecond effective time constant extremely, real more correlation can be by at first smoothly obtaining detected sound with a smoother with 1 millisecond of time constant.

In addition, if an information producer has the existing 5.1 sound channel programs with strong uncorrelated sound channel, by slight mixing adjacent channels, thereby increase correlation, can improve the uniformity that distributes when handling with the sound channel switch decoders, the method will cause sound channel transforms decode module that more uniform distribution is provided between the output channels therebetween.This audio mixing also can be made selectively, for example keeps center front channels signal not by audio mixing, to keep the compactness of dialogue track.

The volume compression/extension

When encoding process comprised that the sound channel of mixing big number is less number sound channel, if the gain compensation of some form is not provided, then coding back signal might be limited.This problem exists equally for traditional matrix coder, but sound channel conversion is had bigger may occurring, because it is bigger to be mixed into the channel number of a given output channels.For avoiding amplitude limit in this case, provide a total gain scaling factor by encoder, and be sent to decoder in the bitstream encoded.Usually this value is 0dB, but it can be set to the pad value of a non-zero by encoder, and to avoid amplitude limit, decoder provides the compensating gain amount of an equivalence.

If decoder is used to handle an existing multichannel, it (does not for example have this scaling factor program, existing 5.1 sound channel tracks), it should select fixing scaling factor for use is the value (approximately 0dB) of a hypothesis, perhaps use a spread function based on signal level and/or dynamic range, or use the metadata that may utilize, for example a dialogue normal value is regulated the decoder gain.

The present invention and various aspects thereof can be implemented in the analog circuit, perhaps more may be implemented in the general purpose digital computer and/or special digital computer of digital signal processor, programming as software function.Interface between the analog and digital signal stream can be implemented in the suitable hardware and/or as function and be implemented in software and/or the firmware.

Claims

1. M input sound channel will representing a sound field is converted to the method for N output channels of the same sound field of expression, and wherein each sound channel is to represent the single audio stream of the sound that arrived at by a direction, and M and N are positive integers, and M is at least 2, and this method comprises:

Produce one or more groups output channels, every group has one or more output channels, wherein each group is in relation to input sound channel adjacent on two or more spaces, and each output channels in a group is handled by one and is produced, and this processing comprises the level correlation of relativity measurement and two or more input sound channels of definite two or more input sound channels.

2. the method for claim 1 is characterized in that, has one group of output channels to be in relation to two input sound channels.

3. the method for claim 1 is characterized in that, one or more described output channels groups are in relation to the input sound channel more than two.

4. the method for claim 1, it is characterized in that, one or more output channels groups are in relation to more input sound channel than one or more other output channels groups, and the correlation of the input sound channel that every group of output channels interrelates is determined in described processing according to a hierarchical order, make the number of the input sound channel that each group or a plurality of groups are got in touch according to its output channels be sorted, maximum input sound channel numbers have the highest order, and described processing according to the hierarchical order sequential processes of these groups they.

5. method as claimed in claim 4 is characterized in that, the result to the group of higher order is considered in described processing.

6. the method for claim 1 is characterized in that, the relativity measurement of described definite two or more input sound channels and the level correlation of two or more input sound channels realize in frequency domain.

7. the method for claim 1 is characterized in that, nonlinear time constant is adopted in described processing.

8. as each described method in claim 1 or 3 to 8, it is characterized in that having three or more input sound channels to represent not point-blank direction.