CN105959905B

CN105959905B - Mixed mode spatial sound generates System and method for

Info

Publication number: CN105959905B
Application number: CN201610268371.7A
Authority: CN
Inventors: 孙学京; 张晨
Original assignee: Beijing Tuoling Inc
Current assignee: Beijing Tuoling Inc
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2017-10-24
Anticipated expiration: 2036-04-27
Also published as: CN105959905A

Abstract

System and method for is generated the invention discloses a kind of mixed mode spatial sound, the mixed mode spatial sound generation method includes inputting one or more audio object；The number of audio object is detected, when the number of audio object is more than first threshold A, activation ambisonic domains branch handles audio object using ambisonic methods, obtains virtual ring around spatial sound；Otherwise activate independent object and render branch, handle audio object using independent object rendering intent, obtain virtual ring around spatial sound.Mixed mode spatial sound generation System and method for, which is added, to be rendered control module and is controlled to be rendered to audio object, and virtual ring can be generated effectively and in high quality around sound, low complex degree can be kept while the 3D audios of high-quality are produced.

Description

Mixed mode spatial sound generates System and method for

Technical field

The present invention relates to signal processing technology field, and in particular to a kind of mixed mode spatial sound generates System and method for.

Background technology

When with virtual reality helmet (Head-Mounted Display, HMD) to user's presentation content, using void Intend 3D Audiotechnicas, audio content is played to user by stereophone, at this moment need to face the virtual surrounding sound effect of raising The problem of.In virtual reality applications, when playing audio content by stereophone, the purpose of virtual 3D audios is intended to Reach that a kind of effect allows user just as being listened with loudspeaker array (such as 5.1 or 7.1).

When making virtual reality audio content, it usually needs several sound elements.It is a kind of improve telepresenc method be The headwork (head tracking) of user is tracked, sound is handled accordingly.Such as, if original sound by with Family is perceived as coming from front, after 90 degree of user's rotary head to the left, and sound, which should be processed, causes user to perceive sound from the just right side 90 degree of side.

Virtual reality device can have a many types herein, such as the display device of headed tracking, or simply one The stereophone of portion's headed tracking transducer.

Realize that head tracking also there are a variety of methods.Relatively common is to use multiple sensors.Motion sensor external member is led to Often include accelerometer, gyroscope and magnetometric sensor.Every kind of sensor has oneself in terms of motion tracking and absolute direction Intrinsic strong point and weakness.Therefore practices well is will to come from each sensor using sensor " fusion " (sensor fusion) Signal combine, produce a more accurate motion detection result.

, it is necessary to be changed accordingly to sound after end rotation angle is obtained.

It is to use HRTF (Head Related Transfer Function, head phase for audio object common practice Closing transforming function transformation function) wave filter is filtered, and obtains virtual surround sound.HRTF is HRIR (Head in the title corresponding to time-domain Related Impulse Response, the impulse response associated with head), or by source of sound and binaural room impulse response (Binaural Room Impulse Response, BRIR) does convolution.Binaural room impulse response is made up of three parts：Directly Up to sound, some discrete reflections and late reverberation (reverberation tail).

It is, if scene is complicated, to contain substantial amounts of audio pair directly by the shortcoming of audio object and this way of BRIR convolution As then complexity can become very high, and for many audio-frequency playing terminals, this will cause power consumption excessive, or even can not play. On virtual reality device, in addition it is also necessary to audio object position is adjusted in real time according to the action on head, this more greatly increases fortune Calculation amount so that using the unrealistic of traditional method change in mobile virtual real world devices.

Another way is that sound is gone into ambisonic domains, and then signal is converted by using spin matrix again. Specific practice is that audio is switched into B format signals, and the B format signals are converted into virtual speaker array signal, will virtually be raised Sound device array signal is filtered by hrtf filter, obtains virtual surround sound.But this method renders flexibility in sound It has been be short of that, and independent source of sound can not accurately have been controlled.

It can be seen that, both the above method respectively has advantage and disadvantage in efficiency and effect.

In view of this, a kind of effective and high-quality generation virtual ring is needed in this area around the solution of sound.

The content of the invention

System and method for is generated it is an object of the invention to provide a kind of mixed mode spatial sound, to solve prior art In produce high-quality 3D audios while can not keep low computational complexity the problem of.

To achieve the above object, mixed mode spatial sound of the present invention generation system include rendering control module, Ambisonic encoders, ears transcoder and earphone and head tracking device, it is described render control module respectively with Ambisonic encoders and the connection of ears transcoder, the ambisonic encoders are connected with ears transcoder, the earphone It is connected respectively with ambisonic encoders and ears transcoder with head tracking device；The control module that renders is used to receive One or more audio object, detects the number of audio object, when the number of audio object is more than first threshold A, activation The ambisonic domains branch that ambisonic encoders are constituted, handles audio object using ambisonic methods, obtains virtual ring Around spatial sound and ambisonic encoders are transferred to, export virtual ring by ambisonic encoders exports around the ears of spatial sound Virtual ring is around acoustical signal；Otherwise the independent object that activation ears transcoder is constituted renders branch, uses independent object rendering intent Audio object is handled, virtual ring is obtained around spatial sound and exports ears output virtual ring of the virtual ring around spatial sound around acoustical signal.

It is described to render the metadata (metadata) that control module is further used for detecting audio object, the metadata Including time and corresponding audio object in the position of three dimensions, in addition to divergence；The control module that renders is according to sound The divergence of frequency object determines the processing mode of the audio object, if the divergence of audio object is more than Second Threshold B, will The audio object is temporarily assigned to ambisonic domains branch；After temporarily distribution terminates, according to the current of audio object processing equipment Situation, calculates computational complexity, is determined whether to redistribute audio object according to computational complexity；Computational complexity passes through statistics The execution cycle of audio object processing equipment is drawn；When computational complexity allows N number of audio object, if present video Object has M, and independent object, which renders branch, can handle 0 to N-T audio object, and ambisonic domains branch can handle M-N + T audio objects, will if the number H for distributing to the audio object that independent object renders branch is less than N-T Any number of audio objects in 1 to N-T-H audio object in the branch of ambisonic domains are reassigned to independent object wash with watercolours Contaminate branch；The N is more than T, and M is more than 0, H and is more than or equal to 0；If N is less than T, all branch is rendered using independent object； If N is equal to T, all using ambisonic domains branch, or all using independent object renders branch.

It is described to render the distribution that control module determines audio object according to the divergence of source of sound；If the divergence of source of sound is high In X, then in the case of complexity is met, audio object is assigned to ambisonic domains branch, conversely, audio object is distributed Branch is rendered to independent object；Wherein X is specified by user.

The present invention also provides a kind of mixed mode spatial sound generation method, comprises the following steps：

Input one or more audio object；

The number of audio object is detected, when the number of audio object is more than first threshold A, activation ambisonic domains point Branch, handles audio object using ambisonic methods, obtains virtual ring around spatial sound；Otherwise activate independent object and render branch, Audio object is handled using independent object rendering intent, virtual ring is obtained around spatial sound.

The mixed mode spatial sound generation method further comprises the metadata for detecting audio object, the metadata bag Include the divergence of time and corresponding audio object in the position of three dimensions, in addition to audio object.

The mixed mode spatial sound generation method further comprises determining the audio pair according to the divergence of audio object The processing mode of elephant, if the divergence of audio object is more than Second Threshold B, the audio object is temporarily assigned to Ambisonic domains branch.

After temporarily distribution terminates, according to the present situation of audio object processing equipment, computational complexity is calculated, according to computing Complexity determines whether to redistribute audio object.

Computational complexity is drawn by counting the execution cycle of audio object processing equipment；1 ambisonic domains branch phase When in the complexity of T independent audio branches；When computational complexity allows N number of audio object, if present video pair As there is M, independent object, which renders branch, can handle 0 to N-T audio object, and ambisonic domains branch can handle M-N+T Individual audio object, if the number H for distributing to the audio object that independent object renders branch is less than N-T, by ambisonic Any number of audio objects in 1 to N-T-H audio object in the branch of domain are reassigned to independent object and render branch；Institute N is stated to be more than 0, H more than T, M and be more than or equal to 0.If N is less than T, all branch is rendered using independent object；If N is equal to T, then all using ambisonic domains branch, or all using independent object render branch.

In a further advantageous embodiment, the distribution of audio object is determined according to the divergence of source of sound, if the hair of source of sound Divergence is higher than X, then in the case of complexity is met, audio object is assigned to ambisonic branches, conversely, audio object It is assigned to independent source of sound and renders branch；Wherein X is specified by user.

The mixed mode spatial sound generation method detects the number of audio object using static schema or dynamic mode With the metadata of detection audio object；The static schema refers to only most starting the number and audio of audio object of detection The metadata of object；The dynamic mode refers to over time, dynamically adjust and how audio object is assigned into list Only object renders branch and this two-way branch of ambisonic domains branch.

The specific practice of the dynamic mode is to use fixed time-interval sampling or on-fixed time sampling；The fixation Time interval sampling refers at interval of regular time section；The number of audio object of detection and the metadata of audio object； The on-fixed time sampling refers to the initial time based on audio object, each new audio object beginning and end when Carve the number of audio object of detection and the metadata of audio object.

The invention has the advantages that：Mixed mode spatial sound generation System and method for of the present invention, which is added, renders control Molding block is controlled to be rendered to audio object, can keep low complex degree while the 3D audios of high-quality are produced.

Brief description of the drawings

Fig. 1 is the structural representation that mixed mode spatial sound of the present invention generates system.

Embodiment

Following examples are used to illustrate the present invention, but are not limited to the scope of the present invention.

As shown in figure 1, the present invention provides a kind of mixed mode spatial sound generation system, including render control module, Ambisonic encoders, ears transcoder and earphone and head tracking device, it is described render control module respectively with Ambisonic encoders and the connection of ears transcoder, the ambisonic encoders are connected with ears transcoder, the earphone It is connected respectively with ambisonic encoders and ears transcoder with head tracking device；The control module that renders is used to receive One or more audio object, detects the number of audio object, when the number of audio object is more than first threshold A, activation The ambisonic domains branch that ambisonic encoders are constituted, handles audio object using ambisonic methods, obtains virtual ring Around spatial sound and ambisonic encoders are transferred to, export virtual ring by ambisonic encoders exports around the ears of spatial sound Virtual ring is around acoustical signal；Otherwise the independent object that activation ears transcoder is constituted renders branch, uses independent object rendering intent Audio object is handled, virtual ring is obtained around spatial sound and exports ears output virtual ring of the virtual ring around spatial sound around acoustical signal.

The earphone and head tracking device are used for the end rotation angle for obtaining user and by the end rotation angle of user Degree is transferred to ambisonic encoders and ears transcoder respectively；The ambisonic encoders and ears transcoder difference root Audio object is handled according to the end rotation angle of user, virtual ring is obtained around spatial sound.

End rotation angle according to user is referred to according to the end rotation angle of user processing audio object, by audio pair The B- format signals rotation of elephant obtains postrotational B- format signals；Specifically, it is that spin matrix is generated according to the anglec of rotation, Further according to the spin matrix, the B- format signals (signal i.e. to be adjusted) of audio object are rotated.So-called rotation, Spin matrix is multiplied with signal matrix to be adjusted, rotation does not change the size of audio signal matrix component, only changes component Direction.The exponent number of spin matrix is adapted with audio signal matrix.For example, when signal matrix to be adjusted is [W₂X₂Y₂]^TWhen, Spin matrix isWhen signal matrix to be adjusted is [W₂X₂Y₂Z₂]^TWhen, spin matrix is

It is described render control module be further used for detect audio object metadata, the metadata include the time and Divergence of the corresponding audio object in the position of three dimensions, in addition to audio object；The control module that renders is according to sound The divergence of frequency object determines the processing mode of the audio object, if the divergence of audio object is more than Second Threshold B, will The audio object is temporarily assigned to ambisonic domains branch；After temporarily distribution terminates, according to the current of audio object processing equipment Situation, calculates computational complexity, is determined whether to redistribute audio object according to computational complexity；Computational complexity passes through statistics The execution cycle of audio object processing equipment is drawn.

Divergence (diffusivity) represent herein sound whether in space be have clear and definite dimensional orientation (such as certain One point sound source), still comparing diverging such as tends to ambient sound.The scope of divergence is [0,1], if 0, then represent audio object Divergence it is low, level off to point sound source.If 1, then represent nondirectional ambient sound.

1 ambisonic domains branch equivalent to the independent audio branch of T complexity, and no matter 1 ambisonic How many audio objects are assigned with the branch of domain, 1 ambisonic domains branch is all equivalent to the complexity of T independent audio branches Degree.Under normal circumstances, complexity of the ambisonic domains branch of T=8, i.e., 1 equivalent to 8 independent audio branches.But T Specific value is needed according to actual audio object processing equipment determination, and the T value values of different audio object processing equipments have May be different.

When computational complexity allows N number of audio object, if current audio object has M, independent object is rendered Branch can handle 0 to N-T audio object, and ambisonic domains branch can handle M-N+T audio object, if distribution The number H that the audio object of branch is rendered to independent object is less than N-T, then by 1 to N-T-H in the branch of ambisonic domains Any number of audio objects in individual audio object are reassigned to independent object and render branch；The N is more than T, and M is more than 0, H More than or equal to 0.If N is less than T, all branch is rendered using independent object；If N is equal to T, all use Ambisonic domains branch, or all using independent object render branch.

For example, when computational complexity allows 8 audio objects, if current audio object has 8, temporarily distribution The number that the audio object of branch is rendered to independent object is 3, is temporarily assigned to the audio object in the branch of ambisonic domains Number be 5, due to the complexity of independent audio branch individual equivalent to T (T=8) of 1 ambisonic domains branch, Er Qiewu By how many audio objects are assigned with 1 ambisonic domains branch, 1 ambisonic domains branch is all equivalent to T (T=8) The complexity of individual independent audio branch, therefore " number for being temporarily assigned to the audio object that independent object renders branch is 3, The number for the audio object being temporarily assigned in the branch of ambisonic domains is 5 " represent that computational complexity needs to allow 3+8= 11 audio objects, and actual conditions are that computational complexity only allows 8 audio objects in this example.Therefore need by 5 audio objects in the branch of ambisonic domains are reassigned to independent object and render branch's (so equivalent to 8 audios pair As all giving independent object renders branch, the requirement of computational complexity 8 audio objects of permission is met), or will be individually right As 3 audio objects rendered in branch be reassigned to ambisonic domains branch (it is complete equivalent to by 8 audio objects Ambisonic domains branch is given in part, due to 1 ambisonic domains branch all answering equivalent to T (T=8) individual independent audio branches Miscellaneous degree, therefore also meet the requirement of computational complexity 8 audio objects of permission).

When computational complexity allows 8 audio objects, if current audio object there are 14, list is temporarily assigned to The number that only object renders the audio object of branch is 3, for the audio object being temporarily assigned in the branch of ambisonic domains Number is 11, because " number for being temporarily assigned to the audio object that independent object renders branch is 3, is temporarily assigned to The number of audio object in the branch of ambisonic domains is 11 " represent that computational complexity needs to allow 3+T audio object (under normal circumstances T=8, i.e., 3+T=11 audio object, actual operation complexity only allow 8 audio objects), it is therefore desirable to Redistribute.By 0 to N-T audio object, (N, which refers to computational complexity, allows N=8 in N number of audio object, this example, generally In the case of T=8) distribute to independent object and render branch, due to N-T=8-8=0 here, i.e., 0 audio object is distributed to Independent object renders branch, it is therefore desirable to be reassigned to 3 audio objects for being temporarily assigned to independent object and rendering branch Ambisonic domains branch, the number for being actually allocated to the audio object of ambisonic domains branch is M-N+T, and (M refers to current sound M=14 in the number of frequency object, this example, M-N+T are 14-8+8=14), that is, it is actually allocated to ambisonic domains branch Audio object number be 14.That is, the result redistributed is will to be temporarily assigned to independent object to render branch 3 audio objects be reassigned to ambisonic domains branch so that current 14 audio objects are all assigned to ambisonic Domain branch.

When computational complexity allows 12 audio objects, if current audio object has 20 (i.e. M=20), temporarily When be assigned to independent object render branch audio object number be 3, be temporarily assigned to the sound in the branch of ambisonic domains The number of frequency object is 17, because " number for being temporarily assigned to the audio object that independent object renders branch is 3, temporarily The number for the audio object being assigned in the branch of ambisonic domains is 17 " represent that computational complexity needs to allow 3+T audio Object (T=8, i.e., 3+T=11 audio object, 12 audio objects of actual operation complexity permission under normal circumstances), therefore It can be redistributed.Because the number for distributing to the audio object that independent object renders branch is 3 (i.e. H=3), it is less than N-T is 12-8=4, therefore by 1 to N-T-H in the branch of ambisonic domains can be 12-8-3=1 audio object Any number of audio objects are reassigned to independent object and render branch, you can with by 1 audio in the branch of ambisonic domains Object is reassigned to independent object and renders branch.

In another embodiment, the distribution of audio object is determined according to divergence.

If the divergence of audio object is higher than X (0≤X≤1), in the case of complexity is met, source of sound is assigned to Ambisonic branches, conversely, audio object, which is assigned to independent audio object, renders branch.

In a preferred embodiment, X=0.5, if that is, the divergence of source of sound (is naturally not limited to this higher than 0.5 Individual value, X can between 0-1 value, or X specifies by user), then in the case of complexity is met, source of sound is assigned to Ambisonic branches, conversely, source of sound, which is assigned to independent source of sound, renders branch.

Input one or more audio object；

In a preferred embodiment, the first threshold A is equal to 8.In other examples, first threshold A can be with It is arbitrarily designated according to the actual requirements by technical staff.

In a preferred embodiment, the Second Threshold B is equal to 0.5.In other examples, Second Threshold B can To be arbitrarily designated according to the actual requirements by technical staff.

Computational complexity can be drawn by counting the execution cycle of audio object processing equipment.When computational complexity allows When N number of audio object, if current audio object has M, independent object, which renders branch, can handle 0 to N-T audio Object, ambisonic domains branch can handle M-N+T audio object, if distributing to the audio that independent object renders branch The number H of object is less than N-T, then by any number of sounds in 1 to N-T-H audio object in the branch of ambisonic domains Frequency object is reassigned to independent object and renders branch；The N is more than or equal to T, and M is more than 0, H and is more than or equal to 0.If N is small In T, then all branch is rendered using independent object；If N is equal to T, according to audio object divergence, all use Ambisonic domains branch, or all using independent object render branch.

In a further advantageous embodiment, the distribution of audio object is determined according to the divergence of source of sound, if the hair of source of sound Divergence is higher than X, then in the case of complexity is met, source of sound is assigned to ambisonic branches, conversely, source of sound is assigned to list Only source of sound renders branch；Wherein X is specified by user.

According to description above, independent object rendering intent and ambisonic methods processing audio object in efficiency and Respectively there are advantage and disadvantage in effect.The advantage of independent object rendering intent is accurate positioning；The shortcoming of independent object rendering intent be as Fruit scene is complicated, containing substantial amounts of audio object, then complexity can become very high, and for many audio-frequency playing terminals, this will Cause power consumption excessive, or even can not play.The advantage of ambisonic methods is computational complexity kept stable, The shortcoming of ambisonic methods is to render flexibility in sound to be short of, and independent source of sound accurately can not be controlled.

Therefore mixed mode spatial sound generation method of the present invention is needed in independent object rendering intent and ambisonic Made a choice between method, it is determined that a how many audio object are distributed into independent object renders branch, by how many audio objects Distribute to ambisonic domains branch.Such as when accurate positioning is needed, on the premise of computational complexity requirement is met, Audio object as much as possible is distributed into independent object and renders branch.When operand is very big, then by more sounds Frequency object distributes to ambisonic domains branch.

Mixed mode spatial sound generation method of the present invention is using static schema or dynamic mode detection audio object Number and detection audio object metadata.The static schema refers to only most starting the number of audio object of detection With the metadata of audio object.But it is due to that the number of each moment audio object is to differ during spatial sound is generated Sample, environmental factor is also changing, therefore static schema is not optimal solution, but advantage is that comparison is simple It is single.

The dynamic mode refers to over time, dynamically adjust and how audio object is assigned into independent object Render branch and this two-way branch of ambisonic domains branch.Specific way can use fixed time-interval sampling or non-solid Fix time sampling.The fixed time-interval sampling refers at interval of regular time section (such as at interval of one second) detection one The number of secondary audio program object and the metadata of audio object.When the on-fixed time sampling refers to the starting based on audio object Between, first number of the number of audio object of detection and audio object at the time of each new audio object beginning and end According to.

Although above with general explanation and specific embodiment, the present invention is described in detail, at this On the basis of invention, it can be made some modifications or improvements, this will be apparent to those skilled in the art.Therefore, These modifications or improvements, belong to the scope of protection of present invention without departing from theon the basis of the spirit of the present invention.

Claims

1. a kind of mixed mode spatial sound generates system, it is characterised in that the mixed mode spatial sound generation system includes wash with watercolours Control module, ambisonic encoders, ears transcoder and earphone and head tracking device are contaminated, it is described to render control module point It is not connected with ambisonic encoders and ears transcoder, the ambisonic encoders are connected with ears transcoder, described Earphone is connected with ambisonic encoders and ears transcoder respectively with head tracking device；The control module that renders is used for One or more audio object is received, the number of audio object is detected, when the number of audio object is more than first threshold A, swashed The ambisonic domains branch that ambisonic encoders living are constituted, audio object is handled using ambisonic methods, obtains virtual Ambient sound is simultaneously transferred to ambisonic encoders, and it is defeated around the ears of spatial sound to export virtual ring by ambisonic encoders Go out virtual ring around acoustical signal；Otherwise the independent object that activation ears transcoder is constituted renders branch, uses the independent object side of rendering Method handles audio object, obtains virtual ring around spatial sound and exports ears output virtual surround sound letter of the virtual ring around spatial sound Number；Described to render the metadata that control module is further used for detecting audio object, the metadata is including the time and correspondingly Audio object in the position of three dimensions, in addition to divergence；The divergence that control module is rendered according to audio object The processing mode of the audio object is determined, it is if the divergence of audio object is more than Second Threshold B, the audio object is temporary transient It is assigned to ambisonic domains branch；After temporarily distribution terminates, according to the present situation of audio object processing equipment, computing is calculated Complexity, determines whether to redistribute audio object according to computational complexity；Computational complexity is handled by counting audio object The execution cycle of equipment is drawn.

2. mixed mode spatial sound as claimed in claim 1 generates system, it is characterised in that 1 ambisonic domains branch is suitable In the complexity of T independent audio branches；When computational complexity allows N number of audio object, if current audio object There are M, independent object, which renders branch, can handle 0 to N-T audio object, ambisonic domains branch can handle M-N+T Audio object, if the number H for distributing to the audio object that independent object renders branch is less than N-T, by ambisonic domains Any number of audio objects in 1 to N-T-H audio object in branch are reassigned to independent object and render branch；It is described N is more than T, and M is more than 0, H and is more than or equal to 0；If N is less than T, all branch is rendered using independent object；If N is equal to T, Then according to audio object divergence, all using ambisonic domains branch, or all using independent object branch is rendered.

3. mixed mode spatial sound as claimed in claim 1 generates system, it is characterised in that the control module that renders is according to sound The divergence of frequency object determines the distribution of audio object；If the divergence of audio object is higher than X, the feelings of complexity are being met Under condition, audio object is assigned to ambisonic domains branch, conversely, audio object, which is assigned to independent object, renders branch；Its Middle X is specified by user.

4. a kind of mixed mode spatial sound generation method, it is characterised in that the mixed mode spatial sound generation method include with Lower step：

Input one or more audio object；

The number of audio object is detected, when the number of audio object is more than first threshold A, activation ambisonic domains branch adopts Audio object is handled with ambisonic methods, virtual ring is obtained around spatial sound；Otherwise activate independent object and render branch, use Independent object rendering intent processing audio object, obtains virtual ring around spatial sound；

After temporarily distribution terminates, according to the present situation of audio object processing equipment, computational complexity is calculated, it is complicated according to computing Degree determines whether to redistribute audio object；Computational complexity is drawn by counting the execution cycle of audio object processing equipment； When computational complexity allows N number of audio object, if current audio object has M, independent object renders branch can Handle 0 to N-T audio object, ambisonic domains branch can handle M-N+T audio object, if distribute to it is independent right As the number H for the audio object for rendering branch is less than N-T, then by 1 to N-T-H audio pair in the branch of ambisonic domains Any number of audio objects as in are reassigned to independent object and render branch；The N is more than T, and M is more than or waited more than 0, H In 0；If N is less than T, all branch is rendered using independent object；If N is equal to T, according to audio object divergence, entirely Portion uses ambisonic domains branch, or all using independent object renders branch；

The mixed mode spatial sound generation method further comprises the metadata for detecting audio object, when the metadata includes Between and corresponding audio object in the position of three dimensions, in addition to audio object divergence；

The mixed mode spatial sound generation method further comprises determining the audio object according to the divergence of audio object Processing mode, if the divergence of audio object is more than Second Threshold B, ambisonic is temporarily assigned to by the audio object Domain branch.

5. mixed mode spatial sound generation method as claimed in claim 4, it is characterised in that sound is determined according to the divergence of source of sound The distribution of frequency object, if the divergence of source of sound is higher than X, in the case of complexity is met, is assigned to audio object Ambisonic branches, conversely, audio object, which is assigned to independent source of sound, renders branch；Wherein X is specified by user.

6. the mixed mode spatial sound generation method as described in claim 4 or 5, it is characterised in that the mixed mode spatial sound Generation method detects the number of audio object and the metadata of detection audio object using static schema or dynamic mode；It is described Static schema refers to only in the number and the metadata of audio object for most starting audio object of detection；The dynamic mode is Refer to over time, dynamically adjust and how audio object is assigned to independent object and renders branch and ambisonic domains This two-way branch of branch；The specific practice of the dynamic mode is to use fixed time-interval sampling or on-fixed time sampling； The fixed time-interval sampling refers at interval of regular time section；The number of audio object of detection and audio object Metadata；The on-fixed time sampling refers to the initial time based on audio object, start in each new audio object and The number of audio object of detection and the metadata of audio object at the time of end.