CN105682000A

CN105682000A - Audio processing method and system

Info

Publication number: CN105682000A
Application number: CN201610017000.1A
Authority: CN
Inventors: 张晨; 孙学京; 刘皓
Original assignee: Beijing Tuoling Inc
Current assignee: Beijing Tuoling Inc
Priority date: 2016-01-11
Filing date: 2016-01-11
Publication date: 2016-06-15
Anticipated expiration: 2036-01-11
Also published as: CN105682000B

Abstract

The invention relates to a cloud audio processing method, server and system. The cloud audio processing method comprises the steps of, aiming at audio signals in different formats, carrying out binaural transcoding for the audio signals in different formats respectively based on a head rotation angle of a client so as to generate binaural audio signals in corresponding formats; and superposing the binaural signals in the corresponding formats to obtain a virtual acoustical signal output by audio ears. The audio processing of the audio processing method is carried out on a cloud server, so that the cloud audio processing method well adapts to cloud architecture based audio processing and storage, and the problems of low quality of the virtual acoustical signal generated by the mobile terminal and great computation burden are reduced. In addition, aiming at the possible delay caused by processing on the server, the cloud audio processing method further comprises a step of carrying out smoothing processing for the angle to remove the delay.

Description

A kind of audio-frequency processing method and system

Technical field

The present invention relates to signal processing technology field, particularly to a kind of method of Audio Processing, server and system.

Background technology

Utilizing virtual reality helmet (head-mounteddisplay, HMD) to user present content time, adopt virtual 3D Audiotechnica, audio content is play to user by stereophone, a kind of method improving telepresenc is to follow the tracks of user's headwork (headtracking), and sound is processed accordingly. Such as, if original sound perceived as from dead ahead, when after user's rotary head to the left 90 degree, sound should be processed so that user's perception sound is from front-right 90 degree. Here virtual reality device can have many types, the display device that such as headed is followed the tracks of, or is the stereophone of a headed tracking transducer.

Realize head tracking and also have multiple method. Relatively common is use multi-motion sensor. Motion sensor external member generally includes accelerometer, gyroscope and magnetometric sensor. In motion tracking and absolute direction, every kind of sensor has oneself intrinsic strong point and weakness. Therefore practices well is to adopt sensor " fusion " (sensorfusion) to be combined by the signal from each sensor, produces a more accurate motion detection result.

After obtaining end rotation angle, it is necessary to sound is changed accordingly. A kind of mode is that sound forwards to Ambisonic territory, then again through using spin matrix that signal is converted. Ambisonic signal is typically more than two sound channels, and stereo two sound channels only supported by common media player, and the audio signal directly playing Ambisonic or other multichannels is brought difficulty by this.

In view of this, the solution that a kind of effective and high-quality virtual surround sound generates and plays is needed in this area.

Summary of the invention

In order to overcome the drawbacks described above of prior art, it is an object of the invention to provide a kind of high in the clouds audio-frequency processing method, server and system, it can effectively and in high quality generate virtual surround sound, it is mainly used in coordinating the stereophone that virtual reality helmet carries out audio frequency to play, and the generation server beyond the clouds of described virtual surround sound carries out, well adapt to the existing network type based on cloud framework, generation and the storage of virtual surround sound is performed by server, thus solving existing customer's end cannot play various 3603Daudio, the problem being especially adapted for use in the audio frequency of virtual reality applications.

To achieve these goals, the present invention provides a kind of high in the clouds audio-frequency processing method, and described audio-frequency processing method comprises the following steps, and obtains the anglec of rotation of user's end rotation; Obtain the audio signal of different-format, according to the described anglec of rotation, respectively the audio signal of described different-format is carried out ears transcoding, generate the binaural audio signal of corresponding format; Binaural signal superposition to described corresponding format, obtains audio frequency ears output virtual ring around acoustical signal.

Preferably, the audio signal of described different-format includes Double-ear type sound-recording signal, Ambisonic recorded audio signals and audio object signal.

Preferably, the audio signal of described different-format being carried out ears transcoding, the ears transcoding audio signal generating corresponding format specifically includes:

To described Double-ear type sound-recording signal, it is interpolated according to the described anglec of rotation, generates Double-ear type sound-recording binaural signal;

To described Ambisonic recorded audio signals, according to the described anglec of rotation, described Ambisonic recorded audio signals is adjusted, the Ambisonic recorded audio signals ears transcoding after described adjustment is generated Ambisonic recording binaural signal;

To described audio object signal, according to the described anglec of rotation, described audio object signal is adjusted, the audio object signal ears transcoding after described adjustment is generated audio object binaural signal.

Preferably, if desired for higher spatial accuracy, audio object signal is rotated according to the anglec of rotation, postrotational audio object signal is encoded to high-order B format audio object signal, after ears transcoding, generate high-order B format audio object binaural signal, be overlapped with Ambisonic recording binaural signal, Double-ear type sound-recording binaural signal;

If desired for low complex degree low latency, audio object signal is encoded to single order B format audio object signal, superpose with other single orders Ambisonic recorded audio signals, then according to the anglec of rotation, the mixed signal after described superposition is carried out ears transcoding, generate the mixing binaural signal of audio object and Ambisonic recorded audio signals, be overlapped with described Double-ear type sound-recording binaural signal.

Preferably, the anglec of rotation of acquired user's end rotation is specially the anglec of rotation obtaining user's end rotation, and the described anglec of rotation is smoothed.

Present invention also offers a kind of high in the clouds audio processing service device, described server includes: acquiring unit, obtains the anglec of rotation of user's end rotation; Collecting unit, gathers the audio signal of different-format; Ears transcoding units, is connected with described acquiring unit and collecting unit respectively, according to the described anglec of rotation, the audio signal of described different-format carries out ears transcoding respectively, generates the binaural audio signal of corresponding format; Superpositing unit, is connected with described ears transcoding units, the binaural signal superposition to described corresponding format, obtains audio frequency ears output virtual ring around acoustical signal.

Preferably, the audio signal of described different-format is carried out ears transcoding by ears transcoding units, and the ears transcoding audio signal generating corresponding format specifically includes:

Preferably, if desired for higher spatial accuracy, audio object signal is rotated by ears transcoding units according to the anglec of rotation, postrotational audio object signal is encoded to high-order B format audio object signal, high-order B format audio object binaural signal is generated after ears transcoding, the high-order B format audio object binaural signal that ears transcoding units is generated by superpositing unit, Ambisonic recording binaural signal, Double-ear type sound-recording binaural signal are overlapped;

If desired for low complex degree low latency, audio object signal is encoded to single order B format audio object signal by ears transcoding units, superpose with other single orders Ambisonic recorded audio signals, then according to the anglec of rotation, the mixed signal after described superposition is carried out ears transcoding, generating the mixing binaural signal of audio object and Ambisonic recorded audio signals, what ears transcoding units was generated by superpositing unit mix binaural signal with described, Double-ear type sound-recording binaural signal is overlapped.

Preferably, described cloud server also includes smooth unit, is connected with described ears transcoding units and described acquiring unit respectively, and smooth unit receives the anglec of rotation of user's end rotation from acquiring unit, and the described anglec of rotation is smoothed.

Present invention also offers a kind of audio frequency broadcast system, described system includes high in the clouds audio processing service device and client; Described client includes head tracking device, described head tracking device captures the head anglec of rotation, it is uploaded to described high in the clouds audio processing service device by network, described high in the clouds audio process receives the described anglec of rotation, generate audio frequency ears output virtual ring after acoustical signal, by described network transmission to client.

High in the clouds audio-frequency processing method according to the present invention, server and system, effectively and in high quality generate virtual surround sound, it is mainly used in coordinating the stereophone that virtual reality helmet carries out audio frequency to play, and the generation server beyond the clouds of described virtual surround sound carries out, well adapt to the existing network type based on cloud framework, Audio Processing and storage is performed by cloud server, thus solving existing customer's end cannot play various 3603Daudio, the problem being especially adapted for use in the audio frequency of virtual reality applications.

Adopting the high in the clouds audio signal processing technique of the present invention, can be greatly promoted telepresenc in multi-person speech communication, user arbitrarily rotary head can pay close attention to the sound of a direction, and the many people more approached in reality talk scene. Especially in the scene using Streaming Media, by adjusting spatial sound in real time, the orientation of audio frequency, it is possible to promote the audio experience of user. If auxiliary virtual reality video content, then can better promote Consumer's Experience.

Accompanying drawing explanation

Fig. 1 is the theory diagram of one embodiment of high in the clouds audio-frequency processing method of the present invention;

Fig. 2 a-c is the theory diagram of high in the clouds another embodiment of audio-frequency processing method of the present invention;

Fig. 3 is the structural representation of an embodiment of the audio processing service device of the present invention;

Fig. 4 is the structural representation of another embodiment of the audio frequency processing system of the present invention;

Detailed description of the invention

Embodiment one: as it is shown in figure 1, audio object is processed by one includes processing as follows step:

User's end rotation angle is obtained by head tracking device;

According to the described anglec of rotation, audio object is encoded to high-order (being preferably 2 rank or 3 rank) AmbisonicB-format signal;

Convert described AmbisonicB-format signal to virtual speaker array signal; With a single order B-format signal [W₁X₁Y₁Z₁]^TFor example, convert virtual speaker array signal [L to₁L₂…L_N]^TProcess be just by following computing:

[\begin{matrix} L_{1} \\ L_{2} \\ \cdot \cdot \\ L_{N} \end{matrix}] = [\begin{matrix} G_{w 1} & G_{x 1} & G_{y 1} & G_{z 1} \\ G_{w 2} & G_{x 2} & G_{y 2} & G_{z 2} \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ G_{w N} & G_{x N} & G_{y N} & G_{z N} \end{matrix}] [\begin{matrix} W_{1} \\ X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}] = G [\begin{matrix} W_{1} \\ X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}] .

Wherein, N is the number of the virtual speaker that virtual speaker topological structure includes. G matrix used in above formula is ambisonic decoding matrix, it is possible to by asking pseudo inverse matrix to draw.

The described virtual speaker array signal of audio object is carried out ears transcoding (usually 3 dimension, namely comprises elevation information) based on binaural room impulse response (BRIR), obtains the ears output virtual ring of audio object around acoustical signal. Specifically: forwarding, from virtual speaker signal, the two stereo BRIR matrixes in road that earphone signal is corresponding to, the stereo matrix in Jiang Gai bis-road and virtual speaker array signal carry out matrix multiplication, obtain virtual surround sound.

BRIR matrix is

[\begin{matrix} B_{1 L} & B_{2 L} & \cdot \cdot & B_{N L} \\ B_{1 R} & B_{2 R} & \cdot \cdot & B_{N R} \end{matrix}],

Then virtual surround sound is

[\begin{matrix} L \\ R \end{matrix}] = [\begin{matrix} B_{1 L} & B_{2 L} & \cdot \cdot & B_{N L} \\ B_{1 R} & B_{2 R} & \cdot \cdot & B_{N R} \end{matrix}] [\begin{matrix} L_{1} \\ L_{2} \\ \cdot \cdot \\ L_{N} \end{matrix}] = [\begin{matrix} F_{W L} & F_{X L} & F_{Y L} & F_{Z L} \\ F_{W R} & F_{X R} & F_{Y R} & F_{Z R} \end{matrix}] [\begin{matrix} W_{1} \\ X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}] .

Described audio signal can be one or more.

Described binaural room impulse response is preferably off-line and generates, it is possible to adopt true measurement or by special Software Create, therefore needs to store substantial amounts of BRIR not necessarily like when adopting online generating mode under prior art, decreases memory consumption.

When audio object is encoded to AmbisonicB-format signal, horizontal direction exponent number is preferably greater than or equal to vertical direction exponent number, such as, when horizontal direction coding is preferably 3 rank AmbisonicB-format signal, vertical direction coding is preferably 2 rank or 1 rank AmbisonicB-format signal, represents with H3V2, H3V1 respectively. Owing to people is to the heightened perception resolution lower than Plane Angle, therefore adopts the above method suitably reducing exponent number on certain specific direction, decrease operand, but significantly reduce user's perceived effect to sound.

Acoustic field signal and ambient sound are carried out process comprise the steps:

Ambient sound is converted the ears output virtual ring of ambient sound to around acoustical signal, more described audio object (audio object now is primarily referred to as the sound-content outside ambient sound) and described ambient sound respective ears output virtual ring are exported around the corresponding audio mixing of acoustical signal ears. Fig. 1 show the theory diagram of an embodiment of the method. Wherein, the described ears output virtual ring that ambient sound (i.e. acoustic field signal in Fig. 1) converts to ambient sound preferably includes following steps around acoustical signal:

Obtain 1 rank AmbisonicB-format signal of ambient sound;

According to the described anglec of rotation, the described AmbisonicB-format signal of ambient sound is rotated and obtains postrotational AmbisonicB-format signal; Specifically, it is generate spin matrix according to the described anglec of rotation, further according to described spin matrix, the described AmbisonicB-format signal (i.e. signal to be adjusted) of ambient sound is rotated. So-called rotation, is multiplied with signal matrix to be adjusted by spin matrix, rotates the size not changing audio signal matrix component, only changes the direction of component. The exponent number of spin matrix and audio signal matrix adapt. Such as, when signal matrix to be adjusted is [W₂X₂Y₂]^TTime, spin matrix is

[\begin{matrix} 1 & 0 & 0 \\ 0 & c o s (θ) & - s i n (θ) \\ 0 & s i n (θ) & c o s (θ) \end{matrix}];

When signal matrix to be adjusted is [W₂X₂Y₂Z₂]^TTime, spin matrix is

[\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & c o s (θ) & - s i n (θ) & 0 \\ 0 & s i n (θ) & c o s (θ) & 0 \\ 0 & 0 & 0 & 1 \end{matrix}] .

Convert the described postrotational AmbisonicB-format signal of ambient sound to virtual speaker array signal; The described virtual speaker array signal of ambient sound is carried out ears transcoding (usually 2 dimension, does not namely comprise elevation information) based on head related transfer function (HRTF), obtains the ears output virtual ring of ambient sound around acoustical signal.HRTF is HRIR (HeadRelatedImpulseResponse) in the title corresponding to time domain.

It is to be noted can use BRIR or HRIR to be filtered for audio object or ambient sound as required. Generally comprise room model due to BRIR and one group of HRIR/HRTF describing sound bearing forms, if so input signal is with the information in room or environment, using HRIR just can meet demand.

The method of the virtual surround sound of described generation is preferably based on following supposition when implementing computing: virtual speaker array has bilateral symmetry, user is on the axis in room, and described binaural room impulse response and head related transfer function that user is corresponding also have bilateral symmetry. Based on this hypothesis, it is possible to use high-order AmbisonicB-form symmetry optimization method, substantially reduce operand, improve operation efficiency.

Describe below and how audio object is encoded to ambisonic territory.

Audio object is encoded to single order ambisonic signal:

W = \frac{1}{k} Σ_{i = 1}^{k} s_{i} [\frac{1}{\sqrt{2}}];

X = \frac{1}{k} Σ_{i = 1}^{k} s_{i} [{cosθ}_{i} {cosφ}_{i}];

Y = \frac{1}{k} Σ_{i = 1}^{k} s_{i} [{sinθ}_{i} {cosφ}_{i}];

Z = \frac{1}{k} Σ_{i = 1}^{k} s_{i} [{sinφ}_{i}];

s_is_iBeing i-th audio object, i=1..k, k is the number of audio object. θ_iθ_iIt is the angle (azimuth) in plane, φ_iφ_iThe angle being vertically oriented. W sound channel signal represents that omnirange sound wave, X sound channel signal, Y sound channel signal and Z sound channel signal represent the sound wave along the orthogonal orientation X in three, space, Y, Z respectively.

Single order AmbisonicB-format signal is expressed as

[\begin{matrix} W_{1} \\ X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}] .

In like manner, audio object is encoded to 2 rank or 3 rank AmbisonicB-format signals preferably carries out according to lower table definition:

Trigonometric function in upper table is even function for azimuth angle theta, then the respective component of corresponding AmbisonicB-format signal is symmetrical, if the trigonometric function in upper table is odd function for azimuth angle theta, then the respective component of corresponding AmbisonicB-format signal is heterochiral. For single order AmbisonicB-format signal, from physical significance and coordinate, w, x, z are regardless of left and right, if so the position listened is symmetrical, and assuming that corresponding HRTF coefficient is also similar to symmetrical, the component of the ears output that so w, x, z are corresponding is identical for the left and right passage of output. And y is just reverse for left and right. So y corresponding ears output component be contrary for left and right passage. For having symmetric component, it is possible to adopt fast algorithm, i.e. symmetry optimization in calculating process, operand can be reduced further.

Further, since the process of audio file is likely to survive late by server, the solution taked is the anglec of rotation obtaining user's end rotation, and the described anglec of rotation is smoothed. Therefore little angle change can not be done new direction of rotation and process, and efficiently solves the delay problem of server process.

Embodiment two:

Fig. 2 a-c describes the embodiment being used for promoting immersion experience effect based on high in the clouds MCVF multichannel voice frequency transmission. It should be noted that the present invention contains two kinds of application scenarios (1) audio frequency real-time communication (conference scenario), as shown in Figure 2 b; (2) audio frequency is downloaded, as shown in Figure 2 c;

For two kinds of scenes, input has three kinds of forms: independent audio object, sound field input (wxy form), Double-ear type sound-recording signal.

As shown in Figure 2 b, scene is downloaded for audio frequency:

Storage server storage has Double-ear type sound-recording signal, Ambisonic recorded audio signals (acoustic field signal), and/or audio object, ears transcoding server obtains above-mentioned signal from storage server, at ears transcoding server end, audio object changed into Ambisonic signal, for instance, single order horizontal direction B format signal, i.e. wxy, and be added with other wxy signals (acoustic field signal).Wxy signal is rotated by the angle that ears transcoding server transmits according to client head tracking device by use spin matrix, wxy signal changes into double track, then superposes generation audio frequency download file with Double-ear type sound-recording binaural signal. Typically require compression to reduce transmission bandwidth. Then the dual-channel audio after client downloads compression. This way can be more efficient, but shortcoming is if audio object is only with single order B form, sterically defined resolution can decline to some extent, if but the preferred way ears process being based on cloud service is placed on client, then client downloads wxy signal from server, then rotation process needs not move through server.

If desired for higher spatial accuracy, audio object is first rotated by ears transcoding server according to the anglec of rotation, postrotational audio object signal is encoded to high-order B form (such as 33 rank), superpose in double track territory with other B format signals: after ears transcoding, generate high-order B format audio object binaural signal, be overlapped generating audio file with Ambisonic recording binaural signal, Double-ear type sound-recording binaural signal.

Here we are it should be noted that head tracking is a kind of form, however not excluded that other action parameters, as waved. The present invention is equally applicable.

As shown in Figure 2 c, for audio frequency real-time communication (conference scenario):

Ears transcoding server directly obtains Double-ear type sound-recording microphone array, Ambisonic microphone array, independent sound source or audio object, at ears transcoding server end by Double-ear type sound-recording microphone array, Ambisonic microphone array, independent sound source or audio object perform above-mentioned similar processing procedure.

Embodiment three:

As it is shown on figure 3, a kind of high in the clouds audio processing service device, acquiring unit, obtain the anglec of rotation of user's end rotation that the head tracking device in client transmits; Collecting unit, gathers Double-ear type sound-recording signal, Ambisonic recorded audio signals, audio object respectively; Ears transcoding units, is connected with described acquiring unit and collecting unit respectively, according to the described anglec of rotation, respectively the audio signal of described different-format is carried out ears transcoding, wherein for Double-ear type sound-recording signal, it is interpolated according to the described anglec of rotation, generates Double-ear type sound-recording binaural signal; And when if desired for higher spatial accuracy, audio object signal is rotated by ears transcoding units according to the anglec of rotation, postrotational audio object signal is encoded to high-order B format audio object signal, high-order B format audio object binaural signal is generated after ears transcoding, the high-order B format audio object binaural signal that ears transcoding units is generated by superpositing unit, Ambisonic recording binaural signal, Double-ear type sound-recording binaural signal are overlapped;

If desired for low complex degree low latency, audio object signal is encoded to single order B format audio object signal by ears transcoding units, superpose with other single orders Ambisonic recorded audio signals, then according to the anglec of rotation, the mixed signal after described superposition is carried out ears transcoding, generate the mixing binaural signal of audio object and Ambisonic recorded audio signals, what ears transcoding units was generated by superpositing unit mix binaural signal with described, Double-ear type sound-recording binaural signal is overlapped, and obtains audio frequency ears output virtual ring around acoustical signal.

The present embodiment utilizes the multichannel audio transmission that cloud server solves to support head tracking and the problem play.

Embodiment four:

As shown in Figure 4, a kind of audio frequency processing system of the present invention mainly comprises client, stores server, high in the clouds audio processing service device; Client includes head tracking module, and storage server end has multitrack audio file, deposits in a specific way. Client head tracking module obtains user's headwork such as end rotation angle, by parameter through the Internet upload onto the server end one or more high in the clouds audio processing service device, multitrack audio file is carried out respective handling: high in the clouds audio processing service device extracts the audio signal of different-format from storage server, and generate audio frequency ears output virtual ring around acoustical signal according to the anglec of rotation received, by the audio file after ears transcoding by described network transmission to client.

Audio file after the above-mentioned process of client downloads, it is preferred that play with two-channel stereo format.

The preferred embodiment of the present invention is described in detail above in association with accompanying drawing; but; the present invention is not limited to the detail in above-mentioned embodiment; in the technology concept of the present invention; technical scheme can being carried out multiple simple variant, these simple variant belong to protection scope of the present invention.

It is further to note that each the concrete technical characteristic described in above-mentioned detailed description of the invention, in reconcilable situation, it is possible to be combined by any suitable mode. In order to avoid unnecessary repetition, various possible compound modes are no longer illustrated by the present invention separately.

Additionally, can also carry out combination in any between the various different embodiment of the present invention, as long as it is without prejudice to the thought of the present invention, it should be considered as content disclosed in this invention equally.

Claims

1. a high in the clouds audio-frequency processing method, it is characterised in that: described audio-frequency processing method comprises the following steps,

Obtain the anglec of rotation of user's end rotation;

Obtain the audio signal of different-format, according to the described anglec of rotation, respectively the audio signal of described different-format is carried out ears transcoding, generate the binaural audio signal of corresponding format;

Binaural signal superposition to described corresponding format, obtains audio frequency ears output virtual ring around acoustical signal.

2. high in the clouds according to claim 1 audio-frequency processing method, it is characterised in that:

The audio signal of described different-format includes Double-ear type sound-recording signal, Ambisonic recorded audio signals and audio object signal.

3. high in the clouds according to claim 2 audio-frequency processing method, it is characterised in that:

The audio signal of described different-format is carried out ears transcoding, and the ears transcoding audio signal generating corresponding format specifically includes:

4. high in the clouds according to claim 3 audio-frequency processing method, it is characterised in that:

If desired for higher spatial accuracy, audio object signal is rotated according to the anglec of rotation, postrotational audio object signal is encoded to high-order B format audio object signal, after ears transcoding, generate high-order B format audio object binaural signal, be overlapped with Ambisonic recording binaural signal, Double-ear type sound-recording binaural signal;

5. the cloud processing method according to any one of claim 1-4, it is characterised in that:

The anglec of rotation of acquired user's end rotation is specially the anglec of rotation obtaining user's end rotation, and the described anglec of rotation is smoothed.

6. a high in the clouds audio processing service device, it is characterised in that described server includes:

Acquiring unit, obtains the anglec of rotation of user's end rotation;

Collecting unit, gathers the audio signal of different-format;

Ears transcoding units, is connected with described acquiring unit and collecting unit respectively, according to the described anglec of rotation, the audio signal of described different-format carries out ears transcoding respectively, generates the binaural audio signal of corresponding format;

Superpositing unit, is connected with described ears transcoding units, the binaural signal superposition to described corresponding format, obtains audio frequency ears output virtual ring around acoustical signal.

7. high in the clouds according to claim 6 audio processing service device, it is characterised in that:

8. high in the clouds according to claim 7 audio processing service device, it is characterised in that:

The audio signal of described different-format is carried out ears transcoding by ears transcoding units, and the ears transcoding audio signal generating corresponding format specifically includes:

9. high in the clouds according to claim 8 audio processing service device, it is characterised in that:

If desired for higher spatial accuracy, audio object signal is rotated by ears transcoding units according to the anglec of rotation, postrotational audio object signal is encoded to high-order B format audio object signal, high-order B format audio object binaural signal is generated after ears transcoding, the high-order B format audio object binaural signal that ears transcoding units is generated by superpositing unit, Ambisonic recording binaural signal, Double-ear type sound-recording binaural signal are overlapped;

10. the high in the clouds processing server according to any one of claim 6-9, it is characterised in that:

Described cloud server also includes smooth unit, is connected with described ears transcoding units and described acquiring unit respectively, and smooth unit receives the anglec of rotation of user's end rotation from acquiring unit, and the described anglec of rotation is smoothed.

11. an audio frequency broadcast system, it is characterised in that: described system includes audio processing service device in high in the clouds described in claim 6-10 and client; Described client includes head tracking device, described head tracking device captures the head anglec of rotation, it is uploaded to described high in the clouds audio processing service device by network, described high in the clouds audio process obtains the audio signal of different-format, and generate audio frequency ears output virtual ring after acoustical signal according to the described anglec of rotation, by described network transmission to client.

12. audio frequency broadcast system according to claim 11, it is characterized in that: described system also includes storage server, the audio signal of storage different-format, when user asks to download downloaded, described high in the clouds audio processing service device extracts described audio signal from described storage server.