CN106060758A

CN106060758A - Processing method for virtual reality sound field metadata

Info

Publication number: CN106060758A
Application number: CN201610391252.0A
Authority: CN
Inventors: 杨洪旭; 孙学京; 张晨
Original assignee: Beijing Tuoling Inc
Current assignee: Beijing Tuoling Inc
Priority date: 2016-06-03
Filing date: 2016-06-03
Publication date: 2016-10-26
Anticipated expiration: 2036-06-03
Also published as: CN106060758B

Abstract

The invention discloses a processing method for virtual reality sound field metadata. The processing method comprises the following steps of judging a movement mode of an audio object, and if the movement mode of the audio object is linear movement, setting the parameter m of the movement mode to be 0; if the movement mode of the audio object is curvilinear movement, setting the parameter m of the movement mode to be 1; when m is equal to 0, using a rectangular coordinate system to express the azimuth information of the audio object, and meanwhile processing the movement track of the audio object in the rectangular coordinate system; when m is equal to 1, using a polar coordinate system to express the azimuth information of the audio object, and meanwhile processing the movement track of the audio object in the polar coordinate system; and generating virtual surround according to location information corresponding to the movement track of the audio object. According to the processing method, the movement direction and track of the audio object can be more perfectly displayed when the complex object simultaneously performing linear movement and curvilinear movement is processed. The sense of immersion is improved in the virtual reality, and the sound is more real, more specific and more vivid.

Description

The processing method of virtual reality sound field metadata

Technical field

The present invention relates to signal processing technology field, be specifically related to the processing method of a kind of virtual reality sound field metadata.

Background technology

When presenting content with virtual reality helmet (Head-Mounted Display, HMD) to user, in audio frequency Hold and play to user by stereophone.At this moment need to face the problem how improving virtual surrounding sound effect.Virtual existing In real application, when playing audio content by stereophone, the purpose of virtual 3D audio frequency is intended to reach a kind of effect, allows User is just as listening with loudspeaker array (such as 5.1 or 7.1), even true as listening the sound in reality.

When making virtual reality audio content, generally there are some sources of sound to be distributed in different positions and constitute together with environment Sound field.Follow the tracks of user's headwork (head tracking), sound is processed accordingly.Such as, if original sound quilt User is perceived as from dead ahead, when, after user's rotary head to the left 90 degree, sound should be processed so that user's perception sound is from just 90 degree, right.

Here virtual reality device can have many types, the display device that such as headed is followed the tracks of, or simply one The stereophone of portion's headed tracking transducer.

Realize head tracking and also have multiple method.Relatively common is to use multiple sensors.Motion sensor external member is led to Often include accelerometer, gyroscope and magnetometric sensor.In terms of motion tracking and absolute direction, every kind of sensor has oneself Intrinsic strong point and weakness.Therefore practices well is to use sensor " to merge " (sensor fusion), will be from each sensor Signal combine, produce a more accurate motion detection result.

After obtaining end rotation angle, need sound is changed accordingly.Generate virtual reality sound field and have several Method, a kind of way is to audio object, according to its positional information, uses corresponding HRTF (Head Related Transfer Function, head related transfer function) wave filter is filtered, and obtains virtual surround sound.HRTF is in the name corresponding to time domain Title is HRIR (Head Related Impulse Response).Another way is that sound forwards to the B in ambisonic territory Format signal, is converted to virtual speaker array signal by this B format signal, is filtered by HRTF by virtual speaker array signal Ripple device is filtered, and obtains virtual surround sound.

It will be seen that both approaches is required for definition, designs and process audio object azimuth information in space.One As way be by the position rectangular coordinate system in space (x, y, z) represent.Assume that whole sound field space is with a limit A length of 2 cubes represent (2 × 2 × 2), and the head of hearer is at cubical central point, and supposes that this central point is at rectangular coordinate The coordinate of system is (0,0,0), as shown in Figure 1, it is assumed that the coordinate of the audio object in this sound field is (xi, yi, zi), then understands such as Lower example:

The audio object coordinate at sound field edge, positive left side is (-1,0,0)；

The audio object coordinate at dead ahead sound field edge is (0,1,0)；

The audio object coordinate at sound field edge, surface is (0,0,1)；

The audio object coordinate at sound field edge, right front is (1,1,0)；

As in figure 2 it is shown, wherein three components of coordinate must meet:

-1≤xi≤1

-1≤yi≤1

-1≤zi≤1

The definition method of this audio object azimuth information is very suitable for the object of linearly orbiting motion, such as object (x2, y2, the z2) in t2 moment it is linearly moved to, then any moment ti pair between t1 and t2 from (x1, y1, the z1) in t1 moment The position Point_i of the object answered can be expressed as by linear interpolation:

Point_i=Point_1+ (Point_2-Point_1) × (ti-t1)/(t2-t1)；

Wherein Point_i is defined as: (xi, yi, zi)；Point_1 is: (x1, y1, z1)；Point_2 is: (x2, y2, z2)；

But, if audio object is riding, the metadata of this azimuth information based on rectangular coordinate system is fixed Justice and interpolation method are not the most the most suitable.Such as audio object rotates a circle around central point, respectively at labelling all around 1,2,3,4 totally 4 azimuthal points (see Fig. 3):

The most front Point_1 (0,1,0)

The rightest Point_2 (1,0,0)

Point_3 (0 ,-1,0) after just

The most left Point_4 (-1,0,0)

The movement locus that actual interpolation obtains not is a circular motion, but a broken line motion, track is One rhombus, as it is shown on figure 3, the difference of this movement locus can cause the distortion of virtual reality audio content, affects audio frequency effect Really.

For solving this technical problem in prior art, use on time dimension, take the sampled point than comparatively dense, Carry out approximate simulation curvilinear motion.But disadvantage of this is that the curve of track simply approximation, and imperfect, and the most intensive adopts Sampling point causes amount of metadata too big, occupies more bandwidth.

Summary of the invention

It is an object of the invention to provide the processing method of a kind of virtual reality sound field metadata, be used for describing virtual reality Audio object curvilinear motion track in sound field, makes the in hgher efficiency of metadata, makes audio object do curve fortune in space simultaneously Time dynamic, track is smoother, uses hybrid (mixing) mode further, compatible with straight line or the curve of audio object simultaneously Motion.

For achieving the above object, the processing method of virtual reality sound field metadata of the present invention comprises the following steps:

Judge the motor pattern of audio object, if the motor pattern of audio object is linear motion, make motor pattern join Number m=0；If the motor pattern of audio object is curvilinear motion, make motor pattern parameter m=1；

As m=0, with the azimuth information of this audio object of right-angle coordinate representation, the simultaneously movement locus of audio object Also process in rectangular coordinate system；

As m=1, representing the azimuth information of this audio object by polar coordinate system, the movement locus of audio object is also simultaneously Polar coordinate system processes；

The positional information that movement locus according to audio object is corresponding generates virtual surround sound.

As m=0, obtain the audio object rectangular coordinate Point_1 in the t1 moment, obtain audio object in the t2 moment Rectangular coordinate Point_2, the rectangular coordinate making ti moment audio object corresponding is Point_i, wherein the ti moment be the t1 moment with Any time between the t2 moment；

The particular location coordinate of Point_i is obtained by linear interpolation, and the formula of linear interpolation is:

Point_i=Point_1+ (Point_2-Point_1) × (ti-t1)/(t2-t1).

As m=1, acquisition audio object, in polar coordinate Point ' _ 1 in t1 moment, obtains audio object in the t2 moment Polar coordinate Point ' _ 2, the polar coordinate making ti moment audio object corresponding are Point ' _ i, and wherein the ti moment is t1 moment and t2 Any time between moment；

The particular location coordinate of Point ' _ i is obtained by linear interpolation, and the formula of linear interpolation is:

Point ' _ i=Point ' _ 1+ (Point ' _ 2-Point ' _ 1) × (ti-t1)/(t2-t1).

Concretely comprising the following steps of the positional information virtual surround sound of generation that the described movement locus according to audio object is corresponding:

If m=1, then obtain polar coordinate Point ' _ i that ti moment audio object is corresponding；

If m=0, then rectangular coordinate Point_i corresponding for the ti moment audio object of linear interpolation acquisition is converted into Polar coordinate Point ' _ i that ti moment audio object is corresponding；

According to polar coordinate Point ' _ i that ti moment audio object is corresponding, audio object is encoded to high-order B-form letter Number；Described B-format signal is converted into virtual speaker array signal；

The described virtual speaker array signal of audio object is carried out ears transcoding based on binaural room impulse response, Ears to audio object export virtual ring around acoustical signal.

Described audio object can be one or more.

Described binaural room impulse response off-line generates, and uses true measurement or by Software Create.

Present invention have the advantage that the processing method of virtual reality sound field metadata of the present invention is processing straight line fortune Dynamic and curvilinear motion and deposit complex object time, can more ideally represent the direction of motion and the track of audio object.Virtual Reality improves feeling of immersion, more truly, more specifically, vivider, and make the object handles of movement locus complexity get up simpler Single quick, also improve work efficiency while improve feeling of immersion.

Accompanying drawing explanation

Fig. 1 is with the schematic diagram in right-angle coordinate representation sound field space in prior art.

Fig. 2 is the schematic diagram of three coordinate components in the rectangular coordinate system shown in Fig. 1.

Fig. 3 is the audio object movement locus schematic diagram obtained by rectangular coordinate system interpolation in prior art.

Fig. 4 is the schematic diagram in present invention polar coordinate representation sound field space.

Fig. 5 is the schematic diagram of three coordinate components in the polar coordinate shown in Fig. 4.

Fig. 6 is the audio object movement locus schematic diagram that present invention polar coordinate interpolation obtains.

Detailed description of the invention

Following example are used for illustrating the present invention, but are not limited to the scope of the present invention.

As Figure 4-Figure 6, the present invention provides the processing method of a kind of virtual reality sound field metadata, comprises the following steps:

Point_i=Point_1+ (Point_2-Point_1) × (ti-t1)/(t2-t1).

Wherein the rectangular coordinate of Point_1 is defined as (x1, y1, z1), the rectangular coordinate of Point_2 be defined as (x2, y2, Z2), the rectangular coordinate of Point_i is defined as (xi, yi, zi).

Point ' _ i=Point ' _ 1+ (Point ' _ 2-Point ' _ 1) × (ti-t1)/(t2-t1).

Wherein the polar coordinate of Point ' _ 1 are defined as (ρ 1, θ 1, φ 1)；The polar coordinate of Point ' _ 2 are defined as (ρ 2, θ 2, φ 2)；The polar coordinate of Point ' _ i are defined as (ρ i, θ i, φ i).

In polar coordinate, the position of the centre of sphere is defined as (0,0,0), ρ 1 represent t1 moment audio object distance the centre of sphere away from From；θ 1 represents the t1 moment audio object horizontal angle relative to the centre of sphere；φ 1 represents t1 moment audio object facing upward relative to the centre of sphere Angle.ρ 2 represents the distance of the t2 moment audio object distance centre of sphere；θ 2 represents the t2 moment audio object horizontal angle relative to the centre of sphere； φ 2 represents the t2 moment audio object elevation angle relative to the centre of sphere.ρ i represents the distance of the ti moment audio object distance centre of sphere；θ i table Show the ti moment audio object horizontal angle relative to the centre of sphere；φ i represents the ti moment audio object elevation angle relative to the centre of sphere.

Use polar providing the benefit that, it is assumed that audio object rotates a circle around central point, respectively all around With 4 azimuthal points of polar coordinate labelling:

The most front Point_1 (1,0,0)

The rightest Point_2 (1,90,0)

Point_3 (1,180,0) after just

The most left Point_4 (1,270,0)

The movement locus that so interpolation obtains is exactly a perfect circular motion, as shown in Figure 6.

When processing linear motion and curvilinear motion the complex object deposited, the present invention can more ideally represent audio frequency pair The direction of motion of elephant and track.Feeling of immersion is improved in virtual reality, more truly, more specifically, vivider, and make movement locus Complicated object handles get up simpler quickly, also improve work efficiency while improve feeling of immersion.

The definition specification (as a example by 2 audio objects) of the virtual reality sound field sound intermediate frequency object metadata of present invention design As follows:

Rectangular coordinate (xi, yi, zi) is converted into the formula of polar coordinate (ρ i, θ i, φ i):

ρ i = \sqrt{X^{2} + Y^{2} + Z^{2}}

θ i=arccos (z/ ρ i)

φ i=arctan (y/sqrt (x*x+z*z))

Wherein sqrt represents sqrt computing.

According to polar coordinate Point ' _ i that ti moment audio object is corresponding, audio object is encoded to high-order (preferably 2 rank Or 3 rank) B-format signal；Described B-format signal is converted into virtual speaker array signal；With a single order B-form letter Number [W₁ X₁ Y₁ Z₁]^TAs a example by, it is converted into virtual speaker array signal [L₁ L₂…L_N]^TProcess be just by following computing (wherein X₁=cos θ₁cosφ₁；Y₁=sin θ₁cosφ₁；Z₁=sin φ₁):

[\begin{matrix} L_{1} \\ L_{2} \\ .. \\ L_{N} \end{matrix}] = [\begin{matrix} G_{w 1} & G_{x 1} & G_{y 1} & G_{z 1} \\ G_{w 2} & G_{x 2} & G_{y 2} & G_{z 2} \\ . & . & . & . \\ . & . & . & . \\ G_{w N} & G_{x N} & G_{y N} & G_{z N} \end{matrix}] [\begin{matrix} W_{1} \\ X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}] = G [\begin{matrix} W_{1} \\ X_{1} \\ Y_{1} \\ Z_{1} \end{matrix}] .

Wherein, N is the number of the virtual speaker that virtual speaker topological structure includes.G matrix used in above formula For ambisonic decoding matrix, can draw by seeking pseudo inverse matrix.

The described virtual speaker array signal of audio object is carried out ears based on binaural room impulse response (BRIR) Transcoding (typically 3-dimensional i.e. comprises elevation information), obtains the ears output virtual ring of audio object around acoustical signal.Specifically: from Virtual speaker signal forwards the two road stereo BRIR matrixes that earphone signal is corresponding to, the stereo matrix in Jiang Gai bis-road and virtual raise Sound device array signal carries out matrix multiplication, obtains virtual surround sound.BRIR matrix isThe most virtual surround sound For

Described audio object can be one or more.

Described binaural room impulse response is preferably off-line and generates, and can use true measurement or raw by special software Become, therefore not necessarily like needing when using online generating mode under prior art to store substantial amounts of BRIR, decrease memory consumption.

Describe below and how audio object is encoded to ambisonic territory.

Audio object is encoded to single order ambisonic signal:

W = \frac{1}{k} Σ_{i = 1}^{k} s_{i} [\frac{1}{\sqrt{2}}];

X = \frac{1}{k} Σ_{i = 1}^{k} s_{i} [{cosθ}_{i} {cosφ}_{i}];

Y = \frac{1}{k} Σ_{i = 1}^{k} s_{i} [{sinθ}_{i} {cosφ}_{i}];

Z = \frac{1}{k} Σ_{i = 1}^{R} s_{i} [{sinφ}_{i}];

s_iIt is i-th audio object, i=1 ... k, k are the numbers of audio object.θ_iIt it is the angle (orientation in plane Angle), φ_iThe angle being vertically oriented.W sound channel signal represents omnirange sound wave, X sound channel signal, Y sound channel signal and Z sound channel Signal represents the sound wave of orthogonal orientation X, Y, Z along three, space respectively.

Single order B-format signal is expressed as

In like manner, audio object is encoded to 2 rank or 3 rank B-format signals is preferably carried out according to lower table definition:

Trigonometric function in upper table is even function for azimuth angle theta, then the respective component of corresponding B-format signal is left Right symmetry, if the trigonometric function in upper table is odd function for azimuth angle theta, then the respective component of corresponding B-format signal is Heterochiral.As a example by single order B-format signal, from the point of view of physical significance and coordinate, w, x, z are regardless of left and right, if so listened The position of person is symmetrical, and supposes that corresponding HRTF coefficient also approximates symmetrical, then the ears output that w, x, z are corresponding Component for output left and right passage be identical.And y is the most reverse for left and right.So y corresponding ears output point Amount is contrary for left and right passage.For having symmetric component, can use in fast algorithm, i.e. calculating process is right The optimization of title property, can reduce operand further.

The alleged left and right of the present invention, level, the orientation such as vertically are all the perspective definition from hearer (i.e. user).

Although, the present invention is described in detail to have used general explanation and specific embodiment, but at this On the basis of invention, can make some modifications or improvements it, this will be apparent to those skilled in the art.Therefore, These modifications or improvements without departing from theon the basis of the spirit of the present invention, belong to the scope of protection of present invention.

Claims

1. the processing method of a virtual reality sound field metadata, it is characterised in that the place of described virtual reality sound field metadata Reason method comprises the following steps:

Judge the motor pattern of audio object, if the motor pattern of audio object is linear motion, make motor pattern parameter m= 0；If the motor pattern of audio object is curvilinear motion, make motor pattern parameter m=1；

As m=0, by the azimuth information of this audio object of right-angle coordinate representation, the movement locus of audio object also exists simultaneously Rectangular coordinate system processes；

As m=1, representing the azimuth information of this audio object by polar coordinate system, the movement locus of audio object is also in pole simultaneously Coordinate system processes；

2. the processing method of virtual reality sound field metadata as claimed in claim 1, it is characterised in that as m=0, obtain sound Frequently object is at the rectangular coordinate Point_1 in t1 moment, obtains the audio object rectangular coordinate Point_2 in the t2 moment, when making ti The rectangular coordinate carving audio object corresponding is Point_i, and wherein the ti moment is any time between t1 moment and t2 moment；

Point_i=Point_1+ (Point_2-Point_1) × (ti-t1)/(t2-t1).

3. the processing method of virtual reality sound field metadata as claimed in claim 1, it is characterised in that as m=1, obtain sound Frequently object is in polar coordinate Point ' _ 1 in t1 moment, and acquisition audio object, in polar coordinate Point ' _ 2 in t2 moment, makes the ti moment The polar coordinate that audio object is corresponding are Point ' _ i, and wherein the ti moment is any time between t1 moment and t2 moment；

Point ' _ i=Point ' _ 1+ (Point ' _ 2-Point ' _ 1) × (ti-t1)/(t2-t1).

4. the processing method of as claimed in claim 1 virtual reality sound field metadata, it is characterised in that described according to audio frequency pair The positional information that the movement locus of elephant is corresponding generates concretely comprising the following steps of virtual surround sound:

If m=0, then when being converted into ti by rectangular coordinate Point_i corresponding for the ti moment audio object of linear interpolation acquisition Carve polar coordinate Point ' _ i that audio object is corresponding；

According to polar coordinate Point ' _ i that ti moment audio object is corresponding, audio object is encoded to high-order B-format signal；Will Described B-format signal is converted into virtual speaker array signal；

The described virtual speaker array signal of audio object is carried out ears transcoding based on binaural room impulse response, obtains sound Frequently the ears output virtual ring of object is around acoustical signal.

5. the processing method of virtual reality sound field metadata as claimed in claim 4, it is characterised in that described audio object is Individual or multiple.

6. the processing method of virtual reality sound field metadata as claimed in claim 5, it is characterised in that described binaural room impulse Response off-line generates, and uses true measurement or by Software Create.