CN105895108B

CN105895108B - Panoramic sound processing method

Info

Publication number: CN105895108B
Application number: CN201610157032.1A
Authority: CN
Inventors: 潘兴德; 吴超刚
Original assignee: NANJING QINGJIN INFORMATION TECHNOLOGY Co Ltd
Current assignee: Panorama Sound (Beijing) Intelligent Technology Co.,Ltd.
Priority date: 2016-03-18
Filing date: 2016-03-18
Publication date: 2020-01-24
Anticipated expiration: 2036-03-18
Also published as: CN105895108A

Abstract

The invention discloses a panoramic sound processing method, which comprises the following steps: acquiring a sound object of a sound field space; establishing a three-dimensional coordinate system by taking the monitoring point as an origin, determining the origin of the three-dimensional coordinate value of the sound object, establishing the three-dimensional coordinate system, and determining the three-dimensional coordinate value of the sound object; dividing three-dimensional coordinate values of a sound object into a reference block and a prediction block in time order; directly coding the three-dimensional coordinate value of the reference block, and differentially coding the three-dimensional coordinate value of the prediction block; and determining the effective action area of the sound object according to the three-dimensional coordinate value of the sound object before encoding or after decoding. The invention provides a coordinate definition, motion trail and action region representation method of a sound object of a three-dimensional sound field during recording production, coding, decoding and rendering playback, and the method has the advantages of high coding efficiency, good sound expression and convenient sound production.

Description

Panoramic sound processing method

Technical Field

The invention relates to the technical field of sound coding, in particular to a panoramic sound processing method.

Background

With the rapid development of computing power and networks, in the application fields of movies, televisions, music, games, virtual reality, network videos and the like, the audio recording, down-mixing editing, encoding, decoding, rendering and playback technologies capable of representing a real three-dimensional sound field have important application values. "panoramic sound" is a visual description of a three-dimensional sound field.

At present, MPEG introduced the three-dimensional acoustic coding technology of MPEG H, Dolby introduced the Atmos panoramic acoustic coding technology, and both proposed the concept of sound object coding on the basis of the traditional multi-channel signal coding. The Dolby Atmos encodes three-dimensional coordinates (x, y, z) of a sound object by directly recording a three-dimensional motion trajectory of the sound object, and divides rendering and playback modes of the sound object into 9 rectangular regions. MPEG H does not encode the sound objects directly but adopts parametric stereo encoding technique to mix a plurality of sounds into one mono signal and encode the spatial perceptual information (phase, intensity and correlation) of each sound object; during decoding, the monaural sum signal is decoded, and then each sound object is restored by using the spatial perception information of the sound object.

In high quality applications, such as movies, Dolby Atmos can achieve higher sound quality than MPEG H. However, the spatial coordinate system, the coordinate representation method, the audio object coordinate encoding method, and the audio object partition representation method of Dolby Atmos have limitations such as low encoding efficiency, poor audio expression, and inconvenience in audio production.

In the process of describing a sound field, DolbyAtmos determines the origin of coordinates at the height position of a front left screen loudspeaker, wherein the X axis is from the origin to a right wall, the Y axis is from the origin to a rear wall, and the Z axis is from the origin to a roof; meanwhile, the room is divided into nine areas, i.e., a left-screen speaker area, a middle-screen speaker area, a right-screen speaker area, a left-wall speaker area, a right-wall speaker area, a rear-wall left-side speaker area, a rear-wall right-side speaker area, a left-roof speaker area, and a right-roof speaker area. The sound object is encoded with the position coordinates and the area division as above.

The origin of coordinates definition and the region of dolby atmos are separated, and the expression efficiency of sound objects such as point sound sources, plane sound sources and diffuse sound sources is not high. In addition, the loudspeaker area of Dolby Atmos and the effective active area of the actual sound object are not equivalent relationships, the latter being a more accurate description of the actual physical sound field.

From the perspective of sound coding efficiency, generally speaking, fewer code streams are contended on the premise of expressing complete information, so as to achieve higher coding efficiency. The existing coordinate definition method is to encode the coordinates with a fixed number of bits, for example, dolby atmos maps the position coordinates into a unit cube to obtain a decimal number in the range of [0,1], and then store the unsigned decimal number with 12 bits. The result of such encoding is that 12 bits are used for storage regardless of whether the position coordinates are changed, thereby generating a large amount of waste of code stream. In practice, the position of the sound object changes slowly in many cases, and there is a large redundancy between position coordinate data of adjacent frames or between adjacent blocks.

From the perspective of sound expression, the existing space area division is a fixed division mode, for example, dolby atmos divides the space into nine areas, i.e., a left screen speaker area, a middle screen speaker area, a right screen speaker area, a left wall speaker area, a right wall speaker area, a back wall left side speaker area, a back wall right side speaker area, a left roof speaker area, and a right roof speaker area. This lacks flexibility in positioning the sound object and leaves less room for selection, thereby making the sound appear less flexible.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a panoramic sound processing method which is high in coding efficiency and good in sound expression.

The technical scheme is as follows: the panoramic sound processing method comprises the following steps:

acquiring a sound object of a sound field space;

establishing a three-dimensional coordinate system by taking the monitoring point as an origin, and determining a three-dimensional coordinate value of the sound object;

dividing three-dimensional coordinate values of a sound object into a reference block and a prediction block in time order;

directly coding the three-dimensional coordinate value of the reference block, and differentially coding the three-dimensional coordinate value of the prediction block;

and determining the effective action area of the sound object according to the three-dimensional coordinate value of the sound object before encoding or after decoding.

Further perfecting the technical scheme, the origin is defined as the position with the same height as the center of the horizontal tangent plane of the sound field space and the center of the connecting line between the two ears of the sound recorder.

Further, the position track of the sound object is in units of frames, each frame comprises a plurality of blocks, the first block of each frame is the reference block, and the subsequent blocks are the prediction blocks.

Further, the three-dimensional coordinate value of each block of the sound object is (x)_i，y_i，z_i)，(x_i，y_i，z_i) Is mapped as (pID)_i，Ax_i，Ay_i，Az_i)，pID_iIs a quadrant identifier, Ax_i、Ay_i、Az_iIs the absolute value of the position coordinates.

Further, the reference block pair (pID)_i，Ax_i，Ay_i，Az_i) Directly coded into (pID)_j，Dx_j，Dy_j，Dz_i)，pID_jUsing 3 bits, Ax_i、Ay_i、Az_iIn the range of 0,1]unsigned number Dx with internal coding of 4-16 bits_j、Dy_j、Dz_i(ii) a The prediction block has a difference value (Δ x) between the coordinate values of the current block and the previous block_k，Δy_k，Δz_k) Coding is carried out, wherein Δ x_kIs the difference value, Δ y, between the x-axis coordinates of the current block and the previous block_kIs the difference value, Δ z, of the y-axis coordinates of the current block and the previous block_kIs the difference value of the z-axis coordinates of the current block and the previous block, and the difference value (Δ x)_k，Δy_k，Δz_k) Is mapped to (pID)_k，|Δx_k|，|Δy_k|，|Δz_k|) wherein pID_kIs Δ x_k，Δy_k，Δz_kIs given by the quadrant identifier, | Δ x_k|、|Δy_k|、|Δz_k| corresponds to Δ x, respectively_k、Δy_k、Δz_kAbsolute value of, | Δ x_k|、|Δy_k|、|Δz_kL is in [0, 2]]Unsigned number Dx with inner code of 4-17 bits_k、Dy_k、Dz_k。

Further, the unsigned number Dx_k、Dy_k、Dz_kAdopting a DIF (n) coding method: taking unsigned position coordinates Dx_k、Dy_k、Dz_kAny one of DIFFata is compared with the size of (2^ n-1), and if the size is smaller than (2^ n-1), the DIFFata is stored by n bits; otherwise, setting all n bits to 1, and then following 2n bits; by analogy, until (2^ (kn) -1)>DIFFata (k is a positive integer).

Further, the unsigned position coordinate diffata is stored using 4 bits or 8 bits or 12 bits.

Further, the effective active area of the loudspeaker is conical

Wherein

The included angle and the range [0, 2 pi ] between the projection of a connecting line of the sound object and the original point on the xoy plane and the x axis, theta is the included angle between the connecting line of the sound object and the original point and the z axis, and gamma is used for describing the large opening degree of the conical surfaceSmall, defined as the angle between the generatrix of the cone and the central axis, in the range [0, π/2]。

Further, according to the coordinates (x) of the sound object_i，y_i，z_i) To obtain

Gamma is encoded as a 4-bit unsigned number B,

the mapping relation is as follows: γ ═ π/2 × B/(2^4-1), 0 ═ B ^ 2^ 4-1.

Has the advantages that: compared with the prior art, the invention has the advantages that: the invention introduces a three-dimensional sound technology of sound objects on the basis of the traditional multi-channel stereo sound field, provides a coordinate definition, a motion track and an action region representation method of the sound objects of the three-dimensional sound field during recording making, coding, decoding and rendering playback, introduces an effective action region of the sound objects, and represents the coordinates (x, y, z) and the effective action region of the sound objects by a cone

The point source can be represented by only three-dimensional coordinate values, the area source not only needs the three-dimensional coordinate values, but also needs area information, a point source sound object and an area source sound object are more effectively represented, higher-efficiency space representation and better sound field effect are realized, and a three-dimensional sound field is more perfect; the coding efficiency is high, the sound expression is good and the sound production is convenient.

The invention adopts a differential coding method, and the coding mode ensures that most sound objects can be coded by using less bits, for example, low-speed objects with the moving speed per hour not higher than 53km/h can be coded by using only 4 bits, thereby greatly saving the code stream space. For a few high-speed objects, the coding can be completed by expanding the high-speed objects in a DIF (n) mode. For low-speed objects, the code stream space is greatly saved, and for high-speed objects, although more bit numbers are used, the coding efficiency is improved on the whole considering that most objects are low-speed objects.

The invention provides a new dividing mode, a cone is obtained by taking a connecting line of an object and an original point as a central axis, the opening angle of the cone is adjustable, and the area covered by the cone is an effective action area of the object. The invention divides the effective action area of the object from the angle of the object, which is beneficial for the sound engineer to define the ideal effective action area, and can flexibly decide the selection of the loudspeaker according to the loudspeaker arrangement of the actual sound field and the adopted presentation algorithm when presenting the object, thus the formed area division can lead the reconstruction of the sound object to have expressive force.

From the perspective of sound production, the position of a sound object and the area division of a sound field space are flexibly defined, the sound object can be conveniently and randomly added on the basis of the traditional 3D stereo in the link of sound production, and the link of sound production or recording is full of flexibility.

Drawings

Fig. 1 is a schematic diagram of the area division of the loudspeaker of the present invention.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

Example 1: taking a cube as an example to describe the sound field space, a typical application is that the speakers are arranged at the boundary surfaces of the cube. Spatial coordinates of the sound object define: the origin of coordinates is defined as the center of the horizontal section, the position where the height is flush with the ears when the sound engineer is listening, and has the x-axis pointing to the right (wall), the y-axis pointing in front (typically screen), and the z-axis pointing vertically up (roof).

The sound field space is expressed by normalized coordinates, the maximum absolute coordinate values of an x axis, a y axis and a z axis are 1, the shorter side of the z axis is the ground, the normalized absolute coordinate value is a (a <1), and then the 8 coordinates of the sound field space are as follows:

(1, 1, 1) -represent the upper right hand corner in front of the region;

-representing the upper left corner in front of the region (-1, 1, 1);

(1, 1, -a) -represents the lower right corner in front of the region;

(-1, 1, -a) — representing the lower left corner in front of the region;

(1, -1, 1) -represent the upper right corner behind the region;

-representing the upper left corner behind the region (-1, -1, 1);

(1, -1, -a) -represents the lower right corner behind the region;

(-1, -1, -a) — representing the lower left corner behind the region.

The position track coding of the sound object is divided in units of frames, and each frame is further divided into a plurality of blocks. For compatibility with compression coding, 1024 samples are taken as a frame: at the sampling frequency of 48kHz, each block has 256 samples, and the time interval is 5.3 ms; at a sampling frequency of 96kHz, each block is 512 samples with a time interval of 5.3 ms. The position coordinates of a certain audio object in the i-th block are represented by (x (i), y (i), and z (i)), where i is 1, 2, 3, or 4. The position coordinates (x, y, z) of the sound object can be mapped to be described in four quantities (pID, Ax, Ay, Az), i.e. the quadrant identifier pID and the absolute values of the position coordinates Ax, Ay, Az (range of values 0, 1).

The quadrant identifier pID of the sound object is description of quadrant position of coordinates (x, y, z) corresponding to sign bit information (sign b (x), sign b (y), sign b (z)) of (x, y, z), wherein sign b (x) is sign bit operation

signb (x) ═ 0 when x > ═ 0;

signb (x) 1 when x < 0;

the quadrant identifier may take the following values:

TABLE 1 quadrant identifier pID Table

pID index	Sign bit
		0	(0，0，0)
1	(0，0，1)
		2	(0，1，0)
3	(0，1，1)
		4	(1，0，0)
5	(1，0，1)
		6	(1，1，0)
7	(1，1，1)

The first block of each frame is a reference block, and the spatial position information of the sound object of the block is directly coded; the subsequent block is a prediction block, and the sound object spatial position information of the block is differentially encoded.

The first block encodes (pID, Ax, Ay, Az) directly, pID being in three bits, as shown in table 1; ax, Ay, Az are coded as 10-bit unsigned numbers Dx, Dy, Dz in the range [0,1], which satisfy the mapping relationship:

the subsequent blocks are differentially encoded, namely, the difference value (delta x, delta y, delta z) of the coordinate values of the current block and the previous block is encoded, wherein the delta x is the difference value of the x-axis coordinates of the current block and the previous block, the delta y is the difference value of the y-axis coordinates of the current block and the previous block, and the delta z is the difference value of the z-axis coordinates of the current block and the previous block; the following relationship is satisfied:

x(k)＝x(k-1)+Δx,-2≤Δx≤2；

y(k)＝y(k-1)+Δy,-2≤Δy≤2；

z(k)＝z(k-1)+Δz,-2≤Δz≤2；

similar to the foregoing process, the difference values (Δ x, Δ y, Δ z) are also mapped to be described by four quantities (pID, | Δ x |, | Δ y |, | Δ z |). The pID is a quadrant identifier of (Δ x, Δ y, Δ z), | Δ x |, | Δ y | and | Δ z | correspond to absolute values of Δ x, Δ y, Δ z, respectively, and have a value range of [0, 2 ]. The pID uses three bits, as shown in table 1, | Δ x |, | Δ y | and | Δ z | can be mapped to 11-bit unsigned numbers Dx, Dy and Dz, which satisfy the mapping relationship:

and adopting a DIF (n) coding method for the unsigned numbers Dx, Dy and Dz, wherein the DIF (n) coding process comprises the following steps: firstly, comparing the size of an unsigned position coordinate DIFFata (DIFFata is any value in Dx, Dy and Dz) to be coded with the size of (2^ n-1), and if the unsigned position coordinate DIFFata is smaller than (2^ n-1), storing the unsigned position coordinate DIFFata by using n bits; otherwise, setting all n bits to 1, and then following 2n bits; and so on until (2^ (kn) -1) > DIDATA (k is a positive integer). Taking DIF (4) coding as an example, when DIF (4) coding is adopted for unsigned numbers Dx, Dy and Dz, k values which may appear are 1, 2 and 3, and the specific code stream structure is as follows:

in the differential encoding of the sound object, sufficient space is left for the difference of the coordinate values so that its storage accuracy is sufficient to coincide with that of the position coordinates in the first block. Then, the following formula is given:

where R is the half-length of the room, L is the displacement of the object in two adjacent blocks, and n is the number of bits used to store the difference value.

For a 10m square room, 4 bits are chosen to store this difference value first, and then it can store at most the following values:

then, L <0.0781 is solved, then the maximum speed of the sound object at this time is:

in practical recording, for most sound objects, the speed is mostly lower than 53km/h, and 4-bit storage is enough, which is very efficient. For sound objects moving at high speed, i.e. speeds greater than 53km/h, it is possible to extend to 8 bit storage. Even if it is fast as an airplane (assuming 100m/s), there are: l is 100 × 0.0053 is 0.53 (m); l is the distance between two adjacent blocks, and at this time, it can be seen that 8 bits can be fully accommodated due to L/2^8<5/2^ 10.

When the room is enlarged to 100 meters and stored by 10 bits, the precision is 50/2^10, and the precision of storing the residual is more sufficient. The following table defines the maximum sound image speed that can be stored at different bit and room sizes:

TABLE 2 object speeds that can be stored in different cases

	10m	100m
			4 bits	53km/h	530km/h
8 bits	848km/h	8480km/h
			12 bits	13568km/h	135680km/h

Within a three-dimensional region, for the reconstruction of sound objects, there are some regions where sound objects are significant, while others may have no effect. From this point of view, for a certain sound object, the action region is divided, and only a part of the sound objects in the region are used, so that the calculation model and the mixing operation can be simpler. Typical sound objects are, besides point sources, also surface sources (which can be understood as point sources at a great distance) and diffuse sources (which can be diffuse sources, such as explosions, etc.), and the effective active area of the sound object is used to describe the surface source. The effective action area is actually provided for a sound recorder when the sound recorder monitors the sound, the sound recorder provides the ideal effective action area for the coder in a metadata mode, and the coder writes the ideal effective action area into the code stream according to the mode. Since the decoded three-dimensional coordinate values are only available at the decoding side, the effective operation region can be specified by the decoded three-dimensional coordinate values at the time of encoding so that the effective operation region before encoding and the operation region after decoding are made to coincide with each other. In fact, within a certain accuracy, the three-dimensional coordinate values before encoding and the three-dimensional coordinate values after decoding are very close to each other, and the difference is a quantization error of the three-dimensional coordinate values.

The division method is shown in fig. 1, when the position of the sound object is determined, a cone is unfolded by taking the connecting line of the origin and the sound object as an axis, and the origin is the vertex of the cone. The speaker enclosed by the cone is now an effective speaker.

For this division, for convenience of expression, in polar form, this division is represented by three parameters,

wherein

The azimuth angle of the sound object is composed,

the included angle between the projection of the connecting line of the object and the origin on the xoy plane and the x axis is shown as the range [0, 2 pi ], and theta is the included angle between the connecting line of the object and the origin and the z axis. And the third parameter gamma is used for describing the opening size of the conical surface and is defined as the included angle between the generatrix of the conical surface and the central axis in the range of [0, pi/2 ]]. Thus, the entire cone is determined, followed by threeThe region division of the dimensional space is completed.

For the

The position of the object has been defined previously and the position coordinates of the acoustic object are expressed as (x, y, z) and are thus easily found.

Pseudo code for the above sound object coding:

the method provides the representation methods of coordinate definition, motion trail, action region and the like of the sound object of the three-dimensional sound field during sound recording making, encoding, decoding and rendering playback. In the three-dimensional acoustic coding, it is necessary to encode the waveform of an acoustic object in addition to information such as the track and the action region of the acoustic object.

In view of the independence of sound objects from each other, high-quality sound object waveforms may be encoded independently, including various known lossless encoding and lossy audio encoding techniques, such as APE, FLAC, MP3, AAC, AVS, and so on. On the occasion of low code rate with high requirement on bandwidth, a parameter coding mode can also be adopted to mix a plurality of sound objects into a sum channel, and a parameter coding method is adopted to effectively represent a plurality of sound objects. Such parametric coding methods include sac (spatial audio coding), bbc (binary cup coding), MPEG Surround, and the like.

Since the method of encoding the sound waveform is mature, it will not be described herein.

As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of panoramic acoustic processing, comprising:

acquiring a sound object of a sound field space;

2. The panoramic acoustic processing method of claim 1, wherein: the origin is defined as the position where the center of the horizontal tangent plane of the sound field space is equal to the center of the connecting line of the two ears of the sound recorder.

3. The panoramic acoustic processing method of claim 1, wherein: the position track of the sound object is in units of frames, each frame comprises a plurality of blocks, the first block of each frame is the reference block, and the subsequent blocks are the prediction blocks.

4. The panoramic acoustic processing method of claim 3, wherein: the three-dimensional coordinate value of each block of the sound object is (x)_i，y_i，z_i)，(x_i，y_i，z_i) Is mapped as (pID)_i，Ax_i，Ay_i，Az_i)，pID_iIs a quadrant identifier, Ax_i、Ay_i、Az_iIs the absolute value of the position coordinates.

5. The panoramic acoustic processing method of claim 4, wherein: the reference block pair (pID)_i，Ax_i，Ay_i，Az_i) Directly coded into (pID)_j，Dx_i，Dy_i，Dz_i)，pID_jWith 3 bits, Axi, Ayi, Azi are in the range [0,1]]Unsigned number Dx with internal coding of 4-16 bits_i、Dy_i、Dz_i(ii) a The prediction block has a difference value (Δ x) between the coordinate values of the current block and the previous block_k，Δy_k，Δz_k) Coding is carried out, wherein Δ x_kIs the difference value, Δ y, between the x-axis coordinates of the current block and the previous block_kIs the difference value, Δ z, of the y-axis coordinates of the current block and the previous block_kIs the difference value of the z-axis coordinates of the current block and the previous block, and the difference value (Δ x)_k，Δy_k，Δz_k) Is mapped to (pID)_k，|Δx_k|，|Δy_k|，|Δz_k|) wherein pID_kIs Δ x_k，Δy_k，Δz_kIs given by the quadrant identifier, | Δ x_k|、|Δy_k|、|Δz_k| corresponds to Δ x, respectively_k、Δy_k、Δz_kAbsolute value of, | Δ x_k|、|Δy_k|、|Δz_kL is in [0, 2]]Unsigned number Dx with inner code of 4-17 bits_k、Dy_k、Dz_k。

6. The panoramic acoustic processing method of claim 5, wherein: the unsigned number Dx_i、Dy_i、Dz_iAnd Dx_k、Dy_k、Dz_kAdopting a DIF (n) coding method: taking Dx_i、Dy_i、Dz_iOr Dx_k、Dy_k、Dz_kThe value of any one of them is compared with the size of (2^ n-1) in the unsigned position coordinate DIDATA, and if it is smaller than (2^ n-1), it is stored with n bits; otherwise, setting all n bits to 1; then follows the 2n bits and so on until (2^ (kn) -1)>DIFFata, k is a positive integer.

7. The panoramic acoustic processing method of claim 6, wherein: and storing the unsigned position coordinate DIDATA by adopting any unit of 4 bits, 8 bits, 10 bits and 12 bits.

8. The panoramic acoustic processing method of claim 6, wherein: the effective action area of the sound object is conical

Wherein

The included angle between the projection of the connecting line of the sound object and the original point on the xoy plane and the x axis is within a range of [0, 2 pi ], theta is the included angle between the connecting line of the sound object and the original point and the z axis, gamma is the included angle which describes the opening size of the conical surface and is defined as the generatrix of the conical surface and the central axis, and the range of [0, pi/2 ]]。

9. The panoramic acoustic processing method of claim 8, wherein: according to the coordinates (x) of the sound object_i，y_i，z_i) To obtain

Gamma is encoded as a 4-bit unsigned number B,

the mapping relation is as follows: γ ═ π/2 × B/(2^4-1), 0 ═ B ^ 2^ 4-1.