Background
With the rapid development of computing power and networks, in the application fields of movies, televisions, music, games, virtual reality, network videos and the like, the audio recording, down-mixing editing, encoding, decoding, rendering and playback technologies capable of representing a real three-dimensional sound field have important application values. "panoramic sound" is a visual description of a three-dimensional sound field.
At present, MPEG introduced the three-dimensional acoustic coding technology of MPEG H, Dolby introduced the Atmos panoramic acoustic coding technology, and both proposed the concept of sound object coding on the basis of the traditional multi-channel signal coding. The Dolby Atmos encodes three-dimensional coordinates (x, y, z) of a sound object by directly recording a three-dimensional motion trajectory of the sound object, and divides rendering and playback modes of the sound object into 9 rectangular regions. MPEG H does not encode the sound objects directly but adopts parametric stereo encoding technique to mix a plurality of sounds into one mono signal and encode the spatial perceptual information (phase, intensity and correlation) of each sound object; during decoding, the monaural sum signal is decoded, and then each sound object is restored by using the spatial perception information of the sound object.
In high quality applications, such as movies, Dolby Atmos can achieve higher sound quality than MPEG H. However, the spatial coordinate system, the coordinate representation method, the audio object coordinate encoding method, and the audio object partition representation method of Dolby Atmos have limitations such as low encoding efficiency, poor audio expression, and inconvenience in audio production.
In the process of describing a sound field, DolbyAtmos determines the origin of coordinates at the height position of a front left screen loudspeaker, wherein the X axis is from the origin to a right wall, the Y axis is from the origin to a rear wall, and the Z axis is from the origin to a roof; meanwhile, the room is divided into nine areas, i.e., a left-screen speaker area, a middle-screen speaker area, a right-screen speaker area, a left-wall speaker area, a right-wall speaker area, a rear-wall left-side speaker area, a rear-wall right-side speaker area, a left-roof speaker area, and a right-roof speaker area. The sound object is encoded with the position coordinates and the area division as above.
The origin of coordinates definition and the region of dolby atmos are separated, and the expression efficiency of sound objects such as point sound sources, plane sound sources and diffuse sound sources is not high. In addition, the loudspeaker area of Dolby Atmos and the effective active area of the actual sound object are not equivalent relationships, the latter being a more accurate description of the actual physical sound field.
From the perspective of sound coding efficiency, generally speaking, fewer code streams are contended on the premise of expressing complete information, so as to achieve higher coding efficiency. The existing coordinate definition method is to encode the coordinates with a fixed number of bits, for example, dolby atmos maps the position coordinates into a unit cube to obtain a decimal number in the range of [0,1], and then store the unsigned decimal number with 12 bits. The result of such encoding is that 12 bits are used for storage regardless of whether the position coordinates are changed, thereby generating a large amount of waste of code stream. In practice, the position of the sound object changes slowly in many cases, and there is a large redundancy between position coordinate data of adjacent frames or between adjacent blocks.
From the perspective of sound expression, the existing space area division is a fixed division mode, for example, dolby atmos divides the space into nine areas, i.e., a left screen speaker area, a middle screen speaker area, a right screen speaker area, a left wall speaker area, a right wall speaker area, a back wall left side speaker area, a back wall right side speaker area, a left roof speaker area, and a right roof speaker area. This lacks flexibility in positioning the sound object and leaves less room for selection, thereby making the sound appear less flexible.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a panoramic sound processing method which is high in coding efficiency and good in sound expression.
The technical scheme is as follows: the panoramic sound processing method comprises the following steps:
acquiring a sound object of a sound field space;
establishing a three-dimensional coordinate system by taking the monitoring point as an origin, and determining a three-dimensional coordinate value of the sound object;
dividing three-dimensional coordinate values of a sound object into a reference block and a prediction block in time order;
directly coding the three-dimensional coordinate value of the reference block, and differentially coding the three-dimensional coordinate value of the prediction block;
and determining the effective action area of the sound object according to the three-dimensional coordinate value of the sound object before encoding or after decoding.
Further perfecting the technical scheme, the origin is defined as the position with the same height as the center of the horizontal tangent plane of the sound field space and the center of the connecting line between the two ears of the sound recorder.
Further, the position track of the sound object is in units of frames, each frame comprises a plurality of blocks, the first block of each frame is the reference block, and the subsequent blocks are the prediction blocks.
Further, the three-dimensional coordinate value of each block of the sound object is (x)i,yi,zi),(xi,yi,zi) Is mapped as (pID)i,Axi,Ayi,Azi),pIDiIs a quadrant identifier, Axi、Ayi、AziIs the absolute value of the position coordinates.
Further, the reference block pair (pID)i,Axi,Ayi,Azi) Directly coded into (pID)j,Dxj,Dyj,Dzi),pIDjUsing 3 bits, Axi、Ayi、AziIn the range of 0,1]unsigned number Dx with internal coding of 4-16 bitsj、Dyj、Dzi(ii) a The prediction block has a difference value (Δ x) between the coordinate values of the current block and the previous blockk,Δyk,Δzk) Coding is carried out, wherein Δ xkIs the difference value, Δ y, between the x-axis coordinates of the current block and the previous blockkIs the difference value, Δ z, of the y-axis coordinates of the current block and the previous blockkIs the difference value of the z-axis coordinates of the current block and the previous block, and the difference value (Δ x)k,Δyk,Δzk) Is mapped to (pID)k,|Δxk|,|Δyk|,|Δzk|) wherein pIDkIs Δ xk,Δyk,ΔzkIs given by the quadrant identifier, | Δ xk|、|Δyk|、|Δzk| corresponds to Δ x, respectivelyk、Δyk、ΔzkAbsolute value of, | Δ xk|、|Δyk|、|ΔzkL is in [0, 2]]Unsigned number Dx with inner code of 4-17 bitsk、Dyk、Dzk。
Further, the unsigned number Dxk、Dyk、DzkAdopting a DIF (n) coding method: taking unsigned position coordinates Dxk、Dyk、DzkAny one of DIFFata is compared with the size of (2^ n-1), and if the size is smaller than (2^ n-1), the DIFFata is stored by n bits; otherwise, setting all n bits to 1, and then following 2n bits; by analogy, until (2^ (kn) -1)>DIFFata (k is a positive integer).
Further, the unsigned position coordinate diffata is stored using 4 bits or 8 bits or 12 bits.
Further, the effective active area of the loudspeaker is conical
Wherein
The included angle and the range [0, 2 pi ] between the projection of a connecting line of the sound object and the original point on the xoy plane and the x axis, theta is the included angle between the connecting line of the sound object and the original point and the z axis, and gamma is used for describing the large opening degree of the conical surfaceSmall, defined as the angle between the generatrix of the cone and the central axis, in the range [0, π/2]。
Further, according to the coordinates (x) of the sound objecti,yi,zi) To obtain
Gamma is encoded as a 4-bit unsigned number B,
the mapping relation is as follows: γ ═ π/2 × B/(2^4-1), 0 ═ B ^ 2^ 4-1.
Has the advantages that: compared with the prior art, the invention has the advantages that: the invention introduces a three-dimensional sound technology of sound objects on the basis of the traditional multi-channel stereo sound field, provides a coordinate definition, a motion track and an action region representation method of the sound objects of the three-dimensional sound field during recording making, coding, decoding and rendering playback, introduces an effective action region of the sound objects, and represents the coordinates (x, y, z) and the effective action region of the sound objects by a cone
The point source can be represented by only three-dimensional coordinate values, the area source not only needs the three-dimensional coordinate values, but also needs area information, a point source sound object and an area source sound object are more effectively represented, higher-efficiency space representation and better sound field effect are realized, and a three-dimensional sound field is more perfect; the coding efficiency is high, the sound expression is good and the sound production is convenient.
The invention adopts a differential coding method, and the coding mode ensures that most sound objects can be coded by using less bits, for example, low-speed objects with the moving speed per hour not higher than 53km/h can be coded by using only 4 bits, thereby greatly saving the code stream space. For a few high-speed objects, the coding can be completed by expanding the high-speed objects in a DIF (n) mode. For low-speed objects, the code stream space is greatly saved, and for high-speed objects, although more bit numbers are used, the coding efficiency is improved on the whole considering that most objects are low-speed objects.
The invention provides a new dividing mode, a cone is obtained by taking a connecting line of an object and an original point as a central axis, the opening angle of the cone is adjustable, and the area covered by the cone is an effective action area of the object. The invention divides the effective action area of the object from the angle of the object, which is beneficial for the sound engineer to define the ideal effective action area, and can flexibly decide the selection of the loudspeaker according to the loudspeaker arrangement of the actual sound field and the adopted presentation algorithm when presenting the object, thus the formed area division can lead the reconstruction of the sound object to have expressive force.
From the perspective of sound production, the position of a sound object and the area division of a sound field space are flexibly defined, the sound object can be conveniently and randomly added on the basis of the traditional 3D stereo in the link of sound production, and the link of sound production or recording is full of flexibility.
Example 1: taking a cube as an example to describe the sound field space, a typical application is that the speakers are arranged at the boundary surfaces of the cube. Spatial coordinates of the sound object define: the origin of coordinates is defined as the center of the horizontal section, the position where the height is flush with the ears when the sound engineer is listening, and has the x-axis pointing to the right (wall), the y-axis pointing in front (typically screen), and the z-axis pointing vertically up (roof).
The sound field space is expressed by normalized coordinates, the maximum absolute coordinate values of an x axis, a y axis and a z axis are 1, the shorter side of the z axis is the ground, the normalized absolute coordinate value is a (a <1), and then the 8 coordinates of the sound field space are as follows:
(1, 1, 1) -represent the upper right hand corner in front of the region;
-representing the upper left corner in front of the region (-1, 1, 1);
(1, 1, -a) -represents the lower right corner in front of the region;
(-1, 1, -a) — representing the lower left corner in front of the region;
(1, -1, 1) -represent the upper right corner behind the region;
-representing the upper left corner behind the region (-1, -1, 1);
(1, -1, -a) -represents the lower right corner behind the region;
(-1, -1, -a) — representing the lower left corner behind the region.
The position track coding of the sound object is divided in units of frames, and each frame is further divided into a plurality of blocks. For compatibility with compression coding, 1024 samples are taken as a frame: at the sampling frequency of 48kHz, each block has 256 samples, and the time interval is 5.3 ms; at a sampling frequency of 96kHz, each block is 512 samples with a time interval of 5.3 ms. The position coordinates of a certain audio object in the i-th block are represented by (x (i), y (i), and z (i)), where i is 1, 2, 3, or 4. The position coordinates (x, y, z) of the sound object can be mapped to be described in four quantities (pID, Ax, Ay, Az), i.e. the quadrant identifier pID and the absolute values of the position coordinates Ax, Ay, Az (range of values 0, 1).
The quadrant identifier pID of the sound object is description of quadrant position of coordinates (x, y, z) corresponding to sign bit information (sign b (x), sign b (y), sign b (z)) of (x, y, z), wherein sign b (x) is sign bit operation
signb (x) ═ 0 when x > ═ 0;
signb (x) 1 when x < 0;
the quadrant identifier may take the following values:
TABLE 1 quadrant identifier pID Table
pID index
|
Sign bit
|
0
|
(0,0,0)
|
1
|
(0,0,1)
|
2
|
(0,1,0)
|
3
|
(0,1,1)
|
4
|
(1,0,0)
|
5
|
(1,0,1)
|
6
|
(1,1,0)
|
7
|
(1,1,1) |
The first block of each frame is a reference block, and the spatial position information of the sound object of the block is directly coded; the subsequent block is a prediction block, and the sound object spatial position information of the block is differentially encoded.
The first block encodes (pID, Ax, Ay, Az) directly, pID being in three bits, as shown in table 1; ax, Ay, Az are coded as 10-bit unsigned numbers Dx, Dy, Dz in the range [0,1], which satisfy the mapping relationship:
the subsequent blocks are differentially encoded, namely, the difference value (delta x, delta y, delta z) of the coordinate values of the current block and the previous block is encoded, wherein the delta x is the difference value of the x-axis coordinates of the current block and the previous block, the delta y is the difference value of the y-axis coordinates of the current block and the previous block, and the delta z is the difference value of the z-axis coordinates of the current block and the previous block; the following relationship is satisfied:
x(k)=x(k-1)+Δx,-2≤Δx≤2;
y(k)=y(k-1)+Δy,-2≤Δy≤2;
z(k)=z(k-1)+Δz,-2≤Δz≤2;
similar to the foregoing process, the difference values (Δ x, Δ y, Δ z) are also mapped to be described by four quantities (pID, | Δ x |, | Δ y |, | Δ z |). The pID is a quadrant identifier of (Δ x, Δ y, Δ z), | Δ x |, | Δ y | and | Δ z | correspond to absolute values of Δ x, Δ y, Δ z, respectively, and have a value range of [0, 2 ]. The pID uses three bits, as shown in table 1, | Δ x |, | Δ y | and | Δ z | can be mapped to 11-bit unsigned numbers Dx, Dy and Dz, which satisfy the mapping relationship:
and adopting a DIF (n) coding method for the unsigned numbers Dx, Dy and Dz, wherein the DIF (n) coding process comprises the following steps: firstly, comparing the size of an unsigned position coordinate DIFFata (DIFFata is any value in Dx, Dy and Dz) to be coded with the size of (2^ n-1), and if the unsigned position coordinate DIFFata is smaller than (2^ n-1), storing the unsigned position coordinate DIFFata by using n bits; otherwise, setting all n bits to 1, and then following 2n bits; and so on until (2^ (kn) -1) > DIDATA (k is a positive integer). Taking DIF (4) coding as an example, when DIF (4) coding is adopted for unsigned numbers Dx, Dy and Dz, k values which may appear are 1, 2 and 3, and the specific code stream structure is as follows:
in the differential encoding of the sound object, sufficient space is left for the difference of the coordinate values so that its storage accuracy is sufficient to coincide with that of the position coordinates in the first block. Then, the following formula is given:
where R is the half-length of the room, L is the displacement of the object in two adjacent blocks, and n is the number of bits used to store the difference value.
For a 10m square room, 4 bits are chosen to store this difference value first, and then it can store at most the following values:
then, L <0.0781 is solved, then the maximum speed of the sound object at this time is:
in practical recording, for most sound objects, the speed is mostly lower than 53km/h, and 4-bit storage is enough, which is very efficient. For sound objects moving at high speed, i.e. speeds greater than 53km/h, it is possible to extend to 8 bit storage. Even if it is fast as an airplane (assuming 100m/s), there are: l is 100 × 0.0053 is 0.53 (m); l is the distance between two adjacent blocks, and at this time, it can be seen that 8 bits can be fully accommodated due to L/2^8<5/2^ 10.
When the room is enlarged to 100 meters and stored by 10 bits, the precision is 50/2^10, and the precision of storing the residual is more sufficient. The following table defines the maximum sound image speed that can be stored at different bit and room sizes:
TABLE 2 object speeds that can be stored in different cases
|
10m
|
100m
|
4 bits
|
53km/h
|
530km/h
|
8 bits
|
848km/h
|
8480km/h
|
12 bits
|
13568km/h
|
135680km/h |
Within a three-dimensional region, for the reconstruction of sound objects, there are some regions where sound objects are significant, while others may have no effect. From this point of view, for a certain sound object, the action region is divided, and only a part of the sound objects in the region are used, so that the calculation model and the mixing operation can be simpler. Typical sound objects are, besides point sources, also surface sources (which can be understood as point sources at a great distance) and diffuse sources (which can be diffuse sources, such as explosions, etc.), and the effective active area of the sound object is used to describe the surface source. The effective action area is actually provided for a sound recorder when the sound recorder monitors the sound, the sound recorder provides the ideal effective action area for the coder in a metadata mode, and the coder writes the ideal effective action area into the code stream according to the mode. Since the decoded three-dimensional coordinate values are only available at the decoding side, the effective operation region can be specified by the decoded three-dimensional coordinate values at the time of encoding so that the effective operation region before encoding and the operation region after decoding are made to coincide with each other. In fact, within a certain accuracy, the three-dimensional coordinate values before encoding and the three-dimensional coordinate values after decoding are very close to each other, and the difference is a quantization error of the three-dimensional coordinate values.
The division method is shown in fig. 1, when the position of the sound object is determined, a cone is unfolded by taking the connecting line of the origin and the sound object as an axis, and the origin is the vertex of the cone. The speaker enclosed by the cone is now an effective speaker.
For this division, for convenience of expression, in polar form, this division is represented by three parameters,
wherein
The azimuth angle of the sound object is composed,
the included angle between the projection of the connecting line of the object and the origin on the xoy plane and the x axis is shown as the range [0, 2 pi ], and theta is the included angle between the connecting line of the object and the origin and the z axis. And the third parameter gamma is used for describing the opening size of the conical surface and is defined as the included angle between the generatrix of the conical surface and the central axis in the range of [0, pi/2 ]]. Thus, the entire cone is determined, followed by threeThe region division of the dimensional space is completed.
For the
The position of the object has been defined previously and the position coordinates of the acoustic object are expressed as (x, y, z) and are thus easily found.
Pseudo code for the above sound object coding:
the method provides the representation methods of coordinate definition, motion trail, action region and the like of the sound object of the three-dimensional sound field during sound recording making, encoding, decoding and rendering playback. In the three-dimensional acoustic coding, it is necessary to encode the waveform of an acoustic object in addition to information such as the track and the action region of the acoustic object.
In view of the independence of sound objects from each other, high-quality sound object waveforms may be encoded independently, including various known lossless encoding and lossy audio encoding techniques, such as APE, FLAC, MP3, AAC, AVS, and so on. On the occasion of low code rate with high requirement on bandwidth, a parameter coding mode can also be adopted to mix a plurality of sound objects into a sum channel, and a parameter coding method is adopted to effectively represent a plurality of sound objects. Such parametric coding methods include sac (spatial audio coding), bbc (binary cup coding), MPEG Surround, and the like.
Since the method of encoding the sound waveform is mature, it will not be described herein.
As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.