CN117581566A

CN117581566A - Audio processing method, device and storage medium

Info

Publication number: CN117581566A
Application number: CN202280001320.1A
Authority: CN
Inventors: 吕柱良; 史润宇; 吕雪洋; 刘晗宇
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2024-02-20
Also published as: WO2023212880A1

Abstract

The disclosure provides an audio processing method, an audio processing device and a storage medium, and belongs to the technical field of communication. The method comprises the following steps: determining metadata of each frame of audio data, wherein the metadata comprises at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object and sounding radiation range of the acoustic object in the audio data; and obtaining an object audio signal based on the metadata of the audio data. The method reduces the data volume and the transmission bandwidth of metadata, improves the coding efficiency, ensures that the renderer can correctly render the azimuth of the sound object, and provides a correct spatial audio perception result without affecting the final decoding rendering effect. And the difference of hearing feeling generated by different actual orientations of the sound objects can be simulated, so that the rendering effect is improved.

Description

Audio processing method, device and equipment and storage medium

Technical Field

The present disclosure relates to the field of communications technologies, and in particular, to an audio processing method, apparatus, and device, and a storage medium.

Background

When the encoding apparatus collects Audio data to produce an Object Audio (Object Audio) signal, relative positional information between an acoustic Object and a listening position of a listener is included in metadata of the Object Audio signal. The decoding device can render the spatial audio based on the relative position information when rendering the object audio signal, so that a listener can hear the sound transmitted from the specific direction, and the user has better stereoscopic and spatial immersion feeling.

However, in the related art, when recording audio data or producing an object audio signal, if the absolute positions of a plurality of sound objects are fixed, but the listening positions are continuously moved, the relative position information between the sound objects and the listening positions under the audio data of each frame is inconsistent, so that the metadata of the audio data of each frame needs to include the relative position information between the sound objects and the listening positions, the data volume of the metadata is increased, the transmission bandwidth is occupied, the encoding efficiency of the object audio signal is low, and for some application scenarios, the subsequent rendering process is complicated if the metadata includes the relative position information, and the rendering efficiency is affected. In addition, when decoding and rendering an object audio signal in the related art, a difference in auditory sensation due to different actual orientations of an acoustic object cannot be simulated, resulting in poor rendering effect.

Disclosure of Invention

The audio processing method, device and equipment and the storage medium are used for solving the technical problems of low coding efficiency and poor rendering effect of an object audio signal in the related technology.

An embodiment of the present disclosure provides an audio processing method, applied to an encoding device, including:

Determining metadata of each frame of audio data, wherein the metadata comprises at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object and sounding radiation range of the acoustic object in the audio data;

and obtaining an object audio signal based on the metadata of the audio data.

Optionally, in one embodiment of the disclosure, the determining metadata of each frame of audio data includes:

determining that absolute position information or relative position information is required to be contained in the metadata;

wherein in response to determining that absolute position information needs to be included in the metadata, the absolute position information is included in the metadata;

in response to determining that relative position information needs to be included in the metadata, relative position information is included in the metadata, the relative position information being used to indicate a relative position between the acoustic object and a listening position of a listener.

determining whether the acoustic object has an orientation;

responsive to the acoustic object having an orientation, including orientation information of the acoustic object into the metadata, and including a marker in the metadata, the marker being for indicating that the orientation information is included in the metadata;

In response to the acoustic object not having an orientation, no orientation information is contained in the metadata.

Optionally, in one embodiment of the disclosure, the orientation information includes absolute orientation information and/or relative orientation information;

the relative orientation information is used to indicate a relative orientation between the acoustic object and a listening position.

Optionally, in one embodiment of the disclosure, the metadata further includes at least one of:

sound source size of the sound object;

the width of the acoustic object;

the height of the acoustic object;

a spatial state of an acoustic object, the spatial state comprising moving or stationary;

type of acoustic object.

Optionally, in one embodiment of the disclosure, the method further comprises:

determining environmental spatial information of the acoustic object;

determining basic information of the acoustic object;

audio data of the acoustic object is sampled in units of frames.

Optionally, in one embodiment of the disclosure, in response to the acoustic object being located in a room, the ambient space information includes at least one of:

room size;

room wall type;

wall reflection coefficient;

a room type;

reverberation time.

Optionally, in one embodiment of the disclosure, the basic information of the acoustic object includes at least one of:

Number of acoustic objects;

sampling rate of sound source of the sound object;

the sound source bit width of the sound object;

the frame length of each frame of audio data.

Optionally, in one embodiment of the disclosure, the obtaining the object audio signal based on metadata of the audio data includes:

storing the environmental space information of the sound object and the basic information of the sound object as a header file;

storing metadata of each frame of audio data and each frame of audio data as an object audio data packet;

and splicing the header file and the object audio data packet to obtain at least one object audio signal.

An audio processing method provided by another embodiment of the present disclosure is applied to an encoding device, and includes:

acquiring a coded signal sent by coding equipment;

decoding the encoded signal to obtain an object audio signal;

determining metadata of the object audio signal, wherein the metadata comprises at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object and sounding radiation range of the acoustic object;

rendering the object audio signal based on the metadata.

sound source size of the sound object;

the width of the acoustic object;

the height of the acoustic object;

type of acoustic object.

Optionally, in one embodiment of the disclosure, the object audio signal includes a header file and an object audio data packet;

the header file includes environmental space information of the sound object and basic information of the sound object;

the object audio data packet includes metadata of audio data and audio data.

room size;

room wall type;

wall reflection coefficient;

a room type;

reverberation time.

number of acoustic objects;

sampling rate of sound source of the sound object;

The sound source bit width of the sound object;

the frame length of each frame of audio data.

Optionally, in one embodiment of the disclosure, the rendering the object audio signal based on the metadata includes:

rendering the audio data based on the metadata and the header file.

Optionally, in one embodiment of the disclosure, the method further comprises:

encoding the object audio signal;

the encoded signal is sent to a decoding device.

An audio processing apparatus according to an embodiment of another aspect of the present disclosure includes:

a determining module, configured to determine metadata of each frame of audio data, where the metadata includes at least one of absolute position information of an acoustic object in the audio data, relative position information of the acoustic object, orientation information of the acoustic object, and a sounding radiation range of the acoustic object;

and the processing module is used for obtaining an object audio signal based on the metadata of the audio data.

the acquisition module is used for acquiring the coded signals sent by the coding equipment;

the decoding module is used for decoding the encoded signal to obtain an object audio signal;

A determining module for determining metadata of the object audio signal, wherein the metadata comprises at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object and sounding radiation range of the acoustic object;

and the rendering module is used for rendering the object audio signals based on the metadata.

In yet another aspect, the disclosure provides a communication apparatus, which includes a processor and a memory, where the memory stores a computer program, and the processor executes the computer program stored in the memory, so that the apparatus performs the method as set forth in the embodiment of another aspect above.

In another aspect of the present disclosure, a communication apparatus includes: a processor and interface circuit;

the interface circuit is used for receiving code instructions and transmitting the code instructions to the processor;

the processor is configured to execute the code instructions to perform a method as set forth in another embodiment.

A further aspect of the present disclosure provides a computer-readable storage medium storing instructions that, when executed, cause a method as set forth in the embodiment of the further aspect to be implemented.

In summary, in the audio processing method, apparatus, device and storage medium provided in the embodiments of the present disclosure, the encoding device may determine metadata of each frame of audio data, where the metadata includes at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object, and a sounding radiation range of the acoustic object in the audio data; the encoding device then obtains the object audio signal based on the metadata of the audio data. Therefore, according to the embodiment of the disclosure, the absolute position information of the sound object may be included in the metadata, based on which, when the absolute positions of the sound objects are fixed in the process of recording the audio data or making the audio signal of the object, but the listening positions are continuously moved, the absolute position information of the sound object may be included in the metadata, and at this time, because the absolute positions of the sound objects are fixed, the absolute position information between the sound object and the listening position may be included in the metadata of the audio data of a certain frame (such as the first frame), the absolute position information between the sound object and the listening position in the audio data of the other frame may be multiplexed, and the metadata of the audio data of each frame does not need to include the absolute position information between the sound object and the listening position, thereby reducing the data amount and the transmission bandwidth of the metadata, improving the coding efficiency and ensuring the subsequent correct rendering of the sound object, providing the correct spatial audio perception result without affecting the final decoding effect. In addition, the metadata of the embodiment of the present disclosure further includes direction information of the acoustic object and a sounding radiation range of the acoustic object, so that when the subsequent encoding device renders the object audio signal, the subsequent encoding device may perform rendering based on the direction information and the sounding radiation range, so as to simulate a difference in auditory sense generated by different actual directions of the acoustic object, thereby improving the rendering effect.

Drawings

The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 is a flow chart of an audio processing method according to an embodiment of the disclosure;

fig. 2 is a flowchart of an audio processing method according to another embodiment of the disclosure;

FIGS. 3a-3b are flow diagrams of an audio processing method according to a further embodiment of the present disclosure;

fig. 4 is a flowchart of an audio processing method according to another embodiment of the disclosure;

fig. 5 is a flowchart of an audio processing method according to another embodiment of the disclosure;

FIGS. 6a-6b are flow diagrams of an audio processing method according to yet another embodiment of the present disclosure;

FIG. 7 is a flow chart of an audio processing method according to an embodiment of the disclosure;

FIG. 8 is a flow chart of an audio processing method according to an embodiment of the disclosure;

FIGS. 9a-9e are flow diagrams of an audio processing method according to one embodiment of the present disclosure;

fig. 9f is a schematic structural diagram of an audio processing device according to an embodiment of the present disclosure;

fig. 9g is a schematic structural diagram of an audio processing device according to an embodiment of the present disclosure;

FIG. 10 is a block diagram of a user device provided by one embodiment of the present disclosure;

fig. 11 is a block diagram of a network side device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of embodiments of the present disclosure as detailed in the accompanying claims.

The terminology used in the embodiments of the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the disclosure. As used in this disclosure of embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present disclosure. The words "if" and "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination", depending on the context.

The audio processing method, apparatus, device and storage medium provided by the embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by an encoding device, and as shown in fig. 1, the audio processing method may include the following steps:

step 101, determining metadata of each frame of audio data.

Among other things, in one embodiment of the present disclosure, the metadata may include at least one of absolute position information of the acoustic object, relative position information of the acoustic object, orientation information of the acoustic object, and a range of sound emission radiation of the acoustic object in each frame of audio data.

It should be noted that, in one embodiment of the present disclosure, the above-described relative position information may be used to indicate a relative position between the acoustic object and the listening position of the listener. And, in one embodiment of the present disclosure, the absolute position information and the relative position information may be, in particular, mapping information of an absolute position or a relative position of the acoustic object on a coordinate system. The absolute position may be, for example, the longitude and latitude of the acoustic object; the relative position may be, for example, a distance between the acoustic object and the listener, an azimuth angle, a pitch angle, or the like. The listening position of the listener may be any position, or may be any position of the sound object.

In particular, in one embodiment of the present disclosure, a method of determining absolute position information of an acoustic object may include: the method comprises the steps of firstly obtaining absolute positions of all sound objects, then establishing an absolute coordinate system, wherein the origin of the absolute coordinate system can be any position, the origin of the absolute coordinate system is fixed, and then mapping the absolute positions of all sound objects into the absolute coordinate system to obtain absolute position information of the sound objects. For example, in one embodiment of the present disclosure, the absolute coordinate system may be a rectangular coordinate system, and the absolute position information of the acoustic object may be: (x, y, z), wherein x, y, z represent the position coordinates of the acoustic object on the x-axis (e.g., the front-rear axis), the y-axis (e.g., the left-right axis), and the z-axis (e.g., the up-down axis) of the rectangular coordinate system, respectively. For example, in another embodiment of the present disclosure, the absolute coordinate system may be a spherical coordinate system, and the absolute position information of the acoustic object may be: (theta, gamma, r), wherein theta, gamma, r respectively represent the horizontal direction angle of the acoustic object on the spherical coordinate system (namely, the angle between the line connecting the acoustic object and the origin of the spherical coordinate system and the x-axis, and the mapping of the line connecting the acoustic object and the origin on the horizontal plane), the vertical direction angle (namely, the angle between the line connecting the acoustic object and the origin and the horizontal plane), and the linear distance between the acoustic object and the origin.

In another embodiment of the present disclosure, a method of determining relative position information of an acoustic object may include: the relative positions of the sound objects and the listening positions of the listeners are acquired firstly, and then a relative coordinate system is established, wherein the origin of the relative coordinate system is always the listening position, and when the listening position changes, the origin of the relative coordinate system also changes. Then, the relative positions of the acoustic objects and the listening positions are mapped into the relative coordinate system to obtain the relative position information of the acoustic objects. For example, in one embodiment of the present disclosure, the relative coordinate system may be a rectangular coordinate system, and the relative position information of the acoustic object may be: (x, y, z), wherein x, y, z represent the position coordinates of the acoustic object on the x-axis (e.g., the front-rear axis), the y-axis (e.g., the left-right axis), and the z-axis (e.g., the up-down axis) of the rectangular coordinate system, respectively. Illustratively, in another embodiment of the present disclosure, the relative coordinate system may be a spherical coordinate system, and the relative position information of the acoustic object may be: (theta, gamma, r), wherein theta, gamma, r respectively represent the horizontal direction angle of the acoustic object on the spherical coordinate system (namely, the angle between the line connecting the acoustic object and the origin of the spherical coordinate system and the x-axis, and the mapping of the line connecting the acoustic object and the origin on the horizontal plane), the vertical direction angle (namely, the angle between the line connecting the acoustic object and the origin and the horizontal plane), and the linear distance between the acoustic object and the origin.

The above-mentioned (x, y, z) and (θ, γ, r) can be converted by the following formula.

And, in one embodiment of the present disclosure, the above-described method of "obtaining an absolute position or a relative position of each acoustic object" may include: the absolute position or relative position of the acoustic object is acquired by a sensor or a combination of sensors, for example, a displacement sensor, a position sensor, an attitude sensor (such as a gyroscope, an ultrasonic range finder, etc.), a positioning sensor, a geomagnetic sensor, a direction sensor, an accelerometer, etc. may be used to acquire the absolute position or relative position of the acoustic object. And the distance between the acoustic object and the listener in the relative position may also be obtained by inertial navigation techniques and initial alignment techniques. In another embodiment of the present disclosure, the absolute or relative position of each acoustic object may also be obtained based on user input. In yet another embodiment of the present disclosure, the absolute or relative position of each acoustic object may also be generated based on a program.

Further, in an embodiment of the present disclosure, the orientation information of the acoustic object may be absolute orientation information of the acoustic object (such as a north-south orientation or a north-south orientation). In another embodiment of the present disclosure, the orientation information of the acoustic object may specifically be relative orientation information of the acoustic object, which may be used to indicate a relative orientation between the acoustic object and the listening position, e.g. the relative orientation information may be: the acoustic object is located 30 ° south-west of the listening position. And, the orientation information of the acoustic object may be obtained using any of the above sensors or obtained based on user input or generated based on a program.

And, in one embodiment of the present disclosure, the above-described sound emission radiation range of the acoustic object may be a parameter for describing radiation characteristics of the acoustic object. Wherein, in one embodiment of the present disclosure, the vocal radiation range of the acoustic object may be used to indicate the vocal radiation angle of the acoustic object, e.g., the vocal radiation range of the acoustic object may be: the acoustic object emits acoustic radiation 90 ° directly in front of, or the acoustic object may emit acoustic radiation in the range of: the acoustic object radiates to 360 ° sound. And, in another embodiment of the present disclosure, the vocal radiation range of the acoustic object may be a vocal radiation shape of the acoustic object, e.g., the vocal radiation range of the acoustic object may be: the sound object emits sound according to heart shape, or the sound object emits sound in the radiation range: the acoustic object radiates according to a 8-shaped sound. And the range of sound-emanating radiation of the acoustic object may be acquired using any of the above-described sensors or acquired based on user input or generated based on a program.

In addition, in one embodiment of the present disclosure, the metadata of each frame of audio data may further include at least one of the following:

sound source size of the sound object;

the width of the acoustic object;

The height of the acoustic object;

a spatial state of the acoustic object, the spatial state including moving or stationary;

the type of acoustic object (e.g., speech, music, etc.).

The sound source size of the sound object, the width of the sound object, the height of the sound object and the space state of the sound object can be acquired by any one of the sensors, or acquired based on user input or generated based on a program.

It should be further noted that, in one embodiment of the present disclosure, each content in the metadata is correspondingly stored with a flag bit, so as to indicate whether a parameter of the content changes relative to a parameter of the content in metadata of the previous frame of audio data. For example, the azimuth in the metadata is correspondingly stored with an azimuth flag bit, where if the azimuth in the metadata of the audio data of the current frame is unchanged relative to the azimuth in the metadata of the audio data of the previous frame, the azimuth flag bit may be made to be a first value (e.g. 1), otherwise, the azimuth flag bit may be made to be a second value (e.g. 0). And in one embodiment of the disclosure, if a part of content in metadata of audio data of a current frame is unchanged relative to metadata of audio data of a previous frame, the unchanged part of content may not be included in the metadata of the audio data of the current frame, and content in metadata of audio data of the previous frame may be directly multiplexed, so that data amount and transmission bandwidth of the metadata may be reduced to a certain extent, data compression is reduced, encoding efficiency is improved, and final decoding rendering effect is not affected.

Step 102, obtaining an object audio signal based on metadata of the audio data.

The specific method of obtaining the object audio signal based on the metadata of the audio data will be described in detail in the following embodiments.

In summary, in the audio processing method provided by the embodiment of the present disclosure, the encoding device may determine metadata of each frame of audio data, where the metadata includes at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object, and a sounding radiation range of the acoustic object in the audio data; the encoding device then obtains the object audio signal based on the metadata of the audio data. Therefore, according to the embodiment of the disclosure, the absolute position information of the sound object may be included in the metadata, based on which, when the absolute positions of the sound objects are fixed in the process of recording the audio data or making the audio signal of the object, but the listening positions are continuously moved, the absolute position information of the sound object may be included in the metadata, and at this time, because the absolute positions of the sound objects are fixed, the absolute position information between the sound object and the listening position may be included in the metadata of the audio data of a certain frame (such as the first frame), the absolute position information between the sound object and the listening position in the audio data of the other frame may be multiplexed, and the metadata of the audio data of each frame does not need to include the absolute position information between the sound object and the listening position, thereby reducing the data amount and the transmission bandwidth of the metadata, improving the coding efficiency and ensuring the subsequent correct rendering of the sound object, providing the correct spatial audio perception result without affecting the final decoding effect. In addition, the metadata of the embodiment of the present disclosure further includes direction information of the acoustic object and a sounding radiation range of the acoustic object, so that when the subsequent encoding device renders the object audio signal, the subsequent encoding device may perform rendering based on the direction information and the sounding radiation range, so as to simulate a difference in auditory sense generated by different actual directions of the acoustic object, thereby improving the rendering effect.

Fig. 2 is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by an encoding device, and as shown in fig. 2, the audio processing method may include the following steps:

step 201, determining environmental space information of an acoustic object.

Wherein, in one embodiment of the present disclosure, when the acoustic object is located in a room, the ambient spatial information includes at least one of:

room size;

room wall type;

wall reflection coefficient;

room type (e.g., big room, small room, meeting room, auditorium, lobby, etc.);

reverberation time.

Wherein the ambient space information may be obtained using any of the above sensors or based on user input or generated based on a program.

It should be noted that, when the absolute coordinate system or the relative coordinate system is established later, the absolute coordinate system or the relative coordinate system may be established based on the environmental spatial information.

Step 202, determining basic information of the acoustic object.

Wherein, in one embodiment of the present disclosure, the basic information of the acoustic object may include at least one of:

number of acoustic objects;

sampling rate of sound source of the sound object;

the sound source bit width of the sound object;

the frame length of each frame of audio data.

Wherein the basic information of the acoustic object can be obtained by any of the above sensors or obtained based on user input or generated based on a program.

Step 203, sampling audio data of the acoustic object in units of frames.

In one embodiment of the present disclosure, a sound collection device (such as a microphone) may be used to sample audio data of an acoustic object in units of frames, and all sampling points included in a current frame may be stored in pcm (Pulse Code Modulation, pulse code coded modulation data).

And tables 1 and 2 are schematic tables of storage syntax of audio data provided for embodiments of the present disclosure.

TABLE 1 syntax of object audio data (Low latency mode)

TABLE 2 syntax of object raw pcm samples

Step 204, determining metadata of each frame of audio data.

The detailed description of step 204 may be described with reference to the above embodiments, which are not repeated herein.

Step 205, obtaining an object audio signal based on metadata of the audio data.

Wherein, in one embodiment of the present disclosure, a method of deriving an object audio signal based on metadata based on audio data may comprise the steps of:

and step 1, storing the environment space information of the sound object and the basic information of the sound object as a header file.

Wherein, table 3 is a schematic table of the storage syntax of the header file provided in the embodiment of the disclosure.

TABLE 3 syntax of object Audio File header

And 2, storing the metadata of each frame of audio data and each frame of audio data into an object audio data packet.

Wherein, table 4 is a schematic table of a storage syntax of an object audio data packet provided in an embodiment of the present disclosure.

TABLE 4 syntax of object audio data packets

And step 3, splicing the header file and the object audio data packet to obtain at least one object audio signal.

In one embodiment of the present disclosure, after obtaining the object audio signal, the encoding apparatus may save or transmit the object audio signal, or may encode the object audio signal into another format and save or transmit the object audio signal, as required.

Fig. 3a is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by an encoding device, and as shown in fig. 3a, the audio processing method may include the following steps:

step 301a, determining that the metadata needs to include absolute position information or relative position information.

Among other things, in one embodiment of the present disclosure, whether absolute or relative position information needs to be included in metadata is determined mainly based on application scene or sound object characteristics, or the degree of simplification of a subsequent rendering process, and the like.

Specifically, when a first preset condition is met, determining that the metadata needs to contain absolute position information; when the second preset condition is met, determining that the metadata needs to contain relative position information.

Wherein the first preset condition may include at least one of:

the absolute position of the acoustic object is unchanged;

the data amount when the absolute position information is included in the metadata is smaller than or equal to the data amount when the relative position information is included in the metadata;

the rendering flow required when absolute position information is included in the metadata is simplified compared to the rendering flow required when relative position information is included in the metadata.

The second preset condition may include at least one of:

The relative position of the acoustic object is unchanged;

the data amount when the absolute position information is included in the metadata is greater than or equal to the data amount when the relative position information is included in the metadata;

the rendering flow required when the relative position information is included in the metadata is simplified compared to the rendering flow required when the absolute position information is included in the metadata.

That is, it is possible to determine whether absolute position information or relative position information needs to be contained in metadata of audio data of each successive frame by determining whether the absolute position or the relative position of an acoustic object is unchanged in the metadata. If the absolute position of the sound object in the metadata of the audio data of the continuous frame is unchanged, the metadata is determined to include absolute position information, and at this time, based on the fact that the absolute position of the metadata is unchanged, it is only necessary to include the absolute position information of the sound object in the metadata of the audio data of the first frame of the continuous frame, and the absolute position information included in the metadata of the audio data of the first frame of the continuous frame can be multiplexed in the audio data of other frames of the continuous frame. And if the relative position of the sound object in the metadata of the audio data of the continuous frames is unchanged, determining that the metadata contains the relative position information, wherein at the moment, based on the relative position of the metadata is unchanged, the metadata of the audio data of the first frame of the continuous frames only needs to contain the relative position information of the sound object, and the audio data of other frames in the continuous frames can be multiplexed with the relative position information contained in the metadata of the audio data of the first frame of the continuous frames, so that the data quantity and the transmission bandwidth of the metadata can be reduced to a certain extent, the data compression is reduced, the coding efficiency is improved, and the final decoding rendering effect is not influenced.

In addition, in one embodiment of the present disclosure, the simplification degree of the subsequent rendering process may also be considered, and information that can simplify the subsequent rendering process is selected from the absolute position information or the relative position information to be used, so that the efficiency of the subsequent rendering process may be improved. For example, in a scene with 6 degrees of freedom, the listener can perform three-dimensional rotation and three-dimensional displacement, and the absolute position information can be used to facilitate the processing of the scene, and simplify the rendering process.

Therefore, in the embodiment of the disclosure, whether the metadata needs to include absolute position information or relative position information is comprehensively considered from multiple dimensions (such as lower data volume and simpler rendering process, etc.), so that the data volume of the metadata is reduced, the subsequent rendering process is simplified, and the rendering efficiency is improved.

It should be noted that the above determination logic for determining whether absolute position information or relative position information is included in metadata is only one example of the present disclosure, and other related or similar content is also within the scope of the present disclosure.

Step 302a, in response to determining that the metadata needs to include absolute position information, the metadata includes absolute position information.

Wherein, table 5 is a schematic representation of a storage syntax of metadata including absolute position information provided by an embodiment of the present disclosure.

TABLE 5 syntax of object metadata samples (absolute coordinate mode)

Step 303a, obtaining an object audio signal based on metadata of the audio data.

The detailed descriptions of steps 302a-303a may be described with reference to the above embodiments, and the embodiments of the disclosure are not repeated herein.

Fig. 3b is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by an encoding device, and as shown in fig. 3b, the audio processing method may include the following steps:

step 301b, determining metadata of each frame of audio data, wherein the metadata contains absolute position information.

Step 302b, obtaining the object audio signal based on the metadata of the audio data.

The detailed descriptions of the steps 301b-302b may be described with reference to the above embodiments, and the embodiments of the disclosure are not repeated herein.

In summary, in the audio processing method provided by the embodiment of the present disclosure, the encoding device may determine metadata of each frame of audio data, where the metadata includes at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object, and a sounding radiation range of the acoustic object in the audio data; the encoding device then obtains the object audio signal based on the metadata of the audio data. Therefore, according to the embodiment of the disclosure, the absolute position information of the sound object may be included in the metadata, based on which, when the absolute positions of the sound objects are fixed in the process of recording the audio data or making the audio signal of the object, but the listening positions are continuously moved, the absolute position information of the sound object may be included in the metadata, and at this time, because the absolute positions of the sound objects are fixed, the absolute position information between the sound object and the listening position may be included in the metadata of the audio data of a certain frame (such as the first frame), the absolute position information between the sound object and the listening position in the audio data of the other frame may be multiplexed, and the metadata of the audio data of each frame does not need to include the absolute position information between the sound object and the listening position, thereby reducing the data amount and the transmission bandwidth of the metadata, improving the coding efficiency and ensuring the subsequent correct rendering of the sound object, providing the correct spatial audio perception result without affecting the final decoding effect.

Fig. 4 is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by an encoding device, and as shown in fig. 4, the audio processing method may include the following steps:

step 401, determining that absolute position information or relative position information needs to be included in metadata.

Step 402, in response to determining that the metadata needs to include the relative position information, the metadata includes the relative position information.

Wherein, table 6 is a schematic representation of a storage syntax of metadata including relative position information provided by an embodiment of the present disclosure.

TABLE 6 syntax of object metadata samples (relative coordinate mode)

Step 403, obtaining the object audio signal based on the metadata of the audio data.

The detailed descriptions of steps 401-403 may be described with reference to the above embodiments, and the embodiments of the disclosure are not repeated herein.

As can be seen from the embodiments of fig. 4 and fig. 5, by encoding the relative position information and the absolute position information together in the present disclosure, it is ensured that the most efficient spatial audio metadata scheme can be achieved in both a scene where the relative position is unchanged and a scene where the absolute position is unchanged.

Fig. 5 is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by an encoding device, and as shown in fig. 5, the audio processing method may include the following steps:

step 501, determining whether the acoustic object has an orientation.

Wherein, in one embodiment of the present disclosure, if an acoustic object is sounding in all directions, then the acoustic object is considered to have no orientation, otherwise, the sounding direction of the acoustic object is determined as the orientation of the acoustic object.

Step 502, in response to the sound object having an orientation, includes orientation information of the sound object into metadata, and includes a flag in the metadata, the flag indicating that the orientation information is included in the metadata.

The detailed descriptions of steps 501-502 may be described with reference to the above embodiments, and the embodiments of the disclosure are not repeated herein.

In summary, the metadata in the embodiment of the present disclosure further includes the direction information of the voiced object, so that when the subsequent encoding device renders the object audio signal, the subsequent encoding device may render based on the direction information, so as to simulate the difference in auditory sense generated by different actual directions of the voiced object, thereby improving the rendering effect.

Fig. 6a is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by an encoding device, and as shown in fig. 6a, the audio processing method may include the following steps:

step 601a, determine if the acoustic object has an orientation.

Step 602a, in response to the acoustic object not having an orientation, does not include orientation information in the metadata.

The detailed descriptions of the steps 601a-602a may be described with reference to the above embodiments, and the embodiments of the disclosure are not repeated herein.

Fig. 6b is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by an encoding device, and as shown in fig. 6a, the audio processing method may include the following steps:

step 601b, determining metadata of each frame of audio data, wherein the metadata comprises at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object and sounding radiation range of the acoustic object in the audio data.

Step 602b, obtaining an object audio signal based on metadata of the audio data.

The detailed descriptions of the steps 601b-602b may be described with reference to the above embodiments, and the embodiments of the disclosure are not repeated herein.

Step 603b, encoding the object audio signal.

Step 604b, transmitting the encoded signal to a decoding device.

The above-described audio processing method is exemplified below.

Fig. 7 is a flowchart of an audio processing method according to an embodiment of the disclosure, and as shown in fig. 7, a local multi-person conference scenario is taken as an example. The left room is a recording end, and the right room is a playback end. There are multiple objects in the recording side room, including object 1 being speaking, object 2, object 4, and object 3 not speaking. The multiple objects are all regarded as sound objects in the scene, corresponding voice data are acquired through microphones, positioning and attitude sensors such as gyroscopes, ultrasonic distance meters and the like are used for acquiring spatial information (such as relative position information or absolute position information) and orientation information of each object, the audio data, the spatial information and the orientation information of each object are encoded, transmitted and decoded, and after rendering, a listener can feel that the listener is placed in a left conference scene, and not only the directions and the distances of the object 1, the object 2 and the object 4, but also the orientations of the objects can be sensed. Besides, the object 3 without audio data can be regarded as an acoustic object for coding transmission, and can be regarded as a listener in a recording scene, so that the playback end can completely restore the real hearing of the object 3, and the real hearing comprises the hearing change caused by the position change of the object 3 and the head rotation (direction change).

Fig. 8 is a schematic flow chart of an audio processing method according to an embodiment of the present disclosure, as shown in fig. 8, in a scenario of a multi-person teleconference, a far-end participant is located on the left side, a plurality of participants are located at different places and in different rooms, each of which can be regarded as an acoustic object for object audio encoding, a displacement or position sensor and an attitude sensor can be used to obtain spatial position change information and head orientation information of the object, a microphone is used to obtain a participant voice signal as object audio data, and the spatial position information, the head orientation information and the object audio data are used for encoding of the object audio. For the near-end user (right side of fig. 8), after a plurality of encoded remote object audios are obtained, decoding and rendering are performed, and by combining local spatial information of the near-end user, the user can sense that the sound of a plurality of remote participants has a sense of direction changing along with time, and can sense that the sense of hearing changes caused by the direction of the remote participants.

Fig. 9a is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by a decoding device, and as shown in fig. 9a, the audio processing method may include the following steps:

Step 901a, acquiring a coded signal sent by coding equipment;

step 902a, decoding the encoded signal to obtain an object audio signal;

step 903a, determining metadata of the object audio signal, where the metadata includes at least one of absolute position information of the acoustic object, relative position information of the acoustic object, orientation information of the acoustic object, and sounding radiation range of the acoustic object;

step 904a, rendering the object audio signal based on the metadata.

sound source size of the sound object;

the width of the acoustic object;

the height of the acoustic object;

type of acoustic object.

the object audio data packet includes metadata of audio data and audio data.

room size;

room wall type;

wall reflection coefficient;

a room type;

reverberation time.

number of acoustic objects;

sampling rate of sound source of the sound object;

the sound source bit width of the sound object;

the frame length of each frame of audio data.

rendering the audio data based on the metadata and the header file.

The detailed descriptions of the steps 901a-904a may be described with reference to the above embodiments, and the embodiments of the disclosure are not repeated herein.

Fig. 9b is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by a decoding device, and as shown in fig. 9b, the audio processing method may include the following steps:

step 901b, acquiring a coded signal sent by coding equipment;

step 902b, decoding the encoded signal to obtain an object audio signal;

step 903b, determining metadata of the object audio signal, the metadata comprising absolute position information of the acoustic object;

step 904b, rendering the object audio signal based on the metadata.

Fig. 9c is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by a decoding device, and as shown in fig. 9c, the audio processing method may include the following steps:

step 901c, acquiring a coded signal sent by coding equipment;

step 902c, decoding the encoded signal to obtain an object audio signal;

step 903c, determining metadata of the object audio signal, the metadata comprising relative position information of the acoustic object;

step 904c, rendering the object audio signal based on the metadata.

Fig. 9d is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by a decoding device, and as shown in fig. 9d, the audio processing method may include the following steps:

step 901d, acquiring a coded signal sent by coding equipment;

step 902d, decoding the encoded signal to obtain an object audio signal;

step 903d, determining metadata of the object audio signal, where the metadata includes orientation information of the acoustic object and a flag, where the flag is used to indicate that the metadata includes the orientation information;

step 904d, rendering the object audio signal based on the metadata.

Fig. 9e is a flowchart of an audio processing method provided in an embodiment of the disclosure, where the method is performed by a decoding device, and as shown in fig. 9e, the audio processing method may include the following steps:

step 901e, acquiring a coded signal sent by a coding device;

step 902e, decoding the encoded signal to obtain an object audio signal;

step 903e, determining metadata of the object audio signal, where the metadata includes orientation information of the acoustic object and a flag, where the flag is used to indicate that the metadata includes the orientation information;

step 904e, rendering the object audio signal based on the metadata.

Fig. 9f is a schematic structural diagram of an audio processing apparatus according to an embodiment of the disclosure, where, as shown in fig. 9f, the apparatus may include:

a determining module 901f, configured to determine metadata of each frame of audio data, where the metadata includes at least one of absolute position information of an acoustic object in the audio data, relative position information of the acoustic object, orientation information of the acoustic object, and a sounding radiation range of the acoustic object;

a processing module 902f, configured to obtain an object audio signal based on metadata of the audio data.

In summary, in the audio processing apparatus provided in the embodiments of the present disclosure, the encoding device may determine metadata of each frame of audio data, where the metadata includes at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object, and a sounding radiation range of the acoustic object in the audio data; the encoding device then obtains the object audio signal based on the metadata of the audio data. Therefore, according to the embodiment of the disclosure, the absolute position information of the sound object may be included in the metadata, based on this, when the audio data is recorded or the audio signal of the object is produced, if the absolute positions of the sound objects are fixed, but the listening positions are continuously moved, the absolute position information of the sound object may be included in the metadata, at this time, because the absolute positions of the sound objects are fixed, the absolute position information between the sound object and the listening position may be included in the metadata of the audio data of a certain frame (such as the first frame), the audio data of other frames may multiplex the absolute position information between the sound object and the listening position in the frame, without making the metadata of the audio data of each frame include the absolute position information between the sound object and the listening position, thereby reducing the data amount and the transmission bandwidth of the metadata, improving the encoding efficiency and ensuring the subsequent correct rendering of the sound object, and providing the correct spatial audio perception result without affecting the final decoding effect. In addition, the metadata of the embodiment of the present disclosure further includes direction information of the acoustic object and a sounding radiation range of the acoustic object, so that when the subsequent encoding device renders the object audio signal, the subsequent encoding device may perform rendering based on the direction information and the sounding radiation range, so as to simulate a difference in auditory sense generated by different actual directions of the acoustic object, thereby improving the rendering effect.

Optionally, in one embodiment of the disclosure, the determining module is further configured to:

in response to determining that relative position information needs to be included in the metadata, relative position information is included in the metadata, the relative position information being used to indicate a relative position between the acoustic object and a listener.

determining whether the acoustic object has an orientation;

The relative orientation information is used to indicate a relative orientation between the acoustic object and the listener.

sound source size of the sound object;

the width of the acoustic object;

the height of the acoustic object;

type of acoustic object.

Optionally, in one embodiment of the disclosure, the apparatus is further configured to:

determining environmental spatial information of the acoustic object;

determining basic information of the acoustic object;

audio data of the acoustic object is sampled in units of frames.

room size;

room wall type;

wall reflection coefficient;

a room type;

reverberation time.

number of acoustic objects;

sampling rate of sound source of the sound object;

the sound source bit width of the sound object;

the frame length of each frame of audio data.

Optionally, in one embodiment of the disclosure, the processing module is further configured to: :

Optionally, in one embodiment of the disclosure, the method further comprises:

encoding the object audio signal;

the encoded signal is sent to a decoding device.

Fig. 9g is a schematic structural diagram of an audio processing apparatus according to an embodiment of the disclosure, where, as shown in fig. 9g, the apparatus may include:

an acquisition module 901g, configured to acquire an encoded signal sent by an encoding device;

a decoding module 902g, configured to decode the encoded signal to obtain an object audio signal;

a determining module 903g, configured to determine metadata of the object audio signal, where the metadata includes at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object, and a sounding radiation range of the acoustic object;

A rendering module 904g for rendering the object audio signal based on the metadata.

In summary, in the audio processing apparatus provided in the embodiments of the present disclosure, the encoding device may determine metadata of each frame of audio data, where the metadata includes at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object, and a sounding radiation range of the acoustic object in the audio data; the encoding device then obtains the object audio signal based on the metadata of the audio data. Therefore, according to the embodiment of the disclosure, the absolute position information of the sound object may be included in the metadata, based on which, when the absolute positions of the sound objects are fixed in the process of recording the audio data or making the audio signal of the object, but the listening positions are continuously moved, the absolute position information of the sound object may be included in the metadata, and at this time, because the absolute positions of the sound objects are fixed, the absolute position information between the sound object and the listening position may be included in the metadata of the audio data of a certain frame (such as the first frame), the absolute position information between the sound object and the listening position in the audio data of the other frame may be multiplexed, and the metadata of the audio data of each frame does not need to include the absolute position information between the sound object and the listening position, thereby reducing the data amount and the transmission bandwidth of the metadata, improving the coding efficiency and ensuring the subsequent correct rendering of the sound object, providing the correct spatial audio perception result without affecting the final decoding effect. In addition, the metadata of the embodiment of the present disclosure further includes direction information of the acoustic object and a sounding radiation range of the acoustic object, so that when the subsequent encoding device renders the object audio signal, the subsequent encoding device may perform rendering based on the direction information and the sounding radiation range, so as to simulate a difference in auditory sense generated by different actual directions of the acoustic object, thereby improving the rendering effect.

sound source size of the sound object;

the width of the acoustic object;

the height of the acoustic object;

type of acoustic object.

the object audio data packet includes metadata of audio data and audio data.

room size;

room wall type;

wall reflection coefficient;

a room type;

reverberation time.

Number of acoustic objects;

sampling rate of sound source of the sound object;

the sound source bit width of the sound object;

the frame length of each frame of audio data.

rendering the audio data based on the metadata and the header file.

Fig. 10 is a block diagram of a user equipment UE1000 provided in one embodiment of the present disclosure. For example, UE1000 may be a mobile phone, computer, digital broadcast terminal device, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.

Referring to fig. 10, the ue1000 may include at least one of the following components: a processing component 1002, a memory 1004, a power component 1006, a multimedia component 1008, an audio component 1010, an input/output (I/O) interface 1012, a sensor component 1013, and a communication component 1016.

The processing component 1002 generally controls overall operation of the UE1000, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1002 can include at least one processor 1020 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 1002 can include at least one module that facilitates interaction between the processing component 1002 and other components. For example, the processing component 1002 can include a multimedia module to facilitate interaction between the multimedia component 1008 and the processing component 1002.

The memory 1004 is configured to store various types of data to support operations at the UE 1000. Examples of such data include instructions for any application or method operating on the UE1000, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1004 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 1006 provides power to the various components of the UE 1000. The power supply component 1006 can include a power management system, at least one power supply, and other components associated with generating, managing, and distributing power for the UE 1000.

The multimedia component 1008 includes a screen between the UE1000 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes at least one touch sensor to sense touch, swipe, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also a wake-up time and pressure associated with the touch or slide operation. In some embodiments, the multimedia assembly 1008 includes a front-facing camera and/or a rear-facing camera. The front camera and/or the rear camera may receive external multimedia data when the UE1000 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1010 is configured to output and/or input audio signals. For example, the audio component 1010 includes a Microphone (MIC) configured to receive external audio signals when the UE1000 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in memory 1004 or transmitted via communication component 1016. In some embodiments, the audio component 1010 further comprises a speaker for outputting audio signals.

The I/O interface 1012 provides an interface between the processing assembly 1002 and peripheral interface modules, which may be a keyboard, click wheel, buttons, and the like. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor component 1013 includes at least one sensor for providing status assessment of various aspects for the UE 1000. For example, the sensor assembly 1013 may detect an on/off state of the device 1000, a relative positioning of the assemblies, such as a display and keypad of the UE1000, the sensor assembly 1013 may also detect a change in position of the UE1000 or one of the assemblies of the UE1000, the presence or absence of user contact with the UE1000, an orientation or acceleration/deceleration of the UE1000, and a change in temperature of the UE 1000. The sensor assembly 1013 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 1013 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1013 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1016 is configured to facilitate communication between the UE1000 and other devices, either wired or wireless. The UE1000 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1016 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1016 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the UE1000 may be implemented by at least one Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components for performing the above-described methods.

Fig. 11 is a block diagram of a network side device 1100 provided by an embodiment of the disclosure. For example, the network-side device 1100 may be provided as a network-side device. Referring to FIG. 11, network-side device 1100 includes a processing component 1111 that further includes at least one processor, and memory resources, represented by memory 1132, for storing instructions, such as applications, executable by processing component 1122. The application programs stored in memory 1132 may include one or more modules each corresponding to a set of instructions. Further, the processing component 1110 is configured to execute instructions to perform any of the methods described above as applied to the network-side device, e.g., as shown in fig. 1.

The network-side device 1100 may also include a power component 1126 configured to perform power management of the network-side device 1100, a wired or wireless network interface 1150 configured to connect the network-side device 1100 to a network, and an input-output (I/O) interface 1158. The network side device 1100 may operate based on an operating system stored in memory 1132, such as Windows Server TM, mac OS XTM, unix (TM), linux (TM), freeBSDTM, or the like.

In the embodiments provided in the present disclosure, the method provided in the embodiments of the present disclosure is described from the perspective of the network side device and the UE, respectively. In order to implement the functions in the method provided by the embodiments of the present disclosure, the network side device and the UE may include a hardware structure, a software module, and implement the functions in the form of a hardware structure, a software module, or a hardware structure plus a software module. Some of the functions described above may be implemented in a hardware structure, a software module, or a combination of a hardware structure and a software module.

The embodiment of the disclosure provides a communication device. The communication device may include a transceiver module and a processing module. The transceiver module may include a transmitting module and/or a receiving module, where the transmitting module is configured to implement a transmitting function, the receiving module is configured to implement a receiving function, and the transceiver module may implement the transmitting function and/or the receiving function.

The communication device may be a terminal device (such as the terminal device in the foregoing method embodiment), or may be a device in the terminal device, or may be a device that can be used in a matching manner with the terminal device. Alternatively, the communication device may be a network device, a device in the network device, or a device that can be used in cooperation with the network device.

Another communication apparatus provided by an embodiment of the present disclosure. The communication device may be a network device, or may be a terminal device (such as the terminal device in the foregoing method embodiment), or may be a chip, a chip system, or a processor that supports the network device to implement the foregoing method, or may be a chip, a chip system, or a processor that supports the terminal device to implement the foregoing method. The device can be used for realizing the method described in the method embodiment, and can be particularly referred to the description in the method embodiment.

The communication device may include one or more processors. The processor may be a general purpose processor or a special purpose processor, etc. For example, a baseband processor or a central processing unit. The baseband processor may be used to process communication protocols and communication data, and the central processor may be used to control communication apparatuses (e.g., network side devices, baseband chips, terminal devices, terminal device chips, DUs or CUs, etc.), execute computer programs, and process data of the computer programs.

Optionally, the communication device may further include one or more memories, on which a computer program may be stored, and the processor executes the computer program, so that the communication device performs the method described in the above method embodiment. Optionally, the memory may further store data. The communication device and the memory may be provided separately or may be integrated.

Optionally, the communication device may further include a transceiver, an antenna. The transceiver may be referred to as a transceiver unit, transceiver circuitry, or the like, for implementing the transceiver function. The transceiver may include a receiver, which may be referred to as a receiver or a receiving circuit, etc., for implementing a receiving function, and a transmitter; the transmitter may be referred to as a transmitter or a transmitting circuit, etc., for implementing a transmitting function.

Optionally, one or more interface circuits may be included in the communication device. The interface circuit is used for receiving the code instruction and transmitting the code instruction to the processor. The processor executes the code instructions to cause the communication device to perform the method described in the method embodiments above.

The communication device is a terminal device (such as the terminal device in the foregoing method embodiment): the processor is configured to perform the method shown in any of figures 1-4.

The communication device is a network device: the transceiver is configured to perform the method shown in any of figures 5-7.

In one implementation, a transceiver for implementing the receive and transmit functions may be included in the processor. For example, the transceiver may be a transceiver circuit, or an interface circuit. The transceiver circuitry, interface or interface circuitry for implementing the receive and transmit functions may be separate or may be integrated. The transceiver circuit, interface or interface circuit may be used for reading and writing codes/data, or the transceiver circuit, interface or interface circuit may be used for transmitting or transferring signals.

In one implementation, a processor may have a computer program stored thereon, which, when executed on the processor, may cause a communication device to perform the method described in the method embodiments above. The computer program may be solidified in the processor, in which case the processor may be implemented in hardware.

In one implementation, a communication device may include circuitry that may implement the functions of transmitting or receiving or communicating in the foregoing method embodiments. The processors and transceivers described in this disclosure may be implemented on Integrated Circuits (ICs), analog ICs, radio frequency integrated circuits RFICs, mixed signal ICs, application specific integrated circuits (application specific integrated circuit, ASICs), printed circuit boards (printed circuit board, PCBs), electronic devices, and the like. The processor and transceiver may also be fabricated using a variety of IC process technologies such as complementary metal oxide semiconductor (complementary metal oxide semiconductor, CMOS), N-type metal oxide semiconductor (NMOS), P-type metal oxide semiconductor (positive channel metal oxide semiconductor, PMOS), bipolar junction transistor (bipolar junction transistor, BJT), bipolar CMOS (BiCMOS), silicon germanium (SiGe), gallium arsenide (GaAs), etc.

The communication apparatus described in the above embodiment may be a network device or a terminal device (such as the terminal device in the foregoing method embodiment), but the scope of the communication apparatus described in the present disclosure is not limited thereto, and the structure of the communication apparatus may not be limited. The communication means may be a stand-alone device or may be part of a larger device. For example, the communication device may be:

(1) A stand-alone integrated circuit IC, or chip, or a system-on-a-chip or subsystem;

(2) A set of one or more ICs, optionally including storage means for storing data, a computer program;

(3) An ASIC, such as a Modem (Modem);

(4) Modules that may be embedded within other devices;

(5) A receiver, a terminal device, an intelligent terminal device, a cellular phone, a wireless device, a handset, a mobile unit, a vehicle-mounted device, a network device, a cloud device, an artificial intelligent device, and the like;

(6) Others, and so on.

In the case where the communication device may be a chip or a system of chips, the chip includes a processor and an interface. The number of the processors may be one or more, and the number of the interfaces may be a plurality.

Optionally, the chip further comprises a memory for storing the necessary computer programs and data.

Those of skill in the art will further appreciate that the various illustrative logical blocks (illustrative logical block) and steps (step) described in connection with the embodiments of the disclosure may be implemented by electronic hardware, computer software, or combinations of both. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation is not to be understood as beyond the scope of the embodiments of the present disclosure.

The embodiments of the present disclosure also provide a system for determining a length of a side link, where the system includes a communication device that is a terminal device (e.g., a first terminal device in the foregoing method embodiment) and a communication device that is a network device in the foregoing embodiment, or the system includes a communication device that is a terminal device (e.g., a first terminal device in the foregoing method embodiment) and a communication device that is a network device in the foregoing embodiment.

The present disclosure also provides a readable storage medium having instructions stored thereon which, when executed by a computer, perform the functions of any of the method embodiments described above.

The present disclosure also provides a computer program product which, when executed by a computer, performs the functions of any of the method embodiments described above.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer programs. When the computer program is loaded and executed on a computer, the flow or functions described in accordance with the embodiments of the present disclosure are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer program may be stored in or transmitted from one computer readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means from one website, computer, server, or data center. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a high-density digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that: the various numbers of first, second, etc. referred to in this disclosure are merely for ease of description and are not intended to limit the scope of embodiments of this disclosure, nor to indicate sequencing.

At least one of the present disclosure may also be described as one or more, a plurality may be two, three, four or more, and the present disclosure is not limited. In the embodiment of the disclosure, for a technical feature, the technical features in the technical feature are distinguished by "first", "second", "third", "a", "B", "C", and "D", and the technical features described by "first", "second", "third", "a", "B", "C", and "D" are not in sequence or in order of magnitude.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

An audio processing method, performed by an encoding device, the method comprising:

determining metadata of each frame of audio data, wherein the metadata comprises at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object and sounding radiation range of the acoustic object in the audio data;

and obtaining an object audio signal based on the metadata of the audio data.
The method of claim 1, wherein the determining metadata for each frame of audio data comprises:

determining that absolute position information or relative position information is required to be contained in the metadata;

wherein, in response to determining that absolute position information needs to be included in metadata, the absolute position information is included in the metadata;

in response to determining that the metadata needs to include relative position information, the metadata is caused to include relative position information.
The method of claim 1, wherein the determining metadata for each frame of audio data comprises:

determining whether the acoustic object has an orientation;

responsive to the acoustic object having an orientation, including orientation information of the acoustic object into the metadata, and including a marker in the metadata, the marker being for indicating that the orientation information is included in the metadata;

in response to the acoustic object not having an orientation, no orientation information is contained in the metadata.
A method according to claim 3, wherein the orientation information comprises absolute orientation information and/or relative orientation information;

the relative orientation information is used to indicate a relative orientation between the acoustic object and a listening position.
The method of claim 1, wherein the metadata further comprises at least one of:

sound source size of the sound object;

the width of the acoustic object;

the height of the acoustic object;

a spatial state of an acoustic object, the spatial state comprising moving or stationary;

type of acoustic object.
The method of any one of claims 1-5, wherein the method further comprises:

determining environmental spatial information of the acoustic object;

Determining basic information of the acoustic object;

audio data of the acoustic object is sampled in units of frames.
The method of claim 6, wherein in response to the acoustic object being located in a room, the ambient spatial information comprises at least one of:

room size;

room wall type;

wall reflection coefficient;

a room type;

reverberation time.
The method of claim 6, wherein the basic information of the acoustic object includes at least one of:

number of acoustic objects;

sampling rate of sound source of the sound object;

the sound source bit width of the sound object;

the frame length of each frame of audio data.
The method of claim 6, wherein the deriving the object audio signal based on metadata of the audio data comprises:

storing the environmental space information of the sound object and the basic information of the sound object as a header file;

storing metadata of each frame of audio data and each frame of audio data as an object audio data packet;

and splicing the header file and the object audio data packet to obtain at least one object audio signal.
The method of claim 1, wherein the method further comprises:

Encoding the object audio signal;

the encoded signal is sent to a decoding device.
An audio processing method, performed by a decoding device, the method comprising:

acquiring a coded signal sent by coding equipment;

decoding the encoded signal to obtain an object audio signal;

determining metadata of the object audio signal, wherein the metadata comprises at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object and sounding radiation range of the acoustic object;

rendering the object audio signal based on the metadata.
The method according to claim 11, wherein the orientation information comprises absolute orientation information and/or relative orientation information;

the relative orientation information is used to indicate a relative orientation between the acoustic object and a listening position.
The method of claim 11, wherein the metadata further comprises at least one of:

sound source size of the sound object;

the width of the acoustic object;

the height of the acoustic object;

a spatial state of an acoustic object, the spatial state comprising moving or stationary.
The method of claim 11, wherein the object audio signal comprises a header file and an object audio data packet;

The header file includes environmental space information of the sound object and basic information of the sound object;

the object audio data packet includes metadata of audio data and audio data.
The method of claim 14, wherein in response to the acoustic object being located in a room, the ambient spatial information comprises at least one of:

room size;

room wall type;

wall reflection coefficient;

a room type;

reverberation time.
The method of claim 14, wherein the basic information of the acoustic object includes at least one of:

number of acoustic objects;

sampling rate of sound source of the sound object;

the sound source bit width of the sound object;

the frame length of each frame of audio data.
The method of any of claims 14-16, wherein the rendering the object audio signal based on the metadata comprises:

rendering the audio data based on the metadata and the header file.
An audio processing apparatus, comprising:

a determining module, configured to determine metadata of each frame of audio data, where the metadata includes at least one of absolute position information of an acoustic object in the audio data, relative position information of the acoustic object, orientation information of the acoustic object, and a sounding radiation range of the acoustic object;

And the processing module is used for obtaining an object audio signal based on the metadata of the audio data.
An audio processing apparatus, comprising:

the acquisition module is used for acquiring the coded signals sent by the coding equipment;

the decoding module is used for decoding the encoded signal to obtain an object audio signal;

a determining module for determining metadata of the object audio signal, wherein the metadata comprises at least one of absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object and sounding radiation range of the acoustic object;

and the rendering module is used for rendering the object audio signals based on the metadata.
A communication device, characterized in that the device comprises a processor and a memory, wherein the memory has stored therein a computer program, which processor executes the computer program stored in the memory to cause the device to perform the method according to any of claims 1 to 10.
A communication device comprising a processor and a memory, wherein the memory has stored therein a computer program, the processor executing the computer program stored in the memory to cause the device to perform the method of any of claims 11 to 17.
A communication device, comprising: processor and interface circuit, wherein

The interface circuit is used for receiving code instructions and transmitting the code instructions to the processor;

the processor for executing the code instructions to perform the method of any one of claims 1 to 10.
A communication device, comprising: processor and interface circuit, wherein

The interface circuit is used for receiving code instructions and transmitting the code instructions to the processor;

the processor for executing the code instructions to perform the method of any one of claims 11 to 17.
A computer readable storage medium storing instructions which, when executed, cause a method as claimed in any one of claims 1 to 10 to be implemented.
A computer readable storage medium storing instructions which, when executed, cause a method as claimed in any one of claims 11 to 17 to be implemented.