WO2023212880A1

WO2023212880A1 - Audio processing method and apparatus, and storage medium

Info

Publication number: WO2023212880A1
Application number: PCT/CN2022/091052
Authority: WO
Inventors: 吕柱良; 史润宇; 吕雪洋; 刘晗宇
Original assignee: 北京小米移动软件有限公司
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2023-11-09
Also published as: CN117581566A

Abstract

The present disclosure provides an audio processing method and apparatus, and a storage medium, which belong to the technical field of communications. The method comprises: determining metadata of each frame of audio data, wherein the metadata comprises at least one among absolute position information of an acoustic object, relative position information of the acoustic object, orientation information of the acoustic object, and a sound emission range of the acoustic object in the audio data; and acquiring an object audio signal on the basis of the metadata of the audio data. By means of using the method provided in the present disclosure, the data volume and transmission bandwidth of the metadata are reduced, the coding efficiency is increased, and it is ensured that a renderer can subsequently correctly render the orientation of an acoustic object and provide a correct spatial audio perception result without affecting a final decoding rendering effect. A hearing difference generated by different actual orientations of the acoustic object can be simulated, so that the rendering effect is improved.

Description

Audio processing method/device/equipment and storage medium

Technical field

The present disclosure relates to the field of communication technology, and in particular, to an audio processing method/device/equipment and a storage medium.

Background technique

When the encoding device collects audio data to produce an object audio signal, it will include relative position information between the sound object and the listener's listening position in the metadata of the object audio signal. When the decoding device renders the object audio signal, it can render the spatial audio based on the relative position information, so that the listener can hear the sound coming from a specific direction, thereby giving the user a better three-dimensional and spatial immersion experience.

However, in related technologies, when recording audio data or producing object audio signals, if the absolute positions of multiple sound objects are fixed but the listening positions are constantly moving, the audio data of each frame will be different from the sound objects. The relative position information between positions is inconsistent, so that the metadata of each frame of audio data needs to include the relative position information between the sound object and the listening position, which will increase the amount of metadata and occupy the transmission bandwidth. , it will also lead to low encoding efficiency of the object audio signal. For some application scenarios, if the metadata contains relative position information, it will also make the subsequent rendering process more complicated and affect the rendering efficiency. Moreover, when decoding object audio signals in rendering-related technologies, the difference in hearing sensation caused by the actual orientation of the sound-emitting object cannot be simulated, which will lead to poor rendering effects.

Contents of the invention

The audio processing method/device/equipment and storage medium proposed by the present disclosure are used to solve the technical problems of low coding efficiency and poor rendering effect of object audio signals in related technologies.

The audio processing method proposed in one aspect of the present disclosure is applied to encoding equipment, including:

Determine the metadata of each frame of audio data, the metadata including at least one of the absolute position information of the sound object in the audio data, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object. kind;

An object audio signal is obtained based on the metadata of the audio data.

Optionally, in one embodiment of the present disclosure, determining the metadata of each frame of audio data includes:

Determine that the metadata needs to contain absolute position information or relative position information;

Wherein, in response to determining that the metadata needs to contain absolute location information, causing the metadata to contain absolute location information;

In response to determining that the metadata needs to include relative position information, causing the metadata to include relative position information, the relative position information is used to indicate the relative position between the sound object and the listening position of the listener. .

Determine whether the sound object has a direction;

In response to the sound object having an orientation, the orientation information of the acoustic object is included in the metadata, and a mark is included in the metadata, the mark being used to indicate that the metadata includes the orientation information. ;

In response to the absence of an orientation of the acoustic object, no orientation information is included in the metadata.

Optionally, in an embodiment of the present disclosure, the orientation information includes absolute orientation information and/or relative orientation information;

The relative orientation information is used to indicate the relative orientation between the sound object and the listening position.

Optionally, in one embodiment of the present disclosure, the metadata further includes at least one of the following:

The size of the sound source of the sound object;

The width of the sound object;

The height of the sound object;

The spatial state of the sound object, which includes movement or stillness;

The type of sound object.

Optionally, in an embodiment of the present disclosure, the method further includes:

Determine the environmental spatial information of the acoustic object;

Determine the basic information of the acoustic object;

Audio data of the sound object is sampled in units of frames.

Optionally, in an embodiment of the present disclosure, in response to the sound object being located in the room, the environmental spatial information includes at least one of the following:

room size;

room wall type;

wall reflection coefficient;

Room type;

Reverberation time.

Optionally, in an embodiment of the present disclosure, the basic information of the acoustic object includes at least one of the following:

The number of sound objects;

The sampling rate of the sound source of the sound object;

The sound source width of the sound object;

The frame length of each frame of audio data.

Optionally, in an embodiment of the present disclosure, obtaining an object audio signal based on the metadata of the audio data includes:

Store the environmental space information of the sound object and the basic information of the sound object as a header file;

Store the metadata of each frame of audio data and each frame of audio data as an object audio data packet;

The header file and the object audio data packet are spliced to obtain at least one object audio signal.

The audio processing method proposed by another embodiment of the present disclosure is applied to encoding equipment, including:

Obtain the encoded signal sent by the encoding device;

Decode the encoded signal to obtain the target audio signal;

Determine the metadata of the object audio signal, the metadata including at least one of the absolute position information of the sound object, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object;

The object audio signal is rendered based on the metadata.

The size of the sound source of the sound object;

The width of the sound object;

The height of the sound object;

The spatial state of the sound object, which includes movement or stillness;

The type of sound object.

Optionally, in an embodiment of the present disclosure, the object audio signal includes a header file and an object audio data packet;

The header file includes environmental space information of the sound object and basic information of the sound object;

The object audio data packet includes audio data metadata and audio data.

room size;

room wall type;

wall reflection coefficient;

Room type;

Reverberation time.

The number of sound objects;

The sampling rate of the sound source of the sound object;

The sound source width of the sound object;

The frame length of each frame of audio data.

Optionally, in one embodiment of the present disclosure, rendering the object audio signal based on the metadata includes:

The audio data is rendered based on the metadata and the header file.

Encoding the object audio signal;

Send the encoded signal to the decoding device.

Another aspect of the present disclosure provides an audio processing device, including:

Determining module, used to determine the metadata of each frame of audio data. The metadata includes the absolute position information of the sound object in the audio data, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation of the sound object. at least one of the ranges;

A processing module, configured to obtain an object audio signal based on the metadata of the audio data.

The acquisition module is used to obtain the encoded signal sent by the encoding device;

A decoding module, used to decode the encoded signal to obtain the target audio signal;

Determining module, used to determine the metadata of the object audio signal, the metadata includes at least one of the absolute position information of the sound object, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object. kind;

A rendering module, configured to render the object audio signal based on the metadata.

Another aspect of the present disclosure provides a communication device. The device includes a processor and a memory. A computer program is stored in the memory. The processor executes the computer program stored in the memory so that the The device performs the method proposed in the above embodiment.

A communication device provided by another embodiment of the present disclosure includes: a processor and an interface circuit;

The interface circuit is used to receive code instructions and transmit them to the processor;

The processor is configured to run the code instructions to perform the method proposed in another embodiment.

A computer-readable storage medium provided by an embodiment of another aspect of the present disclosure is used to store instructions. When the instructions are executed, the method proposed by the embodiment of another aspect is implemented.

To sum up, in the audio processing method/device/equipment and storage medium provided by the embodiments of the present disclosure, the encoding device determines the metadata of each frame of audio data, where the metadata includes the sound in the audio data. At least one of the absolute position information of the object, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object; then, the encoding device will obtain the object audio signal based on the metadata of the audio data. It can be seen from this that in embodiments of the present disclosure, metadata may include absolute position information of sound objects. Based on this, when recording audio data or producing object audio signals, if the absolute positions of multiple sound objects are fixed, but the listening If the sound position is constantly moving, the metadata can include the absolute position information of the sound object. At this time, since the absolute position of the sound object is fixed, the metadata of the audio data of a certain frame (such as the first frame) can only include The absolute position information between the sound object and the listening position. The audio data of other frames can reuse the absolute position information between the sound object and the listening position in the frame without making the metadata of the audio data of each frame All include the absolute position information between the sound object and the listening position, thereby reducing the amount of metadata and transmission bandwidth, improving coding efficiency and ensuring that the renderer can subsequently correctly render the position of the sound object and provide correct spatial audio. Perceptual results without affecting the final decoding rendering effect. In addition, the metadata in the embodiment of the present disclosure also includes the orientation information of the sound object and the sound radiation range of the sound object. Then when the subsequent encoding device renders the object audio signal, the rendering can be based on the orientation information and the sound radiation range, so as to Simulates the difference in hearing caused by the actual direction of the sounding object, improving the rendering effect.

Description of the drawings

The above and/or additional aspects and advantages of the present disclosure will become apparent and readily understood from the following description of the embodiments in conjunction with the accompanying drawings, in which:

Figure 1 is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure;

Figure 2 is a schematic flowchart of an audio processing method provided by another embodiment of the present disclosure;

Figures 3a-3b are flowcharts of an audio processing method provided by yet another embodiment of the present disclosure;

Figure 4 is a schematic flowchart of an audio processing method provided by yet another embodiment of the present disclosure;

Figure 5 is a schematic flowchart of an audio processing method provided by yet another embodiment of the present disclosure;

Figures 6a-6b are flowcharts of an audio processing method provided by yet another embodiment of the present disclosure;

Figure 7 is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure;

Figure 8 is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure;

Figures 9a-9e are flowcharts of an audio processing method provided by an embodiment of the present disclosure;

Figure 9f is a schematic structural diagram of an audio processing device provided by an embodiment of the present disclosure;

Figure 9g is a schematic structural diagram of an audio processing device provided by an embodiment of the present disclosure;

Figure 10 is a block diagram of a user equipment provided by an embodiment of the present disclosure;

Figure 11 is a block diagram of a network side device provided by an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of embodiments of the present disclosure as detailed in the appended claims.

The terminology used in the embodiments of the present disclosure is for the purpose of describing specific embodiments only and is not intended to limit the embodiments of the present disclosure. As used in the embodiments of the present disclosure and the appended claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used to describe various information in the embodiments of the present disclosure, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the embodiments of the present disclosure, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the words "if" and "if" as used herein may be interpreted as "when" or "when" or "in response to determining."

The audio processing method/device/equipment and storage medium provided by the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Figure 1 is a schematic flow chart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by an encoding device. As shown in Figure 1, the audio processing method may include the following steps:

Step 101: Determine metadata of each frame of audio data.

In one embodiment of the present disclosure, the metadata may include the absolute position information of the sound object in each frame of audio data, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object. of at least one.

It should be noted that, in one embodiment of the present disclosure, the above-mentioned relative position information may be used to indicate the relative position between the sound object and the listening position of the listener. And, in one embodiment of the present disclosure, the absolute position information and relative position information may specifically be mapping information of the absolute position or relative position of the sound object on the coordinate system. The above-mentioned absolute position may be, for example, the longitude and latitude of the sound object, etc.; the above-mentioned relative position may be, for example, the distance, azimuth angle, pitch angle, etc. between the sound object and the listener. The listening position of the listener can be any position, or the position of any sound object.

Specifically, in one embodiment of the present disclosure, the method for determining the absolute position information of the sound object may include: first obtaining the absolute position of each sound object, and then establishing an absolute coordinate system. The origin of the absolute coordinate system may be any A position is determined, and the origin of the absolute coordinate system is fixed. Then, the absolute position of each sound object is mapped to the absolute coordinate system to obtain the absolute position information of the sound object. For example, in one embodiment of the present disclosure, the absolute coordinate system may be a rectangular coordinate system, and the absolute position information of the acoustic object may be: (x, y, z), where x, y, z respectively represent The position coordinates of the sound object on the x-axis (such as the axis in the front-to-back direction), y-axis (such as the axis in the left-right direction), and z-axis (such as the axis in the up-down direction) of the rectangular coordinate system. For example, in another embodiment of the present disclosure, the absolute coordinate system may be a spherical coordinate system, and the absolute position information of the acoustic object may be: (θ, γ, r), where θ, γ, r are respectively Indicates the horizontal direction angle of the sound object on the spherical coordinate system (i.e., the angle between the mapping of the sound object and the origin of the spherical coordinate system on the horizontal plane and the x-axis), the vertical direction angle (i.e., the mapping between the sound object and the origin of the The angle between the horizontal plane) and the straight-line distance of the sound object from the origin.

In another embodiment of the present disclosure, the method for determining the relative position information of the sound object may include: first obtaining the relative position of each sound object and the listening position of the listener, and then establishing a relative coordinate system. The relative coordinate system The origin of is always the listening position, and when the listening position changes, the origin of the relative coordinate system will also change. Afterwards, the relative position of each sound object and the listening position is mapped to the relative coordinate system to obtain the relative position information of the sound object. For example, in one embodiment of the present disclosure, the relative coordinate system may be a rectangular coordinate system, and the relative position information of the acoustic object may be: (x, y, z), where x, y, z respectively represent The position coordinates of the sound object on the x-axis (such as the axis in the front-to-back direction), y-axis (such as the axis in the left-right direction), and z-axis (such as the axis in the up-down direction) of the rectangular coordinate system. For example, in another embodiment of the present disclosure, the relative coordinate system may be a spherical coordinate system, and the relative position information of the acoustic object may be: (θ, γ, r), where θ, γ, r are respectively Indicates the horizontal direction angle of the sound object on the spherical coordinate system (i.e., the angle between the mapping of the sound object and the origin of the spherical coordinate system on the horizontal plane and the x-axis), the vertical direction angle (i.e., the mapping between the sound object and the origin of the The angle between the horizontal plane) and the straight-line distance of the sound object from the origin.

Among them, the above-mentioned (x, y, z) and (θ, γ, r) can be converted by the following formula.

And, in one embodiment of the present disclosure, the above-mentioned method of "obtaining the absolute position or relative position of each acoustic object" may include: using a sensor or a combination of sensors to obtain the absolute position or relative position of the acoustic object, such as displacement. Sensors, position sensors, attitude sensors (such as gyroscopes, ultrasonic rangefinders, etc.), positioning sensors, geomagnetic sensors, direction sensors, accelerometers, etc. obtain the absolute or relative position of the sound object. In addition, the distance between the sound object in relative position and the listener can also be obtained through inertial navigation technology and initial alignment technology. In another embodiment of the present disclosure, the absolute position or relative position of each sound object may also be obtained based on user input. In yet another embodiment of the present disclosure, the absolute position or relative position of each sound object may also be generated based on a program.

Furthermore, in an embodiment of the present disclosure, the above-mentioned orientation information of the acoustic object may specifically be absolute orientation information of the acoustic object (such as true south orientation or true north orientation, etc.). In another embodiment of the present disclosure, the orientation information of the acoustic object may be relative orientation information of the acoustic object, and the relative orientation information may be used to indicate the relative orientation between the acoustic object and the listening position, such as the relative orientation information. The orientation information can be: the sound object is located 30° south to west of the listening position. In addition, the orientation information of the acoustic object can be obtained using any of the above sensors or obtained based on user input or generated based on a program.

And, in one embodiment of the present disclosure, the above-mentioned sound radiation range of the sound object may be a parameter used to describe the radiation characteristics of the sound object. Among them, in one embodiment of the present disclosure, the sound radiation range of the sound object can be used to indicate the sound radiation angle of the sound object. For example, the sound radiation range of the sound object can be: the sound object radiates sound 90° to the front, Alternatively, the sound radiation range of the sound object can be: the sound object radiates sound to 360°. And, in another embodiment of the present disclosure, the sound radiation range of the sound object may be the sound radiation shape of the sound object. For example, the sound radiation range of the sound object may be: the sound object emits sound radiation according to a heart shape, or the sound radiation range is: The object's sound radiation range is: the sound object's sound radiation follows a figure-8 shape. And, the sound radiation range of the sound object can be obtained using any of the above sensors or obtained based on user input or generated based on a program.

In addition, in an embodiment of the present disclosure, the above-mentioned metadata of each frame of audio data may also include at least one of the following:

The size of the sound source of the sound object;

The width of the sound object;

The height of the sound object;

The spatial state of the sound object, including moving or stationary;

Type of sound object (such as speech, music, etc.).

The size of the sound source of the sound object, the width of the sound object, the height of the sound object, and the spatial state of the sound object can also be obtained through any of the above sensors or obtained based on user input or generated based on a program.

It should also be noted that in one embodiment of the present disclosure, each content in the metadata is correspondingly stored with a flag bit to indicate that the parameters of the content are relative to the content in the metadata of the previous frame of audio data. Whether the parameters have changed. For example, the azimuth angle in the metadata is correspondingly stored with an azimuth angle flag. If the azimuth angle in the metadata of the audio data of the current frame has not changed relative to the azimuth angle in the metadata of the audio data of the previous frame, , then the azimuth angle flag can be made to be a first value (such as 1), otherwise the azimuth flag can be made to be a second value (such as 0). And, in one embodiment of the present disclosure, if part of the content in the metadata of the audio data of the current frame has not changed relative to the metadata of the audio data of the previous frame, then the metadata of the audio data of the current frame The unchanged part of the content may not be included in the content, and the content in the metadata of the audio data of the previous frame may be directly reused, thereby reducing the data volume and transmission bandwidth of the metadata to a certain extent, reducing data compression, and improving efficiency. Encoding efficiency without affecting the final decoding rendering effect.

Step 102: Obtain the object audio signal based on the metadata of the audio data.

Among them, the specific method of "obtaining the object audio signal based on the metadata of the audio data" mentioned above will be introduced in detail in subsequent embodiments.

To sum up, in the audio processing method provided by the embodiment of the present disclosure, the encoding device determines the metadata of each frame of audio data, where the metadata includes the absolute position information of the sound object in the audio data, the sound At least one of the relative position information of the object, the orientation information of the sound object, and the sound radiation range of the sound object; then, the encoding device will obtain the object audio signal based on the metadata of the audio data. It can be seen from this that in embodiments of the present disclosure, metadata may include absolute position information of sound objects. Based on this, when recording audio data or producing object audio signals, if the absolute positions of multiple sound objects are fixed, but the listening If the sound position is constantly moving, the metadata can include the absolute position information of the sound object. At this time, since the absolute position of the sound object is fixed, the metadata of the audio data of a certain frame (such as the first frame) can only include The absolute position information between the sound object and the listening position. The audio data of other frames can reuse the absolute position information between the sound object and the listening position in the frame without making the metadata of the audio data of each frame All include the absolute position information between the sound object and the listening position, thereby reducing the amount of metadata and transmission bandwidth, improving coding efficiency and ensuring that the renderer can subsequently correctly render the position of the sound object and provide correct spatial audio. Perceptual results without affecting the final decoding rendering effect. In addition, the metadata in the embodiment of the present disclosure also includes the orientation information of the sound object and the sound radiation range of the sound object. Then when the subsequent encoding device renders the object audio signal, the rendering can be based on the orientation information and the sound radiation range, so as to Simulates the difference in hearing caused by the actual direction of the sounding object, improving the rendering effect.

Figure 2 is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by an encoding device. As shown in Figure 2, the audio processing method may include the following steps:

Step 201: Determine the environmental space information of the sound object.

Wherein, in one embodiment of the present disclosure, when the sound object is located in the room, the environmental spatial information includes at least one of the following:

room size;

room wall type;

wall reflection coefficient;

Room type (such as large room, small room, conference room, auditorium, hall, etc.);

Reverberation time.

Wherein, the environmental space information can be obtained using any of the above sensors or obtained based on user input or generated based on a program.

It should be noted that when the absolute coordinate system or the relative coordinate system is subsequently established, the absolute coordinate system or the relative coordinate system can be established based on the environmental space information.

Step 202: Determine the basic information of the sound object.

In one embodiment of the present disclosure, the basic information of the acoustic object may include at least one of the following:

The number of sound objects;

The sampling rate of the sound source of the sound object;

The sound source width of the sound object;

The frame length of each frame of audio data.

The basic information of the acoustic object can be obtained using any of the above-mentioned sensors or obtained based on user input or generated based on a program.

Step 203: Sample the audio data of the sound object in frame units.

Among them, in one embodiment of the present disclosure, a sound collection device (such as a microphone) can be used to sample the audio data of the sound object in frame units, and all sampling points included in the current frame can be Modulation, pulse code encoding modulation data) method to save.

And Table 1 and Table 2 are schematic tables of the storage syntax of audio data provided by embodiments of the present disclosure.

Table 1 Syntax of object audio data (low latency mode)

Table 2—Syntax of object raw pcm sample

Step 204: Determine metadata of each frame of audio data.

For detailed introduction to step 204, reference may be made to the description of the above embodiments, and the embodiments of the present disclosure will not be described again here.

Step 205: Obtain the object audio signal based on the metadata of the audio data.

Among them, in one embodiment of the present disclosure, a method of obtaining an object audio signal based on metadata based on audio data may include the following steps:

Step 1. Store the environmental space information of the sound object and the basic information of the sound object as a header file.

Table 3 is a schematic table of the storage syntax of the header file provided by the embodiment of the present disclosure.

Table 3—Syntax for object audio file headers

Step 2: Store the metadata of each frame of audio data and each frame of audio data as an object audio data packet.

Table 4 is a schematic table of the storage syntax of the object audio data packet provided by the embodiment of the present disclosure.

Table 4—Syntax for object audio packets

Step 3: Splice the header file and the object audio data packet to obtain at least one object audio signal.

In one embodiment of the present disclosure, after obtaining the object audio signal, the encoding device can save or transmit the object audio signal as needed, or it can also encode the object audio signal into other formats and then save or transmit it. .

Figure 3a is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by an encoding device. As shown in Figure 3a, the audio processing method may include the following steps:

Step 301a: Determine whether the metadata needs to contain absolute position information or relative position information.

Among them, in one embodiment of the present disclosure, whether the metadata needs to contain absolute position information or relative position information is determined mainly based on the characteristics of the application scene or sound object or the simplification of the subsequent rendering process.

Specifically, when the first preset condition is met, it is determined that the metadata needs to contain absolute location information; when the second preset condition is met, it is determined that the metadata needs to contain relative location information.

Wherein, the first preset condition may include at least one of the following:

The absolute position of the sound object remains unchanged;

The amount of data when the metadata includes absolute position information is less than or equal to the amount of data when the metadata includes relative position information;

The rendering process required when metadata includes absolute position information is simpler than when the metadata includes relative position information.

The second preset condition may include at least one of the following:

The relative position of the sound object remains unchanged;

The amount of data when the metadata includes absolute position information is greater than or equal to the amount of data when the metadata includes relative position information;

The rendering process required when metadata includes relative position information is simpler than when the metadata includes absolute position information.

That is, it can be determined whether the metadata needs to contain absolute position information or relative position information by determining whether the absolute position or relative position of the sound object in the metadata of each consecutive frame of audio data remains unchanged. Among them, if the absolute position of the sound object in the metadata of the audio data of consecutive frames does not change, then it is determined that the metadata contains absolute position information. At this time, based on the absolute position of the metadata does not change, it is possible to only make the third of the continuous frame The metadata of one frame of audio data only needs to contain the absolute position information of the sound object, and the audio data of other frames in the consecutive frames can reuse the absolute position information contained in the metadata of the audio data of the first frame of the consecutive frames. And, if the relative position of the sound object in the metadata of the audio data of the consecutive frames does not change, it is determined that the metadata contains relative position information. At this time, based on the relative position of the metadata does not change, it is possible to only make the third of the continuous frame The metadata of one frame of audio data only needs to contain the relative position information of the sound object. The audio data of other frames in the consecutive frames can reuse the relative position information contained in the metadata of the audio data of the first frame of the consecutive frames, so that It can reduce the data volume and transmission bandwidth of metadata to a certain extent, reduce data compression, and improve encoding efficiency without affecting the final decoding and rendering effect.

In addition, in one embodiment of the present disclosure, the simplification of the subsequent rendering process will also be taken into consideration, and information that can simplify the subsequent rendering process is selected from absolute position information or relative position information for use, thereby improving the subsequent rendering process. Rendering process efficiency. For example, in a scene with 6 degrees of freedom, the listener can perform three-dimensional rotation and three-dimensional displacement. In this case, using absolute position information can be more conducive to the processing of such scenes and simplify the rendering process.

It can be seen from this that in the embodiments of the present disclosure, whether the metadata needs to include absolute position information or relative position information is comprehensively considered from multiple dimensions (such as lower data volume and simpler rendering process, etc.), thereby not only reducing The amount of metadata also simplifies the subsequent rendering process and improves rendering efficiency.

It should be noted that the above-mentioned determination logic of “determining that metadata needs to contain absolute position information or relative position information” is only an example of this disclosure, and other contents related to or similar to the above-mentioned determination logic are also within the scope of protection of this disclosure. Inside.

Step 302a: In response to determining that the metadata needs to include absolute location information, make the metadata include the absolute location information.

Table 5 is a schematic table of storage syntax for metadata containing absolute position information provided by an embodiment of the present disclosure.

Table 5—Syntax for object metadata sample (absolute coordinate mode)

Step 303a: Obtain the object audio signal based on the metadata of the audio data.

For detailed description of steps 302a-303a, please refer to the above embodiment description, and the embodiments of this disclosure will not be described again here.

Figure 3b is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by an encoding device. As shown in Figure 3b, the audio processing method may include the following steps:

Step 301b: Determine the metadata of each frame of audio data, and the metadata includes absolute position information.

Step 302b: Obtain the object audio signal based on the metadata of the audio data.

For detailed description of steps 301b-302b, please refer to the description of the above embodiments, and the embodiments of the present disclosure will not be described again here.

To sum up, in the audio processing method provided by the embodiment of the present disclosure, the encoding device determines the metadata of each frame of audio data, where the metadata includes the absolute position information of the sound object in the audio data, the sound At least one of the relative position information of the object, the orientation information of the sound object, and the sound radiation range of the sound object; then, the encoding device will obtain the object audio signal based on the metadata of the audio data. It can be seen from this that in embodiments of the present disclosure, metadata may include absolute position information of sound objects. Based on this, when recording audio data or producing object audio signals, if the absolute positions of multiple sound objects are fixed, but the listening If the sound position is constantly moving, the metadata can include the absolute position information of the sound object. At this time, since the absolute position of the sound object is fixed, the metadata of the audio data of a certain frame (such as the first frame) can only include The absolute position information between the sound object and the listening position. The audio data of other frames can reuse the absolute position information between the sound object and the listening position in the frame without making the metadata of the audio data of each frame All include the absolute position information between the sound object and the listening position, thereby reducing the amount of metadata and transmission bandwidth, improving coding efficiency and ensuring that the renderer can subsequently correctly render the position of the sound object and provide correct spatial audio. Perceptual results without affecting the final decoding rendering effect.

Figure 4 is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by an encoding device. As shown in Figure 4, the audio processing method may include the following steps:

Step 401: Determine whether the metadata needs to contain absolute position information or relative position information.

Step 402: In response to determining that the metadata needs to include relative position information, make the metadata include the relative position information.

Table 6 is a schematic table of storage syntax for metadata containing relative position information provided by an embodiment of the present disclosure.

Table 6—Syntax for object metadata sample (relative coordinate mode)

Step 403: Obtain the object audio signal based on the metadata of the audio data.

For detailed introduction to steps 401-403, please refer to the above embodiment description, and the embodiments of the present disclosure will not be described again here.

And, as can be seen from the above-mentioned embodiments of Figures 4 and 5, in the present disclosure, by combining relative position information and absolute position information for encoding, it can simultaneously ensure that the relative position remains unchanged and the absolute position remains unchanged. In all scenarios, the most efficient spatial audio metadata solution can be achieved.

Figure 5 is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by an encoding device. As shown in Figure 5, the audio processing method may include the following steps:

Step 501: Determine whether the sound object has a direction.

Among them, in one embodiment of the present disclosure, if the sound object emits sound in all directions, it is considered that the sound object has no orientation; otherwise, the sound emission direction of the sound object is determined as the orientation of the sound object.

Step 502: In response to the sound object having an orientation, include the orientation information of the acoustic object into the metadata, and include a tag in the metadata, the tag being used to indicate that the metadata includes the orientation information.

For detailed introduction to steps 501-502, please refer to the above embodiment description, and the embodiments of the present disclosure will not be described again here.

To sum up, the metadata in the embodiment of the present disclosure also includes the orientation information of the sounding object. When the subsequent encoding device renders the object audio signal, it can perform rendering based on the orientation information to simulate the different actual orientations of the sounding object. The resulting difference in hearing improves the rendering effect.

Figure 6a is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by an encoding device. As shown in Figure 6a, the audio processing method may include the following steps:

Step 601a: Determine whether the sound object has a direction.

Step 602a: In response to the sound object having no orientation, no orientation information is included in the metadata.

For detailed description of steps 601a-602a, please refer to the above embodiment description, and the embodiments of the present disclosure will not be described again here.

Figure 6b is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by an encoding device. As shown in Figure 6a, the audio processing method may include the following steps:

Step 601b: Determine the metadata of each frame of audio data. The metadata includes the absolute position information of the sound object in the audio data, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object. of at least one.

Step 602b: Obtain the object audio signal based on the metadata of the audio data.

For detailed description of steps 601b-602b, please refer to the description of the above embodiments, and the embodiments of the present disclosure will not be described again here.

Step 603b: Encode the target audio signal.

Step 604b: Send the encoded signal to the decoding device.

The following is an example of the above audio processing method.

Figure 7 is a schematic flow chart of an audio processing method provided by an embodiment of the present disclosure. As shown in Figure 7, a local multi-person conference scenario is taken as an example. The room on the left is the recording end, and the room on the right is the playback end. There are multiple objects in the room at the recording end, including Object 1, Object 2, Object 4 who are speaking, and Object 3 who is not speaking. Multiple objects are regarded as acoustic objects in the scene, and their corresponding voice data is obtained through microphones, and positioning and attitude sensors such as gyroscopes, ultrasonic rangefinders, etc. are used to obtain the spatial information of each object (such as relative position information or absolute position information) and orientation information. After encoding, transmitting, decoding and rendering the audio data, space and orientation information of each object, the listener can feel that he is in the conference scene on the left. Not only can he feel the object 1 , the direction and distance of object 2 and object 4, and you can also feel the direction of the object. In addition, this solution can treat object 3 without audio data as an audio object for encoding and transmission, and it can be regarded as a listener in the recording scene. Through this solution, the playback end can completely restore the object. The real listening experience of 3 includes changes in hearing experience caused by changes in the position of object 3 and head rotation (change in orientation).

Figure 8 is a schematic flow chart of an audio processing method provided by an embodiment of the present disclosure. As shown in Figure 8, in a multi-person remote conference scenario, the left side is a remote participant, and multiple participants are located in different locations. Locations, different rooms, each can be regarded as a sound object for object audio coding. Use displacement or position sensors and attitude sensors to obtain the object's spatial position change information and head orientation information, and use a microphone to obtain the participant's voice signal as Object audio data, and use spatial position information, head orientation information, and object audio data for encoding of object audio. For the near-end user (right side of Figure 8), after obtaining multiple encoded remote object audios, decoding and rendering, combined with the local spatial information of the near-end user, the user can perceive multiple remote parameters. Participants' voices have a sense of direction that changes with time, and the hearing changes caused by the direction of the remote participant can also be perceived.

Figure 9a is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by a decoding device. As shown in Figure 9a, the audio processing method may include the following steps:

Step 901a: Obtain the encoded signal sent by the encoding device;

Step 902a: Decode the encoded signal to obtain the target audio signal;

Step 903a: Determine the metadata of the object audio signal, the metadata including at least one of the absolute position information of the sound object, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object;

Step 904a: Render the object audio signal based on the metadata.

The size of the sound source of the sound object;

The width of the sound object;

The height of the sound object;

The spatial state of the sound object, which includes movement or stillness;

The type of sound object.

The object audio data packet includes audio data metadata and audio data.

room size;

room wall type;

wall reflection coefficient;

Room type;

Reverberation time.

The number of sound objects;

The sampling rate of the sound source of the sound object;

The sound source width of the sound object;

The frame length of each frame of audio data.

The audio data is rendered based on the metadata and the header file.

For detailed description of steps 901a-904a, please refer to the above embodiment description, and the embodiments of the present disclosure will not be described again here.

Figure 9b is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by a decoding device. As shown in Figure 9b, the audio processing method may include the following steps:

Step 901b: Obtain the encoded signal sent by the encoding device;

Step 902b: Decode the encoded signal to obtain the target audio signal;

Step 903b: Determine metadata of the object audio signal, where the metadata includes absolute position information of the acoustic object;

Step 904b: Render the object audio signal based on the metadata.

Figure 9c is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by a decoding device. As shown in Figure 9c, the audio processing method may include the following steps:

Step 901c: Obtain the encoded signal sent by the encoding device;

Step 902c: Decode the encoded signal to obtain the target audio signal;

Step 903c: Determine metadata of the object audio signal, where the metadata includes relative position information of the acoustic object;

Step 904c: Render the object audio signal based on the metadata.

Figure 9d is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by a decoding device. As shown in Figure 9d, the audio processing method may include the following steps:

Step 901d: Obtain the encoded signal sent by the encoding device;

Step 902d: Decode the encoded signal to obtain the target audio signal;

Step 903d: Determine the metadata of the object audio signal. The metadata includes the orientation information and the tag of the acoustic object. The tag is used to indicate that the metadata includes orientation information;

Step 904d: Render the object audio signal based on the metadata.

Figure 9e is a schematic flowchart of an audio processing method provided by an embodiment of the present disclosure. The method is executed by a decoding device. As shown in Figure 9e, the audio processing method may include the following steps:

Step 901e: Obtain the encoded signal sent by the encoding device;

Step 902e: Decode the encoded signal to obtain the target audio signal;

Step 903e: Determine the metadata of the object audio signal. The metadata includes the orientation information and the tag of the acoustic object. The tag is used to indicate that the metadata includes orientation information;

Step 904e: Render the object audio signal based on the metadata.

Figure 9f is a schematic structural diagram of an audio processing device provided by an embodiment of the present disclosure. As shown in Figure 9f, the device may include:

Determining module 901f, used to determine the metadata of each frame of audio data. The metadata includes the absolute position information of the sound object in the audio data, the relative position information of the sound object, the orientation information of the sound object, and the voicing of the sound object. At least one of the radiation ranges;

The processing module 902f is configured to obtain an object audio signal based on the metadata of the audio data.

To sum up, in the audio processing device provided by the embodiment of the present disclosure, the encoding device determines the metadata of each frame of audio data, where the metadata includes the absolute position information of the sound object in the audio data, the sound At least one of the relative position information of the object, the orientation information of the sound object, and the sound radiation range of the sound object; then, the encoding device will obtain the object audio signal based on the metadata of the audio data. It can be seen from this that in embodiments of the present disclosure, metadata may include absolute position information of sound objects. Based on this, when recording audio data or producing object audio signals, if the absolute positions of multiple sound objects are fixed, but the listening If the sound position is constantly moving, the metadata can include the absolute position information of the sound object. At this time, since the absolute position of the sound object is fixed, the metadata of the audio data of a certain frame (such as the first frame) can only include The absolute position information between the sound object and the listening position. The audio data of other frames can reuse the absolute position information between the sound object and the listening position in the frame without making the metadata of the audio data of each frame All include the absolute position information between the sound object and the listening position, thereby reducing the amount of metadata and transmission bandwidth, improving coding efficiency and ensuring that the renderer can subsequently correctly render the position of the sound object and provide correct spatial audio. Perceptual results without affecting the final decoding rendering effect. In addition, the metadata in the embodiment of the present disclosure also includes the orientation information of the sound object and the sound radiation range of the sound object. Then when the subsequent encoding device renders the object audio signal, the rendering can be based on the orientation information and the sound radiation range, so as to Simulates the difference in hearing caused by the actual direction of the sounding object, improving the rendering effect.

Optionally, in one embodiment of the present disclosure, the determining module is also used to:

In response to determining that the metadata needs to include relative position information, the metadata includes relative position information, and the relative position information is used to indicate the relative position between the sound object and the listener.

Determine whether the sound object has a direction;

The relative orientation information is used to indicate the relative orientation between the sound object and the listener.

The size of the sound source of the sound object;

The width of the sound object;

The height of the sound object;

The spatial state of the sound object, which includes movement or stillness;

The type of sound object.

Optionally, in one embodiment of the present disclosure, the device is also used for:

Determine the environmental spatial information of the acoustic object;

Determine the basic information of the acoustic object;

Audio data of the sound object is sampled in units of frames.

room size;

room wall type;

wall reflection coefficient;

Room type;

Reverberation time.

The number of sound objects;

The sampling rate of the sound source of the sound object;

The sound source width of the sound object;

The frame length of each frame of audio data.

Optionally, in one embodiment of the present disclosure, the processing module is also used to::

Encoding the object audio signal;

Send the encoded signal to the decoding device.

Figure 9g is a schematic structural diagram of an audio processing device provided by an embodiment of the present disclosure. As shown in Figure 9g, the device may include:

The acquisition module 901g is used to acquire the encoded signal sent by the encoding device;

The decoding module 902g is used to decode the encoded signal to obtain the target audio signal;

Determining module 903g, used to determine the metadata of the object audio signal, the metadata including at least the absolute position information of the sound object, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object. A sort of;

Rendering module 904g, configured to render the object audio signal based on the metadata.

The size of the sound source of the sound object;

The width of the sound object;

The height of the sound object;

The spatial state of the sound object, which includes movement or stillness;

The type of sound object.

The object audio data packet includes audio data metadata and audio data.

room size;

room wall type;

wall reflection coefficient;

Room type;

Reverberation time.

The number of sound objects;

The sampling rate of the sound source of the sound object;

The sound source width of the sound object;

The frame length of each frame of audio data.

The audio data is rendered based on the metadata and the header file.

Figure 10 is a block diagram of a user equipment UE1000 provided by an embodiment of the present disclosure. For example, the UE1000 can be a mobile phone, a computer, a digital broadcast terminal device, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.

Referring to FIG. 10 , UE 1000 may include at least one of the following components: a processing component 1002 , a memory 1004 , a power supply component 1006 , a multimedia component 1008 , an audio component 1010 , an input/output (I/O) interface 1012 , a sensor component 1013 , and a communication component. 1016.

Processing component 1002 generally controls the overall operations of UE 1000, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 1002 may include at least one processor 1020 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 1002 may include at least one module to facilitate interaction between processing component 1002 and other components. For example, processing component 1002 may include a multimedia module to facilitate interaction between multimedia component 1008 and processing component 1002.

Memory 1004 is configured to store various types of data to support operations at UE 1000. Examples of this data include instructions for any application or method operating on the UE1000, contact data, phonebook data, messages, pictures, videos, etc. Memory 1004 may be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EEPROM), Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Power supply component 1006 provides power to various components of UE 1000. Power supply components 1006 may include a power management system, at least one power supply, and other components associated with generating, managing, and distributing power to UE 1000.

Multimedia component 1008 includes a screen that provides an output interface between the UE 1000 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes at least one touch sensor to sense touches, slides, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding operation, but also detect the wake-up time and pressure related to the touch or sliding operation. In some embodiments, multimedia component 1008 includes a front-facing camera and/or a rear-facing camera. When UE1000 is in an operating mode, such as shooting mode or video mode, the front camera and/or rear camera can receive external multimedia data. Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.

Audio component 1010 is configured to output and/or input audio signals. For example, audio component 1010 includes a microphone (MIC) configured to receive external audio signals when UE 1000 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 1004 or sent via communication component 1016 . In some embodiments, audio component 1010 also includes a speaker for outputting audio signals.

The I/O interface 1012 provides an interface between the processing component 1002 and a peripheral interface module. The peripheral interface module may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.

The sensor component 1013 includes at least one sensor for providing various aspects of status assessment for the UE 1000 . For example, the sensor component 1013 can detect the open/closed state of the device 1000, the relative positioning of components, such as the display and keypad of the UE1000, the sensor component 1013 can also detect the position change of the UE1000 or a component of the UE1000, the user and the The presence or absence of UE1000 contact, UE1000 orientation or acceleration/deceleration and temperature changes of UE1000. Sensor assembly 1013 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 1013 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 1013 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 1016 is configured to facilitate wired or wireless communication between UE 1000 and other devices. UE1000 can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1016 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communications component 1016 also includes a near field communications (NFC) module to facilitate short-range communications. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, UE 1000 may be configured by at least one application specific integrated circuit (ASIC), digital signal processor (DSP), digital signal processing device (DSPD), programmable logic device (PLD), field programmable gate array ( FPGA), controller, microcontroller, microprocessor or other electronic component implementation for executing the above method.

Figure 11 is a block diagram of a network side device 1100 provided by an embodiment of the present disclosure. For example, the network side device 1100 may be provided as a network side device. Referring to FIG. 11 , the network side device 1100 includes a processing component 1111 , which further includes at least one processor, and a memory resource represented by a memory 1132 for storing instructions, such as application programs, that can be executed by the processing component 1122 . An application stored in memory 1132 may include one or more modules, each of which corresponds to a set of instructions. In addition, the processing component 1110 is configured to execute instructions to perform any of the foregoing methods applied to the network side device, for example, the method shown in FIG. 1 .

The network side device 1100 may also include a power supply component 1126 configured to perform power management of the network side device 1100, a wired or wireless network interface 1150 configured to connect the network side device 1100 to the network, and an input/output (I/O ) interface 1158. The network side device 1100 may operate based on an operating system stored in the memory 1132, such as Windows Server TM, Mac OS X TM, Unix TM, Linux TM, FreeBSD TM or similar.

In the above embodiments provided by the present disclosure, the methods provided by the embodiments of the present disclosure are introduced from the perspectives of network side equipment and UE respectively. In order to implement each function in the method provided by the above embodiments of the present disclosure, the network side device and the UE may include a hardware structure and a software module to implement the above functions in the form of a hardware structure, a software module, or a hardware structure plus a software module. A certain function among the above functions can be executed by a hardware structure, a software module, or a hardware structure plus a software module.

A communication device provided by an embodiment of the present disclosure. The communication device may include a transceiver module and a processing module. The transceiver module may include a sending module and/or a receiving module. The sending module is used to implement the sending function, and the receiving module is used to implement the receiving function. The transceiving module may implement the sending function and/or the receiving function.

The communication device may be a terminal device (such as the terminal device in the foregoing method embodiment), a device in the terminal device, or a device that can be used in conjunction with the terminal device. Alternatively, the communication device may be a network device, a device in a network device, or a device that can be used in conjunction with the network device.

Another communication device provided by an embodiment of the present disclosure. The communication device may be a network device, or may be a terminal device (such as the terminal device in the foregoing method embodiment), or may be a chip, chip system, or processor that supports the network device to implement the above method, or may be a terminal device that supports A chip, chip system, or processor that implements the above method. The device can be used to implement the method described in the above method embodiment. For details, please refer to the description in the above method embodiment.

A communications device may include one or more processors. The processor may be a general-purpose processor or a special-purpose processor, etc. For example, it can be a baseband processor or a central processing unit. The baseband processor can be used to process communication protocols and communication data, and the central processor can be used to control and execute communication devices (such as network side equipment, baseband chips, terminal equipment, terminal equipment chips, DU or CU, etc.) A computer program processes data for a computer program.

Optionally, the communication device may also include one or more memories, on which a computer program may be stored, and the processor executes the computer program, so that the communication device executes the method described in the above method embodiment. Optionally, data may also be stored in the memory. The communication device and the memory can be provided separately or integrated together.

Optionally, the communication device may also include a transceiver and an antenna. The transceiver can be called a transceiver unit, a transceiver, or a transceiver circuit, etc., and is used to implement transceiver functions. The transceiver can include a receiver and a transmitter. The receiver can be called a receiver or a receiving circuit, etc., and is used to implement the receiving function; the transmitter can be called a transmitter or a transmitting circuit, etc., and is used to implement the transmitting function.

Optionally, the communication device may also include one or more interface circuits. Interface circuitry is used to receive code instructions and transmit them to the processor. The processor executes the code instructions to cause the communication device to perform the method described in the above method embodiment.

The communication device is a terminal device (such as the terminal device in the foregoing method embodiment): the processor is configured to execute the method shown in any one of Figures 1-4.

The communication device is a network device: a transceiver is used to perform the method shown in any one of Figures 5-7.

In one implementation, a transceiver for implementing receiving and transmitting functions may be included in the processor. For example, the transceiver can be a transceiver circuit, an interface, or an interface circuit. The transceiver circuits, interfaces or interface circuits used to implement the receiving and transmitting functions can be separate or integrated together. The above-mentioned transceiver circuit, interface or interface circuit can be used for reading and writing codes/data, or the above-mentioned transceiver circuit, interface or interface circuit can be used for signal transmission or transfer.

In one implementation, the processor may store a computer program, and the computer program runs on the processor, which can cause the communication device to perform the method described in the above method embodiment. The computer program may be embedded in the processor, in which case the processor may be implemented in hardware.

In one implementation, the communication device may include a circuit, and the circuit may implement the functions of sending or receiving or communicating in the foregoing method embodiments. The processors and transceivers described in this disclosure can be implemented in integrated circuits (ICs), analog ICs, radio frequency integrated circuits RFICs, mixed signal ICs, application specific integrated circuits (ASICs), printed circuit boards (printed circuit boards). circuit board, PCB), electronic equipment, etc. The processor and transceiver can also be manufactured using various IC process technologies, such as complementary metal oxide semiconductor (CMOS), n-type metal oxide-semiconductor (NMOS), P-type Metal oxide semiconductor (positive channel metal oxide semiconductor, PMOS), bipolar junction transistor (BJT), bipolar CMOS (BiCMOS), silicon germanium (SiGe), gallium arsenide (GaAs), etc.

The communication device described in the above embodiments may be a network device or a terminal device (such as the terminal device in the foregoing method embodiment), but the scope of the communication device described in the present disclosure is not limited thereto, and the structure of the communication device may not be limited to limits. The communication device may be a stand-alone device or may be part of a larger device. For example, the communication device may be:

(1) Independent integrated circuit IC, or chip, or chip system or subsystem;

(2) A collection of one or more ICs. Optionally, the IC collection may also include storage components for storing data and computer programs;

(3)ASIC, such as modem;

(4) Modules that can be embedded in other devices;

(5) Receivers, terminal equipment, intelligent terminal equipment, cellular phones, wireless equipment, handheld devices, mobile units, vehicle-mounted equipment, network equipment, cloud equipment, artificial intelligence equipment, etc.;

(6) Others, etc.

Where the communication device may be a chip or a system on a chip, the chip includes a processor and an interface. The number of processors may be one or more, and the number of interfaces may be multiple.

Optionally, the chip also includes a memory, which is used to store necessary computer programs and data.

Those skilled in the art can also understand that the various illustrative logical blocks and steps listed in the embodiments of the present disclosure can be implemented by electronic hardware, computer software, or a combination of both. Whether such functionality is implemented in hardware or software depends on the specific application and overall system design requirements. Those skilled in the art can use various methods to implement the described functions for each specific application, but such implementation should not be understood as exceeding the scope of protection of the embodiments of the present disclosure.

Embodiments of the present disclosure also provide a system for determining side link duration. The system includes a communication device as a terminal device in the foregoing embodiment (such as the first terminal device in the foregoing method embodiment) and a communication device as a network device. Alternatively, the system includes a communication device as a terminal device in the foregoing embodiment (such as the first terminal device in the foregoing method embodiment) and a communication device as a network device.

The present disclosure also provides a readable storage medium on which instructions are stored, and when the instructions are executed by a computer, the functions of any of the above method embodiments are implemented.

The present disclosure also provides a computer program product, which, when executed by a computer, implements the functions of any of the above method embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs. When the computer program is loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present disclosure are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer program may be stored in or transferred from one computer-readable storage medium to another, for example, the computer program may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated. The usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., high-density digital video discs (DVD)), or semiconductor media (e.g., solid state disks, SSD)) etc.

Those of ordinary skill in the art can understand that the first, second, and other numerical numbers involved in this disclosure are only for convenience of description and are not used to limit the scope of the embodiments of the disclosure, nor to indicate the order.

At least one in the present disclosure can also be described as one or more, and the plurality can be two, three, four or more, and the present disclosure is not limited. In the embodiment of the present disclosure, for a technical feature, the technical feature is distinguished by “first”, “second”, “third”, “A”, “B”, “C” and “D” etc. The technical features described in "first", "second", "third", "A", "B", "C" and "D" are in no particular order or order.

Other embodiments of the invention will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The present disclosure is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common common sense or customary technical means in the technical field that are not disclosed in the present disclosure. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the disclosure is limited only by the appended claims.

Claims

An audio processing method, characterized in that it is executed by an encoding device, and the method includes:

Determine the metadata of each frame of audio data, the metadata including at least one of the absolute position information of the sound object in the audio data, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object. kind;

An object audio signal is obtained based on the metadata of the audio data.
The method of claim 1, wherein determining the metadata of each frame of audio data includes:

Determine that the metadata needs to contain absolute position information or relative position information;

Wherein, in response to determining that the metadata needs to include absolute position information, causing the metadata to include absolute position information;

In response to determining that relative position information needs to be included in the metadata, relative position information is included in the metadata.
The method of claim 1, wherein determining the metadata of each frame of audio data includes:

Determine whether the sound object has a direction;

In response to the sound object having an orientation, the orientation information of the acoustic object is included in the metadata, and a mark is included in the metadata, the mark being used to indicate that the metadata includes the orientation information. ;

In response to the absence of an orientation of the acoustic object, no orientation information is included in the metadata.
The method of claim 3, wherein the orientation information includes absolute orientation information and/or relative orientation information;

The relative orientation information is used to indicate the relative orientation between the sound object and the listening position.
The method of claim 1, wherein the metadata further includes at least one of the following:

The size of the sound source of the sound object;

The width of the sound object;

The height of the sound object;

The spatial state of the sound object, which includes movement or stillness;

The type of sound object.
The method according to any one of claims 1 to 5, characterized in that the method further includes:

Determine the environmental spatial information of the acoustic object;

Determine the basic information of the acoustic object;

Audio data of the sound object is sampled in units of frames.
The method of claim 6, wherein in response to the acoustic object being located in the room, the environmental spatial information includes at least one of the following:

room size;

room wall type;

wall reflection coefficient;

Room type;

Reverberation time.
The method of claim 6, wherein the basic information of the acoustic object includes at least one of the following:

The number of sound objects;

The sampling rate of the sound source of the sound object;

The sound source width of the sound object;

The frame length of each frame of audio data.
The method of claim 6, wherein obtaining the object audio signal based on the metadata of the audio data includes:

Store the environmental space information of the sound object and the basic information of the sound object as a header file;

Store the metadata of each frame of audio data and each frame of audio data as an object audio data packet;

The header file and the object audio data packet are spliced to obtain at least one object audio signal.
The method of claim 1, further comprising:

Encoding the object audio signal;

Send the encoded signal to the decoding device.
An audio processing method, characterized in that it is executed by a decoding device, and the method includes:

Obtain the encoded signal sent by the encoding device;

Decode the encoded signal to obtain the target audio signal;

Determine the metadata of the object audio signal, the metadata including at least one of the absolute position information of the sound object, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object;

The object audio signal is rendered based on the metadata.
The method of claim 11, wherein the orientation information includes absolute orientation information and/or relative orientation information;

The relative orientation information is used to indicate the relative orientation between the sound object and the listening position.
The method of claim 11, wherein the metadata further includes at least one of the following:

The size of the sound source of the sound object;

The width of the sound object;

The height of the sound object;

The spatial state of the sound object, including movement or stillness.
The method of claim 11, wherein the object audio signal includes a header file and an object audio data packet;

The header file includes environmental space information of the sound object and basic information of the sound object;

The object audio data packet includes audio data metadata and audio data.
The method of claim 14, wherein in response to the acoustic object being located in the room, the environmental spatial information includes at least one of the following:

room size;

room wall type;

wall reflection coefficient;

Room type;

Reverberation time.
The method of claim 14, wherein the basic information of the acoustic object includes at least one of the following:

The number of sound objects;

The sampling rate of the sound source of the sound object;

The sound source width of the sound object;

The frame length of each frame of audio data.
The method according to any one of claims 14-16, wherein rendering the object audio signal based on the metadata includes:

The audio data is rendered based on the metadata and the header file.
An audio processing device, characterized by including:

Determining module, used to determine the metadata of each frame of audio data. The metadata includes the absolute position information of the sound object in the audio data, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation of the sound object. at least one of the ranges;

A processing module, configured to obtain an object audio signal based on the metadata of the audio data.
An audio processing device, characterized by including:

The acquisition module is used to obtain the encoded signal sent by the encoding device;

A decoding module, used to decode the encoded signal to obtain the target audio signal;

Determining module, used to determine the metadata of the object audio signal, the metadata includes at least one of the absolute position information of the sound object, the relative position information of the sound object, the orientation information of the sound object, and the sound radiation range of the sound object. kind;

A rendering module, configured to render the object audio signal based on the metadata.
A communication device, characterized in that the device includes a processor and a memory, wherein a computer program is stored in the memory, and the processor executes the computer program stored in the memory, so that the device executes: The method of any one of claims 1 to 10.
A communication device, characterized in that the device includes a processor and a memory, wherein a computer program is stored in the memory, and the processor executes the computer program stored in the memory, so that the device executes: The method of any one of claims 11 to 17.
A communication device, characterized by comprising: a processor and an interface circuit, wherein

The interface circuit is used to receive code instructions and transmit them to the processor;

The processor is configured to run the code instructions to perform the method according to any one of claims 1 to 10.
A communication device, characterized by comprising: a processor and an interface circuit, wherein

The interface circuit is used to receive code instructions and transmit them to the processor;

The processor is configured to run the code instructions to perform the method according to any one of claims 11 to 17.
A computer-readable storage medium for storing instructions, which when executed, enables the method according to any one of claims 1 to 10 to be implemented.
A computer-readable storage medium for storing instructions, which when executed, enables the method according to any one of claims 11 to 17 to be implemented.