CN115696137A

CN115696137A - Audio rendering method, device, medium and electronic equipment

Info

Publication number: CN115696137A
Application number: CN202211297361.8A
Authority: CN
Inventors: 胡颖
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-02-03

Abstract

The present application belongs to the field of audio and video technologies, and in particular, to an audio rendering method, an audio rendering apparatus, a computer-readable medium, an electronic device, and a computer program product. The audio rendering method comprises the following steps: acquiring quantity information of receivers, wherein the quantity information is used for indicating the quantity of the receivers receiving the audio data; when the number of the receivers is multiple, configuring multiple receiver fields corresponding to the multiple receivers respectively in metadata according to the number information, wherein the receiver fields comprise characteristic fields related to the position information of the receivers; and sending the metadata to an audio renderer, wherein the audio renderer is used for differentially rendering the audio data for the receivers according to the fields of the receivers respectively. The embodiment of the application can meet the requirements of diversified audio rendering scenes.

Description

Audio rendering method, device, medium and electronic equipment

Technical Field

The present application belongs to the field of audio and video technologies, and in particular, to an audio rendering method, an audio rendering apparatus, a computer-readable medium, an electronic device, and a computer program product.

Background

Different audio playing effects are usually generated when audio is played for a user in different scenes, for example, for the same audio playing source, the user should obtain different auditory effects in different positions, in different postures or by using different playing devices, however, the related audio rendering technology cannot meet diversified audio playing requirements in complex scenes.

Disclosure of Invention

The application provides an audio rendering method, an audio rendering device, a computer readable medium, an electronic device and a computer program product, aiming at improving the diversified rendering effect of audio.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided an audio rendering method including:

acquiring quantity information of receivers, wherein the quantity information is used for indicating the quantity of the receivers receiving the audio data;

when the number of the receivers is multiple, configuring multiple receiver fields respectively corresponding to the multiple receivers in metadata according to the number information, wherein the receiver fields comprise characteristic fields related to the position information of the receivers;

and sending the metadata to an audio renderer, wherein the audio renderer is used for differentially rendering the audio data for the receivers according to the fields of the receivers respectively.

According to an aspect of an embodiment of the present application, there is provided an audio rendering apparatus including:

a first obtaining module configured to obtain number information of recipients, the number information indicating the number of recipients receiving the audio data;

a first assignment module configured to configure, when the number of the recipients is multiple, multiple recipient fields corresponding to the multiple recipients, respectively, in metadata according to the number information, the recipient fields including a feature field related to location information of the recipients;

a first sending module configured to send the metadata to an audio renderer for differentially rendering the audio data for the plurality of recipients according to the plurality of recipient fields, respectively.

extracting a receiver field corresponding to a receiver receiving the audio data from the metadata, the receiver field including a feature field related to location information of the receiver;

and if the number of the receivers is multiple, differentially rendering the audio data for the receivers according to the fields of the receivers respectively.

a first extraction module configured to extract a receiver field corresponding to a receiver receiving audio data from metadata, the receiver field including a feature field related to location information of the receiver;

a first rendering module configured to render the audio data differentially for the plurality of recipients according to the plurality of recipient fields, respectively, if the number of recipients is multiple.

acquiring equipment information of a receiver, wherein the equipment information is used for indicating audio playing equipment used by the receiver when receiving audio data;

assigning values to equipment information fields in the metadata according to the equipment information, wherein the equipment information fields are used for indicating the characteristics of the audio playing equipment influencing the audio data rendering effect;

assigning a value to a corresponding device identification field in the metadata according to the device information field, wherein the corresponding device identification field is a sub-element of a rendering information field, and the rendering information field is an element in the metadata for indicating the rendering effect of the audio data;

and sending the metadata to an audio renderer, wherein the audio renderer is used for rendering the audio data for the audio playing equipment according to the equipment information field.

the second acquisition module is configured to acquire equipment information of a receiving party, wherein the equipment information is used for indicating an audio playing device used by the receiving party when receiving audio data;

a second assignment module configured to assign, according to the device information, a device information field in metadata, where the device information field is used to indicate a feature of the audio playback device that affects an audio data rendering effect; assigning a value to a corresponding device identification field in the metadata according to the device information field, wherein the corresponding device identification field is a sub-element of a rendering information field, and the rendering information field is an element in the metadata for indicating the rendering effect of the audio data;

a second sending module configured to send the metadata to an audio renderer, where the audio renderer is configured to render the audio data for the audio playback device according to the device information field.

extracting a corresponding device identification field from a rendering information field of metadata, wherein the rendering information field is an element of the metadata for indicating an audio data rendering effect;

extracting a device information field according to the corresponding device identification field, wherein the device information field is used for indicating the characteristics of the audio playing device influencing the audio data rendering effect;

and rendering the audio data for the audio playing equipment according to the equipment information field.

a second extraction module configured to extract a corresponding device identification field from a rendering information field of metadata, the rendering information field being an element of the metadata for indicating an audio data rendering effect; extracting a device information field according to the corresponding device identification field, wherein the device information field is used for indicating the characteristics of the audio playing device influencing the audio data rendering effect;

a second rendering module configured to render the audio data for the audio playback device according to the device information field.

acquiring rendering type information of audio data, wherein the rendering type information is used for indicating a rendering type corresponding to each audio component in the audio data;

assigning a value to an audio component information field in the metadata according to the rendering type information, the audio component information field being used to indicate an audio component having a corresponding rendering type to be processed by an audio renderer;

sending the metadata to the audio renderer, the audio renderer being configured to render the audio component according to the rendering type.

the third obtaining module is configured to obtain rendering type information of the audio data, wherein the rendering type information is used for indicating a rendering type corresponding to each audio component in the audio data;

a third assignment module configured to assign an audio component information field in the metadata according to the rendering type information, the audio component information field indicating an audio component having a corresponding rendering type to be processed by the audio renderer;

a third sending module configured to send the metadata to the audio renderer for rendering the audio component according to the rendering type.

extracting an audio component information field from the metadata, the audio component information field indicating an audio component having a corresponding rendering type processed by the audio renderer;

determining rendering types corresponding to all audio components in the audio data according to the audio component information fields;

rendering the audio component according to the rendering type.

a third extraction module configured to extract an audio component information field from the metadata, the audio component information field indicating an audio component having a corresponding rendering type processed by the audio renderer;

the third rendering module is configured to determine rendering types corresponding to the audio components in the audio data according to the audio component information fields; rendering the audio component according to the rendering type.

acquiring attribute information of an obstructing object in a rendering scene of audio data;

assigning a value to an occlusion object information field in the metadata according to the attribute information of the occlusion object, wherein the occlusion object information field is used for indicating the characteristics of the occlusion object generating the audio occlusion effect;

and sending the metadata to an audio renderer, wherein the audio renderer is used for rendering the audio data with the audio occlusion effect according to the occlusion object information field.

a fourth obtaining module configured to obtain attribute information of an occluding object located in a rendered scene of the audio data;

a fourth assignment module configured to assign an assignment to an occluding object information field in metadata according to attribute information of the occluding object, the occluding object information field being used to indicate a feature of the occluding object that produces an audio occluding effect;

a fourth sending module configured to send the metadata to the audio renderer for rendering the audio data with the audio occlusion effect according to the occlusion object information field.

extracting an occluding object information field from the metadata, the occluding object information field indicating a feature of an occluding object that produces an audio occlusion effect;

and rendering the audio data with the audio occlusion effect according to the occlusion object information field.

a fourth extraction module configured to extract an occluding object information field for indicating a feature of an occluding object generating an audio occluding effect from the metadata;

a fourth rendering module configured to render the audio data having the audio occlusion effect according to the occlusion object information field.

acquiring scene configuration information of a receiver, wherein the scene configuration information is configuration information influencing the rendering effect of audio data in a rendered scene;

assigning a value to a characteristic field of the metadata according to the scene configuration information, wherein the characteristic field comprises at least one of a receiver field, an equipment information field, an audio component information field or an information field of an occlusion object; the receiver field includes a field related to position information of the receiver, the device information field is used for indicating a characteristic of an audio playing device influencing an audio data rendering effect, the audio component information field is used for indicating an audio component with a corresponding rendering type processed by an audio renderer, and the shielding object information field is used for indicating a characteristic of a shielding object generating an audio shielding effect;

and sending the metadata to an audio renderer, wherein the audio renderer is used for rendering the audio data according to the characteristic field.

a fifth obtaining module, configured to obtain scene configuration information of a receiver, where the scene configuration information is configuration information that affects an audio data rendering effect in rendering a scene;

a fifth assignment module configured to assign a value to a feature field of the metadata according to the scene configuration information, where the feature field includes at least one of a receiver field, a device information field, an audio component information field, or an obstructing object information field; the receiver field includes a field related to position information of the receiver, the device information field is used for indicating a characteristic of an audio playing device influencing an audio data rendering effect, the audio component information field is used for indicating an audio component with a corresponding rendering type processed by an audio renderer, and the shielding object information field is used for indicating a characteristic of a shielding object generating an audio shielding effect;

a fifth sending module configured to send the metadata to an audio renderer for rendering the audio data according to the feature field.

extracting a characteristic field from the metadata, wherein the characteristic field comprises at least one of a receiver field, a device information field, an audio component information field or an occlusion object information field; the receiver field includes a field related to position information of the receiver, the device information field is used for indicating a characteristic of an audio playing device influencing an audio data rendering effect, the audio component information field is used for indicating an audio component with a corresponding rendering type processed by an audio renderer, and the shielding object information field is used for indicating a characteristic of a shielding object generating an audio shielding effect;

and rendering the audio data according to the characteristic field.

a fifth extraction module configured to extract a feature field from the metadata, the feature field including at least one of a recipient field, a device information field, an audio component information field, or an obstructing object information field; the receiver field includes a field related to position information of the receiver, the device information field is used for indicating a characteristic of an audio playing device influencing an audio data rendering effect, the audio component information field is used for indicating an audio component with a corresponding rendering type processed by an audio renderer, and the shielding object information field is used for indicating a characteristic of a shielding object generating an audio shielding effect;

a fifth rendering module configured to render audio data according to the feature field.

According to an aspect of embodiments of the present application, there is provided a computer-readable medium on which a computer program is stored, the computer program, when executed by a processor, implementing an audio rendering method as in the above technical solution.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the audio rendering method as in the above solution via execution of the executable instructions.

According to an aspect of embodiments of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, is adapted to perform the audio rendering method according to the above technical solution.

In the technical scheme provided by the embodiment of the application, the plurality of receiver fields corresponding to the plurality of receivers are configured in the metadata, so that the plurality of receivers can perform personalized audio rendering for the plurality of receivers according to the plurality of receiver fields, and can obtain different auditory effects from the same audio data, thereby meeting diversified audio rendering scene requirements.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 schematically shows a schematic diagram of an exemplary system architecture to which the technical solution of the embodiments of the present application may be applied.

Fig. 2 schematically shows the placement of an audio-visual encoding device and an audio-visual decoding device in a streaming environment.

Fig. 3 schematically illustrates a system framework for implementing audio content expression in a virtual reality application scenario.

Fig. 4 is a flowchart illustrating steps of a method for audio rendering based on the number of receiving parties, performed by an audio capturing end in an embodiment of the present application.

FIG. 5 is a flow chart illustrating steps of audio rendering based on recipient degrees of freedom in one embodiment of the application.

FIG. 6 is a flow diagram illustrating steps for audio rendering based on a recipient identifier in one embodiment of the application.

FIG. 7 is a flowchart illustrating the steps of audio rendering based on coordinate system type in one embodiment of the present application.

FIG. 8 is a flow chart illustrating the steps of audio rendering based on a co-located recipient in one embodiment of the present application.

Fig. 9 is a flowchart illustrating steps of a method for audio rendering based on the number of recipients performed by an audio sink according to an embodiment of the present application.

Fig. 10 is a flowchart illustrating steps of a method for audio rendering based on device information performed by an audio capturing end according to an embodiment of the present application.

Fig. 11 is a flowchart illustrating steps of a method for audio rendering based on device information performed by an audio sink according to an embodiment of the present application.

Fig. 12 is a flowchart illustrating steps of a method for audio rendering based on rendering type performed by an audio capturing end in an embodiment of the present application.

Fig. 13 is a flowchart illustrating steps of a method for audio rendering based on rendering type performed by an audio sink according to an embodiment of the present application.

Fig. 14 is a flowchart illustrating steps of a method performed by an audio capture end for audio rendering based on occlusion object information according to an embodiment of the present application.

Fig. 15 is a flowchart illustrating steps of a method performed by an audio capture end for audio rendering based on occlusion parameters according to an embodiment of the present application.

Fig. 16 is a flowchart illustrating steps of a method for audio rendering based on information of an occlusion object performed by an audio receiving end in an embodiment of the present application.

Fig. 17 is a flowchart illustrating steps of a method performed by an audio capturing end for audio rendering based on scene configuration information according to an embodiment of the present application.

Fig. 18 is a flowchart illustrating steps of a method for audio rendering based on scene configuration information performed by an audio sink according to an embodiment of the present application.

FIG. 19 schematically illustrates a block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

In the embodiments of the present application, data related to location information, posture information, equipment information, etc. of a user is referred to, when the embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and collection, use and processing of the related data need to comply with related laws and regulations and standards of related countries and regions.

The related terms or abbreviations referred to in the embodiments of the present application are explained as follows.

Panoramic audio (Panoramic audio): 720 degrees of omnibearing audio frequency in a three-dimensional space can enable people to obtain immersive auditory experience.

Immersive audio (Immersive audio): audio in three-dimensional space with immersive sensation is often used for immersive communication, language expression of panoramic audio or three-dimensional audio in immersive systems.

Virtual reality audio (Virtual reality audio): for virtual presentation of audio in reality, certain rendering technology is used to perform binaural or speaker playback on an audio signal in a virtual form so as to achieve the perception of audio in a real space, and the rendering technology is commonly used in virtual reality equipment and systems to describe a language expression mode of three-dimensional audio.

Metadata (Metadata): feature information related to rendering a scene and audio content is described.

ADM (Audio Definition Model ): audio metadata standards for describing the components of an audio file.

HRTF (Head Related Transfer Function): frequency domain acoustic transfer function from the sound source to both ears in the free field case.

HRIR (Head related impulse response): the impulse response from the sound source to both ears in the free-field case is an equivalent representation of the HRTF in the time domain.

HOA (Higher Order Ambisonic): a high order spherical harmonic signal.

DoF (Freedom of Freedom): refers to the motions supported by a user while viewing immersive media and creates degrees of freedom for content interaction.

3DoF: i.e., three degrees of freedom, which refers to three degrees of freedom in which the user's head rotates around the x, y, and z axes.

3DoF +: namely, on the basis of three degrees of freedom, the user also has the degrees of freedom with limited motion along the x, y and z axes.

6DoF: namely, on the basis of three degrees of freedom, the user also has the freedom degree of free motion along the x, y and z axes.

AVS: audio Video Coding Standard, chinese national Video Coding Standard AVS.

The system architecture applying the technical solution of the present application in different application scenarios is described below with reference to fig. 1 to 3. Fig. 1 shows a system architecture for performing audio/video interactive transmission among a plurality of terminal devices, for example, a scene related to an audio/video conference, an audio/video call, and the like; fig. 2 shows a system architecture for performing audio/video streaming transmission from a collecting end to a receiving end, for example, a scene related to live webcast, live television broadcast, and the like. Fig. 3 shows a system architecture for decoupled transmission of audio signals and metadata, for example relating to a virtual reality application scenario.

As shown in fig. 1, the system architecture 100 includes a plurality of end devices that may communicate with each other over, for example, a network 150. For example, the system architecture 100 may include a first end device 110 and a second end device 120 interconnected by a network 150. In the embodiment of fig. 1, the first terminal device 110 and the second terminal device 120 perform unidirectional data transmission.

For example, the first terminal device 110 may encode audio and video data (e.g., audio and video data streams collected by the terminal device 110) for transmission to the second terminal device 120 via the network 150, the encoded audio and video data being transmitted in one or more encoded audio and video streams, and the second terminal device 120 may receive the encoded audio and video data from the network 150, decode the encoded audio and video data to recover the audio and video data, and play or display content according to the recovered audio and video data.

In one embodiment of the present application, the system architecture 100 may include a third end device 130 and a fourth end device 140 that perform bi-directional transmission of encoded audiovisual data, such as may occur during an audiovisual conference. For bi-directional data transmission, each of the third and

fourth end devices

130, 140 may encode audio-visual data (e.g., audio-visual data streams collected by the end devices) for transmission over the network 150 to the other of the third and

fourth end devices

130, 140. Each of the third terminal device 130 and the fourth terminal device 140 may further receive encoded audio/video data transmitted by the other of the third terminal device 130 and the fourth terminal device 140, decode the encoded audio/video data to recover the audio/video data, and play or display content according to the recovered audio/video data.

In the embodiment of fig. 1, the first terminal device 110, the second terminal device 120, the third terminal device 130, and the fourth terminal device 140 may be a server, a personal computer, and a smart phone, but the principles disclosed herein may not be limited thereto. Embodiments disclosed herein are applicable to laptop computers, tablet computers, media players, and/or dedicated audio video conferencing equipment. Network 150 represents any number of networks that communicate encoded audio-visual data between first end device 110, second end device 120, third end device 130, and fourth end device 140, including, for example, wired and/or wireless communication networks. The communication network 150 may exchange data in circuit-switched and/or packet-switched channels. The network may include a telecommunications network, a local area network, a wide area network, and/or the internet. For purposes of this application, the architecture and topology of the network 150 may be immaterial to the operation of the present disclosure, unless explained below.

In one embodiment of the present application, fig. 2 schematically illustrates the placement of an audio-video encoding device and an audio-video decoding device in a streaming environment. The subject matter disclosed herein is equally applicable to other audio-video enabled applications including, for example, audio-video conferencing, digital TV (television), storing compressed audio-video on digital media including CDs, DVDs, memory sticks, and the like.

The streaming system may include an acquisition subsystem 213, and the acquisition subsystem 213 may include an audio video source 201, such as a microphone, a camera, etc., that creates an uncompressed audio video data stream 202. The audio-visual data stream 202 is depicted as a bold line to emphasize high data volume audio-visual data streams compared to the encoded audio-visual data 204 (or the encoded audio-visual code stream 204), the audio-visual data stream 202 can be processed by an electronic device 220, the electronic device 220 comprises an audio-visual encoding device 203 coupled to an audio-visual source 201. The audiovisual encoding device 203 may include hardware, software, or a combination of hardware and software to implement or embody aspects of the disclosed subject matter as described in greater detail below. The encoded audio video data 204 (or encoded audio video stream 204) is depicted as a thin line compared to the audio video data stream 202 to emphasize the encoded audio video data 204 (or encoded audio video stream 204) of lower data volume, which may be stored on the streaming server 205 for future use. One or more streaming client subsystems, such as client subsystem 206 and client subsystem 208 in fig. 2, may access streaming server 205 to retrieve

copies

207 and 209 of encoded audiovisual data 204. Client subsystem 206 may include, for example, an audio-video decoding device 210 in an electronic device 230. An audiovisual decoding device 210 decodes the incoming copy of encoded audiovisual data 207 and generates an output audiovisual data stream 211 that may be presented on an output 212 (e.g., speaker, display) or another presentation device. In some streaming systems, encoded audio-visual data 204, audio-visual data 207, and audio-visual data 209 (e.g., audio-visual code streams) may be encoded according to some audio-visual encoding/compression standard.

It should be noted that

electronic devices

220 and 230 may include other components not shown in the figures. For example, the electronic device 220 may include an audiovisual decoding device, and the electronic device 230 may also include an audiovisual encoding device.

The virtual reality audio content expression broadly relates to metadata, a renderer, an audio encoder and an audio decoder, and the embodiment of the present application may adopt a manner in which the metadata, the renderer, the encoder and the decoder are logically separated from each other. When the method is used for local storage and production, only the renderer is needed to analyze the metadata, and the audio encoding and decoding process is not involved; when used for transmission (e.g., live or two-way communication), the transmission format of the metadata and audio streams needs to be defined.

As shown in fig. 3, in the virtual reality audio content expression framework, the acquisition end is used for inputting audio signals including channels, objects, hoa or their mixed forms, and generating metadata information according to metadata definitions, the dynamic metadata may be transmitted along with an audio stream after being encoded, and a specific encapsulation format is defined according to a transmission protocol type adopted by a system layer. And at the playback end, the renderer renders and outputs the decoded audio file according to the decoded metadata. The metadata and the audio codec are logically independent of each other and decoupled between the decoder and the renderer. The renderers adopt a registration system (ID 1: renderer based on binaural output, ID2: renderer based on speaker output, ID3: other mode, ID4: other mode), and each registered renderer supports the same set of metadata definition.

Taking the scene shown in fig. 3 as an example, executing the audio rendering method proposed by the embodiment of the present application may include the following steps.

1. And the server generates a corresponding audio metadata file according to the position, the equipment information, the audio content and other information fed back by the user.

2. The server sends the audio metadata file to the audio renderer.

a) And if the renderer is a cloud renderer, the renderer is located on the server side.

b) And if the renderer is a local renderer of the user, the renderer is located at the client side.

3. And the audio renderer extracts the position information and the equipment information in the metadata file as input parameters related to rendering based on the metadata file, and calculates rendering effects of the corresponding user and the corresponding equipment. And extracting the position information related to the object and the parameter information related to the occlusion algorithm, and rendering the corresponding sound occlusion effect.

4. The audio is presented on the corresponding device of the corresponding user.

a) And if the renderer is a cloud renderer, the server sends the rendered audio to the client, and the client performs presentation.

b) And if the renderer is a local renderer of the user, the client side directly presents.

The following describes in detail an audio rendering method involving a multi-user scene in the embodiment of the present application with reference to fig. 4 to 10. Fig. 4 to 8 are audio rendering methods related to a multi-user scene executed at an audio acquisition end according to an embodiment of the present application, where the audio acquisition end may be, for example, the terminal device shown in fig. 1, the acquisition subsystem shown in fig. 2, or the acquisition end shown in fig. 3. Fig. 9 is an audio rendering method related to a multi-user scene executed at an audio sink according to an embodiment of the present application, where the audio sink may be, for example, the terminal device shown in fig. 1, the client subsystem shown in fig. 2, or the registered renderer shown in fig. 3.

Fig. 4 is a flowchart illustrating steps of a method for audio rendering based on the number of receiving parties, performed by an audio capturing end in an embodiment of the present application. As shown in fig. 4, the audio rendering method involving a multi-user scene performed at the audio capturing end includes the following steps S410 to S430.

S410: acquiring quantity information of the receivers, wherein the quantity information is used for indicating the quantity of the receivers receiving the audio data;

s420: when the number of the receivers is multiple, configuring multiple receiver fields respectively corresponding to the multiple receivers in the metadata according to the number information, wherein the receiver fields comprise characteristic fields related to the position information of the receivers;

s430: and sending the metadata to an audio renderer, wherein the audio renderer is used for differentially rendering the audio data for the plurality of receivers according to the plurality of receiver fields respectively.

According to the embodiment of the application, the plurality of receiver fields corresponding to the plurality of receivers are configured in the metadata, and personalized audio rendering can be respectively carried out on the plurality of receivers according to the plurality of receiver fields, so that the plurality of receivers can obtain different auditory effects from the same audio data, and diversified audio rendering scene requirements are met.

In an embodiment of the application, after the quantity information of the recipients is obtained, a value may be further assigned to a multi-recipient flag field in the metadata according to the quantity information, where the multi-recipient flag field is used to indicate that the number of the recipients is one or more.

In one embodiment of the present application, the multi-recipient flags field may be an element in the root metadata. As shown in table 1, in VR extension root metadata implemented by applying the embodiment of the present application, a multi-receiver flag multi _ listener exists, and it can be determined that audio rendering is performed differently according to the number of receivers based on different values of the field. For example, when the multi-receiver flag multi _ list takes a value of 1, it indicates that the current renderer needs to perform personalized audio rendering for multiple receivers; when the multi-receiver flag multi _ list takes a value of 0, it means that the current renderer does not need to perform personalized audio rendering for multiple receivers.

TABLE 1 relevant Specifications for VR extended root metadata including multiple receiver flags field

In one embodiment of the application, the metadata includes a presentation information field, the presentation information field being an element related to the content of the audio data, and the multi-recipient flag field being an attribute or a sub-element of the presentation information field. As shown in table 1, a presentation information field presenceInfo, which is extended metadata related to the content of audio data, exists in the metadata.

In one embodiment of the present application, the multi-recipient flag field is an attribute of the presence information field. As shown in table 2, a multi-receiver flag multi _ list exists in the attribute of the presence information field presenceInfo, and it can be determined that audio rendering is performed differently according to the number of receivers based on different values of the field. For example, when the multi-receiver flag multi _ list takes a value of 1, it indicates that the current renderer needs to perform personalized audio rendering for multiple receivers; when the multi-recipient flag multi _ folder takes a value of 0, it indicates that the current renderer does not need to perform personalized audio rendering for multiple recipients.

TABLE 2 correlation Specifications for attributes in PresenceInfo field

In one embodiment of the present application, the multi-recipient flag field is a sub-element of the presence information field. As shown in table 3, a multi-receiver flag multi _ list exists in a sub-element of the presence information field presenceInfo, and it can be determined that audio rendering is performed differently according to the number of receivers based on different values of the field. For example, when the multi-receiver flag multi _ list takes a value of 1, it indicates that the current renderer needs to perform personalized audio rendering for multiple receivers; when the multi-receiver flag multi _ list takes a value of 0, it means that the current renderer does not need to perform personalized audio rendering for multiple receivers.

TABLE 3 related Specifications for subelements in the presenceInfo field

In one embodiment of the application, a value may be assigned to the recipient degree of freedom field in the metadata prior to sending the metadata to the audio renderer to determine, based on the recipient degree of freedom field, the motions supported by the corresponding recipient when viewing the immersive media and to produce degrees of freedom for content interaction.

FIG. 5 is a flow chart illustrating steps of audio rendering based on recipient degrees of freedom in one embodiment of the application. As shown in fig. 5, the method for audio rendering based on metadata carrying a receiver degree of freedom field includes the following steps S510 to S550.

S510: acquiring quantity information of the receivers, wherein the quantity information is used for indicating the quantity of the receivers receiving the audio data;

s520: when the number of the receivers is multiple, configuring multiple receiver fields respectively corresponding to the multiple receivers in the metadata according to the number information, wherein the receiver fields comprise characteristic fields related to the position information of the receivers;

s530: acquiring the degree of freedom information of a receiver, wherein the degree of freedom information is used for indicating the degree of freedom of the receiver when receiving audio;

s540: assigning a value to a receiver degree of freedom field in the metadata according to the degree of freedom information, wherein the receiver degree of freedom field is used for indicating that a receiver has three degrees of freedom or six degrees of freedom;

s550: and sending the metadata to an audio renderer, wherein the audio renderer is used for rendering the audio data for the plurality of receivers differentially according to the plurality of receiver fields and the receiver degree of freedom fields.

According to the embodiment of the application, the receiver freedom degree field is configured in the metadata, and the receiver freedom degree field can be used for marking the freedom degrees of different receivers, so that personalized audio rendering is performed on the receivers according to the freedom degrees of the different receivers, and the auditory effect meeting the requirement of the freedom degrees is obtained.

In one embodiment of the present application, the receiver degree-of-freedom field may be a sub-element in the receiver field, that is, in the receiver field corresponding to each receiver, the degree of freedom of the receiver may be indicated through the receiver degree-of-freedom field, so as to render audio data corresponding to different degrees of freedom for different receivers. The receiving-side degree-of-freedom field may be an element in the root metadata, indicating that a plurality of receiving sides have the same degree of freedom.

In one embodiment of the application, a value may be assigned to the recipient identifier field in the metadata prior to sending the metadata to the audio renderer, thereby distinguishing between different recipients receiving the audio data based on the recipient identifier field.

FIG. 6 is a flow diagram illustrating steps for audio rendering based on a recipient identifier in one embodiment of the application. As shown in fig. 6, the method for audio rendering based on metadata carrying a recipient identifier field includes the following steps S610 to S640.

S610: acquiring quantity information of the receivers, wherein the quantity information is used for indicating the quantity of the receivers receiving the audio data;

s620: when the number of the receivers is multiple, configuring multiple receiver fields respectively corresponding to the multiple receivers in the metadata according to the number information, wherein the receiver fields comprise characteristic fields related to the position information of the receivers;

s630: when the number of the receivers is multiple, assigning values to receiver identifier fields corresponding to the receivers in the metadata, wherein the receiver identifier fields are used for distinguishing different receivers;

s640: the metadata is sent to an audio renderer, which is configured to render audio data for the plurality of recipients differentially according to the plurality of recipient fields and the recipient identifier field, respectively.

According to the embodiment of the application, the receiver identifier field is configured in the metadata, so that different receivers can be identified by using the receiver identifier field, and therefore personalized audio rendering is performed for different receivers, and diversified auditory effects of different receivers under respective scene requirements are met.

In one embodiment of the present application, the receiver identifier field is a sub-element in the receiver field, that is, in the receiver field corresponding to each receiver, the identification that the receiver is distinguished from other receivers can be indicated by the receiver identifier field.

In one embodiment of the application, an assignment may be made to the coordinate system type field in the metadata prior to sending the metadata to the audio renderer, thereby determining the exact location of the different recipients receiving the audio data based on the coordinate system type field.

FIG. 7 is a flowchart illustrating the steps of audio rendering based on coordinate system type in one embodiment of the present application. As shown in fig. 7, the method for audio rendering based on metadata carrying a coordinate system type field includes the following steps S710 to S750.

S710: acquiring quantity information of the receivers, wherein the quantity information is used for indicating the quantity of the receivers receiving the audio data;

s720: when the number of the receivers is multiple, configuring multiple receiver fields respectively corresponding to the multiple receivers in the metadata according to the number information, wherein the receiver fields comprise characteristic fields related to the position information of the receivers;

s730: when the number of the receivers is multiple, respectively acquiring the position information of each receiver, wherein the position information is used for indicating the position coordinates of the receivers in the used coordinate system;

s740: assigning values to coordinate system type fields corresponding to all receivers in the metadata according to the position information, wherein the coordinate system type fields are used for indicating that the type of the coordinate system is a local coordinate system or a world coordinate system;

s750: and sending the metadata to an audio renderer, wherein the audio renderer is used for rendering the audio data for the plurality of receivers differentially according to the plurality of receiver fields and the coordinate system type field.

According to the embodiment of the application, the coordinate system type field is configured in the metadata, and the coordinate system type field can be utilized to identify the coordinate system type used by different receivers, so that the accurate position of each receiver is determined, and the receivers are subjected to personalized audio rendering according to the accurate positions, and the diversified auditory effects of the different receivers under the respective scene requirements are met.

In one embodiment of the present application, the coordinate system type field may be a sub-element in the field of the receiving party, that is, in the field of the receiving party corresponding to each receiving party, the coordinate system type used by the receiving party may be indicated through the coordinate system type field, so that the rendering of audio data corresponding to different coordinate system types for different receiving parties is satisfied. The coordinate system type field may be an element in the root metadata, indicating that the same coordinate system type is used by a plurality of receivers.

In one embodiment of the present application, when the coordinate system is a local coordinate system, an origin offset field in the metadata is assigned according to the location information, the origin offset field indicating an offset of a coordinate origin of the local coordinate system in the world coordinate system.

In one embodiment of the present application, the origin offset field includes offset component fields corresponding to respective coordinate axes, the offset component fields indicating components of the offset on the respective coordinate axes.

In an embodiment of the present application, after the location information of each receiver is obtained, a value may be assigned to the identifier field of the receiver with the same location in the metadata according to whether the locations of the plurality of receivers are the same, and the same location information may be configured for the receivers with the same location based on the field, so that the duplicate content of field data is reduced, and the consumption of storage resources, computing resources, and network resources is reduced.

FIG. 8 is a flow chart illustrating the steps of audio rendering based on a co-located recipient in one embodiment of the present application. As shown in fig. 8, the method for audio rendering based on metadata carrying the co-location recipient identifier field includes the following steps S810 to S860.

S810: acquiring quantity information of the receivers, wherein the quantity information is used for indicating the quantity of the receivers receiving the audio data;

s820: when the number of the receivers is multiple, configuring multiple receiver fields respectively corresponding to the multiple receivers in the metadata according to the number information, wherein the receiver fields comprise characteristic fields related to the position information of the receivers;

s830: when the number of the receivers is multiple, respectively acquiring the position information of each receiver, wherein the position information is used for indicating the position coordinates of the receivers in the used coordinate system;

s840: determining whether all receivers are at the same position according to the position information;

s850: and if the first receiver and the second receiver are at the same position, assigning a position field corresponding to the first receiver according to the position information, and assigning a receiver identifier field at the same position corresponding to the second receiver, wherein the position field is used for indicating the position coordinate of the first receiver in a coordinate system, and the receiver identifier field at the same position is used for indicating the first receiver at the same position as the second receiver.

S860: the metadata is sent to an audio renderer, which is configured to render audio data for the plurality of recipients differentially according to the plurality of recipient fields, the location field, and the co-located recipient identifier field, respectively.

In one embodiment of the present application, the various fields associated with the recipient location information may be sub-elements of the recipient field listener in the metadata. As shown in table 4, a receiver degree of freedom field listener _ DoF may be included in a sub-element of the receiver field listener, and what degree of freedom a corresponding receiver has may be determined based on different values of the field. For example, when the field takes a value of 0, it indicates that the receiving party only has three degrees of freedom; when the value of the field is 1, the receiver has six degrees of freedom.

A recipient identifier field listener _ id may also be included in the sub-element of the recipient field listener. This field may be used to distinguish between different recipients when the renderer needs to process user data of multiple recipients.

A co-located recipient identifier field equal _ list _ id may also be included in a sub-element of the recipient field list. This field indicates the identifier of another receiver equivalent to the receiver when receiver a can consider to have the same location information as receiver B.

A receiver coordinate system type field local _ coordinates may also be included in the sub-elements of the receiver field listener, and different coordinate systems used by the receiver may be determined based on different values of this field. For example, a value of 0 indicates that the plurality of receivers use the respective local coordinate systems, and a value of 1 indicates that the plurality of receivers use the common world coordinate system.

An origin offset field, which in turn may include offset component fields corresponding to the respective coordinate axes, may also be included in the sub-elements of the receiver field listener.

For example, the offset component field origin _ offset _ x corresponding to the x-axis represents the x-axis component of the origin offset of the local coordinate system; this field indicates an offset value of the origin of the local coordinate system of the corresponding receiver on the x-axis of the world coordinate system when the respective local coordinate systems are used by the plurality of receivers.

The offset component field origin _ offset _ y corresponding to the y-axis represents the y-axis component of the local coordinate system origin offset; when the plurality of receivers use respective local coordinate systems, the field indicates an offset value of the origin of the local coordinate system of the corresponding receiver on the y-axis of the world coordinate system.

The offset component field origin _ offset _ z corresponding to the z-axis represents the z-axis component of the local coordinate system origin offset; this field indicates an offset value of the origin of the local coordinate system of the corresponding receiver on the z-axis of the world coordinate system when the respective local coordinate systems are used by the plurality of receivers.

TABLE 4 relevant Specifications for sub-elements in the listener field

Fig. 9 is a flowchart illustrating steps of a method for audio rendering based on the number of recipients performed by an audio sink according to an embodiment of the present application. As shown in fig. 9, the audio rendering method involving a multi-user scene performed at an audio receiving end includes the following steps S910 to S920.

S910: extracting a receiver field corresponding to a receiver receiving the audio data from the metadata, the receiver field including a feature field related to location information of the receiver;

s920: if the number of the receivers is multiple, the audio data are differentially rendered for the receivers according to the fields of the receivers.

The details of the audio rendering method performed at the audio receiving end are the same as the audio rendering method performed at the audio capturing end in fig. 4 to 8, and are not described herein again.

In one embodiment of the present application, the metadata indicating rendering of the audio data further includes other fields than the related fields referred to in the above embodiments. The following is an exemplary description of the relevant specifications of the respective fields by tables 5 to 20. Wherein instance corresponds to audioChannelFormat, aiming to modify metadata content without modifying ADM.

TABLE 5 associated Specifications for attributes in the instance field

Properties	Description of the invention	Norm of
			id	Uniquely identifying the instance	INS_0001_0001
type	Corresponding audio type code	{0001,0003,0004}
			typeLabel	Corresponding audio type name	{DirectSpeaker,Objects,HOA}

TABLE 6 associated Specifications for sub-elements in the instance field

Sub-elements	Description of the invention	Norm of
			audioChannelFormatRefID	ID matching with ADM element matching	AC_00010001
unitInfo	There may be more than one corresponding to the audioBlockFormat

TABLE 7 correlation Specifications for attributes in the UnitInfo field

Properties	Description of the invention	Norm of
			id	Uniquely identify the unit	UNI_00010001
start	The start time of the unit, object type valid	00:00:00.00000
			duration	Duration of the unit, object type valid	00:00:00.00000

Table 8 relevant specifications for subelements in the unitInfo field (typeLabel = = DirectSpeakers)

Sub-elements	Description of the preferred embodiment	Norm of
			speakerLabel	Loudspeaker layout	Reference loudspeaker layout specification
gain	Linear gain	0-16

Table 9 relevant specifications for subelements in the unitInfo field (typeLabel = = Objects)

Sub-elements	Description of the invention	Norm of
			azimuth	Horizontal angle, unit degree	-180-180
elevation	Height angle, unit degree	-90～90
			distance	Distance in meters	0-50
gain	Linear gain	0-16

Table 10 relevant specifications for subelements in the unitInfo field (typeLabel = = HOA)

Sub-elements	Description of the preferred embodiment	Norm of
			order	Order corresponding to channel	0-7
degree	Angle of corresponding channel	-7～+7
			normalization	Normalization mode	{0,1…}
gain	Linear gain	0-16

TABLE 11 STATICControl field relating to the sub-elements

Sub-elements	Description of the invention	Norm of
			ambisonicOrder	Order of spherical harmonic coding	1-7
acousticEnv	Acoustic environment correlation
			rendererInfo	Rendering dependent post-processing

TABLE 12 relevant Specifications for subelements in the acousticEnv field

Sub-elements	Description of the invention	Norm of
			type	Type of ambient acoustics	{0,1,2}
typeLabel	Ambient acoustic type tag	{Physical/Artificial/Sample}
			earlyReflectionGain	Early reflection gain	[0.0-1.0]
lateReverbGain	Late reverberation gain	[0.0-1.0]
			lowFreqProFlag	Low frequency separation process	0/1, low frequency optionally with or without reverberation processing
convolutionReverbType	Sample reverberation type	{0,1,2…}
			surface	Reflecting surface of geometric space	Space model supporting infinite multiple reflecting surface composition

TABLE 13 relative Specifications for subelements in surface field

TABLE 14 relative Specifications for subelements in the audioEffect field

Sub-elements	Description of the invention	Norm of
			EQ	EQ post-treatment
DRC	DRC post-processing
			Gain	Gain post-processing

Table 15 EQ related Specifications for attributes in the fields

Properties	Description of the invention	Norm of
			index	Representing the order of the audio effect links	{0,1,2…}

Attribute specification for Multi-segment item sub-elements in the field Table 16 EQ

item Properties	Description of the invention	Norm of
			type	Filter type	{0,1,2…}
typeLabel	Filter type tag	{lowpass，highpass，bandpass...}
			frequency	Cut-off frequency	20-16000Hz
gain	Gain of	[-40,-40]dB
			Q	Quality factor	0.1-12

TABLE 17 EQ type

type	typeLabel
		0	LowPass
1	HighPass
		2	BandPass
3	BandReject
		4	AllPass
5	LowShelving
		6	HighShelving

Table 18 relative specification of attributes in DRC field

Properties	Description of the invention	Norm of
			index	Representing the order of the audio effect links	{0,1,2…}
attackTime	Starting time	[0-100]ms
			releaseTime	Time of release	[50-300]ms
threshold	Threshold value	[-80,10]dB
			preGain	Front gain	[-10,10]dB
postGain	Post gain	[0,20]dB
			ratio	Compression ratio	1-100

TABLE 19 associated Specifications for attributes in the Gain field

Attribute	Description of the invention	Norm of
			index	Representing the order of the audio effect links	{0,1,2…}
gain	Gain of	-20～20dB

TABLE 20 associated Specifications for subelements in the dynamic control field

The following provides a detailed description of an audio rendering method involving a diversified audio playing device in the embodiment of the present application with reference to fig. 10 to 11. Fig. 10 is an audio rendering method executed at an audio collection end related to a diversified audio playing device in the embodiment of the present application, where the audio collection end may be, for example, the terminal device shown in fig. 1, the collection subsystem shown in fig. 2, or the collection end shown in fig. 3. Fig. 11 is an audio rendering method related to a diversified audio playing device, which is performed at an audio sink according to an embodiment of the present disclosure, where the audio sink may be, for example, the terminal device shown in fig. 1, the client subsystem shown in fig. 2, or the registered renderer shown in fig. 3.

Fig. 10 is a flowchart illustrating steps of a method for audio rendering based on device information performed by an audio capturing end according to an embodiment of the present application. As shown in fig. 10, the audio rendering method involving a variety of audio playback devices, which is performed at the audio capturing end, includes the following steps S1010 to S1040.

S1010: acquiring equipment information of a receiver, wherein the equipment information is used for indicating audio playing equipment used by the receiver when the receiver receives audio data;

s1020: assigning values to equipment information fields in the metadata according to the equipment information, wherein the equipment information fields are used for indicating the characteristics of the audio playing equipment influencing the audio data rendering effect;

s1030: assigning a value to a corresponding equipment identification field in the metadata according to the equipment information field, wherein the corresponding equipment identification field is a sub-element of the rendering information field, and the rendering information field is an element used for indicating the rendering effect of the audio data in the metadata;

s1040: and sending the metadata to an audio renderer, wherein the audio renderer is used for rendering the audio data for the audio playing device according to the device information field.

According to the embodiment of the application, the corresponding equipment identification field and the equipment information field are configured in the metadata, and the audio playing equipment of a receiver receiving the audio data can be identified by using the related fields, so that the audio data can be rendered under the indication of the related fields to form personalized audio rendering effects suitable for different audio playing equipment, and diversified audio rendering scene requirements are met.

In an embodiment of the present application, the corresponding device identification field is a sub-element of the rendering information field, and the corresponding device identification field may directly refer to the sub-element of the device information field for assignment.

As shown in table 21, in the metadata implemented by applying the embodiment of the present application, there is rendering information field rendererInfo, and in the sub-element of the rendering information field rendererInfo, there is a corresponding device identification field refDeviceID. The corresponding device identification field refDeviceID is a sub-element for describing a device identifier or a device group identifier corresponding to the current rendering effect, and takes a value of the device identifier or the device group identifier. As also shown in table 22, a device information field deviceInfo exists in the metadata implemented by applying the embodiment of the present application, and a plurality of sub-elements related to the audio playback device are provided in the device information field deviceInfo.

Table 21 associated specification of sub-elements in rendering information field rendererInfo

Table 22 related specification of subelements in device information field deviceInfo

In one embodiment of the application, the device information field includes a device identifier field or a device group identifier field, the device identifier field is used for indicating an identifier for distinguishing different audio playing devices, the device group identifier field is used for indicating a device group to which the audio playing devices belong, and the audio playing devices belonging to the same device group have the same audio data rendering effect. For example, the device identifier field deviceID shown in the table 22 is used to indicate an identifier of the current device, and the device group identifier field deviceGroupID is used to indicate an identifier of a device group to which the current device belongs. Devices belonging to the same device group have the same rendering effect. The device identifier and the device group identifier are not duplicated.

In one embodiment of the application, the device information field includes a device name field for indicating a readable name of the audio playback device or a device type field for indicating a device type of the audio playback device. For example, the device name field deviceName shown in the table 22 represents a readable device name. The device type field deviceType is used for indicating the type of the current device; a value of 0 indicates the earphone; a value of 1 indicates a loudspeaker.

In one embodiment of the application, the device information field includes a device location field for indicating a distribution location of the audio playback device with respect to the recipient. Such as the device location field devicePos shown in table 22.

In one embodiment of the present application, when the device type of the audio playing device is an earphone, the distribution position includes a left ear or a right ear; when the device type of the audio playing device is a speaker, the distribution position includes middle, front left, front right, back left, or back right. For example, when the audio playing device is an earphone, the devicePos value of 0 indicates a left-ear earphone; a value of 1 indicates a right ear headphone. When the audio playing device is a loudspeaker, the devicePos value is 0 to represent a middle loudspeaker; a value of 1 indicates a left front speaker; a value of 2 indicates a right front speaker; a value of 3 indicates a left rear speaker; a value of 4 indicates a rear right speaker.

In one embodiment of the application, the device information field includes a corresponding recipient identifier field for indicating an identifier of a recipient corresponding to the audio playback device. For example, refListenerID shown in the table 22 is used to indicate an identifier of a recipient to which the current device belongs.

Fig. 11 is a flowchart illustrating steps of a method for audio rendering based on device information performed by an audio sink according to an embodiment of the present application. As shown in fig. 11, the audio rendering method involving a variety of audio playback devices performed at the audio sink includes the following steps S1110 to S1130.

S1110: extracting a corresponding device identification field from a rendering information field of the metadata, wherein the rendering information field is an element used for indicating an audio data rendering effect in the metadata;

s1120: extracting a device information field according to the corresponding device identification field, wherein the device information field is used for indicating the characteristics of the audio playing device influencing the audio data rendering effect;

s1130: and rendering the audio data for the audio playing device according to the device information field.

The details of the scheme of the audio rendering method performed at the audio receiving end are the same as the audio rendering method performed at the audio capturing end in fig. 10, and are not described herein again.

A detailed description is made below of an audio rendering method involving diversified rendering types in the embodiment of the present application with reference to fig. 12 to 13. Fig. 12 is an audio rendering method related to diversified rendering types, executed at an audio acquisition end according to an embodiment of the present disclosure, where the audio acquisition end may be, for example, the terminal device shown in fig. 1, the acquisition subsystem shown in fig. 2, or the acquisition end shown in fig. 3. Fig. 13 is an audio rendering method related to diversified rendering types, which is performed at an audio sink according to an embodiment of the present application, where the audio sink may be, for example, the terminal device shown in fig. 1, the client subsystem shown in fig. 2, or the registered renderer shown in fig. 3.

Fig. 12 is a flowchart illustrating steps of a method for audio rendering based on rendering type performed by an audio capturing end in an embodiment of the present application. As shown in fig. 12, the audio rendering method involving diversified rendering types, which is performed at the audio capturing end, includes the following steps S1210 to S1230.

S1210: obtaining rendering type information of the audio data, wherein the rendering type information is used for indicating the rendering type corresponding to each audio component in the audio data;

s1220: assigning a value to an audio component information field in the metadata according to the rendering type information, the audio component information field indicating an audio component having a corresponding rendering type to be processed by the audio renderer;

s1230: the metadata is sent to an audio renderer, which is configured to render the audio component according to the rendering type.

According to the method and the device, the audio component information fields are configured in the metadata, the rendering types of the audio components can be identified by using the related fields, so that the audio data can be rendered under the indication of the related fields to form personalized audio rendering effects suitable for different rendering types, and diversified audio rendering scene requirements are met.

In one embodiment of the present application, the audio component information field may be an element in the root metadata. As shown in table 23, in VR extension root metadata implemented by an embodiment of the present application, there is an audio component information field componentsInfo, which may be used to indicate an audio component processed by an audio renderer.

Table 23 includes relevant specifications for VR extended root metadata for audio component information fields

<vrExt>	Description of the invention	Norm of
			version	Version number of extended metadata	0.0.1
name	Naming of extended metadata	vrRExt
			level	Extending metadata priority	1
presenceInfo	Extending metadata content related parts
			componentsInfo	Audio component processed by renderer
staticControl	Extending static content of metadata
			dynamicControl	Extending dynamic content of metadata

In one embodiment of the application, the audio component information field includes a component type field for indicating whether the rendering type corresponding to the audio component is immersive audio rendering or non-immersive audio rendering. As shown in table 24, a component type field componentType is included in the sub-elements of the audio component information field componentsInfo, and the type of the audio component participating in rendering can be determined according to different values of the field. For example, a value of 0 indicates that the audio component is participating in immersive audio rendering; a value of 1 indicates that the audio component participates in non-immersive audio rendering, i.e., ordinary audio rendering.

Table 24 associated specification of subelements in audio component information field componentsInfo

In one embodiment of the present application, the audio component information field includes a component identifier field or a component group identifier field, the component identification field indicating an identifier for distinguishing different audio components; the component group identifier field is used to indicate the component group to which the audio components belong, and the audio components belonging to the same component group are used together as the input parameters of the audio renderer. As shown in table 24, there are a component identifier field componentID and a component group identifier field componentGroupID in the sub-elements of the audio component information field componentsInfo, and the audio components within the same group are collectively used as the input of the renderer algorithm.

In one embodiment of the application, the audio component information field includes a corresponding audio format identifier field for indicating an identifier of an audio format corresponding to the audio component in the audio definition model. As shown in table 24, a corresponding audio format identification field refaudioformat id for describing an identifier in an ADM to which an audio component corresponds exists in a sub-element of the audio component information field componentsInfo.

Fig. 13 is a flowchart illustrating steps of a method for audio rendering based on rendering type performed by an audio sink according to an embodiment of the present application. As shown in fig. 13, the audio rendering method involving diversified rendering types performed at the audio sink includes the following steps S1310 to S1320.

S1310: extracting an audio component information field from the metadata, the audio component information field indicating an audio component having a corresponding rendering type processed by the audio renderer;

s1320: and determining the rendering type corresponding to each audio component in the audio data according to the audio component information field.

S1330: and rendering the audio component according to the rendering type.

The details of the scheme of the audio rendering method performed at the audio receiving end are the same as the audio rendering method performed at the audio capturing end in fig. 12, and are not described herein again.

The following provides a detailed description of an audio rendering method involving a diversified audio blocking effect in the embodiment of the present application with reference to fig. 14 to 16. Fig. 14 to 15 illustrate an audio rendering method related to diversified audio occlusion effects performed at an audio capturing end according to an embodiment of the present application, where the audio capturing end may be, for example, the terminal device shown in fig. 1, the capturing subsystem shown in fig. 2, or the capturing end shown in fig. 3. Fig. 16 is an audio rendering method related to diversified audio occlusion effects, executed by an audio sink according to an embodiment of the present disclosure, where the audio sink may be, for example, the terminal device shown in fig. 1, the client subsystem shown in fig. 2, or the registered renderer shown in fig. 3.

Fig. 14 is a flowchart illustrating steps of a method for audio rendering based on information of an object to be occluded, executed by an audio capturing end in an embodiment of the present application. As shown in fig. 14, the audio rendering method involving diversified audio blocking effects performed at the audio capturing end includes the following steps S1410 to S1430.

S1410: acquiring attribute information of an occlusion object in a rendered scene of audio data;

s1420: assigning a value to an occlusion object information field in the metadata according to the attribute information of the occlusion object, wherein the occlusion object information field is used for indicating the characteristics of the occlusion object generating the audio occlusion effect;

s1430: and sending the metadata to an audio renderer, wherein the audio renderer is used for rendering the audio data with the audio occlusion effect according to the occlusion object information field.

According to the method and the device, the shielding object information field is configured in the metadata, the related field can be utilized to identify the shielding object in the audio rendering scene, so that the audio data can be rendered under the indication of the related field to form personalized audio rendering effects with different audio shielding effects, and diversified audio rendering scene requirements are met.

As shown in table 25, the occluding object information field occlusionObjectInfo may include a plurality of sub-elements for describing the occluding object-related feature.

Table 25 related specifications for sub-elements in the occlusion object information field occluationobjectinfo

Sub-elements	Description of the invention	Norm of
			objectID	Identifiers of occluding objects within a scene
pos_x	X coordinate of the position of the object to be occluded	Arbitrary effective position in three-dimensional coordinate system
			pos_y	Y-coordinate of the position of the blocking object	Arbitrary effective position in three-dimensional coordinate system
pos_z	Z coordinate of the position of the object being occluded	Arbitrary effective position in three-dimensional coordinate system
			boundingBox_x	X coordinate of the size of the occluding object	Arbitrary effective position in three-dimensional coordinate system
boundingBox_y	Y-coordinate of the size of the occluding object	Arbitrary effective position in three-dimensional coordinate system
			boundingBox_z	Z-coordinate of the size of the occluding object	Arbitrary effective position in three-dimensional coordinate system

In one embodiment of the application, the occluding object information field includes an occluding position field for indicating the position coordinates of the occluding object in the coordinate system. As shown in table 25, occlusion position fields pos _ x, pos _ y, and pos _ z exist in the child elements of the occlusion object information field occluusionobjectinfo. Wherein the occlusion position field pos _ x represents an x-coordinate of the position of the occluding object, the occlusion position field pos _ y represents a y-coordinate of the position of the occluding object, and the occlusion position field pos _ z represents a z-coordinate of the position of the occluding object.

In one embodiment of the application, the occluding object information field includes an occluding size field for indicating a size coordinate of the occluding object in a coordinate system. As shown in table 25, occlusion size fields bounding box _ x, bounding box _ y, and bounding box _ z exist in the sub-elements of the occlusion object information field occlusoindobjectinfo. The occlusion size field bounding box _ x represents an x coordinate of the size of the occlusion object, the occlusion size field bounding box _ y represents a y coordinate of the size of the occlusion object, and the occlusion size field bounding box _ z represents a z coordinate of the size of the occlusion object.

In another embodiment of the present application, the occluding object information field includes a vertex position field for indicating maximum and minimum position coordinates of a vertex on the occluding object. I.e. the position and size field of the occluding object may be determined by the minimum and maximum values of the respective vertex coordinates of the object in the coordinate system. As shown in table 26, vertex position fields pos _ x _ min, pos _ y _ min, pos _ z _ min, pos _ x _ max, pos _ y _ max, and pos _ z _ max exist in the child elements of the occluding object information field occluationobjectinfo. Wherein pos _ x _ min represents the minimum value of the x coordinate of the vertex position on the shielding object, pos _ y _ min represents the minimum value of the y coordinate of the vertex position on the shielding object, pos _ z _ min represents the minimum value of the z coordinate of the vertex position on the shielding object, pos _ x _ max represents the maximum value of the x coordinate of the vertex position on the shielding object, pos _ y _ max represents the maximum value of the y coordinate of the vertex position on the shielding object, and pos _ z _ max represents the maximum value of the z coordinate of the vertex position on the shielding object.

Table 26 related specifications for sub-elements in the occlusion object information field occluationobjectinfo

Sub-elements	Description of the invention	Norm of
			objectID	Identifiers of occluding objects within a scene
pos_x_min	Minimum value of x-coordinate of shading object vertex position	Arbitrary effective position in three-dimensional coordinate system
			pos_y_min	Minimum value of y coordinate of position of vertex of shielding object	Arbitrary effective position in three-dimensional coordinate system
pos_z_min	Minimum value of z-coordinate of shading object vertex position	Three-dimensional coordinate system inner renVirtual effective position
			pos_x_max	Maximum value of x-coordinate of position of vertex of shielding object	Arbitrary effective position in three-dimensional coordinate system
pos_y_max	Maximum value of y-coordinate of vertex position of shielding object	Arbitrary effective position in three-dimensional coordinate system
			pos_z_max	Maximum value of z-coordinate of position of vertex of shielding object	Arbitrary effective position in three-dimensional coordinate system

In one embodiment of the application, the occluding object information field includes an object identifier field for indicating an identifier that distinguishes between different occluding objects. As shown in table 25 or table 26, an object identifier field objectID indicating an identifier of an occluding object located within a rendered scene exists in a child element of the occluding object information field occlusionObjectInfo.

Fig. 15 is a flowchart illustrating steps of a method performed by an audio capture end for audio rendering based on occlusion parameters according to an embodiment of the present application. As shown in fig. 15, the method for audio rendering based on diversified occlusion parameters performed at the audio capturing end may include the following steps S1510 to S1540.

S1510: acquiring attribute information of an occlusion object in a rendered scene of audio data;

s1520: assigning values to the information fields of the shielding objects in the metadata according to the attribute information of the shielding objects, wherein the information fields of the shielding objects are used for indicating the characteristics of the shielding objects generating the audio shielding effect;

s1530: assigning values to occlusion parameter fields in the metadata according to attribute information of the occlusion objects, wherein the occlusion parameter fields are used for indicating parameters influencing audio occlusion effects;

s1540: and sending the metadata to an audio renderer, wherein the audio renderer is used for rendering the audio data with the audio occlusion effect according to the occlusion object information field and the occlusion parameter field.

According to the method and the device, the shielding object information field and the shielding parameter field are configured in the metadata, the shielding object and the shielding parameter in the audio rendering scene can be identified by using the related fields, so that the audio data can be rendered under the indication of the related fields to form personalized audio rendering effects with different audio shielding effects, and diversified audio rendering scene requirements are met.

As shown in table 27, the occlusion parameter field occluationparameter may include a plurality of sub-elements for describing the occlusion parameter.

Table 27 related specifications for sub-elements in the occlusion parameter field occluationparameter

In one embodiment of the present application, the occlusion parameter field includes an occlusion type field for indicating a type of algorithm used when rendering the audio occlusion effect.

In one embodiment of the present application, the type of algorithm includes achieving an audio occlusion effect by adjusting the volume, or by filtering high frequency signals.

As shown in table 27, an occlusion type field occluationtype exists in a sub-element of the occlusion parameter field occluationparameter, and an algorithm type used when an occlusion effect is rendered can be described based on different values of the field, for example, a value of 0 indicates that the occlusion effect is achieved by adjusting the volume; a value of 1 indicates that the occlusion effect is achieved by filtering the high frequency signal.

In one embodiment of the present application, the occlusion parameter field includes an occlusion strength field for indicating the strength of rendering the audio occlusion effect. As shown in table 26, an occlusion intensity field occlusionIntensity exists in a sub-element of the occlusion parameter field occlusionParameter, and the intensity of the occlusion effect can be described based on different values of the field.

In one embodiment of the present application, the occlusion parameter field includes a maximum occlusion strength field, and the maximum occlusion strength field is used to indicate a maximum strength of generating an audio occlusion effect on the audio data, where the maximum strength may represent a strength of achieving a complete occlusion effect on the audio, or may be a strength of achieving a specified degree of the audio occlusion effect (e.g., a strength of achieving a 90% corresponding occlusion effect). As shown in table 27, there is a maximum occlusion intensity field maxocculusionintensity in the sub-element of the occlusion parameter field occluationparameter, and when the occlusion effect reaches the maximum occlusion intensity indicated by this field, the sound can be considered to be completely occluded. In terms of distance, if the maximum shielding strength is 10 and the shielding strength corresponding to a certain shielding object is 5, it can be considered that the shielding effect generated after the audio passes through the shielding object is attenuated by 50%.

In one embodiment of the present application, the occlusion parameter field includes a corresponding object identification field for indicating an identifier of one or more occlusion objects referenced when rendering the audio occlusion effect. As shown in table 27, a corresponding object identification field refObjectID exists in a sub-element of the occlusion parameter field occluationparameter, which may describe an occlusion object identifier referred to when calculating the occlusion effect, and may directly index one or more occlusion objects.

Fig. 16 is a flowchart illustrating steps of a method for audio rendering based on occlusion object information performed by an audio sink according to an embodiment of the present application. As shown in fig. 16, the audio rendering method involving diversified occlusion parameters, which is performed at the audio sink, includes the following steps S1610 to S1620.

S1610: extracting an occluding object information field from the metadata, the occluding object information field indicating a feature of an occluding object that produces an audio occlusion effect;

s1620: and rendering the audio data with the audio shielding effect according to the shielding object information field.

The details of the scheme of the audio rendering method performed at the audio receiving end are the same as the audio rendering method performed at the audio capturing end in fig. 14, and are not described herein again.

The following describes in detail an audio rendering method involving diversified scene configurations in an embodiment of the present application with reference to fig. 17 to 18. Fig. 17 is an audio rendering method related to diversified scene configurations, which is executed at an audio capturing end according to an embodiment of the present disclosure, where the audio capturing end may be, for example, the terminal device shown in fig. 1, the capturing subsystem shown in fig. 2, or the capturing end shown in fig. 3. Fig. 18 is an audio rendering method related to diversified scene configurations, which is performed at an audio sink according to an embodiment of the present disclosure, where the audio sink may be, for example, the terminal device shown in fig. 1, the client subsystem shown in fig. 2, or the registered renderer shown in fig. 3.

Fig. 17 is a flowchart illustrating steps of a method performed by an audio capturing end for audio rendering based on scene configuration information according to an embodiment of the present application. As shown in fig. 17, the audio rendering method involving diversified scene configurations, which is performed at the audio capturing end, includes the following steps S1710 to S1730.

S1710: acquiring scene configuration information of a receiver, wherein the scene configuration information is configuration information influencing the rendering effect of audio data in a rendered scene;

s1720: assigning values to characteristic fields of the metadata according to the scene configuration information, wherein the characteristic fields comprise at least one of receiver fields, equipment information fields, audio component information fields or shielding object information fields; the receiver field includes a field related to position information of the receiver, the device information field indicating a characteristic of an audio playback device affecting an audio data rendering effect, the audio component information field indicating an audio component having a corresponding rendering type processed by the audio renderer, the occluding object information field indicating a characteristic of an occluding object that produces an audio occluding effect;

s1730: the metadata is sent to an audio renderer, which is configured to render the audio data according to the characteristic field.

According to the embodiment of the application, the characteristic fields related to the scene configuration information are configured in the metadata, and the related fields can be used for identifying various scene configurations which affect audio rendering effects in the audio rendering scene, so that the audio data can obtain different audio rendering effects under the indication of the related fields, and diversified audio rendering scene requirements are met.

In some optional embodiments, any one of the above-mentioned feature fields may be configured in the metadata according to the scene configuration information of the receiving party, so as to achieve an audio rendering effect adapted to one scene configuration type. In other alternative embodiments, a combination of two or more of the above-mentioned feature fields may also be configured in the metadata to achieve an audio rendering effect that is adaptive to multiple scene configuration types.

For example, the receiver field and the device information field may be configured in the metadata at the same time, so as to meet the audio rendering scene requirement adapted to the position information of the receiver and the device characteristics of the audio playing device.

For another example, the device information field and the blocking object information field may be configured in the metadata at the same time, so as to meet the requirements of the audio rendering scene that are adapted to the device features of the audio playing device and the features of the blocking object.

Fig. 18 is a flowchart illustrating steps of a method for audio rendering based on scene configuration information performed by an audio sink according to an embodiment of the present application. As shown in fig. 18, the audio rendering method involving diversified scene configurations performed at the audio sink includes the following steps S1810 to S1820.

S1810: extracting a characteristic field from the metadata, wherein the characteristic field comprises at least one of a receiver field, a device information field, an audio component information field or an occlusion object information field; the receiver field includes a field related to position information of the receiver, the device information field indicating a characteristic of an audio playback device affecting an audio data rendering effect, the audio component information field indicating an audio component having a corresponding rendering type processed by the audio renderer, the occluding object information field indicating a characteristic of an occluding object that produces an audio occluding effect;

s1820: rendering the audio data according to the characteristic field.

The details of the scheme of the audio rendering method performed at the audio receiving end are the same as the audio rendering method performed at the audio capturing end in fig. 17, and are not described herein again.

In the audio rendering method related to diversified scene configurations in the embodiment of the present application, details of implementation related to the field of the receiving party may be described with reference to the above embodiments corresponding to fig. 4 to 9, and are not described herein again.

In the audio rendering method related to diversified scene configurations in the embodiment of the present application, details of implementation related to the device information field may be described with reference to the above embodiments corresponding to fig. 10 to 11, and are not described herein again.

In the audio rendering method related to diversified scene configurations in the embodiment of the present application, details of implementation related to the audio component information field may be described with reference to the above embodiments corresponding to fig. 12 to 13, and are not described herein again.

In the audio rendering method related to diversified scene configurations in the embodiment of the present application, details of implementation related to the information field of the blocking object may be described with reference to the above embodiments corresponding to fig. 14 to 16, and are not described herein again.

The following describes an audio rendering method provided by the embodiments of the present application with reference to two examples of specific application scenarios.

In an application scenario of the embodiment of the present application, a method for performing audio rendering based on relevant information such as user location, device, and content may include the following steps.

1. And the server generates a corresponding audio metadata file according to the position, the equipment information and the like fed back by the user. Wherein:

2. the server sends the audio metadata file to the audio renderer.

3. And the audio renderer extracts the position information and the equipment information in the metadata file as input parameters related to rendering based on the metadata file, and calculates rendering effects of the corresponding user and the corresponding equipment.

The receiver 1 has only three degrees of freedom, and therefore only has rotation-related information and no positional information.

Since both the receivers 2 and 3 have six degrees of freedom, they have both rotation-related information and position information.

When the renderer renders according to the coordinate information of the receiving party 1,2,3, coordinate conversion is not needed because the coordinate information is a global coordinate system. And because the positions of the receiving parts 2 and 3 are equal, the three-dimensional space rendering effect can be calculated only according to the coordinate of the receiving part 2.

For the rendering information, it is assumed that the device ID corresponding to the calculated rendering effect is 100, and it is known from the device information list that it corresponds to one device group. This time, it is shown that all devices within the device group 100 are adapted to the rendering effect, i.e., device1 and 2.

For each component in the audio file, component 1 corresponds to audio channel1 in the file, component 2 corresponds to audio channel2 in the file, and component 1 participates in immersive rendering and component 2 does not participate in immersive rendering. Therefore, the renderer needs to acquire the data of the audio channel1 and the audio channel2 according to the metadata information and then respectively process the data when performing audio rendering.

In another application scenario of the embodiment of the present application, a method for audio rendering based on an occlusion effect may include the following steps.

2. the server sends the audio metadata file to the audio renderer.

b) And if the renderer is a user local renderer, the renderer is located on the client side.

3. And the audio renderer extracts the receiver position information, the shelter information and the shelter algorithm information from the metadata file as input parameters related to the shelter effect rendering.

Through the position of listener1, the position information of the occlusion objects and the position information of the sound source already supported in the current standard, the degree of occlusion when the sound source in the scene propagates to the receiving party can be calculated.

And finally calculating the ratio of the sound to be shielded according to the shielding effect intensity occluationIntensity/maxOcclusionIntensity of the shielding object. For example, if the occlusion intensity occlusionIntensity is 2, the maximum occlusion intensity maxocculusionintensity is 10, and the proportion of sound occluded is 20%.

And finally, according to the algorithm type of the occlusion effect, occlusionType =0, and finally presenting the occlusion effect by adjusting the volume of the sound source.

Based on the above description of the application scene, it can be known that the audio rendering method provided by the embodiment of the present application can perform differentiated audio rendering according to the position information feedback of multiple users, can also distinguish audio rendering scenes of 3DoF and 6DoF, can more accurately perform user equipment definition, distinguish audio components participating in immersive rendering and general rendering in audio content, and provide diversified audio occlusion rendering effects, thereby being capable of supporting richer immersive audio application scenes.

It should be noted that although the steps of the methods in this application are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order or that all of the depicted steps must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken into multiple step executions, etc.

The following describes embodiments of an apparatus of the present application, which may be used to perform the audio rendering method in the above embodiments of the present application.

In one embodiment of the present application, there is provided an audio rendering apparatus at an audio capturing end for audio rendering based on a number of recipients, the apparatus including:

In an embodiment of the application, the first assigning module is further configured to assign a value to a multi-receiver flag field in the metadata according to the quantity information, where the multi-receiver flag field is used to indicate that the quantity of the receivers is one or more.

In one embodiment of the application, the metadata includes a presentation information field, the presentation information field being an element related to the content of the audio data, and the multi-recipient flag field being an attribute or a sub-element of the presentation information field.

In an embodiment of the application, the first obtaining module is further configured to obtain degree-of-freedom information of the receiving party, where the degree-of-freedom information is used to indicate a degree of freedom of the receiving party when receiving the audio; the first assignment module is further configured to assign a receiver degree of freedom field in the metadata according to the degree of freedom information, where the receiver degree of freedom field is used to indicate that the receiver has three degrees of freedom or six degrees of freedom.

In an embodiment of the application, the first assigning module is further configured to assign, when the number of the recipients is multiple, a recipient identifier field corresponding to each of the recipients in the metadata, where the recipient identifier field is used to distinguish different recipients.

In an embodiment of the application, the first obtaining module is further configured to, when the number of the receivers is multiple, respectively obtain location information of each receiver, where the location information is used to indicate a location coordinate of the receiver in a coordinate system used by the receiver; the first assignment module is further configured to assign, according to the location information, a coordinate system type field corresponding to each of the receivers in the metadata, where the coordinate system type field is used to indicate that the type of the coordinate system is a local coordinate system or a world coordinate system.

In an embodiment of the application, the first assignment module is further configured to assign an origin offset field in the metadata according to the location information when the coordinate system is a local coordinate system, the origin offset field indicating an offset of an origin of coordinates of the local coordinate system in the world coordinate system.

In one embodiment of the present application, the origin offset field includes an offset component field corresponding to each coordinate axis, and the offset component field is used for indicating a component of the offset on each coordinate axis.

In an embodiment of the application, the first assignment module is further configured to determine whether each of the receivers is in the same location according to the location information; and if the first receiver and the second receiver are at the same position, assigning a position field corresponding to the first receiver according to the position information, and assigning a same-position receiver identifier field corresponding to the second receiver, wherein the position field is used for indicating a position coordinate of the first receiver in a coordinate system, and the same-position receiver identifier field is used for indicating the first receiver at the same position as the second receiver.

In one embodiment of the present application, an audio rendering apparatus for audio rendering based on the number of recipients at an audio receiving end is provided, the apparatus comprising:

In one embodiment of the present application, there is provided an audio rendering apparatus for audio rendering based on device information at an audio capturing end, the apparatus including:

a second assignment module configured to assign a device information field in the metadata according to the device information, wherein the device information field is used for indicating a characteristic of the audio playing device which affects the audio data rendering effect; assigning a value to a corresponding device identification field in the metadata according to the device information field, wherein the corresponding device identification field is a sub-element of a rendering information field, and the rendering information field is an element in the metadata for indicating the rendering effect of the audio data;

In an embodiment of the application, the device information field includes a device identifier field or a device group identifier field, the device identifier field is used to indicate an identifier for distinguishing different audio playback devices, the device group identifier field is used to indicate a device group to which the audio playback devices belong, and audio playback devices belonging to the same device group have the same audio data rendering effect.

In an embodiment of the application, the device information field includes a device name field for indicating a readable name of the audio playback device or a device type field for indicating a device type of the audio playback device.

In one embodiment of the application, the device information field includes a device location field for indicating a distribution location of the audio playback device with respect to the recipient.

In an embodiment of the present application, when the device type of the audio playing device is an earphone, the distribution position includes a left ear or a right ear; when the device type of the audio playing device is a loudspeaker, the distribution positions comprise middle, left front, right front, left back or right back.

In one embodiment of the present application, the device information field includes a corresponding recipient identifier field for indicating an identifier of the recipient corresponding to the audio playback device.

In an embodiment of the present application, an audio rendering apparatus at an audio receiving end for performing audio rendering based on device information is provided, the apparatus including:

In one embodiment of the present application, there is provided an audio rendering apparatus for audio rendering based on a rendering type at an audio capturing end, the apparatus including:

In one embodiment of the application, the audio component information field comprises a component type field, and the component type field is used for indicating that the rendering type corresponding to the audio component is immersive audio rendering or non-immersive audio rendering.

In one embodiment of the present application, the audio component information field includes a component identifier field or a component group identifier field, the component identification field indicating an identifier for distinguishing different audio components; the component group identifier field is used for indicating the component group to which the audio components belong, and the audio components belonging to the same component group are used as input parameters of the audio renderer together.

In one embodiment of the application, the audio component information field includes a corresponding audio format identifier field for indicating an identifier of an audio format in the audio definition model corresponding to the audio component.

In an embodiment of the present application, an audio rendering apparatus at an audio sink for performing audio rendering based on rendering types is provided, the apparatus including:

In one embodiment of the present application, there is provided an audio rendering apparatus for performing audio rendering based on a blocking object at an audio capturing end, the apparatus including:

In an embodiment of the application, the occluding object information field comprises an occluding position field for indicating position coordinates of the occluding object in a coordinate system.

In one embodiment of the application, the occluding object information field includes an occluding size field for indicating a size coordinate of the occluding object in a coordinate system.

In one embodiment of the present application, the occluding object information field includes a vertex position field for indicating maximum and minimum position coordinates of a vertex on the occluding object.

In one embodiment of the application, the occluding object information field comprises an object identifier field for indicating an identifier distinguishing between different ones of the occluding objects.

In an embodiment of the application, the fourth assignment module is further configured to assign an assignment to an occlusion parameter field in the metadata according to the attribute information of the occlusion object, where the occlusion parameter field is used to indicate a parameter affecting an audio occlusion effect.

In one embodiment of the present application, the type of algorithm includes adjusting volume to achieve the audio occlusion effect, or filtering high frequency signals to achieve the audio occlusion effect.

In one embodiment of the present application, the occlusion parameter field includes an occlusion strength field for indicating a strength of rendering the audio occlusion effect.

In one embodiment of the application, the occlusion parameter field includes a maximum occlusion strength field for indicating a maximum strength of producing an audio occlusion effect on the audio data.

In one embodiment of the present application, the occlusion parameter field includes a corresponding object identification field for indicating an identifier of one or more occlusion objects referenced when rendering the audio occlusion effect.

In an embodiment of the present application, an audio rendering apparatus located at an audio receiving end and performing audio rendering based on a blocking object is provided, the apparatus including:

a fourth rendering module configured to render audio data having the audio occlusion effect according to the occlusion object information field.

In one embodiment of the present application, there is provided an audio rendering apparatus for audio rendering based on diverse scene configurations at an audio capturing end, the apparatus including:

In an embodiment of the present application, an audio rendering apparatus at an audio receiving end for performing audio rendering based on diversified scene configurations is provided, the apparatus including:

The specific details of the audio rendering apparatus provided in the embodiments of the present application have been described in detail in the corresponding method embodiments, and are not described herein again.

Fig. 19 schematically shows a structural block diagram of a computer system of an electronic device for implementing the embodiment of the present application.

It should be noted that the computer system 1900 of the electronic device shown in fig. 19 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 19, the computer system 1900 includes a Central Processing Unit (CPU) 1901 that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory (ROM) 1902 or a program loaded from a storage section 1908 into a Random Access Memory (RAM) 1903. In the random access memory 1903, various programs and data necessary for system operation are also stored. The cpu 1901, the rom 1902, and the ram 1903 are connected to each other via a bus 1904. An Input/Output interface 1905 (Input/Output interface, i.e., I/O interface) is also connected to the bus 1904.

The following components are connected to the input/output interface 1905: an input section 1906 including a keyboard, a mouse, and the like; an output portion 1907 including a Display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1908 including a hard disk and the like; and a communications portion 1909 that includes a network interface card, such as a local area network card, modem, and the like. The communication section 1909 performs communication processing via a network such as the internet. A driver 1910 is also connected to the input/output interface 1905 as needed. A removable medium 1911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1910 as necessary, so that a computer program read out therefrom is mounted in the storage section 1908 as necessary.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications portion 1909 and/or installed from removable media 1911. When executed by the central processor 1901, performs various functions defined in the system of the present application.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An audio rendering method, comprising:

when the number of the receivers is multiple, configuring multiple receiver fields corresponding to the multiple receivers respectively in metadata according to the number information, wherein the receiver fields comprise characteristic fields related to the position information of the receivers;

2. The audio rendering method of claim 1, wherein after obtaining the information of the number of recipients, the method further comprises:

and assigning values to multi-receiver flag fields in the metadata according to the quantity information, wherein the multi-receiver flag fields are used for indicating one or more receivers.

3. The audio rendering method of claim 1, wherein the metadata comprises a presentation information field, wherein the presentation information field is an element related to the content of the audio data, and wherein the multi-recipient flag field is an attribute or a sub-element of the presentation information field.

4. The audio rendering method of any of claims 1-3, wherein prior to sending the metadata to an audio renderer, the method further comprises:

acquiring the degree of freedom information of the receiver, wherein the degree of freedom information is used for indicating the degree of freedom of the receiver when receiving the audio;

and assigning a value to a receiver degree of freedom field in the metadata according to the degree of freedom information, wherein the receiver degree of freedom field is used for indicating that the receiver has three degrees of freedom or six degrees of freedom.

5. The audio rendering method of any of claims 1-3, wherein prior to sending the metadata to an audio renderer, the method further comprises:

and when the number of the receivers is multiple, assigning values to receiver identifier fields corresponding to the receivers in the metadata, wherein the receiver identifier fields are used for distinguishing different receivers.

6. The audio rendering method of any of claims 1-3, wherein prior to sending the metadata to an audio renderer, the method further comprises:

when the number of the receivers is multiple, respectively acquiring the position information of each receiver, wherein the position information is used for indicating the position coordinates of the receivers in the used coordinate system;

and assigning a coordinate system type field corresponding to each receiver in the metadata according to the position information, wherein the coordinate system type field is used for indicating that the type of the coordinate system is a local coordinate system or a world coordinate system.

7. The audio rendering method of claim 6, wherein after the position information of each of the receiving sides is respectively obtained, the method further comprises:

and when the coordinate system is a local coordinate system, assigning an origin offset field in the metadata according to the position information, wherein the origin offset field is used for indicating the offset of the origin of coordinates of the local coordinate system in the world coordinate system.

8. The audio rendering method of claim 7, wherein the origin offset field comprises an offset component field corresponding to each coordinate axis, the offset component field indicating a component of the offset on each of the coordinate axes.

9. The audio rendering method according to claim 6, wherein after the position information of each of the receivers is acquired, the method further comprises:

determining whether the receivers are at the same position according to the position information;

and if the first receiver and the second receiver are at the same position, assigning a position field corresponding to the first receiver according to the position information, and assigning a same-position receiver identifier field corresponding to the second receiver, wherein the position field is used for indicating the position coordinate of the first receiver in a coordinate system, and the same-position receiver identifier field is used for indicating the first receiver at the same position as the second receiver.

10. An audio rendering method, comprising:

11. An audio rendering method, comprising:

12. The audio rendering method according to claim 11, wherein the device information field includes a device identifier field indicating an identifier for distinguishing different audio playback devices or a device group identifier field indicating a device group to which the audio playback devices belong, and audio playback devices belonging to the same device group have the same audio data rendering effect.

13. The audio rendering method of claim 11, wherein the device information field comprises a device name field indicating a readable name of the audio playback device or a device type field indicating a device type of the audio playback device.

14. The audio rendering method of claim 13, wherein the device information field comprises a device location field indicating a distribution location of the audio playback device with respect to the receiving party.

15. The audio rendering method of claim 14, wherein when the device type of the audio playing device is a headphone, the distribution position includes a left ear or a right ear; when the device type of the audio playing device is a loudspeaker, the distribution positions comprise middle, left front, right front, left back or right back.

16. The audio rendering method of claim 11, wherein the device information field comprises a corresponding recipient identifier field indicating an identifier of the recipient corresponding to the audio playback device.

17. An audio rendering method, comprising:

extracting a device information field according to the corresponding device identification field, wherein the device information field is used for indicating the characteristics of the audio playing device influencing the rendering effect of the audio data;

18. An audio rendering method, comprising:

19. The audio rendering method of claim 18, wherein the audio component information field comprises a component type field, and wherein the component type field is used to indicate whether the corresponding rendering type of the audio component is immersive audio rendering or non-immersive audio rendering.

20. The audio rendering method according to claim 18, wherein the audio component information field includes a component identifier field or a component group identifier field, the component identification field indicating an identifier for distinguishing different audio components; the component group identifier field is used for indicating the component group to which the audio components belong, and the audio components belonging to the same component group are used as input parameters of the audio renderer together.

21. The audio rendering method of claim 18, wherein the audio component information field comprises a corresponding audio format identifier field, and wherein the corresponding audio format identification field is used to indicate an identifier of an audio format corresponding to the audio component in the audio definition model.

22. An audio rendering method, comprising:

rendering the audio component according to the rendering type.

23. An audio rendering method, comprising:

acquiring attribute information of an occlusion object in a rendered scene of audio data;

24. The audio rendering method of claim 23, wherein the occluding object information field comprises an occluding position field indicating position coordinates of the occluding object in a coordinate system.

25. The audio rendering method of claim 23 wherein the occluding object information field comprises an occluding size field indicating a size coordinate of the occluding object in a coordinate system.

26. The audio rendering method of claim 23, wherein the occluding object information field comprises a vertex position field indicating a maximum position coordinate and a minimum position coordinate of a vertex on the occluding object.

27. The audio rendering method of claim 23, wherein the occluding object information field comprises an object identifier field indicating an identifier for distinguishing between different occluding objects.

28. The audio rendering method of claim 23, wherein after obtaining attribute information of an occluding object located in a rendered scene of audio data, the method further comprises:

and assigning values to occlusion parameter fields in the metadata according to the attribute information of the occlusion objects, wherein the occlusion parameter fields are used for indicating parameters influencing the audio occlusion effect.

29. The audio rendering method of claim 28 wherein the occlusion parameter field comprises an occlusion type field, the occlusion type field indicating a type of algorithm used in rendering the audio occlusion effect.

30. The audio rendering method of claim 29 wherein the type of algorithm comprises achieving the audio occlusion effect by adjusting volume or by filtering high frequency signals.

31. The audio rendering method of claim 28 wherein the occlusion parameter field comprises an occlusion strength field, the occlusion strength field indicating a strength of rendering the audio occlusion effect.

32. The audio rendering method of claim 28 wherein the occlusion parameter field comprises a maximum occlusion strength field indicating a maximum strength for producing an audio occlusion effect on the audio data.

33. The audio rendering method of claim 28 wherein the occlusion parameter field comprises a corresponding object identification field for indicating an identifier of one or more occluding objects referenced in rendering the audio occlusion effect.

34. An audio rendering method, comprising:

extracting an occluding object information field from the metadata, the occluding object information field indicating a feature of an occluding object that produces an audio occluding effect;

35. An audio rendering method, comprising:

36. An audio rendering method, comprising:

and rendering the audio data according to the characteristic field.

37. An audio rendering apparatus, comprising:

38. An audio rendering apparatus, comprising:

39. An audio rendering apparatus, comprising:

40. An audio rendering apparatus, comprising:

a second extraction module configured to extract a corresponding device identification field from a rendering information field of metadata, the rendering information field being an element of the metadata for indicating an audio data rendering effect; extracting a device information field according to the corresponding device identification field, wherein the device information field is used for indicating the characteristics of the audio playing device influencing the rendering effect of the audio data;

41. An audio rendering apparatus, comprising:

a third obtaining module configured to obtain rendering type information of audio data, where the rendering type information is used to indicate a rendering type corresponding to each audio component in the audio data;

42. An audio rendering apparatus, comprising:

43. An audio rendering apparatus, comprising:

a fourth assignment module configured to assign an occlusion object information field in metadata according to the attribute information of the occlusion object, the occlusion object information field indicating a feature of the occlusion object that produces an audio occlusion effect;

44. An audio rendering apparatus, comprising:

45. An audio rendering apparatus, comprising:

46. An audio rendering apparatus, comprising:

47. A computer-readable medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the audio rendering method of any one of claims 1 to 36.

48. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to cause the electronic device to perform the audio rendering method of any of claims 1 to 36 via execution of the executable instructions.

49. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the audio rendering method of any of claims 1 to 36.