CN115426613A

CN115426613A - Method and device for rendering scene-based audio by using metadata

Info

Publication number: CN115426613A
Application number: CN202210912275.7A
Authority: CN
Inventors: 吴健
Original assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Current assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-12-02

Abstract

The application provides a method and a device for rendering audio based on a scene by using metadata, wherein the method comprises the steps of based on a pre-constructed audio model, and storing parameters of the audio model in respective data structures through type tags; extracting necessary information in the audio block format and general data collected in the additional data, introducing the necessary information in the audio block format and the general data collected in the additional data through the type label, and generating a metadata object of a scene type of the audio model; the parameters of the audio model stored in the respective data structure are introduced via the type tag. The present application provides a method for rendering scene-based audio using metadata, each channel representing a loudspeaker-independent sound field, rather than representing a separate loudspeaker. The more channels used, the higher the spatial resolution of the sound.

Description

Method and device for rendering scene-based audio by using metadata

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method and an apparatus for rendering scene-based audio using metadata.

Background

With the development of technology, audio becomes more and more complex. The early single-channel audio is converted into stereo, and the working center also focuses on the correct processing mode of the left and right channels. But the process begins to become complex after surround sound occurs. The surround 5.1 speaker system performs ordering constraint on a plurality of channels, and further the surround 6.1 speaker system, the surround 7.1 speaker system and the like enable audio processing to be varied, and correct signals are transmitted to proper speakers to form mutual involvement effect. Thus, as sound becomes more immersive and interactive, the complexity of audio processing also increases greatly.

Audio channels (or audio channels) refer to audio signals that are independent of each other and that are captured or played back at different spatial locations when sound is recorded or played. The number of channels is the number of sound sources when recording or the number of corresponding speakers when playing back sound. For example, in a surround 5.1 speaker system comprising 6 different spatial positions of audio signals, each separate audio signal is used to drive a speaker at the corresponding spatial position; in a surround 7.1 speaker system comprising 8 different spatial positions of audio signals, each individual audio signal is used to drive a speaker at a corresponding spatial position.

Therefore, the effect achieved by current loudspeaker systems depends on the number and spatial position of the loudspeakers. For example, a binaural speaker system cannot achieve the effect of a surround 5.1 speaker system.

Disclosure of Invention

The application aims to provide a method and a device for rendering audio based on a scene by using metadata, so that an audio model element is generated into corresponding structural data, and the audio data are conveniently rendered.

A first aspect of the present application provides a method for rendering scene-based audio using metadata, including:

based on a pre-constructed audio model, saving parameters of the audio model in respective data structures through type tags;

extracting necessary information in the audio block format and general data collected in the additional data, introducing the necessary information in the audio block format and the general data collected in the additional data through the type label, and generating a metadata object of a scene type of the audio model;

introducing parameters of the audio model and metadata objects of the scene type stored in respective data structures through the type tags, and generating rendering items of the scene type; the rendering item of the scene type is used to indicate a set of audio channel formats.

A second aspect of the present application provides an apparatus for rendering scene-based audio using metadata, comprising:

the storage module is used for storing the parameters of the audio model in respective data structures through type labels based on the pre-constructed audio model;

the generating module is used for extracting necessary information in the audio block format and general data collected in the additional data, introducing the necessary information in the audio block format and the general data collected in the additional data through the type label, and generating a metadata object of a scene type of the audio model;

the introduction generation module is used for introducing the parameters of the audio model and the metadata objects of the scene type stored in the respective data structures through the type tags to generate rendering items of the scene type; the rendering item of the scene type is used to indicate a set of audio channel formats.

A third aspect of the present application provides an electronic device comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method for rendering scene-based audio using metadata as provided in any of the embodiments.

A fourth aspect of the present application provides a storage medium containing computer-executable instructions that when executed by a computer processor implement a method of rendering scene-based audio using metadata as provided in any embodiment.

From the above, the method for rendering scene-based audio by using metadata converts a set of audio signals with metadata of scene type into different configurations of audio signals and metadata, and can render the audio signals to all speaker configurations specified in an advanced sound system. The present application provides a method for rendering scene-based audio using metadata, each channel representing a loudspeaker-independent sound field, rather than representing a separate loudspeaker. The more channels used, the higher the spatial resolution of the sound.

Drawings

FIG. 1 is a schematic diagram of an audio model provided in an embodiment of the present application;

FIG. 2 is a flowchart of a method for rendering scene-based audio using metadata in an embodiment of the present application;

FIG. 3 is another flowchart of a method for rendering scene-based audio using metadata in an embodiment of the present application;

FIG. 4 is another flowchart of a method for rendering scene-based audio using metadata in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for rendering scene-based audio by using metadata in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.

Fig. 1 is a schematic diagram of an audio model provided in an embodiment of the present application. The audio model comprises a content production part and a format production part;

wherein the content production section includes: an audio program element, an audio content element, an audio object element, and a soundtrack unique identification element;

the format making part includes: an audio packet format element, an audio channel format element, an audio stream format element, and an audio track format element;

the audio program element references at least one of the audio content elements; the audio content element references at least one audio object element; the audio object element references the corresponding audio package format element and the corresponding audio track unique identification element; the audio track unique identification element refers to the corresponding audio track format element and the corresponding audio package format element;

the audio package format element references at least one of the audio channel format elements; the audio stream format element references the corresponding audio channel format element and the corresponding audio packet format element; the audio track format element and the corresponding audio stream format element are mutually referenced;

the audio channel format elements include at least one audio block format element.

As shown in fig. 2, an embodiment of the present application provides a method for rendering scene-based audio by using metadata, including:

s201, based on a pre-constructed audio model, storing parameters of the audio model in respective data structures through type labels;

as shown in fig. 1, the saving of the parameters of the audio model in the respective data structures by type tags based on the pre-constructed audio model includes:

s301, merging the general data into extra data; the general data includes a screen reference and a channel frequency, the screen reference (reference _ screen) being an audio program screen reference (audio program referencescreen) of the selected audio program (unselected, i.e. not assigned importance).

The channel frequency (channel _ frequency) is a channel frequency element of the selected audio channel format (audioChannelFormat).

Implementation code example:

s302, storing important data in an important data structure; the important data includes the audio packet format (audiopack format) and the audio objects (audioObjects).

The importance data allows the processor to discard objects below a certain level of importance, where 10 is the largest and 0 is the smallest. Changing this parameter over consecutive blocks should be avoided. This parameter is very useful when there is a need to reduce the capacity of metadata and allow prioritization of compromises that can be made.

The audio object includes an audio object start time, which is a start time of the last audio object on the path (no importance is specified in the channel-only allocation mode), and an object duration (object _ duration), which is a duration of the last audio object on the path (no importance is specified in the channel-only allocation mode). When important data is used in an audio object, it can be used to delete less important sounds when the number of objects or tracks needs to be reduced. For example, some background sound effects may be discarded to ensure that the primary dialog object remains.

When important data is used in the audio packet format, it can be used to reduce spatial audio quality. A nested audio packet format may be used to take advantage of this functionality. For example, an audio object with a main direct sound (in an audio packet format parent element with high importance) and an additional reverb sound (in a sub audio packet format with low importance) may discard the reverb sound, thereby preserving the main sound but degrading the quality.

Implementation code example:

struct ImportanceData{

optional<int>audio_object；

optional<int>audio_pack_format；

}；

as shown in fig. 4, the step of referencing and encapsulating the audio sample in a track specification (TrackSpec) structure and defining it as a source of the audio sample comprises:

s401, direct track specification (DirectTrackSpec) specifies that an audio sample is to be directly read from a specified input track; alternatively, the first and second electrodes may be,

s402, the silent track specification (SilentTrackSpec) will specify all audio samples as a preset threshold, specifically, all audio samples as zero.

Implementation code example:

structTrackSpec{}；

struct DirectTrackSpec：TrackSpec{

inttrack_index；

}；

struct SilentTrackSpec:TrackSpec{

}；

the set of samples in the embodiments of the present application form an audio track format element. It is used to describe the format of the data, allowing the renderer to decode the signal correctly. It comes from an audio stream format element that identifies the combination of audio tracks needed to successfully decode the audio track data.

S202, extracting necessary information in the audio block format and general data collected in the additional data, introducing the necessary information in the audio block format and the general data collected in the additional data through the type label, and generating a metadata object of a scene type of the audio model;

the necessary information in the audio block format includes the order, degree, normalization scheme, screen indication, start time of the block, duration of the block, and general data collected in the additional data. In addition to the common audio block format (audioBlockFormat) attribute, the following sub-elements are defined for the type definition (typeDefinition) for the SCENE (SCENE). The order of (order) SCENE component (degree) indicates the normalization scheme of SCENE component (N3D, SN3D, fuMa). (screen indication, screenRef) indicates whether the component is screen-related (flag equal to 1) or not screen-related (flag equal to 0).

The start time (rtime) of the audio block format (audioBlockFormat) is associated with the start time of the audio object (audioObject). Thus, the first sample is obtained by adding the rtime of the audio sample to the beginning of the audio object (audioObject) and adding its duration to find the last sample; this allows to find the position of the audio sample in the file corresponding to a particular audio block format (audioBlockFormat).

As sound becomes more immersive and interactive, the complexity of audio processing also increases greatly. To handle all these additional channels and complexities, each audio channel would require an explicit tag. These channel tags are metadata that becomes scene type based audio when scene type based metadata is attached to some audio. Binding scene type based metadata to the audio it describes enables the audio to be rendered correctly. In general operation, both encoded audio and scene type metadata may be transmitted as one or separate data streams over an audio channel.

Implementation code example:

struct SCENETypeMetadata:TypeMetadata{

vector<int>orders；

vector<int>degrees；

optional<string>normalization；

optional<float>nfcRefDist；

bool screenRef；

ExtraData extra_data；

optional<duration>rtime；

optional<duration>duration；

}；

s203, introducing parameters of the audio model and metadata objects of the scene type stored in respective data structures through the type tags, and generating rendering items of the scene type; the rendering item of the scene type is used to indicate a set of audio channel formats. Each channel represents a sound field independent of a loudspeaker, rather than representing a separate loudspeaker. The more channels used, the higher the spatial resolution of the sound.

Scene-based audio includes ambient sounds and higher order ambient sounds (HOA), with first order ambient sounds (strictly speaking, 0 order channels) consisting of 4 component channels. The first represents the fully-directional signal, the next 3 components represent the X, Y and Z dimensions of the sound.

Since first order ambient sounds do not provide good spatial resolution (localization of sound is not very good), higher Orders (HOA) can be used to improve this. For the second order, 5 components are added on the basis of the first order, and for the third order, 7 components are added (16 component channels in total).

The introducing, by the type tag, the parameters of the audio model and the metadata object of the scene type stored in the respective data structure, and the generating the rendering item of the scene type specifically includes:

the audio sample and vector input by referring to the metadata object, the importance data stored in the importance data structure, and the package track specification structure, in this embodiment, the audio sample vector refers to the "gain" vector possessed by the audio sample, and the gains of all the audio samples have directionality and are therefore described in a vector manner.

Each audio channel format (audioChannelFormat) of a scene type can be processed independently, the render item (RenderingItem) containing a track specification (TrackSpec). Embodiments of the present application may be either direct track specifications or silent track specifications.

Implementation code example:

struct SCENERenderingItem:RenderingItem{

vector<TrackSpec>track_specs；

MetadataSource metadata_source；

vector<ImportanceData>importances；

}；

the present application provides a method for rendering scene-based audio using metadata, converting a set of audio signals with metadata of a scene type into different configurations of audio signals and metadata, capable of rendering the audio signals to all speaker configurations specified in an advanced sound system.

The generating of the corresponding rendering item by introducing the metadata of the scene type of the audio signal stored in the respective data structure through the type tag further includes: metadata objects of the scene type are associated with the audio sample sources and the generic data collected in the additional data.

As shown in fig. 5, an embodiment of the present application provides an apparatus for rendering scene-based audio by using metadata, including:

a saving module 501, configured to save parameters of an audio model in respective data structures based on a pre-constructed audio model and through a type tag;

a generating module 502, configured to extract necessary information in the audio block format and general data collected in the additional data, introduce the necessary information in the extracted audio block format and general data collected in the additional data through the type tag, and generate a metadata object of a scene type of the audio model;

an import generation module 503, configured to import, through the type tag, the parameters of the audio model and the metadata object of the scene type stored in the respective data structure, and generate a rendering item of the scene type; the rendering item of the scene type is used to indicate a set of audio channel formats.

The parameters of the audio model include: metadata objects for saving all parameters needed to render the item and for saving the scene type of a series of metadata objects.

The saving module 501 is configured to merge general data into additional data; the general data comprises a screen reference and a channel frequency; alternatively, the first and second electrodes may be,

storing the importance data in an importance data structure; the important data includes audio objects and audio packets.

The saving module 501 is configured to directly read an audio sample from a designated input track; alternatively, the first and second electrodes may be,

all audio samples will be specified as a preset threshold.

The import generation module 503 is used to reference metadata objects, importance data stored in an importance data structure, and audio samples encapsulating the soundtrack specification structure input.

The import generation module 503 is configured to import the metadata of the scene type of the audio signal stored in the respective data structure, and generating the corresponding rendering item further includes: metadata objects of the scene type are associated with the audio sample sources and the generic data collected in the additional data.

The device for rendering the audio based on the scene by using the metadata provided by the embodiment of the application can execute the method for rendering the audio based on the scene by using the metadata provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, the electronic apparatus includes: a processor 610, a memory 620, an input device 630, and an output device 640. The number of the processors 610 in the electronic device may be one or more, and one processor 610 is taken as an example in fig. 6. The number of the memories 620 in the electronic device may be one or more, and one memory 620 is taken as an example in fig. 6. The processor 610, the memory 620, the input device 630 and the output device 640 of the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 6. The electronic device can be a computer, a server and the like. In the embodiment of the present application, the electronic device is used as a server, and the server may be an independent server or a cluster server.

The memory 620, as a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules for rendering scene-based audio using metadata as described in any of the embodiments of the present application. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 620 can further include memory located remotely from the processor 610, which can be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 630 may be used to receive input numeric or character information and generate key signal inputs related to viewer user settings and function controls of the electronic device, and may also be a camera for acquiring images and a sound pickup device for acquiring audio data. The output device 640 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 630 and the output device 640 can be set according to actual situations.

The processor 610 executes various functional applications of the device and data processing, i.e., implements scene-based rendering item metadata, by executing software programs, instructions, and modules stored in the memory 620.

Embodiments of the present application also provide a storage medium containing computer-executable instructions that when generated by a computer processor comprise the method for rendering scene-based audio using metadata as provided in any of the embodiments.

Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the electronic method described above, and may also perform related operations in the electronic method provided in any embodiments of the present application, and have corresponding functions and advantages.

From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the electronic method according to any embodiment of the present application.

It should be noted that, in the above apparatus, each unit and each module included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "in one embodiment," "in another embodiment," "exemplary" or "in a particular embodiment" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although the present application has been described in detail above with reference to general description, specific embodiments and experiments, it will be apparent to those skilled in the art that some modifications or improvements may be made on the basis of the present application. Accordingly, such modifications and improvements are intended to be within the scope of this invention as claimed.

Claims

1. A method for rendering scene-based audio using metadata, comprising:

based on a pre-constructed audio model, saving parameters of the audio model in respective data structures through type labels;

2. The method of claim 1, wherein the saving parameters of the audio model in respective data structures based on a pre-constructed audio model and by type tags comprises:

merging the general data into additional data; the general data comprises a screen reference and a channel frequency; alternatively, the first and second electrodes may be,

3. The method of claim 1, wherein the saving parameters of the audio model in respective data structures based on a pre-constructed audio model and by type tags comprises: the audio samples are referenced and encapsulated in a soundtrack specification structure and defined as an audio sample source.

4. The method of claim 3, wherein the step of referencing and encapsulating the audio sample in a soundtrack specification structure and defining it as an audio sample source further comprises:

the direct track specification will specify that the audio sample should be read directly from the specified input track; alternatively, the first and second electrodes may be,

the silent track specification will specify that all audio samples are a preset threshold.

5. The method of any of claims 1, 3 or 4, wherein the necessary information in the audio block format comprises:

order, degree, normalization scheme, screen indication, starting time of a block, duration of a block.

6. The method according to any of claims 1, 2 or 3, wherein said introducing parameters of the audio model and metadata objects of the scene type stored in the respective data structure via the type tag, generating rendering items of the scene type specifically comprises:

metadata objects referencing scene types, significance data stored in significance data structures, and audio samples and vectors encapsulating soundtrack specification structure inputs.

7. The method of any of claims 1, 2 or 3, wherein said introducing metadata of scene types of said audio signals stored in respective data structures by said type tags, generating rendering items of scene types further comprises: associating metadata objects of the scene type with an audio sample source and general data collected in additional data.

8. An apparatus for rendering scene-based audio using metadata, comprising:

9. An electronic device, comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A storage medium containing computer-executable instructions for implementing the method of any one of claims 1-7 when executed by a computer processor.