CN115209310A

CN115209310A - Method and device for rendering sound bed-based audio by using metadata

Info

Publication number: CN115209310A
Application number: CN202210634563.0A
Authority: CN
Inventors: 吴健
Original assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Current assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-10-18

Abstract

The present disclosure relates to a method and apparatus for rendering sound bed-based audio using metadata, the method includes storing parameters of audio models in respective data structures by type tags based on pre-constructed audio models; introducing the parameters of the audio model stored in the respective data structure through the type tag to generate a sound bed type metadata object of the audio model; parameters of the audio model stored in the respective data structure are introduced via the type tag. The present disclosure converts a set of audio signals with sound bed type metadata into different configurations of audio signals and metadata, enabling rendering of the audio these signals to all speaker configurations specified in the advanced sound system. The present disclosure provides a method for rendering sound bed based audio using metadata, without any signal modification, to directly transmit per-channel audio signals to each speaker.

Description

Method and device for rendering sound bed-based audio by using metadata

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a method and an apparatus for rendering audio based on a sound bed by using metadata.

Background

With the development of technology, audio becomes more and more complex. The early single-channel audio is converted into stereo, and the working center also focuses on the correct processing mode of the left and right channels. But the process begins to become complex after surround sound occurs. The surround 5.1 speaker system performs ordering constraint on a plurality of channels, and further the surround 6.1 speaker system, the surround 7.1 speaker system and the like enable audio processing to be varied, and correct signals are transmitted to proper speakers to form mutual involvement effect. Thus, as sound becomes more immersive and interactive, the complexity of audio processing also increases greatly.

Audio track (or soundtrack) refers to mutually independent audio signals that are captured or played back at different spatial locations as the sound is recorded or played. The number of channels is the number of sound sources when recording or the number of corresponding speakers when playing back sound. For example, in a surround 5.1 speaker system comprising audio signals at 6 different spatial locations, each separate audio signal is used to drive a speaker at a corresponding spatial location; in a surround 7.1 speaker system comprising audio signals at 8 different spatial positions, each separate audio signal is used to drive a speaker at a corresponding spatial position.

Therefore, the effect achieved by current loudspeaker systems depends on the number and spatial position of the loudspeakers. For example, a binaural speaker system cannot achieve the effect of a surround 5.1 speaker system.

Disclosure of Invention

The present disclosure aims to provide a method and an apparatus for rendering audio based on a sound bed by using metadata, so as to generate corresponding structural data from audio model elements, thereby facilitating rendering of audio data.

A first aspect of the present disclosure provides a method of rendering sound bed-based audio using metadata, including:

based on a pre-constructed audio model, saving parameters of the audio model in respective data structures through type labels;

introducing the parameters of the audio model stored in the respective data structure through the type tag to generate a sound bed type metadata object of the audio model;

introducing parameters of the audio model and a sound bed type metadata object stored in respective data structures through the type tags, and generating a sound bed type rendering item; the bed-type rendering item is used to indicate an individual audio channel format or a group of audio channel formats.

A second aspect of the present disclosure provides an apparatus for rendering sound bed-based audio using metadata, including:

the storage module is used for storing the parameters of the audio model in respective data structures through type tags based on the pre-constructed audio model;

the generation module is used for introducing the parameters of the audio model stored in the respective data structure through the type label to generate a sound bed type metadata object of the audio model;

the introduction generation module is used for introducing the parameters of the audio model and the metadata object of the sound bed type stored in the respective data structure through the type label to generate a rendering item of the sound bed type; the bed-type rendering item is used to indicate an individual audio channel format or a group of audio channel formats.

A third aspect of the present disclosure provides an electronic device, comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of rendering bed-based audio with metadata as provided by any of the embodiments.

A fourth aspect of the present disclosure provides a storage medium containing computer-executable instructions that when executed by a computer processor implement a method of rendering sound bed-based audio using metadata as provided in any embodiment.

From the above, it can be seen that the present disclosure of a method for rendering sound bed based audio using metadata converts a set of audio signals having sound bed type metadata into different configurations of audio signals and metadata, and can render the audio signals to all speaker configurations specified in an advanced sound system. The disclosed method for rendering sound bed based audio using metadata directly transmits per-channel audio signals to each speaker without any signal modification.

Drawings

FIG. 1 is a schematic diagram of an audio model provided in an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of rendering bed-based audio using metadata in an embodiment of the present disclosure;

FIG. 3 is another flow diagram of a method of rendering sound bed based audio using metadata in an embodiment of the present disclosure;

FIG. 4 is another flow diagram of a method of rendering sound bed based audio using metadata in an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of an apparatus for rendering audio based on a sound bed using metadata according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Fig. 1 is a schematic diagram of an audio model provided in an embodiment of the present disclosure. The audio model comprises a content production part and a format production part;

wherein the content production section includes: an audio program element, an audio content element, an audio object element, and a soundtrack unique identification element;

the format making part includes: an audio packet format element, an audio channel format element, an audio stream format element, and an audio track format element;

the audio program element references at least one of the audio content elements; the audio content element references at least one audio object element; the audio object element references the corresponding audio package format element and the corresponding audio track unique identification element; the audio track unique identification element refers to the corresponding audio track format element and the corresponding audio package format element;

the audio package format element references at least one of the audio channel format elements; the audio stream format element references the corresponding audio channel format element and the corresponding audio packet format element; the audio track format element and the corresponding audio stream format element are mutually referenced;

the audio channel format element comprises at least one audio block format element.

As shown in fig. 2, an embodiment of the present disclosure provides a method for rendering sound bed-based audio by using metadata, including:

s201, based on a pre-constructed audio model, storing parameters of the audio model in respective data structures through type labels;

as shown in fig. 3, the saving parameters of the audio model in respective data structures based on the pre-constructed audio model and through the type tags includes:

s301, merging the general data into extra data; the general data includes an audio object start time, which is a start time of the last audio object on the path (no importance is specified in the channel-only allocation mode), an object duration (object _ duration), which is a duration of the last audio object on the path (no importance is specified in the channel-only allocation mode), an object duration, and a channel frequency. The screen reference (reference _ screen) is an audio program screen reference (audio program referencescreen) for the selected audio program (unselected, i.e., not assigned importance).

The channel frequency (channel _ frequency) is a channel frequency element of the selected audio channel format (audioChannelFormat).

Implementation code example:

s302, storing important data in an important data structure; the important data includes the audio packet format (audiopack format) and the audio objects (audioObjects).

The importance data allows the processor to discard objects below a certain level of importance, where 10 is the largest and 0 is the smallest. Changing this parameter over consecutive blocks should be avoided. This parameter is very useful when there is a need to reduce the capacity of metadata and allow prioritization of compromises that can be made.

When important data is used in an audio object, it can be used to delete less important sounds when the number of objects or tracks needs to be reduced. For example, some background sound effects may be discarded to ensure that the primary dialog object remains.

When important data is used in the audio packet format, it can be used to reduce spatial audio quality. A nested audio packet format may be used to take advantage of this functionality. For example, an audio object with a main direct sound (in an audio packet format parent element with high importance) and an additional reverberant sound (in a sub-audio packet format with low importance) may discard the reverberant sound, thereby preserving the main sound but degrading the quality.

Implementation code example:

struct ImportanceData{

optional<int>audio_object；

optional<int>audio_pack_format；

}；

the step of storing the parameters of the audio model in respective data structures based on the pre-constructed audio model and through the type tags comprises: referencing and encapsulating the audio sample in a track Specification (TrackSpec) structure and defining it as an audio sample source, comprising:

the direct track specification (DirectTrackSpec) will specify that audio samples should be read directly from the specified input track; alternatively, the first and second electrodes may be,

the silent track specification (SilentTrackSpec) will specify that all audio samples are a preset threshold, in particular, that all audio samples are zero.

Implementation code example:

structTrackSpec{}；

struct DirectTrackSpec：TrackSpec{

inttrack_index；

}；

struct SilentTrackSpec:TrackSpec{

}；

two track specification types are provided to support typeDefinition = = Bed, including matrix coefficient track specification (matrixcoeffectinttrack spec) and mixed track specification (mixtrack spec).

As shown in fig. 4, the step of referencing and encapsulating the audio sample in the soundtrack specification structure and defining it as an audio sample source further comprises:

s401, parameters appointed in the matrix audio block format coefficient elements are applied to audio samples of an input track; alternatively, the first and second liquid crystal display panels may be,

s402, specifying that at least one audio sample encapsulated in a soundtrack specification structure input should be mixed together.

A set of samples in an embodiment of the disclosure forms an audio track format element. It is used to describe the format of the data, allowing the renderer to decode the signal correctly. It comes from an audio stream format element that identifies the combination of audio tracks needed to successfully decode the audio track data.

Implementation code example:

struct MatrixCoefficientTrackSpec:TrackSpec{

TrackSpec input_track；

MatrixCoefficient coefficient；

}；

struct MixTrackSpec:TrackSpec{

vector<TrackSpec>input_tracks；

}；

s202, introducing the parameters of the audio model stored in the respective data structure through the type tag to generate a sound bed type metadata object of the audio model;

the step of generating the metadata object of the sound bed type of the audio model by introducing the parameters of the audio model stored in the respective data structure through the type tag comprises:

reference audio block formats, lists of audio packet format sets containing audio channel formats, and general data collected in the additional data.

As sound becomes more immersive and interactive, the complexity of audio processing also increases greatly. To handle all these new channels and complexities, each audio channel would require an explicit label. These channel tags are metadata that when attached to an audio, become a sound bed type based audio. Binding the sound bed type based metadata to the audio it describes enables the audio to be rendered correctly. In general operation, both the encoded audio and the bed-based type metadata may be transmitted as one or separate data streams over the audio channel.

Implementation code example:

struct BedTypeMetadata:TypeMetadata{

AudioBlockFormatBedblock_format；

vector<AudioPackFormat>audioPackFormats；

ExtraData extra_data；

}；

s203, introducing the parameters of the audio model and the sound bed type metadata objects stored in the respective data structures through the type tags, and generating rendering items of the sound bed type; the bed-type rendering item is used to indicate an individual audio channel format or a group of audio channel formats.

The introducing, by the type tag, the parameters of the audio model and the metadata object of the sound bed type stored in the respective data structure, and the generating of the rendering item of the sound bed type specifically includes:

a reference metadata object, importance data stored in an importance data structure, and an audio sample encapsulating the soundtrack specification structure input.

Each audio channel format (audioChannelFormat) of the sound bed type can be processed independently, the rendering item (RenderingItem) containing a track specification (TrackSpec).

Implementation code example:

structBedRenderingItem:RenderingItem{

TrackSpec track_spec；

MetadataSource metadata_source；

ImportanceData importance；

}；

the present disclosure provides a method for rendering sound bed based audio using metadata, converting a set of audio signals with metadata of the sound bed type into different configurations of audio signals and metadata, capable of rendering the audio to all speaker configurations specified in an advanced sound system.

The present disclosure provides a method for rendering sound bed based audio using metadata, without any signal modification, to directly transmit per-channel audio signals to each speaker.

The generating, by the type tag, a corresponding rendering item by introducing the metadata of the sound bed type of the audio signal stored in the respective data structure further includes: a sound bed type metadata object is associated with the audio sample source and the generic data collected in the additional data.

In the channel allocation mode, the sound bed type metadata object is set with the audio sample source for audio object start time, object duration, screen reference and channel frequency.

As shown in fig. 5, an embodiment of the present disclosure provides an apparatus for rendering sound bed-based audio by using metadata, including:

a storage module 501, configured to store parameters of an audio model in respective data structures through type tags based on a pre-constructed audio model;

a generating module 502, configured to introduce, through the type tag, the parameters of the audio model stored in the respective data structure, and generate a sound bed type metadata object of the audio model;

an import generation module 503, configured to import the parameters of the audio model and the metadata object of the sound bed type stored in the respective data structures through the type tags, and generate rendering items of the sound bed type; the bed-type rendering item is used to indicate an individual audio channel format or a group of audio channel formats.

The parameters of the audio model include: a metadata object for holding all parameters needed to render an item and for holding a sound bed type for a series of metadata objects.

A saving module 501, configured to merge the general data into additional data; the generic data includes audio object start time, object duration, screen reference and channel frequency; alternatively, the first and second electrodes may be,

storing the importance data in an importance data structure; the important data comprises audio objects and audio packets; alternatively, the first and second electrodes may be,

the step of referencing and encapsulating an audio sample in a soundtrack specification structure and defining it as an audio sample source, comprising:

the audio sample is directly read from the specified input track; alternatively, the first and second electrodes may be,

all audio samples will be specified as a preset threshold.

A saving module 501, configured to apply the parameters specified in the matrix audio block format coefficient elements to the audio samples of the input tracks; alternatively, the first and second liquid crystal display panels may be,

specifying that at least one audio sample encapsulated in the soundtrack specification structure input should be mixed together.

A generating module 502 for referencing an audio block format, a list of audio packet format sets containing audio channel formats, and general data collected in the additional data.

The import generation module 503 is used to reference metadata objects, importance data stored in the importance data structure, and audio samples encapsulating the soundtrack specification structure input.

The device for rendering audio based on sound bed by using metadata provided by the embodiment of the disclosure can execute the method for rendering audio based on sound bed by using metadata provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 6, the electronic apparatus includes: a processor 610, a memory 620, an input device 630, and an output device 640. The number of the processors 610 in the electronic device may be one or more, and one processor 610 is taken as an example in fig. 6. The number of the memories 620 in the electronic device may be one or more, and one memory 620 is taken as an example in fig. 6. The processor 610, the memory 620, the input device 630, and the output device 640 of the electronic apparatus may be connected by a bus or other means, and fig. 6 illustrates an example of connection by a bus. The electronic device can be a computer, a server and the like. The embodiment of the present disclosure describes in detail by taking an electronic device as a server, and the server may be an independent server or a cluster server.

The memory 620, as a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules for rendering bed-based audio using metadata as described in any embodiment of the present disclosure. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 620 can further include memory located remotely from the processor 610, which can be connected to devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 630 may be used to receive input numeric or character information and generate key signal inputs related to viewer user settings and function controls of the electronic device, as well as a camera for capturing images and a sound pickup device for capturing audio data. The output device 640 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 630 and the output device 640 can be set according to actual situations.

The processor 610 executes various functional applications of the device and data processing, i.e., rendering item metadata based on a sound bed, by executing software programs, instructions, and modules stored in the memory 620.

Embodiments of the present disclosure also provide a storage medium containing computer-executable instructions that when generated by a computer processor include a method for rendering sound bed-based audio using metadata as provided by any of the embodiments.

Of course, the storage medium provided by the embodiments of the present disclosure contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the electronic method described above, and may also perform related operations in the electronic method provided by any embodiments of the present disclosure, and have corresponding functions and advantages.

From the above description of the embodiments, it is obvious for a person skilled in the art that the present disclosure can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the electronic method according to any embodiment of the present disclosure.

It should be noted that, in the electronic device, the units and modules included in the electronic device are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present disclosure.

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "in an embodiment," "in yet another embodiment," "exemplary" or "in a particular embodiment," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although the present disclosure has been described in detail hereinabove by way of general description, specific embodiments and experiments, it will be apparent to those skilled in the art that certain modifications or improvements may be made thereto based on the present disclosure. Accordingly, such modifications and improvements are intended to be within the scope of this disclosure, as claimed.

Claims

1. A method of rendering sound bed based audio using metadata, comprising:

2. The method of claim 1, wherein the saving parameters of the audio model in respective data structures based on a pre-constructed audio model and by type tags comprises:

merging the general data into additional data; the generic data includes an audio object start time, an object duration, a screen reference, and a channel frequency; alternatively, the first and second electrodes may be,

storing the importance data in an importance data structure; the important data includes audio objects and audio packets.

3. The method according to claim 1, wherein the saving parameters of the audio model in respective data structures based on the pre-constructed audio model and through the type tags comprises: the audio samples are referenced and encapsulated in a soundtrack specification structure and defined as an audio sample source.

4. The method of claim 3, wherein the step of referencing and encapsulating the audio sample in a soundtrack specification structure and defining it as an audio sample source further comprises:

applying parameters specified in matrix audio block format coefficient elements to audio samples of an input track; alternatively, the first and second liquid crystal display panels may be,

specifying that at least one audio sample encapsulated in a soundtrack specification structure input should be mixed together.

5. The method according to any of claims 1, 3 or 4, wherein said introducing parameters of said audio model stored in a respective data structure by said type tag, generating a bed-type metadata object of the audio model comprises:

6. The method according to any of claims 1, 2 or 3, wherein said introducing parameters of said audio model and metadata objects of a sound bed type stored in respective data structures by said type tags, generating rendering items of a sound bed type specifically comprises:

7. The method of any of claims 1, 2 or 3, wherein said introducing, via said type tag, metadata of a bed type of said audio signal stored in a respective data structure, generating a corresponding rendered item further comprises: associating the sound bed type metadata object with an audio sample source and general data collected in additional data.

8. An apparatus for rendering sound bed based audio using metadata, comprising:

9. An electronic device, comprising: a memory and one or more processors;

the memory to store one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A storage medium containing computer-executable instructions for implementing the method of any one of claims 1-7 when executed by a computer processor.