CN114203188A

CN114203188A - Scene-based audio packet format metadata and generation method, device and storage medium

Info

Publication number: CN114203188A
Application number: CN202111306844.5A
Authority: CN
Inventors: 吴健
Original assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Current assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-03-18

Abstract

The present disclosure relates to metadata based on a scene audio packet format and a generating method, apparatus, and storage medium. Metadata in an audio package format, wherein the attribute area comprises an audio package format identifier and an audio package format name of the audio package, and the audio package format identifier comprises information indicating that the audio type of the audio package is a scene type; a sub-element region comprising: the method comprises the steps of obtaining first reference information, second reference information, absolute distance and scene component description information, wherein the first reference information comprises audio channel format information adopted by an audio channel related to an audio packet during rendering, the second reference information comprises audio packet format information adopted by the audio packet related to the audio packet during rendering, the absolute distance is indicated as a preset invalid value, and the preset invalid value is used for representing that no corresponding distance exists in the audio packet of the scene type during rendering. The reproduction of three-dimensional sound can be realized in the channel during rendering, so that the quality of the sound scene is improved.

Description

Scene-based audio packet format metadata and generation method, device and storage medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a scene-based audio packet format metadata and generation method, device, and storage medium.

Background

With the development of technology, audio becomes more and more complex. The early single-channel audio is converted into stereo, and the working center also focuses on the correct processing mode of the left and right channels. But the process begins to become complex after surround sound occurs. While surround 5.1 speaker systems impose ordering constraints on multiple channels, the situation becomes more complex. Further, a surround 6.1 speaker system, a surround 7.1 speaker system, and the like make audio processing very diverse, and transmit correct signals to appropriate speakers to achieve an effect of mutual involvement. Thus, as sound becomes more immersive and interactive, the complexity of audio processing also increases greatly.

In scene-based audio, a sound scene is represented by a set of coefficient signals. These coefficient signals are linear weights of spatially orthogonal basis functions, such as spherical harmonics or circular harmonics. The scene may then be reproduced by rendering these coefficient signals to a target speaker layout or headphones. The separation of programming and reproduction allows mixed program material to be created without knowledge of the number and location of the target speakers. One example of scene-based audio is higher order ambient sounds (HOA).

The present disclosure provides scene-based audio packet format metadata and a generation method, in order to provide metadata capable of solving the above-mentioned technical problems.

Disclosure of Invention

The present disclosure is directed to a scene-based audio packet format metadata and generation method, apparatus, and storage medium for solving one of the above technical problems. The open metadata structure may place the compression codec in a user-selectable metadata extension structure independent of the audio metadata architecture of the content. The producer, the copyright owner and the content operator can freely select the content without influencing the integrity and the tone quality of the transmitted content.

To achieve the above object, a first aspect of the present disclosure provides metadata based on a scene audio packet format, including: the attribute zone comprises an audio packet format identifier and an audio packet format name of an audio packet, wherein the audio packet format identifier comprises information indicating that the audio type of the audio packet is a scene type;

a sub-element region comprising: the method comprises the steps of obtaining first reference information, second reference information, absolute distance and scene component description information, wherein the first reference information comprises audio channel format information adopted by an audio channel related to an audio packet during rendering, the second reference information comprises audio packet format information adopted by the audio packet related to the audio packet during rendering, the absolute distance is indicated as a preset invalid value, the preset invalid value is used for representing that no corresponding distance exists in the audio packet of the scene type during rendering, and the scene component description information is used for describing information shared by a group of scene components.

To achieve the above object, a second aspect of the present disclosure provides a method for generating metadata based on a scene audio packet format, including:

generating metadata comprising the audio packet format as described in the first aspect.

To achieve the above object, a third aspect of the present disclosure provides an apparatus comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to generate metadata comprising the audio packet format of the first aspect.

To achieve the above object, a fourth aspect of the present disclosure provides a storage medium containing computer-executable instructions which, when generated by a computer processor, comprise metadata in the form of audio packets as described in the first aspect.

As can be seen from the above, the disclosed metadata in audio packet format includes: the attribute zone comprises an audio packet format identifier and an audio packet format name of an audio packet, wherein the audio packet format identifier comprises information indicating that the audio type of the audio packet is a scene type; a sub-element region comprising: the method comprises the steps of obtaining first reference information, second reference information, absolute distance and scene component description information, wherein the first reference information comprises audio channel format information adopted by an audio channel related to an audio packet during rendering, the second reference information comprises audio packet format information adopted by the audio packet related to the audio packet during rendering, the absolute distance is indicated as a preset invalid value, the preset invalid value is used for representing that no corresponding distance exists in the audio packet of the scene type during rendering, and the scene component description information is used for describing information shared by a group of scene components. In scene-based audio, a sound scene is represented by a set of coefficient signals as linear weights of spatially orthogonal basis functions (e.g., spherical harmonics or circular harmonics). The scene may then be reproduced by rendering these coefficient signals to a target speaker layout or headphones. The reproduction of three-dimensional sound can be realized in the channel, thereby improving the quality of sound scenes.

Drawings

FIG. 1 is a schematic diagram of a three-dimensional acoustic audio production model provided in an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of metadata based on a scene audio packet format provided in embodiment 1 of the present disclosure;

fig. 3 is a flowchart of a method for generating metadata based on a scene audio packet format provided in embodiment 2 of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device provided in embodiment 3 of the present disclosure.

Detailed Description

The following examples are intended to illustrate the present disclosure, but are not intended to limit the scope of the present disclosure.

Metadata (Metadata) is information that describes the structural characteristics of data, and the functions supported by Metadata include indicating storage locations, historical data, resource lookups, or file records.

As shown in fig. 1, the three-dimensional audio production model is composed of a set of production elements each describing information of structural characteristics of data of a corresponding stage of audio production by metadata, and includes a content production section and a format production section.

The production elements of the content production section include: an audio program element, an audio content element, an audio object element, and a soundtrack unique identification element.

The audio program includes narration, sound effects, and background music, and the audio program references one or more audio contents that are combined together to construct a complete audio program. The audio program elements are, for example, elements that produce an audio program, and metadata that describes the structural characteristics of the audio program is generated for the audio program.

The audio content describes the content of a component of an audio program, such as background music, and relates the content to its format by reference to one or more audio objects. The audio content element is information for producing audio content, and metadata for generating the audio content is used for describing structural characteristics of the audio content.

The audio objects are used to establish a relationship between content, format, and asset using soundtrack unique identification elements and to determine soundtrack unique identification of the actual soundtrack. The audio object elements are, for example, production audio objects, and metadata of the audio objects is generated to describe information of structural characteristics of the audio objects.

The audio track unique identification element is used for making an audio track unique identification, and metadata for generating the audio track unique identification is used for describing the structural characteristics of the audio track unique identification.

The production elements of the format production part include: an audio packet format element, an audio channel format element, an audio stream format element, an audio track format element.

The audio packet format is a format adopted when the audio object and the original audio data are packed according to channel packets, wherein the audio packet format can include a nested audio packet format. The audio packet format element is also the production audio packet data. The audio packet data comprises metadata in an audio packet format, and the metadata in the audio packet format is used for describing information of structural characteristics of the audio packet format.

The audio channel format represents a single sequence of audio samples on which certain operations may be performed, such as movement of rendering objects in a scene. Nested audio channel formats can be included in the audio channel formats. The audio channel format element is to make audio channel data. The audio channel data comprises metadata in an audio channel format, and the metadata in the audio channel format is used for describing information of structural characteristics of the audio channel format.

Audio streams, which are combinations of audio tracks needed to render channels, objects, higher-order ambient sound components, or packets. The audio stream format is used to establish a relationship between a set of audio track formats and a set of audio channel formats or audio packet formats. The audio stream format element is also the production audio stream data. The audio stream data comprises metadata in an audio stream format, and the metadata in the audio stream format is used for describing information of structural characteristics of the audio stream format.

The audio track format corresponds to a set of samples or data in a single audio track in the storage medium, the track format used to describe the original audio data, and the decoded signal of the renderer. The audio track format is derived from an audio stream format for identifying the combination of audio tracks required for successful decoding of the audio track data. The audio track format element is the production audio track data. The audio track data includes metadata in an audio track format, and the metadata in the audio track format is used for describing information of structural characteristics of the audio track format.

Each stage of the three-dimensional audio production model produces metadata that describes the characteristics of that stage.

And after the audio channel data manufactured based on the three-dimensional audio manufacturing model is transmitted to the far end in a communication mode, the far end performs stage-by-stage rendering on the audio channel data based on the metadata, and the manufactured sound scene is restored.

Example 1

The present disclosure provides metadata in an audio packet format in a three-dimensional audio model and describes in detail.

In an audio packet format element of a three-dimensional sound audio production model, metadata of an audio object and audio stream data are divided into a plurality of data blocks, which are called audio packets, according to channels. These audio packets are transmitted along different paths in one or more networks for reassembly at the destination. The disclosed embodiment describes structural information of an audio packet format using metadata 100 based on a scene audio packet format.

As shown in fig. 2, the scene audio packet format-based metadata 100 includes a property area 110 and a sub-element area 120.

The attribute section 110 includes an audio packet format identifier 111 and an audio packet format name 112 of the audio packet.

The audio packet format identifier 111 includes information indicating that the audio type of the audio packet is a scene type, and the scene component description information is used to describe information shared by a group of scene components. The audio packet formats for many scene types have been defined in a common definition and therefore do not require explicit specification in generating three-dimensional sound audio model metadata.

In the disclosed embodiment, the audio types include: channel type, matrix type, object type, scene type, and binaural channel type. Audio channel data for each audio type is generated by the above-described three-dimensional acoustic audio production model.

Wherein, based on the audio packet data of the channel type, it can be understood that the original audio data is made into the audio packet data having two audio channels; can be played directly through a channel type speaker or be processed by a special renderer to be suitable for other types of playback (such as channel type 5.1 surround sound).

The audio packet format flag 111 includes information indicating that the audio type of the audio packet is a scene type, for example, the audio packet format flag 111 is set to "AP _ y₁y₂y₃y₄x₁x₂x₃x₄"Format, wherein₁y₂y₃y₄"the information indicating that the audio type of the audio packet is the channel type is represented by numeric characters and/or symbolic characters, such as a four-bit hexadecimal numeric character" 0004 "or a symbolic character" abcd "or a mixed character" ab01 "indicating that the audio type of the audio packet is the scene type.

Optionally, the audio packet format identifier 111 further includes a high-level sound indicating the production of the audio programA particular type of information that affects the system. For example, continuing the above example, audio packet format identifier 111 is set to "AP _ y₁y₂y₃y₄x₁x₂x₃x₄"Format" x₁x₂x₃x₄"is a string of numeric characters, such as indicating a four-digit hexadecimal numeric character," 0001 "to" 0FFF "represent numeric ranges for indicating a specific type of information of an audio programming advanced sound system specified in the ITU-R bs.2094 specification set by the International telecommunications Union (ITU for short); the numerical ranges denoted by "1000" through "FFFF" are used to indicate specific types of information for a custom audio programming advanced sound system. Therefore, the audio packet format identifier has uniqueness, and cross-reference associated information is provided for the production elements through the audio packet format identifier. The information storage capacity is reduced, and the data processing efficiency is improved.

The general attributes of the audio packet format elements are shown in table 1 below,

TABLE 1

The sub-element region 120 includes: first reference information 121, second reference information 122, absolute distance 123, and scene component description information.

The first reference information 121 includes audio channel format information adopted by the audio channel associated with the audio packet at the time of rendering.

The second reference information 122 includes audio packet format information adopted by an audio packet associated with the audio packet at the time of rendering;

likewise, the absolute distance 123 is indicated as a preset invalid value, and the preset invalid value is used for representing that no corresponding distance exists in the audio packets of the scene type during rendering. For example, the preset invalid value is zero.

The scene component description information is used for describing information shared by a group of scene components.

The audio packet format for a scene type defines sub-elements as shown in table 2 below:

TABLE 2

Optionally, the attribute section 110 further includes a channel type tag indicating that the audio channel format or the audio packet format referred to downward at the time of rendering is an audio channel.

As shown in fig. 1, in the three-dimensional sound audio production model, the downwardly-referenced production element may be understood as a latter production element of the audio package format element.

If the next element of the audio package format element is an audio channel format element, the attribute area 110 of the metadata of the audio package format includes a channel type tag indicating that the audio channel format that is referred to downward during rendering is an audio channel. If the subsequent production element of the audio package format element is the same as the audio package format element (i.e. the produced audio package format includes a nested audio package format), the attribute area 110 of the metadata of the audio package format includes a channel type tag indicating that the audio package format referred to downward during rendering adopts an audio channel.

The channel type label of the audio channel is used to characterize the channel type of the audio channel. For example, the channel type label "0001" of the audio channel represents the channel type, and it can be understood that when the audio itself and metadata describing the audio are played, each audio channel data is directly output to the corresponding speaker; the channel type label "0002" represents a matrix type, and can be understood as representing rendering parameter values when each audio channel data is rendered in a matrix manner during manufacturing; the channel type label "0003" represents an object type, and can be understood as an effect object which can be perceived and can be coupled out in space through a renderer by audio channel data during playing and manufacturing; the channel type label "0004" represents a scene type, and can be understood that audio channel data is rendered into an audio of a scene based on environmental acoustics and high-price environmental sounds during playing; the channel type label "0005" represents the binaural channel type, and it can be understood that audio channel data is played through a speaker during playing.

Optionally, the attribute area further includes an audio type indicating that the audio object or audio packet format referred to upward at the time of rendering is in an audio channel.

As shown in fig. 1, in the three-dimensional sound audio production model, the above-referenced production element may be understood as a previous production element of the audio package format element.

If the previous element of the audio package format element is an audio object element, the attribute area 110 of the metadata of the audio package format includes an audio type indicating that the audio object referred to in the audio package format is an audio channel during rendering. If the previous fabricated element of the audio package format element is the same as the audio package format element (i.e. the fabricated audio package format includes the nested audio package format), the attribute section 110 of the metadata of the audio package format includes an audio type indicating that the audio package format referred to upward by the audio package format during rendering adopts an audio channel. The audio channels can adopt audio types including: channel type, matrix type, object type, scene type, and binaural channel type.

Optionally, the attribute section 110 further includes importance information indicating the metadata 100 in the audio packet format in rendering.

Based on the importance information, the metadata 100 in the audio packet format with high importance can be rendered preferentially, and even the metadata 100 in the audio packet format with low importance can be discarded as required, so that the requirement of rendering progress is met.

Optionally, the scene component description information includes: standardized information 124, reference distance information 125, and screen related information 126; the standardized information 124 represents normalized information of the scene component content;

the reference distance information 125 represents a reference distance of the speaker setup for near field compensation; if nfcRefDist is undefined or has a value of 0, indicating that NFC is not required;

the screen related information 126 indicates the scene component content is related to the screen, either related (flag equal to 1) or unrelated (flag equal to 0).

The normalization information consists of degree, order and normalization value definitions or of equations. When there is no equation, the scene component is defined by the order, and the normalization value; the audio type component/signal for each higher order ambient sound consists of a degree, order and normalization value definition or an equation. The purpose is to allow the description of custom or experimental SCENE components that cannot be described in order, degree, and normalized parameters alone.

The embodiment of the present disclosure describes the channel type audio packet format through the metadata 100 of the audio packet format, and can realize the reproduction of three-dimensional sound in a channel, thereby improving the quality of a sound scene.

Example 2

The present disclosure also provides an embodiment of a method for adapting to the above embodiment, and a method for generating metadata in an audio packet format, where the explanation based on the same name and meaning is the same as that in the above embodiment, and has the same technical effect as that in the above embodiment, and details are not repeated here.

As shown in fig. 3, the method for generating metadata based on a scene audio packet format includes the steps of:

step S210, generating metadata based on a scene audio packet format, where the metadata of the audio packet format includes:

the attribute zone comprises an audio packet format identifier and an audio packet format name of an audio packet, wherein the audio packet format identifier comprises information indicating that the audio type of the audio packet is a scene type;

a sub-element region comprising: the first reference information comprises audio channel format information adopted by an audio channel related to the audio packet during rendering, the second reference information indicates a preset identifier, the preset identifier is used for representing that the audio packet of the scene type is not applied with information in the second reference information during rendering, the absolute distance indicates a preset invalid value, the preset invalid value is used for representing that the audio packet of the scene type does not have a corresponding distance during rendering, and the scene component description information is used for describing information shared by a group of scene components.

Optionally, the attribute area further includes a channel type tag indicating that an audio channel format or an audio packet format referred to downward at the time of rendering employs an audio channel.

Optionally, the attribute area further includes importance information indicating an importance of the metadata of the audio packet format in rendering.

Optionally, the scene component description information includes: standardized information, reference distance information, and screen related information; the standardized information represents normalized information of the scene component content; the normalization information consists of degree, order and normalization value definitions or of equations.

The reference distance information represents a reference distance of a speaker setting for near field compensation, and if nfcrfdist is undefined or has a value of 0, it represents that NFC is not required;

the screen related information indicates a scene component content-to-screen correlation, which is related (flag equal to 1) or unrelated (flag equal to 0).

Optionally, the audio packet format identifier further includes information indicating a specific type of audio programming advanced sound system.

The embodiment of the disclosure generates the metadata of the audio packet format, and the metadata of the audio packet format describes the audio packet format of the channel type, so that the reproduction of three-dimensional sound can be realized in the channel, thereby improving the quality of sound scenes.

Example 3

Fig. 4 is a schematic structural diagram of an electronic device provided in embodiment 3 of the present disclosure. As shown in fig. 4, the electronic apparatus includes: a processor 30, a memory 31, an input device 32, and an output device 33. The number of the processors 30 in the electronic device may be one or more, and one processor 30 is taken as an example in fig. 4. The number of the memories 31 in the electronic device may be one or more, and one memory 31 is taken as an example in fig. 4. The processor 30, the memory 31, the input device 32 and the output device 33 of the electronic apparatus may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example. The electronic device can be a computer, a server and the like. The embodiment of the present disclosure describes in detail by taking an electronic device as a server, and the server may be an independent server or a cluster server.

Memory 31 is provided as a computer-readable storage medium that may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules, that generate metadata in the form of audio packets, as described in any embodiment of the present disclosure. The memory 31 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 31 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 31 may further include memory located remotely from the processor 30, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 32 may be used to receive input numeric or character information and generate key signal inputs related to viewer user settings and function controls of the electronic device, as well as a camera for capturing images and a sound pickup device for capturing audio data. The output device 33 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 32 and the output device 33 can be set according to actual conditions.

The processor 30 executes various functional applications of the device and data processing, i.e., generating metadata based on a scene audio packet format, by executing software programs, instructions, and modules stored in the memory 31.

Example 4

The embodiment 4 of the present disclosure also provides a storage medium containing computer-executable instructions, which generate metadata including the scene-based audio packet format according to the embodiment 1 by a computer processor.

Of course, the storage medium provided by the embodiments of the present disclosure contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the electronic method described above, and may also perform related operations in the electronic method provided by any embodiments of the present disclosure, and have corresponding functions and advantages.

From the above description of the embodiments, it is obvious for a person skilled in the art that the present disclosure can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the electronic method according to any embodiment of the present disclosure.

It should be noted that, in the electronic device, the units and modules included in the electronic device are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present disclosure.

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "in an embodiment," "in yet another embodiment," "exemplary" or "in a particular embodiment," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although the present disclosure has been described in detail hereinabove with respect to general description, specific embodiments and experiments, it will be apparent to those skilled in the art that some modifications or improvements may be made based on the present disclosure. Accordingly, such modifications and improvements are intended to be within the scope of this disclosure, as claimed.

Claims

1. Metadata based on a scene audio packet format, comprising: the attribute zone comprises an audio packet format identifier and an audio packet format name of an audio packet, wherein the audio packet format identifier comprises information indicating that the audio type of the audio packet is a scene type;

2. The scene-based audio package format metadata according to claim 1, wherein the attribute section further comprises an audio type indicating that an audio object or an audio package format referred to upward at the time of rendering is in an audio channel.

3. The scene-based audio packet format metadata according to claim 1, wherein the attribute section further comprises a channel type tag indicating that an audio channel format or an audio packet format referred to downward at the time of rendering employs an audio channel.

4. The scene-based audio packet format metadata according to claim 1, wherein the attribute area further includes importance information indicating an importance of the audio packet format metadata in rendering.

5. The metadata according to claim 1, wherein the scene component description information includes: standardized information, reference distance information, and screen related information; the standardized information represents normalized information of the scene component content;

the reference distance information represents a reference distance of a speaker setting for near field compensation;

the screen related information indicates a scene component content-to-screen correlation, which is related or unrelated.

6. The metadata in the scene-based audio packet format according to claim 5, wherein the standardized information is composed of degrees, orders and standardized value definitions or is composed of equations.

7. The metadata in accordance with claim 1, wherein the audio packet format identification further comprises information indicating a specific type of audio programming advanced sound system.

8. The method for generating metadata based on the scene audio packet format is characterized by comprising the following steps:

generating metadata comprising the audio packet format according to any of claims 1-7.

9. An apparatus, comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to generate metadata comprising the audio packet format of any of claims 1-7.

10. A storage medium containing computer-executable instructions which, when generated by a computer processor, comprise metadata in the form of audio packets according to any one of claims 1 to 7.