CN114023340A

CN114023340A - Object-based audio packet format metadata and generation method, apparatus, and medium

Info

Publication number: CN114023340A
Application number: CN202111308430.6A
Authority: CN
Inventors: 吴健
Original assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Current assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-02-08

Abstract

The present disclosure relates to an object-based audio packet format metadata and generation method, apparatus, and medium. Metadata in the form of audio packets, comprising: the attribute area comprises an audio packet format identifier and an audio packet format name of the audio packet, wherein the audio packet format identifier comprises information indicating that the audio type of the audio packet is an object type; a sub-element region comprising: the audio processing method comprises the steps of obtaining first reference information, second reference information and an absolute distance, wherein the first reference information comprises audio channel format information adopted by an audio channel related to an audio packet during rendering, the second reference information comprises audio packet format information adopted by the audio packet related to the audio packet during rendering, the absolute distance is used for representing the distance from a preset central point of a perceptive effect object coupled out of a space by sound during rendering to the origin of preset coordinates, and the audio type of the effect object is the object type. The reproduction of three-dimensional sound can be realized in the space during rendering, so that the quality of a sound scene is improved.

Description

Object-based audio packet format metadata and generation method, apparatus, and medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a method, an apparatus, and a medium for generating metadata based on an object audio packet format.

Background

With the development of technology, audio becomes more and more complex. The early single-channel audio is converted into stereo, and the working center also focuses on the correct processing mode of the left and right channels. But the process begins to become complex after surround sound occurs. The surround 5.1 speaker system performs ordering constraint on a plurality of channels, and further the surround 6.1 speaker system, the surround 7.1 speaker system and the like enable audio processing to be varied, and correct signals are transmitted to proper speakers to form an effect of mutual involvement. Thus, as sound becomes more immersive and interactive, the complexity of audio processing also increases greatly.

Audio channels (or audio channels) refer to audio signals that are independent of each other and that are captured or played back at different spatial locations when sound is recorded or played. The number of channels is the number of sound sources when recording or the number of corresponding speakers when playing back sound. For example, in a surround 5.1 speaker system comprising audio signals at 6 different spatial locations, each separate audio signal is used to drive a speaker at a corresponding spatial location; in a surround 7.1 speaker system comprising audio signals at 8 different spatial positions, each separate audio signal is used to drive a speaker at a corresponding spatial position.

Therefore, the effect achieved by current loudspeaker systems depends on the number and spatial position of the loudspeakers. For example, a binaural speaker system cannot achieve the effect of a surround 5.1 speaker system.

The present disclosure provides metadata in an audio packet format and a generation method thereof in order to provide metadata capable of solving the above-mentioned technical problems.

Disclosure of Invention

An object of the present disclosure is to provide a method, an apparatus, and a medium for generating metadata based on an object audio packet format to solve one of the above technical problems.

To achieve the above object, a first aspect of the present disclosure provides metadata in an audio packet format, including:

the attribute area comprises an audio packet format identifier and an audio packet format name of an audio packet, wherein the audio packet format identifier comprises information indicating that the audio type of the audio packet is an object type;

a sub-element region comprising: the audio processing method comprises the steps of obtaining first reference information, second reference information and an absolute distance, wherein the first reference information comprises audio channel format information adopted by an audio channel related to an audio packet during rendering, the second reference information comprises audio packet format information adopted by the audio packet related to the audio packet during rendering, the absolute distance is used for representing the distance from a preset central point of a perceptive effect object coupled out of a space by sound during rendering to the origin of preset coordinates, and the audio type of the effect object is the object type.

To achieve the above object, a second aspect of the present disclosure provides a method for generating metadata in an audio packet format, including:

generating metadata comprising the audio packet format as described in the first aspect.

To achieve the above object, a third aspect of the present disclosure provides an electronic device, including: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to generate metadata comprising the audio packet format of the first aspect.

To achieve the above object, a fourth aspect of the present disclosure provides a storage medium containing computer-executable instructions which, when generated by a computer processor, comprise metadata in the form of audio packets as described in the first aspect.

As can be seen from the above, the disclosed metadata in audio packet format includes: the attribute area comprises an audio packet format identifier and an audio packet format name of an audio packet, wherein the audio packet format identifier comprises information indicating that the audio type of the audio packet is an object type; a sub-element region comprising: the audio processing method comprises the steps of obtaining first reference information, second reference information and an absolute distance, wherein the first reference information comprises audio channel format information adopted by an audio channel related to an audio packet during rendering, the second reference information comprises audio packet format information adopted by the audio packet related to the audio packet during rendering, the absolute distance is used for representing the distance from a preset central point of a perceptive effect object coupled out of a space by sound during rendering to the origin of preset coordinates, and the audio type of the effect object is the object type. The metadata in the audio packet format describes the absolute distance of each effect object in the space, and the reproduction of three-dimensional sound can be realized in the space, so that the quality of a sound scene is improved.

Drawings

FIG. 1 is a schematic diagram of a three-dimensional acoustic audio production model provided in an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of metadata in an audio packet format provided in embodiment 1 of the present disclosure;

fig. 3 is a flowchart of a method for generating metadata in an audio packet format according to embodiment 2 of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device provided in embodiment 3 of the present disclosure.

Detailed Description

The following examples are intended to illustrate the present disclosure, but are not intended to limit the scope of the present disclosure.

Metadata (Metadata) is information that describes the structural characteristics of data, and the functions supported by Metadata include indicating storage locations, historical data, resource lookups, or file records.

As shown in fig. 1, the three-dimensional audio production model is composed of a set of production elements each describing information of structural characteristics of data of a corresponding stage of audio production by metadata, and includes a content production section and a format production section.

The production elements of the content production section include: an audio program element, an audio content element, an audio object element, and a soundtrack unique identification element.

The audio program includes narration, sound effects, and background music, and the audio program references one or more audio contents that are combined together to construct a complete audio program. The audio program elements are, for example, elements that produce an audio program, and metadata that describes the structural characteristics of the audio program is generated for the audio program.

The audio content describes the content of a component of an audio program, such as background music, and relates the content to its format by reference to one or more audio objects. The audio content element is information for producing audio content, and metadata for generating the audio content is used for describing structural characteristics of the audio content.

The audio objects are used to establish a relationship between content, format, and asset using soundtrack unique identification elements and to determine soundtrack unique identification of the actual soundtrack. The audio object elements are, for example, production audio objects, and metadata of the audio objects is generated to describe information of structural characteristics of the audio objects.

The audio track unique identification element is used for making an audio track unique identification, and metadata for generating the audio track unique identification is used for describing the structural characteristics of the audio track unique identification.

The production elements of the format production part include: an audio packet format element, an audio channel format element, an audio stream format element, an audio track format element.

The audio packet format is a format adopted when the audio object and the original audio data are packed according to channel packets, wherein the audio packet format can include a nested audio packet format. The audio packet format element is also the production audio packet data. The audio packet data comprises metadata in an audio packet format, and the metadata in the audio packet format is used for describing information of structural characteristics of the audio packet format.

The audio channel format represents a single sequence of audio samples on which certain operations may be performed, such as movement of rendering objects in a scene. Nested audio channel formats can be included in the audio channel formats. The audio channel format element is to make audio channel data. The audio channel data comprises metadata in an audio channel format, and the metadata in the audio channel format is used for describing information of structural characteristics of the audio channel format.

Audio streams, which are combinations of audio tracks needed to render channels, objects, higher-order ambient sound components, or packets. The audio stream format is used to establish a relationship between a set of audio track formats and a set of audio channel formats or audio packet formats. The audio stream format element is also the production audio stream data. The audio stream data comprises metadata in an audio stream format, and the metadata in the audio stream format is used for describing information of structural characteristics of the audio stream format.

The audio track format corresponds to a set of samples or data in a single audio track in the storage medium, the track format used to describe the original audio data, and the decoded signal of the renderer. The audio track format is derived from an audio stream format for identifying the combination of audio tracks required for successful decoding of the audio track data. The audio track format element is the production audio track data. The audio track data includes metadata in an audio track format, and the metadata in the audio track format is used for describing information of structural characteristics of the audio track format.

Each stage of the three-dimensional audio production model produces metadata that describes the characteristics of that stage.

And after the audio channel data manufactured based on the three-dimensional audio manufacturing model is transmitted to the far end in a communication mode, the far end performs stage-by-stage rendering on the audio channel data based on the metadata, and the manufactured sound scene is restored.

Example 1

The present disclosure provides metadata in an audio packet format in a three-dimensional audio model and describes in detail.

In an audio packet format element of a three-dimensional sound audio production model, metadata of an audio object and audio stream data are divided into a plurality of data blocks, which are called audio packets, according to channels. These audio packets are transmitted along different paths in one or more networks for reassembly at the destination. The disclosed embodiment describes the structural information of the audio packet format using the metadata 100 of the audio packet format.

As shown in fig. 2, the metadata 100 of the audio packet format includes a property area 110 and a sub-element area 120.

The attribute section 110 includes an audio packet format identifier 111 and an audio packet format name 112 of the audio packet.

The audio packet format identifier 111 includes information indicating that the audio type of the audio packet is an object type.

In the disclosed embodiment, the audio types include: bed type, matrix type, object type, scene type, and binaural channel type. Audio channel data for each audio type is generated by the above-described three-dimensional acoustic audio production model.

Where audio packet data based on object type is understood to be each audio channel of the original audio data appended with position information, as well as metadata related to spatial and/or signal characteristics. The channels or sound-coupled out-in-space effect objects that can be perceived represent a single sound in the entire scene, and a plurality of different effect objects can build up the entire sound scene. The audio type of the effect object is an object type. Where some effect objects exist for a limited time and some effect objects are able to move around and change their characteristics over time.

The audio packet format flag 111 includes information indicating that the audio type of the audio packet is an object type, for example, the audio packet format flag 111 is set to "AP _ y₁y₂y₃y₄x₁x₂x₃x₄"Format, wherein₁y₂y₃y₄"information for indicating that the audio type of the audio packet is an object type, such as numeric characters and/or symbolic character representations, such as four-bit hexadecimal numeric character" 0003 "or symbolic character" abcd "or mixed character" ab03 ", indicates that the audio type of the audio packet is an object type.

Optionally, the audio packet format identifier 111 further includes information indicating a specific type of audio programming advanced sound system. For example, continuing the above example, audio packet format identifier 111 is set to "AP _ y₁y₂y₃y₄x₁x₂x₃x₄"Format" x₁x₂x₃x₄"is a string of numeric characters, e.g. fingerA four-digit hexadecimal numeric character is shown, and the numeric range indicated by "0001" to "0 FFF" is used for indicating the specific type of information of the audio programming advanced sound system specified in ITU-R BS.2094 specification established by the International Telecommunication Union (ITU); the numerical ranges denoted by "1000" through "FFFF" are used to indicate specific types of information for a custom audio programming advanced sound system. Therefore, the audio packet format identifier has uniqueness, and cross-reference associated information is provided for the production elements through the audio packet format identifier. The information storage capacity is reduced, and the data processing efficiency is improved.

The general attributes of the audio packet format elements are shown in table 1,

TABLE 1

The sub-element region 120 includes: first reference information 121, second reference information 122, and absolute distance 123.

The first reference information 121 includes audio channel format information adopted by the audio channel associated with the audio packet at the time of rendering.

The second reference information 122 includes audio packet format information employed by the audio packet associated with the audio packet at the time of rendering.

Since the audio packets are packetized by grouping the audio channels, one audio packet includes the original audio data of the same audio channel.

The audio type of the effect object is the object type.

The absolute distance 123 is used to represent a distance from a preset central point of a perceptive effect object coupled out of the space by the sound during rendering to an origin of the preset coordinates. The preset coordinates include spherical coordinates and cartesian coordinates.

If the transducer is used to measure the energy coupled out in space by sound when rendered, the energy signal is strongest at the position where the effect object is located. Due to the interaction of the multi-channel sounds, the effect object takes on various shapes, and the preset central point can be the geometric center of the shape, or the shape is included in a sphere, and the sphere center of the sphere is taken as the preset central point.

The sub-elements of the audio packet format element are shown in table 2.

TABLE 2

Optionally, when the absolute distance 123 is zero, the effect object is characterized to be absent.

The metadata 100 of the audio packet format of the embodiment of the present disclosure describes the absolute distance 123 of each effect object in the space, which enables reproduction of three-dimensional sound in the space, thereby improving the quality of a sound scene.

Optionally, the attribute section 110 further includes a channel type tag indicating that the audio channel format or the audio packet format referred to downward at the time of rendering is an audio channel.

As shown in fig. 1, in the three-dimensional sound audio production model, the downwardly-referenced production element may be understood as a latter production element of the audio package format element.

If the next element of the audio package format element is an audio channel format element, the attribute area 110 of the metadata of the audio package format includes a channel type tag indicating that the audio channel format that is referred to downward during rendering is an audio channel. If the subsequent production element of the audio package format element is the same as the audio package format element (i.e. the produced audio package format includes a nested audio package format), the attribute area 110 of the metadata of the audio package format includes a channel type tag indicating that the audio package format referred to downward during rendering adopts an audio channel.

The channel type label of the audio channel is used to characterize the channel type of the audio channel. For example, the channel type label "0001" of the audio channel represents the sound bed type, and it can be understood that each audio channel data is directly output to the corresponding speaker during playing; the channel type label "0002" represents a matrix type, and can be understood as representing rendering parameter values when each audio channel data is rendered in a matrix manner; the channel type label "0003" represents an object type, which can be understood as an effect object that can be perceived and audio channel data can be coupled out in space when playing; the channel type label "0004" represents a scene type, and can be understood that audio channel data forms audio of a scene based on environmental acoustics and high-price environmental sounds during playing; the channel type label "0005" characterizes the binaural channel type, and it can be understood that audio channel data is played in the form of a headphone player during playing.

Optionally, the property area 110 further includes an audio type indicating that the audio object or audio packet format referred to upward at the time of rendering is in an audio channel.

As shown in fig. 1, in the three-dimensional sound audio production model, the above-referenced production element may be understood as a previous production element of the audio package format element.

If the previous element of the audio package format element is an audio object element, the attribute area 110 of the metadata of the audio package format includes an audio type indicating that the audio object referred to in the audio package format is an audio channel during rendering. If the previous fabricated element of the audio package format element is the same as the audio package format element (i.e. the fabricated audio package format includes the nested audio package format), the attribute section 110 of the metadata of the audio package format includes an audio type indicating that the audio package format referred to upward by the audio package format during rendering adopts an audio channel. The audio channels can adopt audio types including: bed type, matrix type, object type, scene type, and binaural channel type.

Optionally, the attribute section 110 further includes importance information indicating the metadata 100 in the audio packet format in rendering.

Based on the importance information, the metadata 100 in the audio packet format with high importance can be rendered preferentially, and even the metadata 100 in the audio packet format with low importance can be discarded as required, so that the requirement of rendering progress is met.

The embodiment of the present disclosure describes the absolute distance 123 of each effect object in the space through the metadata 100 in the audio packet format, and can realize the reproduction of three-dimensional sound in the space, thereby improving the quality of a sound scene.

Example 2

The present disclosure also provides an embodiment of a method for adapting to the above embodiment, and a method for generating metadata in an audio packet format, where the explanation based on the same name and meaning is the same as that in the above embodiment, and has the same technical effect as that in the above embodiment, and details are not repeated here.

As shown in fig. 3, a method for generating metadata in an audio packet format includes the steps of:

step S210, generating metadata in an audio packet format, where the metadata in the audio packet format includes:

a sub-element region comprising: the audio processing method includes the steps of obtaining first reference information, second reference information and an absolute distance, wherein the first reference information comprises audio channel format information adopted by an audio channel related to an audio packet during rendering, the second reference information comprises audio packet format information adopted by the audio packet related to the audio packet during rendering, the absolute distance is used for representing the distance from a preset central point of a perceptive effect object (object) coupled out of a space during rendering to an origin of preset coordinates, and the audio type of the effect object is the object type.

Optionally, the attribute area further includes an audio type indicating that the audio object or audio packet format referred to upward at the time of rendering is in an audio channel.

Optionally, the attribute area further includes a channel type tag indicating that an audio channel format or an audio packet format referred to downward at the time of rendering employs an audio channel.

Optionally, the attribute area further includes importance information indicating an importance of the metadata of the audio packet format in rendering.

Optionally, when the absolute distance is zero, the effect object is characterized to be absent.

Optionally, the audio packet format identifier further includes information indicating a specific type of audio programming advanced sound system.

The embodiment of the present disclosure generates metadata in an audio packet format, which describes an absolute distance of each effect object in a space, and can realize reproduction of three-dimensional sound in the space, thereby improving the quality of a sound scene.

Example 3

Fig. 4 is a schematic structural diagram of an electronic device provided in embodiment 3 of the present disclosure. As shown in fig. 4, the electronic apparatus includes: a processor 30, a memory 31, an input device 32, and an output device 33. The number of the processors 30 in the electronic device may be one or more, and one processor 30 is taken as an example in fig. 4. The number of the memories 31 in the electronic device may be one or more, and one memory 31 is taken as an example in fig. 4. The processor 30, the memory 31, the input device 32 and the output device 33 of the electronic apparatus may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example. The electronic device can be a computer, a server and the like. The embodiment of the present disclosure describes in detail by taking an electronic device as a server, and the server may be an independent server or a cluster server.

Memory 31 is provided as a computer-readable storage medium that may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules, that generate metadata in the form of audio packets, as described in any embodiment of the present disclosure. The memory 31 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 31 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 31 may further include memory located remotely from the processor 30, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 32 may be used to receive input numeric or character information and generate key signal inputs related to viewer user settings and function controls of the electronic device, as well as a camera for capturing images and a sound pickup device for capturing audio data. The output device 33 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 32 and the output device 33 can be set according to actual conditions.

The processor 30 executes various functional applications of the device and data processing, i.e., generates metadata in the form of audio packets, by executing software programs, instructions, and modules stored in the memory 31.

Example 4

The disclosed embodiment 4 also provides a storage medium containing computer executable instructions that generate metadata including the audio packet format as described in embodiment 1 at a computer processor.

Of course, the storage medium provided by the embodiments of the present disclosure contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the electronic method described above, and may also perform related operations in the electronic method provided by any embodiments of the present disclosure, and have corresponding functions and advantages.

From the above description of the embodiments, it is obvious for a person skilled in the art that the present disclosure can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the electronic method according to any embodiment of the present disclosure.

It should be noted that, in the electronic device, the units and modules included in the electronic device are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present disclosure.

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "in an embodiment," "in yet another embodiment," "exemplary" or "in a particular embodiment," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although the present disclosure has been described in detail hereinabove with respect to general description, specific embodiments and experiments, it will be apparent to those skilled in the art that some modifications or improvements may be made based on the present disclosure. Accordingly, such modifications and improvements are intended to be within the scope of this disclosure, as claimed.

Claims

1. Metadata in the form of audio packets, comprising:

2. The metadata in the audio package format according to claim 1, wherein the attribute section further includes an audio type indicating that an audio object or audio package format referred to upward at the time of rendering is in an audio channel.

3. The metadata in the audio package format according to claim 1, wherein the attribute section further includes a channel type tag indicating that an audio channel format or an audio package format referred down at the time of rendering is in an audio channel.

4. The metadata in the audio packet format according to claim 1, wherein the attribute section further includes information indicating an importance of the metadata in the audio packet format in rendering.

5. The metadata in the audio packet format according to claim 1, characterized in that the effect object is not present when the absolute distance is zero.

6. The metadata in audio package format of claim 1, wherein the audio package format identification further comprises information indicating a particular type of audio programming audio enhancement system.

7. A method for generating metadata in an audio packet format, comprising:

generating metadata comprising the audio packet format according to any of claims 1-6.

8. An electronic device, comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to generate metadata comprising the audio packet format of any of claims 1-6.

9. A storage medium containing computer-executable instructions which, when generated by a computer processor, comprise metadata in the form of audio packets according to any one of claims 1 to 6.