CN114519121A

CN114519121A - Audio serial metadata block generation method, device, equipment and storage medium

Info

Publication number: CN114519121A
Application number: CN202111666363.5A
Authority: CN
Inventors: 吴健
Original assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Current assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-05-20

Abstract

The present disclosure relates to an audio serial metadata block generation method, apparatus, device, and storage medium, the method comprising: acquiring a header sub-element and an audio serial extensible block element of an audio serial metadata block; placing the header sub-element in front of the audio serial scalable block element and then placing the audio serial scalable block element behind the header sub-element, generating the audio serial metadata block for storing and transmitting time-varying metadata; wherein each of the audio-serially scalable block elements corresponds to a unique number of audio samples from a group of neighboring audio-serially scalable block elements, and an audio-format-extension-tag sub-element of the audio-serially scalable block element includes set audio model metadata. To generate audio serial metadata blocks in audio serial mode, to implement the workflow of acoustic model metadata.

Description

Audio serial metadata block generation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating an audio serial metadata block.

Background

With the development of technology, audio becomes more and more complex. The early single-channel audio is converted into stereo, and the working center also focuses on the correct processing mode of the left and right channels. But the process begins to become complex after surround sound occurs. The surround 5.1 speaker system performs ordering constraint on a plurality of channels, and further the surround 6.1 speaker system, the surround 7.1 speaker system and the like enable audio processing to be varied, and correct signals are transmitted to proper speakers to form an effect of mutual involvement. Thus, as sound becomes more immersive and interactive, the complexity of audio processing also increases greatly.

Audio channels (or audio channels) refer to audio signals that are independent of each other and that are captured or played back at different spatial locations when sound is recorded or played. And the number of channels is the number of sound sources when recording sound or the corresponding number of speakers when playing back sound. For example, in a surround 5.1 speaker system comprising audio signals at 6 different spatial locations, each separate audio signal is used to drive a speaker at a corresponding spatial location; in a surround 7.1 speaker system comprising audio signals at 8 different spatial positions, each separate audio signal is used to drive a speaker at a corresponding spatial position.

Therefore, the effect achieved by current loudspeaker systems depends on the number and spatial position of the loudspeakers. For example, a binaural speaker system cannot achieve the effect of a surround 5.1 speaker system.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a device, and a storage medium for generating an audio serial metadata block in an audio serial mode, so as to implement a workflow of sound model metadata.

A first aspect of the present disclosure provides an audio serial metadata block generation method, including:

acquiring a header sub-element and an audio serial extensible block element of an audio serial metadata block;

placing the header sub-element in front of the audio serial scalable block element and then placing the audio serial scalable block element behind the header sub-element, generating the audio serial metadata block for storing and transmitting time-varying metadata;

wherein each of the audio-serially scalable block elements corresponds to a unique number of audio samples from a group of neighboring audio-serially scalable block elements, and an audio-format-extension-tag sub-element of the audio-serially scalable block element includes set audio model metadata.

A second aspect of the present disclosure provides an audio serial metadata chunk generating apparatus, including:

an obtaining module, configured to obtain header sub-elements and audio serial scalable block elements of an audio serial metadata block;

an audio serial metadata block generation module, configured to place the header sub-element in front of the audio serial scalable block element, and then place the audio serial scalable block element behind the header sub-element, and generate the audio serial metadata block for storing and transmitting time-varying metadata;

A third aspect of the present disclosure provides an electronic device, comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement an audio serial metadata chunk generation method as provided by any of the embodiments.

A fourth aspect of the present disclosure provides a storage medium containing computer-executable instructions that implement the audio serial metadata chunk generation method provided in any embodiment in a computer processor.

As can be seen from the above, the audio serial metadata chunk generating method of the present disclosure generates an audio serial metadata chunk in an audio serial mode, metadata contained in a broadcast audio format file chunk is likely to be used to describe the used BW64 format, and maintains compatibility as much as possible by using shared core elements.

Drawings

FIG. 1 is a schematic diagram of a three-dimensional acoustic audio model provided in an embodiment of the present disclosure;

FIG. 2 is a flowchart of a method for generating audio serial metadata blocks according to an embodiment of the present disclosure;

FIG. 3 is a block diagram illustrating an exemplary audio serial metadata block generator according to the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device in an embodiment of the present disclosure.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Examples

As shown in fig. 1, a three-dimensional acoustic audio model is composed of a set of elements, each element describing one stage of audio production, the three-dimensional acoustic audio model including a content part and a format part.

Wherein the content part comprises: an audio program element, an audio content element, an audio object element, and a soundtrack unique identification element; the format part includes: an audio packet format element, an audio channel format element, an audio stream format element, and an audio track format element;

the audio program element references at least one of the audio content elements; the audio content element references at least one audio object element; the audio object element references the corresponding audio package format element and the corresponding audio track unique identification element; the audio track unique identification element refers to the corresponding audio track format element and the corresponding audio package format element;

the audio package format element references at least one of the audio channel format elements; the audio stream format element refers to the corresponding audio channel format element and the corresponding audio packet format element; the audio track format element and the corresponding audio stream format element are referenced to each other. The reference relationships between elements are indicated by arrows in fig. 1.

The audio program may include, but is not limited to, narration, sound effects, and background music, the audio program elements may be used to describe a program, the program includes at least one content, and the audio content elements are used to describe a corresponding one of the audio program elements. An audio program element may reference one or more audio content elements that are grouped together to construct a complete audio program element.

The audio content elements describe the content of a component of an audio program, such as background music, and relate the content to its format by reference to one or more audio object elements.

The audio object elements are used to build content, format and valuable information and to determine the unique identification of the soundtrack of the actual soundtrack.

The format part includes: an audio packet format element, an audio channel format element, an audio stream format element, an audio track format element.

The audio packet format element may be configured to describe a format adopted when the audio object element and the original audio data are packed according to channel packets.

The audio channel format element may be used to represent a single sequence of audio samples and preset operations performed on it, such as movement of rendering objects in a scene. The audio channel format element may comprise at least one audio block format element. The audio block format elements may be considered to be sub-elements of the audio channel format elements, and therefore there is an inclusion relationship between the audio channel format elements and the audio block format elements.

Audio streams, which are combinations of audio tracks needed to render channels, objects, higher-order ambient sound components, or packets. The audio stream format element is used for establishing the relationship between the audio track format element set and the audio channel format element set, or the relationship between the audio track format set and the audio packet format.

The audio track format elements correspond to a set of samples or data in a single audio track, and are used to describe the format of the original audio data, and the decoded signals of the renderer, and also to identify the combination of audio tracks required to successfully decode the audio track data.

And generating synthetic audio data containing metadata after the original audio data are produced through the three-dimensional sound audio model.

The Metadata (Metadata) is information describing characteristics of data, and functions supported by the Metadata include indicating a storage location, history data, resource lookup, or file record.

And after the synthesized audio data is transmitted to the far end in a communication mode, the far end renders the synthesized audio data based on the metadata to restore the original sound scene.

The division between content parts, format production parts, and 64-bit Broadcast Wave (BW 64) files is shown in fig. 1. The content production section and the format production section both constitute metadata in XML format, which is usually contained in one audio extensible markup language (axml) section of a 64-bit broadcast wave file. The bottom 64-bit broadcast wave file section contains the channel allocation (chna) tile, which is a look-up table for linking metadata to the audio programs in the file.

The content part describes the technical content of the audio, e.g. whether it contains dialogs or a specific language, and loudness metadata. The format section describes the channel types of the audio tracks and how they are combined together, e.g. the left and right channels in a stereo pair. The meta-index of the content portion is typically unique to the audio and program, while the elements of the format portion may be multiplexed.

The present disclosure provides an audio serial metadata block generation method, as shown in fig. 2, the method including:

s210, acquiring header sub-elements and audio serial extensible block elements of the audio serial metadata block;

s220, placing the header sub-element in front of the audio serial scalable block element, and then placing the audio serial scalable block element behind the header sub-element, generating the audio serial metadata block for storing and transmitting time-varying metadata;

Optionally, the audio serial metadata chunk is arranged with other resource exchange file format (RIFF) chunks in the same audio file, the audio serial metadata chunk containing audio data conforming to an audio format file or associated with an audio data segment.

Optionally, the block format element in the obtained header sub-element is used to describe a specification of a block containing audio model metadata and a specification of a related audio block.

Optionally, the audio serial metadata chunk is completed by an alignment point table that allows access to selected sub-elements of the audio serial scalable chunk based on timestamp acquisition; and the total length of the audio serialized metadata block is not fixed.

Optionally, obtaining the elements of the audio serial metadata block includes:

obtaining attributes and elements of the audio serial metadata block, wherein the attributes of the audio serial metadata block include: a block identifier, a data size of a byte block, a compression method of a text, an audio serial scalable block low byte, an audio serial scalable block high byte, an audio serial scalable block valid entry number, an audio serial scalable block sub-block array, an alignment point valid entry number, and an alignment point array;

for the data size of the byte block, acquiring the attribute of the data size of the byte block, and acquiring used bytes except for a block identifier and the data size of the byte block;

the compression method of the text represents the compression method of the XML text of the extensible markup language;

the audio serial scalable block low byte represents the low byte size of the audio serial scalable block table, including bytes of the multiple audio serial scalable block element field;

the audio serial scalable block high byte represents a high byte size of the audio serial scalable block table, including bytes of the multi-audio serial scalable block element field.

Optionally, the attributes of the sub-block array of audio-serial scalable block elements include an audio-serial scalable block data size, an audio sample number of a channel associated with the audio-serial scalable block, scalable markup language data, a starting byte offset of the audio-serial scalable block, a high byte of the starting byte offset of the alignment point, a low byte of the timestamp sampling level, and a high byte of the timestamp sampling level.

Optionally, the method further includes: acquiring used bytes except the data size and the number of audio samples of a sound channel associated with the audio serial scalable block according to the attribute of acquiring the data size in bytes, wherein the attribute represents that the data size is acquired;

according to the obtained extensible markup language data, the extensible markup language data which contains the extensible markup language data or is compressed by a compression method indicated by a text compression method is obtained;

representing that an alignment point represents the beginning of the audio serial scalable block in bytes according to the obtained starting byte offset of the audio serial scalable block, and obtaining used bytes except a block identifier and the data size of a byte block; the lower byte of the timestamp sample count represents a timestamp of an alignment point represented in audio samples of each channel starting from a data block of a resource switch file format block.

An audio serial scalable chunk (sxml chunk) element is used to transport time-varying metadata, e.g., a serial representation of audio model data (serial ADM). Associated serial audio metadata comprising compressed or uncompressed XML conforming to a preset XML1.0 format or associated with the audio data segments; appear in any order with other resource exchange file format blocks in the same file. It contains a header followed by a sub-block (SubXMLChunk) with compressed or uncompressed XML data specified by a compression method type (fmtType). Each sub-block of XML data corresponds to a unique number of audio samples adjacent to a sub-block group of adjacent XML data. The audio serial scalable block is completed by an optional alignment point table that allows sub-blocks of XML data to be selected based on time stamps, the total length of the audio serial scalable block is not fixed, and a data (ds 64) block is used to allow the field size of the audio serial scalable block.

The audio serial scalable block element contains information as shown in table 1:

TABLE 1

The subblock array of audio serial scalable block elements contains information as shown in table 2:

TABLE 2

Fig. 3 is an audio serial metadata block generating apparatus according to an embodiment of the present disclosure, including:

an obtaining module 310, configured to obtain header sub-elements and audio serial scalable block elements of an audio serial metadata block;

an audio serial metadata block generation module 320, configured to apply the audio serial metadata block generation module to place the header sub-element in front of the audio serial scalable block element, and then place the audio serial scalable block element behind the header sub-element to generate the audio serial metadata block for storing and transmitting time-variant metadata; wherein each of the audio-serialized scalable block elements corresponds to a unique number of audio samples from a group of adjacent audio-serialized scalable block elements, the audio-format-extension-tag sub-element of the audio-serialized scalable block element comprising the set audio model metadata.

Optionally, the audio serial metadata block is arranged with other resource exchange file format blocks in the same audio file, and the audio serial metadata block contains audio data conforming to an audio format file or associated with an audio data segment.

Optionally, the obtaining module is specifically configured to:

Optionally, the attributes of the sub-block array of audio serial scalable block elements include an audio serial scalable block data size, an audio sample number of a channel associated with the audio serial scalable block, scalable markup language data, a start byte offset of the audio serial scalable block, a high byte of the start byte offset of the alignment point, a low byte of a timestamp sample number, and a high byte of the timestamp sample number.

Optionally, the method further includes: a representation module, configured to obtain, according to the attribute of obtaining the data size of the audio serial scalable block and representing obtaining the data size in bytes, used bytes other than the data size and the number of audio samples of a channel associated with the audio serial scalable block;

according to the obtained extensible markup language data, representing that extensible markup language data which contains extensible markup language data or is compressed by a compression method indicated by a compression method of a text is obtained;

representing that an alignment point represents the beginning of the audio serial scalable block in bytes according to the obtained starting byte offset of the audio serial scalable block, and obtaining used bytes except a block identifier and the data size of a byte block;

the lower byte of the timestamp sample count represents a timestamp of an alignment point represented by audio samples of each channel starting from a data block of a resource switch file format block for storing audio samples.

The audio serial metadata block generation device provided by the embodiment of the invention can execute the audio serial metadata block generation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus includes: a processor 410, a memory 420, an input device 430, and an output device 440. The number of the processors 30 in the electronic device may be one or more, and one processor 410 is taken as an example in fig. 4. The number of the memories 420 in the electronic device may be one or more, and one memory 420 is taken as an example in fig. 4. The processor 410, the memory 420, the input device 430, and the output device 440 of the electronic apparatus may be connected by a bus or other means, and fig. 4 illustrates the connection by the bus as an example. The electronic device can be a computer, a server and the like. The embodiment of the present disclosure describes in detail by taking an electronic device as a server, and the server may be an independent server or a cluster server.

The memory 420 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules of the audio serial metadata block generation apparatus according to any embodiment of the present disclosure. The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 430 may be used to receive input numeric or character information and generate key signal inputs related to viewer user settings and function controls of the electronic device, and may also be a camera for acquiring images and a sound pickup device for acquiring audio data. The output device 440 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 430 and the output device 440 can be set according to actual situations.

The processor 410 executes various functional applications of the device and data processing, i.e., implements an audio serial metadata block generation method, by executing software programs, instructions, and modules stored in the memory 420.

The disclosed embodiments also provide a storage medium containing computer-executable instructions for generating audio serial metadata block generation methods including any of the embodiments from a computer processor.

Of course, the storage medium provided by the embodiments of the present disclosure contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the electronic method described above, and may also perform related operations in the electronic method provided by any embodiments of the present disclosure, and have corresponding functions and advantages.

From the above description of the embodiments, it is obvious for a person skilled in the art that the present disclosure can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the electronic method according to any embodiment of the present disclosure.

It should be noted that, in the electronic device, the units and modules included in the electronic device are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present disclosure.

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "in an embodiment," "in yet another embodiment," "exemplary" or "in a particular embodiment," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although the present disclosure has been described in detail hereinabove by way of general description, specific embodiments and experiments, it will be apparent to those skilled in the art that certain modifications or improvements may be made thereto based on the present disclosure. Accordingly, such modifications and improvements are intended to be within the scope of this disclosure, as claimed.

Claims

1. The audio serial metadata block generation method is characterized by comprising the following steps:

2. The method of claim 1, wherein the audio serialized metadata block is arranged with other resource exchange file format blocks in the same audio file, the audio serialized metadata block comprising audio data conforming to an audio format file or associated with an audio data segment.

3. The method of claim 1, wherein the block format element in the obtained header sub-element is used to describe a specification of a block containing audio model metadata and a specification of a related audio block.

4. The method of claim 1, wherein the audio serialized metadata chunk is completed by an alignment point table that allows timestamp-based retrieval access to the selected audio serialized scalable chunk element; and the total length of the audio serialized metadata block is not fixed.

5. The method of claim 4, wherein obtaining the elements of the audio serial metadata chunk comprises:

the compression method of the text represents a compression method of an extensible markup language text;

the audio serial scalable block low byte represents a low byte size of an audio serial scalable block table, including bytes of a multi-audio serial scalable block element field;

the audio serial scalable block high byte represents a high byte size of an audio serial scalable block table, including bytes of a multi-audio serial scalable block element field.

6. The method of claim 4, wherein the attributes of the sub-block array of audio-serial scalable block elements comprise an audio-serial scalable block data size, an audio sample number of a channel associated with an audio-serial scalable block, scalable markup language data, a starting byte offset of an audio-serial scalable block, a high byte of a starting byte offset of an alignment point, a low byte of a timestamp sample level, and a high byte of a timestamp sample level.

7. The method of claim 6, further comprising: acquiring used bytes except the data size and the number of audio samples of a sound channel associated with the audio serial scalable block according to the attribute of acquiring the data size in bytes, wherein the attribute represents that the data size is acquired;

8. An audio serial metadata block generating apparatus, comprising:

an audio-serial metadata block generation module for placing the header sub-element in front of the audio-serial scalable block element and then placing the audio-serial scalable block element behind the header sub-element to generate the audio-serial metadata block for storing and transmitting time-varying metadata;

9. An electronic device, comprising: a memory and one or more processors;

the memory to store one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A storage medium containing computer-executable instructions for implementing the method of any one of claims 1-7 when executed by a computer processor.