CN114363791A

CN114363791A - Serial audio metadata generation method, device, equipment and storage medium

Info

Publication number: CN114363791A
Application number: CN202111424254.2A
Authority: CN
Inventors: 吴健
Original assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Current assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-04-15

Abstract

The application relates to a method, a device, equipment and a storage medium for generating serial audio metadata, wherein the method comprises the following steps: acquiring audio model metadata for generating serial audio metadata; distributing the audio model metadata to serial audio metadata frames of the serial audio metadata according to the start time, the end time and the duration of the audio model metadata; configuring at least one of a start time, an end time, and a duration within a data frame for each of the serial audio metadata frames in which the audio model metadata is located. Model elements in an audio model are converted into corresponding serial audio metadata, and an existing audio file is framed in real-time production and streaming audio applications so that the frames are transmitted in real-time through a transmission interface.

Description

Serial audio metadata generation method, device, equipment and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating serial audio metadata.

Background

With the development of technology, audio becomes more and more complex. The early single-channel audio is converted into stereo, and the working center also focuses on the correct processing mode of the left and right channels. But the process begins to become complex after surround sound occurs. The surround 5.1 speaker system performs ordering constraint on a plurality of channels, and further the surround 6.1 speaker system, the surround 7.1 speaker system and the like enable audio processing to be varied, and correct signals are transmitted to proper speakers to form an effect of mutual involvement. Thus, as sound becomes more immersive and interactive, the complexity of audio processing also increases greatly.

Audio channels (or audio channels) refer to audio signals that are independent of each other and that are captured or played back at different spatial locations when sound is recorded or played. The number of channels is the number of sound sources when recording or the number of corresponding speakers when playing back sound. For example, in a surround 5.1 speaker system comprising audio signals at 6 different spatial locations, each separate audio signal is used to drive a speaker at a corresponding spatial location; in a surround 7.1 speaker system comprising audio signals at 8 different spatial positions, each separate audio signal is used to drive a speaker at a corresponding spatial position.

Therefore, the effect achieved by current loudspeaker systems depends on the number and spatial position of the loudspeakers. For example, a binaural speaker system cannot achieve the effect of a surround 5.1 speaker system.

Disclosure of Invention

The application aims to provide a method, a device, equipment and a storage medium for generating serial audio metadata, and the workflow of audio model metadata is realized by serial audio.

A first aspect of the present application provides a method for generating serial audio metadata, including:

acquiring audio model metadata for generating serial audio metadata;

assigning the audio model metadata into serial audio metadata frames of the serial audio metadata;

configuring at least one of a start time, an end time and a duration of a preset element of the audio model metadata within a frame for each serial audio metadata frame in which the preset element is located.

A second aspect of the present application provides a serial audio metadata generation apparatus, including:

the acquisition module is used for acquiring audio model metadata used for generating serial audio metadata;

an allocation module for allocating the audio model metadata into serial audio metadata frames of the serial audio metadata;

and the time configuration module is used for configuring at least one of the start time, the end time and the duration of a preset element in a frame for each serial audio metadata frame in which the preset element of the audio model metadata is positioned.

A third aspect of the present application provides an electronic device comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of serial audio metadata generation as provided in any of the embodiments.

A fourth aspect of the present application provides a storage medium containing computer-executable instructions that implement a method of serial audio metadata generation as provided in any of the embodiments in a computer processor.

In view of the above, the method for generating serial audio metadata converts model elements in an audio model into corresponding serial audio metadata, and frames an existing audio file are made into frames in real-time making and streaming audio applications, so that the frames are transmitted in real time through a transmission interface.

Drawings

FIG. 1 is a schematic diagram of a three-dimensional acoustic audio model provided in an embodiment of the present application;

FIG. 2 is a flow chart of a method of generating serial audio metadata in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a serial audio metadata generation apparatus in an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device in an embodiment of the present application;

FIG. 5 is a structure of serial audio metadata in a complete frame stream according to an embodiment of the present application;

FIG. 6 is a structure of serial audio metadata in an inter-frame stream in an embodiment of the present application;

FIG. 7 is a structure of serial audio metadata in a mixed frame stream according to an embodiment of the present application;

FIG. 8 is a structure of serial audio metadata in a stream of split frames according to an embodiment of the present application;

FIG. 9 is a structure of serial audio metadata for introducing and modifying new elements in a real-time scenario in an embodiment of the present application;

FIG. 10 is a structure of serial audio metadata for processing a new element without sub-elements in a real-time scenario in an embodiment of the present application;

fig. 11 is a structure of serial audio metadata for modifying an existing element in a real-time scene in an embodiment of the present application.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Examples

As shown in fig. 1, a three-dimensional acoustic audio model is composed of a set of elements, each element describing one stage of audio, and includes a content production section and a format production section.

Wherein the content part comprises: an audio program element, an audio content element, an audio object element, and a soundtrack unique identification element; the format making part includes: an audio packet format element, an audio channel format element, an audio stream format element, and an audio track format element;

the audio program element references at least one of the audio content elements; the audio content element references at least one audio object element; the audio object element references the corresponding audio package format element and the corresponding audio track unique identification element; the audio track unique identification element refers to the corresponding audio track format element and the corresponding audio package format element;

the audio package format element references at least one of the audio channel format elements; the audio stream format element refers to the corresponding audio channel format element and the corresponding audio packet format element; the audio track format element and the corresponding audio stream format element are referenced to each other. The reference relationships between elements are indicated by arrows in fig. 1.

The audio program may include, but is not limited to, narration, sound effects, and background music, the audio program elements may be used to describe a program, the program includes at least one content, and the audio content elements are used to describe a corresponding one of the audio program elements. An audio program element may reference one or more audio content elements that are grouped together to construct a complete audio program element.

The audio content elements describe the content of a component of an audio program, such as background music, and relate the content to its format by reference to one or more audio object elements.

The audio object elements are used to build content, format and valuable information and to determine the soundtrack unique identification of the actual soundtrack.

The format making part comprises: an audio packet format element, an audio channel format element, an audio stream format element, an audio track format element.

The audio packet format element may be configured to describe a format adopted when the audio object element and the original audio data are packed according to channel packets.

The audio channel format element may be used to represent a single sequence of audio samples and preset operations performed on it, such as movement of rendering objects in a scene. The audio channel format element may comprise at least one audio block format element. The audio block format elements may be considered to be sub-elements of the audio channel format elements, and therefore there is an inclusion relationship between the audio channel format elements and the audio block format elements.

Audio streams, which are combinations of audio tracks needed to render channels, objects, higher-order ambient sound components, or packets. The audio stream format element is used for establishing the relationship between the audio track format element set and the audio channel format element set, or the relationship between the audio track format set and the audio packet format.

The audio track format elements correspond to a set of samples or data in a single audio track, and are used to describe the format of the original audio data, and the decoded signals of the renderer, and also to identify the combination of audio tracks required to successfully decode the audio track data.

And generating synthetic audio data containing metadata after the original audio data are produced through the three-dimensional sound audio model.

The Metadata (Metadata) is information describing characteristics of data, and functions supported by the Metadata include indicating a storage location, history data, resource lookup, or file record.

And after the synthesized audio data are transmitted to the far end in a communication mode, the far end analyzes the synthesized audio data based on the metadata, and restores the original sound scene or renders the original sound scene into a new sound scene in real time.

The division between content production, format production and BW64(Broadcast Wave 64 bit) files is shown in fig. 1. Both the content production portion and the format production portion constitute metadata in XML format, which is typically contained in one block ("axml" block) of the BW64 file. The bottom BW64 file portion contains a "channel allocation (chna)" block, which is a look-up table used to link metadata to the audio programs in the file.

The content production section describes the technical content of the audio, e.g. whether it contains dialogs or a specific language, and loudness metadata. The format section describes the channel types of the audio tracks and how they are combined together, e.g. the left and right channels in a stereo pair. The meta-index of the content production is typically unique to the audio and program, while the elements of the format production may be multiplexed.

The present application provides a method for generating serial audio metadata, as shown in fig. 2, the method includes:

s210, obtaining audio model metadata used for generating serial audio metadata;

s220, distributing the audio model metadata to serial audio metadata frames of the serial audio metadata;

and S230, configuring at least one of the start time, the end time and the duration of the preset element in the frame for each serial audio metadata frame where the preset element of the audio model metadata is located.

Optionally, the audio model metadata for generating serial audio metadata includes an audio program element, an audio content element, an audio object element, a soundtrack unique identification element, an audio packet format element, an audio channel format element, an audio stream format element, and an audio track format element, where the audio channel format element includes at least one audio block format element. The preset elements include the audio program element, the audio object element, and the audio block format element.

Optionally, the allocating the audio model metadata to the serial audio metadata frames of the serial audio metadata includes:

determining a starting serial audio metadata frame of the audio model metadata according to the starting time of the audio model metadata;

and determining continuous serial audio metadata frames of the audio model metadata according to the ending time and/or the duration of the audio model metadata.

Optionally, configuring at least one of a start time, an end time, and a duration of a preset element in a frame for each serial audio metadata frame where the preset element of the audio model metadata is located includes:

within the starting serial audio metadata frame, configuring a start time of the preset element as a start time of the starting serial audio metadata frame, and configuring an end time and/or duration of the preset element as an end time and/or duration of the starting serial audio metadata frame.

Optionally, configuring at least one of a start time, an end time, and a duration of a preset element in a frame for each serial audio metadata frame in which the preset element of the audio model metadata is located, further includes:

if the preset element does not contain a sub-element in the initial serial audio metadata frame, configuring the start time of the preset element as the start time of the initial serial audio metadata frame in the continuous serial audio metadata frame of which the preset element does not contain the sub-element; within the continuous serial audio metadata frame for which the preset element comprises a sub-element, configuring a start time of the preset element to be a start time of the continuous serial audio metadata frame for which the first preset element comprises a sub-element;

configuring, within the persistent serial audio metadata frame, an end time of the preset element as an end time of the current persistent serial audio metadata frame, and/or configuring a duration of the preset element as a duration from a start time of the start serial audio metadata frame to an end time of the current persistent serial audio metadata frame.

Optionally, the types of the serial audio metadata frame include: header frames, full frames, split frames, intermediate frames, and full frames. After configuring at least one of a start time, an end time and a duration of a preset element in a frame for each serial audio metadata frame in which the preset element of the audio model metadata is located, the method further comprises:

forming a serial audio stream according to a plurality of serial audio metadata frames;

wherein the types of the serial audio stream include: a complete frame stream, an intermediate frame stream, a mixed frame stream, and a split frame stream;

the complete frame stream comprises a series of complete frames, the first frame being a complete frame, a header frame or a full frame;

the stream of intermediate frames comprises a series of intermediate frames, the first frame being a full frame, a header frame or a full frame;

the mixed frame stream comprises a series of intermediate frames and full frames, the first frame being a full frame, a header frame or all frames;

the stream of segmented frames comprises a series of segmented frames, the first frame being a full frame, a segmented frame, a header frame, or a full frame.

The audio model is an open compatible metadata generic model, but the audio model metadata is not suitable for real-time production and streaming audio applications, but rather for local file storage. When remote real-time transmission of metadata with digital audio is involved, a serial audio metadata schema is required to allow slicing of existing audio and its associated audio model metadata files into frames and streaming.

A frame of serial audio metadata contains a set of audio model metadata describing the audio frames within a certain time period associated with the frame. The serial audio metadata has the same structure, attributes and elements as the audio model metadata, as well as additional attributes for specifying the frame format. The frames of serial audio metadata do not overlap and are linked to a specified start time and duration. Metadata contained in a frame of serial audio metadata is likely to be used to describe the audio itself over the duration of the frame.

The parent element of the serial audio metadata is a frame (frame) comprising: frame header (frameHeader) and audio format extended (audio format extended) two sub-elements. And the frame header includes 2 sub-elements: frame format (frameFormat) and transport track format (transportTrackFormat).

The audio format extension includes 8 sub-elements: audio program (audioprogram), audio content (audioContent), audio object (audioObject), soundtrack unique identifier (audiotrack uid), audio packet format (audiopackagformat), audio channel format (audioChannelFormat), audio stream format (audioStreamFormat), and audio track format (audioTrackFormat).

The audio model metadata consists of a content portion (e.g., audio program elements) and a format portion (e.g., audio channel format elements). Only three elements, audio program element, audio object element and audio block format element, have time-related parameters stored. In the content portion, the start time, end time and duration of an audio program element or audio object element are used to determine the start time, end time or duration of the element, these parameters are typically fixed. In the format part, all parameters in the audio block format elements are time-varying parameters.

The audio model metadata can be divided into two groups: namely dynamic metadata (e.g., audio block format elements in an audio channel format element) and static metadata (e.g., audio program elements and audio content elements).

A serial audio metadata frame consists of one or more metadata chunks.

The serial audio metadata frames are divided into five types:

1. a "header": representing the first frame in the stream, containing all descriptors associated with the audio signal.

2. "complete": all descriptors associated with the audio signal.

3. "splitting": the metadata is divided into data blocks, the last data block contains dynamic metadata, and other data blocks contain partial static metadata.

4. "middle": only the descriptors changed from the previous frame.

5. "all": all descriptors of the entire audio program element (the entire XML code of the initial audio model).

The serial audio stream is of four types:

1. full frame Full-frame (ff) stream: a series of "complete" frames, the first frame being "complete", "header" or "all".

2. Intermediate-frame (if) stream: a series of "intermediate" frames, the first frame being "complete", "header" or "all".

3. Mixed-frame (mf) stream: a series of "intermediate" and "full" frames, the first frame being "full", "header" or "all".

4. Split frame partitioned-frame (df) stream: a series of "split" frames, the first frame being "complete", "split", "header" or "all".

By not repeatedly transmitting "time invariant" metadata in each frame, the "split" and "intermediate" frames enable efficient representation of serial audio metadata. The serial audio data stream is intended to support such efficient representation while providing random access functionality when needed. The implementation of the serial audio data stream is applied as table 1:

TABLE 1

Full Frame (FF) stream specification

In this case, the serial audio is built into the basic structure of serial audio metadata in a "full" frame, Full Frame (FF) stream, as shown in fig. 5. The FF stream provides access to any audio frame to support random access.

Intermediate Frame (IF) flow specification

The receiver only needs to receive the static audio model metadata once and can therefore ignore any repeated static audio model metadata even if the complete audio model metadata is repeatedly transmitted. Thus, when random access is not required by the broadcaster, the audio model metadata that has already been transmitted can be omitted. An "intermediate" frame may omit all elements whose values have not changed from those in the previous frame, even if the element is classified as dynamic metadata. IF streams do not support random access and the structure of serial audio metadata in Intermediate Frame (IF) streams is shown in fig. 6.

Mixed Frame (MF) flow specification

Both "full" and "intermediate" frames may be used in a single stream (as shown in fig. 7). In this case, the broadcaster is free to decide the time interval for transmitting a "complete" frame. MF flows support random access with some delay: i.e., the receiver must wait for the next "complete" frame, the serial audio metadata structure in the Mixed Frame (MF) stream is as in fig. 7.

Split frame (DF) flow specification

Depending on whether a "full" frame or an "intermediate" frame, the flow of the MF stream may vary greatly. As the static metadata is divided into data blocks, the DF flow is designed to distribute the data more evenly across all frames (as shown in fig. 8).

In the example of fig. 8: the frame metadata, such as "FF _ 00000001", is divided into metadata pieces, such as "FF _00000001_ 01", "FF _00000001_ 02", and "FF _00000001_ 03". These metadata blocks are transmitted at the same time instance. The metadata chunk "FF _0000000X _ 04" includes dynamic metadata, and the metadata chunks "FF _0000000X _ 01" to "FF _0000000X _ 03" include split static metadata. Since the metadata block "FF _00000002_ 01" has the same static metadata as other frames (e.g., "FF _00000003_ 01" and "FF _00000004_ 01"), the "FF _00000003_ 01" and "FF _0000004_ 01" may be omitted.

In a DF flow, the last data block always contains dynamic metadata, while all other data blocks contain static metadata. The DF flow supports random access with some delay: i.e. the receiver has to wait until all the metadata blocks needed to reconstruct the complete static metadata set have been received, see fig. 8.

Real-time serial audio metadata generation

The following illustrates how serial audio metadata is generated in a real-time environment. The example shows MF flows and FF flows, but similar procedures can be applied to other types of flows as well.

The serial audio metadata structure in the real-time scenario shown in fig. 9 illustrates how new elements are introduced and modified. Fig. 9 shows how audio objects (audioobjects) ("AO _ 1001") and some audio block format (audioBlockFormat) sets ("AB _00030001_ NN") are initialized in a real-time scene:

when "AO _ 1001" appears in "FF _ 00000003" for the first time, the duration of "AO _ 1001" starts from 2 seconds (to match the length of the frame); the duration is then updated to 4 seconds and then to 6 seconds in the next frame.

New audio block formats appear in "FF _ 00000003", "FF _ 00000004" and "FF _ 00000005", and some of their duration (duration) values are adjusted when the audio block format is used in a frame subsequent to the first frame.

The audio model metadata regenerated on the right side of fig. 9 shows how the element will appear after receiving "FF _ 00000005" and thus the duration of "AO _ 1001" is 6 seconds.

The structure of the serial audio metadata in the real-time scenario shown in fig. 10 illustrates how a new element without sub-elements is processed. Fig. 10 shows how a new audio object is introduced, without any sub-elements contained in the first two frames:

its start time is modified over successive frames until it is assigned a number of sub-elements. In this case, when it reaches "FF _ 00000003", a new audio block format ("AB _00030001_ 01") occurs, so the 4 second "AO _ 1001" start time is fixed and its duration increases in subsequent frames.

The structure of the serial audio metadata in the real-time scenario shown in fig. 11 illustrates how existing elements are modified. Fig. 11 shows how the end time of an audio program (audioprogram) ("APR _ 1001") is modified when a new frame ("FF _ 00000006") appears after the end of the initial end time of "APR _ 1001":

the durations of "AO _ 1001" and "AB _00030001_ 04" are also modified in this new frame. As a result, the "APR _ 1001" end time of the reconstructed audio model metadata is also updated.

When reading a serial audio frame in which the properties of a particular metadata element have changed with respect to the properties of the previous frame, the metadata element of the most recent frame has to be used.

Structure of serial audio metadata frame

The serial audio metadata frame should be composed of two parts. The first part is a header containing additional elements of the serial audio used to describe the metadata frame specification, and the second part is an audio format extension containing the specified metadata.

Structure of 'complete' frame

A "complete" frame should contain all 8 elements in the audio format extension, i.e. 8 elements of the audio model metadata.

Structure of 'intermediate' frame

An "intermediate" frame should only include elements whose values have changed when compared to the previous metadata frame. Among the audio model metadata elements, an audio program element, an audio object element, and a video block format element are used to define time information. The audio block format elements defined by the Bed type usually have time invariant metadata, whereas the audio block format elements defined by the Object type usually have time variant metadata.

The "intermediate" frame is composed of audio block format elements among the audio channel format elements, and is typically used for an audio type defined as "object Objects".

Structure of 'split' frame

A "split" frame contains metadata that is split into at least two data blocks. Each frame should carry at least one data block. Each data block should contain a subset of all the elements in a complete frame.

Since static metadata elements do not change in consecutive frames, it is not necessary to place them in all frames. The dynamic metadata elements that change in each frame should be carried in the last block carried in the frame.

Structure of 'header' frame

A "header" frame is a "complete" frame with special functions for indicating the start of a new audio program or the start of a new stream.

Structure of 'full' frame

The "full" frame should contain all metadata for the entire audio program element. Thus, this may include metadata describing the audio in the past and future frames as well as metadata in the current frame.

The "all" type of frame is used when the metadata for the entire audio program element is known before the serial audio metadata is formed into a data stream. Therefore, this type of frame should be considered in the case of pre-recorded programs or live programs with completely static metadata.

Fig. 3 is a device for generating serial audio metadata according to an embodiment of the present application, including:

an obtaining module 310, configured to obtain audio model metadata for generating serial audio metadata;

an assigning module 320 for assigning the audio model metadata into serial audio metadata frames of the serial audio metadata;

a time configuration module 330, configured to configure, for each serial audio metadata frame in which a preset element of the audio model metadata is located, at least one of a start time, an end time, and a duration of the preset element within the frame.

Optionally, the allocating module 320 is specifically configured to:

Optionally, the time configuration module 330 is specifically configured to:

Optionally, the time configuration module 330 is further specifically configured to:

Optionally, the types of the serial audio metadata frame include: header frames, full frames, split frames, intermediate frames, and full frames. The serial audio metadata generation apparatus further includes:

a serial audio stream forming module, configured to form a serial audio stream according to a plurality of serial audio metadata frames after configuring at least one of a start time, an end time, and a duration of a preset element in a frame for each serial audio metadata frame where the preset element of the audio model metadata is located;

The serial audio metadata generation device provided by the embodiment of the invention can execute the serial audio metadata generation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 4, the electronic apparatus includes: a processor 410, a memory 420, an input device 430, and an output device 440. The number of the processors 30 in the electronic device may be one or more, and one processor 410 is taken as an example in fig. 4. The number of the memories 420 in the electronic device may be one or more, and one memory 420 is taken as an example in fig. 4. The processor 410, the memory 420, the input device 430, and the output device 440 of the electronic apparatus may be connected by a bus or other means, and fig. 4 illustrates the connection by the bus as an example. The electronic device can be a computer, a server and the like. In the embodiment of the present application, the electronic device is used as a server, and the server may be an independent server or a cluster server.

The memory 420 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules of the serial audio metadata generation apparatus according to any embodiment of the present application. The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 430 may be used to receive input numeric or character information and generate key signal inputs related to viewer user settings and function controls of the electronic device, and may also be a camera for acquiring images and a sound pickup device for acquiring audio data. The output device 440 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 430 and the output device 440 can be set according to actual situations.

The processor 410 executes various functional applications of the device and data processing, i.e., implements a serial audio metadata generation method, by executing software programs, instructions, and modules stored in the memory 420.

Embodiments of the present application further provide a storage medium containing computer-executable instructions, which generate, by a computer processor, a serial audio metadata generation method including any of the embodiments.

Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the electronic method described above, and may also perform related operations in the electronic method provided in any embodiments of the present application, and have corresponding functions and advantages.

From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the electronic method according to any embodiment of the present application.

It should be noted that, in the electronic device, the units and modules included in the electronic device are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "in an embodiment," "in another embodiment," "exemplary" or "in a particular embodiment," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although the present application has been described in detail above with respect to general description, specific embodiments and experiments, it will be apparent to those skilled in the art that some modifications or improvements may be made based on the present application. Accordingly, such modifications and improvements are intended to be within the scope of this invention as claimed.

Claims

1. A method of generating serial audio metadata, comprising:

acquiring audio model metadata for generating serial audio metadata;

2. The method of claim 1, wherein the audio model metadata used to generate the serial audio metadata comprises audio program elements, audio content elements, audio object elements, soundtrack unique identification elements, audio packet format elements, audio channel format elements, audio stream format elements, and audio track format elements, wherein the audio channel format elements comprise at least one audio block format element.

3. The method of claim 2, wherein the preset elements include the audio program element, the audio object element, and the audio block format element.

4. The method of claim 1, wherein assigning the audio model metadata into a serial audio metadata frame of the serial audio metadata comprises:

5. The method of claim 4, wherein configuring at least one of a start time, an end time, and a duration of a preset element of the audio model metadata for each of the serial audio metadata frames in which the preset element is located comprises:

6. The method of claim 5, wherein configuring at least one of a start time, an end time, and a duration of a preset element of the audio model metadata within a frame for each of the serial audio metadata frames in which the preset element is located, further comprises:

7. The method of claim 6, wherein the type of the serial audio metadata frame comprises: header frames, full frames, split frames, intermediate frames, and full frames; after configuring at least one of a start time, an end time and a duration of a preset element in a frame for each serial audio metadata frame in which the preset element of the audio model metadata is located, the method further comprises:

forming a serial audio stream according to a plurality of serial audio metadata frames; wherein the types of the serial audio stream include: a complete frame stream, an intermediate frame stream, a mixed frame stream, and a split frame stream;

8. A serial audio metadata generation apparatus, comprising:

9. An electronic device, comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A storage medium containing computer-executable instructions for implementing the method of any one of claims 1-7 when executed by a computer processor.