CN113963725A - Audio object metadata and generation method, electronic device, and storage medium - Google Patents

Audio object metadata and generation method, electronic device, and storage medium Download PDF

Info

Publication number
CN113963725A
CN113963725A CN202111102038.6A CN202111102038A CN113963725A CN 113963725 A CN113963725 A CN 113963725A CN 202111102038 A CN202111102038 A CN 202111102038A CN 113963725 A CN113963725 A CN 113963725A
Authority
CN
China
Prior art keywords
audio
audio object
information
interaction
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111102038.6A
Other languages
Chinese (zh)
Inventor
吴健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Saiyinxin Micro Beijing Electronic Technology Co ltd
Original Assignee
Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Saiyinxin Micro Beijing Electronic Technology Co ltd filed Critical Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority to CN202111102038.6A priority Critical patent/CN113963725A/en
Publication of CN113963725A publication Critical patent/CN113963725A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/12Formatting, e.g. arrangement of data block or words on the record carriers
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • G11B2020/10537Audio or video recording
    • G11B2020/10546Audio or video recording specifically adapted for audio data
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • G11B2020/10537Audio or video recording
    • G11B2020/10592Audio or video recording specifically adapted for recording or reproducing multichannel signals
    • G11B2020/10601Audio or video recording specifically adapted for recording or reproducing multichannel signals surround sound signal
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/12Formatting, e.g. arrangement of data block or words on the record carriers
    • G11B2020/1291Formatting, e.g. arrangement of data block or words on the record carriers wherein the formatting serves a specific purpose
    • G11B2020/1298Enhancement of the signal quality

Abstract

The present disclosure relates to an audio object metadata and generation method, an electronic device, and a storage medium. Audio object metadata comprising: a property region including an audio object identification and an audio object name of an audio object, the audio object identification including information indicating a relationship between a plurality of audio objects; and the sub-element area is used for representing information of audio package format identification reference, audio object identification reference, audio complementary object identification reference, audio track unique identification and audio object interaction of the audio object metadata. The audio object metadata describe the metadata and the format of the audio object, and can provide accurate audio object metadata for a renderer during audio playing, so that the quality of an audio playing scene is improved.

Description

Audio object metadata and generation method, electronic device, and storage medium
Technical Field
The present disclosure relates to the field of audio processing technologies, and in particular, to an audio object metadata and generation method, an electronic device, and a storage medium.
Background
With the development of technology, audio becomes more and more complex. The early single-channel audio is converted into stereo, and the working center also focuses on the correct processing mode of the left and right channels. But the process begins to become complex after surround sound occurs. The surround 5.1 speaker system performs ordering constraint on a plurality of channels, and further the surround 6.1 speaker system, the surround 7.1 speaker system and the like enable audio processing to be varied, and correct signals are transmitted to proper speakers to form an effect of mutual involvement. Thus, as sound becomes more immersive and interactive, the complexity of audio processing also increases greatly.
Audio channels (or audio channels) refer to audio signals that are independent of each other and that are captured or played back at different spatial locations when sound is recorded or played. The number of channels is the number of sound sources when recording or the number of corresponding speakers when playing back sound. For example, in a surround 5.1 speaker system comprising audio signals at 6 different spatial locations, each separate audio signal is used to drive a speaker at a corresponding spatial location; in a surround 7.1 speaker system comprising audio signals at 8 different spatial positions, each separate audio signal is used to drive a speaker at a corresponding spatial position.
Therefore, the effect achieved by current loudspeaker systems depends on the number and spatial position of the loudspeakers. For example, a binaural speaker system cannot achieve the effect of a surround 5.1 speaker system.
The present disclosure provides an audio object metadata and a generating method to provide a metadata capable of solving the above technical problems.
Disclosure of Invention
The present disclosure is directed to an audio object metadata generation method, an electronic device, and a storage medium, so as to solve one of the above technical problems.
To achieve the above object, a first aspect of the present disclosure provides audio object metadata, including:
a property region including an audio object identification and an audio object name of an audio object, the audio object identification including information indicating a relationship between a plurality of audio objects;
and the sub-element area is used for representing information of audio package format identification reference, audio object identification reference condition, audio complementary object identification reference, audio track unique identification reference and audio object interaction of the audio object metadata.
To achieve the above object, a second aspect of the present disclosure provides an audio object metadata generation method, including:
the generating comprises audio object metadata as described in the first aspect.
To achieve the above object, a third aspect of the present disclosure provides an electronic device, including: a memory and one or more processors;
the memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to generate audio data comprising audio object metadata as described in the first aspect.
To achieve the above object, a fourth aspect of the present disclosure provides a storage medium containing computer-executable instructions which, when generated by a computer processor, comprise audio object metadata as described in the first aspect.
From the above, the audio object metadata of the present disclosure includes: a property region including an audio object identification and an audio object name of an audio object, the audio object identification including information indicating a relationship between a plurality of audio objects; and the sub-element area is used for representing information of audio package format identification reference, audio object identification reference, audio complementary object identification reference, audio track unique identification and audio object interaction of the audio object metadata. The audio object metadata describes the metadata and format of the audio object, and can provide accurate audio object metadata for a renderer during audio playing, so that the quality of an audio playing scene is improved.
Drawings
Fig. 1 is a schematic diagram of a three-dimensional acoustic audio production model provided in embodiment 1 of the present disclosure;
fig. 2 is a schematic structural diagram of audio object metadata provided in embodiment 1 of the present disclosure;
fig. 3 is a flowchart of a method for generating audio object metadata provided in embodiment 2 of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device provided in embodiment 3 of the present disclosure.
Detailed Description
The following examples are intended to illustrate the present disclosure, but are not intended to limit the scope of the present disclosure.
Metadata (Metadata) is information that describes the structural characteristics of data, and the functions supported by Metadata include indicating storage locations, historical data, resource lookups, or file records.
As shown in fig. 1, the three-dimensional audio production model is composed of a set of production elements each describing information of structural characteristics of data of a corresponding stage of audio production by metadata, and includes a content production section and a format production section.
The production elements of the content production section include: an audio program element, an audio content element, an audio object element, and a soundtrack unique identification element.
The audio program includes narration, sound effects, and background music, and the audio program references one or more audio contents that are combined together to construct a complete audio object. The audio program elements are, in other words, production audio objects, and metadata of the audio program is generated for describing information of structural characteristics of the audio program.
The audio content describes the content of a component of the audio content, such as background music, and relates the content to its format by reference to one or more audio objects. The audio content element is information for producing audio content, and metadata for generating the audio content is used for describing structural characteristics of the audio content.
The audio objects are used to build content, format and valuable information and to determine the soundtrack unique identification of the actual soundtrack. The audio object elements are, for example, production audio objects, and metadata of the audio objects is generated to describe information of structural characteristics of the audio objects.
The audio track unique identification element is used for making an audio track unique identification, and metadata for generating the audio track unique identification is used for describing the structural characteristics of the audio track unique identification.
The production elements of the format production part include: an audio packet format element, an audio channel format element, an audio stream format element, an audio track format element.
The audio package format is a format adopted when metadata of an audio object and audio stream data are packaged according to channel packets, wherein the audio package format can include a nested audio package format. The audio packet format element is also the production audio packet data. The audio packet data comprises metadata in an audio packet format, and the metadata in the audio packet format is used for describing information of structural characteristics of the audio packet format.
The audio channel format represents a single sequence of audio samples on which certain operations may be performed, such as movement of rendering objects in a scene. Nested audio channel formats can be included in the audio channel formats. The audio channel format element is to make audio channel data. The audio channel data comprises metadata in an audio channel format, and the metadata in the audio channel format is used for describing information of structural characteristics of the audio channel format.
Audio streams, which are combinations of audio tracks needed to render channels, objects, higher-order ambient sound components, or packets. The audio stream format is used to establish a relationship between a set of audio track formats and a set of audio channel formats or audio packet formats. The audio stream format element is also the production audio stream data. The audio stream data comprises metadata in an audio stream format, and the metadata in the audio stream format is used for describing information of structural characteristics of the audio stream format.
The audio track format corresponds to a set of samples or data in a single audio track in the storage medium, the track format used to describe the original audio data, and the decoded signal of the renderer. The audio track format is derived from the original audio data for identifying the combination of audio tracks required for successful decoding of the audio track data. The audio track format element is the production audio track data. The audio track data includes metadata in an audio track format, and the metadata in the audio track format is used for describing information of structural characteristics of the audio track format.
Each stage of the three-dimensional audio production model produces metadata that describes the characteristics of that stage.
And after the audio channel data manufactured based on the three-dimensional audio manufacturing model is transmitted to the far end in a communication mode, the far end performs stage-by-stage rendering on the audio channel data based on the metadata, and the manufactured sound scene is restored.
Example 1
The present disclosure provides and describes in detail audio object metadata in a three-dimensional acoustic audio model.
An audio object (audioObject) is a connection between an actual audio track and its format, and the audio object elements establish a relationship between content, format and valuable information. Audio objects may be nested to group other audio objects together. It follows that an audio object is a combination of one or more audio objects.
As shown in fig. 2, the audio object metadata 100 includes a property area 110 and a sub-element area 120.
The property area 110 comprises an audio object identification 111 and an audio object name 112 of an audio object.
The audio object identification 111 comprises information indicating a relationship between a plurality of audio objects.
In the embodiment of the present disclosure, the audio object identifier 111 is the name of an audio object, for example, the name of an audio object is "AO-1002", the name of an audio object is "object 001", then by the name "AO-1002", the audio object with the name of an audio object being "object 001" can be obtained, and the audio object identifier 111 can be described in the computer language as:
<audioObject audioObjectID=“AO-1002”>
an audio object identification 111 denoted as an audio object is an "AO-1002" audio object identification 111, which means that the name of the audio object by which the corresponding audio object is obtained is understood.
The audio object name 112 is represented as a specific name of an audio object, for example, the name of an audio object is "object 001", and its corresponding audio object identifier 111 is "AO-1002", so that the audio object with the name of "object 001" can be obtained through "AO-1002" of the audio object identifier 111.
The relationship between the audio object identifier 111 and the audio object name 112 can be described in a computer language as:
<audioObject audioObjectID=“AO-1002”
audioObjectName=“object001”>
specifically, the attribute region 110 further includes:
start time information 113 indicating an audio object, the start time information 113 relating to a play start time of an audio program in the audio object; specifically, the format of the start time information 113 is "hh: mm: zzzzzz ", wherein hh represents hours, mm represents minutes, ss represents seconds, and zzzz represents seconds of a smaller order, such as: milliseconds.
And/or duration information of an audio object, the duration information being related to a play start time and an end time of an audio program in the audio object;
and/or, audio object importance information 115 for describing an importance index of an audio object; different level representations may be used for the audio object importance information 115, such as setting the audio object importance information 115 to 0, representing least important, and setting the audio object importance information 115 to 10, representing most important.
And/or, user-object interaction information 116, the user-object interaction information 116 being used to characterize whether the user is allowed to interact with the object; for the user and object interaction information 116, 0 or 1 may be used to indicate whether the item is selected, for example, a setting of 1 indicates that the user is allowed to interact with the object, and a setting of 0 indicates that the user is not allowed to interact with the object.
And/or, automatic dodging information 117, said automatic dodging information 117 for characterizing whether the object is allowed to automatically dodge. For auto-dodging information 117, 0 or 1 may be used to indicate whether the item is selected, for example, a setting of 1 indicates that the object is allowed to be automatically dodged, and a setting of 0 indicates that the object is not allowed to be automatically dodged.
In particular, the start time information 113, the duration information, the audio object importance information 115, the user-object interaction information 116, and the auto-flash information 117 are not parameters necessary for the attribute section 110, and the parameters necessary for the attribute section 110 are the audio object identifier 111 and the audio object name 112 of the audio object, and the attribute section 110 may select one or more of the parameters of the start time information 113, the duration information, the audio object importance information 115, the user-object interaction information 116, and the auto-flash information 117, or may not select any one of the parameters.
Specifically, the related parameter settings of the attribute area 110 can be described in a computer language as:
<audioObject audioObjectID=“AO-1002”
audioObjectName=“object001”
start=“00:00:00.00000”
dialogue=“1”importance=“5”interact=“1”disableDucking=“0”>
expressed as the audio object identification 111 being "AO-1002", the audio object name 112 being "object 001", the start time information 113 of the audio object being "00: 00: 00.00000 ', the duration information of the audio object is "1 ', the importance information 115 of the audio object is" 5 ', the information of the user interaction with the object is "1 ', the information interaction between the user and the object is allowed, the automatic dodging information 117 is" 0 ', and the object is not allowed to be automatically dodged.
The sub-element region 120 includes: audio package format identification reference information 121, where the audio package format identification reference information 121 is used to reference an audio package format for audio package format description; in the embodiment of the present application, format description of the audio object is performed by referring to an audio package format, and the number of the format description may be 0, or 1 or more.
And/or, the audio object identification reference information 122, where the audio object identification reference information 122 is used to describe other referenced audio objects, and the number of the other referenced audio objects may be 0, or may also be 1 or more;
and/or audio complementary object identification reference information 123, said audio complementary object identification reference information 123 being used to describe an audio object complementary to said audio object; audio complementary object identification reference information the audio complementary object identification reference information 123 refers to another audio object complementary to the object, and in particular, includes a mutually exclusive language, and the number of the complementary audio objects may be 0, or 1 or more. Since said audio complementary object identification reference information 123 contains a reference to another audio object, the referenced audio object being a complement of the current audio object, said list of audio complementary object identification reference information 123 may be used to describe mutually exclusive objects, e.g. language tracks containing the same dialog in different dubbing versions, for each set of mutually exclusive content said audio complementary object identification reference information 123 contains only one corresponding audio object.
And/or, audio track unique identification reference information 124, the audio track unique identification reference information 124 is used for describing the reference of the audio track unique identification; specifically, when an audio file in BW64 format is used, the track unique identification reference information 124 is listed in the channel allocation block of the audio object. In particular, if the value of the track unique identification reference information 124 is set to "000000", it means that the track representation information 124 does not reference any track in the file of audio objects, but introduces a virtual empty track, which is very important for audio objects in a multi-channel format, by which way it is possible to save storage space of the file by not storing zero-valued samples in the file, but using a silent track. The number of the track representations may be 0, or 1 or more.
And/or audio object interaction information 125, where the audio object interaction information 125 is used to characterize a specification of user interaction with an object, and in particular, the specification of user interaction with an object may be 0 or 1.
It should be particularly noted that all of the modules included in the sub-element area 120 are not required in practical application, and one, multiple or none of the modules may be required.
Specifically, in this embodiment of the present disclosure, the audio object interaction information includes:
audio object switching information for characterizing whether to allow an audio object to be turned on or off; specifically, whether to allow the user to turn on or off the audio object is implemented by setting the parameter value of the audio object switching information, for example, setting the parameter value of the audio object switching information to 1 indicates that the audio object is allowed to be turned on or off, and setting the parameter value of the audio object switching information to 0 indicates that the audio object is not allowed to be turned on or off.
Audio object gain information characterizing whether or not to allow a gain of an audio object to be changed; specifically, whether or not to allow the gain value of the audio object to be changed may be implemented by setting a parameter value of the audio object gain information, for example, setting the parameter value of the audio object gain information to 1, which indicates that the gain value of the audio object is allowed to be changed, and setting the parameter value of the audio object gain information to 0, which indicates that the gain value of the audio object is not allowed to be changed.
Audio object position information characterizing whether a user is allowed to alter the position of an audio object. Specifically, whether or not to allow modification of the position information of the audio block format set of the audio object may be achieved by setting a parameter value of the audio object position information, for example, setting the parameter value of the audio object position information to 1, indicating that modification of the position information of the audio block format set of the audio object is allowed, setting the parameter value of the audio object position information to 0, indicating that modification of the position information of the audio block format set of the audio object is not allowed.
In particular, if the audio object allows interaction, the user may modify the audio object interaction scope by setting the corresponding parameters.
Specifically, in this embodiment of the present disclosure, the audio object gain information includes:
maximum value information (max) of audio gain characterizing a maximum gain factor allowing user gain interaction;
minimum value information (min) of audio gain characterizing a minimum gain factor allowing user gain interaction.
Can be described in computer languages as:
<audioObjrctInteraction onOffinteract=“1”gainInteract=“1”>
<gainInteractionRange bound=“min”>-20.0</gainInteractionRange>
<gainInteractionRange bound=“max”>3.0</gainInteractionRange>
</audioObjrctInteraction>
indicating that the audio object is allowed to be opened or closed, the gain value of the audio object is allowed to be changed, the maximum value of the audio gain is 3.0, and the minimum value of the audio gain is-20.0.
Specifically, in this embodiment of the present disclosure, the audio object position information includes:
the audio object detection method comprises a polar coordinate parameter and a Cartesian coordinate parameter, wherein the polar coordinate parameter is used for representing the position information of the audio object in a polar coordinate mode, and the Cartesian coordinate parameter is used for representing the position information of the audio object in a Cartesian coordinate mode.
Specifically, in the embodiment of the present disclosure, the polar coordinate parameters include:
maximum azimuth information used for representing the maximum azimuth deviation value allowing the user position interactivity;
minimum azimuth information for characterizing a minimum azimuth offset value allowing for user position interactivity;
maximum elevation angle information used for representing the maximum elevation deviation value allowing the user position interaction degree;
minimum elevation angle information for characterizing a minimum elevation offset value that allows for user position interactivity;
maximum distance information used for representing the maximum standardized distance value of the user position interaction degree;
minimum distance information characterizing a minimum normalized distance value that allows for user position interactivity.
The polar parameters may be described in computer language as:
<audioObjrctInteraction onOffinteract=“1”positionInteract=“1”>
<positionInteractionRange bound=“min”coordinate=“azimuth”>-50.0</positionInteractionRange>
<positionInteractionRange bound=“max”coordinate=“azimuth”>50.0</positionInteractionRange>
<positionInteractionRange bound=“min”coordinate=“elevation”>-20.0</positionInteractionRange>
<positionInteractionRange bound=“max”coordinate=“elevation”>20.0</positionInteractionRange>
<positionInteractionRange bound=“min”coordinate=“distance”>0.6</positionInteractionRange>
<positionInteractionRange bound=“max”coordinate=“distance”>1.0</positionInteractionRange>
representing a minimum azimuth of-50 degrees, a maximum azimuth of 50 degrees, a minimum elevation of-20 degrees, a maximum elevation of 20 degrees, a minimum distance of 0.6, and a maximum distance of 1.0.
Specifically, in the embodiment of the present disclosure, the cartesian coordinate parameters include:
maximum X-axis coordinate information used for representing the maximum X-axis deviation value of the user position interaction normalization unit;
minimum X-axis coordinate information used for representing a minimum X-axis deviation value allowing a user position to interact with a normalization unit;
maximum Y-axis coordinate information used for representing a maximum Y-axis offset value standardization unit allowing user position interaction;
minimum Y-axis coordinate information used for representing a minimum Y-axis offset value standardization unit allowing user position interaction;
maximum Z-axis coordinate information used for representing a maximum Z-axis deviation value standardization unit allowing user position interaction;
minimum Z-axis coordinate information characterizing a standardized unit of minimum Z-axis offset values that allow user position interaction.
The cartesian coordinate parameters may be described in computer language as:
<audioObjrctInteraction onOffinteract=“1”positionInteract=“1”>
<positionInteractionRange bound=“min”coordinate=“X”>-0.5</positionInteractionRange>
<positionInteractionRange bound=“max”coordinate=“X”>0.5</positionInteractionRange>
<positionInteractionRange bound=“min”coordinate=“Y”>-0.2</positionInteractionRange>
<positionInteractionRange bound=“max”coordinate=“Y”>0.2</positionInteractionRange>
<positionInteractionRange bound=“min”coordinate=“Z”>0.0</positionInteractionRange>
<positionInteractionRange bound=“max”coordinate=“Z”>1.0</positionInteractionRange>
the minimum X-axis offset value representing the normalization unit for allowing user position interaction is-0.5, the maximum X-axis offset value representing the normalization unit for allowing user position interaction is 0.5, the minimum Y-axis offset value representing the normalization unit for allowing user position interaction is-0.2, the maximum Y-axis offset value representing the normalization unit for allowing user position interaction is 0.2, the maximum Z-axis offset value representing the normalization unit for allowing user position interaction is 1.0, and the minimum Z-axis offset value representing the normalization unit for allowing user position interaction is 0.0.
Specifically, in the embodiment of the present disclosure, the relevant parameters of the sub-element area 120 may be described in a computer language as:
<audioPackFormatIDRef>AP_00010001</audioPackFormatIDRef>
<audioComplementaryObjectRef>AO_1001</audioComplementary ObjectRef>
<audioTrackIDRef>ATU_00000003</audioTrackIDRef>
<audioObjrctInteraction onOffinteract=“1”gainInteract=“1”>
<gainInteractionRange bound=“min”>-20.0</gainInteractionRange>
<gainInteractionRange bound=“max”>3.0</gainInteractionRange>
</audioObjrctInteraction>
indicating that the format of the reference audio packet is "AP _ 00010001", the reference audio complementary object is "AO _ 1001", the track is uniquely identified as "ATU _ 00000003", the audio object is allowed to be turned on or off, the gain value of the audio object is allowed to be changed, the maximum value of the audio gain is 3.0, and the minimum value of the audio gain is-20.0.
The embodiment of the present disclosure describes the metadata of the audio object and the format thereof through the audio object metadata 100, and can provide accurate audio object data for the renderer during audio playing, thereby improving the quality of an audio playing scene.
Example 2
The present disclosure also provides an embodiment of a method for generating audio object metadata, which is similar to the above embodiment, and the explanation based on the same name meaning is the same as that of the above embodiment, and has the same technical effect as that of the above embodiment, and therefore, the description thereof is omitted.
As shown in fig. 3, a method for generating audio object metadata includes the steps of:
step S210, generating audio object metadata, the audio object metadata including:
a property region including an audio object identification and an audio object name of an audio object, the audio object identification including information indicating a relationship between a plurality of audio objects;
and the sub-element area is used for representing information of audio package format identification reference, audio object identification reference condition, audio complementary object identification reference, audio track unique identification reference and audio object interaction of the audio object metadata.
Optionally, the attribute zone further comprises: start time information indicating an audio object, the start time information relating to a playback start time of an audio program in the audio object; and/or duration information of an audio object, the duration information being related to a play start time and an end time of an audio program in the audio object; and/or, audio object importance information for describing an importance index of an audio object; and/or, the user interacts with the object, and the user and object interaction information is used for representing whether the user is allowed to interact with the object or not; and/or, automatic dodging information, the automatic dodging information being used to characterize whether the object is allowed to dodge automatically.
Optionally, the sub-element region includes: the audio package format identifier quotes information, and the audio package format representation quotes information and is used for quoting the audio package format identifier to carry out format description; and/or, audio object identification reference information for referencing other audio objects; and/or, audio complementary object identification reference information, the audio complementary object identification reference information being used to describe an audio object that is complementary to the audio object; and/or, the unique identification reference information of the audio track is used for describing the unique identification of the audio track; and/or audio object interaction information, wherein the audio object interaction information is used for representing the specification of the user interaction with the object.
Optionally, the audio object interaction information includes: audio object switching information for characterizing whether to allow an audio object to be turned on or off; audio object gain information characterizing whether or not to allow a gain of an audio object to be changed; audio object position information characterizing whether a user is allowed to alter the position of an audio object.
Optionally, the audio object gain information includes: maximum value information of the audio gain, wherein the maximum value information of the audio gain is used for representing a maximum gain factor allowing user gain interaction; minimum information of audio gain characterizing a minimum gain factor allowing user gain interaction.
Optionally, the audio object position information includes: the audio object detection method comprises a polar coordinate parameter and a Cartesian coordinate parameter, wherein the polar coordinate parameter is used for representing the position information of the audio object in a polar coordinate mode, and the Cartesian coordinate parameter is used for representing the position information of the audio object in a Cartesian coordinate mode.
Optionally, the polar coordinate parameters include: maximum azimuth information used for representing the maximum azimuth deviation value allowing the user position interactivity; minimum azimuth information for characterizing a minimum azimuth offset value allowing for user position interactivity; maximum elevation angle information used for representing the maximum elevation deviation value allowing the user position interaction degree; minimum elevation angle information for characterizing a minimum elevation offset value that allows for user position interactivity; maximum distance information used for representing the maximum standardized distance value of the user position interaction degree; minimum distance information characterizing a minimum normalized distance value that allows for user position interactivity.
Optionally, the cartesian coordinate parameters include: maximum X-axis coordinate information used for representing the maximum X-axis deviation value of the user position interaction normalization unit; minimum X-axis coordinate information used for representing a minimum X-axis deviation value allowing a user position to interact with a normalization unit; maximum Y-axis coordinate information used for representing a maximum Y-axis offset value standardization unit allowing user position interaction; minimum Y-axis coordinate information used for representing a minimum Y-axis offset value standardization unit allowing user position interaction; maximum Z-axis coordinate information used for representing a maximum Z-axis deviation value standardization unit allowing user position interaction; minimum Z-axis coordinate information characterizing a standardized unit of minimum Z-axis offset values that allow user position interaction.
The embodiment of the disclosure generates the audio object metadata, and the audio object metadata describes the metadata format of the audio object playing, so that the audio content can be acquired when the audio object is played, and the quality of an audio playing scene is improved.
Example 3
Fig. 4 is a schematic structural diagram of an electronic device provided in embodiment 3 of the present disclosure. As shown in fig. 4, the electronic apparatus includes: a processor 30, a memory 31, an input device 32, and an output device 33. The number of the processors 30 in the electronic device may be one or more, and one processor 30 is taken as an example in fig. 4. The number of the memories 31 in the electronic device may be one or more, and one memory 31 is taken as an example in fig. 4. The processor 30, the memory 31, the input device 32 and the output device 33 of the electronic apparatus may be connected by a bus or other means, and fig. 4 illustrates the connection by a bus as an example. The electronic device can be a computer, a server and the like. The embodiment of the present disclosure describes in detail by taking an electronic device as a server, and the server may be an independent server or a cluster server.
Memory 31 is provided as a computer-readable storage medium that may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules for generating audio object metadata as described in any embodiment of the present disclosure. The memory 31 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 31 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 31 may further include memory located remotely from the processor 30, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 32 may be used to receive input numeric or character information and generate key signal inputs related to viewer user settings and function controls of the electronic device, as well as a camera for capturing images and a sound pickup device for capturing audio data. The output device 33 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 32 and the output device 33 can be set according to actual conditions.
The processor 30 executes various functional applications of the device and data processing, i.e., generates audio object metadata, by executing software programs, instructions, and modules stored in the memory 31.
Example 4
The disclosed embodiment 4 also provides a storage medium containing computer executable instructions which, when generated by a computer processor, include audio object metadata as described in embodiment 1.
Of course, the storage medium provided by the embodiments of the present disclosure contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the electronic method described above, and may also perform related operations in the electronic method provided by any embodiments of the present disclosure, and have corresponding functions and advantages.
From the above description of the embodiments, it is obvious for a person skilled in the art that the present disclosure can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the electronic method according to any embodiment of the present disclosure.
It should be noted that, in the electronic device, the units and modules included in the electronic device are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be realized; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present disclosure.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "in an embodiment," "in yet another embodiment," "exemplary" or "in a particular embodiment," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although the present disclosure has been described in detail hereinabove with respect to general description, specific embodiments and experiments, it will be apparent to those skilled in the art that some modifications or improvements may be made based on the present disclosure. Accordingly, such modifications and improvements are intended to be within the scope of this disclosure, as claimed.

Claims (11)

1. Audio object metadata, comprising:
a property region including an audio object identification and an audio object name of an audio object, the audio object identification including information indicating a relationship between a plurality of audio objects;
and the sub-element area is used for representing information of audio package format identification reference, audio object identification reference condition, audio complementary object identification reference, audio track unique identification reference and audio object interaction of the audio object metadata.
2. The audio object metadata according to claim 1, wherein the property area further comprises:
start time information indicating an audio object, the start time information relating to a playback start time of an audio program in the audio object;
and/or duration information of an audio object, the duration information being related to a play start time and an end time of an audio program in the audio object;
and/or, audio object importance information for describing an importance index of an audio object;
and/or, the user interacts with the object, and the user and object interaction information is used for representing whether the user is allowed to interact with the object or not;
and/or, automatic dodging information, the automatic dodging information being used to characterize whether the object is allowed to dodge automatically.
3. The audio object metadata according to claim 1, wherein the sub-element region comprises:
the audio package format identification quoting information is used for quoting the audio package format identification for format description;
and/or, the audio object identification reference information is used for describing other referenced audio objects;
and/or, audio complementary object identification reference information, the audio complementary object identification reference information being used to describe an audio object complementary to the audio object;
and/or, the unique identification reference information of the audio track is used for describing the unique identification reference of the audio track;
and/or audio object interaction information, wherein the audio object interaction information is used for representing the specification of the user interaction with the object.
4. The audio object metadata according to claim 3, wherein the audio object interaction information comprises:
audio object switching information for characterizing whether to allow an audio object to be turned on or off;
audio object gain information characterizing whether or not to allow a gain of an audio object to be changed;
audio object position information characterizing whether a user is allowed to alter the position of an audio object.
5. The audio object metadata according to claim 4, wherein the audio object gain information comprises:
maximum value information of the audio gain, wherein the maximum value information of the audio gain is used for representing a maximum gain factor allowing user gain interaction;
minimum information of audio gain characterizing a minimum gain factor allowing user gain interaction.
6. Audio object metadata according to claim 4, characterized in that said audio object position information comprises:
the audio object detection method comprises a polar coordinate parameter and a Cartesian coordinate parameter, wherein the polar coordinate parameter is used for representing the position information of the audio object in a polar coordinate mode, and the Cartesian coordinate parameter is used for representing the position information of the audio object in a Cartesian coordinate mode.
7. The audio object metadata according to claim 6, wherein the polar coordinate parameters comprise:
maximum azimuth information used for representing the maximum azimuth deviation value allowing the user position interactivity;
minimum azimuth information for characterizing a minimum azimuth offset value allowing for user position interactivity;
maximum elevation angle information used for representing the maximum elevation deviation value allowing the user position interaction degree;
minimum elevation angle information for characterizing a minimum elevation offset value that allows for user position interactivity;
maximum distance information used for representing the maximum standardized distance value of the user position interaction degree;
minimum distance information characterizing a minimum normalized distance value that allows for user position interactivity.
8. Audio object metadata according to claim 6, characterized in that said Cartesian coordinate parameters comprise:
maximum X-axis coordinate information used for representing the maximum X-axis deviation value of the user position interaction normalization unit;
minimum X-axis coordinate information used for representing a minimum X-axis deviation value allowing a user position to interact with a normalization unit;
maximum Y-axis coordinate information used for representing a maximum Y-axis offset value standardization unit allowing user position interaction;
minimum Y-axis coordinate information used for representing a minimum Y-axis offset value standardization unit allowing user position interaction;
maximum Z-axis coordinate information used for representing a maximum Z-axis deviation value standardization unit allowing user position interaction;
minimum Z-axis coordinate information characterizing a standardized unit of minimum Z-axis offset values that allow user position interaction.
9. A method for generating audio object metadata, comprising:
generating audio object metadata comprising audio objects according to any of claims 1-8.
10. An electronic device, comprising: a memory and one or more processors;
the memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to generate metadata comprising the audio object of any of claims 1-8.
11. A storage medium containing computer-executable instructions which, when generated by a computer processor, comprise audio object metadata according to any of claims 1 to 8.
CN202111102038.6A 2021-09-18 2021-09-18 Audio object metadata and generation method, electronic device, and storage medium Pending CN113963725A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111102038.6A CN113963725A (en) 2021-09-18 2021-09-18 Audio object metadata and generation method, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111102038.6A CN113963725A (en) 2021-09-18 2021-09-18 Audio object metadata and generation method, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN113963725A true CN113963725A (en) 2022-01-21

Family

ID=79461703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111102038.6A Pending CN113963725A (en) 2021-09-18 2021-09-18 Audio object metadata and generation method, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN113963725A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070233738A1 (en) * 2006-04-03 2007-10-04 Digitalsmiths Corporation Media access system
JP2008086030A (en) * 2002-04-12 2008-04-10 Mitsubishi Electric Corp Hint information description method
CN101908052A (en) * 2009-11-27 2010-12-08 新奥特(北京)视频技术有限公司 Making method and device of multimedia program
US20130080159A1 (en) * 2011-09-27 2013-03-28 Google Inc. Detection of creative works on broadcast media
CN106688251A (en) * 2014-07-31 2017-05-17 杜比实验室特许公司 Audio processing systems and methods
CN107004021A (en) * 2014-12-08 2017-08-01 微软技术许可有限责任公司 Recommended based on process content metadata tag generation
US20180098173A1 (en) * 2016-09-30 2018-04-05 Koninklijke Kpn N.V. Audio Object Processing Based on Spatial Listener Information
US20190265943A1 (en) * 2018-02-23 2019-08-29 Bose Corporation Content based dynamic audio settings
CN112334973A (en) * 2018-07-19 2021-02-05 杜比国际公司 Method and system for creating object-based audio content
US20210050028A1 (en) * 2018-01-26 2021-02-18 Lg Electronics Inc. Method for transmitting and receiving audio data and apparatus therefor
CN112825550A (en) * 2019-11-19 2021-05-21 萨基姆宽带联合股份公司 Decoder arrangement for generating commands for audio profiles to be applied
CN113377326A (en) * 2021-06-08 2021-09-10 广州博冠信息科技有限公司 Audio data processing method and device, terminal and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008086030A (en) * 2002-04-12 2008-04-10 Mitsubishi Electric Corp Hint information description method
US20070233738A1 (en) * 2006-04-03 2007-10-04 Digitalsmiths Corporation Media access system
CN101908052A (en) * 2009-11-27 2010-12-08 新奥特(北京)视频技术有限公司 Making method and device of multimedia program
US20130080159A1 (en) * 2011-09-27 2013-03-28 Google Inc. Detection of creative works on broadcast media
CN106688251A (en) * 2014-07-31 2017-05-17 杜比实验室特许公司 Audio processing systems and methods
CN107004021A (en) * 2014-12-08 2017-08-01 微软技术许可有限责任公司 Recommended based on process content metadata tag generation
US20180098173A1 (en) * 2016-09-30 2018-04-05 Koninklijke Kpn N.V. Audio Object Processing Based on Spatial Listener Information
US20210050028A1 (en) * 2018-01-26 2021-02-18 Lg Electronics Inc. Method for transmitting and receiving audio data and apparatus therefor
US20190265943A1 (en) * 2018-02-23 2019-08-29 Bose Corporation Content based dynamic audio settings
CN112334973A (en) * 2018-07-19 2021-02-05 杜比国际公司 Method and system for creating object-based audio content
CN112825550A (en) * 2019-11-19 2021-05-21 萨基姆宽带联合股份公司 Decoder arrangement for generating commands for audio profiles to be applied
CN113377326A (en) * 2021-06-08 2021-09-10 广州博冠信息科技有限公司 Audio data processing method and device, terminal and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
国际电信联盟: "音频定义模型", 《ITU-R BS.2076-1建议书》, pages 1 - 106 *

Similar Documents

Publication Publication Date Title
JP2008538675A (en) Media timeline processing infrastructure
JP2019165494A (en) Pair screen rendering of audio and audio encoding and decoding for rendering
US9818448B1 (en) Media editing with linked time-based metadata
CN113963725A (en) Audio object metadata and generation method, electronic device, and storage medium
CN112053699A (en) Method and device for processing game card voice change
CN113905321A (en) Object-based audio channel metadata and generation method, device and storage medium
CN113963724A (en) Audio content metadata and generation method, electronic device and storage medium
CN114203189A (en) Method, apparatus and medium for generating metadata based on binaural audio packet format
CN114023340A (en) Object-based audio packet format metadata and generation method, apparatus, and medium
CN114121036A (en) Audio track unique identification metadata and generation method, electronic device and storage medium
CN114023339A (en) Audio-bed-based audio packet format metadata and generation method, device and medium
CN115190412A (en) Method, device and equipment for generating internal data structure of renderer and storage medium
CN114203188A (en) Scene-based audio packet format metadata and generation method, device and storage medium
CN114051194A (en) Audio track metadata and generation method, electronic equipment and storage medium
CN113990355A (en) Audio program metadata and generation method, electronic device and storage medium
CN115038029A (en) Rendering item processing method, device and equipment of audio renderer and storage medium
CN115426612A (en) Metadata parsing method, device, equipment and medium for object renderer
CN114530157A (en) Audio metadata channel allocation block generation method, apparatus, device and medium
Franck et al. A system architecture for semantically informed rendering of object-based audio
CN115529548A (en) Speaker channel generation method and device, electronic device and medium
CN113889128A (en) Audio production model and generation method, electronic equipment and storage medium
CN113923264A (en) Scene-based audio channel metadata and generation method, device and storage medium
CN113905322A (en) Method, device and storage medium for generating metadata based on binaural audio channel
CN114143695A (en) Audio stream metadata and generation method, electronic equipment and storage medium
CN113938811A (en) Audio channel metadata based on sound bed, generation method, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination