TWI530941B - Method and system for object-based audio interaction of imaging - Google Patents

Method and system for object-based audio interaction of imaging Download PDF

Info

Publication number
TWI530941B
TWI530941B TW103105464A TW103105464A TWI530941B TW I530941 B TWI530941 B TW I530941B TW 103105464 A TW103105464 A TW 103105464A TW 103105464 A TW103105464 A TW 103105464A TW I530941 B TWI530941 B TW I530941B
Authority
TW
Taiwan
Prior art keywords
object
channels
audio
audio content
program
Prior art date
Application number
TW103105464A
Other languages
Chinese (zh)
Other versions
TW201445561A (en
Inventor
Robert Andrew France
Thomas Ziegler
Sripal S Mehta
Prinyar Saungsomboon
Andrew Jonathan Dowell
Michael David Dwyer
Farhad Farahani
Nicolas R Tsingos
Freddie Sanchez
Original Assignee
Dolby Lab Licensing Corp
Dolby Int Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US201361807922P priority Critical
Priority to US201361832397P priority
Application filed by Dolby Lab Licensing Corp, Dolby Int Ab filed Critical Dolby Lab Licensing Corp
Publication of TW201445561A publication Critical patent/TW201445561A/en
Application granted granted Critical
Publication of TWI530941B publication Critical patent/TWI530941B/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding, i.e. using interchannel correlation to reduce redundancies, e.g. joint-stereo, intensity-coding, matrixing
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/20Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels, e.g. Dolby Digital, Digital Theatre Systems [DTS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/167Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/01Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/03Application of parametric coding in stereophonic audio systems

Description

Method and system for interactive imaging based on object audio [Cross Reference Related Applications]

The application claims the benefit of the filing date of U.S. Provisional Patent Application No. 61/807,922, filed on Apr. 3, 2013, and U.S. Provisional Patent Application No. 61/832,397, filed on Jun. 7, 2013. This is incorporated herein by reference.

The present invention relates to audio signal processing, and more particularly to encoding, decoding, and interactive imaging of audio data bitstreams, the audio data bitstream including audio content (typically representing a speaker channel and at least one selectable audio object channel), and support Metadata for interactive imaging of audio content. Some embodiments of the present invention generate, decode, and/or image into one of formats known as Dolby Digital (AC-3), Dolby Digital Plus (Enhanced AC-3 or E-AC-3), or Dolby E. Audio material.

Dolby, Dolby Digital, Dolby Digital Plus, and Dolby E are registered trademarks of Dolby Laboratories, Inc. Dolby Labs offers Dolby Digital and Dolby Digital Plus, respectively. The proprietary implementation of AC-3 and E-AC-3.

A typical stream of audio material includes audio content (eg, sound) One or more channels of the frequency content) and metadata representing at least one characteristic of the audio content. For example, in an AC-3 bitstream, there are several audio metadata parameters that are specifically intended to be used to change the sound of a program delivered to the listening environment. One of the metadata parameters is the DIALNORM parameter, which is used to represent the average level of conversations occurring in the audio program and is used to determine the audio playback signal level.

Although the invention is not limited to the AC-3 bit stream, E-AC-3 The bitstream, or Dolby E bitstream, is used together, but for convenience, such a bitstream including loudness processing state metadata will be generated, decoded, or otherwise processed in an embodiment.

The AC-3 encoded bit stream contains one of metadata and audio content Up to six channels. Audio content is audio material that has been compressed using perceptual audio coding. The metadata includes a number of audio metadata parameters that are intended to be used to change the sound of the program delivered to the listening environment. The details of the AC-3 (also known as Dolby Digital) encoding are well known and described in many published references, including in the ATSC Standard A52/A: Digital Audio Compression Standard (AC-3), Rev. A, High Order Television System Committee, 2001/8/20.

Details of the Dolby Digital Plus (E-AC-3) code are documented in "Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System", AES Conference Paper 6196, 117th AES Conference, 2004/10/28.

Details of the Dolby E code are recorded in "Efficient Bit" Allocation, Quantization, and Coding in an Audio Distribution System", AES Preprint 5068, 107th AES Conference, 1999/8 and "Professional Audio Coder Optimized for Use with Video", AES Preprint 5033, 107th AES Conference, 1999/8.

Each frame of the AC-3 encoded audio bitstream contains audio within Metadata for 1536 samples for digital audio. In terms of a sampling rate of 48 kHz, this represents a rate of 31.25 frames per second for 32 milliseconds of digital audio or audio.

Each frame of the E-AC-3 encoded audio bitstream contains audio Content and metadata for 256, 512, 768, or 1536 samples of digital audio, depending on whether the frame contains one, two, three, or six blocks of audio material, respectively. For a sampling rate of 48 kHz, this represents 5.333, 10.667, 16 or 32 milliseconds of digital audio, respectively, or a rate of 189.9, 93.75, 62.5, or 31.25 frames per second for audio.

As shown in Figure 1, each AC-3 frame is divided into sections. (section), comprising: a synchronization information (SI) portion containing (as shown in FIG. 2) a synchronization word (SW) and a first one of two error correction words (CRC1); bit stream information (BSI) Part, which contains most of the metadata; six audio blocks (AB0 to AB5), which contain data compressed audio content (and can also include metadata); a waste bit (W), which is included after the compressed audio content Any unused bits left; an auxiliary (AUX) information portion, which may contain more metadata; and a second of the two error correction words (CRC2).

As shown in Figure 4, each E-AC-3 frame is divided into sections. (section), comprising: a synchronization information (SI) portion containing (as shown in FIG. 2) a synchronization word (SW); a bit stream information (BSI) portion containing most of the metadata; Between six audio blocks (AB0 to AB5), which contain data compressed audio content (and can also include metadata); a waste bit (W) containing any unused bits left after compressing the audio content Auxiliary (AUX) information section, which may contain more metadata; and error correction words (CRC).

In the AC-3 (or E-AC-3) bit stream, there are several audio elements Data parameters that are specifically intended to be used to change the sound of a program delivered to the listening environment. One of the metadata parameters is the DIALNORM parameter, which is included in the BSI segment.

As shown in Figure 3, the AC-3 frame (or E-AC-3 frame) The BSI section includes a five-bit parameter ("DIALNORM") that represents the DIALNORM value for the program. If the audio coding mode ("acmod") of the AC-3 frame is "0", a five-bit parameter indicating the DIALNORM value for the second audio program transmitted in the same AC-3 frame ("DIALNORM2" is included. "), indicating that the dual mono or "1+1" channel configuration is in use.

The BSI section also includes a flag ("addbsie") indicating There is (or does not exist) extra bit stream information and parameters ("addbsil") after the "addbsie" bit, indicating the length of any extra bit stream information after the "addbsil" value, and after the "addbsil" value Up to 64-bit extra bit stream information ("addbsi").

The BSI section includes other elements not specifically shown in FIG. Data value.

It has been proposed to include other types of elements in the audio bitstream. data. For example, a method and system for generating, decoding, and processing an audio bitstream that includes metadata describing the processing state (eg, loudness processing state) and characteristics (eg, loudness) of audio content is illustrated in the PCI International Application The assignee WO 2012/075246 A2, having the international filing date 2011/12/1, assigned to the assignee of this application. This reference also illustrates the use of metadata to adaptively process the audio content of a bitstream, and the use of metadata to verify the loudness processing state and loudness of the audio content of the bitstream.

Method for generating and imaging object-based audio programs It is known. During the generation of such a program, it is generally assumed that the loudspeaker for imaging is located anywhere in the playing environment; not only in the (nominal) horizontal plane or in any other predetermined arrangement known at the time of program generation. Typically, the metadata included in the program represents imaging parameters for, for example, using a three-dimensional array of speakers to image at least one item of the program at an apparent spatial location or along a track (in a three-dimensional volume). For example, the item channel of the program may have a corresponding metadata indicating the three-dimensional orbit of the apparent spatial location at which the object (indicated by the object channel) is to be imaged. The track may include a series of "floor" positions (on a plane that is assumed to be on a subset of the speakers on the floor, or on another horizontal plane of the playing environment), and a series of "on-floor" positions (each driven by a drive) Determined for a subset of speakers located on at least one other horizontal plane of the playback environment). An example of imaging based on an object audio program is illustrated, for example, on September 29, 2011. The International Patent Publication No. PCT/US2001/028783, the disclosure of which is incorporated herein by reference.

In accordance with some embodiments of the present invention, imaging (generated in accordance with the present invention) is based on an item audio program to provide immersively personalized audio content of the program. Typically, the content represents the atmosphere (ie, the sound occurring in it) in a spectator event (eg, a football or rugby game, a car or motorcycle game, or another sporting event), and/or a live event to the spectator. Report. In some embodiments, the content does not represent an atmosphere in a viewer event or a live coverage of a viewer event (eg, in some embodiments, the content represents a conversation with multiple, optional versions, and/or other audio content) Script or movie show). In some embodiments, the audio content of the program represents a plurality of audio object channels (eg, representing a user selectable item or group of objects, and typically also representing a predetermined set of images that are not imaged by the user selected object. Object) and at least one set (sometimes referred to herein as a "group") speaker channel. The set of speaker channels may be a conventional mix of speaker channels of the type that may be included in conventional broadcast programs that do not include object channels (eg, 5.1 channel mix).

In some embodiments, the object-related metadata represented by the object audio program (ie, transmitted as part of) provides a hybrid interaction (eg, a large degree of hybrid interaction) on the playback side, including by allowing the end user to select the program. The mix of audio content is for imaging, rather than just allowing the premixed sound field to be played. For example, the user can be typical of the program of the present invention. The imaging options provided by the meta-data of the embodiment are selected to select a subset of the available object channels for imaging, and optionally also the playback level of at least one audio object (sound source) represented by the object channel to be imaged. The spatial location at which each selected sound source is imaged may be predetermined by the metadata included in the program, but in some embodiments may be selected by the user (e.g., subject to predetermined rules or restrictions). In some embodiments, the metadata included in the program allows the user to select from imaging options (eg, a few imaging options, such as "home team mass noise" objects, "home team mass noise" and "home team live report" object groups," Choose from the menu of the "Nobody's Noise" object and the "Aviator's Noise" and "Guest Report". The menu can be presented to the user via the user interface of the controller. The controller is typically coupled (eg, via a wireless connection) to an on-board device (or other device, such as a TV, AVR, tablet, or phone) configured to (at least partially) decode and image an object based program. In some other embodiments, the metadata included in the program otherwise allows the user to choose from which object(s) to represent the object to be imaged, and a set of options as to how the object to be imaged should be configured. select.

In one class of embodiments, the invention is based on a production The method of item audio programming (e.g., including generating a program by encoding audio content) enables the program to be imaged in a personalized manner to provide immersively perceived audio content of the program. Other embodiments include the steps of transmitting (e.g., broadcasting), decoding, and/or imaging such programs. The audio object represented (included) by the imaging program can provide an immersive experience (eg, when the playback system includes a three-dimensional array of speakers, or even when the playback system includes the name When on a two-dimensional speaker array).

Usually, the audio content of a program represents multiple object channels. (For example, the user selects an item, and usually there is also a preset set of objects that will be imaged without being selected by the user) and a set ("group") of speaker channels. In some embodiments, the consumer uses the controller (actually as a user interface) to select the object channel content of the program (and corresponding imaging parameters), but the controller does not provide an option for the user to select the speaker channel content of the program (ie, Group of individual speaker channels).

In some embodiments, based on the object audio program encoding (eg, compressed) an audio bitstream (sometimes referred to herein as a "main mix") that represents at least some (ie, at least a portion) of the audio content of the program (eg, a set of speaker channels and at least some of the program's objects) Channel) and object-related metadata, and optionally at least one extra bit stream or file (sometimes referred to herein as "side mix"), representing the audio content of some programs (eg, at least some object channels) And/or related metadata of the object.

In some embodiments, the object related metadata package of the program Includes durability metadata (eg, durability metadata and non-durability metadata). For example, the object-related metadata can include non-durable metadata that can be changed at least at one point in the broadcast chain (from the content creation facility to the consumer's user interface) (eg, for user-selectable objects) The preset level and/or imaging position or track), and the durability metadata that is not expected to be changeable (or cannot be changed) after the initial generation of the program (typically in the content establishment facility). Examples of durability metadata include object IDs for each user-selectable item or other item or group of items of the program, and synchronization words (eg, For example, time code) indicates the timing of each user-selectable item or other item relative to the audio content of the speaker channel group or other component of the program. The durability metadata is typically saved throughout the broadcast chain from the content creation facility to the user interface, during the entire broadcast period of the program, or even during the broadcast of the program. In some embodiments, the audio content (and associated metadata) of the at least one user-selectable item is transmitted in a main mix based on the item audio program, at least some of the durability metadata (eg, time code), And optionally, the audio content (and associated metadata) of at least one other object is sent in the side mix of the program.

Some embodiments based on object audio programs in the present invention The durability metadata is used to preserve (e.g., even after the broadcast of the program) the object content is mixed with the user selection of the group (speaker channel) content. For example, whenever a user views a particular type of program (eg, any football game) or whenever the user views (any type of) any program, this may provide the selected blend as a preset blend until the user changes him/ Her choice so far. For example, during the broadcast of the first program, the user may select a mix of objects having a durability ID (eg, items identified as "home team noise" objects), and then each time the user views (and listens) When another program (which includes objects with the same durability ID), the playback system will automatically image the program with the same mix until the user changes the blend selection. In some embodiments of the present invention based on object audio programs, the durable object related metadata may cause some items during the entire program to be forcibly imaged (e.g., although the user desires to eliminate the imaging described above).

In some embodiments, the object related metadata is preset to A parameter such as a preset spatial location of the imaged object provides a preset blend of object content and group (speaker channel) content.

In some embodiments, the object related metadata provides the object Mixed with a set of optional "presets" of the "group" speaker channel content, each preset blend having a predetermined set of imaging parameters (eg, the spatial location of the imaged object). These can be presented by the user interface of the playback system as a limited menu or panel of available mix. Each preset blend (and/or each optional item) may have a durability ID (eg, name, tag, or logo), and the identification of the ID described above may typically be by a user interface of the playback system (eg, on an iPad or other Displayed on the controller's screen). For example, there may be an optional "home team" mixed with an ID (eg, a team logo), whether or not changing (eg, by a broadcast station) to preset the audio content or non-durability metadata details of each object, The IDs are all durable.

In some embodiments, the object related metadata of the program (or The pre-configuration of the playback or imaging system, not represented by the metadata transmitted with the program, provides a restriction or condition for the optional mixing of the contents of the object and the group (speaker channel). For example, if digital rights management (DRM) is employed, the DRM hierarchy can be implemented to allow consumers to "layer" access to a set of audio objects included in an object-based audio program. If the consumer (eg, to a radio station) pays more, the consumer can be authorized to decode and select (and listen to) the audio items of more programs. For another example, the object-related metadata can provide restrictions on the user selection of the object (for example, if both the "home team noise" object and the "home team broadcaster" object are selected, the metadata ensures that the two objects are The relative spatial position is predetermined to be imaged). By playing about The system's data (eg, user input data) comes (at least in part) to determine the limits. For example, if the playback system is a stereo system (only two speakers are included), the system's object processing subsystem can be configured to prevent the user from choosing to have sufficient spatial resolution by imaging only two speakers (object-related metadata) Recognized) mixed. For another example, an optional (eg, DRM) reason or other reason (eg, based on the bandwidth of the transmission channel) may be selected from the object-related metadata (and/or other data entered into the playback system) The category of objects removes some of the transferred items. The user can pay for content creators or broadcast stations for more bandwidth, and thus can allow the user to select from a larger menu of optional items and/or groups/objects.

In some embodiments, the invention is based on a rule object Channel selection, wherein at least one predetermined rule determines (eg, with a set of speaker channels) which object channel(s) based on the object audio program. Typically, the user specifies at least one rule for object channel selection (eg, by a menu selection of available rules presented from the user interface of the playback system controller), and the playback system applies each of the above rules to determine the object based Which object(s) of the audio program should be included in the mix of channels to be imaged. The playback system can determine from the object-related metadata in the program which of the object(s) channels of the program meets the predetermined rules.

In some embodiments, the invention is based on an object audio program A set of bitstreams (multiple bitstreams, which may be referred to as "substreams") generated and transmitted in parallel. Typically, multiple decoders are employed to decode them (eg, the program includes multiple E-AC-3 substreams and the playback system employs multiple E-AC-3 decoders to decode the substreams). Typically, each subflow includes all object channels Different subsets of the group and corresponding object related metadata, and at least one substream includes a set of speaker channels. Each substream preferably includes a sync word (e.g., time code) to synchronize the substreams or time aligned with each other. For example, in each substream, each container including object channel content and object related metadata includes a unique ID or timestamp.

For another example, generating and transmitting a set of N books in parallel Invented the Dolby E bit stream. Each such Dolby E bit stream contains a series of bursts. Each burst can transmit a subset of the speaker channel audio content (a "group" speaker channel) and all object channel groups of the inventive object channel (which can be a large group) and object related metadata (ie, each burst can be Indicates some object channels of all object channel groups and related object related metadata. Each bit stream in the group includes a sync word (eg, a time code) to cause the bit streams in the group to be synchronized or time aligned with each other. For example, in each bitstream, each container including object channel content and object related metadata can include a unique ID or timestamp to cause the bitstreams in the group to be synchronized or time aligned with each other.

Some embodiments of the invention (eg, the playback system of the present invention) Some embodiments) implement discrete imaging. For example, the selected object channel (and corresponding object-related metadata) of the program is transmitted (from the decoded set of speaker channels) from the on-board device (STB) to the downstream device configured to image the mixture of the object channel and the set of speaker channels ( For example, an AVR or a one-piece home theater). The STB can partially image the audio and the downstream device can complete the imaging (eg, by generating a speaker feed to drive a particular top-level speaker (eg, a ceiling speaker) to place the audio object at a particular apparent sound source location The output of the STB only indicates that the object can be imaged in some non-specific way in some non-specific top speakers. For example, the STB may not have knowledge of the particular organization of the speaker of the playback system, but downstream devices (eg, AVR or single-piece home theaters) may have such knowledge.

In some embodiments, based on the object audio program is or Each container including at least one AC-3 (or E-AC-3) bit stream, and the program including the object channel content (and/or object related metadata) is included at the end of the frame at the bit stream The auxdata field (for example, the AUX section shown in Figure 1 or Figure 4) or in the "Skip Field" section of the bitstream. In some of the above embodiments, each frame of the AC-3 or E-AC-3 bitstream includes one or two meta-data containers. A container can be included in the Aux field of the frame, and another container can be included in the addbsi field of the frame. Each container has a core header and includes (or is associated with) one or more bearer data. One such bearer (including a container in the Aux field or associated with a container included in the Aux field) may be one or more of the inventive objects (associated with the set of speaker channels also represented by the program) A set of audio samples of the channel and each of the object-related metadata associated with each object channel. The core header of each container typically includes at least one ID value indicating the type of bearer data included in or associated with the container; a subflow association indication (indicating which substreams the core header is associated with); and a protection bit. Typically, each bearer has its own header (or "bearer profile"). The object level metadata can be transferred in each substream for the object channel.

In other embodiments, the audio program based on the object is or A bit stream that is not an AC-3 bit stream or an E-AC-3 bit stream is included. In some embodiments, the object-based audio program is or includes at least one Dolby E-bit stream, and the object channel content of the program and the object-related metadata (eg, the program including the object channel content and/or the object-related metadata) Each container is included in the location of the bit in the Dolby E bitstream that normally does not carry useful information. Each burst of the Dolby E bit stream occupies a time period equivalent to the time period of the corresponding video frame. The object channel (and object-related metadata) may be included in the guard band between the Dolby E bursts and/or in each data structure (each having the AES3 frame format) within each Dolby E burst. Use the bit position. For example, each guard band is composed of a series of segments (eg, 100 segments), each of the first X segments (eg, X=20) of each guard band includes object channel and object related metadata, and Each of the remaining sections of each of the guard bands may include a guard band symbol. In some embodiments, the object channel and object related metadata of the Dolby E bitstream are included in the metadata container. Each container has a core header and includes (or is associated with) one or more bearer data. One such bearer (including a container in the Aux field or associated with a container included in the Aux field) may be one or more of the inventive objects (associated with the set of speaker channels also represented by the program) A set of audio samples of the channel and each of the object-related metadata associated with each object channel. The core header of each container typically includes at least one ID value indicating the type of bearer data included in or associated with the container; a subflow association indication (indicating which substreams the core header is associated with); and a protection bit. Typically, each bearer has its own header (or "bearer profile"). Available for each child of the object channel Transfer the object level metadata in the stream.

In some embodiments, a broadcast facility (eg, such a facility) The encoding system in the middle) generates a plurality of audio representations (based on the object audio program) based on the captured sounds (eg, 5.1 flat blending, international blending, and domestic blending). For example, a menu of speaker channel groups for a program, and/or optional items (or optional or non-selectable imaging parameters for imaging and mixing components) may vary from program to program.

In some embodiments, the object-based audio program is solvable The coded, and its speaker channel content can be imaged by conventional decoders and conventional imaging systems that are not configured to profile the object channel and object related metadata of the present invention. In accordance with some embodiments of the present invention, the same program may be imaged by an on-board device (or other decoding and imaging system, such as a TV, AVR, tablet, or telephone), an on-board device (in accordance with an embodiment of the present invention) The configuration is to analyze the mixing of the object channel and the object related metadata of the present invention and the content of the speaker channel and the object channel represented by the imaging program.

Generated (or transmitted, stored in accordance with some embodiments of the present invention) The object-based audio program stored, buffered, decoded, imaged, or otherwise processed includes at least one set of speaker channels, at least one object channel, and an optional mix representing the speaker channel and the object channel (eg, all optional blends) Metadata of a layered graph (sometimes referred to as a layered "hybrid graph"). For example, a hybrid map represents each rule that can be applied to select a subset of speakers and object channels. Typically, the encoded audio bit stream represents at least some of the audio content of the program (eg, a set of speaker channels and object channels of at least some of the programs) and object related metadata (including metadata representing the mixed map). (ie, at least a portion), and optionally, at least one additional encoded audio bitstream or file system represents audio content and/or object related metadata for some programs.

Hierarchical mixed graph representation nodes (each representing an optional channel) Or a channel group, or a category of an optional channel or channel group) and a connection between the nodes (eg, a control interface to the node and/or a rule for selecting a channel), and includes primary information ("basic" layer) and Optional (ie, non-essentially omitted) material (at least one "extension" layer). Typically, the layered hybrid map is included in one of the encoded audio bitstreams representing the program, and can be accessed by graph traversal (eg, implemented by the playback system) to determine the preset mix of channels and for modifying the pre- Set the option to mix.

The hybrid map here can be represented as a tree diagram, and the base layer will be A branch of a tree diagram (or two or more branches), and each extension layer will be another branch of the tree diagram (or another set of two or more branches). For example, one branch of the tree diagram (represented by the base layer) may represent selectable channels and channel groups available to all end users, and another branch of the tree diagram (represented by the extension layer) may represent only Additional optional channels and/or channel groups for some end users (e.g., such extension layers may be provided only to end users authorized to use it).

Typically, the base layer contains (indicating) the graph structure and the control interface to the nodes of the graph (eg, translation, and adding control interfaces). The base layer is necessary for mapping any user interaction to the decoding/imaging program.

Each extension layer contains (associated with) an extension to the base layer. Extension is not immediately necessary for the decoder to map user interactions and It can therefore be transmitted at a slower rate and/or delay, or omitted.

Generated (or transmitted, stored in accordance with some embodiments of the present invention) The object-based audio program stored, buffered, decoded, imaged, or otherwise processed includes at least one set of speaker channels, at least one object channel, and an optional mix representing the speaker channel and the object channel (eg, all optional blends) Metadata of a mixed graph (which may or may not be a hierarchically mixed graph). The encoded audio bitstream (eg, Dolby E or E-AC-3 bitstream) represents at least a portion of the program, and the metadata representing the mixed map (and typically also optional objects and/or speaker channels) is included In each frame of the bitstream (or in each frame of a subset of the frame of the bitstream). For example, each frame may include at least one meta data section and at least one audio data section, and the hybrid map may be included in at least one meta data section of each frame. Each metadata section (which may be referred to as a "container") may have a format that includes a metadata section header (and optionally other elements), and a one after the metadata section header Or more diverse data to carry data. Each meta-data bearer data itself is identified by the bearer data header. The hybrid map (if present in the metadata section) is included in one of the metadata holding materials of the metadata section.

Generated (or transmitted, stored in accordance with some embodiments of the present invention) The object-based audio program stored, buffered, decoded, imaged, or otherwise processed includes at least two sets of speaker channels, at least one object channel, and metadata representing a mixed map (which may or may not be a layered blend map). The hybrid map represents an optional mix of speaker channels and object channels (eg, all optional blends) and includes at least one "group mix" node. Each group The "hybrid" node defines a predetermined mix of speaker channel groups and thereby represents or acts on a predetermined set of mixing rules for the speaker channels of two or more speaker groups of the mixed program (optionally by user selectable parameters) ).

In another class of embodiments, generated (or transmitted) in accordance with the present invention The object-based audio program that is sent, stored, buffered, decoded, imaged, or otherwise processed includes substreams, and the substreams represent at least one set of speaker channels, at least one object channel, and object related metadata. The object-related metadata includes "sub-stream" metadata (representing the sub-flow structure of the program and/or the manner in which the sub-stream should be decoded) and typically also represents an optional mix of speaker channels and object channels (eg, all optional mixes) Mixed diagram. The substream metadata may indicate which substreams of the program should be decoded independently of other substreams of the program, and which substreams of the program should be decoded in association with at least one other substream of the program.

In an exemplary embodiment, the object based audio program includes At least one set of speaker channels, at least one object channel, and metadata. Metadata includes "substream" metadata (a substream structure representing the audio content of the program and/or a substream of the audio content of the program to be decoded) and typically also a hybrid map representing an optional mix of speaker channels and object channels. . The audio program is associated with a football match. The encoded audio bit stream (eg, E-AC-3 bit stream) represents the audio content and metadata of the program. The audio content of the program (and therefore the bit stream) includes at least two separate substreams. An independent sub-flow system represents a 5.1 speaker channel group that stands for mass noise in a football match. Another independent sub-flow system indicates the channel "A team" of the channel 2.0 representing the voice of the part of the competition from the team ("A team"), indicating that the competition is from the other team ("B team"). Part of the sound of channel 2.0 "B Team" and the mono object channel that represents the live report of the game. The substream metadata of the bitstream indicates that during decoding, the coupling should be "closed" between each pair of independent substreams (so that each individual substream is decoded independently of the other independent substreams), and the substream of the bitstream Metadata indicates that the coupling of program channels within each substream should be "on" (so that these channels are not decoded independently of each other) or "off" (so that these channels are decoded independently of each other). For example, the substream metadata indicates that the coupling should be "on" within each of the two stereo speaker channel groups (2.0 channel "A team" group and 2.0 channel "B team" group) of the second substream, but across the second The speaker channel group of the substream is deactivated between the mono object channel and each speaker channel group of the second substream (so that the mono object channel and the speaker channel group are independently decoded from each other). Similarly, the substream metadata indicates that the coupling should be "on" within the 5.1 speaker channel group of the first substream I0.

Another aspect of the present invention is an audio processing unit (APU), configured to carry out any of the embodiments of the method of the present invention. In another class of embodiments, the present invention is an APU comprising a buffer memory (buffer) that stores (e.g., in a non-transitory manner) an object-based audio program that has been produced by any of the embodiments of the method of the present invention. At least one frame or other segment (including audio content of a set of speaker channels and object channels, and object related metadata). Examples of APUs include, but are not limited to, encoders (eg, transcoders), decoders, codecs, pre-processing systems (pre-processors), post-processing systems (post-processors), audio bit stream processing systems, And a combination of such components.

Aspects of the invention include a system or device, configuration (example) For example, programming) to perform any of the embodiments of the method of the present invention, and a computer readable medium (eg, a magnetic disk) that is stored (eg, in a non-transitory manner) for implementing the method of the present invention or a step thereof. The code of any embodiment. For example, the system of the present invention can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed in software or firmware, and/or otherwise configured to perform any of a variety of operations on the material. An embodiment of the method of the invention or a step thereof is included. Such a general purpose processor may be or include a computer system including input devices, memory, and processing circuitry, programmed (and/or otherwise configured) to perform the method of the present invention in response to data established thereon (or Example of step).

[symbols and terms]

Throughout this disclosure, including in the scope of the patent application, the expression "operating" on a signal or data (eg, filtering, scaling, converting, or applying gain to a signal or data) is used broadly to denote a pair of signals. Or the data, or the type of processing of the signal or material (for example, a pattern of signals that have been subjected to preliminary filtering or pre-processing prior to operation).

Throughout this disclosure, including the scope of the patent application, the term "system" is used broadly to mean a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and includes systems of such subsystems (eg, systems that generate X output signals in response to multiple inputs, where the subsystem produces M inputs and others The XM inputs are received from an external source) and may also be referred to as a decoder system.

Throughout this disclosure, including the scope of the patent application, the term "processor" is used broadly to mean (eg, in software or firmware) that is programmable or otherwise configurable for data (eg, A system or device that operates in audio, or video or other imaging material. Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chipsets), digital signal processors that are programmed and/or otherwise configured to pipeline audio or other sound data, programmable universal A processor or computer, and a programmable microprocessor die or chipset.

Throughout this disclosure, including in the scope of the patent application, the term "audio video receiver" (or "AVR") means, for example, in a consumer electronic device used to control the playback of audio and video content in a home theater. Receiver.

Throughout this disclosure, including in the context of the patent application, the term "single-piece home theater" means a device that is a consumer electronic device (usually installed in a home theater system) and that includes at least one Speakers (typically, at least two speakers) and sub-elements for imaging audio by each of the included speakers for playback (or by each of the included speakers and at least one additional speaker external to the one-piece home theater) system.

Throughout this disclosure, including the scope of the patent application, the terms "audio processor" and "audio processing unit" are used interchangeably and are used broadly to mean a system configured to process audio material. Examples of audio processing units include, but are not limited to, encoders (eg, transcoders), decoders, codecs, pre-processing systems, post-processing systems, and A bit stream processing system (sometimes called a bit stream processing tool).

Throughout this disclosure, including the scope of the patent application, the term "metadata" (for example, in the term "processing state metadata") refers to data that is separate and distinct from the corresponding audio material (including metadata). The audio content of the bit stream). The metadata is associated with the audio material and represents at least one feature or characteristic of the audio material (eg, which type(s) of audio material has been processed, or which type(s) should be processed, or the track of the object represented by the audio material). The contact time between the metadata and the audio data is synchronized. Thereby, the current (recently received or updated) metadata may indicate that the corresponding audio material has both the represented features and/or the results of the audio material processing of the indicated type.

Throughout this disclosure, the words "coupled" or "coupled" are used to mean a direct or indirect connection. Thus, if the first device is coupled to the second device, the connection can be through a direct connection, or through an indirect connection via other devices and connections.

Throughout this disclosure, including in the scope of the patent application, the following statements have the following definitions: the speaker and the loudspeaker are used simultaneously to represent any sounding transducer. This definition includes loudspeakers that are implemented as multiple converters (eg, woofer and tweeter); speaker feed: an audio signal that will be applied directly to the loudspeaker, or an audio signal that will be applied in series with the amplifier and loudspeaker Channel (or "audio channel"): Mono audio signal. Such signals can usually be applied directly to a loudspeaker located at a desired or predicted location. Imaging is done in the same way as the signal. The desired location may be static (as is typically the case with a physical loudspeaker), or dynamic; audio programming: a set of one or more audio channels (at least one speaker channel and / or at least one object channel) and not necessary There is also associated metadata (for example, meta-data describing the desired spatial audio presentation); speaker channel (or "speaker feed channel"): associated loudspeaker (located at the desired or predicted location), or associated with the definition The audio channel of the specified speaker area within the speaker configuration. The speaker channel is imaged in the same manner as the direct application of the audio signal to the loudspeaker in the designated loudspeaker or in the designated loudspeaker area (object in the desired or predicted position); object channel: represented by the audio source (sometimes referred to as audio) "Object" The audio channel of the sound that is emitted. Typically, the object channel determines the parameter audio source description (eg, the metadata representing the parameter audio source description is included in the object channel or is provided with the object channel). The source description may determine the sound emitted by the source (as a function of time), the apparent location of the source as a function of time (eg, 3D space coordinates), and at least one additional parameter of the non-essentially characterized source (eg, a table) Viewing source size or width); based on an object audio program: an audio program comprising a set of one or more object channels (and optionally also at least one speaker channel) and optionally associated metadata (eg, representation) The metadata of the track of the audio object that emits the sound represented by the object channel, or the metadata representing the desired spatial audio representation of the sound represented by the object channel, or expressed as The metadata identifying the at least one audio object of the sound source represented by the object channel; and imaging: converting the audio program into one or more speaker feed programs, or converting the audio program into one or more speaker feeds and using one or More loudspeakers are used to convert the speaker feed into sound (in the latter case, imaging is sometimes referred to herein as "by" a loudspeaker to image). The audio channel can be mediocrely imaged by applying a signal directly to the physical loudspeaker at the desired location ("at the desired location"), or can be used (for the listener) to be substantially equivalent to the various virtualities of the above-described trivial imaging. One of the technologies to image one or more audio channels. In this latter case, each audio channel can be converted to one or more speaker feeds applied to a loudspeaker at a known location (which is typically at a different location than the desired location) such that the loudspeaker is responsive to the feed. The sound will be perceived as being emitted from the desired location. Examples of such virtualization technologies include binaural imaging via headphones (eg, using Dolby Headphone processing that simulates up to 7.1 surround channels for headset wearers) and wavefield synthesis.

1‧‧‧Capture unit

3‧‧‧ Filming unit

3A‧‧‧buffer

5‧‧‧Transfer subsystem

7‧‧‧Decoder

7A‧‧‧buffer

9‧‧‧ Object Processing Subsystem

10‧‧‧ Controller

11‧‧‧ imaging subsystem

50-53‧‧‧Transformation

B1-BN‧‧‧Substream

59‧‧‧ bit stream synchronization phase

T1-TN‧‧‧ sync word

A1-AN‧‧‧Speaker Channel Audio Content

60‧‧‧Decoder

61‧‧‧Decoder

63‧‧‧Decoder

66‧‧‧ metadata combiner

67‧‧‧ Object Processing and Imaging Subsystem

68‧‧‧ Controller

23‧‧‧ Controller

24‧‧‧ imaging subsystem

100‧‧‧ microphone

101‧‧‧ microphone

102‧‧‧ microphone

103‧‧‧Microphone

104‧‧‧Audio console

106‧‧‧ Object Processing Subsystem

108‧‧‧Embedded subsystem

110‧‧‧Distribution Encoder

N0-N6‧‧‧ object channel

200‧‧‧ object processor

202‧‧‧ Transcoder

204‧‧‧Decoder

210‧‧‧ metadata generation subsystem

211‧‧‧ Analog Subsystem

212‧‧‧Mezzanine encoder

213‧‧‧Mezzanine decoder

214‧‧‧Encoder

215‧‧‧Decoder

216‧‧‧ imager

Figure 1 is a diagram of an AC-3 frame including segmented segments.

Figure 2 is a diagram of the Synchronization Information (SI) section of the AC-3 frame including the segmented segments.

Figure 3 is a diagram of a bitstream information (BSI) section of an AC-3 frame including segmented segments.

Figure 4 is a diagram of an E-AC-3 frame including segmented segments.

Figure 5 is a block diagram of an embodiment of a system configurable to carry out embodiments of the method of the present invention.

Figure 6 is a block diagram of a playback system configured in accordance with an embodiment of the present invention.

Figure 7 is a block diagram of a playback system configured in accordance with another embodiment of the present invention.

Figure 8 is a block diagram of a broadcast system configured to produce an object based audio program (and corresponding video program) in accordance with an embodiment of the present invention.

Figure 9 is a diagram of the relationship between object channels of an embodiment of the program of the present invention, indicating which subsets of the object channels are selectable by the user.

Figure 10 is a block diagram of a system embodying an embodiment of the present invention.

Figure 11 is a diagram of content based on object audio programming produced in accordance with an embodiment of the present invention.

In some embodiments, the present invention is a method and system for transmitting object-based audio for broadcast, including an improved imaging procedure in which a consumer can interactively control the aspect of the imaged program, and typically also Includes improved live broadcast workflows and/or improved post-production workflows.

Figure 5 is a block diagram of an example of an audio processing chain (audio data processing system) in which one or more elements of the system may be configured in accordance with embodiments of the present invention. The system includes the following components, coupled together as shown: a capture unit 1, a capture unit 3 (which includes an encoding subsystem), a transfer subsystem 5, a decoder 7, an object processing subsystem 9, and a controller 10 And imaging subsystem 11. In variations on the system shown, one or more components are omitted, or additional audio data processing units are included. Typically, elements 7, 9, 10, and 11 are included in a playback system (e.g., a home theater system for an end user).

The capture unit 1 is typically configured to generate PCM (time domain) samples containing audio content and output PCM samples. The sample may represent multiple streams of audio captured by the microphone (eg, in a sports event or other viewer event). The capture unit 3 (typically operated by a broadcast station) is configured to accept PCM samples as input and output based on the object audio program representing the audio content. The program typically includes or includes an encoded (eg, compressed) audio bit stream representing at least some of the audio content (sometimes referred to herein as a "main mix"), and optionally, at least one additional representation of some of the audio content. A bitstream or file (sometimes referred to herein as "side mix"). The data representing the encoded bitstream of the audio content (and each generated side mix, if any) is sometimes referred to herein as "audio material." If the encoding subsystem of the camera unit 3 is configured in accordance with an exemplary embodiment of the present invention, the object-based audio program output from unit 3 represents (i.e., includes) a plurality of speaker channels of audio material (a "group" speaker channel) , multiple object channels of audio data, and object related metadata. The program can include the main mix Tones, which in turn comprise audio content representing a set of speaker channels, audio content representing at least one user selectable object channel (and optionally at least one other object channel), and object related elements associated with each object channel data. The program may also include at least one side of the mix, including audio content and/or object related metadata representing at least one other object channel (eg, at least one user selectable object channel). The object-related metadata of the program may include durability metadata (described below). A program (eg, its main mix) may represent one speaker channel or speaker channel group or no speaker channel group. For example, the main mix can represent two or more speaker channel groups (eg, 5.1 channel neutral mass noise group, 2.0 channel home team mass noise group, and 2.0 guest crowd noise group), including at least one user selectable group (which can The same user interface selected or configured for the user of the object channel content is used to select) and the preset group (which will be imaged when the user has not selected another group). The preset group can be determined by data indicating the configuration (eg, initial configuration) of the speaker group of the playback system, and the user may optionally select another group to replace the preset group.

The transmission subsystem 5 of Figure 5 is configured to store and/or transmit (e.g., broadcast) programs generated by unit 3 (e.g., the main mix and each of its side mixes, if any side mixes are produced) .

In some embodiments, subsystem 5 is implemented to transmit an audio-based program based on the object, wherein the audio object (and at least some of the corresponding objects are transmitted through the broadcast system (in the main mix of the program represented for the broadcast audio bitstream) Related metadata) and transmitted in another way (eg, side mixes are sent to specific end users by the Internet Protocol or "IP" network) (eg The "side mix" of the main mix) at least some object related metadata (eg, metadata indicating restrictions on the object channel of the imaged or mixed program) and/or at least one object channel of the program. Additionally, the end user's decoding and/or imaging system is pre-configured with at least some object-related metadata (eg, meta-data representing limitations on imaging or blending audio objects based on embodiments of the object-based audio program of the present invention), and The object-related metadata is not broadcast or otherwise transmitted (by subsystem 5) to the corresponding object channel (in the main mix or side mix based on the object audio program).

In some embodiments, the object-based audio program transmitted over the separate path is provided by a synchronization word (eg, a time code) transmitted through all of the transmission paths (eg, in the main mix and each corresponding side mix) The timing and synchronization of parts or components (for example, the main mix broadcasted through the broadcast system, and the associated metadata that is sent as a side mix over the IP network).

Referring again to Figure 5, the decoder 7 accepts (receives or reads) the program (or at least one bit stream or other component of the program) transmitted by the transmission subsystem 5 and decodes the program (or each of its accepted components). . In some embodiments of the invention, the program includes a main mix (encoded bitstream, eg, an AC-3 or E-AC-3 encoded bitstream) and at least one side of the main mix, and the decoder 7 Receive and decode the main mix (and optionally at least one side of the mix). Optionally, at least one side of the program (e.g., object channel) that does not require the decoded program is transmitted directly to the object processing subsystem 9 by the subsystem 5. If the decoder 7 is configured in accordance with an exemplary embodiment of the present invention, the output of the decoder 7 in a typical operation includes the following: a string of audio samples representing the speaker channel group of the program. And a corresponding stream of audio samples representing the object channel of the program (eg, user selectable audio object channel) and object related metadata.

The object processing subsystem 9 is coupled to receive (from the decoder 7) the decoded speaker channel, the object channel, and the object-related metadata of the transmitted program, and optionally at least one side of the program is mixed (representing at least one Other object channels). For example, subsystem 9 can receive (from decoder 7) the audio channel of the program's speaker channel and at least one object channel of the program, and the object-related metadata of the program, and can also receive (from transmission subsystem 5) at least An audio sample of another object channel that is not decoded by decoder 7.

Subsystem 9 is coupled and configured to output a selected subset of all of the object channel groups represented by the program, and corresponding object related metadata to imaging subsystem 11. Subsystem 9 is also typically configured to pass the unaltered decoded speaker channel from decoder 7 (to subsystem 11) and is configurable to process at least some of the object channels (and/or metadata) established for it to produce a pair The object channel and metadata established by system 11. The rules are typically implemented by the user (as indicated by the control data established by the controller 10 for the subsystem 9) and/or programmed or otherwise configured by the subsystem 9 (eg, indicating conditions and/or restrictions) ) to determine the object channel selection made by subsystem 9. Additional information to subsystem 9 may be established by object-related metadata of the program and/or by (eg, from controller 10 or another external source) (eg, capabilities and organization data representing the speaker array of the playback system) )and/ The above rules are determined by pre-configuring (e.g., programming) subsystem 9. In some embodiments, the controller 10 (via the user interface implemented by the controller 10) provides an optional "preset" for the user (eg, displayed on the touch screen) of the object and the "group" speaker channel content. Mixed menu or panel. The optional preset blend can be determined by the item-related metadata of the program and also by the rules implemented by the subsystem 9 (e.g., the pre-configured subsystem 9 is implemented with rules). The user selects among the optional mixes by inputting a command to the controller 10 (e.g., by actuating its touch screen), and in response thereto, the controller 10 establishes corresponding control information for the subsystem 9.

The imaging subsystem 11 of Figure 5 configures the audio content determined by the output of the imaging subsystem 9 for playback by a speaker (not shown) of the playback system. The subsystem 11 is configured to use the imaging parameters (e.g., spatially and hierarchical user selected and/or preset values) associated with each selected object output from the subsystem 9 to select objects selected by the object processing subsystem 9. Audio objects determined by the channel (eg, preset objects, and/or user selected objects that have been selected by controller 10 as a result of user interaction) are mapped to available speaker channels. At least some of the imaging parameters are determined by the object-related metadata output from subsystem 9. Imaging system 11 also receives the set of speaker channels through which subsystem 9 passes. Typically, subsystem 11 is a smart mixer and is configured to determine speaker feeds for available speakers, including by mapping one or more selected (eg, preset selected) objects to individual speaker channels. Each of them, and each of the mixture and the speaker channel group of the program corresponds to the "group" audio content represented by the speaker channel.

Figure 6 is a block diagram of an embodiment of a playback system of the present invention including a decoder 20, an object processing subsystem 22, a spatial imaging subsystem 25, a controller 23 (actually a user interface), and optionally Digital audio processing subsystems 25, 26, and 27 are coupled as shown. In some implementations, elements 20, 22, 24, 25, 26, 27, 29, 31, and 33 of the system of Fig. 6 are implemented as onboard devices.

In the system of Figure 6, decoder 20 is configured to receive and decode encoded signals representing object based audio programs (or based on the main mix of object audio programs). A program (e.g., a main mix of a program) is an audio content that includes at least two speaker channels (i.e., a "set" of at least two speaker channels). The program also represents at least one user selectable object channel (and optionally at least one other object channel) and object related metadata corresponding to each object channel. Each object channel represents an audio object, and thus the object channel is sometimes referred to herein as an "object" for convenience. In one embodiment, the program is an AC-3 or E-AC-3 bit stream (or includes a main mix of AC-3 or E-AC-3 bitstreams), representing audio objects, object related metadata. And a set of speaker channels. Typically, individual audio objects are mono or stereo encoded (ie, each object channel represents the left or right channel of the object, or a mono channel representing the object), which is a traditional 5.1 mix, and the decoder 20 can be configured to synchronously decode up to 16 channels of audio content (including six speaker channels of the group and up to ten object channels). The input E-AC-3 (or AC-3) bit stream can represent more than ten audio objects, as not all may need to be decoded to achieve a particular mix.

In some embodiments of the playback system of the present invention, each frame of the input E-AC-3 (or AC-3) encoded bit stream includes one or two metadata "containers". The input bit stream representation is based on the object audio program, or the main mix of such programs, and the speaker channels of the program are organized into audio content of a conventional E-AC-3 (or AC-3) bit stream. A container can be included in the Aux field of the frame, and another container can be included in the addbsi field of the frame. Each container has a core header and includes (or is associated with) one or more bearer data. One such bearer (including a container in the Aux field or associated with a container included in the Aux field) may be one or more of the inventive objects (associated with the set of speaker channels also represented by the program) A set of audio samples of the channel and each of the object-related metadata associated with each object channel. In such bearer data, samples of some or all of the object channels (and associated metadata) may be organized into standard E-AC-3 (or AC-3) frames, or may be organized in other ways (eg, they may include In a side mix that is distinct from the E-AC-3 or AC-3 bit stream). Another instance of the bearer data (including the container in the addbsi field or the Aux field or associated with the container included in the addbsi field or the Aux field) is a set of loudness associated with the audio content of the frame. Process state metadata.

In some of the above embodiments, the decoder (e.g., decoder 20 of FIG. 6) parses the core header of the container in the Aux field, and the slave container (eg, from AC-3 or E-AC-3) The Aux field of the frame and/or the object channel and associated metadata of the present invention are extracted from the location indicated by the core header (eg, side mix). After extracting the bearer data (object channel and related meta-data), the decoder will perform any necessary data on the extracted bearer data. To decode.

The core header of each container typically includes: at least one ID value indicating the type of bearer data included in or associated with the container; a subflow association indication (indicating which substreams the core header is associated with); and a protection bit . Such a protected bitstream (which may consist of or includes HMAC based on a hash information authentication code or "HMAC") generally facilitates object-related metadata and/or loudness processing included in at least one bearer data included or associated with the container. State metadata (and optionally other metadata), and/or at least one of decoding, verification, or verification of corresponding audio material in the frame. Substreams can be located in "in-band" (in E-AC-3 or AC-3 bitstreams) or "out-of-band" (for example, in a side mix separate from E-AC-3 or AC-3 bitstreams) In the phoneme stream). One type of such bearer data (associated with the set of speaker channels also represented by the program) a set of audio for each of one or more inventive object channels and object related metadata associated with each object channel sample. Each object channel is a separate substream and is usually identified in the core header. Another type of data carried is the loudness processing state metadata.

Typically, each bearer has its own header (or "bearer profile"). The object level metadata can be transferred in each substream for the object channel. The program level metadata may be included in the core header of the container and/or in the subject of the set of audio samples used for one or more of the inventive object channels (and the metadata associated with each object channel). In the head.

In some embodiments, the auxdata (or addbsi) in the frame Each container in the field has a three-layer structure: a high-level structure: including a flag indicating whether the auxdata (or addbsi) field includes metadata (where "metadata" indicates the object channel of the present invention in the present invention, the present invention Object related metadata, and any other audio content or metadata transmitted by the bit stream but traditionally not transmitted in a conventional E-AC-3 or AC-3 bit stream lacking any of the described container types), at least An ID value indicating which type(s) of metadata exists, and typically also includes a value indicating how many bits (e.g., each type) of metadata (if any metadata exists). In the present text, an example of the above-mentioned "type" metadata is the object channel data of the present invention and related object related metadata (ie, (related to the speaker channel group also represented by the program) one or more object channels. And a set of audio samples associated with each of the metadata of each object channel; an intermediate layer structure containing core elements for each identification type of metadata (eg, metadata for each identification type) For example, the core header, the protection value, and the bearer data ID and the bearer data size value of the above type; and the low layer structure, if at least one such bearer data is identified by the core component, each bearer for a core component is included data. Examples of such bearer data are (one set of audio samples associated with each of the object channels associated with the speaker channel group also represented by the program and each of the metadata associated with each object channel). Another example of such bearer data is bearer data including loudness processing state metadata ("LPSM"), sometimes referred to as LPSM bearer data.

The data values in the above three-layer structure will be nested. For example, the protection value for the bearer data (eg, LPSM bearer data) identified by the core component can be included after each of the bearer data identified by the core component (and thus after the core header of the core component). In an example, the core header can identify the first bearer data (eg, LPSM bearer data) and another bearer data, and the bearer data ID and the bearer data size value of the first bearer data can be after the core header. A bearer data itself can be after the ID and the size value, and the bearer data ID and the bearer data size value for the second bearer data can be after the first bearer data, and the second bearer data itself can be after the ID and size values, and The protection value used to carry data (or for core component values and either or both) for either or both can be followed by the last bearer.

Referring again to Figure 6, the user employs controller 23 to select the object being imaged (represented by the object audio program). The controller 23 can be a handheld processing device (e.g., an iPad) programmed to implement a user interface (e.g., an iPad App) that is compatible with other elements of the system of Figure 6. The user interface provides a menu or panel that the user can provide (eg, displayed on the touch screen) an object that is mixed with an optional "preset" of the "group" speaker channel content. The optional preset blend can be determined by the item-related metadata of the program and also by the rules implemented by the subsystem 22 (e.g., the rules that have been pre-configured by the subsystem 22). The user will select from the optional mix by entering a command to the controller 23 (e.g., by actuating its touch screen), and in response thereto, the controller 23 will establish the corresponding control data to the subsystem 22.

The decoder 20 decodes the speaker channel of the speaker channel group of the program and outputs the decoded speaker channel to the subsystem 22. In response to the object-based audio program, and in response to control data from a selected subset of all of the object channel groups of the program representing the program to be imaged by the controller 23, the decoder 20 decodes (if necessary) the selected object channel and will The selected (e.g., decoded) object channels (each may be a pulse code modulation or "PCM" bit stream) and object related metadata corresponding to the selected object channel are output to subsystem 22.

The object represented by the decoded object channel is typically or includes a user selectable audio object. For example, as shown in FIG. 6, the decoder can extract a 5.1 speaker channel group, indicating an object channel ("Comment-1 Mono") reported by the broadcaster from the city of the home team, indicating that the team is coming from the visiting team. The object channel of the live broadcaster of the city's broadcaster ("Comment-2 Mono"), the object channel ("Fans (Main))) indicating the noise of the fans from the home team in the sports event, indicating that The left and right object channels ("ball sound stereo") and the four object channels ("effect 4x mono") that represent the special effects of the game ball when it is hit by a sports event participant. (After being subjected to any necessary decoding in the decoder 20) selectable ("Comment-1 Mono"), ("Comment-2 Mono"), ("Fans (Main)"), ("Ball Sound" Any of the stereo "), and ("Effect 4x Mono") object channels, and each selected slave will be transferred from subsystem 22 to imaging subsystem 24.

The object processing subsystem is the same as the decoded speaker channel from the decoder 20, the decoded object channel, and the decoded object related metadata. The input to 22 optionally includes an external audio object channel established for the system (e.g., one or more side mixes of the program that is asserted to the decoder 20 as its primary mix). Examples of objects represented by the above external audio object channels include local live reporters (eg, mono audio content transmitted by radio channels), dial-in Skype calls, and new Twitter connections (via text-to-speech system) Conversion, not shown in Figure 6), and system sound.

Subsystem 22 is configured to output a selected subset of all of the object channel groups represented by the program, and corresponding object related metadata for the program. The rules may be implemented by the user (as indicated by the control data established by the controller 23 from the subsystem 22) and/or programmed or otherwise configured by the subsystem 22 (eg, indicating conditions and/or Limit) to determine object selection. Other material that may be established to subsystem 22 by the object-related metadata of the program and/or by (eg, from controller 23 or another external source) (eg, the capabilities and organization of the speaker array representing the playback system) And/or by pre-configuring (eg, programming) subsystem 22 to determine the above rules. In some embodiments, the object-related metadata providing object is mixed with a set of optional "presets" of the "group" speaker channel content. Subsystem 22 typically passes through the unaltered decoded speaker channel from decoder 20 (to subsystem 24) and processes the selected one of the object channels for which it is asserted.

Object processing (including object selection) by subsystem 22 is typically performed by control data from controller 23 and object-related metadata from decoder 20 (and optionally, in addition to subsystems 22 from decoder 20). Established side-mixed object-related metadata) control, and usually This includes determining the spatial location and level for each selected object (whether the object selection is due to user selection or selection by a rule application). Typically, the predetermined spatial position and preset level for imaging the object, and optionally the user-selected constraints on the object and its spatial position and level, are included in the pair subsystem 22 (eg, from the decoder 20) ) Established in the relevant metadata of the object. The above limitations may indicate an object that may be imaged by the selected object or a prohibited combination of prohibited spatial locations (e.g., to prevent selected objects from being imaged too close to each other). In addition, the loudness of the individual selected objects is typically by a predetermined level represented by the object processing subsystem 22 in response to the control data entered using the controller 23, and/or (e.g., from the decoder 20). And/or controlled by the pre-configuration of subsystem 22.

Typically, decoding by decoder 20 includes extracting (from the input program) the type of audio content representative of each item represented by the program (eg, the type of sporting event represented by the audio content of the program, and the representation of the program) Select and pre-set the object's name or other identifying tag (for example, team logo) meta-data. The controller 23 and the object processing subsystem 22 receive relevant information represented by the metadata or metadata. In general, controller 23 also receives (e.g., is programmed with) information about the playback capabilities of the user's audio system (e.g., the number of speakers, and the hypothetical placement of the speakers or other hypothetical organization).

The spatial imaging subsystem 24 of Figure 6 (or subsystem 24 having at least one downstream device or system) is configured to image the audio content output from subsystem 22 for playback by the speaker of the user's playback system. Optionally, one or more of the included digital audio processing subsystems 25, 26, And 27 can be post-processed on the output of subsystem 24.

The spatial imaging subsystem 24 is configured to be selected by the object processing subsystem 22 using imaging parameters (e.g., user-selected and/or preset values for spatial locations and levels) associated with each selected object output from the subsystem 22. The audio object channel (e.g., preset selected objects, and/or user selected objects that have been selected using controller 23 as a result of user interaction) are mapped to available speaker channels. The spatial imaging subsystem 24 also receives the decoded set of speaker channels through which the subsystem 22 passes. Typically, subsystem 24 is a smart mixer and is configured to determine speaker feeds for available speakers, including by mapping one, two, or more than two selected object channels to individual speaker channels. Each of the plurality of selected object channels and the speaker channel group of the program corresponds to the "group" audio content represented by the speaker channel.

The number of output speaker channels can vary between 2.0 and 7.1, and the speakers that are driven to image the selected audio object channel (in the mix with the "group" audio content) can be assumed to be in the playback environment (nominally) On the horizontal plane. In this case, imaging is performed such that the speaker can be driven to emit a position of the object that will be perceived as being from the plane of the speaker (ie, an object position for each selected or preset object, or along the track) The sound emitted by a series of object positions is mixed with the sound determined by the "group" audio content.

In some embodiments, the number of full range speakers that are driven to image audio may be any number in a wide range (not necessarily limited to in the range from 2 to 7), and thus the number of output speaker channels It is not limited to being in the range from 2.0 to 7.1.

In some embodiments, the speaker that is driven to image the audio can be assumed to be located anywhere in the playback environment; not only on the (nominal) horizontal plane. In some of the above cases, the metadata included in the program indicates imaging parameters for imaging at least one object of the program at any apparent spatial location (in a three-dimensional volume) using a three-dimensional array of speakers. For example, the object channel may have a corresponding metadata indicating the three-dimensional orbit of the apparent spatial location at which the object (represented by the object channel) is imaged. The track may include a series of "floor" positions (on a plane that is assumed to be on a subset of the speakers on the floor, or on another horizontal plane of the playing environment), and a series of "on-floor" positions (each driven by a drive) Determined for a subset of speakers located on at least one other horizontal plane of the playback environment). In this case, imaging can be performed in accordance with the present invention such that the speaker can be driven to emit a sound that is to be perceived as being emitted from a series of object positions in a three-dimensional space including the track (determined by the associated object channel), and The mix of sounds determined by the "group" audio content. Subsystem 24 can be configured to perform the imaging described above, or steps thereof, wherein the remaining steps of imaging are performed by a downstream system or device (e.g., imaging subsystem 35 of Figure 6).

Optionally, a digital audio processing (DAP) phase (eg, for each of some predetermined output speaker channel configurations) is coupled to the output of the spatial imaging subsystem 24 for output to the spatial imaging subsystem. Post processing. Examples of the above processing include intelligent equalization or (in the case of stereo output) speaker virtualization processing.

Figure 6 System output (eg, spatial imaging subsystem, Or the output of the DAP phase after the spatial imaging phase) may be a PCM bitstream (which determines the speaker feed for the available speakers). For example, where the user's playback system includes a 7.1 speaker array, the system can output a PCM bit stream (generated in subsystem 24) that determines the speaker feed for the speakers of the array, or the bit stream described above. Post-processing pattern (produced in DAP 25). For another example, where the user's playback system includes a 5.1 speaker array, the system can output a PCM bit stream (generated in subsystem 24) that determines the speaker feed for the speaker of the array, or the bit described above. The post-processing pattern of the metaflow (generated in DAP 26). For another example, where the user's playback system includes only the left and right speakers, the system can output a stream of PCM bits (generated in subsystem 24) that determine the speaker feed for the left and right speakers, or The post-processing pattern of the above-described bitstream (generated in DAP 27).

The Fig. 6 system also optionally includes one or both of the recoding subsystems 31 and 33. The re-encoding subsystem 31 is configured to re-encode the PCM bitstream (representing the feed for the 7.1 speaker array) output from the DAP 25 into an E-AC-3 encoded bitstream, and generate the encoded (compressed) E- The AC-3 bit stream may be output from the system. The re-encoding subsystem 33 is configured to re-encode the PCM bitstream (representing the feed for the 5.1 speaker array) output from the DAP 27 into an AC-3 or E-AC-3 encoded bitstream, and generate the encoding ( Compressed) AC-3 or E-AC-3 bitstreams may be output from the system.

The FIG. 6 system also optionally includes a re-encoding (or formatting) subsystem 29 and a downstream imaging subsystem 35 coupled to receive the output of subsystem 29. Subsystem 29 is coupled to receive (output from subsystem 22) Material, representing the selected audio object (or preset mix of audio objects), corresponding object related metadata, and speaker channel groups, and configured to re-encode (and/or format) the above data for imaging by subsystem 35. Subsystem 35, which may be implemented in an AVR or a one-piece home theater (or downstream from other systems or devices of subsystem 29), is configured to generate an available playback speaker in response to the output of subsystem 29 (speaker array 36) The speaker feed (or the bit stream that determines the speaker feed). For example, subsystem 29 can be configured to generate encoded audio by re-encoding the data representing the selected (or preset) audio object, the corresponding metadata, and the set of speaker channels into an appropriate format for imaging in subsystem 35, and The encoded audio is transmitted (eg, via an HDMI link) to subsystem 35. In response to the speaker feed generated by subsystem 35 (or as determined by the output of subsystem 35), available speaker 36 will emit a sound representative of a mixture of speaker channel groups and selected (or preset) objects, with the object having subsystem 29 The position of the apparent sound source determined by the output object related metadata. When subsystems 29 and 35 are included, imaging subsystem 24 is optionally issued from the system.

In some embodiments, the present invention is a decentralized system for imaging object-based audio, wherein a portion of imaging (ie, at least one step) (eg, selecting an imaged audio object and selectively imaging each selected object) The features, as performed by subsystem 22 and controller 23 of the system of Figure 6, are implemented in the first subsystem (e.g., the components of Figure 6 implemented in an onboard, or onboard, and handheld controller) Another portion of 20, 22, and 23) and imaged (eg, where the speaker feeds, or the signal that determines the speaker feed, is immersed in response to the output of the first subsystem) Imaging) is implemented in a second subsystem (eg, subsystem 35 implemented in an AVR or single-piece home theater). Some embodiments that provide decentralized imaging also implement delay time management to account for different times of different subsystems of the audio imaging portion (and any processing corresponding to the video of the imaged audio).

In some embodiments of the playback system of the present invention, each decoder and object processing subsystem (sometimes referred to as a personalization engine) is implemented in a device on board (STB). For example, elements 20 and 22 of Fig. 6, and/or all of the elements of the system of Fig. 7 may be implemented in the STB. In some embodiments of the playback system of the present invention, multiple imaging of the output of the personalization engine is performed to ensure that all STB outputs (eg, stereo analog output of HDMI, S/PDIF, and STB) are enabled. Optionally, the selected object channel (and corresponding object related metadata) is transmitted from the STB (in a decoded set of speaker channels) to a downstream device configured to mix the imaged object channel with the set of speaker channels (eg, AVR or single-piece Home theater).

In one class of embodiments, the present invention is based on an object audio program comprising a set of bitstreams (multiple bitstreams, which may be referred to as "substreams") generated and transmitted in parallel. In some embodiments of this class, multiple decoders are employed to decode the content of the substream (eg, the program includes multiple E-AC-3 substreams and the playback system decodes with multiple E-AC-3 decoders) The content of the subflow). Figure 7 is a block diagram of a playback system configured to decode and image an embodiment of the present invention based on object audio programming comprising a plurality of serially transmitted bit streams.

The playback system of Figure 7 is a variation of the system of Figure 6, which The medium-based object audio program includes a plurality of bitstreams (B1, B2, ..., BN, where N is some positive integer) that are transmitted in parallel to the playback system and received by the playback system. Each of the bitstreams ("substreams") B1, B2, ..., and BN is a string of bitstreams, including timecodes or other syncwords (refer to Figure 7, for convenience) "sync word") to make substreams synchronized or time aligned with each other. Each substream also includes a different subset of all object channel groups and corresponding object related metadata, and at least one substream includes a set of speaker channels. For example, in each of the substreams B1, B2, ..., BN, each container including the object channel content and the object related metadata includes a unique ID or time stamp.

The system of Figure 7 includes N variants 50, 51, ..., 53, each coupled and configured to parse different input substreams, and to establish metadata (including its sync words) and its audio content to the bitstream Synchronization phase 59.

The deformation item 50 is configured to parse the sub-flow B1, and establish its synchronization word (T1), its other metadata, and object channel content (M1) for the bit stream synchronization phase 59 (including object-related metadata and at least one object of the program) Channel), and its speaker channel audio content (A1) (including at least one speaker channel of the group of programs). Similarly, variant 51 is configured to parse subflow B2, and establish its synchronization word (T2), its other metadata, and object channel content (M2) for the bitstream synchronization phase 59 (including object related metadata and programming). At least one object channel), and its speaker channel audio content (A2) (including at least one speaker channel of the group of programs). Similarly, variant item 53 is configured to parse substream BN, and establish its synchronization word (TN), its other metadata, and object channel content (MN) for bit stream synchronization phase 59 (including object correlation) Metadata and at least one object channel of the program), and its speaker channel audio content (AN) (including at least one speaker channel of the group of programs).

The bitstream synchronization phase 59 of the system of Figure 7 typically includes buffers for audio content and metadata of substreams B1, B2, ..., BN, and stream offset compensation components that are coupled and configured To use the sync word of each substream to determine any misalignment of the data in the input substream (eg, this may be due to the fact that each bit stream is typically transmitted through a separate interface and/or a track within the media file) Occurs in the possibility of losing close synchronization between them in the distribution/distribution). The stream offset compensation component of stage 59 is also typically configured to correct the misalignment of any of the decisions and to make the time alignment of the speaker channel audio material by establishing appropriate control values for the buffer containing the audio data and metadata of the bitstream. The quasi-bits are read from the buffer to the decoder (including decoders 60, 61, and 63), each of which is coupled to a corresponding one of the buffers, and the time of the object channel audio material and metadata The alignment bit is read from the buffer to the object data combining stage 66.

The time alignment bit from the speaker channel audio content A1' of substream B1 is read from stage 59 to decoder 60, and the time alignment bit of the object channel content and metadata M1' from substream B1 is from Stage 59 is read to metadata combiner 66. The decoder 60 is configured to decode on the speaker channel audio material it is established on, and to assert the resulting decoded speaker channel audio for the object processing and imaging subsystem 67.

Similarly, the time alignment bit from the speaker channel audio content A2' of substream B2 is read from stage 59 to decoder 61, and the time alignment of the object channel content and metadata M2' from substream B2. yuan The data is read from stage 59 to metadata combiner 66. The decoder 61 is configured to decode on the speaker channel audio material it is established on, and to assert the resulting decoded speaker channel audio for the object processing and imaging subsystem 67.

Similarly, the time alignment bit from the speaker channel audio content AN' of the substream BN is read from stage 59 to the decoder 63, and the time alignment bits of the object channel content and metadata MN' from the substream BN. The metasystem is read from stage 59 to metadata combiner 66. The decoder 63 is configured to decode on the speaker channel audio material it is established on, and to assert the resulting decoded speaker channel audio for the object processing and imaging subsystem 67.

For example, each of the substreams B1, B2, ..., BN may be an E-AC-3 substream, and each of the decoders 60, 61, 63, and the decoders 60, 61, and 63 are coupled in parallel Any other decoder connected to substream 59 may be an E-AC-3 decoder configured to decode the speaker channel content of one of the input E-AC-3 substreams.

The object data combiner 66 is configured to establish time aligned object channel material and metadata for the object processing channel for the appropriate format of the program for the object processing and imaging subsystem 67.

Subsystem 67 is coupled to the output of combiner 66 and decoders 60, 61, and 63 (and any other decoders coupled in parallel with decoders 60, 61, and 63 between subsystems 59 and 67). Output, and controller 68 is coupled to subsystem 67. Subsystem 67 includes an interactive party configured to respond to control data from controller 68 in accordance with an embodiment of the present invention The subsystems of the combiner 66 and the output of the decoder are processed (e.g., including the steps performed by subsystem 22 of the system of Figure 6, or variations of the steps described above). The controller 68 can be configured to perform the operations (or changes in the operations described above) in which the controller 23 of the system of FIG. 6 is configured in response to input from the user. Subsystem 67 also includes a configuration for imaging speaker channel audio and object channel audio material established for it in accordance with an embodiment of the present invention (e.g., imaging subsystem 24, or subsystems 24, 25, 26 of the FIG. Subsystems of operations performed by subsystems 24, 25, 26, 31, 33, 29, and 35 of the systems of 31, 33, or 6, or variations of the above operations.

In one implementation of the system of Figure 7, each of the substreams B1, B2, ..., BN is a Dolby E bit stream. Each such Dolby E bit stream contains a series of bursts. Each burst can transmit a subset of the speaker channel audio content (a "group" speaker channel) and all object channel groups of the inventive object channel (which can be a large group) and object related metadata (ie, each burst can be Indicates some object channels of all object channel groups and related object related metadata. Each burst of the Dolby E bit stream typically occupies a time period equivalent to the time period of the corresponding video frame. Each Dolby E bit stream in the group includes a sync word (eg, a time code) to enable the bit streams in the group to be synchronized or time aligned with each other. For example, in each bitstream, each container including object channel content and object related metadata can include a unique ID or timestamp to enable the bitstreams in the group to be synchronized or time aligned with each other. In the implementation of the system of Fig. 7, each of the deformation terms 50, 51, and 53 (and any other deformations coupled in parallel with the deformation terms 50, 51, and 53) A SMPTE 337 variant, and each of the decoders 60, 61, 63, and any other decoder coupled in parallel with the decoders 60, 61, and 63 to the subsystem 59 may be a Dolby E decoder.

In some embodiments of the invention, the object-related metadata based on the object audio program includes durability metadata. For example, the object-related metadata included in the program input to the subsystem 20 of the system of FIG. 6 may include non-durability metadata (eg, a predetermined level and/or imaging location or track for the user-selectable object). , which can be changed at least at least one point in the broadcast chain (from the content creation facility that produces the program to the user interface implemented by the controller 23), and which is initially generated (usually, at The content creation facility is not expected to be changeable (or cannot be changed) afterwards. Examples of durability metadata include: an item ID for each user-selectable item or other item or group of items of the program; and a time code or other sync word representing the audio of the speaker channel group or other component relative to the program The timing of each user's optional object or other object. The durability metadata is typically saved throughout the broadcast chain from the content creation facility to the user interface, during the entire broadcast period of the program, or even during the broadcast of the program. In some embodiments, the audio content (and associated metadata) of the at least one user-selectable item is transmitted in a main mix based on the item audio program, and at least some of the durability metadata (eg, time code) and Optionally, there is at least one other piece of audio content (and associated metadata) that is sent in the side mix of the program.

In some embodiments of the present invention based on object audio programs, the durable object related metadata is used to save (eg, even in the festival) After the destination broadcast, the object content is mixed with the user selection of the group (speaker channel) content. For example, whenever a user views a particular type of program (eg, any football game) or whenever the user views (any type of) any program, this may provide the selected blend as a preset blend until the user changes him/ Her choice so far. For example, during the broadcast of the first program, the user can use the controller 23 (of the system of FIG. 6) to select an object including the durability ID (eg, the user interface identified by the controller 23 as "the home team noise". The object of the object, in which the durability ID indicates the "noisy of the home team". Then, whenever the user views (and listens to) another program (which includes objects with the same durability ID), the playback system will automatically image the program with the same mix (ie, the "home team mass noise" object with the program The channel channel group of the channel mixed program) until the user changes the mix selection. In some embodiments of the present invention based on object audio programs, the durable object related metadata may cause some items during the entire program to be forcibly imaged (e.g., although the user desires to eliminate the imaging described above).

In some embodiments, the object-related metadata provides a preset blend of object content and group (speaker channel) content with preset imaging parameters (eg, a predetermined spatial location of the imaged object). For example, the object-related metadata of the program input to subsystem 20 of the system of Figure 6 may be a preset mix of object content and group (speaker channel) content with preset imaging parameters, and subsystems 22 and 24 will cause the program to pre- Mixing, and imaging with preset imaging parameters, unless the user employs controller 23 to select another mix of object content and group content and/or another component image parameter.

In some embodiments, the object-related metadata providing object is mixed with a set of selectable "presets" of the "group" speaker channel content, each preset blend having a predetermined set of imaging parameters (eg, the space of the imaged object) position). These may be presented by the user interface of the playback system as a limited menu or panel of available hybrids (e.g., a limited menu or panel displayed by controller 23 of the system of Figure 6). Each preset blend (and/or each optional item) may have a durability ID (eg, a name, a label, or a logo). Controller 23 (or a controller of another embodiment of the playback system of the present invention) can be configured to display an indication of the above ID (e.g., on a touch screen implemented by the iPad of controller 23). For example, there may be an optional "home team" mixed with an ID (eg, a team logo), regardless of the audio content of each item or the details of the non-durability metadata (eg, by a radio station). , ID is durable.

In some embodiments, the object-related metadata of the program (or pre-configuration of the playback or imaging system not represented by the metadata transmitted with the program) provides a limit on the optional mix of object-to-group (speaker channel) content. Or condition. For example, the implementation of the system of Figure 6 can be implemented as Digital Rights Management (DRM), and more specifically can be implemented as a DRM hierarchy to allow "layered" access by users of the FIG. 6 system to be included in object-based audio programs. a set of audio objects. If the user (eg, a consumer associated with the playback system) pays more money (eg, to a broadcast station), the user can be authorized to decode and select (and listen to) more audio objects of the program.

For another example, object related metadata may provide a restriction on the user selection of the object. An example of the above limitation is if the user The controller 23 is used to select both the "home team noise" object and the "home team broadcaster" object for imaging the program (i.e., in the mixture determined by the subsystem 24 included in Fig. 6), which is included in the program. Metadata can ensure that subsystem 24 causes two selected objects to be imaged at a predetermined relative spatial location. The restriction can be determined (at least in part) by information about the playback system (eg, data entered by the user). For example, if the playback system is a stereo system (only two speakers are included), then the object processing subsystem 24 (and/or controller 23) of the Figure 6 system can be configured to prevent the user from selecting not enough to be used by only two speakers. A mixture of spatial resolution imaging (identified by object-related metadata). For another example, the object processing subsystem 24 (and/or controller 23) of the system of Figure 6 can be legally identified from the object-related metadata (and/or other material input to the playback system) (eg, DRM) The category of optional items for reasons or other reasons (eg, based on the bandwidth of the transmitted channel) removes some of the transferred items. The user may pay for the content creator or broadcast station for more bandwidth, and thus the system (eg, the object processing subsystem 24 and/or controller 23 of the FIG. 6 system) may allow the user to select from the optional item and/or Choose from a larger menu of object/group blends.

Some embodiments of the present invention (e.g., the implementation of the playback system of Figure 6 including elements 29 and 35 described above) implement discrete imaging. For example, a preset or selected object channel (and corresponding object related metadata) of the program is (from a set of decoded speaker channels) from the onboard device (eg, subsystems 22 and 29 implemented from the system of Figure 6). It is transmitted to downstream devices (e.g., subsystem 35 of Figure 6 implemented in the downstream of the AVR from the on-board device (STB) implementing subsystems 22 and 29 or downstream of the one-piece home theater). Downstream device It is configured to mix the image object channel with the speaker channel group. The STB can partially image the audio and the downstream device can complete the imaging (eg, by generating a speaker feed to drive a particular top-level speaker (eg, a ceiling speaker) to place the audio object at a particular apparent sound source location, where the output of the STB It simply means that the object can be imaged in some non-specific way in a non-specific top-level speaker. For example, the STB may not have knowledge of the specific organization of the speaker of the playback system, but downstream devices (eg, AVR or single-piece home theater) Can have this type of knowledge.

In some embodiments, the item-based audio program (eg, the program input to the subsystem 20 of the system of FIG. 6, or to the elements 50, 51, and 53 of the system of FIG. 7) is or includes at least one AC-3 ( Or E-AC-3) bit stream, and each container of the program including the object channel content (and/or object related metadata) is included in the auxdata field at the end of the frame of the bit stream (eg, 1 or AUX section shown in Figure 4. In some of the above embodiments, each frame of the AC-3 or E-AC-3 bitstream includes one or two meta-data containers. A container can be included in the Aux field of the frame, and another container can be included in the addbsi field of the frame. Each container has a core header and includes (or is associated with) one or more bearer data. One such bearer (including a container in the Aux field or associated with a container included in the Aux field) may be one or more of the inventive objects (associated with the set of speaker channels also indicated by the program) A set of audio samples of the channel and each of the object-related metadata associated with each object channel. The core header of each container typically includes at least one ID value indicating the type of bearer data included in or associated with the container; subflow association indication (indication) Which sub-flows are associated with the core header; and protection bits. Typically, each bearer has its own header (or "bearer profile"). The object level metadata can be transferred in each substream for the object channel.

In other embodiments, the audio program based on the object (eg, the program input to the subsystem 20 of the system of FIG. 6, or to the components 50, 51, and 53 of the system of FIG. 7) is or includes not an AC-3 bit. A stream of bits of a stream or E-AC-3 bit stream. In some embodiments, the object channel program is or includes at least one Dolby E bit stream, and the object channel content of the program and the object related metadata (eg, the program including the object channel content and/or the object related metadata) Each container is included in the location of the bit in the Dolby E bitstream that normally does not carry useful information. Each burst of the Dolby E bit stream occupies a time period equivalent to the time period of the corresponding video frame. The object channel (and object-related metadata) may be included in the guard band between the Dolby E bursts and/or in each data structure (each having the AES3 frame format) within each Dolby E burst. Use the bit position. For example, each guard band is composed of a series of segments (eg, 100 segments), each of the first X segments (eg, X=20) of each guard band includes object channel and object related metadata, and Each of the remaining sections of each of the guard bands may include a guard band symbol. In some embodiments, the object channel and object related metadata of the Dolby E bitstream are included in the metadata container. Each container has a core header and includes (or is associated with) one or more bearer data. One such bearer (including a container in the Aux field or associated with a container included in the Aux field) may be one or more of the inventive objects (associated with the set of speaker channels also represented by the program) Channel and association A set of audio samples for each of the object-related metadata for each object channel. The core header of each container typically includes at least one ID value indicating the type of bearer data included in or associated with the container; a subflow association indication (indicating which substreams the core header is associated with); and a protection bit. Typically, each bearer has its own header (or "bearer profile"). The object level metadata can be transferred in each substream for the object channel.

In some embodiments, based on a conventional audio decoder and a conventional imaging system (which is not configured to parse the inventive object channel and object related metadata), based on the object audio program (eg, input to the subsystem 20 of the system of FIG. 6, or The programs to elements 50, 51, and 53 of the system of Figure 7 are decodable and their speaker channel contents are imageable. The on-board device (or other decoding and imaging system) can be configured to analyze (in accordance with an embodiment of the present invention) the object channel and object related metadata and the imaging speaker channel and the object channel content represented by the program. Mixing) Imaging the same program in accordance with some embodiments of the present invention.

Some embodiments of the present invention are directed to presenting a personalized (and preferably immersive) audio experience to a terminal consumer in response to a broadcast program, and/or proposing a new method for using metadata in a broadcast pipeline. Some embodiments enhance microphone capture (eg, stadium microphone capture) to produce an audio program that provides a more immersive experience for the end consumer, modifying existing filming, distribution, and decentralized workflows to make the present invention based on object audio programming The object channel and metadata flow through the professional chain and produce a new play pipeline (eg, implemented in the onboard device) that supports the object channel of the present invention And meta-data and conventional broadcast audio (e.g., a set of speaker channels included in some embodiments of the broadcast audio program of the present invention).

Figure 8 is a block diagram of a broadcast system configured to generate an audio based program (and corresponding video program) for broadcasting in accordance with an embodiment of the present invention. A set of X microphones (where X is an integer) including the microphones 100, 101, 102, and 103 of the system of Figure 8 are placed to capture audio content to be included in the program, and its output is coupled to Input to the audio console 104.

In one type of embodiment, the program includes interactive audio content that represents an atmosphere in a viewer event (eg, a football or rugby game, a car or motorcycle game, or another sporting event), and/or an event to the viewer. Live report. In some embodiments, the audio content of the program represents a plurality of audio objects (including user-selectable items or groups of objects, and typically also a predetermined set of objects that are not imaged by the user-selected objects) and A mix of speaker channels (or "groups") for the show. The set of speaker channels may be a conventional mix of speaker channels of the type that may be included in conventional broadcast programs that do not include object channels (eg, 5.1 channel mix).

A subset of the microphones (eg, microphones 100 and 101 and, optionally, other microphones whose output is coupled to audio console 104) are audio captured in operation (to be encoded and transmitted as a set of speaker channels) Traditional microphone array. In operation, another subset of the microphones (eg, microphones 102 and 103 and optionally other microphones whose output is coupled to audio console 104) captures the object channel to be encoded and transmitted as a program ( For example, the noise of mass noise and/or other "objects". For example, the microphone array of the system of FIG. 8 may include: at least one microphone (eg, microphone 100), implemented as a sound field microphone and permanently installed in a stadium (eg, having a sound field microphone with a heater); at least one A stereo microphone (eg, microphone 102, acting as a Sennheiser MKH416 microphone or another stereo microphone), located at a viewer supporting a team (eg, home team), and at least one other stereo microphone (eg, microphone 103, implemented as Sennheiser MKH416) A microphone or another stereo microphone) is located at a location that supports a viewer of another team (eg, the away team).

The broadcast system of the present invention may include a mobile unit (which may be a cart, and sometimes referred to as a "competition cart"), located outside the stadium (or other event location), which is a microphone from a stadium (or other event location) The first reception of the audio feed. The competition cart generates (to be broadcasted) the object-based audio program including the corresponding object-related metadata by encoding the audio content from the microphone for the object channel transmitted as the program (eg, indicating the space in which each object should be imaged) The meta-data of the location) and includes the metadata described above in the program, and the audio content from a number of microphones for transmitting a set of speaker channels for the program.

For example, in the system of Figure 8, the console 104, the object processing subsystem 106 (the output coupled to the console 104), the embedded subsystem 108, and the delivery encoder 110 can be mounted in a racing cart. The object-based audio program generated in subsystem 106 can be combined (e.g., in subsystem 108) with video content (e.g., from a camera located in a stadium) to produce a code that is then encoded (e.g., by encoder 110). Combine audio and video The signal, thereby producing an encoded audio/video signal for broadcast (e.g., by transmission subsystem 5 of Figure 5). It should be understood that a playback system for encoding and imaging such audio/video signals would include subsystems (not specifically shown) for parsing the audio and video content of the transmitted audio/video signals, and for decoding and imaging. A subsystem of audio content (e.g., a subsystem similar or identical to the system of Figure 6), and another subsystem for decoding and imaging video content (not specifically shown in the figures) in accordance with an embodiment of the present invention.

The audio output of console 104 may include a 5.1 speaker channel group (labeled "5.1 Neutral" in Figure 8) representing the sound captured in the sports event, indicating the crowd noise from the fans of the home team present in the event. Audio content of the stereo object channel (labeled "2.0 main"), audio content representing the stereo object channel (labeled "2.0 guest") from the crowd noise of the fans of the visiting team present in the event, indicated by the home team The audio content of the object channel ("Mark comm1"), which is reported by the broadcaster of the city, indicates the audio content of the object channel ("Marked as "1.0 comm2"), which is reported by the broadcaster from the city of the visiting team. And the object channel audio content ("Marked as "1.0 Kick") that represents the sound produced when the game ball is hit by a sports event participant.

The object processing subsystem 106 is configured to organize (e.g., group) the audio streams from the console 104 into object channels (e.g., group the left and right audio streams labeled "2.0 guests" into a guest mass noise object channel) and / Or object channel group, generating a representation object channel (and/or object channel group) Object-related metadata and encoding the object channel (and/or object channel group), object-related metadata, and (from the audio stream from console 104) the speaker channel group to an object-based audio program (eg, encoded as Dolby E bit stream based on object audio program). In general, subsystem 106 is also configured to image (and play on a multicast studio monitor speaker) at least a selected subset of object channels (and/or object channel groups) and speaker channel groups (including by using object related elements) The data is generated to represent a mixture of selected object channels and speaker channels so that the played sound can be monitored by the operator of console 104 and subsystem 106 (as indicated by the "Monitoring Path" of Figure 8).

The interface between the output of subsystem 104 and the input of subsystem 106 may be a multi-channel audio digital interface ("MADI").

In operation, subsystem 108 of the FIG. 8 system incorporates object-based audio programming generated in subsystem 106 with video content (eg, from a camera located at a stadium) to produce a combined audio and video signal established to encoder 110. The interface between the output of subsystem 108 and the input of subsystem 110 may be a high resolution serial digital interface ("MD-SDI"). In operation, encoder 110 encodes the output of subsystem 108, thereby generating an encoded audio/video signal for broadcast (e.g., by transmission subsystem 5 of FIG. 5).

In some embodiments, a broadcast facility (eg, subsystems 106, 108, and 110 of the FIG. 8 system) is configured to generate a plurality of object-based audio programs that represent the captured sound (eg, from subsystem 110 of FIG. 8) Output based on object audio represented by multiple encoded audio/video signals program). Examples of the above-described object-based audio programs include 5.1 flat blending, international blending, and domestic blending. For example, all programs may include a common set of speaker channels, but the program's object channel (and/or menu of optional object channels selected by the program, and/or optional or non-selectable imaging for imaging and blending component channels) Parameters) may vary from program to program.

In some embodiments, a facility of a broadcast station or other content creator (eg, subsystems 106, 108, and 110 of the FIG. 8 system) is configured to generate a single object-based audio program (ie, a primary station) that can Imaging in any of a variety of different playback environments (eg, 5.1 channel domestic playback system, 5.1 channel international playback system, and stereo playback system). The primary station does not need to be mixed (eg, downmixed) to broadcast to the consumer in any particular environment.

As noted above, in some embodiments of the present invention, object-related metadata for a program (or pre-configuration of a playback or imaging system not represented by metadata transmitted with the program) provides for objects and groups (speaker channels) The optional mix of conditions or conditions. For example, the implementation of the system of Figure 6 can be implemented as a DRM hierarchy to allow a user to hierarchically access a set of object channels included in an object-based audio program. If the user pays more (e.g., to a broadcast station), the user can be authorized to decode, select, and image more of the object channels of the program.

The restrictions and conditions for user selection of objects (or groups of objects) will be explained with reference to FIG. In Figure 9, the program "P0" includes seven object channels: the object channel "N0" indicating neutral noise, the object channel "N1" indicating the noise of the home team, and the noise of the visiting crowd. The item channel "N2", which indicates the official live report of the event (for example, the live broadcast of the commercial radio announcer), the object channel "N3", and the object channel "N4" indicating the live report of the event to the fans. The object channel "N5" of the public address announcement in the event, and the object channel "N6" indicating the new push connection (converted via the text-to-speech system) of the event.

The preset indicating the meta-data included in the program P0 indicates a preset object group (one or more "preset" objects) and a preset imaging parameter group (for example, a space of each preset object in the preset object group) The location) is included (preset) in the image mix of the "group" speaker channel content and the object channel content represented by the program. For example, the preset object group may be an object channel "N0" (representing neutral mass noise) imaged in a diffused manner (eg, to be perceived as being emitted from any particular sound source location) and imaged to be perceived as being directly from the listener. (ie, a mixture of object channels "N3" (indicating an official live report) from the sound source location relative to the listener's 0 degree azimuth.

(Picture 9) The program P0 also includes metadata indicating that a plurality of groups of users can select a preset mix, and each preset mix is determined by a subset of the object channels of the program and a corresponding imaging parameter set. The user selectable preset blend can be presented as a menu on the user interface of the controller of the playback system (eg, the menu displayed by controller 23 of the system of FIG. 6). For example, a combination of the object channel "N0" (representing neutral mass noise) and the object channel "N1" (representing the noise of the home team) and the object channel "N4" (representing the live broadcast of the fans) of the above-mentioned preset hybrid system, Imaged such that the channels N0 and N1 in the mix are perceived as being behind the listener (ie, at 180 relative to the listener) The position of the sound source at the azimuth is issued, wherein the content of the channel N1 in the mixture is 3 dB less than the degree of the channel N0 in the mixture, and is diffused (for example, to avoid being perceived as being emitted from any particular sound source position). ) to image the channel N4 content in the mix.

The rules that the playback system can implement (eg, the grouping rule "G" indicated by FIG. 9 determined by the meta-data of the program) are optional for each user including at least one of the object channels N0, N1, and N2. It is assumed that the mixing must separately include the content of the object channel N0 or the content of the object channel N0 mixed with the contents of at least one of the object channels N1 and N2. The rule that the playback system can also be implemented (for example, the condition rule "C1" indicated by Fig. 9 determined by the meta-data of the program) is an object channel N0 including the content mixed with at least one of the object channels N1 and N2. Each user-selectable preset mix of content must include the content of the object channel N0 mixed with the content of the object channel N1, or must include the content of the object channel N0 mixed with the content of the object channel N2.

The rules that the playback system can also implement (e.g., the conditional rule "C2" indicated in Figure 9 as determined by the meta-data of the program) are optional for each user including the content of at least one of the object channels N3 and N4. The preset mix must separately include the content of the object channel N3, or must separately include the content of the object channel N4.

Some embodiments of the present invention implement conditional decoding (and/or imaging) based on object channels of an object audio program. For example, the playback system can be configured to conditionally decode the object channel based on the playback environment or the user's rights. For example, if you implement the DRM hierarchy to allow consumers to "layer" Taking a set of audio object channels included in the object-based audio program, the playback system can be automatically configured (by the control bits included in the meta-data of the program) to prevent decoding and selection of some of the objects being imaged, unless The playback system is notified that the user has met at least one condition (eg, paying a particular amount of money to the content provider). For example, the user may need to purchase the copyright to listen to the "official fact report" object channel N3 of the program P0 of Fig. 9, and the playback system can implement the condition rule "C2" indicated in Fig. 9, so that the object channel N3 cannot be selected. Unless the playback system is notified that the user of the playback system has purchased the necessary copyright.

For another example, the playback system can be automatically configured (by a control bit included in the meta-data of the program, indicating a particular format of the available playback speaker array) to prevent decoding and selection if the playback speaker array does not meet the criteria Imaging some objects (for example, the playback system can implement the conditional rule "C1" indicated in Figure 9, so that the preset mix of object channels N0 and N1 cannot be selected unless the playback system is notified that the 5.1 speaker array can be used to image selected content, But not if only the available speaker array is a 2.0 speaker array).

In some embodiments, the present invention is implemented based on a ruled object channel selection, wherein at least one predetermined rule determines (eg, with a set of speaker channels) which object channel(s) based on the object audio program. The user may also specify at least one rule for object channel selection (eg, by a menu selection from the available rules presented by the user interface of the playback system controller), and the playback system (eg, the object of the system of FIG. 6) Processing subsystem 22) is configurable to apply each of the above rules to determine that the image to be imaged is based Which object(s) of the object audio program should be included in the mix to be imaged (e.g., by subsystem 24 of system of Figure 6, or subsystems 24 and 35). The playback system can determine from the item-related metadata in the program which of the program(s) of the program meets the predetermined rules.

For a simple example, consider the case where an object-based audio program represents a sporting event. The user manipulates the controller to set rules (eg, to assume that the imaging automatically selects an object channel that represents whether the team, or the car, or the bicycle is winning or is in the first place) rather than manipulating the controller (eg, controller 23 of FIG. 6) To perform a static selection of a specific set of items included in the program (eg, radio live coverage from a particular team, or car, or bicycle). The playback system applies rules to implement dynamic selection (during imaging a single program, or a series of different programs) a series of different subsets of objects (object channels) included in the program (eg, when the second team scores and thus becomes the current winning team) The event indicates that the second subset of objects of the second team is automatically followed by the first subset of objects representing a team). Thus, in some of the above embodiments, instant event manipulation or affecting which object channel systems are included in the imaging mix. A playback system (eg, object processing subsystem 22 of the system of FIG. 6) may be responsive to metadata included in the program (eg, indicating that at least one corresponding item represents a current winning team (ie, a mass noise representing a team of fans or A live report of the radio announcer associated with the winning team)) to select which object channel(s) should be included in the mix of the speaker and the object channel to be imaged. For example, a content creator may (in an object-based audio program) include metadata (eg, a table) indicating the placement order (or other hierarchy) of each of at least some of the audio object channels of the program. Show which object channels correspond to the team or car currently in the first place, which object channels correspond to the team or car in the second place, etc.). The playback system is configurable to respond to the above element by selecting and imaging only the object channel that satisfies the user-specified rules (eg, the object channel of the team occupying the "n" position, as indicated by the item-related metadata of the program) data.

Examples of object-related metadata for an object channel based on an object audio program of the present invention include, but are not limited to, meta-data representing detailed information on how to image an object channel; dynamic time metadata (eg, representing translation for an object) Track, object size, gain, etc.; and metadata used to image the object channel by the AVR (or downstream of other devices or systems from some of the implemented decoding and object processing subsystems of the system of the invention) (eg, having The knowledge of the organization that plays the speaker array can be used). The metadata described above may specify restrictions on object position, gain, mute, or other imaging parameters, and/or restrictions on how objects interact with other objects (eg, restrictions on what additional items may be selected under a particular item); And/or may specify preset objects and/or preset imaging parameters (which will be used if the user has not selected other objects and/or imaging parameters).

In some embodiments, the present invention is based on at least some object-related metadata of an object audio program (and optionally, at least some object channels) from a speaker channel group of the program and legacy metadata in a separate bit stream or other container ( For example, if the user may have to pay an additional fee to receive and/or use the side mix). The user can decode and image the speaker channel group without accessing the object related metadata (or object related metadata and object channel), but in the speaker channel group The audio object of the program cannot be selected in the mixture of the represented audio and the audio object of the program cannot be imaged. Each frame of the present invention based on the object audio program may include audio content of the plurality of object channels and corresponding object related metadata.

An object-based audio program generated (or transmitted, stored, buffered, decoded, imaged, or otherwise processed) in accordance with some embodiments of the present invention includes at least one set of speaker channels, at least one object channel, and representing a speaker channel and an object channel Metadata for a layered graph (sometimes referred to as a layered "hybrid graph") of optional blends (eg, all optional blends). For example, a hybrid map represents each rule that can be applied to select a subset of speakers and object channels. Typically, the encoded audio bitstream represents at least some of the audio content of the program (eg, a set of speaker channels and at least some of the program's object channels) and the object-related metadata (including metadata representing the blended map) (ie, at least A portion), and optionally, at least one additional encoded bit stream or file system representing audio content and/or object related metadata for some programs.

A hierarchically mixed graph represents nodes (each representing a selectable channel or group of channels, or a category of an optional channel or group of channels) and a connection between nodes (eg, a control interface for connecting nodes and/or for selecting a channel) Rules), including "primary" material ("basic" layer) and optional (ie, non-essentially omitted) data (at least one "extension" layer). Typically, the layered hybrid map is included in one of the encoded audio bitstreams representing the program and can be accessed by the graph traversal (implemented by the playback system, for example, the end user's playback system) to determine the channel preset. Blend and options for modifying the preset blend.

The mixed map can be imaged as a tree diagram, and the base layer can be A branch of a tree diagram (or two or more branches), and each extension layer can be another branch of the tree diagram (or another set of two or more branches). For example, one branch of the tree diagram (represented by the base layer) may represent selectable channels and channel groups available to all end users, and another branch of the tree diagram (represented by the extension layer) may represent only Additional optional channels and channel groups for some end users (e.g., such extension layers may be provided only to end users authorized to use it). Figure 9 is an example of a tree diagram of object channel nodes (e.g., nodes representing object channels N0, N1, N2, N3, N4, N5, and N6) and other elements of the hybrid map.

Typically, the base layer contains (indicating) the control structure of the graph structure and the nodes of the connection graph (eg, translation, and adding control interfaces). The base layer is necessary for mapping any user interaction to the decoding/imaging program.

Each extension layer contains (associated with) an extension to the base layer. The extension is not immediately necessary for the decoding program to map user interactions and can therefore be transmitted at a slower rate and/or delay, or omitted.

In some embodiments, the base layer is included as metadata for the independent substream of the program (eg, metadata that is transmitted as a separate substream).

An object-based audio program generated (or transmitted, stored, buffered, decoded, imaged, or otherwise processed) in accordance with some embodiments of the present invention includes at least one set of speaker channels, at least one object channel, and representations representing speaker channels and objects A mix of optional blends of channels (for example, all optional blends) (may or may not be a layered blend) material. The encoded audio bitstream (eg, Dolby E or E-AC-3 bitstream) represents at least a portion of the program, and metadata representing the mixed map (and typically optional objects and/or speaker channels) is included In each frame of the bitstream (or in each frame of a subset of the frame of the bitstream). For example, each frame may include at least one meta data section and at least one audio data section, and the hybrid map may be included in at least one meta data section of each frame. Each metadata section (which may be referred to as a "container") may have a format that includes a metadata section header (and optionally other elements), and a one after the metadata section header Or more diverse data to carry data. Each meta-data bearer data itself is identified by the bearer data header. The hybrid map (if present in the metadata section) is included in one of the metadata holding materials of the metadata section.

In some embodiments, an object-based audio program generated (or transmitted, stored, buffered, decoded, imaged, or otherwise processed) in accordance with the present invention includes at least two sets of speaker channels, at least one object channel, and a representation blending map ( It may or may not be a meta-information of a hierarchically mixed graph. The hybrid map represents an optional mix of speaker channels and object channels and includes at least one "group mix" node. Each "group mix" node defines a predetermined mix of speaker channel groups and thereby represents or acts on a set of predetermined mixing rules for the speaker channels of two or more speaker groups of the mixed program (not necessarily having a user Select the parameter).

Consider an example where the audio program is associated with a football (American football) match between team A (home team) and team B in the stadium, and includes the entire mass in the stadium (determined by the microphone feed) 5.1 Speaker channel group, stereo for the mass part of team A The feed (ie, the audio captured from the audience sitting in the stadium section that is mainly occupied by the fans of Team A), used for another stereo feed that favors the mass portion of Team B (ie, from sitting mainly by Team B) The audio captured by the audience in the stadium section occupied by the fans). It is possible to mix these three feeds (5.1 channel neutral group, 2.0 channel "A team" group, and 2.0 channel "B team" group) on the mixing console to generate four 5.1 speaker channel groups (which can be called "fans" Zone "group": unbiased, biased towards the home team (mixed neutral and team A), biased towards the visiting team (mixed neutral and team B), and relative (neutral group, with the A team on one side of the room, And mixed with the B team that is translated to the opposite side of the room). However, transmitting four mixed 5.1 channel groups is expensive in terms of bit rate. Accordingly, embodiments of the bitstream of the present invention include specifying a group mixing rule to be implemented by a playback system (e.g., in an end user's home) based on user blending selections (for mixing speaker channel groups, for example, The metadata of the four mixed 5.1 channel groups described above, and the set of speaker channels that can be mixed according to rules (eg, the original 5.1 channel group and the two biased stereo speaker channel groups). In response to the group mixing node of the hybrid map, the playback system can present options to the user (eg, via the user interface implemented by the controller 23 of the system of FIG. 6) to select four of the above-described mixed 5.1 channel groups. One of them. In response to user selection of the mixed 5.1 channel group, the playback system (e.g., subsystem 22 of the system of Figure 6) will use the (non-mixed) set of speaker channels transmitted in the bitstream to produce the selected mix.

In some embodiments, the group mixing rule considers the following operations (which may have predetermined parameters or user selectable parameters): Group "Rotate" (ie, pan the speaker channel group to left, right, front or back). For example, to establish the above "relative" mix, the stereo A team will be rotated to the left of the playback speaker array (the L and R channels of team A are mapped to the L and Ls channels of the playback system) and the stereo B team will be Rotate to the right side of the playback speaker array (the L and R channels of team B are mapped to the R and Rs channels of the playback system). Therefore, the user interface of the playback system can present the selection of one of the above-mentioned "unbiased", "biased home team", "biased away team" and "relative" groups to the terminal user, and when the user selects " When the "group" is mixed, the playback system will perform the appropriate group rotation during the imaging "relative" group mixing; and reduce (ie, decrease) the particular speaker channel (target channel) in the group mix (typically, to create a margin). For example, in the above football game example, the user interface of the playback system may present to the terminal user a selection of four of the above-mentioned "unbiased", "biased home team", "biased away team" and "relative" groups. And in response to the user selecting the "relative" group mix, the playback system can reduce (reduce) neutrality by mixing the reduced 5.1 channel group with the stereo "A team" and "B team" groups to produce a "relative" group mix. The predetermined amount of each of the L, Ls, R, and Rs channels of the 5.1 channel group (specified by the metadata in the bitstream) is used to achieve the target sneak during the imaging "relative" group mixing.

In another class of embodiments, an object-based audio program generated (or transmitted, stored, buffered, decoded, imaged, or otherwise processed) in accordance with the present invention includes a substream, and the substream represents at least one set of speaker channels, at least An object channel, and object related metadata. Object related element The data includes "substream" metadata (representing the substream structure of the program and/or the manner in which the substream should be decoded) and usually also a hybrid map representing an optional mix of speaker channels and object channels (eg, all optional blends). . The substream metadata may indicate which substreams of the program should be decoded independently of other substreams of the program, and which substreams of the program should be decoded in association with at least one other substream of the program.

For example, in some embodiments, the encoded audio bitstream represents audio content of the program (eg, at least one set of speaker channels and object channels of at least some of the programs) and metadata (eg, mixed maps and substream metadata, and There are optionally other meta-data (), at least some of them, and at least one additional encoded audio bit stream (or file) is the audio content and/or metadata of some programs. In the case where each bit stream is a Dolby E bit stream (or encoded in a manner consistent with the SMPTE 337 format to transfer non-pcm data in the AES3 serial digital bit stream), the bit stream can be all The ground represents a multiple of up to 8 channels of audio content, where each bit stream carries up to 8 channels of audio material and typically also includes metadata. Each bit stream can be viewed as a substream representing a combined bit stream of all audio data and metadata transmitted by all bit streams.

For another example, in some embodiments, the encoded audio bitstream represents a plurality of substreams of metadata (eg, mixed map and substream metadata, and optionally other object related metadata) and at least one audio The audio content of the show. Typically, each substream represents one or more channels of the program (and usually metadata). In some cases, the plurality of substreams encoding the stream of audio bits represent audio content of a plurality of audio programs, such as a "master" audio program (which may be a multi-channel program) and at least one other audio A program (eg, a program that is reported live to the main audio program).

The encoded audio bitstream representing the at least one audio program must include at least one "independent" substream of the audio content. The independent substream represents at least one channel of the audio program (e.g., the independent substream may represent five full range channels of a conventional 5.1 channel audio program). In this article, this audio program is called the "main" program.

In some cases, the encoded audio bit stream represents two or more audio programs ("master" programs and at least one other audio program). In some cases, the bitstream includes two or more independent substreams: a first independent substream representing at least one channel of the main program; and at least one representing another audio program (a program distinct from the main program) At least one other independent substream of the channel. Each individual bitstream can be decoded independently, and the decoder can operate to decode only a subset (not all) of the independent substreams of the encoded bitstream.

Optionally, the encoded audio bitstream representing the main program (and optionally at least one other audio program) includes at least one "independent" substream of audio content. Each independent substream is associated with a separate substream of the bitstream and represents at least one additional channel of the program (eg, the main program) whose content is represented by the associated independent substream (ie, independent substream representation) Not at least one channel of the program represented by the associated independent substream, and the associated independent substream represents at least one channel of the program).

In an example of an encoded bitstream that includes separate substreams (representing at least one channel of the main program), the bitstream also includes subordinate substreams (associated with independent bitstreams) that represent one or more of the main programs. Additional speaker channels. The above additional speaker channel is attached to the main section represented by the independent substream Channel. For example, if the independent substream represents the standard format left, right, center, left surround, right surround full range speaker channels of the 7.1 channel main program, the dependent substream may represent two other full range speaker channels of the main program.

In accordance with the E-AC-3 standard, a legacy E-AC-3 bitstream must represent at least one independent substream (eg, a single AC-3 bitstream) and can represent up to 8 independent substreams. Each individual substream of the E-AC-3 bitstream can be associated with up to 8 dependent substreams.

In an exemplary embodiment (which will be described with reference to FIG. 11), the object-based audio program includes at least one set of speaker channels, at least one object channel, and metadata. Metadata includes "substream" metadata (a substream structure representing the audio content of the program and/or a substream of the audio content of the program to be decoded) and typically also a hybrid map representing an optional mix of speaker channels and object channels. . The audio program is associated with a football match. The encoded audio bit stream (eg, E-AC-3 bit stream) represents the audio content and metadata of the program. The audio content of the program (and therefore the bit stream) includes four separate substreams, as shown in FIG. An independent substream (labeled as substream "I0" in Fig. 11) is a 5.1 speaker channel group that represents the neutral noise in a football match. Another independent sub-stream (labeled as sub-flow "I1" in Figure 11) is a group of Channel 2.0 "A Team" ("M Team") that represents the voice of a part of the game from the team ("A Team"). "People"), the "B Team" of Channel 2.0 ("LivP Mass"), which represents the voice of the masses of the other team ("B Team"), and the mono objects that represent the live coverage of the game. Channel ("Sky comm 1"). The third independent substream (labeled as subflow "I2" in Fig. 11) indicates that the game ball is played by the soccer ball. The object channel audio content (marked as "2/0 kick") of the sound produced by the participants of the match event, and the three object channels ("Sky comm 2", which represent different live reports on the football match, "Man comm" and "Liv comm"). The fourth independent substream (labeled as substream "I3" in Fig. 11) represents an object channel (labeled "PA") indicating the sound generated by the stadium public address system in a football match, indicating a football match. The object channel of the radio broadcast (labeled "Radio") and the object channel (marked as "Score Refresh") indicating the score during the football match.

In the example of Figure 11, substream I0 includes a mixed map for the program and metadata ("obj md") including at least some of the substream metadata and at least some of the object channel related metadata. Each of substreams I1, I2, and I3 includes metadata ("obj md") containing at least some object channel related metadata and, optionally, at least some substream metadata.

In the example of Figure 11, the substream metadata of the bitstream indicates that during decoding, the coupling should be "closed" between each pair of independent substreams (so that each individual substream is decoded independently of the other independent substreams), And the substream metadata of the bit stream indicates that the coupling of the program channels in each substream should be "on" (so that these channels are not decoded independently of each other) or "off" (so that these channels are decoded independently of each other). For example, the substream metadata indicates that the coupling should be "on" within each of the two stereo speaker channel groups (2.0 channel "A team" group and 2.0 channel "B team" group) of substream I1, but cross substream I1 The speaker channel group is deactivated between the mono object channel and each speaker channel group of substream I1 (to make the mono object channel and Yang The sound channel groups can be decoded independently of each other). Similarly, the substream metadata indicates that the coupling should be "on" within the 5.1 speaker channel group of substream I0 (so that the set of speaker channels can be decoded in association with each other).

In some embodiments, the speaker channel and the object channel are included ("encapsulated") within the substream of the audio program in a manner suitable for a hybrid map of programs. For example, if the hybrid map is a tree, all channels of one branch of the graph may be included in one substream, and all channels of another branch of the graph may be included in another substream.

In one class of embodiments, the present invention is a method for generating an object-based audio program, the method comprising the steps of: determining a set of speaker channels, representing representations of captured audio content (eg, Figure 8 system) The audio content of the first subset of the set of audio signals of the output of the microphone, or the input to the subsystem 210 of the system of Figure 10; a set of object channels that determine the audio content of the second subset of the set of audio signals Generating object-related metadata representing the object channel; and generating an object-based audio program such that the object-based audio program represents a speaker channel group, an object channel, and object-related metadata, and is imageable to provide a perceived speaker channel a sound of a mixture of the first audio content represented by the group and the second audio content represented by the selected subset of the object channels, such that the second audio content is emitted when the sound source location determined from the selected subset of the object channels is issued Perception. Typically, at least some (ie, at least a portion of) object-related metadata represent at least some object frequencies. The identification of each of the tracks, and/or at least some of the object-related metadata, represents a predetermined subset of the group of object channels that are imaged without the end user selection of a subset of the object channel groups. Some embodiments in this class also include the step of generating an audio signal set, including by capturing audio content (e.g., in a viewer event).

In another class of embodiments, the present invention is a method of imaging an audio content determined based on an audio program of an object, wherein the program represents a set of speaker channels, a set of object channels, and object related metadata, the method comprising The following steps: (a) determining a selected subset of the object channel group; (b) imaging the audio content determined based on the object audio program, including determining the first audio content and the object channel represented by the speaker channel group A mixture of the second audio content represented by the subset.

In some embodiments, the method is performed by a playback system including a set of speakers, and step (b) includes the step of generating a speaker feed to drive the speaker in response to the mixing of the first audio content and the second audio content The group emits a sound, wherein the sound includes an object channel sound representing the second audio content, and the object channel sound is perceptible when emitted from an apparent sound source location determined by a selected subset of the object channels. The speaker channel group can include a speaker channel for each of the speakers in the speaker group.

Figure 10 is a block diagram of a system embodying an embodiment of the present invention.

Figure 10 System object processing system (object processor) 200 A metadata generation subsystem 210, a mezzanine encoder 212, and an analog subsystem 211 are included, coupled as shown. The metadata generation subsystem 210 is coupled to receive the captured audio stream (eg, a stream representing sounds captured by a microphone of a viewer event, and optionally other audio streams), and configured to The audio stream of console 104 is organized (eg, grouped) into a set of speaker channels and some object channels and/or object channel groups. Subsystem 210 is also configured to generate object-related metadata representing object channels (and/or object channel groups). The encoder 212 is configured to encode the object channel (and/or object channel group), the object-related metadata, and the speaker channel group as a mezzanine-based object-based audio program (eg, an object-based audio program encoded as a Dolby E bitstream) ).

The analog subsystem 211 of the object processor 200 is configured to image (and play on a multicast studio monitor speaker) at least a selected subset of object channels (and/or object channel groups) and a set of speaker channels (including by The object-related metadata is used to generate a representation of the selected object channel and the speaker channel so that the played sound can be monitored by the operator of subsystem 200.

The transcoder 202 of the system of Fig. 10 includes a mezzanine decoder subsystem (mezzanine decoder) 213, and an encoder 214, coupled as shown. The mezzanine decoder 213 is coupled and configured to receive and decode the mezzanine-based object-based audio program output from the object processor 200. The decoded output of decoder 213 is re-encoded by encoder 214 into a format suitable for broadcast. In an embodiment, the encoding output from encoder 214 is based on an E-AC-3 bit stream (and thus in Figure 10 encoder 214 is marked as "DD+ Encoder"). In other embodiments, the encoding output from encoder 214 is based on an object audio program stream being an AC-3 bit stream or having some other format. The object-based audio program output of the transcoder 202 is broadcast (or otherwise transmitted) to some end users.

The decoder 204 is included in a playback system of one of the above-described end users. The decoder 204 includes a decoder 215 and an imaging subsystem (imager) 216, coupled as shown. The decoder 215 accepts (receives or reads) and decodes the object-based audio program transmitted from the transcoder 202. If the decoder 215 is configured in accordance with an exemplary embodiment of the present invention, the output of the decoder 215 in a typical operation includes: a stream of audio samples representing a set of speaker channels of the program, and an object channel representing the program (eg, using The stream of audio samples and the corresponding stream of object-related metadata can be selected for the audio object channel. In one embodiment, the encoding input to decoder 215 is based on an E-AC-3 bitstream, and thus decoder 215 is labeled "DD+Decoder" in Figure 10.

The imager 216 of the decoder 204 includes an object processing subsystem coupled to receive (from the decoder 215) the decoded speaker channel, the object channel, and the object-related metadata of the transmitted program. Imager 216 also includes an imaging subsystem configured to image the audio content determined by the object processing subsystem for playback by a speaker (not shown) of the playback system.

Typically, the object processing subsystem of imager 216 is configured to output a selected subset of all of the object channel groups represented by the program, and corresponding object related metadata to the imaging subsystem of imager 216. The object processing subsystem of imager 216 is also typically configured to pass from decoder 215 through Change the decoded speaker channel (to the imaging subsystem). In accordance with an embodiment of the present invention, the object processing subsystem is determined, for example, by a user selecting and/or having programmed or otherwise configuring the imager 216 to implement rules (eg, indicating conditions and/or restrictions). Object channel selection.

Each of elements 200, 202, and 204 of Figure 10 (and each of elements 104, 106, 108, and 110 of Figure 8) can be implemented as a hardware system. The hardware implemented input of processor 200 (or processor 106) will typically be a multi-channel audio digital interface ("MADI") input. Typically, processor 106 of Figure 8, and encoders 212 and 214 of Figure 10, include a frame buffer. Typically, the frame buffer is a buffer memory coupled to receive a stream of encoded input audio bits, and in operation, the buffer memory (eg, in a non-transitory manner) stores the encoded audio bit stream. At least one frame, and a series of frames encoding the stream of audio bits are established from the buffer memory to the downstream device or system. Also, typically, each of the decoders 213 and 215 of Fig. 10 also includes a frame buffer. Typically, the frame buffer is a buffer memory coupled to receive a stream of encoded input audio bits, and in operation, the buffer memory (eg, in a non-transitory manner) is stored by the decoder 213. Or at least one frame of the encoded audio bit stream decoded by 215.

Any component or component of processor 106 of FIG. 8 (or subsystems 200, 202, and/or 204 of FIG. 10) may be implemented as one or more of a combination of hardware, software, or a combination of hardware and software. Multiple programs and/or one or more circuits (eg, ASICs, FPGAs, or other integrated circuits).

An aspect of the present invention is an audio processing unit (APU), configured to carry out any of the embodiments of the method of the present invention. Examples of APUs include, but are not limited to, encoders (eg, transcoders), decoders, codecs, pre-processing systems (pre-processors), post-processing systems (post-processors), audio bit stream processing systems, And a combination of such components.

In one class of embodiments, the present invention is an APU comprising a buffer memory (buffer) that stores (e.g., in a non-transitory manner) an object-based audio program that has been produced by any of the embodiments of the method of the present invention. At least one frame or other segment (including audio content of a set of speaker channels and object channels, and object related metadata). For example, the camera unit 3 of FIG. 5 may include a buffer 3A (eg, in a non-transitory manner) at least one frame or other segment (including a set of speakers) based on the object audio program generated by the storage unit 3. Audio content of the channel and object channel, and metadata related to the object). For another example, the decoder 7 of FIG. 5 can include a buffer 7A that stores (eg, in a non-transitory manner) at least one frame based on the object audio program transmitted from the subsystem 5 to the decoder 7 or Other sections (including the audio content of a set of speaker channels and object channels, and object related metadata).

Embodiments of the invention may be implemented in hardware, firmware, or software, or a combination of the above (e.g., a programmable logic array). For example, all or some of the elements 106, 22, 24, 25, 26, 29, 35, 31, and 35 of the system 106 of the Fig. 8, or Fig. 7 system, or Fig. 10 All or some of the elements 200, 202, and 204 may be suitably programmed (or otherwise configured) by a hardware or firmware (eg, a programmed general purpose processor, digital signal processor, or microprocessor). . Unless otherwise stated, an algorithm or program that is included as part of the invention is not inherently related to any particular computer or other device. In particular, various general purpose machines may be used with programs written in accordance with the teachings herein, or may be more conveniently used to construct more specialized devices (e.g., integrated circuits) to perform the required method steps. Thus, the present invention can be implemented in one or more computer programs on one or more programmable computer systems (eg, elements 20, 22, 24, 25, 26, 29, 35, 31 of Figure 6 And all or some of the operations of 35, each comprising at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and At least one output device or port). The code is applied to the input data for the functions described herein and produces output information. The output information is applied to one or more output devices in a known manner.

Each such program can be implemented to communicate with a computer system in any desired computer language (including machine, combination, or higher level program, logic, or object oriented programming language). In any case, the language can be a compiled or interpreted language.

For example, when implemented by a computer software instruction sequence, various functions and steps of an embodiment of the present invention can be implemented by a multi-threaded software instruction sequence running on a suitable digital signal processing hardware, in which case an embodiment The various means, steps, and functions may correspond to portions of the software instructions.

Each such computer program is preferably stored or downloaded to a storage medium or device that can be read by a general purpose or special purpose programmable computer (eg, solid State memory or media, or magnetic or optical media, for configuring and operating a computer to perform the procedures described herein when the storage medium or device is read by a computer system. The system of the present invention can also be implemented as a computer readable storage medium configured with (i.e., stored) a computer program, wherein the storage medium so configured causes the computer system to operate in a specific and predetermined manner for performing the functions described herein.

Some embodiments of the invention have been described. It will be appreciated that various modifications may be made without departing from the spirit and scope of the invention. Many modifications and variations of the present invention are possible in light of the above teaching. It will be appreciated that within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.

1‧‧‧Capture unit

3‧‧‧ Filming unit

3A‧‧‧buffer

5‧‧‧Transfer subsystem

7‧‧‧Decoder

7A‧‧‧buffer

9‧‧‧ Object Processing Subsystem

10‧‧‧ Controller

11‧‧‧ imaging subsystem

Claims (93)

  1. A method for generating an object-based audio program representing at least one set of speaker channels and a set of object channels, wherein the set of speaker channels represents audio content of a first subset of one of a set of audio signals of the captured audio content, The object channels represent audio content of a second subset of the set of audio signals, each of the object channels representing a sound emitted by at least one audio object, and each of the speaker channels is associated with a designated speaker or In defining a designated speaker area within the speaker configuration, the method includes the steps of: generating, in the filming system, object-related metadata indicative of at least one feature or characteristic of the object channels or imaging of the object channels; In the filming system, the object-based audio program is generated such that the object-based audio program represents each of the set of speaker channels, the object channels, and the object-related metadata, and is imaged to provide a perception of the set of speakers The first audio content represented by the channel and the second sound represented by the selected subset of the one of the object channels a mixed sound of content such that the second audio content is perceived when originating from a sound source location determined by the selected subset of the object channels, wherein the object related metadata is or includes at least one for imaging Optionally, at least two mixed restrictions or conditions, each of the blends representing a different mix of the first audio content and the audio content of at least one of the object channels.
  2. The method of claim 1, wherein the object-related metadata is or includes a predetermined subset of the group of object channels that are imaged without selection by a terminal user of the subset of the group of object channels. of Metadata.
  3. The method of claim 1, wherein the object related metadata is or includes metadata indicating which object channels in the group object channel satisfy at least one object selection rule.
  4. The method of claim 1, wherein the generating the object-based audio program is performed such that the object-based audio program includes an encoded audio bit stream and at least one side of the mixing, the encoded audio bit stream representing at least Audio content of the set of speaker channels and audio content of the first subset of the object channels and/or the object-related metadata, and at least one of the side mixes represents a second subset of the object channels Audio content and/or metadata related to the object.
  5. The method of claim 1, wherein the first audio content represents a sound in a viewer event, and the at least one of the object channels of the selected subset of the object channels is represented by The audio content represents at least one of crowd noise or a comment on the viewer event in the viewer event.
  6. The method of claim 1, wherein the first audio content represents a sound in a sporting event, and the one of the object channels of the selected subset of the object channels is represented by The audio content represents the home team noise or the away crowd noise in the sports event.
  7. The method of claim 1, wherein the first audio content represents a sound in a viewer event, and the one of the object channels of the selected subset of the object channels is represented by Within the audio The system indicates comments on the audience event.
  8. The method of claim 1, wherein the object-related metadata based on the object audio program includes durability and non-durability metadata, and wherein the method also includes the step of using a delivery system to Transmitting at least one playback system based on the item audio program, wherein at least a portion of the non-durable meta-data is modified in the filming system, the delivery system, or the playback system during or prior to the step of transmitting the object-based program, but The durability metadata is saved during or before the step of transmitting the object-based program.
  9. The method of claim 8, wherein at least a portion of the durability metadata represents a synchronization word, the synchronization words representing at least some audio of the program relative to a timing of at least one other component of the program. The timing of the content, the object-based audio program is generated such that the object-based audio program includes an encoded audio bit stream and at least one side of the mix, the encoded audio bit stream representing at least one audio content of the set of speaker channels and the Audio content of a first subset of one of the object channels and/or associated metadata of the object, and at least one of the side mixes represents audio content of the second subset of the object channels and/or associated elements of the object Data, wherein a first subset of the sync words is included in the encoded audio bitstream, and a second subset of the sync words is included in at least one of the side mixes.
  10. The method of claim 1, wherein the object-based audio program comprises a coded bit stream of a plurality of frames, the coded bit stream being an AC-3 bit stream or an E-AC- 3-bit stream, the coded bit Each of the frames of the meta-stream represents at least one data structure, which is a container, including some contents of the object channels and some related metadata of the objects, and at least one of the containers is included in each of the messages. The box is in an auxdata field.
  11. The method of claim 1, wherein the object-based audio program comprises a coded bit stream of a plurality of frames, the coded bit stream being an AC-3 bit stream or an E-AC- a 3-bit stream, each of the frames of the encoded bit stream representing at least one data structure, which is a container, including some contents of the object channels and some related metadata of the object, and at least one of the The container is included in an addbsi field of each frame.
  12. The method of claim 1, wherein the object-based audio program is a Dolby E bit stream, comprising a series of bursts and a guard band between the bundles, each of the guard bands Each of the first X segments of each of the at least some of the guard bands includes some content of the object channels and some of the object-related metadata, where X is a number.
  13. The method of claim 1, wherein at least some of the object-related metadata represent a layered hybrid map indicating an optional mixture of the speaker channels and the object channels, and The layered hybrid map includes a base layer of the object-related metadata and at least one extension layer of the object-related metadata.
  14. The method of claim 1, wherein at least some of the object-related metadata represent a hybrid map, the hybrid map indicating the An optional mix of the speaker channels and the object channels, the object-based audio program comprising a coded bitstream of the plurality of frames, and each of the frames of the encoded bitstream includes a representation of the hybrid map Object related metadata.
  15. The method of claim 1, the method comprising the steps of determining at least two sets of speaker channels representing a subset of the set of audio signals, and wherein at least some of the object-related metadata represent a hybrid map, the blending The figure represents an optional mix of the speaker channels and the object channels, and the hybrid map includes at least one set of mixing nodes representing a predetermined mix of the set of speaker channels.
  16. The method of claim 1, wherein the object-based audio program comprises substreams, the substreams are stream streams to be transmitted in parallel, and at least some of the object related metadata are substream metadata. Representing at least one of the substream structure of the program or the manner in which the substreams should be decoded.
  17. A method of imaging an audio content determined based on an object audio program, wherein the program represents at least one set of speaker channels, a set of object channels, and object related metadata, wherein the method is performed by a playback system, Each of the object channels represents a sound emitted by at least one audio object, and each of the speaker channels is associated with a designated speaker or a designated speaker area within a defined speaker configuration, the method comprising the steps of: (a) determining a selected subset of the set of object channels; and (b) imaging the audio content determined by the object based audio program, including by determining a first audio content represented by the set of speaker channels a mixture of the second audio content represented by the selected subset of the set of object channels, wherein step (a) includes the steps of: providing a menu of selectable subsets of the object channels, or audio content A mixed menu is selected, each of which represents a different mix of audio content of the set of speaker channels and audio content of a subset of the set of object channels.
  18. The method of claim 17, wherein the step (b) comprises selecting, based on the selected subset of the set of object channels, at least one object represented by the selected subset of the set of object channels. The step of conditionally imaging the audio content by a user-specified gain and a user-specified location within the imaging environment of the at least one object represented by the selected subset of the set of object channels.
  19. The method of claim 17, wherein the playback system comprises a set of speakers, and wherein step (b) comprises the step of generating a speaker in response to the mixing of the first audio content and the second audio content Feeding to drive the set of speakers to emit a sound, wherein the sound includes an object channel sound representative of the second audio content, and the object channel sound is emitted when an exact sound source position is determined from the selected subset of the set of object channels It is perceptible.
  20. The method of claim 17, wherein the step (a) comprises the step of: providing a menu of subsets of the object channels selectable for use; and The selected subset of the set of object channels is determined by selecting one of the subset of the object channels represented by the menu.
  21. The method of claim 20, wherein the menu is presented by a user interface of a controller, the controller is coupled to an onboard device, and the onboard device is coupled to receive the The step (b) is performed based on the object audio program and configuration.
  22. The method of claim 17, wherein the first audio content represents a sound in a viewer event and is represented by at least one of the object channels of the selected subset of the set of object channels. The audio content represents at least one of crowd noise or a comment on the viewer event in the viewer event.
  23. The method of claim 17, wherein the first audio content represents a sound in a sporting event and is represented by one of the selected ones of the selected subset of the group of object channels. The audio content represents the home team noise or the away crowd noise in the sports event.
  24. The method of claim 17, wherein the first audio content represents a sound in a viewer event and is represented by one of the object channels of the selected subset of the set of object channels. The audio content represents a comment on the viewer event.
  25. The method of claim 17, wherein the object-based audio program comprises an encoded audio bit stream and at least one side mix, the encoded audio bit stream representing the audio content of the set of speaker channels and the Audio content of the first subset of one of the object channels and/or related elements of the object Data, and at least one of the side mixes represents audio content of the second subset of the object channels and/or associated metadata of the object.
  26. The method of claim 17, wherein the object-related metadata is or includes an identified meta-data representing at least some of the object channels.
  27. The method of claim 17, wherein the object-related metadata is or includes a predetermined subset of the set of object channels that are imaged without selection by a terminal user of the subset of the set of object channels. Meta data.
  28. The method of claim 17, wherein at least some of the object-related metadata represent synchronization words, the synchronization words representing at least some audio content of the program relative to timing of at least one other component of the program Timing of the object-based audio program comprising an encoded audio bitstream and at least one side mix, the encoded audio bitstream representing at least one audio content of the set of speaker channels and a first subset of one of the object channels Audio content and/or related metadata of the object, and at least one of the side mixing systems represents audio content of the second subset of the object channels and/or related metadata of the object, wherein one of the synchronization words a first subset is included in the encoded audio bitstream, and a second subset of the synchronization words is included in at least one of the side mixes, and wherein the method includes using at least some of the sync words The step of time aligning the encoded audio bit stream with at least one of the sides is performed.
  29. The method of claim 17, wherein the step (a) comprises the following steps: Providing an optional mix of audio content, each of the selectable mixes indicating that the audio content of the set of speaker channels is differently mixed with one of the audio content of the subset of the set of object channels, wherein the object related metadata Is or includes metadata indicating at least one of the restrictions or conditions included in the menu; and selecting one of the optional mixtures from the menu to determine the group The selected subset of object channels.
  30. The method of claim 29, wherein the object-related metadata is or includes meta-data indicating a relationship between the identification and the relationship of each of the object channels, and is determined to be in the optional mixture. Which of the at least one restrictions or conditions included in the menu is included.
  31. The method of claim 17, wherein the step (a) comprises the step of providing an optional mix of audio content, each of the selectable mixes representing an audio of the set of speaker channels The content is differently mixed with one of the audio content of the subset of the set of object channels, wherein the pre-configured playback system determines which of the selectable mixes is included in the menu of at least one restriction or condition; and selecting from the menu One of the optional blends thereby determining the selected subset of the set of object channels.
  32. The method of claim 17, further comprising the steps of: (c) determining at least one rule for object channel selection before performing step (a), and Wherein step (a) includes the step of determining the selected subset of the set of object channels in accordance with the at least one rule.
  33. The method of claim 32, wherein the step (c) comprises the steps of: providing a menu for an optional rule for the object channel selection; and selecting one of the optional rules from the menu. By this, the at least one rule is determined.
  34. The method of claim 32, wherein the object-related metadata is or includes metadata indicating which of the object channels of the group of objects satisfy the at least one rule, and wherein step (a) includes responding to The step of determining which of the object channels in the set of object channels satisfy the metadata of the at least one rule to determine the selected subset of the set of object channels.
  35. The method of claim 17, wherein the object-based audio program comprises a set of bitstreams, wherein steps (a) and (b) are performed by the playback system, and wherein the method comprises the following Step: (c) transmitting the bit streams based on the object audio program to the playback system.
  36. The method of claim 35, wherein one of the bit streams represents a first subset of the group of object channels, and the other of the bit streams represents the group of objects A second subset of the channel.
  37. The method of claim 17, wherein the object-based audio program comprises a set of bitstreams, wherein steps (a) and (b) are performed by the playback system, and wherein the method comprises the following Step: (c) receiving in parallel in the playback system before performing step (a) The bitstreams based on the object audio program.
  38. The method of claim 37, wherein one of the bit streams represents a first subset of the group of object channels, and the other of the bit streams represents the group of objects A second subset of the channel.
  39. The method of claim 17, wherein the playback system comprises a first subsystem and a second subsystem, the second subsystem being coupled to the first subsystem from the first subsystem Downstream, step (a) is performed in the first subsystem of the playback system, and step (b) is performed at least partially in the second subsystem of the playback system.
  40. The method of claim 39, wherein the step (b) comprises the step of: determining, in the second subsystem of the playback system, a mixture of the first audio content and the second audio content; In the second subsystem of the playback system, a speaker feed for driving a set of speakers of the playback system is generated in response to the mixing of the first audio content and the second audio content.
  41. The method of claim 39, wherein the first subsystem of the playback system is implemented in an on-board device, and the second subsystem of the playback system is coupled to the In a downstream device of the onboard device.
  42. The method of claim 17, wherein the object-based audio program comprises a coded bit stream of a plurality of frames, the coded bit stream being an AC-3 bit stream or an E-AC- a 3-bit stream, each of the frames of the encoded bit stream representing at least one data structure, one of which is a The container includes some content of the object channels and some of the object related metadata, and at least one of the containers is included in an auxdata field of each of the frames.
  43. The method of claim 17, wherein the object-based audio program comprises a coded bit stream of a plurality of frames, the coded bit stream being an AC-3 bit stream or an E-AC- a 3-bit stream, each of the frames of the encoded bit stream representing at least one data structure, which is a container, including some contents of the object channels and some related metadata of the object, and at least one of the The container is included in an addbsi field of each frame.
  44. The method of claim 17, wherein the object-based audio program is a Dolby E bit stream, comprising a series of bursts and a guard band between the bundles, each of the guard bands And consisting of a series of segments, and each of the first X segments of each of the at least some of the protection bands includes some content of the object channels and some of the object-related metadata, where X is a number.
  45. A system for imaging audio content determined based on an object audio program, wherein the program represents at least one set of speaker channels, a set of object channels, and object related metadata, the system comprising: a first subsystem coupled Receiving the object-based audio program and configuration to parse the speaker channels, the object channels, and the object-related metadata, and determining a selected subset of the object channels; an imaging subsystem coupled to the first Subsystem and configuration to image the audio content determined by the object based audio program, including by deciding a a mixture of the first audio content represented by the set of speaker channels and the second audio content represented by the selected subset of the object channels; and a controller coupled to the first subsystem, wherein the controller is A menu configured to provide a subset of the selectable object channels, or an alternate mix of audio content, each of the selectable mixes representing audio content of the set of speaker channels and a subset of the set of object channels Different mixes of content.
  46. The system of claim 45, wherein the system comprises or is configured to be coupled to a set of speakers, and the imaging subsystem is configured to respond to a mixture of the first audio content and the second audio content And generating a speaker feed such that when driven by the speaker feeds, the set of speakers emits a sound, including an object channel sound representative of the second audio content, and the object channel sound is from the selected subset of the object channels The determined apparent source location is perceived as being.
  47. The system of claim 45, further comprising a controller coupled to the first subsystem, wherein the controller is configured to provide a menu of selectable subsets of the object channels, and The selected subset of the object channels is determined in response to a user selection of a subset of the object channels represented by the menu.
  48. The system of claim 47, wherein the controller is configured to implement a user interface for displaying the menu.
  49. The system of claim 47, wherein the first subsystem is implemented in an onboard device and the controller is coupled to the onboard device.
  50. The system of claim 45, further comprising a controller coupled to the first subsystem, wherein the controller is configured to provide an optional mix of audio content, the optional mix Each of the sets of audio content of the set of speaker channels is differently mixed with one of the audio contents of the subset of the object channels, wherein the object-related metadata is or includes a representation of which of the selectable mixes is included At least one restriction or conditional metadata in the menu, and wherein the controller is configured to determine the selection of the object channels in response to user selection of one of the selectable mixes from the menu Subset.
  51. The system of claim 45, wherein the controller is configured to implement a user interface for displaying the menu.
  52. The system of claim 45, wherein the first audio content represents a sound in a viewer event, the at least one of the object channels of the selected subset of the object channels being represented by The audio content represents at least one of crowd noise or comments on the viewer event in the viewer event.
  53. The system of claim 45, wherein the first audio content represents a sound in a sporting event, and the one of the object channels of the selected subset of the object channels is represented by The audio content represents the home team noise or the away crowd noise in the sports event.
  54. The system of claim 45, wherein the first audio content represents a sound in a viewer event, and the one of the object channels of the selected subset of the object channels is represented by The audio The content represents a comment on the viewer event.
  55. The system of claim 45, wherein the object-based audio program comprises an encoded audio bit stream and at least one side mix, the encoded audio bit stream representing an audio content of the set of speaker channels and the Audio content of a first subset of one of the object channels and/or associated metadata of the object, and at least one of the side mixes represents audio content of the second subset of the object channels and/or associated elements of the object data.
  56. The system of claim 45, wherein the object-related metadata is or includes a predetermined subset of the object channels that are imaged without selection by a terminal user of the subset of the object channels. Meta data.
  57. The system of claim 45, wherein the object-related metadata is or includes metadata representing a synchronization word, the synchronization word indicating the program relative to the timing of at least one other component of the program. At least some timing of the audio content, the object-based audio program includes an encoded audio bit stream and at least one side of the audio stream, the encoded audio bit stream representing an audio content of the set of speaker channels and one of the object channels a subset of the audio content and/or the object related metadata, and at least one of the side mixes represents audio content of the second subset of the object channels and/or the object related metadata, wherein the synchronization A first subset of words is included in the encoded audio bitstream, and a second subset of the sync words is included in at least one of the side mixes, and wherein the first subsystem is configured to Using at least some of the synchronization words to time align the encoded audio bit stream with at least one of the sides sound.
  58. The system of claim 45, further comprising a controller coupled to the first subsystem, wherein the controller is configured to provide an optional mix of audio content, the optional mix Each of which represents a different mix of audio content of the set of speaker channels and one of the audio content of a subset of the object channels, wherein the first subsystem is preconfigured and/or the controller determines the selectable mixes Which is included in at least one restriction or condition in the menu, and wherein the controller is configured to determine the object channel in response to user selection from one of the selectable mixes of the menu Select a subset.
  59. The system of claim 58 wherein the controller is configured to implement a user interface for displaying the menu.
  60. The system of claim 45, further comprising a controller coupled to the first subsystem, wherein the controller is configured to provide a menu of optional rules for object channel selection, and wherein The controller is configured to configure the first subsystem to apply at least one rule for object channel selection in response to user selection of one of the selectable rules from the menu.
  61. The system of claim 60, wherein the object-related metadata is or includes metadata indicating which of the object channels of the group of objects satisfy the at least one rule.
  62. The system of claim 45, wherein the imaging subsystem is implemented in a playback system including a first imaging subsystem and a second imaging subsystem, the second imaging subsystem being coupled to From this Downstream of the first imaging subsystem of the first imaging subsystem.
  63. The system of claim 62, wherein the second imaging subsystem is configured to determine a mixture of the first audio content and the second audio content, and the second imaging subsystem is configured to respond to the A blend of the first audio content and the second audio content produces a speaker feed for driving a set of speakers.
  64. The system of claim 63, wherein the first imaging subsystem is implemented in an on-board device, and the second imaging subsystem is implemented downstream of a device coupled to the on-board device In the device.
  65. The system of claim 45, wherein the object-based audio program comprises a coded bit stream of the frame, the coded bit stream being an AC-3 bit stream or an E-AC-3 bit. a stream, each of the frames of the encoded bit stream representing at least one data structure, which is a container, including some contents of the object channels and some related metadata of the object, and at least one of the containers Included in an auxdata field or an addbsi field in each frame.
  66. The system of claim 45, wherein the object-based audio program is a Dolby E bit stream, comprising a series of bursts and a guard band between the bundles, each of the guard bands Composed of a series of segments, and each of the first X segments of each of the at least some of the guard bands includes some content of the object channels and some of the object-related metadata, where X is a number.
  67. The system of claim 45, wherein at least some of the object-related metadata represent a layered hybrid map, the layered mixture The figure represents an optional mix of the speaker channels and the object channels, and the layered hybrid map includes a base layer of the object related metadata and at least one extended layer of the object related metadata.
  68. The system of claim 45, wherein at least some of the object-related metadata represent a hybrid map indicating an optional mix of the speaker channels and the object channels, the object-based audio program An encoded bitstream containing the frame, and each of the frames of the encoded bitstream includes object-related metadata representing the mixed graph.
  69. The system of claim 45, wherein the program represents at least two speaker channel groups, and at least some of the object related metadata represents a hybrid map indicating the speaker channels and the objects. An optional mix of channels, and the hybrid map includes at least one set of hybrid nodes representing a predetermined mix of the set of speaker channels.
  70. The system of claim 45, wherein the object-based audio program comprises substreams, the substreams are bitstreams to be transmitted in parallel, and at least some of the object related metadata are substreaming materials, Representing at least one of the substream structure of the program or the manner in which the substreams should be decoded.
  71. A system for generating an object-based audio program, comprising: a first subsystem configured to determine at least one set of speaker channels, representing audio content representing a first subset of one of a set of audio signals of the captured audio content, Determining a set of object channels, indicating audio content of a second subset of the set of audio signals, and generating object related metadata, indicating the object channels; and an encoding subsystem coupled to the first subsystem and the configuration To produce this Based on the item audio program, the object-based audio program is representative of each of the set of speaker channels, the object channels, and the object-related metadata, and is imaged to provide a first representation represented by the set of speaker channels a sound of a mixture of audio content and a second audio content represented by a selected subset of the object channels such that the second audio content is emitted when a sound source location determined by the selected subset of the object channels is issued Perceived, wherein the object-related metadata is or includes at least one restriction or condition for imaging at least two combinations, each of the plurality of representations representing the first audio content and at least one of the object channels Different mixes of audio content.
  72. The system of claim 71, wherein the object-related metadata is or includes a predetermined subset of the group of object channels that are imaged without selection by a terminal user of the subset of the group of object channels. Meta data.
  73. The system of claim 71, wherein the object related metadata is or includes metadata indicating which object channels in the group object channel satisfy at least one object selection rule.
  74. The system of claim 71, wherein the encoding subsystem is configured to generate the object-based audio program such that the object-based audio program includes an encoded audio bit stream and at least one side of the encoded audio. The bit stream system represents audio content of at least one of the set of speaker channels and audio content of the first subset of the object channels and/or the object related metadata, and at least one of the side mixes indicates the object channels The audio content of one of the second subsets and/or the metadata associated with the object.
  75. The system of claim 71, wherein the An audio content is indicative of a sound in a viewer event, and the audio content represented by at least one of the object channels of the selected subset of the object channels represents a mass noise or pair in the viewer event At least one of the comments of the audience event.
  76. The system of claim 71, wherein the first audio content represents a sound in a sporting event, and the one of the object channels of the selected subset of the object channels is represented by The audio content represents the home team noise or the away crowd noise in the sports event.
  77. The system of claim 71, wherein the first audio content represents a sound in a viewer event, and the one of the object channels of the selected subset of the object channels is represented by The audio content represents a comment on the viewer event.
  78. The system of claim 71, wherein the object-related metadata based on the object audio program includes durability metadata.
  79. The system of claim 71, wherein the object-based audio program comprises a coded bit stream of the frame, the coded bit stream being an AC-3 bit stream or an E-AC-3 bit. a stream, each of the frames of the encoded bit stream representing at least one data structure, which is a container, including some contents of the object channels and some related metadata of the object, and at least one of the containers Included in an auxdata field or an addbsi field for each of the frames.
  80. The system of claim 71, wherein the object-based audio program is a Dolby E bit stream, comprising a series of bursts and a guard band between the pair of bundles, each of the guard bands being composed of a series of segments, and each of the first X segments of each of the at least some of the guard bands includes the object channels Some content and some metadata related to the object, where X is a number.
  81. An audio processing unit, comprising: a buffer memory; and at least one audio processing subsystem coupled to the buffer memory, wherein the buffer memory stores at least one segment based on the object audio program, wherein the buffer memory The program represents at least one set of speaker channels, a set of object channels, and object related metadata, and is imaged to provide a first subset of the first audio content represented as a set of speaker channels and a selected subset of the object channels The mixed sound of the second audio content is rendered such that the second audio content is perceived when the sound source location determined by the selected subset of the object channels is issued, wherein the object related metadata is or includes Representing at least one restriction or condition for imaging at least two blends, each of the blends representing a different mix of audio content of the first audio content and at least one of the object channels, and each of the segments Included information indicative of audio content of at least one of the set of speaker channels, information indicative of audio content of at least one of the object channels, and the object At least some of the meta data.
  82. The unit of claim 81, wherein the object-based audio program comprises a coded bit stream of a plurality of frames, each of the segments being one of the frames.
  83. Such as the unit described in claim 82, wherein the The code bit stream is an AC-3 bit stream or an E-AC-3 bit stream, each of the frames representing at least one data structure, which is a container, including at least one of the object channels Some of the content and some of the object-related metadata, and at least one of the containers is included in an auxdata field or an addbsi field of each of the frames.
  84. The unit of claim 81, wherein the object-based audio program is a Dolby E bit stream, comprising a series of bursts and a guard band between the bundles, each of the guard bands And consisting of a series of segments, and each of the first X segments of each of the at least some of the protection bands includes some content of the object channels and some of the object-related metadata, where X is a number.
  85. The unit of claim 81, wherein the buffer memory stores the segment in a non-transitory manner.
  86. The unit of claim 81, wherein the audio processing subsystem is an encoder.
  87. The unit of claim 81, wherein the audio processing subsystem is configured to parse the speaker channels, the object channels, and the object related metadata, and determine a selected subset of the object channels. .
  88. The unit of claim 81, wherein the audio processing subsystem is configured to image the audio content determined by the object-based audio program, including by determining a first audio content represented by the set of speaker channels and A mixture of the second audio content represented by the selected subset of the object channels.
  89. The unit of claim 81, wherein the audio processing unit is a digital signal processor.
  90. The unit of claim 81, wherein at least some of the object-related metadata represent a layered hybrid map indicating an optional mixture of the speaker channels and the object channels, and The layered hybrid map includes a base layer of the object-related metadata and at least one extension layer of the object-related metadata.
  91. The unit of claim 81, wherein at least some of the object-related metadata represent a hybrid map indicating an optional mixture of the speaker channels and the object channels, and each of the regions The segment includes object related metadata representing the hybrid map.
  92. The unit of claim 81, wherein the program represents at least two speaker channel groups, and at least some of the object related metadata represents a hybrid map indicating the speaker channels and the objects. An optional mix of channels, and the hybrid map includes at least one set of hybrid nodes representing a predetermined mix of the set of speaker channels.
  93. The unit of claim 81, wherein the object-based audio program comprises substreams, the substreams are stream streams to be transmitted in parallel, and at least some of the object related metadata are substream metadata. Representing at least one of the substream structure of the program or the manner in which the substreams should be decoded.
TW103105464A 2013-04-03 2014-02-19 Method and system for object-based audio interaction of imaging TWI530941B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US201361807922P true 2013-04-03 2013-04-03
US201361832397P true 2013-06-07 2013-06-07

Publications (2)

Publication Number Publication Date
TW201445561A TW201445561A (en) 2014-12-01
TWI530941B true TWI530941B (en) 2016-04-21

Family

ID=50483612

Family Applications (1)

Application Number Title Priority Date Filing Date
TW103105464A TWI530941B (en) 2013-04-03 2014-02-19 Method and system for object-based audio interaction of imaging

Country Status (7)

Country Link
US (8) US9997164B2 (en)
EP (4) EP2982140B1 (en)
JP (3) JP6212624B2 (en)
KR (1) KR101800604B1 (en)
CN (5) CN108134978A (en)
TW (1) TWI530941B (en)
WO (3) WO2014165326A1 (en)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI530941B (en) 2013-04-03 2016-04-21 Dolby Lab Licensing Corp Method and system for object-based audio interaction of imaging
US9838823B2 (en) * 2013-04-27 2017-12-05 Intellectual Discovery Co., Ltd. Audio signal processing method
KR101805630B1 (en) * 2013-09-27 2017-12-07 삼성전자주식회사 Method of processing multi decoding and multi decoder for performing the same
US9654076B2 (en) * 2014-03-25 2017-05-16 Apple Inc. Metadata for ducking control
CN106797525B (en) * 2014-08-13 2019-05-28 三星电子株式会社 For generating and the method and apparatus of playing back audio signal
WO2016050900A1 (en) * 2014-10-03 2016-04-07 Dolby International Ab Smart access to personalized audio
US10140996B2 (en) 2014-10-10 2018-11-27 Qualcomm Incorporated Signaling layers for scalable coding of higher order ambisonic audio data
WO2016077320A1 (en) * 2014-11-11 2016-05-19 Google Inc. 3d immersive spatial audio systems and methods
TW201643864A (en) 2015-03-13 2016-12-16 Dolby Int Ab Decoding with enhanced spectral band replication metadata at least one of an audio bit-stream filling element
EP3281416A1 (en) 2015-04-10 2018-02-14 Dolby Laboratories Licensing Corporation Action sound capture using subsurface microphones
US10136240B2 (en) * 2015-04-20 2018-11-20 Dolby Laboratories Licensing Corporation Processing audio data to compensate for partial hearing loss or an adverse hearing environment
CN106303897A (en) * 2015-06-01 2017-01-04 杜比实验室特许公司 Process object-based audio signal
RU2685999C1 (en) * 2015-06-17 2019-04-23 Фраунхофер-Гезелльшафт Цур Фердерунг Дер Ангевандтен Форшунг Е.Ф. Volume control for user interactivity in the audio coding systems
US20170098452A1 (en) * 2015-10-02 2017-04-06 Dts, Inc. Method and system for audio processing of dialog, music, effect and height objects
US9877137B2 (en) 2015-10-06 2018-01-23 Disney Enterprises, Inc. Systems and methods for playing a venue-specific object-based audio
GB2543275A (en) * 2015-10-12 2017-04-19 Nokia Technologies Oy Distributed audio capture and mixing
GB2543276A (en) 2015-10-12 2017-04-19 Nokia Technologies Oy Distributed audio capture and mixing
WO2017075539A1 (en) * 2015-10-28 2017-05-04 Voke Inc. Apparatus and method for distributing multimedia events from a client
EP3174317A1 (en) * 2015-11-27 2017-05-31 Nokia Technologies Oy Intelligent audio rendering
US10027994B2 (en) * 2016-03-23 2018-07-17 Dts, Inc. Interactive audio metadata handling
EP3434023A1 (en) * 2016-03-24 2019-01-30 Dolby Laboratories Licensing Corporation Near-field rendering of immersive audio content in portable computers and devices
US10325610B2 (en) 2016-03-30 2019-06-18 Microsoft Technology Licensing, Llc Adaptive audio rendering
US10015612B2 (en) * 2016-05-25 2018-07-03 Dolby Laboratories Licensing Corporation Measurement, verification and correction of time alignment of multiple audio channels and associated metadata
GB2550877A (en) * 2016-05-26 2017-12-06 Univ Surrey Object-based audio rendering
JP2018022028A (en) * 2016-08-03 2018-02-08 株式会社リコー Audio processing device, audio-video output device, and remote conference system
JP2019533404A (en) * 2016-09-23 2019-11-14 ガウディオ・ラボ・インコーポレイテッド Binaural audio signal processing method and apparatus
US10187740B2 (en) * 2016-09-23 2019-01-22 Apple Inc. Producing headphone driver signals in a digital audio signal processing binaural rendering environment
US10419866B2 (en) * 2016-10-07 2019-09-17 Microsoft Technology Licensing, Llc Shared three-dimensional audio bed
WO2019168780A1 (en) * 2018-02-27 2019-09-06 Thin Film Electronics Asa System and method for providing augmented reality experience to objects using wireless tags

Family Cites Families (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69210689D1 (en) 1991-01-08 1996-06-20 Dolby Lab Licensing Corp Encoder / decoder for multi-dimensional sound fields
US5632005A (en) 1991-01-08 1997-05-20 Ray Milton Dolby Encoder/decoder for multidimensional sound fields
WO1993003549A1 (en) * 1991-07-31 1993-02-18 Euphonix, Inc. Automated audio mixer
US5727119A (en) 1995-03-27 1998-03-10 Dolby Laboratories Licensing Corporation Method and apparatus for efficient implementation of single-sideband filter banks providing accurate measures of spectral magnitude and phase
US6154549A (en) 1996-06-18 2000-11-28 Extreme Audio Reality, Inc. Method and apparatus for providing sound in a spatial environment
US6931370B1 (en) 1999-11-02 2005-08-16 Digital Theater Systems, Inc. System and method for providing interactive audio in a multi-channel audio environment
AUPQ570700A0 (en) 2000-02-17 2000-03-09 Lake Technology Limited Virtual audio environment
US6553077B2 (en) 2001-07-31 2003-04-22 Xm Satellite Radio, Inc. Method and apparatus for customized selection of audio channels
JP2004072345A (en) * 2002-08-05 2004-03-04 Pioneer Electronic Corp Information recording medium, information recording device and method, information reproducing device and method, information recording/reproducing device and method, computer program, and data structure
US7558393B2 (en) * 2003-03-18 2009-07-07 Miller Iii Robert E System and method for compatible 2D/3D (full sphere with height) surround sound reproduction
CA2598575A1 (en) * 2005-02-22 2006-08-31 Verax Technologies Inc. System and method for formatting multimode sound content and metadata
DE102005008343A1 (en) * 2005-02-23 2006-09-07 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for providing data in a multi-renderer system
KR101253067B1 (en) 2005-06-24 2013-04-11 돌비 레버러토리즈 라이쎈싱 코오포레이션 Immersive audio communication
US8705747B2 (en) * 2005-12-08 2014-04-22 Electronics And Telecommunications Research Institute Object-based 3-dimensional audio service system using preset audio scenes
KR100802179B1 (en) * 2005-12-08 2008-02-12 한국전자통신연구원 Object-based 3-dimensional audio service system using preset audio scenes and its method
WO2007078254A2 (en) 2006-01-05 2007-07-12 Telefonaktiebolaget Lm Ericsson (Publ) Personalized decoding of multi-channel surround sound
US9088855B2 (en) 2006-05-17 2015-07-21 Creative Technology Ltd Vector-space methods for primary-ambient decomposition of stereo audio signals
US7876903B2 (en) 2006-07-07 2011-01-25 Harris Corporation Method and apparatus for creating a multi-dimensional communication space for use in a binaural audio system
ES2399562T3 (en) * 2006-10-13 2013-04-02 Auro Technologies Method and encoder for combining digital data sets, method for decoding and decoder for such combined digital data sets and recording medium for storing such combined digital data sets
AT539434T (en) * 2006-10-16 2012-01-15 Fraunhofer Ges Forschung Device and method for multichannel parameter conversion
US8571875B2 (en) 2006-10-18 2013-10-29 Samsung Electronics Co., Ltd. Method, medium, and apparatus encoding and/or decoding multichannel audio signals
EP2097895A4 (en) 2006-12-27 2013-11-13 Korea Electronics Telecomm Apparatus and method for coding and decoding multi-object audio signal with various channel including information bitstream conversion
WO2008100098A1 (en) * 2007-02-14 2008-08-21 Lg Electronics Inc. Methods and apparatuses for encoding and decoding object-based audio signals
CN101689368B (en) * 2007-03-30 2012-08-22 韩国电子通信研究院 Apparatus and method for coding and decoding multi object audio signal with multi channel
KR101147780B1 (en) * 2008-01-01 2012-06-01 엘지전자 주식회사 A method and an apparatus for processing an audio signal
AU2008344073B2 (en) * 2008-01-01 2011-08-11 Lg Electronics Inc. A method and an apparatus for processing an audio signal
EP2083584B1 (en) * 2008-01-23 2010-09-15 LG Electronics Inc. A method and an apparatus for processing an audio signal
CN101960865A (en) * 2008-03-03 2011-01-26 诺基亚公司 Apparatus for capturing and rendering a plurality of audio channels
US20090237492A1 (en) 2008-03-18 2009-09-24 Invism, Inc. Enhanced stereoscopic immersive video recording and viewing
KR101061128B1 (en) * 2008-04-16 2011-08-31 엘지전자 주식회사 Audio signal processing method and device thereof
WO2009128663A2 (en) * 2008-04-16 2009-10-22 Lg Electronics Inc. A method and an apparatus for processing an audio signal
KR101596504B1 (en) * 2008-04-23 2016-02-23 한국전자통신연구원 / method for generating and playing object-based audio contents and computer readable recordoing medium for recoding data having file format structure for object-based audio service
EP2146522A1 (en) * 2008-07-17 2010-01-20 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for generating audio output signals using object based metadata
TWI559786B (en) * 2008-09-03 2016-11-21 Dolby Lab Licensing Corp Promote the regeneration of multi-channel
US20100064053A1 (en) 2008-09-09 2010-03-11 Apple Inc. Radio with personal dj
EP2194527A3 (en) * 2008-12-02 2013-09-25 Electronics and Telecommunications Research Institute Apparatus for generating and playing object based audio contents
EP2205007B1 (en) 2008-12-30 2019-01-09 Dolby International AB Method and apparatus for three-dimensional acoustic field encoding and optimal reconstruction
EP2209328B1 (en) 2009-01-20 2013-10-23 Lg Electronics Inc. An apparatus for processing an audio signal and method thereof
US8577065B2 (en) 2009-06-12 2013-11-05 Conexant Systems, Inc. Systems and methods for creating immersion surround sound and virtual speakers effects
US20100324915A1 (en) 2009-06-23 2010-12-23 Electronic And Telecommunications Research Institute Encoding and decoding apparatuses for high quality multi-channel audio codec
CN102549655B (en) * 2009-08-14 2014-09-24 Dts有限责任公司 System for adaptively streaming audio objects
US20110069934A1 (en) * 2009-09-24 2011-03-24 Electronics And Telecommunications Research Institute Apparatus and method for providing object based audio file, and apparatus and method for playing back object based audio file
US9185445B2 (en) * 2009-09-24 2015-11-10 At&T Intellectual Property I, L.P. Transmitting a prioritized audio stream along with multimedia content
TWI557723B (en) * 2010-02-18 2016-11-11 Dolby Lab Licensing Corp Decoding method and system
CN102823273B (en) 2010-03-23 2015-12-16 杜比实验室特许公司 Perceptual audio technology for the localization of
US8908874B2 (en) 2010-09-08 2014-12-09 Dts, Inc. Spatial audio encoding and reproduction
TWI665659B (en) 2010-12-03 2019-07-11 美商杜比實驗室特許公司 Audio decoding device, audio decoding method, and audio encoding method
WO2012122397A1 (en) 2011-03-09 2012-09-13 Srs Labs, Inc. System for dynamically creating and rendering audio objects
KR20140027954A (en) 2011-03-16 2014-03-07 디티에스, 인코포레이티드 Encoding and reproduction of three dimensional audio soundtracks
US9754595B2 (en) 2011-06-09 2017-09-05 Samsung Electronics Co., Ltd. Method and apparatus for encoding and decoding 3-dimensional audio signal
EP2727380A1 (en) * 2011-07-01 2014-05-07 Dolby Laboratories Licensing Corporation Upmixing object based audio
MX2013014684A (en) * 2011-07-01 2014-03-27 Dolby Lab Licensing Corp System and method for adaptive audio signal generation, coding and rendering.
EP2727381A2 (en) 2011-07-01 2014-05-07 Dolby Laboratories Licensing Corporation System and tools for enhanced 3d audio authoring and rendering
WO2013073810A1 (en) 2011-11-14 2013-05-23 한국전자통신연구원 Apparatus for encoding and apparatus for decoding supporting scalable multichannel audio signal, and method for apparatuses performing same
WO2013181272A2 (en) 2012-05-31 2013-12-05 Dts Llc Object-based audio system using vector base amplitude panning
EP2862370B1 (en) 2012-06-19 2017-08-30 Dolby Laboratories Licensing Corporation Rendering and playback of spatial audio using channel-based audio systems
EP2690621A1 (en) 2012-07-26 2014-01-29 Thomson Licensing Method and Apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side
CN104604257B (en) 2012-08-31 2016-05-25 杜比实验室特许公司 For listening to various that environment is played up and the system of the object-based audio frequency of playback
US9319019B2 (en) * 2013-02-11 2016-04-19 Symphonic Audio Technologies Corp. Method for augmenting a listening experience
TWI530941B (en) 2013-04-03 2016-04-21 Dolby Lab Licensing Corp Method and system for object-based audio interaction of imaging
US9564136B2 (en) * 2014-03-06 2017-02-07 Dts, Inc. Post-encoding bitrate reduction of multiple object audio

Also Published As

Publication number Publication date
CN105103571B (en) 2017-11-10
JP6212624B2 (en) 2017-10-11
EP2982141A1 (en) 2016-02-10
US9805727B2 (en) 2017-10-31
US20160029140A1 (en) 2016-01-28
KR101800604B1 (en) 2017-11-23
US9997164B2 (en) 2018-06-12
KR20150123925A (en) 2015-11-04
US20180151186A1 (en) 2018-05-31
CN105075295B (en) 2017-05-24
JP6149152B2 (en) 2017-06-14
US20180053515A1 (en) 2018-02-22
JP2016520858A (en) 2016-07-14
US20190341061A1 (en) 2019-11-07
WO2014165665A1 (en) 2014-10-09
TW201445561A (en) 2014-12-01
CN108134978A (en) 2018-06-08
US20160029138A1 (en) 2016-01-28
EP2982142B1 (en) 2018-03-07
EP3413592A1 (en) 2018-12-12
EP2982140A1 (en) 2016-02-10
US10388291B2 (en) 2019-08-20
US20160064003A1 (en) 2016-03-03
CN105103570A (en) 2015-11-25
WO2014165326A1 (en) 2014-10-09
CN105103571A (en) 2015-11-25
CN105075295A (en) 2015-11-18
CN105103570B (en) 2018-02-13
JP6212625B2 (en) 2017-10-11
US9881622B2 (en) 2018-01-30
EP2982142A1 (en) 2016-02-10
JP2016519788A (en) 2016-07-07
US20190251977A1 (en) 2019-08-15
WO2014165668A1 (en) 2014-10-09
CN107731239A (en) 2018-02-23
JP2016521380A (en) 2016-07-21
EP2982141B1 (en) 2017-06-14
US20180268829A1 (en) 2018-09-20
US10276172B2 (en) 2019-04-30
EP2982140B1 (en) 2017-12-06

Similar Documents

Publication Publication Date Title
JP5726874B2 (en) Object-oriented audio streaming system
US10368183B2 (en) Directivity optimized sound reproduction
ES2606678T3 (en) Display of reflected sound for object-based audio
CN1211775C (en) Method and apparatus for adapting primary content of audio and remaniing portion of audio content in digital audio production process
EP3074973B1 (en) Metadata for ducking control
ES2453074T3 (en) Apparatus and procedure for generating audio output signals by using object-based metadata
CA2809314C (en) Content transmission apparatus, content transmission method, content reproduction apparatus, content reproduction method, program, and content distribution system
EP1427252A1 (en) Method and apparatus for processing audio signals from a bitstream
US7266501B2 (en) Method and apparatus for accommodating primary content audio and secondary content remaining audio capability in the digital audio production process
JP5688030B2 (en) Method and apparatus for encoding and optimal reproduction of a three-dimensional sound field
US9135953B2 (en) Method for creating, editing, and reproducing multi-object audio contents files for object-based audio service, and method for creating audio presets
CN101978689B (en) Display device with object-oriented stereo sound coordinate display
US20110166681A1 (en) System and method for transmitting/receiving object-based audio
CN105792086B (en) It is generated for adaptive audio signal, the system and method for coding and presentation
CN1976431A (en) Control device and method for interacting between media source, amusement system and the same
CN102272840B (en) Distributed spatial audio decoder
CN101960865A (en) Apparatus for capturing and rendering a plurality of audio channels
CN107493542B (en) For playing the speaker system of audio content in acoustic surrounding
JP2011182109A (en) Content playback device
US20140328485A1 (en) Systems and methods for stereoisation and enhancement of live event audio
US20090192638A1 (en) device for and method of generating audio data for transmission to a plurality of audio reproduction units
US8705747B2 (en) Object-based 3-dimensional audio service system using preset audio scenes
CN105103570B (en) The method and system that interactive mode for object-based audio renders
AU2015266343B2 (en) Data processor and transport of user control data to audio decoders and renderers
EP2022263B1 (en) Object-based 3-dimensional audio service system using preset audio scenes