EP3843428A1 - Mesure de caractéristiques audio inter-canaux et affichage sur interface graphique d'utilisateur - Google Patents

Mesure de caractéristiques audio inter-canaux et affichage sur interface graphique d'utilisateur Download PDF

Info

Publication number
EP3843428A1
EP3843428A1 EP20214889.6A EP20214889A EP3843428A1 EP 3843428 A1 EP3843428 A1 EP 3843428A1 EP 20214889 A EP20214889 A EP 20214889A EP 3843428 A1 EP3843428 A1 EP 3843428A1
Authority
EP
European Patent Office
Prior art keywords
audio
channel
features
blocks
user interface
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20214889.6A
Other languages
German (de)
English (en)
Inventor
Christopher Ryan Latina
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Publication of EP3843428A1 publication Critical patent/EP3843428A1/fr
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/40Visual indication of stereophonic sound image
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems

Definitions

  • Embodiments of the present invention pertain generally to processing audio signals and pertain more specifically to inter-channel audio feature measurement and usages.
  • Audio processors may be used in end-to-end audio processing chains to deliver audio content to end user devices. Different audio processors may perform different or similar media processing operations to generate output audio content for rendering or reproduction with a variety of audio speaker configurations. The same input media data as received by the end-to-end audio processing chains may undergo different or similar audio sample data manipulations, conversions, and modifications to produce different quality levels in audio rendering or reproduction.
  • Some of these audio processing operations to varying extents may be prone to introducing artifacts, unintended results, delays, latency, channel mapping issues, dropouts, transmission errors, coding/quantization errors, or the like.
  • Example embodiments which relate to inter-channel audio feature measurement and usages, are described herein.
  • numerous specific details are set forth in order to provide a thorough understanding of example embodiments of the present invention. It will be apparent, however, that embodiments of the present invention may be practiced without these specific details. In other instances, well-known structures and devices are not described in exhaustive detail, in order to avoid unnecessarily occluding, obscuring, or obfuscating example embodiments of the present invention.
  • audio signals e.g., audio visual signals, media signals, etc.
  • audio processors/codecs used to decode audio signals as described may include, but are not necessarily limited to only, any of: Dolby Digital Plus Joint Object Coding or DD+ JOC codecs, Dolby Atmos codecs, codecs processing stereo, 5.1 or 5.1.2 audios, and so on.
  • Audio features are determined/generated using decoded audio data from the audio signals.
  • Audio feature matrices with matrix elements representing/storing the audio features may be generated directly from the audio features, without needing to receive audio object metadata.
  • the audio feature matrices may be presented to users (e.g., audio engineers, audio professionals, audio authoring users, etc.).
  • User interface components used to present audio feature information in the audio feature matrixes may be color coded. For example, different colors may be assigned based on different values or properties of audio features such as audio channels of audio data used to compute the audio features. These user interface components may also be graphically annotated. For example, the user interface components may comprise visually displayed arrows, audio speaker position indications, positive or negative correlation indications, decorrelation indications, etc.
  • Example authoring and analysis tools/systems implementing or performing audio feature extraction and usages as described herein may include, but are not necessarily limited to only, any of: digital audio workstations, Avid Pro tools, Logic Pro X tools, Ableton Live tools, Steinway Cubase tools, game design tools, virtual reality platforms, Unity and Unreal engines, cloud-based or web-based media processors such as Hybrik and/or Amazon Elemental, audio analysis or adaptive effects tool/system incorporating these techniques in software, hardware and/or a combination of software and hardware, and so forth.
  • Audio features as determined/generated from the decoded audio data may be grouped into subsets.
  • Statistics such as average, value distribution, feature of features, etc., may be computed from the audio features or the subsets thereof can be used to provide or convey overall measurements, indications and/or representations of audibly perceptible features/characteristics of the audio data in the audio signals.
  • audibly perceptible features/characteristics of audio data as described herein may include, but are not necessarily limited to only, any of: “envelopment” (e.g., visualizations, measurements, and/or representations of specific channels and directions around which sounds are moving, synchronized or unsynchronized attack times in various audio channels, etc.), “immersion”, “spatial velocity” (e.g., visualizations, measurements, and/or representations of spatial velocity of an audio object or a depicted sound source, etc.), “cohesion” (e.g., visualizations, measurements, and/or representations of correlation or decorrelation of audio content in various signals, etc.), etc.
  • envelope e.g., visualizations, measurements, and/or representations of specific channels and directions around which sounds are moving, synchronized or unsynchronized attack times in various audio channels, etc.
  • immersion e.g., immersion
  • spatial velocity e.g., visualizations, measurements, and/or representations of spatial velocity of an audio object or a depicted
  • immersion or “immersive-ness” refers to (e.g., visualizations of, measurements of, representations of, significant, signal, etc.) presence of audio content or signal energy in surround channels in relation to or in comparison with presence of audio content or signal energy in non-surround channels.
  • a level of immersion or immersive-ness may be a numeric value or a classification computed, determined and/or measured based on surround signal energy of a multi-channel audio signal in comparison with non-surround signal energy of the multi-channel audio signa.
  • audibly perceptible features or characteristics can be visually perceived by users via visualization of audio features (e.g., correlation coefficients, spectral fluxes, inter-channel correlation coefficients, inter-channel spectral fluxes, etc.) in authoring tools/systems for the users to adaptively modulate spatial fields (e.g., to-be-rendered audio scenes, to-be-rendered sound fields, etc.) of audio signals to achieve respective target levels of immersion, to achieve target levels of animation, etc.
  • audio features e.g., correlation coefficients, spectral fluxes, inter-channel correlation coefficients, inter-channel spectral fluxes, etc.
  • spatial fields e.g., to-be-rendered audio scenes, to-be-rendered sound fields, etc.
  • Animation may refer to a change of spatial position of an audio object (e.g., a character, a sound source, etc.),
  • a level of animation may be (e.g., numerically, discretely, etc.) characterized or determined, for example, based on a rate of spatial position change, a rate of spatial position change rate, etc.
  • a spatial sound field depicted in a multi-channel audio signal may be modulated through activating different speakers or different numbers of speakers in different zones of a rendering/listening environment, applying different dynamic range control (DRC) operations to different audio channels/speakers/objects, suppressing or enhancing loudness levels of different audio channels/speakers/objects, modulating spatial metadata parameters, and so forth.
  • DRC dynamic range control
  • a multi-channel audio signal may refer to an overall audio signal comprising a plurality of component audio signals respectively corresponding to a plurality of audio channels/speakers in a rendering/reproduction environment for audio/sound reproducing/rendering represented in the plurality of component audio signals.
  • the multi-channel audio signal or a modified signal generated therefrom may be used to provide PCM data for driving transducers in audio speakers to produce spatial pressure waves of a spatial sound field represented in the multi-channel audio signal.
  • these features and characteristics can be used in real-time broadcasts and production studios to support creative (authoring) and validation operations.
  • audio features extracted from a query content of a query audio signal e.g., a mono audio signal, a stereo audio signal, a multi-channel audio signal, etc.
  • reference or curated audio features already extracted from audio contents stored in a content/template database can be compared with reference or curated audio features already extracted from audio contents stored in a content/template database to identify matches or similarities between the query audio content and any of the reference audio contents.
  • These audio features of the query content represent metrics or a query feature vector to be compared with corresponding audio features of reference audio contents to identify one or more matched reference audio contents, and to use the one or more matched reference audio contents to identify, infer, deduce or classify the query content's level(s) of immersion, "genre", “mix style,” etc., in real time broadcast applications as well as non-broadcast media production operations. Additionally, optionally or alternatively, the query content's levels of immersion, “genre”, “mix style,” etc., can be used in automated (e.g., fully, with user input, etc.) mixing operations to transfer an "immersive-ness", “genre”, “mix style,” etc., from a template to the query audio content.
  • immersive-ness For example, "immersive-ness”, “genre”, “mix style,” etc., of matched reference audio contents can be used as template(s) or as look-up key(s) to look up or identify applicable template(s) of "immersive-ness”, “genre”, “mix style,” etc., from a reference/template database that stores templates of "immersive-ness styles.
  • a system as described herein may use audio features or information derived therefrom to determine, infer and/or modify animation of (e.g., an array of, etc.) audio objects and depicted sound sources in a multi-channel audio signal as described herein. Animation styles from template audio content may also be transferred to, or implemented with, audio content in the multi-channel audio signal.
  • audio features and information e.g., immersive-ness, etc.
  • derived therefrom of a variety of audio contents in audio signals that have been processed by a system as described can be stored as new templates/references in a template library or data store, along with any user input and/or modifications made to the audio contents.
  • channel and/or inter-channel features and characteristics as described herein can be used with non-real-time offline processes in a quality control (QC) context, for example to support cloud-based or web-based render validation.
  • an audio render refers to audio data (e.g., PCM data, etc.) generated near or at the end of an end-to-end content delivery and/or consumption chain to (e.g., directly, etc.) drive audio speakers of a target audio speaker configuration/environment (e.g., a target audio channel configuration/environment, a target audio playback configuration or environment, etc.) for audio/sound reproduction or rendering.
  • a target audio speaker configuration/environment e.g., a target audio channel configuration/environment, a target audio playback configuration or environment, etc.
  • Audibly perceptible artifacts between or among various audio encoders and renderers can be compared and detected by taking differences (including but not limited to differences or Euclidean distances of multi-dimensional data) between or among feature matrices of audio renders (e.g., at each block, etc.) and by detecting timbre or transient smearing (e.g., to temporal errors and misalignments in different channels, etc.) in spatial audio productions or spatial audio renders.
  • spatial information such as time-dependent positions, directions, trajectories, velocities, etc., of audio objects or depicted sound sources (e.g., a person, a car, an airplane, a crowd, etc.) in audio content of an input multi-channel audio signal may be identified using channel and/or inter-channel features and characteristics as described herein.
  • the spatial information of the depicted sound sources may be included in, or used to generate, spatial audio metadata to be transmitted with corresponding audio data (e.g., audio sample data, PCM data, etc.) in an output multi-channel audio signal.
  • the spatial audio metadata can be generated without needing to receive any original spatial audio metadata in the input or original multi-channel audio signal, and be beneficially used in a game engine, in a media program, etc., to help an rendering system to depict an audiovisual environment or scene relatively lively and efficiently.
  • the spatial audio metadata may identify a specific spatial direction to which an audio object is moving at a given time, a specific position at which an audio object is located, whether an audio object is entering into or leaving from a zone, whether an audio object is rotating clockwise or counter-clockwise, which zone an audio object will be entering in future, and so forth.
  • the spatial audio metadata may be extracted by a recipient device from the output signal that has been added or encoded with the spatial audio metadata.
  • audio object or to-be-depicted sound sources can be rendered at relatively accurate spatial locations in games, movies, etc., even when the input or original multi-channel audio signal may not have audio metadata that includes spatial information of the depicted sound sources.
  • spatial information of audio objects or depicted sound sources generated/identified through channel and/or inter-channel features and characteristics as described herein can be extended to be used by audio production tools (e.g., open-source tools, etc.) for virtual reality (VR), spatialized music, and/or sound synthesis applications.
  • audio production tools e.g., open-source tools, etc.
  • VR virtual reality
  • Example embodiments are directed to visualization of audio features and audio characteristics in audio content authoring and/or monitoring systems. It is determined, from a multi-channel audio signal, a pair of sets of audio blocks, which comprises a first set of audio blocks for a first time block over a plurality of audio channels and a second set of audio blocks for a second time block over the plurality of audio channels. The first time block is different from the second time block. A set of audio features is generated from the first set of audio blocks and the second set of audio blocks in the pair of sets of audio blocks. The set of audio features includes one or more inter-channel audio features. The set of audio features is graphically presented to a user by way of a set of user interface components on a display page.
  • Each user interface component in the set of user interface components represents a respective audio feature in the set of audio features.
  • the respective audio feature is computed from a first audio block in the first set of audio blocks and a second audio block in the second set of audio blocks.
  • a specific perceptible audio characteristic is visually conveyed to the user using the set of user interface components on the display page dynamically updated with a plurality of sets of audio features computed from a plurality of pairs of sets of audio blocks of the multi-channel audio signal.
  • Each set of audio features in the plurality of sets of audio features is computed from a respective pair of sets of audio blocks in the plurality of pairs of sets of audio blocks.
  • mechanisms as described herein form a part of a media processing system, including but not limited to: an audiovisual device, a flat panel TV, a handheld device, game machine, television, home theater system, tablet, mobile device, laptop computer, netbook computer, cellular radiotelephone, electronic book reader, point of sale terminal, desktop computer, computer workstation, computer kiosk, various other kinds of terminals and media processors, etc.
  • FIG. 1 illustrates an example audio render analyzer 100 comprising an audio pre-processor 104, an audio feature extractor 106, an audio render modifier 108, a user interface 110, etc.
  • Some or all of these devices or components in the audio render analyzer 100 may be implemented with one or more computing devices and may be operatively linked with one another through local data connections, remote data connections, cloud-based or web-based network connections, etc.
  • the audio render analyzer 100 is configured to receive/collect one or more (e.g., mono, stereo, multi-channel, etc.) audio signals 102 (e.g., individually as multi-channel audio signals, as a part of an overall multi-channel audio signal, etc.) to be analyzed for quality, styles, zones of interests, etc., using channel-specific and/or inter-channel audio features extracted from audio data of the audio signals 102.
  • the audio signals 102 may or may not comprise spatial audio metadata that indicates spatial information (e.g., positions, orientations, velocities, trajectories, etc.) of audio objects or depicted sound sources represented in the audio data of the audio signals 102.
  • an audio signal from which audio content/sample data is decoded may represent DD+ or AC4 media/audio.
  • the audio content/sample data decoded from the DD+ or AC4 media/audio may be applied with specific dynamic range control operations corresponding to a specific speaker configuration arrangement (e.g., actually, etc.) presented in a rendering environment and routed to the specific speaker configuration/arrangement for sound rendering/production.
  • the audio data of the audio signals 102 is directly used for feature analysis.
  • the audio pre-processor 104 performs one or more audio processing operations on the audio data of the audio signals 102 to generate preprocessed audio data for feature analysis.
  • the audio data for feature analysis is used as input by the audio feature analyzer 106 to extract or generate audio features including but not limited to: channel-specific audio features, inter-channel audio features, audio features further generated (e.g., averaging, different orders of derivatives of audio features, higher-order fluxes computed from lower-order fluxes, etc.) using some or all of the channel-specific audio features and/or the inter-channel audio features, spatial information of audio objects and/or depicted sound sources represented in the audio signals 102, and so forth.
  • audio features including but not limited to: channel-specific audio features, inter-channel audio features, audio features further generated (e.g., averaging, different orders of derivatives of audio features, higher-order fluxes computed from lower-order fluxes, etc.) using some or all of the channel-specific audio features and/or the inter-channel audio features, spatial information of audio objects and/or depicted sound sources represented in the audio signals 102, and so forth.
  • the audio render analyzer 100 provides some or all of the audio features, the spatial information, etc., as output data 112 to one or more other devices and/or data stores operating in conjunction with the audio render analyzer 100.
  • the user interface 110 interacts with a user through one or more user interface pages (e.g., GUI display pages, etc.).
  • the user interface 110 can present, or cause displaying, user interface components depicting some or all of the output data 112 to the user through the user interface pages.
  • the user interface 110 can receive some, or all, of the user input 114 through the one or more user interfaces.
  • the audio render analyzer 100 receives user input 114 that provides feedbacks or changes to processes, algorithms, operational parameters, etc., that are used in analyzing or modifying the audio data of the audio signals 102.
  • Example user feedbacks may include, but are not necessarily limited to only, user input related to one or more audio processing operations of: enhance or shrink immersive-ness, enhance or shrink cross channel coherence, modifying attack times in specific audio channels, modify dynamic range control (DRC) operations in specific audio channels, transfer or infuse the audio signals with reference styles from templates, classify audio content/scenes depicted in the audio signals, validate audio renders generated by different audio renderers/decoders, monitor one or more listener-perceptible characteristics/qualities of the audio signals, etc.
  • enhance or shrink immersive-ness enhance or shrink cross channel coherence
  • modifying attack times in specific audio channels modify dynamic range control (DRC) operations in specific audio channels
  • DRC dynamic range control
  • transfer or infuse the audio signals with reference styles from templates classify audio content/scenes depicted in
  • the audio render analyzer 100 sends control data 216 to one or more processes, algorithms, operational parameters, etc., that are used in one or more audio processing operations to modify the audio signals 102 into one or more modified audio signals.
  • the audio render modifier 108 generates the control data 216 automatically, or programmatically with or without user input. Additionally, optionally or alternatively, the audio render modifier 108 may perform some or all of the processes, algorithms, operational parameters, etc., implemented with the audio processing operations to modify the audio signals 102 into the modified audio signals.
  • the modified audio signals may be used by a plurality of audio speakers for audio reproduction or rendering, in place of the audio signals 102.
  • Audio feature extraction e.g., inter-channel feature extraction, etc.
  • Machine listening and/or computer audition as described herein differs from other MIR approaches and is capable of being performed in real time to map extracted audio features to control signals for real-time (e.g., audio rendering, audio reproduction, etc.) systems.
  • Example audio features as described herein may include, but are not necessarily limited to only, instantaneous spectral features as follows:
  • x denotes a first multi-dimensional quantity generated from a first component audio signal of a first channel
  • y denotes a second multi-dimensional quantity generated from a second component audio signal of a second channel
  • N represents a dimensionality (e.g., the total number of dimensions in, etc.) each of x and y
  • x i represents a value in the i-th dimension (e.g., the
  • Each distinct block/frame in the collection/sequence of blocks/frames and each distinct audio channel in the plurality of audio channels form a distinct combination of block/frame and audio channel, thereby giving rise to a plurality of distinct combinations of block/frame and audio channel.
  • the specific block/frame and the specific channel - in reference to which the audio feature is computed - forms only one distinct combination of block/frame and channel in a plurality of distinct combinations of block/frame and channel.
  • FIG. 2A illustrates an example audio feature matrix that may be used to reflect, represent, display and/or present directional and/or locational information of audio features and/or directional and/or locational information of audio channels.
  • the (two-dimensional) audio feature matrix comprises a plurality of matrix rows and a plurality of matrix columns.
  • the plurality of matrix columns represents a plurality of audio channels for a current set of audio blocks/frames derived from a plurality of component audio signals in a multi-channel audio signal
  • the plurality of matrix rows represents the plurality of audio channels for a preceding set of audio blocks/frames derived from the plurality of component audio signals.
  • a matrix element is indexed by the matrix column labeled with "L” and the matrix row labeled with "L-1", and used to store, reflect, represent, display and/or present an audio feature computed based at least in part on a current audio block/frame for the audio channel L as indicated by the matrix column "L” and a preceding audio block/frame for the audio channel L as indicated by the matrix row "L-1".
  • the audio channels indicated by the matrix column "L” and the matrix row “L-1" are the same
  • the audio feature represented/stored in the matrix element is a channel-specific audio feature such as a channel-specific spectral flux (denoted as "SF(L, L-1)").
  • a system as described herein Rather than calculating (e.g., only, etc.) fluxes from power of a previous block/frame of the same channel, a system as described herein generates the audio feature matrix to capture or compute fluxes from power of a previous block/frame of different channel(s) such as adjacent channel(s).
  • the fluxes of power across different channels are captured, computed and/or represented in the audio feature matrix to detect directionality of audio object (or depicted sound source) movement across various channels over time, with each matrix element of the audio feature matrix representing/displaying the directionality between a first channel/speaker from which the audio object (or depicted sound source) resides to a second channel/speaker to which the audio object (or depicted sound source) moves.
  • an algorithm implemented by a system as described herein may be reconfigured or redirected to capture or compute audio features based on per-bin fluxes (e.g., individual fluxes per frequency bin, etc.) to provide a more detailed frequency analysis (e.g., frequency-specific analysis, etc.) or a more detailed frequency visualization (e.g., frequency-specific visualization, etc.) of fluxes of spectral power across various channels/speakers, instead of or in addition to summing fluxes for all bins for each (e.g., FFT, etc.) block/frame to provide an overall analysis (e.g., non-frequency-specific analysis, etc.) or an overall visualization (e.g., non-frequency-specific visualization, etc.) of fluxes of spectral power across various channels/speakers.
  • per-bin fluxes e.g., individual fluxes per frequency bin, etc.
  • a more detailed frequency visualization e.g., frequency-specific visualization, etc.
  • the current audio signal portions may comprise a current set of audio blocks/frames, for a current time point or a current time block/window, in a plurality of audio channels represented a multi-channel audio signal.
  • a current time point/block/window may refer to a time point/block/window (e.g., a current wall clock time point/block/window, etc.) at which the multi-channel audio signal is currently being rendered, is most recently rendered, is next to be rendered, etc. Additionally, optionally or alternatively, a current time point/block/window may refer to a time point/block/window (e.g., a logical time point/block/window represented in the multi-channel audio signal, etc.) at which a user or system is viewing, analyzing, or manipulating channel-specific or inter-channel characteristics, styles, immersive-ness, envelope, attack times, etc., of the multi-channel audio signal.
  • a time point/block/window e.g., a current wall clock time point/block/window, etc.
  • a current time point/block/window may refer to a time point/block/window (e.g., a logical time point/block/window represented in the multi-
  • an audio feature as captured in an audio feature matrix element as described herein may be represented with a numeric indicator (e.g., a GUI component, etc.) on a GUI representation.
  • a numeric value indicated by the numeric indicator may be a value of the audio feature.
  • GUI components in a GUI representation as described herein may be interactive. For example, one or more table cells in a GUI representation of any of FIG. 3A through FIG. 3F may be clicked or selected to display or access options, popups, related GUI representations, etc.
  • FIG. 3G illustrates an example GUI representation that may be launched by a user - e.g., interacting with a GUI representation of one or more of FIG. 3A through FIG. 3F - to display (e.g., instantaneous, time-averaged, smoothened with a filter, etc.) inter-channel spectral fluxes across three zones in a 5.1.2 audio format.
  • display e.g., instantaneous, time-averaged, smoothened with a filter, etc.
  • These three zones include a left-only-right-only zone or LoRo (denoted as “Zone LR”) comprising the audio channels/speakers L and R; a left-surround-right-surround zone (denoted as “Zone LsRs”) comprising the audio channels/speakers Ls and Rs; and a height zone (denoted as "Zone Heights”) comprising the audio channels/speakers Lh and Rh.
  • Zone LR left-only-right-only zone or LoRo
  • Zone LsRs left-surround-right-surround zone
  • Zone Heights height zone
  • the plots of FIG. 3G reflect or represent real-time and/or non-real-time numeric data for inter-channel spectral fluxes (denoted as "ICSF Mean Zone LR,” “ICSF Mean Zone LsRs” and “ICSF Mean Zone Heights,” respectively) in the three zones.
  • quantities such as "average fluxes” that are representative of average power may be used to characterize or constitute "mean zones” or quantities associated with “mean zones.”
  • combinatorial higher-level features may be constructed from a matrix decomposition or representation of audio features, whether using simple averaging techniques or more complex statistical analysis based speaker groups (e.g. location-dependent, etc.).
  • the user can cause per-channel loudness (e.g., volume, averaged amplitude of spatial pressure wave over time, etc.) to be modulated to achieve a target level of immersive-ness specifically adapted for a specific audible event, audio scene, audio portion, etc. Loudness - amplitude average over time. Loudness modulation algorithms/methods/operations (e.g., DRC algorithms/methods/operations, etc.) may be adaptively modified or invoked based on user input to regulate loudness in individual channels to achieve the target level of immersive-ness.
  • per-channel loudness e.g., volume, averaged amplitude of spatial pressure wave over time, etc.
  • loudness in one or more channels may be decreased, while loudness in one or more other channels may be increased.
  • Loudness of dialog may be enhanced to decrease immersive-ness, shrunk to increase immersive-ness.
  • one or more audio objects or depicted sound sources may be removed or suppressed from a specific audible event, audio scene, audio portion, etc.
  • a GUI representation such as any of FIG. 3A through FIG. 3G may be used to convey or depict dynamic motions of audio objects or depicted sound sources represented in a multi-channel audio signal. For example, if all audio objects or depicted sound sources in the multi-channel audio signal collectively make a global spatial rotation, then all inter-channel fluxes and/or inter-channel correlation coefficients may indicate relatively high magnitude values. If an individual audio object or depicted sound source in the multi-channel audio signal makes an individual spatial rotation, then inter-channel fluxes and/or inter-channel correlation coefficients two or more specific channels but not necessarily other channels may indicate relatively high magnitude values.
  • linear motions such as translations made by some or all depicted sound sources may be visualized through one or more GUI representations as described herein.
  • GUI representations of audio features of the audio signal Through one or more GUI representations of audio features of the audio signal, the user can visualize, perceive and/or validate auditory characteristics or styles, immersive-ness of audio content for dialogs, cartoons, games, etc.
  • Numeric values representing per-channel or per-channel-pair audio features in a specific subset (e.g., a surround zone, a height zone, etc.) of channels may be grouped into group values (e.g., average, mean, etc.).
  • the grouped audio feature values may be used to convey auditory characteristics or styles or immersive-ness of audio content represented in the specific subset of channels.
  • An envelope of an auditory event (e.g., an airplane flying overhead, etc.) may be visualized and monitored to determine whether audio signal activities as indicated by audio features show up coherently at the same time in the channels involved in depicting the auditory event or whether audio signal activities show up chaotically at different times in the channels involved in depicting the auditory event.
  • audio features extracted from these different audio renders or audio signals can be indicated through one or more GUI representations as described herein to determine whether any audio render or audio signal shows the best overall coherence/correlation, or whether any audio render or audio signal shows the most overall chaos, or whether any channel (e.g., attack times of an audio event such as a piano striking represented in different audio channels, etc.) should be further synchronized by a specific audio renderer/decoder (e.g., a cloud-based or web-based audio renderer/decoder, etc.).
  • “chaos” may refer to non-uniform perceptible movements and complex signal interactions (e.g., among audio objects and/or sound sources, among different audio channels, etc.) in a spatial audio scene.
  • a composite of two or more GUI representations may be displayed on a single display page.
  • one or more GUI frames/panes on the display page may be used to display one or more of GUI representations of FIG. 3A through FIG. 3F
  • one or more GUI frames/panes on the display page may be used to display one or more GUI representations of FIG. 3G or other types.
  • User interface control components may be rendered on the display page.
  • a user interface control component may be used by a user to cause audio processing operations to be performed to modify an audio signal.
  • Example audio processing operations effectuated through one or more user interface components may include, but are not necessarily limited to only, any of: enhance or shrink immersive-ness, synchronize attack times in two or more channels, real time editing of audio content, remove audio objects, change DRC operations, adaptively modify or transfer style to a particular audio data portion, add spatial metadata in a modified audio signal generated from the (input) audio signal, etc.), etc.
  • an audio signal received by a system as described herein comprise audio data as PCM data.
  • an audio signal received by a system as described herein comprise audio data as non-PCM data.
  • Some or all audio processing operations may operate on PCM data and/or non-PCM data. These audio processing operations may be used to generate auditory characteristics or styles or immersive-ness in accordance with specific artistic intent and/or to cause depiction of sound objects or auditory events relatively accurate. If deviation from artistic intent, inaccuracy in audio content envelopes or attack times, unintended chaos, predictable audible artifacts, etc., are detected through representations of audio features in GUI representations by a user or a system as described herein, corrective actions may be taken accordingly to address these issues.
  • FIG. 4 illustrates an example process flow according to an embodiment.
  • one or more computing devices or components may perform this process flow.
  • a system as described herein determines, from a multi-channel audio signal, a pair of sets of audio blocks.
  • the pair of sets of audio blocks comprises a first set of audio blocks for a first time block over a plurality of audio channels and a second set of audio blocks for a second time block over the plurality of audio channels.
  • the first time block is different from the second time block.
  • the system In block 404, the system generates a set of audio features from the first set of audio blocks and the second set of audio blocks in the pair of sets of audio blocks.
  • the set of audio features includes one or more inter-channel audio features.
  • the system graphically presents, to a user, the set of audio features with a set of user interface components on a display page.
  • Each user interface component in the set of user interface components represents a respective audio feature in the set of audio features.
  • the respective audio feature is computed from a first audio block in the first set of audio blocks and a second audio block in the second set of audio blocks.
  • the system causes a specific perceptible audio characteristic to be visually monitored by the user using the set of user interface components on the display page dynamically updated with a plurality of sets of audio features computed from a plurality of pairs of sets of audio blocks of the multi-channel audio signal.
  • Each set of audio features in the plurality of sets of audio features is computed from a respective pair of sets of audio blocks in the plurality of pairs of sets of audio blocks.
  • the first time block represents a current time block for which the set of audio features is computed; the second time block precedes the first time block.
  • the first time block represents a current time block for which the set of audio features is computed; the second time block succeeds the first time block.
  • the one or more inter-channel audio features represent one or more inter-channel correlation coefficients.
  • the specific perceptible audio characteristic to be visually monitored by the user represents a level of immersive-ness of the multi-channel audio signal for the first time block.
  • the level of immersive-ness of the multi-channel audio signal for the first time block is modified, based on user input provided by the user, in a modified multi-channel audio signal generated from the multi-channel audio signal.
  • the specific perceptible audio characteristic to be visually monitored by the user represents a level of animation of one or more audio objects represented in the multi-channel audio signal for the first time block.
  • one or more dynamic range control operations are caused by the user to be performed on one or more audio channels in the plurality of audio channels while the user is visually monitoring the specific perceptible audio characteristic represented by the set of user interface components dynamically updated with the plurality of sets of audio features.
  • colors of user interface components in the set of user interface components indicate respective audio channels of audio blocks used to compute audio features represented by the user interface components.
  • the multi-channel audio channel is an audio render generated by a cloud-based audio rendering system.
  • a non-transitory computer readable storage medium comprising software instructions, which when executed by one or more processors cause performance of any one of the methods as described herein. Note that, although separate embodiments are discussed herein, any combination of embodiments and/or partial embodiments discussed herein may be combined to form further embodiments.
  • the techniques described herein are implemented by one or more special-purpose computing devices.
  • the special-purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination.
  • ASICs application-specific integrated circuits
  • FPGAs field programmable gate arrays
  • Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques.
  • the special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
  • FIG. 5 is a block diagram that illustrates a computer system 500 upon which an embodiment of the invention may be implemented.
  • Computer system 500 includes a bus 502 or other communication mechanism for communicating information, and a hardware processor 504 coupled with bus 502 for processing information.
  • Hardware processor 504 may be, for example, a general-purpose microprocessor.
  • Computer system 500 also includes a main memory 506, such as a random-access memory (RAM) or other dynamic storage device, coupled to bus 502 for storing information and instructions to be executed by processor 504.
  • Main memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 504.
  • Such instructions when stored in non-transitory storage media accessible to processor 504, render computer system 500 into a special-purpose machine that is device-specific to perform the operations specified in the instructions.
  • Computer system 500 further includes a read-only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor 504.
  • ROM read-only memory
  • a storage device 510 such as a magnetic disk or optical disk, is provided and coupled to bus 502 for storing information and instructions.
  • Computer system 500 may be coupled via bus 502 to a display 512, such as a liquid crystal display (LCD), for displaying information to a computer user.
  • a display 512 such as a liquid crystal display (LCD)
  • An input device 514 is coupled to bus 502 for communicating information and command selections to processor 504.
  • cursor control 516 is Another type of user input device
  • cursor control 516 such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512.
  • This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
  • Computer system 500 may implement the techniques described herein using device-specific hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 500 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 500 in response to processor 504 executing one or more sequences of one or more instructions contained in main memory 506. Such instructions may be read into main memory 506 from another storage medium, such as storage device 510. Execution of the sequences of instructions contained in main memory 506 causes processor 504 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
  • Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510.
  • Volatile media includes dynamic memory, such as main memory 506.
  • Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
  • Storage media is distinct from but may be used in conjunction with transmission media.
  • Transmission media participates in transferring information between storage media.
  • transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 502.
  • transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
  • Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution.
  • the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer.
  • the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
  • a modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
  • An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 502.
  • Bus 502 carries the data to main memory 506, from which processor 504 retrieves and executes the instructions.
  • the instructions received by main memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.
  • Computer system 500 also includes a communication interface 518 coupled to bus 502.
  • Communication interface 518 provides a two-way data communication coupling to a network link 520 that is connected to a local network 522.
  • communication interface 518 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line.
  • ISDN integrated services digital network
  • communication interface 518 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN.
  • LAN local area network
  • Wireless links may also be implemented.
  • communication interface 518 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
  • Network link 520 typically provides data communication through one or more networks to other data devices.
  • network link 520 may provide a connection through local network 522 to a host computer 524 or to data equipment operated by an Internet Service Provider (ISP) 526.
  • ISP 526 in turn provides data communication services through the world-wide packet data communication network now commonly referred to as the "Internet" 528.
  • Internet 528 uses electrical, electromagnetic or optical signals that carry digital data streams.
  • the signals through the various networks and the signals on network link 520 and through communication interface 518, which carry the digital data to and from computer system 500, are example forms of transmission media.
  • Computer system 500 can send messages and receive data, including program code, through the network(s), network link 520 and communication interface 518.
  • a server 530 might transmit a requested code for an application program through Internet 528, ISP 526, local network 522 and communication interface 518.
  • the received code may be executed by processor 504 as it is received, and/or stored in storage device 510, or other non-volatile storage for later execution.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)
EP20214889.6A 2019-12-23 2020-12-17 Mesure de caractéristiques audio inter-canaux et affichage sur interface graphique d'utilisateur Pending EP3843428A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962952822P 2019-12-23 2019-12-23
EP19219223 2019-12-23

Publications (1)

Publication Number Publication Date
EP3843428A1 true EP3843428A1 (fr) 2021-06-30

Family

ID=73790030

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20214889.6A Pending EP3843428A1 (fr) 2019-12-23 2020-12-17 Mesure de caractéristiques audio inter-canaux et affichage sur interface graphique d'utilisateur

Country Status (1)

Country Link
EP (1) EP3843428A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2146522A1 (fr) * 2008-07-17 2010-01-20 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Appareil et procédé pour générer des signaux de sortie audio utilisant des métadonnées basées sur un objet

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2146522A1 (fr) * 2008-07-17 2010-01-20 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Appareil et procédé pour générer des signaux de sortie audio utilisant des métadonnées basées sur un objet

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LERCH, A.: "An Introduction to Audio Content Analysis Applications in Signal Processing and Music Informatics", 2012, WILEY
PAUL BOERSMA ET AL: "Praat online manual: doing phonetics by computer; Draw as squares...", 19 March 1998 (1998-03-19), XP055678343, Retrieved from the Internet <URL:http://www.fon.hum.uva.nl/praat/manual/Matrix__Draw_as_squares___.html> [retrieved on 20200320] *
PAUL BOERSMA ET AL: "Praat online manual: doing phonetics by computer; Intro", 31 January 2011 (2011-01-31), XP055678317, Retrieved from the Internet <URL:http://www.fon.hum.uva.nl/praat/manual/Intro.html> [retrieved on 20200320] *
PAUL BOERSMA ET AL: "Praat online manual: doing phonetics by computer; Sound: To CrossCorrelationTable...", 12 February 2011 (2011-02-12), XP055678310, Retrieved from the Internet <URL:http://www.fon.hum.uva.nl/praat/manual/Sound__To_CrossCorrelationTable___.html> [retrieved on 20200320] *

Similar Documents

Publication Publication Date Title
EP3092642B1 (fr) Métrique d&#39;erreur spatiale de contenu audio
EP3011762B1 (fr) Génération de contenu audio adaptatif
Cuevas-Rodríguez et al. 3D Tune-In Toolkit: An open-source library for real-time binaural spatialisation
US11972770B2 (en) Systems and methods for intelligent playback
US8699727B2 (en) Visually-assisted mixing of audio using a spectral analyzer
Jot et al. Augmented reality headphone environment rendering
US11269589B2 (en) Inter-channel audio feature measurement and usages
EP3332557B1 (fr) Traitement de signaux audio basés sur des objets
CN105580070A (zh) 根据室内脉冲响应处理音频信号的方法、信号处理单元、音频编码器、音频解码器及立体声渲染器
EP4121958A1 (fr) Rendu de réverbération
EP3622730B1 (fr) Spatialisation de données audio fondée sur une analyse de données audio entrantes
JP2022550372A (ja) オーディオビジュアルコンテンツについてバイノーラルイマーシブオーディオを作成するための方法及びシステム
US10832700B2 (en) Sound file sound quality identification method and apparatus
US20240177697A1 (en) Audio data processing method and apparatus, computer device, and storage medium
CN109792582A (zh) 用于回放多个音频源的双声道渲染装置和方法
WO2006069248A2 (fr) Dispositif de mesure de fidelite audio
CN110024421A (zh) 用于自适应控制去相关滤波器的方法和装置
EP3843428A1 (fr) Mesure de caractéristiques audio inter-canaux et affichage sur interface graphique d&#39;utilisateur
US20230070037A1 (en) Method for processing audio signal and electronic device
Thery et al. Impact of the visual rendering system on subjective auralization assessment in VR
US20220295207A1 (en) Presentation independent mastering of audio content
Turchet et al. Immersive networked music performance systems: identifying latency factors
JP7490062B2 (ja) ダイアログの了解度を評価する方法及び装置
CN115735365A (zh) 用于上混合视听数据的系统和方法
Gal et al. YouMixIt: Your Music Deserves Spatial Treatment

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220103

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

RAP3 Party data changed (applicant data changed or rights of an application transferred)

Owner name: DOLBY LABORATORIES LICENSING CORPORATION

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230329

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230417