US20240119946A1 - Audio rendering system and method and electronic device - Google Patents

Audio rendering system and method and electronic device Download PDF

Info

Publication number
US20240119946A1
US20240119946A1 US18/541,665 US202318541665A US2024119946A1 US 20240119946 A1 US20240119946 A1 US 20240119946A1 US 202318541665 A US202318541665 A US 202318541665A US 2024119946 A1 US2024119946 A1 US 2024119946A1
Authority
US
United States
Prior art keywords
audio
signal
audio signal
specific
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US18/541,665
Other languages
English (en)
Inventor
Junjie Shi
Chuanzeng Huang
Xuzhou YE
Zhengpu ZHANG
Derong Liu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Publication of US20240119946A1 publication Critical patent/US20240119946A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S5/00Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation 

Definitions

  • the present disclosure relates to the technical field of audio signal processing, and in particular to an audio rendering system, an audio rendering method, electronic apparatus and a non-transitory computer-readable storage medium.
  • Audio rendering refers to proper processing of sound signals from a sound source to provide a user with a desired listening experience, especially an immersive experience, in the user application scenario.
  • an excellent immersive audio system should provide listeners with the feeling of being immersed in a virtual environment.
  • immersive sense itself is not a sufficient condition for successful commercial deployment of virtual reality multimedia services.
  • an audio system should also provide content creation tools, content creation workflows, content distribution modes and platforms, and a set of rendering systems that are economically viable and easy to use for consumers and creators.
  • an audio rendering system including an audio signal encoding module configured to spatially encode an audio signal in a specific audio content format based on information related to metadata associated with the audio signal in the specific audio content format to obtain an encoded audio signal; and an audio signal decoding module configured to spatially decode the encoded audio signal to obtain a decoded audio signal for audio rendering.
  • an audio rendering method including an audio signal encoding step for spatially encoding an audio signal in a specific audio content format based on information related to metadata associated with the audio signal in the specific audio content format to obtain an encoded audio signal; and an audio signal decoding step for spatially decoding the encoded audio signal to obtain a decoded audio signal for audio rendering.
  • a chip including at least one processor and an interface, wherein the interface is used for providing computer-executable instructions for the at least one processor, and the at least one processor is used for executing the computer-executable instructions to implement the audio rendering method of any embodiment described in the present disclosure.
  • a computer program including instructions that, when executed by a processor, cause the processor to perform the audio rendering method of any embodiment described in the present disclosure.
  • an electronic apparatus including a memory; and a processor coupled to the memory, the processor can be configured to execute instructions stored in the memory so as to execute the audio rendering method of any embodiment described in the present disclosure.
  • a computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, causes implementation of the audio rendering method of any embodiment described in the present disclosure.
  • a computer program product including instructions which, when executed by a processor, implement the audio rendering method of any embodiment described in the present disclosure.
  • FIG. 1 shows a schematic diagram of some embodiments of an audio signal processing process
  • FIGS. 2 A and 2 B show schematic diagrams of some embodiments of the audio system architecture
  • FIG. 3 A shows a schematic diagram of a tetrahedral B-format microphone
  • FIG. 3 C shows a schematic diagram of a HOA microphone
  • FIG. 3 D shows a schematic diagram of an X-Y pair stereo microphone
  • FIG. 4 A shows a block diagram of an audio rendering system according to an embodiment of the present disclosure
  • FIG. 4 B shows a schematic conceptual diagram of an audio rendering process according to an embodiment of the present disclosure
  • FIGS. 4 C and 4 D show schematic diagrams of pre-processing operations in an audio rendering system according to an embodiment of the present disclosure
  • FIG. 4 E shows a block diagram of an audio signal encoding module according to an embodiment of the present disclosure
  • FIG. 4 F shows a flowchart of audio signal spatial encoding according to an embodiment of the present disclosure
  • FIG. 4 G shows a flowchart of an exemplary implementation of an audio rendering process according to an embodiment of the present disclosure
  • FIG. 4 H shows a schematic diagram of an exemplary implementation of an audio rendering process according to an embodiment of the present disclosure
  • FIG. 4 I shows a flowchart of an audio rendering method according to an embodiment of the present disclosure
  • FIG. 5 shows a block diagram of some embodiments of an electronic apparatus of the present disclosure
  • FIG. 6 shows a block diagram of other embodiments of the electronic apparatus of the present disclosure.
  • FIG. 7 shows a block diagram of some embodiments of a chip of the present disclosure.
  • the term “including” and its variants used in this disclosure means an open term including at least the following elements/features, but not excluding other elements/features, that is, “including but not limited to”.
  • the term “comprising” and its variants used in this disclosure mean an open term comprising at least the elements/features behind it, but not excluding other elements/features, that is, “comprising but not limited to”. Therefore, including is synonymous with comprising.
  • the term “based on” means “at least partially based on”.
  • references throughout this specification to “one embodiment”, “some embodiments” or “embodiments” means that a particular feature, structure or characteristic described in connection with an embodiment also can be included in at least one embodiment of the present invention.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one other embodiment”; the term “some embodiments” means “at least some embodiments”.
  • the appearances of the phrases “in one embodiment”, “in some embodiments” or “in an embodiment” in various places throughout the specification do not necessarily all refer to the same embodiment, but they may also refer to the same embodiment.
  • first and second mentioned in this disclosure are only used to distinguish different apparatuses, modules or units, and are not used to limit the order or interdependence of the functions performed by these apparatuses, modules or units. Unless otherwise specified, the concepts of “first” and “second” are not intended to imply that the objects so described must be in a given order in time, space, ranking or in any other way.
  • FIG. 1 shows some conceptual diagrams of audio signal processing, especially from acquisition to rendering process/system.
  • the audio signal after being collected, can be processed or produced, and then the processed/produced audio signal can be distributed to a rendering end for rendering, so that it can be presented to the user in an appropriate form to satisfy the user experience.
  • an audio signal processing flow can be applied to various application scenarios, especially virtual reality audio content expression.
  • virtual reality audio content expression generally relates to metadata, renderer/rendering system, audio codec and the like, wherein the metadata, renderer/rendering system and audio codec can be logically separated from each other.
  • the renderer/rendering system can directly process metadata and audio signals, without audio coding and decoding, in particular, the renderer/rendering system here can be used for audio content production.
  • the transmission format of metadata+audio stream can be set, and then the metadata and audio content can be transmitted to the renderer/rendering system through an intermediate process including encoding and decoding process, so as to be rendered to users.
  • input audio signal and metadata can be obtained from an acquisition end, wherein the input audio signal may include various appropriate forms, including, for example, channel, object, HOA or their mixed formats.
  • Metadata can include appropriate types, such as dynamic metadata and static metadata, in which dynamic metadata can be transmitted together with input audio signals, for example, in various appropriate ways, as an example, metadata information can be generated according to metadata definitions, in which dynamic metadata can be transmitted along with audio streams, and the specific package format can be defined according to the transmission protocol type adopted by the system layer.
  • metadata can also be directly transmitted to the playback end without further generating metadata information.
  • static metadata can be directly transmitted to the playback end, without going through the encoding and decoding process.
  • the input audio signal will be encoded, then transmitted to the playback end, and then decoded for playbacking to the user through a playback apparatus, such as a renderer.
  • the renderer renders and outputs the decoded audio file.
  • metadata and audio codec are independent from each other, and the decoder and renderer are decoupled therebetween.
  • the renderer can be configured with an identifier, that is, a renderer has a corresponding identifier, and different renderers may have different identifiers.
  • a registration mechanism is adopted for the renderer, that is, the playback end is provided with multiple IDs, which respectively indicate various renderers/rendering systems that the playback end can support, for example, it may include at least four IDs, ID1 indicates a renderer based on binaural output, ID2 indicates a renderer based on speaker output, ID3-ID4 indicate other types of renderers, and various renderers may indicate the same metadata definition, of course they may also support different metadata definitions, each renderer can have a corresponding metadata definition, in this case, a specific metadata identifier can be used to indicate a specific metadata definition in the transmission process, so that a renderer can have a corresponding metadata identifier, so that a corresponding renderer can be selected according to the metadata identifier for audio signal playback at the playback end.
  • ID1 indicates a renderer based on binaural output
  • ID2 indicates a renderer based on speaker output
  • ID3-ID4 indicate other types of renderers
  • various renderers may indicate the same
  • FIGS. 2 A and 2 B show exemplary implementations of an audio system.
  • FIG. 2 A shows a schematic diagram of an exemplary architecture of an audio system according to some embodiments of the present disclosure.
  • an audio system may include, but not limited to, audio acquisition, audio content production, audio storage/distribution, and audio rendering.
  • FIG. 2 B shows exemplary implementations of various stages of an audio rendering process/system. Among them, the production and consumption stages in the audio system are mainly shown, and an intermediate processing stage, such as compression, can be optionally included.
  • the production and consumption stages here may correspond to the exemplary implementations of the production and rendering stages shown in FIG. 2 A , respectively. This intermediate processing stage can be included in the distribution stage shown in FIG.
  • Audio acquisition In the audio acquisition stage, audio scenes will be captured to acquire audio signals. Audio acquisition can be dealt with by appropriate audio acquisition means/systems/apparatuses, etc.
  • audio content formats can include at least one of the following three: scene-based audio representation, channel-based audio representation and object-based audio representation, and for each audio content format, corresponding or matching apparatuses and/or manners can be adopted for capturing.
  • a microphone array supporting a sphere configuration can be used to capture scene audio signals
  • one or more microphones that have been specially optimized can be used to record sound to capture audio signals.
  • audio acquisition may also include appropriate post-processing for the captured audio signals. Audio acquisition of various audio content formats will be exemplarily described below.
  • Scene-based audio representation is a type of sound field representation which is extensible and independent of speakers, for example, an example definition is given in ITU R BS.2266-2.
  • the scene-based audio may be based on a set of orthogonal basis functions, such as spherical harmonics.
  • examples of scene-based audio formats used may include B-Format, First Order Ambisonics (FOA), High Order Ambisonics (HOA) and the like.
  • Ambisonics indicates an omnidirectional audio system, that is, it can include sound sources above and below the listener in addition to the horizontal plane.
  • the auditory scene of Ambisonics can be captured by using a first-order or higher-order Ambisonics microphone.
  • a scene-based audio representation may generally indicate an audio signal including HOA.
  • the B-format microphone or First Order Ambisonics (FOA) format may use the first four low-order spherical harmonics to represent a three-dimensional sound field with four signals W, X, Y and Z.
  • W is used to record the omnidirectional sound pressure
  • X is used to record the front/back sound pressure gradient at the acquisition position
  • Y is used to record the left/right sound pressure gradient at the acquisition position
  • Z is used to record the up/down sound pressure gradient at the acquisition position.
  • Such four signals can be generated by processing original signals of a so-called “tetrahedron” microphone.
  • the “tetrahedron” microphone can be composed of four microphones, which are arranged in left front upper (LFU), right front lower (RFD), left rear lower (LBD) and right rear upper (RBU), as shown in FIG. 3 A .
  • the B-format microphone array configuration can be deployed on a portable spherical audio and video acquisition apparatus, and original microphone signal components can be processed in real time to obtain W, X, Y and Z components.
  • Horizontal only B-format microphones can be used to capture auditory scenes and acquire audio.
  • some configurations may support only horizontal B-format, in which only W, X and Y components are captured while no Z component is captured. Compared with the 3D audio functionality of FOA and HOA, Horizontal only B-format gives up the extra immersion provided by height information.
  • multiple formats for High-Order Ambisonics data exchange may be included.
  • the channel order, normalization and polarity of channels should be correctly defined.
  • auditory scenes can be captured by High-Order Ambisonics microphones.
  • the spatial resolution and listening area can be greatly enhanced by increasing the number of directional microphones, which can be realized by, for example, second-order, third-order, fourth-order, and higher-order Ambisonics systems (collectively referred to as HOA, Higher Order Ambisonics).
  • FIG. 3 C shows a HOA microphone.
  • the acquisition of channel-based audio representation generally includes audio acquisition using microphones and may also include performing channel-based post-processing.
  • a channel-based audio representation may generally indicate an audio signal including a channel.
  • Such an acquisition system can use multiple microphones to capture sounds from different directions; or using coincident or spaced microphone arrays.
  • different channel-based formats can be created, for example, from the X-Y microphone (XY pair stereo Microphone) shown in FIG. 3 D , 8.0 channel contents can be recorded by using a microphone array.
  • the microphone embedded in the user equipment can also realize the recording of channel-based audio formats, such as recording stereo with mobile phone.
  • object-based audio representation can use a set of a series of single audio elements to represent a whole complex audio scene, each audio element including an audio waveform and a set of related parameters or metadata. Metadata can specify motion and transformation of each audio element in a sound scene, so as to reproduce the audio scene originally designed by the artist.
  • the experience provided by object-based audios usually exceeds the general mono audio acquisition, which makes the audio more likely to meet the artistic intention of the producer.
  • an object-based audio representation may generally indicate an audio signal including an object.
  • the spatial accuracy of the object-based audio representation depends on the metadata and the rendering system. It is not directly related to the number of channels contained in audio.
  • the acquisition of the object-based audio representation can be performed by using any appropriate acquisition equipment, such as a speaker, and can be processed appropriately.
  • mono audio tracks can be acquired and further processed based on metadata to obtain an object-based audio representation.
  • a sound object usually uses mono audio tracks recorded or generated with sound design. These mono audio tracks can serve as sound elements to be further processed in a tool such as Digital Audio Workstation (DAW), for example, using metadata to specify that the sound elements are on a horizontal plane around the listener, or even anywhere in a three-dimensional space. Therefore, a “track” in the DAW can correspond to an audio object.
  • DAW Digital Audio Workstation
  • the audio acquisition system can generally consider the following factors and make corresponding optimization:
  • audio acquisition processing and various audio representations are only exemplary and not restrictive.
  • the audio representation can also be in other suitable forms that are known or will be known in the future, and can be acquired by appropriate means, as long as such audio representation can be acquired from the music scene and can be used for presentation to the user.
  • the audio signal After the audio signal is acquired by the audio capture/acquisition system, the audio signal will be input to the production stage for audio content production.
  • the creator needs to have the ability to edit sound objects and generate metadata, and the aforementioned metadata generation operations can be performed here.
  • the producer's creation of audio contents can be realized in various appropriate ways.
  • the input audio data and audio metadata are received, and the audio data and audio metadata are processed, especially authorization and metadata marking, to obtain production results.
  • the input of audio processing may include, but not limited to, target-based audio signals, FOA (First-Order Ambisonics), HOA (Higher-Order Ambisonics), stereo, surround sound, etc., and particularly, the input of audio processing may also include scene information and metadata, etc., which is associated with the input metadata.
  • audio data is input to the audio track interface for processing, and audio metadata is processed via common audio source data (such as ADM extension, etc.).
  • standardization can also be carried out, especially for the results obtained by authorization and metadata marking.
  • the creator in the audio content production process, the creator also needs to be able to monitor and modify the work in time.
  • an audio rendering system can be provided to provide the monitoring function of the scene.
  • the rendering system provided for the creator to perform monitoring should be the same as the rendering system provided by consumers to ensure a consistent experience.
  • Audio contents with an appropriate audio production format can be obtained during or after the audio content production process.
  • the audio production format can be various appropriate formats.
  • the audio production format may be specified in ITU-R BS.2266-2.
  • ITU-R BS.2266-2 specifies channel-based, object-based and scene-based audio representations, as shown in Table 1 below. For example, all the signal types in Table 1 can describe the three-dimensional audio whose goal is to bring immersive experience.
  • Audio Production Format For example, full mix or microphone array record mix, music for a specific speaker layout, e.g., stereo 5.1, 7.1 + 4 Object-based audio
  • Object-based audio For example, dialogue, Audio element with position helicopter sound metadata Rendered to the target speaker layout or headphones.
  • Scene-based audio For example, crowd, B-Format (First order Ambisonics) ambient sound in Higher order Ambisonics (HOA) motion
  • the signal types shown in the table can be combined with audio metadata to control rendering.
  • the audio metadata includes at least one of the following:
  • Audio production can also be performed by any other appropriate means, any other appropriate apparatus, and can adopt any other appropriate audio production format, as long as the acquired audio signal can be processed for rendering.
  • the audio signal may be subject to further intermediate processing.
  • intermediate processing of audio signals may include storage and distribution of audio signals.
  • audio signals can be stored and distributed in an appropriate format, for example, can be stored in an audio storage format and distributed in an audio distribution format, respectively.
  • the audio storage format and audio distribution format can be of various appropriate forms. The following describes existing spatial audio formats or spatial audio exchange formats related to audio storage and/or audio distribution as an example.
  • An example can be a container format, such as an .mp4 container, which can accommodate spatial (scene-based) and non-blind audio.
  • This container format may include a Spatial Audio Box (SA3D), which contains information such as the type, order, channel order and standardization of Ambisonics.
  • SA3D Spatial Audio Box
  • the container format may also include a Non-Diegetic Audio Box (SAND), which is used to represent the audio (such as comments, stereo music, etc.) that should remain unchanged when the listener's head rotates.
  • SAND Non-Diegetic Audio Box
  • ACN Ambisonics Channel Number
  • SN3D Schmidt semi-normalization
  • ADM Audio Definition Model
  • ADM Audio Definition Model
  • the model can be divided into content part and format part.
  • the content part describes the contents contained in the audio, such as the language of the audio tracks (Chinese, English, Japanese, etc.) and loudness.
  • the format part contains the technical information needed for audio to be correctly decoded or rendered, such as the position coordinates of sound objects and the order of HOA components.
  • Recommendation ITU-R BS.2076-0 specifies a series of ADM elements, such as audioTrackFormat (which describes what data format is), audioTrackUID (which uniquely identifies audio tracks or assets with audio scene records), audioPackFormat (which groups audio channels) and so on.
  • ADM elements such as audioTrackFormat (which describes what data format is), audioTrackUID (which uniquely identifies audio tracks or assets with audio scene records), audioPackFormat (which groups audio channels) and so on.
  • AMD can be used for audio based on channels, objects and scenes.
  • AmbiX supports audio contents based on HOA scenes.
  • AmbiX file contains linear PCM data with word length of 16-, 24- or 32-bit fixed-point number, or 32-bit floating-point number, which can support all effective sampling rates in. caf (Apple's core audio format).
  • AmbiX adopts ACN sorting and SN3D normalization, and supports HOA and mixed-order Ambisonics. As a popular format for exchanging Ambisonics contents, AmbiX is developing rapidly.
  • the intermediate processing of the audio signal may also include appropriate compression processing.
  • the produced audio content can be encoded/decoded to obtain a compression result, and then the compression result can be provided to the rendering side for rendering.
  • compression processing can help to reduce data transmission overhead and improve data transmission efficiency. Coding and decoding in compression can be realized by any suitable technology.
  • Audio intermediate processing can also include any other appropriate processing, and can also adopt any other appropriate format, as long as the processed audio signal can be effectively transmitted to the audio rendering end for rendering.
  • the audio transmission process also includes the transmission of metadata
  • the metadata can be of various appropriate forms and can be applied to all audio renderers/rendering systems, or can be correspondingly applied to each audio renderer/rendering system respectively.
  • Such metadata may be called rendering-related metadata, and may include, for example, basic metadata and extended metadata, the basic metadata can be ADM basic metadata conforming to BS. 2076, for example.
  • ADM metadata describing the audio format can be given in the form of XML (Extensible Markup Language).
  • metadata can be appropriately controlled, such as under hierarchical control.
  • the metadata is mainly realized by XML encoding.
  • the metadata in XML format can be included in the “axml” or “bxml” block in the audio file in BW64 format for transmission, and in the generated metadata, “audio packet format identifier”, “audio channel format identifier”, and “audio track unique identifier” can be provided to BW64 file for linking the metadata with the actual audio track.
  • the basic elements of metadata may include, but not limited to, at least one of the following: audio program, audio content, audio object, audio package format, audio channel format, audio stream format, audio track format, audio track unique identifier, audio block format, etc.
  • the extended metadata can be packaged in various appropriate forms, for example, it can be packaged in a similar way with the aforementioned basic metadata, and it can contain appropriate information, identifiers, and the like.
  • the audio rendering end/playback end can process the audio signal for playback/presentation to the user, in particular, the audio signal is rendered and presented to the user with a desired effect.
  • the processing at the audio rendering end may include processing the signal from the audio production stage before rendering, as an example, as shown in FIG. 2 B , according to the processing result at the production side, recovering and rendering metadata by using the audio track interface and general audio metadata (such as ADM extension); performing audio rendering on the results after metadata recovery and rendering, and inputting the obtained results into an audio apparatus for consumption by consumers.
  • general audio metadata such as ADM extension
  • the processing at the audio rendering end may include various appropriate types of audio rendering.
  • corresponding audio rendering processing can be adopted.
  • input data at the audio rendering end can be composed of a renderer identifier, metadata and audio signals, and the audio rendering end can select a corresponding renderer according to the transmitted renderer identifier, and then the selected renderer can read the corresponding metadata information and audio files, thereby performing audio playback.
  • the input data at the audio rendering end can take various appropriate forms, for example, it can take various appropriate packaging formats, such as hierarchical format, metadata and audio files can be packaged in the inner layer, and the renderer identifier can be packaged in the outer layer.
  • the metadata and audio files may be in BW64 file format, and the outermost layer may package a renderer identifier, such as a renderer label, a renderer ID, and the like.
  • the audio rendering process may employ scene-based audio rendering.
  • scene-Based Audio SBA
  • rendering can be generated adaptively mainly for application scenes, independently from capture or creation of sound scenes.
  • the rendering of a sound scene can be usually performed on a receiving apparatus, and a real or virtual speaker signal is generated.
  • the sound field rotates according to the motion of the head.
  • the audio rendering process may employ channel-based audio rendering.
  • each channel is associated with a corresponding speaker and can be presented through the corresponding speaker.
  • the position of the speaker is standardized in ITU-R BS.2051 or MPEG CICP, for example.
  • each speaker channel in a scenario of immersive audio, can be rendered to headphones as a virtual sound source in the scene; in other words, the audio signal of each channel can be rendered to a correct position in a virtual listening room according to the standard.
  • the most direct method is to filter the audio signal of each virtual sound source and the response function measured in a reference listening room.
  • the acoustic response function can be measured by a microphone placed in the head and ear of a person or an artificial person. They are called binaural room impulse responses (BRIR).
  • BRIR binaural room impulse responses
  • This method can provide high audio quality and accurate positioning, but it has the disadvantage of high computational complexity, especially for BRIRs with a larger number of channels and longer lengths that to be rendered. Therefore, some alternative methods have been developed to reduce complexity while maintaining audio quality. Usually, these alternative methods involve a parametric model for BRIR, for example, by using sparse filters or recursive filters.
  • the audio rendering process may employ object-based audio rendering.
  • audio rendering can be performed with the object and associated metadata in mind.
  • each object sound source is independently presented along with its metadata, the metadata describes the spatial properties of each sound source, such as position, direction, width and so on. Using these properties, the sound source is rendered separately in the three-dimensional audio space around the listener.
  • speaker array rendering can use different types of speaker panning methods, such as vector-based amplification panning (VBAP), and the sound played by the speaker array can present the listener with a feeling that the object sound source is located at a specified position.
  • speaker panning methods such as vector-based amplification panning (VBAP)
  • VBAP vector-based amplification panning
  • there are also many different ways to render for headphones such as using HRTF (Head-Related Transfer Function) in a corresponding direction of each sound source to directly filter the sound source signal.
  • HRTF Head-Related Transfer Function
  • an indirect rendering method can be used to render the sound source to a virtual speaker array, and then perform binaural rendering for each virtual speaker.
  • the present disclosure proposes an audio rendering with good compatibility and high efficiency, which can be compatible with various input audio and various desired audio outputs, while ensuring the rendering effect and efficiency.
  • an audio signal in a public space format that can be used for user application scenarios can be obtained based on the received input audio signals, that is, even if the received input audio signals may contain or be audio representation signals in different formats, such audio representation signals can be transformed/encoded into the audio signal in the public space format; then, the audio signal in the public space format can be decoded according to the type of a playback apparatus in the user's listening environment, so as to obtain output audio especially suitable for the playback apparatus in the user's listening environment, which can be well compatible with various input and output formats, and can obtain the output format especially suitable for the playback apparatus in the user's listening environment for various inputs, so as to realize an audio rendering system with good compatibility and then an audio system with good compatibility.
  • the present disclosure realizes improved audio rendering, especially improved immersive audio rendering.
  • FIG. 4 A shows a block diagram of some embodiments of an audio rendering system according to an embodiment of the present disclosure.
  • the audio rendering system 4 includes an acquisition module 41 configured to acquire an audio signal in a specific spatial format based on an input audio signal, the audio signal in the specific spatial format may be an audio signal in a common spatial format derived from various possible audio representation signals for usage in the user's application scenarios; and an audio signal decoding module 42 configured to spatially decode the encoded audio signal in the specific spatial format to obtain a decoded audio signal for audio rendering, so that audio can be presented/played back to a user based on the spatially decoded audio signal.
  • the audio signal in a specific spatial format can be called an intermediate audio signal during audio rendering, or an intermediate signal medium, which can have a common specific spatial format derivable from various input audio signals, for example, it can be any suitable spatial format as long as it can be supported by user application scenarios/user playback environments and is suitable for playback in the user playback environments.
  • the intermediate signal can a kind of signal which is relatively independent from the sound source, and can be applied to different scenarios/apparatuses for playback according to different decoding methods, thereby improving the universality of the audio rendering system of the present application.
  • the audio signal in the specific spatial format can be an audio signal of Ambisonics type, and more specifically, the audio signal in the specific spatial format can be any one or more of FOA (First Order Ambisonics), HOA (Higher Order Ambisonics) and MOA (Mixed-Order Ambisonics).
  • FOA First Order Ambisonics
  • HOA Higher Order Ambisonics
  • MOA Mated-Order Ambisonics
  • the audio signal in the specific spatial format can be appropriately acquired based on the format of the input audio signal.
  • the input audio signal may be in a distributed spatial audio exchange format, which can be obtained from various audio content formats that have been acquired, thereby performing spatial audio processing on such input audio signal to obtain an audio signal with the specific spatial format.
  • the spatial audio processing may include appropriate processing on the input audio, especially including parsing, format conversion, information processing, encoding, etc., to obtain the audio signal in the specific spatial format.
  • the audio signal in the specific spatial format can be directly obtained from the input audio signal without at least some spatial audio processing.
  • the input audio signal may be in other appropriate formats other than the non-spatial audio exchange format, in particular, the input audio signal may contain or be directly a signal in a specific audio content format, such as a specific audio representation signal, or may contain or be directly an audio signal in a specific spatial format, so that the input audio signal may not need to be subject to at least some spatial audio processing, and thus it may not be necessary to perform the above-mentioned spatial audio processing, such as not perform parsing, format conversion, information processing, encoding, etc., or only part of the spatial audio processing can be performed, for example, only encoding is performed, while parsing, format conversion, etc. are not performed, so that an audio signal in the specific spatial format can be obtained.
  • a specific audio content format such as a specific audio representation signal
  • the input audio signal may not need to be subject to at least some spatial audio processing, and thus it may not be necessary to perform the above-mentioned spatial audio processing, such as not perform parsing, format conversion, information processing, encoding
  • the acquisition module 41 may include an audio signal encoding module 413 configured to spatially encode the audio signal in the specific audio content format to obtain an encoded audio signal based on information related to metadata associated with the audio signal in the specific audio content format.
  • the encoded audio signal may be included in an audio signal in the specific spatial format.
  • an audio signal in a specific audio content format may, for example, include a spatial audio signal with a specific spatial audio representation, in particular, the spatial audio signal can be at least one of a scene-based audio representation signal, a channel-based audio representation signal, and an object-based audio representation signal.
  • the audio signal encoding module 413 particularly encodes a specific type of audio signal in the audio signal in the specific audio content format
  • the specific type of audio signal is an audio signal that needs or is required to be spatially encoded in an audio rendering system, and may include at least one of specific channel (for example, a non-narrative channel/track) signals in a scene-based audio representation signal, an object-based audio representation signal, and a channel-based audio representation signal.
  • the acquisition module 41 may include an audio signal acquisition module 411 configured to acquire an audio signal in the specific audio content format and metadata information associated with the audio signal.
  • the audio signal acquisition module may obtain the audio signal in the specific audio content format and metadata information associated with the audio signal by parsing the input signal, or receive the audio signal in the specific audio content format and metadata information associated with the audio signal which are directly input.
  • the acquisition module 41 may further include an audio information processing module 412 configured to extract audio parameters of an audio signal in the specific audio content format based on metadata associated with the audio signal in the specific audio content format, so that the audio signal encoding module may be further configured to spatially encode the audio signal in the specific audio content format based on at least one of the metadata associated with the audio signal and the audio parameters.
  • the audio information processing module can be called a scene information processor, which can provide the audio parameters extracted based on the metadata to the audio signal encoding module for encoding.
  • the audio information processing module is not necessary for audio rendering of the present disclosure, for example, its information processing function may not be executed, or it may be outside the audio rendering system, or it may be included in other modules, such as the audio signal acquisition module or the audio signal encoding module, or its function may be realized by other modules, so it is indicated by a dotted line in the drawings.
  • the audio rendering system may include a signal adjustment module 43 configured to perform signal processing on the decoded audio signal.
  • the signal processing carried out by the signal adjustment module can be called a signal post-processing, especially the post-processing of the decoded audio signal before it is played back by a playback apparatus. Therefore, the signal adjustment module can also be called a signal post-processing module.
  • the signal adjustment module 43 can be configured to adjust the decoded audio signal based on the characteristics of the playback apparatus in the user application scenario, so as to enable the adjusted audio signal to present a more appropriate acoustic experience when rendered by an audio rendering apparatus.
  • the audio signal adjustment module is not necessary for the audio rendering of the present disclosure, for example, the signal adjustment function may not be performed, or it may be outside the audio rendering system, or the audio signal adjustment module may be included in other modules, such as the audio signal decoding module or its function can be realized by the decoding module, so it is indicated by a dotted line in the drawings.
  • the audio rendering system 4 may also include or be connected to an audio input port, which is used to receive the input audio signal, the input audio signal may be distributed and transmitted to the audio rendering system in the audio system, as mentioned above, or may be directly input by the user at the user end or the consumer end, which will be described later. Additionally, the audio rendering system 4 can also include or be connected to output apparatuses, such as audio presentation apparatuses and audio playback apparatuses, which can present the spatially decoded audio signals to users. According to some embodiments of the present disclosure, an audio presentation apparatus or an audio playback apparatus according to embodiments of the present disclosure may be any suitable audio apparatus, such as a speaker, a speaker array, a headphone, and any other suitable apparatus capable of presenting an audio signal to a user.
  • FIG. 4 B shows a schematic conceptual diagram of an audio rendering process according to an embodiment of the present disclosure, showing a flow of acquiring an output audio signal suitable for rendering in a user application scenario, especially for presentation/playback to a user through an apparatus in a playback environment, based on an input audio signal.
  • an audio signal in a specific spatial format that can be used for playback in the user application scenario can be obtained.
  • appropriate processing is performed to obtain the audio signal in the specific spatial format.
  • the input audio signal contains an audio signal in a spatial audio exchange format distributed to the audio rendering system
  • the input audio signal can be subjected to spatial audio processing to obtain the audio signal in the specific spatial format.
  • the spatial audio exchange format can be any known suitable format for audio signals during signal transmission, such as the audio distribution format in audio signal distribution mentioned above, which will not be described in detail here.
  • spatial audio processing may include at least one of parsing, format conversion, information processing, encoding, etc. that is performed on the input audio signal.
  • audio signals in various audio content formats can be obtained from input audio signals through audio parsing, and then the parsed signals can be encoded to obtain audio signals in spatial formats suitable for rendering in user application scenarios, that is, playback environments, for playback.
  • format conversion and signal information processing can optionally be performed before encoding. Therefore, an audio signal with a specific spatial audio representation can be obtained from the input audio signal, and the audio signal in the specific spatial format can be obtained based on the audio signal with the specific spatial audio representation.
  • an audio signal with a specific audio representation can be obtained from an input audio signal, such as at least one of a scene-based audio representation signal, an object-based audio representation signal, and a channel-based audio representation signal.
  • the input audio signal is an audio signal in a spatial audio exchange format
  • the input audio signal can be parsed to obtain a spatial audio signal with a specific spatial audio representation and metadata information corresponding to the signal, for example, the spatial audio signal may be at least one of a scene-based audio representation signal, a channel-based audio representation signal, an object-based audio representation signal, and optionally, the spatial audio signal can be further converted into a predetermined format, which may be, for example a format pre-specified/predefined in an audio rendering system or an audio rendering system. Of course, this format conversion is not necessary.
  • audio processing can be performed based on the audio representation manner of the audio signal.
  • spatial audio coding may be performed on at least one of the scene-based audio representation signal, the object-based audio representation signal, and the narrative channel in the channel-based audio representation signal to obtain the audio signal with the specific spatial format. That is, even though the format/representation manner of the input audio signal may be different, the input audio signal can be converted into a common audio signal with a specific spatial format for decoding and rendering.
  • the spatial audio coding process can be performed based on information related to metadata associated with audio signals, where the metadata related information can include metadata of audio signals directly obtained, for example, derived from input audio signals during parsing, and/or alternatively, it can also include corresponding audio parameters of spatial audio signals obtained by information processing on the metadata information of the obtained signals, and the spatial audio coding process can be performed based on the audio parameters.
  • the metadata related information can include metadata of audio signals directly obtained, for example, derived from input audio signals during parsing, and/or alternatively, it can also include corresponding audio parameters of spatial audio signals obtained by information processing on the metadata information of the obtained signals, and the spatial audio coding process can be performed based on the audio parameters.
  • the input audio signal can be in any other appropriate format than the non-spatial audio exchange format, especially for example, a specific spatial representation signal or even a specific spatial format signal, then in this case, an audio signal in a specific spatial format can be obtained while at least some of the aforementioned spatial audio processing can be skipped.
  • the format conversion and coding can be directly performed without needing to perform the aforementioned audio parsing process. Even when the input audio signal has a predetermined format, the encoding process is directly performed without needing to perform the format conversion.
  • the input audio signal can directly be the audio signal in the specific spatial format
  • such input audio signal can be transmitted directly/transparently to the audio signal spatial decoder, without needing to be subject to spatial audio processing, such as parsing, format conversion, information processing, encoding, etc.
  • spatial audio processing such as parsing, format conversion, information processing, encoding, etc.
  • the input audio signal is a scene-based spatial audio representation signal
  • such an input audio signal can be directly transmitted to the spatial decoder as a signal in a specific spatial format without being subject to the aforementioned spatial audio processing.
  • the input audio signal in a case that the input audio signal is not the distributed audio signal in a spatial audio exchange format, for example, it can be an audio signal with the aforementioned specific spatial audio presentation or an audio signal in a specific spatial format, it can be directly input at the user end/consumer end, for example, it can be directly acquired from an application program interface (API) that is directly disposed in the rendering system.
  • API application program interface
  • the input audio signal has a format specified by the system and a representation that the system can handle, it can be directly transmitted to the spatial coding processing module without performing the aforementioned parsing and transcoding.
  • the input audio signal is a non-narrative channel signal, a binaural signal after reverberation processing, etc.
  • the input audio signal can be directly transmitted to the spatial decoding module for decoding, without performing the aforementioned spatial audio encoding processing.
  • spatial decoding can be performed on the obtained audio signal in a specific spatial format, in particular, the obtained audio signal in a specific spatial format can be called an audio signal to be decoded, and the spatial decoding of the audio signal aims to convert the audio signal to be decoded into a format suitable for playback by a playback apparatus or a rendering apparatus in a user application scenario, such as an audio playback environment, an audio rendering environment.
  • decoding can be performed according to an audio signal playback mode, which can be indicated in various appropriate ways, for example, indicated by an identifier, and can be informed to a decoding module in various appropriate ways, for example, along with an input audio signal, or can be input and informed to the decoding module by other input apparatuses.
  • audio signal decoding can utilize a decoding manner corresponding to the playback apparatus in the user application scenario, especially a decoding matrix, to decode the audio signal in this specific spatial format and transform the audio signal to be decoded into audio in an appropriate format.
  • audio signal decoding can also be performed in other appropriate ways, such as virtual signal decoding.
  • the decoded output can be post-processed, especially signal adjustment, which is used to adjust the spatially decoded audio signal for a specific playback apparatus in the user application scenario, especially to adjust the audio signal characteristics, so that the adjusted audio signal can present a more appropriate acoustic experience when rendered by an audio rendering apparatus.
  • the decoded audio signal or the adjusted audio signal can be presented to the user in a user application scenario, for example, in an audio playback environment through an audio rendering apparatus/audio playback apparatus, so as to meet the requirements of the user.
  • audio signal processing can be performed in blocks, and the block size can be set.
  • the block size can be preset and kept changedly during processing.
  • the block size can be set when the audio rendering system is initialized.
  • the metadata can be parsed in blocks and then the information in the scenario can be adjusted for the metadata, such operation can be included in the operation of the scene information processing module according to the embodiment of the present disclosure, for example.
  • the signal suitable for rendering processing by an audio rendering system may be an audio signal in a specific audio content format.
  • an audio signal in a specific audio content format can be directly input into an audio rendering system, that is, an audio signal in a specific audio content format can be directly input as an input signal, so that it can be directly acquired.
  • an audio signal in a specific audio content format can be acquired from an audio signal input to an audio rendering system.
  • the input audio signal may be an audio signal in other formats, such as a specific combination signal containing an audio signal in the specific audio content format, or a signal in other formats.
  • the audio signal in the specific audio content format can be acquired by parsing the input audio signal.
  • the input signal acquisition module can be called an audio signal parsing module, and the signal processing it performs can be called a signal fore-processing, especially processing before the audio signal is encoded.
  • FIGS. 4 C and 4 D illustrate exemplary processing of an audio signal parsing module according to an embodiment of the present disclosure.
  • audio signals may be input in different input formats, and therefore, audio signal parsing can be performed before audio rendering processing to be compatible with inputs in different formats.
  • audio signal parsing process can be considered as a fore-processing/preprocessing.
  • the audio signal parsing module can be configured to acquire an audio signal in an audio content format compatible with the audio rendering system and its associated metadata information from an input audio signal, and in particular, can parse an input signal in an arbitrary spatial audio exchange format to acquire the audio signal in the audio content format compatible with the audio rendering system, which can include at least one of an object-based audio representation signal, a scene-based audio representation signal and a channel-based audio representation signal, and the associated metadata information.
  • FIG. 4 C shows the parsing process for signal input in any spatial audio exchange format.
  • the audio signal parsing module can further convert the acquire audio signal in the audio content format compatible with the audio rendering system to make the audio signal have a predetermined format, especially the predetermined format in the audio rendering system, for example, convert the signal into a format specified by the audio rendering system according to the signal format type.
  • the predetermined format may correspond to the predetermined configuration parameters of the audio signal in the specific audio content format, so that the audio signal in the specific audio content format may be further converted into the predetermined configuration parameters in the audio signal parsing operation.
  • the signal parsing module is configured to convert the scene-based audio signals with different channel ordering and normalization coefficients into channel ordering and normalization coefficients specified by the audio rendering system.
  • any signal in spatial audio exchange format used for distribution can be divided into three types of signals through the input signal parser according to the signal representation manner of spatial audio, namely at least one of scene-based audio representation signal, channel-based audio representation signal and object-based audio representation signal, and the corresponding metadata of such signals.
  • the signal can be converted into a system-constrained format according to the format type in the fore-processing.
  • different channel ordering such as ACN (Ambisonics Channel Number), FuMa (Furse-Malham) and SID (Single Index Design) and different normalization coefficients (N3D, SN3D, FuMa) may be used in different data exchange formats. In this step, they can be converted into a certain agreed channel ordering and normalization coefficient, such as ACN+SN3D.
  • the input audio signal in the case that the input audio signal is not the distributed signal in spatial audio exchange format, it may not be necessary to perform at least some of the spatial audio processing on the input audio signal.
  • the input specific audio signal can be directly at least one of the above three signal representations, so that the above signal parsing process can be omitted, and the audio signal and its associated metadata can be directly transmitted to the audio signal encoding module.
  • FIG. 4 D illustrates processing for a specific audio signal input according to other embodiments of the present disclosure.
  • the input audio signal may even be an audio signal with the specific spatial format as mentioned above, and such an input audio signal can be transmitted directly/transparently to the audio signal decoding module, without performing the spatial audio processing including parsing, format conversion, audio coding, etc.
  • the audio rendering system may further include a specific audio input apparatus for directly receiving the input audio signal and transmitting it directly/transparently to an audio signal encoding module or an audio signal decoding module.
  • a specific input apparatus can be, for example, an application program interface (API), and the format of the input audio signal that it can receive has been set in advance, for example, corresponding to the specific spatial format mentioned above, for example, it can be at least one of the three signal representations mentioned above, and so on, so that when the input apparatus receives the input audio signal, the input audio signal can be transmitted directly/transparently without at least some spatial audio processing.
  • API application program interface
  • such a specific input apparatus can also be part of the audio signal acquisition operation/module, or even included in the audio signal parsing module.
  • the audio signal parsing module can be implemented in various appropriate ways.
  • the audio signal parsing module may include a parsing sub-module and a direct transmission sub-module.
  • the parsing sub-module may only receive audio signals in a spatial exchange format for audio parsing
  • the direct transmission sub-module may receive an audio signal in a specific audio content format or a specific audio representation signal for direct transmission.
  • the audio rendering system can be set so that the audio signal parsing module receives two inputs, namely, an audio signal in a spatial exchange format and an audio signal in a specific audio content format or a specific audio representation signal.
  • the audio signal parsing module may include a judgment sub-module, a parsing sub-module and a direct transmission sub-module, so that the audio signal parsing module can receive any type of input signal and process it appropriately.
  • the judgment sub-module can judge what format/type the input audio signal is, and if it is judged that the input audio signal is an audio signal in spatial audio exchange format, it will be transferred to the parsing sub-module to perform the above-mentioned parsing operation, otherwise, the direct transmission sub-module can directly/transparently transmit the audio signal to the stages of format conversion, audio coding, audio decoding, etc., as described above.
  • the judgment sub-module can also be outside the audio signal parsing module. Audio signal judgment can be realized in various known and appropriate ways, which will not be described in detail here.
  • the audio rendering system may include an audio information processing module configured to acquire audio parameters of an audio signal in a specific audio content format based on metadata associated with the audio signal in the specific audio content format, in particular, acquire audio parameters based on metadata associated with the specific type of audio signal as metadata information that can be used for encoding.
  • the audio information processing module may be called a scene information processing module/processor, and the audio parameters acquired by the audio information processing module may be input to an audio signal encoding module, whereby the audio signal encoding module may be further configured to spatially encode the specific type of audio signal based on the audio parameters.
  • a specific type of audio signal may include the aforementioned audio signal in an audio content format compatible with the audio rendering system, such as at least one of the aforementioned scene-based audio representation signal, object-based audio representation signal, and channel-based audio representation signal, and especially at least one of specific types of channel signals in object-based audio representation signal, scene-based audio representation signal, and channel-based audio representation signal.
  • the specific type of channel signal may be called a first specific type of channel signal, which may include non-narrative sound channels/tracks in the channel-based audio representation signal.
  • the specific type of channel signal may also include narrative channels/tracks that need not be spatially encoded according to the application scenario.
  • the audio information processing module is further configured to acquire the audio parameters of the specific type of audio signal based on the audio content format of the specific type of audio signal, in particular, acquire the audio parameters based on the audio content format of the audio signal in an audio content format compatible with the audio rendering system derived from the input audio signal, for example, the audio parameters may be specific types of parameters corresponding to the audio content format respectively, as mentioned above.
  • the audio signal is an object-based audio representation signal
  • the audio information processing module is configured to acquire spatial attribute information of the object-based audio representation signal as an audio parameter that can be used for spatial audio encoding processing.
  • the spatial attribute information of the audio signal includes azimuth information of each audio element in the coordinate system, or relative azimuth information of the sound source related to the audio signal relative to the listener.
  • the spatial attribute information of the audio signal further includes the distance information of each sound element of the audio signal in the coordinate system.
  • azimuth information of each sound element in the coordinate system such as azimuth and elevation, and optionally distance information, or relative azimuth information of each sound source relative to the listener's head can be obtained.
  • the audio signal is a scene-based audio representation signal
  • the audio information processing module is configured to acquire rotation information related to the audio signal based on metadata information associated with the audio signal, for spatial audio encoding processing.
  • the rotation information related to the audio signal includes at least one of the rotation information of the audio signal and the rotation information of the listener of the audio signal.
  • rotation information of scene audio and rotation information of listeners are read from metadata.
  • the audio signal is a channel-based audio signal
  • the audio information processing module is configured to acquire audio parameters based on the channel track type of the audio signal.
  • the audio encoding process will mainly focus on specific types of channel-based audio signals that need to be spatially encoded, especially narrative channel tracks in channel-based audio signals, and the audio information processing module can be configured to split the audio representations of channels into audio elements by channel to convert them into metadata as audio parameters.
  • the narrative channel tracks in channel-based audio signal may not be subject to spatial audio coding, for example, the spatial audio coding may not be performed depending on the specific application scenario, and such track may be transmitted directly to the decoding stage or further processed depending on the playback mode.
  • the audio representation of channels can be divided into audio elements according to the standard definition of channels by channels, and converted into metadata for processing.
  • spatial audio processing can also be omitted, and audio mixing can be performed for different playback modes in subsequent stages.
  • non-narrative audio tracks because there is no need for dynamic spatialization, they can be mixed with respect to different playback methods in the subsequent stages. That is to say, the non-narrative audio tracks will not be processed by the audio information processing module, that is, they will not be performed spatial audio processing, but can be transmitted directly/transparently by bypassing the audio information processing module.
  • FIG. 4 E shows a block diagram of some embodiments of an audio signal encoding module, wherein the audio signal encoding module may be configured to spatially encode an audio signal in a specific audio content format based on information related to metadata associated with the audio signal in the specific audio content format to obtain an encoded audio signal. Additionally, the audio signal encoding module can also be configured to acquire an audio signal in a specific audio content format and information related to associated metadata.
  • the audio signal encoding module may receive the audio signal and metadata related information, such as those generated by the aforementioned audio signal parsing module and audio signal processing module, via, for example, an input port/input apparatus.
  • the audio signal encoding module can realize the operations of the aforementioned audio signal acquisition module and/or audio signal processing module, and for example, it can include the aforementioned audio signal acquisition module and/or audio signal processing module to acquire the audio signal and metadata.
  • the audio signal coding module can also be called the audio signal spatial coding module/encoder.
  • FIG. 4 F shows a flowchart of some embodiments of an audio signal encoding operation, in which an audio signal in a specific audio content format and information related to metadata associated with the audio signal are acquired; and the audio signal in the specific audio content format, is spatially encoded based on information related to metadata associated with the audio signal in the specific audio content format to obtain an encoded audio signal.
  • the acquired audio signal in the specific audio content format may be called an audio signal to be encoded.
  • the acquired audio signal may be a non-direct transmission/non-transparent transmission audio signal, and may have various audio content formats or audio representations, such as at least one of the three representations of audio signals mentioned above, or other suitable audio signals.
  • such an audio signal may be, for example, an object-based audio representation signal, a scene-based audio representation signal, or a signal that has been specified in advance to be encoded for a specific application scenario, such as a narrative channel track in the channel-based audio representation signal as mentioned above.
  • the obtained audio signal can be directly input, for example, the signal that is not performed signal parsing as mentioned above, or it can be extracted/parsed from the input audio signal, for example, the signal that is obtained by the signal parsing module mentioned above and needs no audio encoding, such as a specific type of channel signal in the channel-based audio representation signal, which can be called the second specific type of channel signal here, for example, the narrative channel tracks that are not specified to be encoded or the non-narrative channel tracks that are not required to be encoded themselves as mentioned above will not be input into the audio signal encoding module, for example, they will be transmitted directly to the subsequent decoding module.
  • the second specific type of channel signal such as a specific type of channel signal in the channel-based audio representation signal
  • the specific spatial format can be a spatial format that can be supported by the audio rendering system, for example, it can be played back to users in different user application scenarios, such as different audio playback environments.
  • the encoded audio signal in a specific spatial format can be used as an intermediate signal medium, that is, it indicates that an intermediate signal with a common format is encoded from an input audio signal that may contain various spatial representations, and decoding processing is performed on the intermediate signal for rendering.
  • the encoded audio signal in the specific spatial format can be the audio signal in the specific spatial format as mentioned above, such as FOA, HOA, MOA, etc., which will not be described in detail here.
  • audio signals that may have at least one of a variety of different spatial representations, they can be spatially encoded to obtain encoded audio signals in specific spatial formats that can be used for playback in user application scenarios, that is, even though audio signals may contain different content formats/audio representations, audio signals in common spatial formats can be obtained through encoding.
  • the encoded audio signal may be added to the intermediate signal, for example, encoded into the intermediate signal.
  • the encoded audio signal can also be transmitted directly/transparently to the spatial decoder, without needing to be added to the intermediate signal. In this way, the audio signal encoding module can be compatible with various types of input signals to obtain encoded audio signals in a common space format, so that audio rendering processing can be performed efficiently.
  • the audio signal encoding module can be realized in various appropriate ways, for example, it can include an acquisition unit and an encoding unit that respectively realize the above acquisition and encoding operations.
  • Such spatial encoder, acquisition unit and coding unit can be implemented in various appropriate forms, such as software, hardware, firmware, etc. or any combination.
  • the audio signal encoding module can be implemented to receive only the audio signal to be encoded, such as the audio signal to be encoded directly input or obtained from the audio signal parsing module. That is to say, the signal input to the audio signal encoding module must be encoded.
  • the acquisition unit can be implemented as a signal input interface, which can directly receive the audio signal to be encoded.
  • the audio signal encoding module can be implemented to receive audio signals or audio representation signals in various audio content formats.
  • the audio signal encoding module may also include a judgment unit, which can judge whether the audio signal received by the audio signal encoding module is an audio signal that needs to be encoded, and if it is judged to be an audio signal that needs to be encoded, the audio signal is transmitted to the acquisition unit and the encoding unit; and if it is judged to be an audio signal that does not need to be encoded, the audio signal is directly transmitted to the decoding module without audio encoding.
  • the judgement can be performed in various appropriate ways, for example, the comparison can be made with reference to the audio content format or audio signal representation manner of audio, and when the format or representation manner of the input audio signal matches the format or representation manner of the audio signal that needs to be encoded, it is judged that the input audio signal needs to be encoded.
  • the judgment unit can also receive other reference information, such as application scenario information, rules specified for a specific application scenario in advance, etc., and make a judgment based on this reference information, as mentioned above, when the rules specified for a specific application scenario in advance are known, the audio signal that needs to be encoded can be selected from audio signals according to the rules.
  • the judgment unit may also acquire an identifier related to the signal type, and judge whether the signal needs to be encoded according to the identifier related to the signal type.
  • the identifier may be in various suitable forms, such as a signal type identifier, and any other suitable indication information that can indicate the signal type.
  • information related to metadata associated with an audio signal may include metadata in an appropriate form and may depend on the signal type of the audio signal, and in particular, the metadata information may correspond to a signal representation of the signal.
  • metadata information may be related to the attributes of audio object, especially spatial attributes; for scene-based signal representation, metadata information can be related to the attributes of scene; for channel-based signal representation, metadata information can be related to the attributes of channel.
  • the audio signal is encoded according to the type of the audio signal, and in particular, the audio signal can be encoded based on information related to metadata corresponding to the type of the audio signal.
  • information related to metadata associated with an audio signal may include at least one of metadata associated with the audio signal and audio parameters of the audio signal derived based on the metadata.
  • the metadata related information may include metadata related to the audio signal, such as metadata acquired together with the audio signal, such as directly input or acquired through signal parsing.
  • the metadata-related information may also include the audio parameters of the audio signal derived based on the metadata, as described above for the operation of the information processing module.
  • metadata related information can be obtained through various appropriate ways.
  • metadata information can be obtained through signal parsing, or directly input, or obtained through specific processing.
  • the metadata related information can be the metadata associated with a specific audio representation signal obtained by parsing the distributed input signal in the spatial audio exchange format through the signal parsing process as described above.
  • the metadata-related information can be directly input when the audio signal is input, for example, in the case that the input audio signal can be directly input through the API without performing the aforementioned audio signal parsing, the metadata-related information can be input together with the audio signal when the audio signal is input, or can be input separately from the audio signal.
  • the metadata that is parsed from audio signal or the metadata that is directly input can be further processed, such as information processing, so that appropriate audio parameters/information can be obtained as metadata information for audio encoding.
  • the information processing may be called scene information processing, and in the information processing, processing may be performed based on metadata associated with an audio signal to obtain appropriate audio parameters/information.
  • signals in different formats can be extracted and corresponding audio parameters can be calculated, which can be related to rendering application scenarios as an example.
  • scene information may be adjusted based on metadata, for example.
  • an audio signal to be encoded will be encoded based on information related to metadata associated with the audio signal.
  • the audio signal to be encoded may include a specific type of audio signal in the audio signal in the specific audio content format, and the specific type of audio signal will be spatially encoded based on metadata related information associated with the specific type of audio signal to obtain an encoded audio signal in a specific spatial format.
  • Such encoding can be called spatial encoding.
  • the audio signal encoding module may be configured to weight the audio signal according to metadata information.
  • the audio signal encoding module may be configured to perform weighting according to weights in metadata.
  • This metadata can be associated with the audio signal to be encoded acquired by the audio signal encoding module, for example, with signals/audio representation signals in various audio content formats, as mentioned above.
  • the audio signal encoding module may be further configured to weight the acquired audio signal, especially the audio signal in a specific audio content format, based on metadata associated with the audio signal.
  • the audio signal encoding module can also be configured to further perform additional processing on the encoded audio signal, such as weighting, rotation, etc.
  • the audio signal encoding module may be configured to convert the audio signal in the specific audio content format into an audio signal in a specific spatial format, and then weight the obtained audio signal in the specific spatial format based on metadata, to obtain an intermediate signal.
  • the audio signal encoding module may be configured to further process the audio signal in the specific spatial format which is converted based on metadata, such as format conversion, rotation, etc.
  • the audio signal encoding module can be configured to convert the encoded or directly input audio signal in the specific spatial format to meet the constrained formats supported by the current system, for example, it can be converted in the terms of channel arrangement method, regularization method, etc. to meet the requirements of the system.
  • the audio signal in the specific audio content format is an object-based audio representation signal
  • the audio signal encoding module is configured to spatially encode the object-based audio representation signal based on the spatial attribute information of the object-based audio representation signal.
  • encoding can be performed by matrix multiplication.
  • the spatial attribute information of the object-based audio representation signal may include information related to spatial propagation of sound objects based on the audio signal, especially information related to spatial propagation paths of sound objects to listeners.
  • the information related to the spatial propagation paths of the sound objects to the listeners may include at least one of propagation duration, propagation distance, azimuth information, path energy intensity and nodes along the way of each spatial propagation paths of the sound object to the listeners.
  • the audio signal encoding module is configured to spatially encode an object-based audio signal according to at least one of a filter function and a spherical harmonic function
  • the filter function can be a filter function for filtering the audio signal based on the path energy intensity of a spatial propagation path of a sound object in the audio signal to a listener
  • the spherical harmonic function can be a spherical harmonic function based on the azimuth information of the spatial propagation path.
  • audio signal encoding may be performed based on a combination of both a filter function and a spherical harmonic function.
  • audio signal encoding can be performed based on the product of both the filter function and the spherical harmonic function.
  • the spatial audio encoding for the object-based audio signal may be further based on the delay of the sound object in spatial propagation, for example, may be based on the propagation duration of the spatial propagation path.
  • the filter function for filtering the audio signal based on the path energy intensity is a filter function for filtering an audio signal of a sound object before propagating along the spatial propagation path based on the path energy intensity of the path.
  • the audio signal of the sound object before propagating along the spatial propagation path may refer to an audio signal at a moment before the time required for the sound object to reach the listener along the spatial propagation path, for example, the audio signal of the sound object before the propagation duration.
  • the azimuth information of the spatial propagation path may include the direction angle of the spatial propagation path to the listener or the direction angle of the spatial propagation path relative to the coordinate system.
  • the spherical harmonic function based on the azimuth of the spatial propagation path can be any suitable form of spherical harmonic function.
  • the spatial audio coding of the object-based audio signal can further encode the audio signal by adopting at least one of a near-field compensation function and a diffusion function based on the length of the spatial propagation path of the sound object in the audio signal to the listener. For example, depending on the length of the spatial propagation path, at least one of the near-field compensation function and the diffusion function can be applied to the audio signal of the sound object for the propagation path to make appropriate audio signal compensation and enhance the effect.
  • spatial encoding of an object-based audio signal may be performed for one or more spatial propagation paths of a sound object to a listener, respectively. Particularly, when there is one spatial propagation path of the sound object to the listener, the spatial encoding of the object-based audio signal is performed for this spatial propagation path, while when there are multiple spatial propagation paths of the sound object to the listener, the spatial encoding can be performed for at least one or even all of the multiple spatial propagation paths.
  • each spatial propagation path of the sound object to the listener can be considered separately, and the audio signal corresponding to the spatial propagation path can be encoded accordingly, and then the encoding results for respective spatial propagation paths can be combined to obtain the encoding result for the sound object.
  • the spatial propagation path between the sound object and the listener can be determined in various appropriate ways, especially determined by the information processing module mentioned above by acquiring the spatial attribute information.
  • spatial encoding of an object-based audio signal may be performed for each of one or more sound objects contained in the audio signal, and the encoding process for each sound object may be performed as described above.
  • the audio signal encoding module is further configured to weighted combine the encoded signals of the respective object-based audio representation signals based on the weights for the sound objects defined in the metadata.
  • spatial encoding can be performed on the object-based audio representation signal based on the spatial propagation related information of the sound objects of the audio signal, for example, spatial encoding can be performed on the audio representation signal with respect to the spatial propagation path of each sound object as mentioned above, and then, the encoded audio signals of sound objects can be weighted combined by using the weight for each sound object contained in the metadata associated with the audio representation signal.
  • the audio signal will be written into a delay apparatus. It can be known from the metadata information associated with the audio representation signal, especially the audio parameters obtained by the audio information processing module, that each sound object will have one or more propagation paths to the listener, and according to the length of each path, the time t 1 required for the sound object to the listener can be calculated, so the audio signal s of the sound object before t 1 can be obtained from the delayer of the audio object, and the audio signal can be filtered by a filtering function E based on the path energy intensity.
  • the azimuth information of the path can be acquired from the metadata information associated with the audio representation signal, especially the audio parameters obtained by the audio information processing module, and a specific function based on the azimuth angle, such as the spherical harmonic function Y for the corresponding channel, can be used, so that based on the both, the audio signal can be encoded into a encoded signal, such as the HOA signal s.
  • N be the number of channels of the HOA signal
  • the HOA signal SN obtained by audio encoding processing can be expressed as follows:
  • the orientation of the path relative to the coordinate system can also be used, instead of the direction to the listener, so that the target sound field signal can be obtained with multiplication by the rotation matrix in the subsequent step, as the encoded audio signal.
  • the azimuth information of path is the orientation of the path relative to the coordinate system
  • the multiplication by rotation matrix can be further performed on the basis of the above formula, to obtain the encoded HOA signal.
  • the encoding operation may be performed in the time domain or the frequency domain. Further, encoding can also be performed based on the distance of the spatial propagation path of the sound object to the listener, and in particular, at least one of near-field compensation function and source spread function can be further applied according to the distance of the path to enhance the effect.
  • the near-field compensation function and/or the source spread function can be further applied on the basis of the above-mentioned encoded HOA signal, and in particular, the near-field compensation function can be applied when the distance of the path is less than the threshold, and the source spread function can be applied when it is greater than the threshold, and vice versa, so as to further optimize the above-mentioned encoded HOA signal.
  • the HOA signals obtained after the signal conversion of each sound object can be weightedly superposed according to the weight for the sound object defined in the metadata, that is, a weighted sum signal of all object-based audio signals can be obtained as an encoded signal, which can serve as an intermediate signal.
  • the spatial encoding of the object-based audio signal can also be performed based on reverberation information, and the obtained encoded signal can be directly transmitted to the spatial decoder for decoding, or can be added to the intermediate signal output by the encoder.
  • the audio signal encoding module is further configured to obtain reverberation parameter information and perform reverberation processing on the audio signal to acquire the reverberation related signal of the audio signal.
  • the spatial reverberation response of the scene can be obtained, and convolution of audio signal can be performed based on the spatial reverberation response to obtain a reverberation related signal of the audio signal.
  • the reverberation parameter information can be obtained in various appropriate ways, such as obtained from metadata information, obtained from the aforementioned information processing module, input by users or from other input apparatuses, and so on.
  • the spatial room reverberation response for a user application scenario may be generated, includes but not limited to RIR (Room Impulse Response), ARIR (Ambisonics Room Impulse Response), BRIR (Binaural Room Impulse Response), MO-BRIR (Multi orientation Binaural Room Impulse Response).
  • RIR Room Impulse Response
  • ARIR Ambisonics Room Impulse Response
  • BRIR Binary Room Impulse Response
  • MO-BRIR Multi orientation Binaural Room Impulse Response
  • a convolver can be added to the encoding module to process the audio signal.
  • the processing result may be intermediate signal (ARIR), omnidirectional signal (RIR) or binaural signal (BRIR, MO-BRIR), and the processing result may be added to the intermediate signal or transmitted transparently to the next step for corresponding playback decoding.
  • the information processor may also provide reverberation parameter information such as reverberation duration, and an artificial reverberation generator (for example, a Feedback delay network) can be added to the encoding module to perform artificial reverberation processing, and the result can be output to an intermediate signal or transmitted transparently to a decoder for processing.
  • reverberation parameter information such as reverberation duration
  • an artificial reverberation generator for example, a Feedback delay network
  • the audio signal in a specific audio content format is a scene-based audio representation signal
  • the audio signal encoding module is further configured to weight the scene-based audio representation signal based on weight information indicated by or contained in metadata associated with the audio representation signal. In this way, the weighted signal can be used as an encoded audio signal for spatial decoding.
  • the audio signal in a specific audio content format is a scene-based audio representation signal
  • the audio signal encoding module is further configured to perform a sound field rotation operation on the scene-based audio representation signal based on spatial rotation information indicated by or contained in metadata associated with the audio representation signal. In this way, the rotated audio signal can be used as an encoded audio signal for spatial decoding.
  • the scene audio signal itself is a FOA, HOA or MOA signal, so it can be weighted directly according to the weight information in the metadata, that is, the desired intermediate signal.
  • the processing of sound field rotation can be performed in the encoding module according to different implementations.
  • the scene audio signal can be multiplied by parameters indicating the rotation characteristics of the sound field, such as that in forms of vectors, matrices and the like, so that the audio signal can be further processed.
  • this sound field rotation operation may also be performed in the decoding stage. In some implementations, the sound field rotation operation may be performed in one of the encoding and decoding stages, or in both.
  • the audio signal in a specific audio content format is a channel-based audio representation signal
  • the audio signal encoding module is further configured to, when the channel-based audio representation signal needs to be converted, convert the channel-based audio representation signal that needs to be converted into an object-based audio representation signal and encode it.
  • the encoding operation here can be performed as described above for encoding an object-based audio representation signal.
  • the channel-based audio representation signal that needs to be converted may include a narrative channel track in the channel-based audio representation signal
  • the audio signal encoding module is further configured to convert the audio representation signal corresponding to the narrative channel track into an object-based audio representation signal and encode it, as described above.
  • the audio representation signal corresponding to the narrative channel track can be split into audio elements by channel and converted into metadata for encoding.
  • the audio signal in a specific audio content format is a channel-based audio representation signal
  • the channel-based audio representation information may not need spatial audio processing, especially spatial audio encoding, such a channel-based audio representation signal will be transmitted directly to the audio decoding module and processed in an appropriate manner for playback/rendering.
  • the narrative channel track in the channel-based audio representation signal does not perform spatial audio processing according to the needs of the scene, for example, it is stipulated in advance that the narrative channel track does not need encoding processing, the narrative channel track can be directly transmitted to the decoding step.
  • the non-narrative channel track in the channel-based audio representation signal itself does not need spatial audio processing, so it can be directly transmitted to the decoding step.
  • the spatial encoding processing of channel-based audio representation signals can be performed based on a predetermined rule
  • the predetermined rule can be provided in an appropriate manner, and in particular can be specified in an information processing module.
  • audio encoding can be performed in an appropriate manner according to the specification.
  • the audio encoding mode can be a mode that the audio presentation signal can be converted into an object-based audio representation for processing as described above, or it can be any other coding mode, such as a pre-agreed encoding mode for the channel-based audio signals.
  • the audio representation signal can be directly transmitted to the decoding module/stage, so that it can be processed for different playback modes.
  • audio decoding processing will be performed on such encoded audio signal or directly transmitted/transparently transmitted audio signal in order to obtain an audio signal suitable for playback/rendering in user application scenarios.
  • an encoded audio signal or a directly transmitted/transparently transmitted audio signal may be called a signal to be decoded, and may correspond to an audio signal in a specific spatial format as described above, or an intermediate signal.
  • the audio signal in the specific spatial format may be the aforementioned intermediate signal, or it may be an audio signal transmitted directly/transparently to the spatial decoder, including an uncoded audio signal, or an encoded audio signal that has been spatially encoded but not included in the intermediate signal, such as a non-narrative channel signal and a binaural signal after reverberation.
  • the audio decoding process may be performed by an audio signal decoding module.
  • the audio signal decoding module can decode the intermediate signal and the transparently transmitted signal to the playback/play apparatus according to the playback mode.
  • the audio signal to be decoded can be converted into a format suitable for playback through a playback apparatus in a user application scenario, such as an audio playback environment or an audio rendering environment.
  • the playback mode may be related to the configuration of the playback apparatus in the user application scenario. In particular, depending on the configuration information of the playback apparatus in the user application scenario, such as identifier, type and arrangement of the playback apparatus, a corresponding decoding mode can be adopted.
  • the decoded audio signal can be suitable for a specific type of playback environment, especially suitable for playback apparatuses in the playback environment, thus achieving compatibility with various types of playback environments.
  • the audio signal decoder can perform decoding according to information related to the type of the user application scenario, the information can be a type indicator of the user application scenario, for example, a type indicator of a rendering apparatus/playback apparatus in the user application scenario, such as a renderer ID, so that decoding processing corresponding to the renderer ID can be performed to obtain an audio signal suitable for playback by the renderer.
  • the renderer ID can be that, as described above, each renderer ID can correspond to a specific renderer arrangement/playback scenario/playback apparatus arrangement, etc., so that an audio signal suitable for playback through the renderer arrangement/playback scenario/playback apparatus arrangement corresponding to the renderer ID can be decoded.
  • the playback mode such as the renderer ID
  • an audio signal decoder decodes an audio signal in a specific spatial format by using a decoding mode corresponding to a playback apparatus in a user application scenario.
  • the playback apparatus in the user application scenario may include a speaker array, which may correspond to a speaker playbacking/rendering scene.
  • the audio signal decoder may decode the audio signal in the specific spatial format by using a decoding matrix corresponding to the speaker array in the user application scenario.
  • a user application scenario may correspond to a specific renderer ID, such as the aforementioned renderer ID2.
  • corresponding identifiers can be set according to the types of speaker arrays, so as to indicate the user application scenarios more accurately.
  • corresponding identifiers can be set for standard speaker arrays, custom speaker arrays, and so on.
  • the decoding matrix can be determined depending on the configuration information of the speaker array, such as the type and arrangement of the speaker array.
  • the decoding matrix in the case that the playback apparatus in the user application scenario is a predetermined speaker array, is a decoding matrix built in the audio signal decoder or received from the outside and corresponding to the predetermined speaker array.
  • the decoding matrix can be a preset decoding matrix, which can be stored in the decoding module in advance, for example, can be stored in a database in association with/corresponding to the types of speaker arrays, or otherwise provided to the decoding module.
  • the decoding module can call a corresponding decoding matrix according to the known predetermined speaker array type to perform decoding processing.
  • the decoding matrix may be in various suitable forms, for example, it may contain gains, such as gain values of HOA tracks/channels to speakers, so that the gains can be directly applied to the HOA signals to generate output audio channels, so as to render the HOA signals into the speaker array.
  • gains such as gain values of HOA tracks/channels to speakers
  • the decoder will have built-in decoding matrix coefficients, and the playback signal L can be acquired by multiplying the intermediate signal with the decoding matrix.
  • L is the speaker array signal
  • D is the decoding matrix
  • SN is the intermediate signal, which is obtained as described above.
  • the directly/transparently transmitted audio signal can be converted into a speaker array according to the definition of a standard speaker, for example, it can be multiplied by a decoding matrix as described above, and other suitable methods can be adopted, such as Vector-base amplitude panning (VBAP) or the like.
  • VBAP Vector-base amplitude panning
  • the speaker manufacturer is required to provide a decoding matrix with corresponding design.
  • the system provides a decoding matrix setting interface to receive the decoding matrix related parameters corresponding to the special speaker array, so that the received decoding matrix can be used for decoding processing, as described above.
  • the decoding matrix is a decoding matrix calculated according to the arrangement of the custom speaker array.
  • the decoding matrix is calculated according to the azimuth and pitch angles of each speaker or the three-dimensional coordinate values of the speaker in the speaker array.
  • the decoding module can calculate the decoding matrix according to the arrangement of the custom speakers, and the required inputs may be the azimuth and pitch angle of each speaker or the three-dimensional coordinate values of the speakers.
  • the calculation methods of speaker decoding matrix can be SAD (Sampling Ambisonics Decoder), MMD (Mode Matching Decoder), EPAD (Energy Preserved Ambisonics Decoder), AllRAD (All Round Ambisonics Decoder) and so on.
  • the playback apparatus in the user application scenario is an earphone
  • the audio signal decoder can be configured to directly decode an audio signal to be decoded into a binaural signal as a decoded audio signal, or perform speaker virtualization to obtain a decoded signal as a decoded audio signal.
  • a user application scenario may correspond to a specific renderer ID, such as the aforementioned renderer ID1.
  • the signal to be decoded can be directly decoded into a binaural signal.
  • the signal to be decoded can be directly decoded, for example, the HOA signal can be converted by determining a rotation matrix according to the listener's posture, and then the HOA channel/track can be adjusted, such as convolution (for example, convolution by means of gain matrix, harmonic function, HRIR (head-related impulse response), spherical harmonic HRIR, etc., such as frequency domain convolution), so that binaural signals can be obtained.
  • convolution for example, convolution by means of gain matrix, harmonic function, HRIR (head-related impulse response), spherical harmonic HRIR, etc., such as frequency domain convolution
  • such a process can also be regarded as that the HOA signal is directly multiplied by a decoding matrix, which may include rotation matrix, gain matrix, harmonic function and so on.
  • a decoding matrix which may include rotation matrix, gain matrix, harmonic function and so on.
  • typical methods may include LS (least squares), Magnitude LS, SPR (Spatial resampling) and so on.
  • LS least squares
  • SPR spatial resampling
  • For transparently transmitted signals usually binaural signals, they are played back directly.
  • indirect rendering can also be performed, that is, the speaker array will be used first, and then HRTF convolution will be performed according to the position of the loudspeaker to virtualize the loudspeaker, so as to obtain the decoded signal.
  • the audio signal to be decoded in the audio decoding process, can also be processed based on metadata information associated with the audio signal to be decoded.
  • the audio signal to be decoded can be spatially transformed according to spatial transformation information in the metadata information, for example, when rotation is indicated in the metadata information, the sound field rotation operation can be performed on the audio representation signal to be decoded based on rotation information indicated in the metadata.
  • the intermediate signal can be multiplied with the rotation matrix as needed to obtain the rotated intermediate signal, so that the rotated intermediate signal can be decoded.
  • spatial transformation such as spatial rotation
  • the spatial encoding such as spatial rotation
  • the spatially decoded audio signal can be adjusted for a specific playback apparatus in a user application scenario, aiming at enabling the adjusted audio signal to present a more appropriate acoustic experience when rendered by an audio rendering apparatus.
  • the adjustment of audio signals can mainly aim at eliminating the possible inconsistency between different playback types or different playback modes, and then make the playback experience of the adjusted audio signals consistent when playing back in application scenarios, and improve the user's experience.
  • audio signal adjustment processing can be called a post-processing, which refers to post-processing the output signal obtained by audio decoding, which can be called output signal post-processing.
  • the signal post-processing module is configured to perform at least one of frequency response compensation and dynamic control range on the decoded audio signal for a specific playback apparatus.
  • the post-processing module can perform post-processing adjustment on the output signal so as to present a consistent acoustic experience.
  • Post-processing operations include, but not limited to, frequency response compensation (EQ) and Dynamic range control (DRC) for a specific apparatus.
  • the audio information processing module, the audio signal encoding module, the signal space decoder and the output signal post-processing as described above can constitute the core rendering modules of the system, which are responsible for processing the signals in three audio representation formats obtained through the pre-processing and their metadata, and playing them back through the playback apparatus in the user application environment.
  • each module of the audio rendering system described above is only a logical module classified according to the specific function it realizes, and is not used to limit the specific implementation, for example, it can be implemented in software, hardware or a combination of software and hardware.
  • the above modules can be realized as independent physical entities, or can also be realized by a single entity (for example, a processor (CPU or DSP, etc.), an integrated circuit, etc.), for example, an encoder, a decoder, etc. can adopt a chip (such as an integrated circuit module including a single wafer), a hardware component or a complete product.
  • the above-mentioned modules when indicated by dotted lines in the drawings, may indicate that these units may not actually exist, but the operations/functions they realize may be realized by other modules or systems or even the apparatus themselves that contain the modules.
  • at least one of the audio signal parsing module 411 , the information processing module 412 , and the audio signal encoding module 413 shown in FIG. 4 A may be located outside the acquisition module 41 and exist in the audio rendering system 4 , for example, between the acquisition module 41 and the decoder 42 , and sequentially process the input audio signal to obtain the audio signal to be processed by the decoder. It can even be located outside the audio rendering system.
  • the audio rendering system 4 may also include a memory, which may store various information generated by various modules included in the system and apparatus during operation, programs and data used for operation, data to be sent by the communication unit, and the like.
  • the memory may be a volatile memory and/or a nonvolatile memory.
  • the memory may include, but not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), read-only memory (ROM), and flash memory.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • ROM read-only memory
  • flash memory any type of volatile memory
  • the memory may also be located outside the apparatus.
  • the audio rendering system 4 may also include other components not shown, such as an interface, a communication unit, and the like.
  • the interface and/or the communication unit may be used to receive an input audio signal to be rendered, and may also output the finally generated audio signal to a playback apparatus in a playback environment for playback.
  • the communication unit can be implemented in an appropriate manner known in the art, including, for example, communication components such as antenna arrays and/or radio frequency links, various types of interfaces, communication units, and the like, which will not be described in detail here.
  • the apparatus may also include other components not shown, such as radio frequency link, baseband processing unit, network interface, processor, controller, etc., which will not be described in detail here.
  • the audio rendering system mainly includes a rendering metadata system and a core rendering system, in the metadata system, there are control information describing audio content and rendering techniques, such as whether the input form of audio is single channel, double channel, multi channel, or object or sound field HOA, as well as position information of dynamic sound source and listener, and rendered acoustic environment information such as room shape, size, wall material, etc.
  • the core rendering system can perform corresponding rendering for playback apparatuses and environments according to different audio signal representations and metadata parsed from the metadata system.
  • the input audio signal is received, and then is parsed or transmitted directly according to the format of the input audio signal.
  • the input audio signal when the input audio signal is an input signal in any spatial audio exchange format, the input audio signal can be parsed to obtain an audio signal in a specific spatial audio representation, such as an object-based spatial audio representation signal, a scene-based spatial audio representation signal, a channel-based spatial audio representation signal, and associated metadata, and then the parsing result is passed to the subsequent processing stage.
  • a specific spatial audio representation such as an object-based spatial audio representation signal, a scene-based spatial audio representation signal, a channel-based spatial audio representation signal, and associated metadata
  • the input audio signal when the input audio signal is directly an audio signal in a specific spatial audio representation, it can be directly transmitted to the subsequent processing stage without parsing.
  • such an audio signal can be transmitted directly to the audio coding stage, such as an object-based audio representation signal, a scene-based audio representation signal, and a narrative channel track that needs to be encoded in the channel-based audio representation signal.
  • the audio signal in the specific spatial representation is of a type/format that needs not to be encoded, it can be transmitted directly to the audio decoding stage, for example, it can be a parsed non-narrative channel track in the channel-based audio representation or a narrative channel track that needs not to be encoded.
  • information processing can be performed based on the obtained metadata, so as to extract and obtain audio parameters related to each audio signal, and such audio parameters can be used as metadata information.
  • the information processing here can be performed for either the parsed audio signal or the directly transmitted audio signal. Of course, as mentioned above, such information processing is optional and not necessary.
  • signal coding is performed on the audio signal with the specific spatial audio presentation.
  • signal encoding can be performed on the audio signal with the specific spatial audio representation based on metadata information, and the obtained encoded audio signal is either transmitted directly to the subsequent audio decoding stage, or an intermediate signal can be obtained and then transmitted to the subsequent audio decoding stage.
  • the audio signal with the specific spatial audio representation needs not to be encoded, such audio signal can be transmitted directly to the audio decoding stage.
  • the received audio signal can be decoded to obtain an audio signal suitable for playback in the user application scenario as an output signal, and such an output signal can be presented to the user through an audio playback apparatus in the user application scenario, for example, an audio playback environment.
  • FIG. 4 I shows a flowchart of some embodiments of an audio rendering method according to the present disclosure.
  • step S 430 also called audio signal encoding step
  • an audio signal in a specific audio content format can be spatially encoded based on metadata information associated with the audio signal in the specific audio content format to obtain an encoded audio signal
  • step S 440 also called audio signal decoding step
  • the encoded audio signal in the specific spatial format can be spatially decoded to obtain a decoded audio signal for audio rendering.
  • the method 400 may further include step S 410 (also called audio signal acquisition step), in which an audio signal in a specific audio content format and metadata information associated with the audio signal can be acquired.
  • step S 410 also called audio signal acquisition step
  • the audio signal acquisition step it may further include parsing the input audio signal to obtain an audio signal conforming to a specific spatial audio representation, and performing format conversion on the audio signal conforming to the specific spatial audio representation to obtain the audio signal in the specific audio content format.
  • the method 400 may further include a step S 420 (also called an information processing step), in which audio parameters of the specific type of audio signal can be extracted based on metadata information associated with the specific type of audio signal.
  • a step S 420 also called an information processing step
  • the audio parameters of the specific type of audio signal can be further extracted based on the audio content format of the specific type of audio signal. Therefore, in the audio signal encoding step, it may further include spatially encoding the specific type of audio signal based on the audio parameters.
  • the audio signal in the specific spatial format may be further decoded based on a playback mode.
  • decoding can be performed by using a decoding method corresponding to the playback apparatus in the user application scenario.
  • the method 400 may further include a signal input step, in which an input audio signal is received, and directly transmitted to the audio signal encoding step if the input audio signal is a specific type of audio signal in an audio signal in specific audio content format, or directly transmitted to the audio signal decoding step if the input audio signal is an input audio signal in a specific audio content format and not the specific type of audio signal.
  • the method 400 may further include step S 450 (also called signal post-processing step), in which the decoded audio signal may be post-processed.
  • step S 450 also called signal post-processing step
  • post-processing can be performed based on characteristics of the playback apparatus in the user application scenario.
  • the above-mentioned signal acquisition step, information processing step, signal input step and signal post-processing step are not necessarily included in the rendering method according to the present disclosure, that is, even if such steps are not included, the method according to the present disclosure is still complete and can effectively solve the problems to be solved by the present disclosure and achieve advantageous effects.
  • these steps may be performed outside the method according to the present disclosure and the result of these steps can be provided to the method of the present disclosure, or may receive a result signal of the method of the present disclosure.
  • a signal acquisition step can be included in a signal encoding step
  • an information processing step and a signal input step can be included in a signal acquisition step
  • an information processing step can be included in a signal encoding step
  • a signal post-processing step can be included in a signal decoding step. Therefore, these steps are shown by dotted lines in the drawings.
  • the audio rendering method according to the present disclosure may further include other steps to realize the processing/operation in the pre-processing, audio information processing, audio signal spatial coding, etc., which will not be described in detail here.
  • the audio rendering method according to the present disclosure and the steps therein can be executed by any suitable apparatus, such as a processor, an integrated circuit, a chip, etc., for example, by the aforementioned audio rendering system and various modules therein, and the method can also be embodied in computer programs, instructions, computer program media, computer program products, etc.
  • FIG. 5 shows a block diagram of an electronic apparatus according to some embodiments of the present disclosure.
  • the electronic apparatus 5 of this embodiment includes a memory 51 and a processor 52 coupled to the memory 51 , and the processor 52 is configured to execute audio signal encoding or decoding or the rendering method of audio signals in any embodiment of the present disclosure based on instructions stored in the memory 51 .
  • the memory 51 may include, for example, a system memory, a fixed non-volatile storage medium, and the like.
  • the system memory stores, for example, an operating system, application programs, a Boot Loader, a database and other programs.
  • FIG. 6 shows a structural schematic diagram of an electronic apparatus suitable for implementing an embodiment of the present disclosure.
  • the electronic apparatuses in the embodiment of the present disclosure may include, but not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDA (Personal Digital Assistant), PAD (Tablet Computer), PMP (Portable Multimedia Player), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital TV and desktop computers.
  • PDA Personal Digital Assistant
  • PAD Tablett Computer
  • PMP Portable Multimedia Player
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • fixed terminals such as digital TV and desktop computers.
  • the electronic apparatus shown in FIG. 6 is just an example, and should not bring any limitation to the function and application scope of the embodiment of the present disclosure.
  • FIG. 6 shows a block diagram of other embodiments of the electronic apparatus of the present disclosure.
  • the electronic apparatus may include a processing apparatus (such as a central processing unit, a graphics processor, etc.) 601 , which may perform various appropriate operations and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded into a random access memory (RAM) 603 from a storage apparatus 608 .
  • a processing apparatus 601 such as a central processing unit, a graphics processor, etc.
  • RAM random access memory
  • a processing apparatus 601 , a ROM 602 and a RAM 603 are connected to each other through a bus 604 .
  • An input/output (I/O) interface 605 is also connected to the bus 604 .
  • an input apparatus 606 including, for example, a touch screen, a touch pad, a keyboard, a mouse, an image sensor, a microphone, an accelerometer, a gyroscope, etc.
  • an output apparatus 607 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.
  • a storage apparatus 608 such as a magnetic tape, a hard disk, etc.
  • the communication apparatus 609 may allow the electronic apparatus to communicate wirelessly or wired with other apparatuses to exchange data.
  • FIG. 6 shows an electronic apparatus with various apparatuses, it should be understood that it is not required to implement or have all the apparatuses shown. More or fewer apparatuses may alternatively be implemented or provided.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from the network through the communication apparatus 609 , or installed from the storage apparatus 608 , or installed from the ROM 602 .
  • the processing apparatus 601 When the computer program is executed by the processing apparatus 601 , the above functions defined in the method of the embodiment of the present disclosure are performed.
  • a chip which comprises at least one processor and an interface, wherein the interface is used for providing computer-executable instructions for the at least one processor, and the at least one processor is used for executing the computer-executable instructions, so as to realize audio signal encoding or decoding or the audio signal rendering method of any one of the above embodiments.
  • FIG. 7 shows a block diagram of a chip capable of implementing some embodiments according to the present disclosure.
  • the processor 70 of the chip can be mounted on a Host CPU as a co-processor, and tasks are assigned by the Host CPU.
  • the core part of the processor 70 is an arithmetic circuit, and the controller 704 controls the arithmetic circuit 703 to extract the data in the memory (a weight memory or an input memory) and perform the operation.
  • the arithmetic circuit 703 internally includes a plurality of process engines (PE).
  • the arithmetic circuit 703 is a two-dimensional systolic array.
  • the arithmetic circuit 703 can also be a one-dimensional systolic array or any other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • the arithmetic circuit 703 is a general matrix processor.
  • the arithmetic circuit takes the data corresponding to the matrix B from the weight memory 702 and buffers it on each PE in the arithmetic circuit.
  • the operation circuit takes the data of matrix A from the input memory 701 and performs matrix operation along with matrix B, and a partial or final result of the matrix obtained can be stored in the accumulator 708 .
  • the vector calculation unit 707 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and so on.
  • the vector calculation unit 707 can store the processed output vector to a unified buffer 706 .
  • the vector calculation unit 707 may apply a nonlinear function to the output of the operation circuit 703 , such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 707 generates normalized values, combined values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 703 , for example, for usage in a subsequent layer in a neural network.
  • the unified memory 706 may be used to store input data and output data.
  • a memory unit access controller 705 (Direct Memory Access Controller, DMAC) transports the input data in the external memory to the input memory 701 and/or the unified memory 706 , stores the weight data in the external memory into the weight memory 702 , and stores the data in the unified memory 706 into the external memory.
  • DMAC Direct Memory Access Controller
  • Bus Interface Unit (BIU) 510 is used to realize interaction among the main CPU, DMAC and fetch memory 709 through the bus.
  • An instruction fetch buffer 709 connected to the controller 704 is used for storing instructions used by the controller 704 ;
  • the controller 704 is used to call the instructions cached in the instruction fetch buffer 709 to control the working process of the operation accelerator.
  • the unified memory 706 , the input memory 701 , the weight memory 702 and the instruction fetch memory 709 are all On-Chip memories, and the external memory is the memory outside the NPU, the external memory can be Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), High Bandwidth Memory (HBM), or other readable and writable memories.
  • DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • a computer program which comprises instructions which, when executed by a processor, cause the processor to perform the audio signal processing of any one of the above embodiments, especially any processing in the audio signal rendering process.
  • the present disclosure can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
  • the above-mentioned embodiments can be implemented in the form of computer program products as a whole or in part.
  • a computer program product includes one or more computer instructions or computer programs. When computer instructions or computer programs are loaded or executed on a computer, the processes or functions according to the embodiments of the present application can be generated as a whole or in part.
  • the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable apparatuses.
  • the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) in which computer-usable program codes are contained.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Stereophonic System (AREA)
US18/541,665 2021-06-15 2023-12-15 Audio rendering system and method and electronic device Pending US20240119946A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
WOPCT/CN2021/100076 2021-06-15
CN2021100076 2021-06-15
PCT/CN2022/098882 WO2022262758A1 (fr) 2021-06-15 2022-06-15 Système et procédé de rendu audio et dispositif électronique

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/098882 Continuation WO2022262758A1 (fr) 2021-06-15 2022-06-15 Système et procédé de rendu audio et dispositif électronique

Publications (1)

Publication Number Publication Date
US20240119946A1 true US20240119946A1 (en) 2024-04-11

Family

ID=84526847

Family Applications (1)

Application Number Title Priority Date Filing Date
US18/541,665 Pending US20240119946A1 (en) 2021-06-15 2023-12-15 Audio rendering system and method and electronic device

Country Status (3)

Country Link
US (1) US20240119946A1 (fr)
CN (1) CN117546236A (fr)
WO (1) WO2022262758A1 (fr)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106210990B (zh) * 2016-07-13 2018-08-10 北京时代拓灵科技有限公司 一种全景声音频处理方法
US10123150B2 (en) * 2017-01-31 2018-11-06 Microsoft Technology Licensing, Llc Game streaming with spatial audio
US20200120438A1 (en) * 2018-10-10 2020-04-16 Qualcomm Incorporated Recursively defined audio metadata
EP3809709A1 (fr) * 2019-10-14 2021-04-21 Koninklijke Philips N.V. Appareil et procédé de codage audio

Also Published As

Publication number Publication date
WO2022262758A1 (fr) 2022-12-22
CN117546236A (zh) 2024-02-09

Similar Documents

Publication Publication Date Title
US10674262B2 (en) Merging audio signals with spatial metadata
RU2661775C2 (ru) Передача сигнальной информации рендеринга аудио в битовом потоке
US10477310B2 (en) Ambisonic signal generation for microphone arrays
JP2019533404A (ja) バイノーラルオーディオ信号処理方法及び装置
US11758349B2 (en) Spatial audio augmentation
US11429340B2 (en) Audio capture and rendering for extended reality experiences
CN114067810A (zh) 音频信号渲染方法和装置
US11122386B2 (en) Audio rendering for low frequency effects
KR20170015897A (ko) 고차 앰비소닉 오디오 렌더러들에 대한 희소성 정보의 획득
CN116569255A (zh) 用于六自由度应用的多个分布式流的矢量场插值
US11310616B2 (en) Method for outputting audio signal using scene orientation information in an audio decoder, and apparatus for outputting audio signal using the same
KR101941764B1 (ko) 고차 앰비소닉 오디오 렌더러들에 대한 대칭성 정보의 획득
US20240119946A1 (en) Audio rendering system and method and electronic device
US20240119945A1 (en) Audio rendering system and method, and electronic device
US20210092543A1 (en) 3d sound orientation adaptability
JP2023551016A (ja) オーディオ符号化及び復号方法並びに装置
TW202029185A (zh) 音訊資料之靈活渲染
CN114128312B (zh) 用于低频效果的音频渲染
CN114128312A (zh) 用于低频效果的音频渲染

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION