WO2021052050A1 - 一种沉浸式音频渲染方法及系统 - Google Patents

一种沉浸式音频渲染方法及系统 Download PDF

Info

Publication number
WO2021052050A1
WO2021052050A1 PCT/CN2020/107157 CN2020107157W WO2021052050A1 WO 2021052050 A1 WO2021052050 A1 WO 2021052050A1 CN 2020107157 W CN2020107157 W CN 2020107157W WO 2021052050 A1 WO2021052050 A1 WO 2021052050A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
gain
mixing
weight
channel
Prior art date
Application number
PCT/CN2020/107157
Other languages
English (en)
French (fr)
Inventor
孙学京
郭红阳
张兴涛
许春生
Original Assignee
南京拓灵智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京拓灵智能科技有限公司 filed Critical 南京拓灵智能科技有限公司
Priority to KR1020207026992A priority Critical patent/KR102300177B1/ko
Publication of WO2021052050A1 publication Critical patent/WO2021052050A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Definitions

  • This application relates to the technical field of audio data processing, and in particular to an immersive audio rendering method and system.
  • immersive audio processing is mainly based on technology such as channel-based audio (CBA), object-based audio (OBA) and Ambisonics scene-based audio (SBA), including audio production, encoding and decoding, Technology such as packaging and rendering.
  • CBA channel-based audio
  • OOA object-based audio
  • SBA Ambisonics scene-based audio
  • Ambisonics uses spherical harmonic functions to record the sound field and drive the speakers. It has strict speaker layout requirements and can reconstruct the original sound field with high quality at the center of the speaker.
  • HOA Higher Order Ambisonics
  • vector-based amplitude synthesis (VectorBasedAmplitudePanning, VBAP) is based on the sine law in three-dimensional space, using three adjacent speakers in the space to form a three-dimensional sound vector, which will not affect the low-frequency binaural time difference (ITD) or high-frequency spectrum cues.
  • ITD low-frequency binaural time difference
  • VBAP has become the most commonly used multi-channel 3D audio processing technology.
  • HOA uses an intermediate format to reconstruct a 3D sound field, but it is limited by the number of orders used, which may bring The lack of high-frequency clues affects the accuracy of the listener's positioning; while VBAP will cause jumps when rendering moving sound sources, resulting in incoherent spatial sound effects.
  • the purpose of this application is to provide an immersive audio rendering method and system, which can more accurately locate the sound source position, so as to meet the requirements of on-site immersive production and playback in small and medium-sized venues.
  • the present application provides an immersive audio rendering method, the method includes:
  • the mixing gain of each channel of the audio is determined, and the mixing process of the multiple channels of audio is completed through the mixing gain.
  • determining the weight coefficients of the first gain and the second gain according to the mixing weight includes:
  • the mixing weight is used as the weight coefficient of the first gain, and the difference between 1 and the mixing weight is used as the weight coefficient of the second gain.
  • the mixing gain of each speaker is determined according to the following formula:
  • g mn (t) w n (t)g HOAn (t)+(1-w n (t))g VBAPn (t)
  • g mn (t) represents the audio mixing gain corresponding to the nth speaker
  • w n (t) represents the mixing weight
  • g HOAn (t) represents the first audio gain corresponding to the nth speaker
  • g VBAPn (t ) Represents the second gain of the audio corresponding to the nth speaker
  • t represents the time.
  • configuring the mixing weight for each channel of the audio includes:
  • configuring the mixing weight for each channel of the audio includes:
  • the multi-channel spectrogram is input to the trained model, and the output result of the trained model is used as the mixing weight of the audio corresponding to the current speaker.
  • the abscissa of the multi-channel spectrogram is time, the ordinate is frequency, and the audio energy value is divided by color levels.
  • the neural network is a multi-layer convolutional neural network and a fully connected layer, and the convolutional neural network has at least M layers, where M is a positive integer greater than or equal to 2, which is used to retrieve the multi-channel spectrogram
  • M is a positive integer greater than or equal to 2
  • the feature information is extracted from the convolutional neural network, and the convolutional layer and pooling layer in the convolutional neural network are used to respond to the translation invariance of the feature information.
  • the method further includes:
  • the model parameters in the training process are adjusted so that the difference between the estimated weight obtained by the adjusted prediction and the actual weight meets the error allowance condition .
  • this application also provides an immersive audio rendering system, which includes:
  • a gain obtaining unit configured to obtain a first gain based on HOA and a second gain based on VBAP of each channel of the audio for multiple channels of audio played by multiple speakers to be mixed;
  • a weight coefficient determining unit configured to configure mixing weights for each channel of the audio, and determine the weight coefficients of the first gain and the second gain according to the mixing weight;
  • the mixing unit is configured to determine the mixing gain of each channel of the audio according to the first gain, the second gain, and respective weight coefficients, and complete the mixing process of the multiple channels of audio through the mixing gain.
  • the weight coefficient determining unit includes:
  • the weight coefficient determining unit includes:
  • a training module for obtaining audio training samples, and training the audio training samples based on a neural network model
  • An extraction module for acquiring input audio and extracting a multi-channel spectrogram of the input audio
  • the weight determination module is configured to input the multi-channel spectrogram into the trained model, and use the output result of the trained model as the mixing weight of the audio corresponding to the current speaker.
  • the neural network is a multi-layer convolutional neural network and a fully connected layer, and the convolutional neural network has at least M layers, where M is a positive integer greater than or equal to 2.
  • this application proposes an immersive audio rendering method and system.
  • the optimal processing method is adaptively selected according to the audio content, and the audio is rendered. This method can maintain the smoothness of the sound.
  • the location of the sound source is more accurately located, so as to meet the needs of immersive audio production and playback in small and medium venues.
  • FIG. 1 is a step diagram of an immersive audio rendering method in an embodiment of this application
  • FIG. 2 is a flow chart of determining the mixing weight by means of machine learning in an embodiment of this application
  • Fig. 3 is a schematic structural diagram of an immersive audio rendering system in an embodiment of the application.
  • This application provides an immersive audio rendering method. Please refer to FIG. 1.
  • the method includes:
  • S3 Determine the mixing gain of each channel of the audio according to the first gain, the second gain, and respective weight coefficients, and complete the mixing process of the multiple channels of audio through the mixing gain.
  • the mixing weight may be used as the weight coefficient of the first gain, and the difference between 1 and the mixing weight may be used as the weight coefficient of the second gain.
  • immersive audio rendering processing may be performed based on object audio technology and HOA technology, and weights may be set based on a rule-based gain generation method.
  • the gain based on HOA is g HOAn (t)
  • the gain based on VBAP is g VBAPn (t)
  • the final mixed mode gain is g mn (t) .
  • the mixing gain of each channel of the audio is determined according to the following formula:
  • g mn (t) w n (t)g HOAn (t)+(1-w n (t))g VBAPn (t)
  • g mn (t) represents the audio mixing gain corresponding to the nth speaker
  • w n (t) represents the mixing weight
  • g HOAn (t) represents the first audio gain corresponding to the nth speaker
  • g VBAPn (t ) Represents the second gain of the audio corresponding to the nth speaker
  • t represents the time.
  • the mixing weight for each channel of the audio when configuring the mixing weight for each channel of the audio, it can be judged whether the audio source is in a moving state, and according to the judgment result, different mixing weight configuration modes can be adaptively selected.
  • the mixing weight of the audio corresponding to the current speaker is configured as 0; if the audio source is in a moving state, the audio corresponding to the current speaker is configured with a mixing weight matching the moving speed.
  • w n (t) is set to 0; the weight of the sound source movement is set according to the moving speed, for example, the speed needs to be less than v, and w n (t) is set to be less than 0.5.
  • This embodiment is suitable for audio mixing processing. Whether the audio source is moving and the moving speed can be known in advance or customized by the mixer.
  • the immersive audio rendering processing is performed based on the object audio technology and the HOA technology, and the weight is determined in a data-driven manner.
  • the HOA-based gain is g HOAn (t)
  • the VBAP-based gain is g VBAPn (t)
  • the final mixed mode gain is g mn (t).
  • the mixing gain of each channel of the audio is determined according to the following formula:
  • g mn (t) wn(t)g HOAn (t)+(1-w n (t))g VBAPn (t)
  • g mn (t) represents the audio mixing gain corresponding to the nth speaker
  • w n (t) represents the mixing weight
  • g HOAn (t) represents the first audio gain corresponding to the nth speaker
  • g VBAPn (t ) Represents the second gain of the audio corresponding to the nth speaker
  • t represents the time.
  • w n (t) can be used to determine the weight in a data-driven manner, such as machine learning, and deep learning methods based on neural networks.
  • the method of constructing a neural network includes: 1) the input is the audio spectrogram of different channels; 2) the hidden layer multi-layer convolutional neural network and the fully connected layer; 3) the output is the mixed weight w n (t ).
  • When making a prediction based on a neural network it may include: obtaining audio training samples, and training the audio training samples based on a multi-layer convolutional neural network and a fully connected layer network model; obtaining input audio, and extracting the input audio Multi-channel spectrogram; input the multi-channel spectrogram to the trained model, and use the output result of the trained model as the mixing weight of the audio corresponding to the current speaker.
  • the abscissa of the spectrogram is time
  • the ordinate is frequency
  • the coordinate point value is the audio energy of the frequency point. Since a two-dimensional plane is used to express three-dimensional information, the size of the energy value is expressed by color. The darker the color, the stronger the audio energy at that point.
  • the audio spectrogram we can analyze the frequency distribution of the audio. According to the multi-channel spectrogram, the trajectory of the sound source can be analyzed.
  • Convolutional neural networks have the ability to characterize learning and can extract high-order features from multi-channel spectrograms. Among them, the convolutional layer and pooling layer in the convolutional neural network can respond to the translation of input features. Invariance, that is, the ability to identify similar features at different locations in space.
  • a neural network generally includes training and testing. The input is a multi-channel spectrogram and the output is the corresponding weight. The loss function during training is set according to the actual weight (predetermined) and estimated weight, and the neural network parameters are constantly adjusted.
  • the estimated weight predicted by the trained model can be compared with the predetermined actual weight, and the parameters in the training process can be adjusted according to the difference between the estimated weight and the actual weight, so that The difference between the estimated weight obtained by the adjusted prediction and the actual weight satisfies the error allowable condition.
  • This embodiment is used in situations where the sound source is moving and the moving speed is unknown.
  • the system automatically matches the mixing weight according to the input audio for rendering processing.
  • this application also provides an immersive audio rendering system, which includes:
  • a gain obtaining unit configured to obtain a first gain based on HOA and a second gain based on VBAP of each channel of the audio for multiple channels of audio played by multiple speakers to be mixed;
  • a weight coefficient determining unit configured to configure mixing weights for each channel of the audio, and determine the weight coefficients of the first gain and the second gain according to the mixing weight;
  • the mixing unit is configured to determine the mixing gain of each channel of the audio according to the first gain, the second gain, and respective weight coefficients, and complete the mixing process of the multiple channels of audio through the mixing gain.
  • the weight coefficient determining unit includes:
  • the weight coefficient determining unit includes:
  • a training module for obtaining audio training samples, and training the audio training samples based on a neural network model
  • An extraction module for inputting audio and extracting a multi-channel spectrogram of the input audio
  • the weight determination module is configured to input the multi-channel spectrogram into the trained model, and use the output result of the trained model as the mixing weight of the audio corresponding to the current speaker.
  • the neural network model is a multi-layer convolutional neural network and a fully connected layer, and the convolutional neural network has at least M layers, where M is a positive integer greater than or equal to 2.
  • this application proposes a method and system for immersive audio rendering.
  • the optimal processing method is adaptively selected according to the audio content, and the audio is rendered. This method can keep the sound smooth.
  • the location of the sound source is more accurately located, so as to meet the needs of immersive audio production and playback in small and medium-sized venues.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Stereophonic System (AREA)

Abstract

一种沉浸式音频渲染方法及系统,该方法包括:针对待混合的多个扬声器播放的多路音频,获取各路音频的基于HOA的第一增益和基于VBAP的第二增益(S1);为各路音频配置混合权重,并根据混合权重确定第一增益和第二增益的权重系数(S2);根据第一增益、第二增益以及各自的权重系数,确定各路音频的混合增益,并通过混合增益完成多路音频的混音处理(S3)。该方法能够更精准地定位声源位置,从而满足中小型场馆现场沉浸式制作与播放的需求。

Description

一种沉浸式音频渲染方法及系统 技术领域
本申请涉及音频数据处理技术领域,特别涉及一种沉浸式音频渲染方法及系统。
背景技术
近年来,随着高清视频的不断发展,从2K到4K,甚至8K,还有伴随着虚拟现实VR、AR的发展,人们对音频的听觉要求也随之提高。人们已不再满足于流行多年的立体声、5.1、7.1等音响效果,开始追求更具有沉浸感、真实感的3D音效或沉浸式音效。目前,沉浸式音频处理主要基于通道(channel-basedaudio,CBA)、对象音频(object-basedaudio,OBA)和Ambisonics场景音频(scene-based audio,SBA)等技术进行处理,包含音频制作、编解码、打包以及渲染等技术。
具体地,Ambisonics利用球谐函数记录声场并驱动扬声器,具有严格的扬声器排布要求,能够在扬声器中心位置高质量重建原始声场。在渲染移动音源时,HOA(HigherOrderAmbisonics)会营造出更加流畅,平滑的听感。
此外,幅度矢量合成(VectorBasedAmplitudePanning,VBAP)基于三维空间中的正弦法则,利用空间中3个临近的扬声器形成三维声音矢量,不会影响低频的双耳时间差(ITD)或者高频的频谱线索,对声音在三维空间中的定位更加精准。由于该算法简单,VBAP成为最常用的多声道三维音频处理技术。
然而,现有的沉浸式音频处理方法不能满足中小型场馆现场沉浸式制作与播放的需求,且HOA用一种中间格式来重建一个3D声场,但受限于采用的阶数,可能会带来高频线索的缺失,从而影响听者的定位的精准度;而VBAP在渲染移动音源时会产生跳跃,产生不连贯的空间声效果。
发明内容
本申请的目的在于提供一种沉浸式音频渲染方法及系统,能够更精准地定位声源位置,从而满足中小型场馆现场沉浸式制作与播放的需求。
为实现上述目的,本申请提供一种沉浸式音频渲染方法,所述方法包括:
针对待混合的多个扬声器播放的多路音频,获取各路所述音频的基于HOA的第一增益和基于VBAP的第二增益;
为各路所述音频配置混合权重,并根据所述混合权重确定所述第一增益和所述第二增益的权重系数;
根据所述第一增益、所述第二增益以及各自的权重系数,确定各路所述音频的混合增益,并通过所述混合增益完成所述多路音频的混音处理。
进一步地,根据所述混合权重确定所述第一增益和所述第二增益的权重系数包括:
将所述混合权重作为所述第一增益的权重系数,以及将1与所述混合权重的差值作为所述第二增益的权重系数。
进一步地,各个所述扬声器的混合增益按照以下公式确定:
g mn(t)=w n(t)g HOAn(t)+(1-w n(t))g VBAPn(t)
其中,g mn(t)表示第n个扬声器对应音频的混合增益,w n(t)表示所述混合权重,g HOAn(t)表示第n个扬声器对应音频的第一增益,g VBAPn(t)表示第n个扬声器对应音频的第二增益,t表示时间。
进一步地,为各路所述音频配置混合权重包括:
判断音源是否处于移动状态,并根据判断结果,自适应地选用不同的混合权重的配置方式;其中,若所述音源静止,将所述当前扬声器对应音频的混合权重配置为0;若所述音源处于移动状态,为所述当前扬声器对应音频配置与移动速度相匹配的混合权重。
进一步地,为各路所述音频配置混合权重包括:
获取音频训练样本,并基于神经网络模型对所述音频训练样本进行 训练;
获取当前扬声器的输入音频,并提取所述输入音频的多声道语谱图;
将所述多声道语谱图输入训练后的模型,并将所述训练后的模型输出的结果作为所述当前扬声器对应音频的混合权重。
进一步地,所述多声道语谱图的横坐标为时间,纵坐标为频率,并且音频能量值通过颜色等级进行划分。
进一步地,所述神经网络为多层卷积神经网络和全连接层,且卷积神经网络至少为M层,其中M为大于等于2的正整数,用于从所述多声道语谱图中提取特征信息,并且所述卷积神经网络中的卷积层和池化层用于响应所述特征信息的平移不变性。
进一步地,在对所述音频训练样本进行训练之后,所述方法还包括:
根据训练后的模型预测得到的估计权重与预先确定的实际权重,对训练过程中的模型参数进行调整,以使得调整后预测得到的估计权重与所述实际权重之间的差值满足误差允许条件。
为实现上述目的,本申请还提供一种沉浸式音频渲染系统,所述系统包括:
增益获取单元,用于针对待混合的多个扬声器播放的多路音频,获取各路所述音频的基于HOA的第一增益和基于VBAP的第二增益;
权重系数确定单元,用于为各路所述音频配置混合权重,并根据所述混合权重确定所述第一增益和所述第二增益的权重系数;
混合单元,用于根据所述第一增益、所述第二增益以及各自的权重系数,确定各路所述音频的混合增益,并通过所述混合增益完成所述多路音频的混音处理。
进一步地,所述权重系数确定单元包括:
判断音源是否处于移动状态,并根据判断结果,自适应地选用不同的混合权重的配置方式;其中,若所述音源静止,将所述当前扬声器对应音频的混合权重配置为0;若所述音源处于移动状态,为所述当前扬声器对应音频配置与移动速度相匹配的混合权重。
进一步地,所述权重系数确定单元包括:
训练模块,用于获取音频训练样本,并基于神经网络模型对所述音频训练样本进行训练;
提取模块,用于获取输入音频,并提取所述输入音频的多声道语谱图;
权重确定模块,用于将所述多声道语谱图输入训练后的模型,并将所述训练后的模型输出的结果作为所述当前扬声器对应音频的混合权重。
进一步地,所述神经网络为多层卷积神经网络和全连接层,且卷积神经网络至少为M层,其中M为大于等于2的正整数。
由上可见,本申请提出一种沉浸式音频渲染的方法和系统,基于HOA和对象音频技术,根据音频内容自适应选择最优的处理方式,对音频进行渲染处理,该方法可以在保持声音平滑运动的情况下更精准定位声源位置,从而满足中小型场馆现场沉浸式音频制作与播放的需求。
附图说明
图1为本申请实施方式中沉浸式音频渲染方法的步骤图;
图2为本申请实施方式中通过机器学习的方式确定混合权重的流程图;
图3为本申请实施方式中沉浸式音频渲染系统的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施方式中的附图,对本申请实施方式中的技术方案进行清楚、完整地描述,显然,所描述的实施方式仅仅是本申请一部分实施方式,而不是全部的实施方式。基于本申请中的实施方式,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施方式,都应当属于本申请保护的范围。
本申请提供一种沉浸式音频渲染方法,请参阅图1,所述方法包括:
S1:针对待混合的多个扬声器播放的多路音频,获取各路所述音频的 基于HOA的第一增益和基于VBAP的第二增益。
S2:为各路所述音频配置混合权重,并根据所述混合权重确定所述第一增益和所述第二增益的权重系数。
S3:根据所述第一增益、所述第二增益以及各自的权重系数,确定各路所述音频的混合增益,并通过所述混合增益完成所述多路音频的混音处理。
在一个实施方式中,可以将所述混合权重作为所述第一增益的权重系数,以及将1与所述混合权重的差值作为所述第二增益的权重系数。
具体地,在一个实施方式中,可以基于对象音频技术和HOA技术进行沉浸式音频渲染处理,且基于规则的(rule-based)增益生成方式来设置权重。
假设有N个扬声器,对于第n个扬声器播放的音频而言,基于HOA的增益为g HOAn(t),基于VBAP的增益为g VBAPn(t),最后的混合模式增益为g mn(t)。
各路所述音频的混合增益按照以下公式确定:
g mn(t)=w n(t)g HOAn(t)+(1-w n(t))g VBAPn(t)
其中,g mn(t)表示第n个扬声器对应音频的混合增益,w n(t)表示所述混合权重,g HOAn(t)表示第n个扬声器对应音频的第一增益,g VBAPn(t)表示第n个扬声器对应音频的第二增益,t表示时间。
在一个实施方式中,为各路所述音频配置混合权重时,可以判断音源是否处于移动状态,并根据判断结果,自适应地选用不同的混合权重的配置方式。其中,若所述音源静止,将所述当前扬声器对应音频的混合权重配置为0;若所述音源处于移动状态,为所述当前扬声器对应音频配置与移动速度相匹配的混合权重。具体地,对于静止音源,w n(t)则设置为0;音源移动权重则根据移动速度来设置,比如速度需小于v,w n(t)则设置小于0.5。
该实施方式适用于混音处理,音源是否移动以及移动速度,可以预先知道或者由混音师自定义。
在另一个实施方式中,基于对象音频技术和HOA技术进行沉浸式音频渲染处理,且通过数据驱动的方式来确定权重。
同样地,假设有N个扬声器,对于第n个扬声器播放的音频而言,基于 HOA的增益为g HOAn(t),基于VBAP的增益为g VBAPn(t),最后的混合模式增益为g mn(t)。
各路所述音频的混合增益按照以下公式确定:
g mn(t)=wn(t)g HOAn(t)+(1-w n(t))g VBAPn(t)
其中,g mn(t)表示第n个扬声器对应音频的混合增益,w n(t)表示所述混合权重,g HOAn(t)表示第n个扬声器对应音频的第一增益,g VBAPn(t)表示第n个扬声器对应音频的第二增益,t表示时间。
其中,w n(t)可以通过数据驱动的方式来确定权重,比如通过机器学习,基于神经网络的深度学习方法。
具体地,构建神经网络方法包括:1)输入为不同channel的音频语谱图(spectrogram);2)隐层多层卷积神经网络和全连接层;3)输出为混合权重为w n(t)。
在根据神经网络进行预测时,可以包括:获取音频训练样本,并基于多层卷积神经网络和全连接层网络模型对所述音频训练样本进行训练;获取输入音频,并提取所述输入音频的多声道语谱图;将所述多声道语谱图输入训练后的模型,并将所述训练后的模型输出的结果作为所述当前扬声器对应音频的混合权重。
具体地,语谱图的横坐标是时间,纵坐标是频率,坐标点值为该频点的音频能量。由于是采用二维平面表达三维信息,所以能量值的大小是通过颜色来表示的,颜色深,表示该点的音频能量越强。通过音频的语谱图,我们能分析出音频的频率分布。根据多声道的语谱图,能够分析得到音源的运动轨迹。
请参阅图2,卷积神经网络具有表征学习能力,能够从多声道的语谱图中提取高阶特征,其中,卷积神经网络中的卷积层和池化层能够响应输入特征的平移不变性,即能够识别位于空间不同位置的相近特征。神经网络一般包含训练和测试两部分,输入为多声道的语谱图,输出为对应的权重,训练时的损失函数根据实际权重(预先确定)和估计权重进行设置,不断调整神经网络参数。也就是说,可以将训练后的模型预测得到的估计权重与预先确定的实际权重进行对比,并根据所述估计权重和所述实际权重的差值,对训练过 程中的参数进行调整,以使得调整后预测得到的估计权重与所述实际权重之间的差值满足误差允许条件。
该实施方式使用于音源是否移动以及移动速度未知的情况,系统根据输入音频自动匹配混合权重,用于渲染处理。
请参阅图3,本申请还提供一种沉浸式音频渲染系统,所述系统包括:
增益获取单元,用于针对待混合的多个扬声器播放的多路音频,获取各路所述音频的基于HOA的第一增益和基于VBAP的第二增益;
权重系数确定单元,用于为各路所述音频配置混合权重,并根据所述混合权重确定所述第一增益和所述第二增益的权重系数;
混合单元,用于根据所述第一增益、所述第二增益以及各自的权重系数,确定各路所述音频的混合增益,并通过所述混合增益完成所述多路音频的混音处理。
在一个实施方式中,所述权重系数确定单元包括:
判断音源是否处于移动状态,并根据判断结果,自适应地选用不同的混合权重的配置方式;其中,若所述音源静止,将所述当前扬声器对应音频的混合权重配置为0;若所述音源处于移动状态,为所述当前扬声器对应音频配置与移动速度相匹配的混合权重。
在一个实施方式中,所述权重系数确定单元包括:
训练模块,用于获取音频训练样本,并基于神经网络模型对所述音频训练样本进行训练;
提取模块,用于输入音频,并提取所述输入音频的多声道语谱图;
权重确定模块,用于将所述多声道语谱图输入训练后的模型,并将所述训练后的模型输出的结果作为所述当前扬声器对应音频的混合权重。
在一个实施方式中,所述神经网络模型为多层卷积神经网络和全连接层,且卷积神经网络至少为M层,其中M为大于等于2的正整数。
由上可见,本申请提出一种沉浸式音频渲染的方法和系统,基于HOA和对象音频技术,根据音频内容自适应选择最优的处理方式,对音频进行渲染处理, 该方法可以在保持声音平滑运动的情况下更精准定位声源位置,从而满足中小型场馆现场沉浸式音频制作与播放的需求。
上面对本申请的各种实施方式的描述以描述的目的提供给本领域技术人员。其不旨在是穷举的、或者不旨在将本申请限制于单个公开的实施方式。如上所述,本申请的各种替代和变化对于上述技术所属领域技术人员而言将是显而易见的。因此,虽然已经具体讨论了一些另选的实施方式,但是其它实施方式将是显而易见的,或者本领域技术人员相对容易得出。本申请旨在包括在此已经讨论过的本申请的所有替代、修改、和变化,以及落在上述申请的精神和范围内的其它实施方式。

Claims (10)

  1. 一种沉浸式音频渲染方法,其中,所述方法包括:
    针对待混合的多个扬声器播放的多路音频,获取各路所述音频的基于HOA的第一增益和基于VBAP的第二增益;
    为各路所述音频配置混合权重,并根据所述混合权重确定所述第一增益和所述第二增益的权重系数;
    根据所述第一增益、所述第二增益以及各自的权重系数,确定各路所述音频的混合增益,并通过所述混合增益完成所述多路音频的混音处理。
  2. 根据权利要求1所述的方法,其中,根据所述混合权重确定所述第一增益和所述第二增益的权重系数包括:
    将所述混合权重作为所述第一增益的权重系数,以及将1与所述混合权重的差值作为所述第二增益的权重系数。
  3. 根据权利要求1所述的方法,其中,各路所述音频的混合增益按照以下公式确定:
    g mn(t)=w n(t)g HOAn(t)+(1-w n(t))g VBAPn(t)
    其中,g mn(t)表示第n个扬声器对应音频的混合增益,w n(t)表示所述混合权重,g HOAn(t)表示第n个扬声器对应音频的第一增益,g VBAPn(t)表示第n个扬声器对应音频的第二增益,t表示时间。
  4. 根据权利要求1所述的方法,其中,为各路所述音频配置混合权重包括:
    判断音源是否处于移动状态,并根据判断结果,自适应地选用不同的混合权重的配置方式;其中,若所述音源静止,将当前扬声器对应音频的混合权重配置为0;若所述音源处于移动状态,为所述当前扬声器对应音频配置与移动速度相匹配的混合权重。
  5. 根据权利要求1所述的方法,其中,为各路所述音频配置混合权重包括:
    获取音频训练样本,并基于神经网络模型对所述音频训练样本进行训练;
    获取输入音频,并提取所述输入音频的多声道语谱图;
    将所述多声道语谱图输入训练后的模型,并将所述训练后的模型输出的结果作为当前扬声器对应音频的混合权重。
  6. 根据权利要求5所述的方法,其中,所述神经网络模型为多层卷积神经网络和全连接层,且卷积神经网络至少为M层,其中M为大于等于2的正整数。
  7. 一种沉浸式音频渲染系统,其中,所述系统包括:
    增益获取单元,用于针对待混合的多个扬声器播放的多路音频,获取各路所述音频的基于HOA的第一增益和基于VBAP的第二增益;
    权重系数确定单元,用于为各路所述音频配置混合权重,并根据所述混合权重确定所述第一增益和所述第二增益的权重系数;
    混合单元,用于根据所述第一增益、所述第二增益以及各自的权重系数,确定各路所述音频的混合增益,并通过所述混合增益完成所述多路音频的混音处理。
  8. 根据权利要求7所述的系统,其中,所述权重系数确定单元包括:
    判断音源是否处于移动状态,并根据判断结果,自适应地选用不同的混合权重的配置方式;其中,若所述音源静止,将当前扬声器对应音频的混合权重配置为0;若所述音源处于移动状态,为所述当前扬声器对应音频配置与移动速度相匹配的混合权重。
  9. 根据权利要求7所述的系统,其中,所述权重系数确定单元包括:
    训练模块,用于获取音频训练样本,并基于神经网络模型对所述音频训练样本进行训练;
    提取模块,用于获取输入音频,并提取所述输入音频的多声道语谱图;
    权重确定模块,用于将所述多声道语谱图输入训练后的模型,并将所述训练后的模型输出的结果作为所述当前扬声器对应音频的混合权重。
  10. 根据权利要求9所述的系统,其中,所述神经网络模型为多层卷积神经网络和全连接层,且卷积神经网络至少为M层,其中M为大于等于2的正整数。
PCT/CN2020/107157 2019-09-17 2020-08-05 一种沉浸式音频渲染方法及系统 WO2021052050A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020207026992A KR102300177B1 (ko) 2019-09-17 2020-08-05 몰입형 오디오 렌더링 방법 및 시스템

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910876818.2 2019-09-17
CN201910876818.2A CN110751956B (zh) 2019-09-17 2019-09-17 一种沉浸式音频渲染方法及系统

Publications (1)

Publication Number Publication Date
WO2021052050A1 true WO2021052050A1 (zh) 2021-03-25

Family

ID=69276576

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/107157 WO2021052050A1 (zh) 2019-09-17 2020-08-05 一种沉浸式音频渲染方法及系统

Country Status (2)

Country Link
CN (1) CN110751956B (zh)
WO (1) WO2021052050A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110751956B (zh) * 2019-09-17 2022-04-26 北京时代拓灵科技有限公司 一种沉浸式音频渲染方法及系统
CN111046218A (zh) * 2019-12-12 2020-04-21 洪泰智造(青岛)信息技术有限公司 一种基于锁屏状态的音频获取方法、装置和系统
CN112351379B (zh) * 2020-10-28 2021-07-30 歌尔光学科技有限公司 音频组件的控制方法以及智能头戴设备
CN112616110A (zh) * 2020-12-01 2021-04-06 中国电影科学技术研究所 空间声渲染方法、装置和电子设备
CN114023299A (zh) * 2021-10-29 2022-02-08 福建星网视易信息系统有限公司 一种网络合唱方法及存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103188595A (zh) * 2011-12-31 2013-07-03 展讯通信(上海)有限公司 处理多声道音频信号的方法和系统
US20140219455A1 (en) * 2013-02-07 2014-08-07 Qualcomm Incorporated Mapping virtual speakers to physical speakers
CN104244164A (zh) * 2013-06-18 2014-12-24 杜比实验室特许公司 生成环绕立体声声场
US20160134988A1 (en) * 2014-11-11 2016-05-12 Google Inc. 3d immersive spatial audio systems and methods
CN107342092A (zh) * 2017-05-08 2017-11-10 深圳市创锐实业有限公司 一种自动分配增益的混音系统和方法
CN107920303A (zh) * 2017-11-21 2018-04-17 北京时代拓灵科技有限公司 一种音频采集的方法及装置
US20190239015A1 (en) * 2018-02-01 2019-08-01 Qualcomm Incorporated Scalable unified audio renderer
CN110751956A (zh) * 2019-09-17 2020-02-04 北京时代拓灵科技有限公司 一种沉浸式音频渲染方法及系统
CN111046218A (zh) * 2019-12-12 2020-04-21 洪泰智造(青岛)信息技术有限公司 一种基于锁屏状态的音频获取方法、装置和系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009046460A2 (en) * 2007-10-04 2009-04-09 Creative Technology Ltd Phase-amplitude 3-d stereo encoder and decoder
ES2922639T3 (es) * 2010-08-27 2022-09-19 Sennheiser Electronic Gmbh & Co Kg Método y dispositivo para la reproducción mejorada de campo sonoro de señales de entrada de audio codificadas espacialmente
EP2875511B1 (en) * 2012-07-19 2018-02-21 Dolby International AB Audio coding for improving the rendering of multi-channel audio signals
EP2738962A1 (en) * 2012-11-29 2014-06-04 Thomson Licensing Method and apparatus for determining dominant sound source directions in a higher order ambisonics representation of a sound field
KR102213895B1 (ko) * 2013-01-15 2021-02-08 한국전자통신연구원 채널 신호를 처리하는 부호화/복호화 장치 및 방법
EP2765791A1 (en) * 2013-02-08 2014-08-13 Thomson Licensing Method and apparatus for determining directions of uncorrelated sound sources in a higher order ambisonics representation of a sound field
CN104967960B (zh) * 2015-03-25 2018-03-20 腾讯科技(深圳)有限公司 语音数据处理方法、游戏直播中的语音数据处理方法和系统
MC200186B1 (fr) * 2016-09-30 2017-10-18 Coronal Encoding Procédé de conversion, d'encodage stéréophonique, de décodage et de transcodage d'un signal audio tridimensionnel
CN106960672B (zh) * 2017-03-30 2020-08-21 国家计算机网络与信息安全管理中心 一种立体声音频的带宽扩展方法与装置
CN109473117B (zh) * 2018-12-18 2022-07-05 广州市百果园信息技术有限公司 音频特效叠加方法、装置及其终端

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103188595A (zh) * 2011-12-31 2013-07-03 展讯通信(上海)有限公司 处理多声道音频信号的方法和系统
US20140219455A1 (en) * 2013-02-07 2014-08-07 Qualcomm Incorporated Mapping virtual speakers to physical speakers
CN104244164A (zh) * 2013-06-18 2014-12-24 杜比实验室特许公司 生成环绕立体声声场
US20160134988A1 (en) * 2014-11-11 2016-05-12 Google Inc. 3d immersive spatial audio systems and methods
CN107342092A (zh) * 2017-05-08 2017-11-10 深圳市创锐实业有限公司 一种自动分配增益的混音系统和方法
CN107920303A (zh) * 2017-11-21 2018-04-17 北京时代拓灵科技有限公司 一种音频采集的方法及装置
US20190239015A1 (en) * 2018-02-01 2019-08-01 Qualcomm Incorporated Scalable unified audio renderer
CN110751956A (zh) * 2019-09-17 2020-02-04 北京时代拓灵科技有限公司 一种沉浸式音频渲染方法及系统
CN111046218A (zh) * 2019-12-12 2020-04-21 洪泰智造(青岛)信息技术有限公司 一种基于锁屏状态的音频获取方法、装置和系统

Also Published As

Publication number Publication date
CN110751956B (zh) 2022-04-26
CN110751956A (zh) 2020-02-04

Similar Documents

Publication Publication Date Title
WO2021052050A1 (zh) 一种沉浸式音频渲染方法及系统
US11681490B2 (en) Binaural rendering for headphones using metadata processing
TWI744341B (zh) 使用近場/遠場渲染之距離聲相偏移
KR101828138B1 (ko) 상이한 재생 라우드스피커 셋업에 대한 공간 오디오 신호의 세그먼트-와이즈 조정
CN104869524A (zh) 三维虚拟场景中的声音处理方法及装置
CN105075293A (zh) 音频设备及其音频提供方法
JP7142109B2 (ja) 空間オーディオパラメータのシグナリング
US11924627B2 (en) Ambience audio representation and associated rendering
US11611840B2 (en) Three-dimensional audio systems
CN105075294B (zh) 音频信号处理装置
CN105594227A (zh) 利用恒定功率成对平移的矩阵解码器
KR102300177B1 (ko) 몰입형 오디오 렌더링 방법 및 시스템
CN115705839A (zh) 语音播放方法、装置、计算机设备和存储介质
US20230379648A1 (en) Audio signal isolation related to audio sources within an audio environment
Lv et al. A TCN-based primary ambient extraction in generating ambisonics audio from Panorama Video
Kim et al. Parameter-Based Multi-Channel Audio Panning for Multi-View Broadcasting Systems

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 20207026992

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20866487

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20866487

Country of ref document: EP

Kind code of ref document: A1