CN112866896B - Immersive audio upmixing method and system - Google Patents

Immersive audio upmixing method and system Download PDF

Info

Publication number
CN112866896B
CN112866896B CN202110111130.2A CN202110111130A CN112866896B CN 112866896 B CN112866896 B CN 112866896B CN 202110111130 A CN202110111130 A CN 202110111130A CN 112866896 B CN112866896 B CN 112866896B
Authority
CN
China
Prior art keywords
signal
audio signal
sound
sound source
stereo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110111130.2A
Other languages
Chinese (zh)
Other versions
CN112866896A (en
Inventor
孙学京
李旭阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tuoling Xinsheng Technology Co ltd
Original Assignee
Beijing Tuoling Xinsheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tuoling Xinsheng Technology Co ltd filed Critical Beijing Tuoling Xinsheng Technology Co ltd
Priority to CN202110111130.2A priority Critical patent/CN112866896B/en
Publication of CN112866896A publication Critical patent/CN112866896A/en
Application granted granted Critical
Publication of CN112866896B publication Critical patent/CN112866896B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction

Abstract

The invention discloses an immersive audio upmixing method and system, which are characterized in that an input stereo audio signal is obtained, and a deep learning sound source separation model is adopted to separate the stereo audio signal into a sound source signal and an environment sound signal; separating the sound source signal into a middle sound source signal and a bass signal by adopting a deep learning sound source separation model; performing decorrelation processing on the environmental sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals; the method comprises the steps of obtaining input left channel audio signals and right channel audio signals, and combining a middle audio source signal, a bass signal, a left surround sound audio signal, a right surround sound audio signal, a left channel audio signal and a right channel audio signal to obtain 5.1 channel audio signals. The method processes the input stereo audio signal in real time based on the neural network, so that the sound source and the environmental sound can be effectively distinguished, a multi-channel audio signal can be obtained, and the immersive effect is further improved.

Description

Immersive audio upmixing method and system
Technical Field
The invention relates to the technical field of sound processing, in particular to an immersive audio frequency upmixing method and system.
Background
In recent years, with the development of high-definition video, from 2K to 4K, even 8K, and with the development of virtual reality VR and AR, the requirement for audio hearing has been increased. People no longer satisfy the stereo sound effect which is popular for many years, and pursue 3D sound effect or immersive sound effect which has more immersion and reality. Professional and home theaters typically have multiple speakers that can play 5.1/7.1 and more channels of immersive audio, and in addition, vehicular audio gradually transitions to content that can play more than two channels.
Currently, upmix algorithms are used to process stereo audio signals to deliver stereo (stereo) surround sound, such as Center, left (L), right (R), Left (LS), Right (RS), bass (LFE). BPF and LPF processing are respectively carried out on the Center signal to obtain a C audio signal and an LFE audio signal; performing time delay processing and LPF processing on the RS signal, and further performing decorrelation processing (for example, performing phase inversion processing) to obtain an LS audio signal and an RS audio signal respectively; and combining the input left channel audio signal and the right channel audio signal to obtain the surround channel audio signal. In the prior art, after upmixing processing, a sound source and environmental sounds cannot be well distinguished, and the immersive effect of a multi-channel sound channel audio signal is greatly weakened. At present, a large amount of traditional stereo (two-channel) contents are available in the market, and how to make the latest immersive audio system compatible with the traditional stereo contents and simultaneously more ideally utilize the advantages of more channels to render a better immersive effect is a pain point problem to be solved urgently.
Disclosure of Invention
Therefore, the immersive audio upmixing method and the immersive audio upmixing system provided by the invention have the advantages that the stereo audio is converted into at least four paths of multi-channel format audio, the overall audio experience is improved, and the problem that the immersive effect of a multi-channel audio signal cannot be well distinguished and weakened by a sound source and environmental sound is solved.
In order to achieve the above purpose, the invention provides the following technical scheme: an immersive audio upmixing method comprising the steps of:
acquiring an input stereo audio signal, and separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model;
adopting a deep learning sound source separation model to separate the sound source signal into a middle sound source signal and a bass signal;
performing decorrelation processing on the environment sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals;
and acquiring input left channel audio signals and right channel audio signals, and combining the center sound source signals, the bass signals, the left surround sound audio signals, the right surround sound audio signals, the left channel audio signals and the right channel audio signals to obtain 5.1 channel audio signals.
As a preferable scheme of the immersive audio upmixing method, further comprising separating the stereo audio signal into a sound source signal, an environmental sound signal, and decorrelation parameters using a deep learning sound source separation model;
and performing decorrelation processing on the environment sound signals and the decorrelation parameters by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals.
As a preferable scheme of the immersive audio upmixing method, the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encoder and the decoder are connected by adopting a long-short term memory network LSTM, and finally, a mask is output for sound separation.
As a preferable scheme of the immersive audio upmixing method, the U-nets structure includes a down-sampling process for performing stereo audio signal concentration and an up-sampling process for performing stereo audio signal pixel restoration;
in the U-nets structure, each downsampling process is provided with a jump connection to be cascaded with a corresponding upsampling process.
As a preferred solution of the immersive audio upmixing method, the deep learning sound source separation model is trained directly on a stereo time domain audio signal;
or training according to a stereo frequency domain signal, wherein the stereo frequency domain signal comprises left channel real part information, left channel imaginary part information, right channel real part information and right channel imaginary part information;
or training is carried out according to the frequency domain parameters of the stereo, wherein the frequency domain parameters of the stereo comprise the energy ratio of the left channel and the right channel.
As a preferred scheme of the immersive audio upmixing method, the method comprises the following steps of performing mode detection on an input stereo audio signal, and when the stereo audio signal is movie and television content, performing processing by adopting a mode A:
the method comprises the steps of obtaining a sound source signal and an environment sound signal based on a deep learning sound source separation model, obtaining a middle sound source signal and a low sound signal according to the sound source signal, performing decorrelation according to the environment sound signal to obtain a left surround sound audio signal and a right surround sound audio signal, and finally combining the middle sound source signal, the low sound signal, the left surround sound audio signal, the right surround sound audio signal, the left sound channel audio signal and the right sound channel audio signal to obtain a 5.1 sound channel audio signal.
As a preferred scheme of the immersive audio upmixing method, the method performs mode detection on an input stereo audio signal, and when the stereo audio signal is music content, performs processing in a mode B:
giving a music style type of the stereo audio signal, setting the center sound source signal as silence according to the music style type, and performing decorrelation on the stereo audio signal to obtain a left surround sound audio signal and a right surround sound audio signal; and finally, combining the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal.
As a preferred scheme of the immersive audio upmixing method, an input stereo audio signal is directly processed by adopting a deep learning sound source separation model to obtain a multi-channel audio signal;
according to the pattern detection result, if the stereo audio signal is movie contents, a plurality of output channels are predicted using a deep learning neural network method.
The present invention also provides an immersive audio upmixing system comprising:
the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for acquiring an input stereo audio signal and separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model;
the second processing module is used for separating the sound source signal into a middle sound source signal and a bass signal by adopting a deep learning sound source separation model;
the third processing module is used for performing decorrelation processing on the environment sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals;
and the audio merging module is used for acquiring the input left channel audio signal and right channel audio signal, and merging the middle audio source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal.
As a preferred scheme of the immersive audio upmixing system, the first processing module is further configured to obtain an input stereo audio signal, and separate the stereo audio signal into a sound source signal, an environmental sound signal, and decorrelation parameters by using a deep learning sound source separation model;
and the third processing module is also used for performing decorrelation processing on the environment sound signals and the decorrelation parameters by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals.
The invention has the following advantages: separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model through acquiring the input stereo audio signal; adopting a deep learning sound source separation model to separate a sound source signal into a middle sound source signal and a bass signal; performing decorrelation processing on the environmental sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals; the method comprises the steps of obtaining input left channel audio signals and right channel audio signals, and combining a middle audio source signal, a bass signal, a left surround sound audio signal, a right surround sound audio signal, a left channel audio signal and a right channel audio signal to obtain 5.1 channel audio signals. The method processes the input stereo audio signal in real time based on the neural network, so that the sound source and the environmental sound can be effectively distinguished, a multi-channel audio signal can be obtained, and the immersive effect is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art will understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical essence, and any modifications of the structures, changes of the ratio relationships, or adjustments of the sizes, should still fall within the scope covered by the technical contents disclosed in the present invention without affecting the efficacy and the achievable purpose of the present invention.
Fig. 1 is a flowchart of a first immersive audio upmixing method provided in an embodiment of the present invention;
fig. 2 is a flowchart of a second immersive audio upmixing method provided in the embodiments of the present invention;
fig. 3 is a flowchart of a third immersive audio upmixing method provided in the embodiments of the present invention;
fig. 4 is a flowchart of a fourth immersive audio upmixing method provided in the embodiments of the present invention;
fig. 5 is a first deep learning sound source separation model training processing framework according to an embodiment of the present invention;
fig. 6 is a training processing framework of a second deep learning sound source separation model according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an immersive audio upmixing system provided in an embodiment of the present invention.
Detailed Description
The present invention is described in terms of specific embodiments, and other advantages and benefits of the present invention will become apparent to those skilled in the art from the following disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, there is provided an immersive audio upmixing method comprising the steps of:
acquiring an input stereo audio signal, and separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model;
separating the sound source signal into a middle sound source signal and a bass signal by adopting a deep learning sound source separation model;
performing decorrelation processing on the environment sound signal by adopting a deep learning sound source separation model to obtain a left surround sound audio signal and a right surround sound audio signal;
and acquiring input left channel audio signals and right channel audio signals, and combining the center sound source signals, the bass signals, the left surround sound audio signals, the right surround sound audio signals, the left channel audio signals and the right channel audio signals to obtain 5.1 channel audio signals.
Referring to fig. 2, in an embodiment of the immersive audio upmixing method, further comprises separating the stereo audio signal into a sound source signal, an ambient sound signal and decorrelation parameters using a deep learning sound source separation model;
and performing decorrelation processing on the environment sound signals and the decorrelation parameters by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals.
Specifically, the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encoder and the decoder are connected by adopting a long-short term memory network LSTM. Wherein the encoder (6 layers) and decoder (6 layers) parts are composed of convolutional neural networks, and the total of 12 layers. Finally, the mask is output for sound separation.
In this embodiment, the U-nets structure includes down-sampling processing and up-sampling processing, where the down-sampling processing is used to perform stereo audio signal concentration, and the up-sampling processing is used to perform stereo audio signal pixel recovery; in the U-nets structure, each downsampling process is provided with a Skip Connection (Skip Connection) which is cascaded with a corresponding upsampling process.
Specifically, the U-nets structure includes a down-sampling process for performing information concentration and an up-sampling process for performing pixel recovery. The model carries out maximum pooling downsampling for 6 times, convolution is used for information extraction after each sampling to obtain a characteristic diagram, and then the input pixel size is recovered through 6 times of upsampling.
In addition, U-nets also employ hopping connections. Each down-sampling has a jump connection to cascade with the corresponding up-sampling, the feature fusion of different scales is helpful to restore pixels of the up-sampling, particularly, the down-sampling multiple of a high layer (shallow layer) is small, the feature graph has more detailed graph features, the down-sampling multiple of a bottom layer (deep layer) is large, information is greatly concentrated, the space loss is large, but the judgment of a target area (classification) is facilitated, and when the features of the high layer (high level) and the low layer (low level) are fused, a very good segmentation effect can be achieved.
In this embodiment, the deep learning sound source separation model is directly trained on a stereo time domain audio signal; or training according to a stereo frequency domain signal, wherein the stereo frequency domain signal comprises left channel real part information, left channel imaginary part information, right channel real part information and right channel imaginary part information; or training is carried out according to the frequency domain parameters of the stereo, wherein the frequency domain parameters of the stereo comprise the energy ratio of the left channel and the right channel.
Referring to fig. 5, a deep learning sound source separation model (NN) inputs a signal including a stereo audio signal, a sound source signal, an ambient sound signal, and decorrelation parameters, and performs training according to the input stereo audio signal, where task1 is to reconstruct the sound source and the ambient sound, task2 is to reconstruct the decorrelation parameters, and further sets a loss function according to task1 and task 2; in real-time processing, a sound source signal, an ambient sound signal, and decorrelation parameters can be obtained according to an input stereo audio signal.
Referring to fig. 6, in this embodiment, a deep learning sound source separation model (NN) inputs a signal including a stereo audio signal, a sound source signal, and an environment sound signal, performs training according to the input stereo audio signal, reconstructs the sound source and the environment sound, and calculates decorrelation parameters according to a mask value; when the real-time processing is performed, a sound source signal, an environmental sound signal and decorrelation parameters can be obtained according to the input stereo audio signal. In the training process, a decorrelation parameter is not required to be used as a ground channel (the classification accuracy of a training set for supervised training is mainly used for verifying or overriding a certain research hypothesis in a statistical model).
In this embodiment, there are many ways to decorrelate, the simplest being phase inversion, the ambient signal Ade,lsInverting 180 degrees to generate another environmental signal Ade,rs. Assuming this is the most aggressive, then the least aggressive is for the ambient signal to be made into dual mono (two for each)A stereo sound composed of identical sound channels) is copied into a two-way same signal Adm,lsAnd Adm,rs
Specifically, the decorrelation algorithm may be controlled in the following manner according to the mask value in the [0,1] interval:
Als=M*Ade,ls+(1-M)*Adm,ls
Ars=M*Ade,rs+(1-M)*Adm,rs
in the specific implementation process, the training can be directly performed on stereo time domain audio signals, or can be performed according to stereo frequency domain signals (left channel real part information, left channel imaginary part information, right channel real part information, and right channel imaginary part information), or can be performed according to stereo frequency domain parameters (left-right channel energy ratio).
Referring to fig. 3, in the present embodiment, mode detection is performed on an input stereo audio signal, and an up-mixing process is performed adaptively, so as to obtain a 5.1 channel audio signal.
Specifically, the movie/music mode detection is performed on the input stereo; including whether to classify the entire content or to classify the content in real time at a certain frame. When the stereo audio signal is a video content, processing is performed by using a mode a, the processing mode of the mode a is as shown in fig. 1, a sound source signal and an environment sound signal are obtained based on a deep learning sound source separation model, a center sound source signal and a bass signal are obtained according to the sound source signal, decorrelation is performed according to the environment sound signal to obtain a left surround sound audio signal and a right surround sound audio signal, and finally the center sound source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal are combined to obtain a 5.1 channel audio signal.
Specifically, mode detection is performed on an input stereo audio signal, and when the stereo audio signal is music content, processing is performed in a mode B:
giving the music style type of the stereo audio signal, setting the center sound source signal as silence according to the music style type, and performing decorrelation on the stereo audio signal to obtain a left surround sound audio signal and a right surround sound audio signal; and finally, combining the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal.
Referring to fig. 4, in the present embodiment, an input stereo audio signal is processed in real time based on a neural network, so as to obtain a multi-channel audio signal. For the input stereo audio signal, adopting a deep learning sound source separation model to process to directly obtain a multi-channel audio signal; according to the pattern detection result, if the stereo audio signal is movie contents, a plurality of output channels are predicted using a deep learning neural network method.
Specifically, according to stereo audio signals, multichannel audio signals are directly obtained through deep learning sound source separation model processing. And if the content is the film and television content, predicting a plurality of output channels by using a deep learning neural network method according to the classification mode detection result. Content classification may become more and more accurate over time, as it may not be accurately determined in a short period of time. The output result is thus also a weighted average of mode a and mode B.
Specifically, in the training process, the input is a stereo audio signal and a multi-channel audio signal. In the training process, a multi-channel audio signal is obtained according to an input stereo audio signal, and a loss function is set according to a reconstructed multi-channel signal and an original multi-channel audio signal. The embodiment can directly process the time domain signal and also can process the time domain signal according to the frequency domain signal after time frequency transformation.
Referring to fig. 7, the present invention also provides an immersive audio upmixing system comprising:
the system comprises a first processing module 1, a second processing module and a third processing module, wherein the first processing module is used for acquiring an input stereo audio signal and separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model;
the second processing module 2 is configured to separate the sound source signal into a middle sound source signal and a bass signal by using a deep learning sound source separation model;
the third processing module 3 is configured to perform decorrelation processing on the environment sound signal by using a deep learning sound source separation model to obtain a left surround sound audio signal and a right surround sound audio signal;
and the audio combining module 4 is configured to obtain an input left channel audio signal and an input right channel audio signal, and combine the center audio source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal, and the right channel audio signal to obtain a 5.1 channel audio signal.
Specifically, the first processing module 1 is further configured to obtain an input stereo audio signal, and separate the stereo audio signal into a sound source signal, an environment sound signal, and decorrelation parameters by using a deep learning sound source separation model;
the third processing module 3 is further configured to perform decorrelation processing on the environmental sound signal and the decorrelation parameters by using a deep learning sound source separation model, so as to obtain a left surround sound audio signal and a right surround sound audio signal.
Specifically, the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encorder and the decoder are connected by a long-short-term memory network LSTM. Wherein the encoder (6 layers) and decoder (6 layers) parts are composed of convolutional neural networks, and the total of 12 layers. Finally, the mask is output for sound separation.
In this embodiment, the U-nets structure includes down-sampling processing and up-sampling processing, where the down-sampling processing is used to perform stereo audio signal concentration, and the up-sampling processing is used to perform stereo audio signal pixel recovery; in the U-nets structure, each downsampling process is provided with a jump connection to be cascaded with a corresponding upsampling process.
Specifically, the U-nets structure includes a down-sampling process for performing information concentration and an up-sampling process for performing pixel recovery. The model carries out maximum pooling downsampling for 6 times, convolution is used for information extraction after each sampling to obtain a feature map, and then the input pixel size is recovered through 6 times of upsampling.
In addition, U-nets also employ hopping connections. Each down-sampling has a jump connection to cascade with the corresponding up-sampling, the feature fusion of different scales is helpful to restore pixels of the up-sampling, particularly, the down-sampling multiple of a high layer (shallow layer) is small, the feature graph has more detailed graph features, the down-sampling multiple of a bottom layer (deep layer) is large, information is greatly concentrated, the space loss is large, but the judgment of a target area (classification) is facilitated, and when the features of the high layer (high level) and the low layer (low level) are fused, a very good segmentation effect can be achieved.
In this embodiment, the deep learning sound source separation model is directly trained on a stereo time domain audio signal; or training is carried out according to a stereo frequency domain signal, wherein the stereo frequency domain signal comprises left channel real part information, left channel imaginary part information, right channel real part information and right channel imaginary part information; or training according to frequency domain parameters of stereo, wherein the frequency domain parameters of the stereo comprise the energy ratio of the left channel and the right channel.
Referring to fig. 5, a deep learning sound source separation model (NN) inputs a signal including a stereo audio signal, a sound source signal, an ambient sound signal, and decorrelation parameters, and performs training according to the input stereo audio signal, where task1 is to reconstruct the sound source and the ambient sound, task2 is to reconstruct the decorrelation parameters, and further sets a loss function according to task1 and task 2; when the stereo audio signal is processed in real time, a sound source signal, an environment sound signal and decorrelation parameters can be obtained according to the input stereo audio signal.
Referring to fig. 6, in the present embodiment, a deep learning sound source separation model (NN) inputs a stereo audio signal, a sound source signal, and an environmental sound signal, performs training according to the input stereo audio signal, reconstructs the sound source and the environmental sound, and calculates decorrelation parameters according to a mask value; when the real-time processing is performed, a sound source signal, an environmental sound signal and decorrelation parameters can be obtained according to the input stereo audio signal. In the training process, a decorrelation parameter is not required to be used as a ground channel (the classification accuracy of a training set for supervised training is mainly used for verifying or overriding a certain research hypothesis in a statistical model).
In this embodiment, decorrelationThere are many methods, the simplest being phase inversion, the ambient signal Ade,lsInverting 180 degrees to generate another environmental signal Ade,rs. Assuming that this is the most aggressive, then the least aggressive is that the ambient signal is made into dual mono (stereo of two identical channels), which is reproduced into the two-way identity signal Adm,lsAnd Adm,rs
Specifically, the decorrelation algorithm may be controlled in the following manner according to the mask value in the [0,1] interval:
Als=M*Ade,ls+(1-M)*Adm,ls
Ars=M*Ade,rs+(1-M)*Adm,rs
in the specific implementation process, the training can be directly performed on stereo time domain audio signals, or can be performed according to stereo frequency domain signals (left channel real part information, left channel imaginary part information, right channel real part information, and right channel imaginary part information), or can be performed according to stereo frequency domain parameters (left-right channel energy ratio).
Referring to fig. 3, in the present embodiment, mode detection is performed on an input stereo audio signal, and an up-mixing process is performed adaptively, so as to obtain a 5.1 channel audio signal.
Specifically, the movie/music mode detection is performed on the input stereo; including whether to classify the entire content or to classify the content in real time at a certain frame. When the stereo audio signal is video content, processing is performed in a mode a, the processing mode of the mode a is shown in fig. 1, a sound source signal and an environment sound signal are obtained based on a deep learning sound source separation model, a center sound source signal and a bass signal are obtained according to the sound source signal, decorrelation is performed according to the environment sound signal to obtain a left surround sound audio signal and a right surround sound audio signal, and finally the center sound source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal are combined to obtain a 5.1 channel audio signal.
Specifically, mode detection is performed on an input stereo audio signal, and when the stereo audio signal is music content, processing is performed in a mode B:
giving the music style type of the stereo audio signal, setting the center sound source signal as silence according to the music style type, and performing decorrelation on the stereo audio signal to obtain a left surround sound audio signal and a right surround sound audio signal; and finally, combining the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal.
Referring to fig. 4, in the present embodiment, an input stereo audio signal is processed in real time based on a neural network, so as to obtain a multi-channel audio signal. For the input stereo audio signal, adopting a deep learning sound source separation model to process to directly obtain a multi-channel audio signal; according to the pattern detection result, if the stereo audio signal is movie contents, a plurality of output channels are predicted using a deep learning neural network method.
Specifically, according to the stereo audio signal, a multi-channel audio signal is directly obtained through deep learning sound source separation model processing. And if the content is the film and television content, predicting a plurality of output channels by using a deep learning neural network method according to the classification mode detection result. Since it may not be possible to accurately judge within a short time, content classification may become more and more accurate over time. The output result is thus also a weighted average of mode a and mode B.
Specifically, in the training process, the input is a stereo audio signal and a multi-channel audio signal. In the training process, a multi-channel audio signal is obtained according to an input stereo audio signal, and a loss function is set according to a reconstructed multi-channel signal and an original multi-channel audio signal. The embodiment can directly process the time domain signal and also can process the time domain signal according to the frequency domain signal after time frequency transformation.
The method comprises the steps of separating a stereo audio signal into a sound source signal and an environment sound signal by acquiring the input stereo audio signal and adopting a deep learning sound source separation model; adopting a deep learning sound source separation model to separate a sound source signal into a middle sound source signal and a bass signal; performing decorrelation processing on the environmental sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals; the method comprises the steps of obtaining input left channel audio signals and right channel audio signals, and combining a center sound source signal, a bass signal, a left surround sound audio signal, a right surround sound audio signal, a left channel audio signal and a right channel audio signal to obtain 5.1 channel audio signals. The method processes the input stereo audio signal in real time based on the neural network, so that the sound source and the environmental sound can be effectively distinguished, a multi-channel audio signal can be obtained, and the immersive effect is further improved.
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, it is intended that all such modifications and alterations be included within the scope of this invention as defined in the appended claims.

Claims (7)

1. An immersive audio upmixing method comprising the steps of:
acquiring an input stereo audio signal, and separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model;
adopting a deep learning sound source separation model to separate the sound source signal into a middle sound source signal and a bass signal;
performing decorrelation processing on the environment sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals;
acquiring input left channel audio signals and right channel audio signals, and combining the center sound source signals, the bass signals, the left surround sound audio signals, the right surround sound audio signals, the left channel audio signals and the right channel audio signals to obtain 5.1 channel audio signals;
the method also comprises the steps of separating the stereo audio signal into a sound source signal, an environment sound signal and decorrelation parameters by adopting a deep learning sound source separation model;
performing decorrelation processing on the environment sound signals and the decorrelation parameters by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals;
the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encoder and the decoder are connected by adopting a long-short term memory network LSTM, and finally, a mask is output for sound separation;
the U-nets structure comprises a down-sampling process and an up-sampling process, wherein the down-sampling process is used for performing stereo audio signal concentration, and the up-sampling process is used for performing stereo audio signal pixel recovery;
in the U-nets structure, each downsampling process is provided with a jump connection to be cascaded with a corresponding upsampling process.
2. The immersive audio upmixing method of claim 1, wherein the deep learning sound source separation model is trained directly on a stereo time domain audio signal;
or training is carried out according to a stereo frequency domain signal, wherein the stereo frequency domain signal comprises left channel real part information, left channel imaginary part information, right channel real part information and right channel imaginary part information;
or training is carried out according to the frequency domain parameters of the stereo, wherein the frequency domain parameters of the stereo comprise the energy ratio of the left channel and the right channel.
3. The immersive audio upmixing method of claim 1, wherein a mode detection is performed on an input stereo audio signal, and when the stereo audio signal is movie content, a mode a is used for processing:
the method comprises the steps of obtaining a sound source signal and an environment sound signal based on a deep learning sound source separation model, obtaining a middle sound source signal and a low sound signal according to the sound source signal, performing decorrelation according to the environment sound signal to obtain a left surround sound audio signal and a right surround sound audio signal, and finally combining the middle sound source signal, the low sound signal, the left surround sound audio signal, the right surround sound audio signal, the left sound channel audio signal and the right sound channel audio signal to obtain a 5.1 sound channel audio signal.
4. An immersive audio upmixing method according to claim 3, wherein mode detection is performed on an input stereo audio signal, and when the stereo audio signal is music content, mode B processing is performed:
giving a music style type of the stereo audio signal, setting the center sound source signal as silence according to the music style type, and performing decorrelation on the stereo audio signal to obtain a left surround sound audio signal and a right surround sound audio signal; and finally, combining the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal.
5. The immersive audio upmixing method of claim 4, wherein for the input stereo audio signal, a deep learning source separation model is applied to directly obtain a multi-channel audio signal;
according to the pattern detection result, if the stereo audio signal is movie contents, a plurality of output channels are predicted using a deep learning neural network method.
6. An immersive audio upmixing system, comprising:
the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for acquiring an input stereo audio signal and separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model;
the second processing module is used for separating the sound source signal into a middle sound source signal and a bass signal by adopting a deep learning sound source separation model;
the third processing module is used for performing decorrelation processing on the environment sound signal by adopting a deep learning sound source separation model to obtain a left surround sound audio signal and a right surround sound audio signal;
the audio merging module is used for acquiring an input left channel audio signal and an input right channel audio signal, and merging the middle audio source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal;
separating the stereo audio signal into a sound source signal, an environment sound signal and decorrelation parameters by adopting a deep learning sound source separation model;
performing decorrelation processing on the environment sound signals and the decorrelation parameters by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals;
the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encorder and the decoder are connected by adopting a long-short term memory network LSTM, and finally a mask is output for sound separation;
the U-nets structure comprises down-sampling processing and up-sampling processing, wherein the down-sampling processing is used for performing stereo audio signal concentration, and the up-sampling processing is used for performing stereo audio signal pixel recovery;
in the U-nets structure, each down-sampling process is provided with a jump connection to be cascaded with a corresponding up-sampling process.
7. The immersive audio upmixing system of claim 6, wherein the first processing module is further configured to obtain an input stereo audio signal, and separate the stereo audio signal into a sound source signal, an ambient sound signal, and decorrelation parameters using a deep learning sound source separation model;
and the third processing module is also used for performing decorrelation processing on the environment sound signals and the decorrelation parameters by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals.
CN202110111130.2A 2021-01-27 2021-01-27 Immersive audio upmixing method and system Active CN112866896B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110111130.2A CN112866896B (en) 2021-01-27 2021-01-27 Immersive audio upmixing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110111130.2A CN112866896B (en) 2021-01-27 2021-01-27 Immersive audio upmixing method and system

Publications (2)

Publication Number Publication Date
CN112866896A CN112866896A (en) 2021-05-28
CN112866896B true CN112866896B (en) 2022-07-15

Family

ID=76009551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110111130.2A Active CN112866896B (en) 2021-01-27 2021-01-27 Immersive audio upmixing method and system

Country Status (1)

Country Link
CN (1) CN112866896B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115691515A (en) * 2022-07-12 2023-02-03 南京拓灵智能科技有限公司 Audio coding and decoding method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018047643A1 (en) * 2016-09-09 2018-03-15 ソニー株式会社 Device and method for sound source separation, and program
CN111429939A (en) * 2020-02-20 2020-07-17 西安声联科技有限公司 Sound signal separation method of double sound sources and sound pickup

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5082327B2 (en) * 2006-08-09 2012-11-28 ソニー株式会社 Audio signal processing apparatus, audio signal processing method, and audio signal processing program
KR101567461B1 (en) * 2009-11-16 2015-11-09 삼성전자주식회사 Apparatus for generating multi-channel sound signal
US20150243289A1 (en) * 2012-09-14 2015-08-27 Dolby Laboratories Licensing Corporation Multi-Channel Audio Content Analysis Based Upmix Detection
WO2019099899A1 (en) * 2017-11-17 2019-05-23 Facebook, Inc. Analyzing spatially-sparse data based on submanifold sparse convolutional neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018047643A1 (en) * 2016-09-09 2018-03-15 ソニー株式会社 Device and method for sound source separation, and program
CN109661705A (en) * 2016-09-09 2019-04-19 索尼公司 Sound source separating device and method and program
CN111429939A (en) * 2020-02-20 2020-07-17 西安声联科技有限公司 Sound signal separation method of double sound sources and sound pickup

Also Published As

Publication number Publication date
CN112866896A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
US20210217436A1 (en) Data driven audio enhancement
Gao et al. 2.5 d visual sound
CN105075293B (en) Audio frequency apparatus and its audio provide method
CN104995681B (en) The video analysis auxiliary of multichannel audb data is produced
RU2586842C2 (en) Device and method for converting first parametric spatial audio into second parametric spatial audio signal
TWI490853B (en) Multi-channel audio processing
US20220059123A1 (en) Separating and rendering voice and ambience signals
CN112989107B (en) Audio classification and separation method and device, electronic equipment and storage medium
WO2010125228A1 (en) Encoding of multiview audio signals
US9838790B2 (en) Acquisition of spatialized sound data
Rafaely et al. Spatial audio signal processing for binaural reproduction of recorded acoustic scenes–review and challenges
CN112866896B (en) Immersive audio upmixing method and system
Cobos et al. An overview of machine learning and other data-based methods for spatial audio capture, processing, and reproduction
CN114067810A (en) Audio signal rendering method and device
CN111787464B (en) Information processing method and device, electronic equipment and storage medium
WO2022262576A1 (en) Three-dimensional audio signal encoding method and apparatus, encoder, and system
Blauert et al. Aural assessment by means of binaural algorithms− The AABBA project−
US20220386060A1 (en) Signalling of audio effect metadata in a bitstream
Hold et al. Parametric binaural reproduction of higher-order spatial impulse responses
CN115938385A (en) Voice separation method and device and storage medium
CN115035907B (en) Target speaker separation system, device and storage medium
US20230308823A1 (en) Systems and Methods for Upmixing Audiovisual Data
CN112397089B (en) Speech generator identity recognition method, device, computer equipment and storage medium
CN114944164A (en) Multi-mode-based immersive sound generation method and device
KR20230157225A (en) Apparatus and method of processing audio signal for scene classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210824

Address after: Room 960A, floor 9, No. 11, Zhongguancun Street, Haidian District, Beijing 100190

Applicant after: Beijing Tuoling Xinsheng Technology Co.,Ltd.

Address before: Room F12, 14th floor, building B, latte City, 318 Yanta South Road, Qujiang New District, Xi'an City, Shaanxi Province, 710061

Applicant before: Xi'an times Tuoling Technology Co.,Ltd.

Applicant before: BEIJING TUOLING Inc.

GR01 Patent grant
GR01 Patent grant