CN112866896B

CN112866896B - Immersive audio upmixing method and system

Info

Publication number: CN112866896B
Application number: CN202110111130.2A
Authority: CN
Inventors: 孙学京; 李旭阳
Original assignee: Beijing Tuoling Xinsheng Technology Co ltd
Current assignee: Beijing Tuoling Xinsheng Technology Co ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2022-07-15
Anticipated expiration: 2041-01-27
Also published as: CN112866896A

Abstract

The invention discloses an immersive audio upmixing method and system, which are characterized in that an input stereo audio signal is obtained, and a deep learning sound source separation model is adopted to separate the stereo audio signal into a sound source signal and an environment sound signal; separating the sound source signal into a middle sound source signal and a bass signal by adopting a deep learning sound source separation model; performing decorrelation processing on the environmental sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals; the method comprises the steps of obtaining input left channel audio signals and right channel audio signals, and combining a middle audio source signal, a bass signal, a left surround sound audio signal, a right surround sound audio signal, a left channel audio signal and a right channel audio signal to obtain 5.1 channel audio signals. The method processes the input stereo audio signal in real time based on the neural network, so that the sound source and the environmental sound can be effectively distinguished, a multi-channel audio signal can be obtained, and the immersive effect is further improved.

Description

Immersive audio upmixing method and system

Technical Field

The invention relates to the technical field of sound processing, in particular to an immersive audio frequency upmixing method and system.

Background

In recent years, with the development of high-definition video, from 2K to 4K, even 8K, and with the development of virtual reality VR and AR, the requirement for audio hearing has been increased. People no longer satisfy the stereo sound effect which is popular for many years, and pursue 3D sound effect or immersive sound effect which has more immersion and reality. Professional and home theaters typically have multiple speakers that can play 5.1/7.1 and more channels of immersive audio, and in addition, vehicular audio gradually transitions to content that can play more than two channels.

Currently, upmix algorithms are used to process stereo audio signals to deliver stereo (stereo) surround sound, such as Center, left (L), right (R), Left (LS), Right (RS), bass (LFE). BPF and LPF processing are respectively carried out on the Center signal to obtain a C audio signal and an LFE audio signal; performing time delay processing and LPF processing on the RS signal, and further performing decorrelation processing (for example, performing phase inversion processing) to obtain an LS audio signal and an RS audio signal respectively; and combining the input left channel audio signal and the right channel audio signal to obtain the surround channel audio signal. In the prior art, after upmixing processing, a sound source and environmental sounds cannot be well distinguished, and the immersive effect of a multi-channel sound channel audio signal is greatly weakened. At present, a large amount of traditional stereo (two-channel) contents are available in the market, and how to make the latest immersive audio system compatible with the traditional stereo contents and simultaneously more ideally utilize the advantages of more channels to render a better immersive effect is a pain point problem to be solved urgently.

Disclosure of Invention

Therefore, the immersive audio upmixing method and the immersive audio upmixing system provided by the invention have the advantages that the stereo audio is converted into at least four paths of multi-channel format audio, the overall audio experience is improved, and the problem that the immersive effect of a multi-channel audio signal cannot be well distinguished and weakened by a sound source and environmental sound is solved.

In order to achieve the above purpose, the invention provides the following technical scheme: an immersive audio upmixing method comprising the steps of:

acquiring an input stereo audio signal, and separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model;

adopting a deep learning sound source separation model to separate the sound source signal into a middle sound source signal and a bass signal;

performing decorrelation processing on the environment sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals;

and acquiring input left channel audio signals and right channel audio signals, and combining the center sound source signals, the bass signals, the left surround sound audio signals, the right surround sound audio signals, the left channel audio signals and the right channel audio signals to obtain 5.1 channel audio signals.

As a preferable scheme of the immersive audio upmixing method, further comprising separating the stereo audio signal into a sound source signal, an environmental sound signal, and decorrelation parameters using a deep learning sound source separation model;

and performing decorrelation processing on the environment sound signals and the decorrelation parameters by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals.

As a preferable scheme of the immersive audio upmixing method, the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encoder and the decoder are connected by adopting a long-short term memory network LSTM, and finally, a mask is output for sound separation.

As a preferable scheme of the immersive audio upmixing method, the U-nets structure includes a down-sampling process for performing stereo audio signal concentration and an up-sampling process for performing stereo audio signal pixel restoration;

in the U-nets structure, each downsampling process is provided with a jump connection to be cascaded with a corresponding upsampling process.

As a preferred solution of the immersive audio upmixing method, the deep learning sound source separation model is trained directly on a stereo time domain audio signal;

or training according to a stereo frequency domain signal, wherein the stereo frequency domain signal comprises left channel real part information, left channel imaginary part information, right channel real part information and right channel imaginary part information;

or training is carried out according to the frequency domain parameters of the stereo, wherein the frequency domain parameters of the stereo comprise the energy ratio of the left channel and the right channel.

As a preferred scheme of the immersive audio upmixing method, the method comprises the following steps of performing mode detection on an input stereo audio signal, and when the stereo audio signal is movie and television content, performing processing by adopting a mode A:

the method comprises the steps of obtaining a sound source signal and an environment sound signal based on a deep learning sound source separation model, obtaining a middle sound source signal and a low sound signal according to the sound source signal, performing decorrelation according to the environment sound signal to obtain a left surround sound audio signal and a right surround sound audio signal, and finally combining the middle sound source signal, the low sound signal, the left surround sound audio signal, the right surround sound audio signal, the left sound channel audio signal and the right sound channel audio signal to obtain a 5.1 sound channel audio signal.

As a preferred scheme of the immersive audio upmixing method, the method performs mode detection on an input stereo audio signal, and when the stereo audio signal is music content, performs processing in a mode B:

giving a music style type of the stereo audio signal, setting the center sound source signal as silence according to the music style type, and performing decorrelation on the stereo audio signal to obtain a left surround sound audio signal and a right surround sound audio signal; and finally, combining the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal.

As a preferred scheme of the immersive audio upmixing method, an input stereo audio signal is directly processed by adopting a deep learning sound source separation model to obtain a multi-channel audio signal;

according to the pattern detection result, if the stereo audio signal is movie contents, a plurality of output channels are predicted using a deep learning neural network method.

The present invention also provides an immersive audio upmixing system comprising:

the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for acquiring an input stereo audio signal and separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model;

the second processing module is used for separating the sound source signal into a middle sound source signal and a bass signal by adopting a deep learning sound source separation model;

the third processing module is used for performing decorrelation processing on the environment sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals;

and the audio merging module is used for acquiring the input left channel audio signal and right channel audio signal, and merging the middle audio source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal.

As a preferred scheme of the immersive audio upmixing system, the first processing module is further configured to obtain an input stereo audio signal, and separate the stereo audio signal into a sound source signal, an environmental sound signal, and decorrelation parameters by using a deep learning sound source separation model;

and the third processing module is also used for performing decorrelation processing on the environment sound signals and the decorrelation parameters by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals.

The invention has the following advantages: separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model through acquiring the input stereo audio signal; adopting a deep learning sound source separation model to separate a sound source signal into a middle sound source signal and a bass signal; performing decorrelation processing on the environmental sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals; the method comprises the steps of obtaining input left channel audio signals and right channel audio signals, and combining a middle audio source signal, a bass signal, a left surround sound audio signal, a right surround sound audio signal, a left channel audio signal and a right channel audio signal to obtain 5.1 channel audio signals. The method processes the input stereo audio signal in real time based on the neural network, so that the sound source and the environmental sound can be effectively distinguished, a multi-channel audio signal can be obtained, and the immersive effect is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art will understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical essence, and any modifications of the structures, changes of the ratio relationships, or adjustments of the sizes, should still fall within the scope covered by the technical contents disclosed in the present invention without affecting the efficacy and the achievable purpose of the present invention.

Fig. 1 is a flowchart of a first immersive audio upmixing method provided in an embodiment of the present invention;

fig. 2 is a flowchart of a second immersive audio upmixing method provided in the embodiments of the present invention;

fig. 3 is a flowchart of a third immersive audio upmixing method provided in the embodiments of the present invention;

fig. 4 is a flowchart of a fourth immersive audio upmixing method provided in the embodiments of the present invention;

fig. 5 is a first deep learning sound source separation model training processing framework according to an embodiment of the present invention;

fig. 6 is a training processing framework of a second deep learning sound source separation model according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an immersive audio upmixing system provided in an embodiment of the present invention.

Detailed Description

The present invention is described in terms of specific embodiments, and other advantages and benefits of the present invention will become apparent to those skilled in the art from the following disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, there is provided an immersive audio upmixing method comprising the steps of:

separating the sound source signal into a middle sound source signal and a bass signal by adopting a deep learning sound source separation model;

performing decorrelation processing on the environment sound signal by adopting a deep learning sound source separation model to obtain a left surround sound audio signal and a right surround sound audio signal;

Referring to fig. 2, in an embodiment of the immersive audio upmixing method, further comprises separating the stereo audio signal into a sound source signal, an ambient sound signal and decorrelation parameters using a deep learning sound source separation model;

Specifically, the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encoder and the decoder are connected by adopting a long-short term memory network LSTM. Wherein the encoder (6 layers) and decoder (6 layers) parts are composed of convolutional neural networks, and the total of 12 layers. Finally, the mask is output for sound separation.

In this embodiment, the U-nets structure includes down-sampling processing and up-sampling processing, where the down-sampling processing is used to perform stereo audio signal concentration, and the up-sampling processing is used to perform stereo audio signal pixel recovery; in the U-nets structure, each downsampling process is provided with a Skip Connection (Skip Connection) which is cascaded with a corresponding upsampling process.

Specifically, the U-nets structure includes a down-sampling process for performing information concentration and an up-sampling process for performing pixel recovery. The model carries out maximum pooling downsampling for 6 times, convolution is used for information extraction after each sampling to obtain a characteristic diagram, and then the input pixel size is recovered through 6 times of upsampling.

In addition, U-nets also employ hopping connections. Each down-sampling has a jump connection to cascade with the corresponding up-sampling, the feature fusion of different scales is helpful to restore pixels of the up-sampling, particularly, the down-sampling multiple of a high layer (shallow layer) is small, the feature graph has more detailed graph features, the down-sampling multiple of a bottom layer (deep layer) is large, information is greatly concentrated, the space loss is large, but the judgment of a target area (classification) is facilitated, and when the features of the high layer (high level) and the low layer (low level) are fused, a very good segmentation effect can be achieved.

In this embodiment, the deep learning sound source separation model is directly trained on a stereo time domain audio signal; or training according to a stereo frequency domain signal, wherein the stereo frequency domain signal comprises left channel real part information, left channel imaginary part information, right channel real part information and right channel imaginary part information; or training is carried out according to the frequency domain parameters of the stereo, wherein the frequency domain parameters of the stereo comprise the energy ratio of the left channel and the right channel.

Referring to fig. 5, a deep learning sound source separation model (NN) inputs a signal including a stereo audio signal, a sound source signal, an ambient sound signal, and decorrelation parameters, and performs training according to the input stereo audio signal, where task1 is to reconstruct the sound source and the ambient sound, task2 is to reconstruct the decorrelation parameters, and further sets a loss function according to task1 and task 2; in real-time processing, a sound source signal, an ambient sound signal, and decorrelation parameters can be obtained according to an input stereo audio signal.

Referring to fig. 6, in this embodiment, a deep learning sound source separation model (NN) inputs a signal including a stereo audio signal, a sound source signal, and an environment sound signal, performs training according to the input stereo audio signal, reconstructs the sound source and the environment sound, and calculates decorrelation parameters according to a mask value; when the real-time processing is performed, a sound source signal, an environmental sound signal and decorrelation parameters can be obtained according to the input stereo audio signal. In the training process, a decorrelation parameter is not required to be used as a ground channel (the classification accuracy of a training set for supervised training is mainly used for verifying or overriding a certain research hypothesis in a statistical model).

In this embodiment, there are many ways to decorrelate, the simplest being phase inversion, the ambient signal A_de,lsInverting 180 degrees to generate another environmental signal A_de,rs. Assuming this is the most aggressive, then the least aggressive is for the ambient signal to be made into dual mono (two for each)A stereo sound composed of identical sound channels) is copied into a two-way same signal A_dm,lsAnd A_dm,rs。

Specifically, the decorrelation algorithm may be controlled in the following manner according to the mask value in the [0,1] interval:

A_ls＝M*A_de,ls+(1-M)*A_dm,ls

A_rs＝M*A_de,rs+(1-M)*A_dm,rs

in the specific implementation process, the training can be directly performed on stereo time domain audio signals, or can be performed according to stereo frequency domain signals (left channel real part information, left channel imaginary part information, right channel real part information, and right channel imaginary part information), or can be performed according to stereo frequency domain parameters (left-right channel energy ratio).

Referring to fig. 3, in the present embodiment, mode detection is performed on an input stereo audio signal, and an up-mixing process is performed adaptively, so as to obtain a 5.1 channel audio signal.

Specifically, the movie/music mode detection is performed on the input stereo; including whether to classify the entire content or to classify the content in real time at a certain frame. When the stereo audio signal is a video content, processing is performed by using a mode a, the processing mode of the mode a is as shown in fig. 1, a sound source signal and an environment sound signal are obtained based on a deep learning sound source separation model, a center sound source signal and a bass signal are obtained according to the sound source signal, decorrelation is performed according to the environment sound signal to obtain a left surround sound audio signal and a right surround sound audio signal, and finally the center sound source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal are combined to obtain a 5.1 channel audio signal.

Specifically, mode detection is performed on an input stereo audio signal, and when the stereo audio signal is music content, processing is performed in a mode B:

giving the music style type of the stereo audio signal, setting the center sound source signal as silence according to the music style type, and performing decorrelation on the stereo audio signal to obtain a left surround sound audio signal and a right surround sound audio signal; and finally, combining the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal.

Referring to fig. 4, in the present embodiment, an input stereo audio signal is processed in real time based on a neural network, so as to obtain a multi-channel audio signal. For the input stereo audio signal, adopting a deep learning sound source separation model to process to directly obtain a multi-channel audio signal; according to the pattern detection result, if the stereo audio signal is movie contents, a plurality of output channels are predicted using a deep learning neural network method.

Specifically, according to stereo audio signals, multichannel audio signals are directly obtained through deep learning sound source separation model processing. And if the content is the film and television content, predicting a plurality of output channels by using a deep learning neural network method according to the classification mode detection result. Content classification may become more and more accurate over time, as it may not be accurately determined in a short period of time. The output result is thus also a weighted average of mode a and mode B.

Specifically, in the training process, the input is a stereo audio signal and a multi-channel audio signal. In the training process, a multi-channel audio signal is obtained according to an input stereo audio signal, and a loss function is set according to a reconstructed multi-channel signal and an original multi-channel audio signal. The embodiment can directly process the time domain signal and also can process the time domain signal according to the frequency domain signal after time frequency transformation.

Referring to fig. 7, the present invention also provides an immersive audio upmixing system comprising:

the system comprises a first processing module 1, a second processing module and a third processing module, wherein the first processing module is used for acquiring an input stereo audio signal and separating the stereo audio signal into a sound source signal and an environment sound signal by adopting a deep learning sound source separation model;

the second processing module 2 is configured to separate the sound source signal into a middle sound source signal and a bass signal by using a deep learning sound source separation model;

the third processing module 3 is configured to perform decorrelation processing on the environment sound signal by using a deep learning sound source separation model to obtain a left surround sound audio signal and a right surround sound audio signal;

and the audio combining module 4 is configured to obtain an input left channel audio signal and an input right channel audio signal, and combine the center audio source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal, and the right channel audio signal to obtain a 5.1 channel audio signal.

Specifically, the first processing module 1 is further configured to obtain an input stereo audio signal, and separate the stereo audio signal into a sound source signal, an environment sound signal, and decorrelation parameters by using a deep learning sound source separation model;

the third processing module 3 is further configured to perform decorrelation processing on the environmental sound signal and the decorrelation parameters by using a deep learning sound source separation model, so as to obtain a left surround sound audio signal and a right surround sound audio signal.

Specifically, the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encorder and the decoder are connected by a long-short-term memory network LSTM. Wherein the encoder (6 layers) and decoder (6 layers) parts are composed of convolutional neural networks, and the total of 12 layers. Finally, the mask is output for sound separation.

In this embodiment, the U-nets structure includes down-sampling processing and up-sampling processing, where the down-sampling processing is used to perform stereo audio signal concentration, and the up-sampling processing is used to perform stereo audio signal pixel recovery; in the U-nets structure, each downsampling process is provided with a jump connection to be cascaded with a corresponding upsampling process.

Specifically, the U-nets structure includes a down-sampling process for performing information concentration and an up-sampling process for performing pixel recovery. The model carries out maximum pooling downsampling for 6 times, convolution is used for information extraction after each sampling to obtain a feature map, and then the input pixel size is recovered through 6 times of upsampling.

In this embodiment, the deep learning sound source separation model is directly trained on a stereo time domain audio signal; or training is carried out according to a stereo frequency domain signal, wherein the stereo frequency domain signal comprises left channel real part information, left channel imaginary part information, right channel real part information and right channel imaginary part information; or training according to frequency domain parameters of stereo, wherein the frequency domain parameters of the stereo comprise the energy ratio of the left channel and the right channel.

Referring to fig. 5, a deep learning sound source separation model (NN) inputs a signal including a stereo audio signal, a sound source signal, an ambient sound signal, and decorrelation parameters, and performs training according to the input stereo audio signal, where task1 is to reconstruct the sound source and the ambient sound, task2 is to reconstruct the decorrelation parameters, and further sets a loss function according to task1 and task 2; when the stereo audio signal is processed in real time, a sound source signal, an environment sound signal and decorrelation parameters can be obtained according to the input stereo audio signal.

Referring to fig. 6, in the present embodiment, a deep learning sound source separation model (NN) inputs a stereo audio signal, a sound source signal, and an environmental sound signal, performs training according to the input stereo audio signal, reconstructs the sound source and the environmental sound, and calculates decorrelation parameters according to a mask value; when the real-time processing is performed, a sound source signal, an environmental sound signal and decorrelation parameters can be obtained according to the input stereo audio signal. In the training process, a decorrelation parameter is not required to be used as a ground channel (the classification accuracy of a training set for supervised training is mainly used for verifying or overriding a certain research hypothesis in a statistical model).

In this embodiment, decorrelationThere are many methods, the simplest being phase inversion, the ambient signal A_de,lsInverting 180 degrees to generate another environmental signal A_de,rs. Assuming that this is the most aggressive, then the least aggressive is that the ambient signal is made into dual mono (stereo of two identical channels), which is reproduced into the two-way identity signal A_dm,lsAnd A_dm,rs。

A_ls＝M*A_de,ls+(1-M)*A_dm,ls

A_rs＝M*A_de,rs+(1-M)*A_dm,rs

Specifically, the movie/music mode detection is performed on the input stereo; including whether to classify the entire content or to classify the content in real time at a certain frame. When the stereo audio signal is video content, processing is performed in a mode a, the processing mode of the mode a is shown in fig. 1, a sound source signal and an environment sound signal are obtained based on a deep learning sound source separation model, a center sound source signal and a bass signal are obtained according to the sound source signal, decorrelation is performed according to the environment sound signal to obtain a left surround sound audio signal and a right surround sound audio signal, and finally the center sound source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal are combined to obtain a 5.1 channel audio signal.

Specifically, according to the stereo audio signal, a multi-channel audio signal is directly obtained through deep learning sound source separation model processing. And if the content is the film and television content, predicting a plurality of output channels by using a deep learning neural network method according to the classification mode detection result. Since it may not be possible to accurately judge within a short time, content classification may become more and more accurate over time. The output result is thus also a weighted average of mode a and mode B.

The method comprises the steps of separating a stereo audio signal into a sound source signal and an environment sound signal by acquiring the input stereo audio signal and adopting a deep learning sound source separation model; adopting a deep learning sound source separation model to separate a sound source signal into a middle sound source signal and a bass signal; performing decorrelation processing on the environmental sound signals by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals; the method comprises the steps of obtaining input left channel audio signals and right channel audio signals, and combining a center sound source signal, a bass signal, a left surround sound audio signal, a right surround sound audio signal, a left channel audio signal and a right channel audio signal to obtain 5.1 channel audio signals. The method processes the input stereo audio signal in real time based on the neural network, so that the sound source and the environmental sound can be effectively distinguished, a multi-channel audio signal can be obtained, and the immersive effect is further improved.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, it is intended that all such modifications and alterations be included within the scope of this invention as defined in the appended claims.

Claims

1. An immersive audio upmixing method comprising the steps of:

acquiring input left channel audio signals and right channel audio signals, and combining the center sound source signals, the bass signals, the left surround sound audio signals, the right surround sound audio signals, the left channel audio signals and the right channel audio signals to obtain 5.1 channel audio signals;

the method also comprises the steps of separating the stereo audio signal into a sound source signal, an environment sound signal and decorrelation parameters by adopting a deep learning sound source separation model;

performing decorrelation processing on the environment sound signals and the decorrelation parameters by adopting a deep learning sound source separation model to obtain left surround sound audio signals and right surround sound audio signals;

the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encoder and the decoder are connected by adopting a long-short term memory network LSTM, and finally, a mask is output for sound separation;

the U-nets structure comprises a down-sampling process and an up-sampling process, wherein the down-sampling process is used for performing stereo audio signal concentration, and the up-sampling process is used for performing stereo audio signal pixel recovery;

2. The immersive audio upmixing method of claim 1, wherein the deep learning sound source separation model is trained directly on a stereo time domain audio signal;

or training is carried out according to a stereo frequency domain signal, wherein the stereo frequency domain signal comprises left channel real part information, left channel imaginary part information, right channel real part information and right channel imaginary part information;

3. The immersive audio upmixing method of claim 1, wherein a mode detection is performed on an input stereo audio signal, and when the stereo audio signal is movie content, a mode a is used for processing:

4. An immersive audio upmixing method according to claim 3, wherein mode detection is performed on an input stereo audio signal, and when the stereo audio signal is music content, mode B processing is performed:

5. The immersive audio upmixing method of claim 4, wherein for the input stereo audio signal, a deep learning source separation model is applied to directly obtain a multi-channel audio signal;

6. An immersive audio upmixing system, comprising:

the third processing module is used for performing decorrelation processing on the environment sound signal by adopting a deep learning sound source separation model to obtain a left surround sound audio signal and a right surround sound audio signal;

the audio merging module is used for acquiring an input left channel audio signal and an input right channel audio signal, and merging the middle audio source signal, the bass signal, the left surround sound audio signal, the right surround sound audio signal, the left channel audio signal and the right channel audio signal to obtain a 5.1 channel audio signal;

separating the stereo audio signal into a sound source signal, an environment sound signal and decorrelation parameters by adopting a deep learning sound source separation model;

the deep learning sound source separation model adopts a U-nets structure; the U-nets structure comprises an encoder part and a decoder part; the encorder and the decoder are connected by adopting a long-short term memory network LSTM, and finally a mask is output for sound separation;

the U-nets structure comprises down-sampling processing and up-sampling processing, wherein the down-sampling processing is used for performing stereo audio signal concentration, and the up-sampling processing is used for performing stereo audio signal pixel recovery;

in the U-nets structure, each down-sampling process is provided with a jump connection to be cascaded with a corresponding up-sampling process.

7. The immersive audio upmixing system of claim 6, wherein the first processing module is further configured to obtain an input stereo audio signal, and separate the stereo audio signal into a sound source signal, an ambient sound signal, and decorrelation parameters using a deep learning sound source separation model;