CN113055809A

CN113055809A - 5.1 sound channel signal generation method, equipment and medium

Info

Publication number: CN113055809A
Application number: CN202110271369.6A
Authority: CN
Inventors: 芮元庆; 林慧镔
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-06-29
Anticipated expiration: 2041-03-12
Also published as: CN113055809B

Abstract

The application discloses a method, equipment and medium for generating 5.1 sound channel signals, comprising the following steps: separating the accompaniment signals in the stereo signals to obtain a first accompaniment signal in the left channel signal and a second accompaniment signal in the right channel signal; generating a left surround channel signal and a right surround channel signal of a 5.1 channel signal based on the first accompaniment signal and the second accompaniment signal, respectively; generating a center channel signal and a bass channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal; the left channel signal is determined as a left front channel signal of the 5.1 channel signal and the right channel signal is determined as a right front channel signal of the 5.1 channel signal. In this way, the left surround channel signal is generated based on the accompaniment signal of the left channel signal separated from the stereo signal, and the right surround channel signal is generated based on the accompaniment signal of the separated right channel signal, so that various instrument components of the stereo signal and the irrelevancy of the left channel signal and the right channel signal can be well preserved.

Description

5.1 sound channel signal generation method, equipment and medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method, a device, and a medium for generating a 5.1 channel signal.

Background

Currently, 5.1 sound channels have been widely applied to various traditional cinemas and home cinemas due to good auditory effect, and most of the existing sound sources are stereo sound sources due to manufacturing cost factors, in order to convert stereo signals into 5.1 sound channel signals, the prior art generally generates left and right surround sound channels by using the difference of left and right sound channel signals, so that on one hand, the difference of the left and right sound channel signals eliminates the same-phase musical instrument sound while eliminating the human voice, and the obtained left and right surround sound channel information is lost; on the other hand, the difference signal of the left and right channel signals is a single channel signal, and then the left and right surround channels of the two channels are obtained by adopting a decorrelation mode, but the obtained left and right surround channels are still correlated, and the irrelevance of the original stereo left and right channel signals is not reserved. In summary, in the process of implementing the present invention, the inventors found that there is at least a defect in the left and right surround channel information in the 5.1 channel signal generated based on the stereo signal in the prior art, and the irrelevancy of the original stereo left and right channel signals is not preserved.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, a device and a medium for generating a 5.1 channel signal, which can better preserve various musical instrument components of a stereo signal and the irrelevancy of a left channel signal and a right channel signal. The specific scheme is as follows:

in a first aspect, the present application discloses a 5.1 channel signal generating method, including:

separating the accompaniment signals in the stereo signals to obtain a first accompaniment signal in the left channel signal and a second accompaniment signal in the right channel signal;

generating a left surround channel signal and a right surround channel signal of a 5.1 channel signal based on the first accompaniment signal and the second accompaniment signal, respectively;

generating a center channel signal and a bass channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal;

determining the left channel signal as a left front channel signal of the 5.1 channel signal and determining the right channel signal as a right front channel signal of the 5.1 channel signal.

Optionally, the separating the accompaniment signals in the stereo signal to obtain a first accompaniment signal in the left channel signal and a second accompaniment signal in the right channel signal includes:

acquiring a target accompaniment separation model;

and separating the accompaniment signals in the stereo signals by using the target accompaniment separation model to obtain a first accompaniment signal in the left channel signal and a second accompaniment signal in the right channel signal.

Optionally, the obtaining the target accompaniment separation model includes:

acquiring a left channel frequency spectrum characteristic corresponding to a left channel signal and a right channel frequency spectrum characteristic corresponding to a right channel signal in each stereo audio data set;

inputting the left channel frequency spectrum characteristic and the right channel frequency spectrum characteristic of each stereo audio data into a preset neural network model for training, and determining a corresponding target loss parameter in the training process; the preset neural network model at least comprises an accompaniment separation model;

and when the target loss parameters are converged, determining the current accompaniment separation model as a target accompaniment separation model.

Optionally, the obtaining a left channel spectrum feature corresponding to a left channel signal and a right channel spectrum feature corresponding to a right channel signal in each stereo audio data in the stereo audio data set includes:

performing subband decomposition on each stereo audio data in the stereo audio data set to obtain a plurality of first subband signals corresponding to a left channel signal and a plurality of second subband signals corresponding to a right channel signal in each stereo audio data;

obtaining a plurality of first sub-band frequency spectrums corresponding to a plurality of first sub-band signals to obtain the frequency spectrum characteristics of the left channel;

and acquiring a plurality of second sub-band frequency spectrums corresponding to the plurality of second sub-band signals to obtain the right channel frequency spectrum characteristic.

Optionally, the inputting the left channel spectrum feature and the right channel spectrum feature of each piece of stereo audio data into a preset neural network model for training, and determining a corresponding target loss parameter in a training process includes:

inputting the left channel frequency spectrum feature and the right channel frequency spectrum feature of each stereo audio data into an accompaniment separation model for training to obtain a first training frequency spectrum feature corresponding to the left channel frequency spectrum feature and a second training frequency spectrum feature corresponding to the right channel frequency spectrum feature;

determining a left channel accompaniment signal based on the first training spectral features and determining a right channel accompaniment signal based on the second training spectral features;

and determining a loss parameter for measuring the accompaniment loss based on the left channel accompaniment signal and the right channel accompaniment signal, and determining the loss parameter as a target loss parameter.

Optionally, the inputting the left channel spectrum feature and the right channel spectrum feature of each piece of stereo audio data into a preset neural network model for training, and determining a corresponding loss parameter in a training process includes:

determining a first loss parameter for measuring an accompaniment loss based on the left channel accompaniment signal and the right channel accompaniment signal;

inputting the left channel frequency spectrum feature and the right channel frequency spectrum feature of each stereo audio data into a human voice separation model for training to obtain a third training frequency spectrum feature corresponding to the left channel frequency spectrum feature and a fourth training frequency spectrum feature corresponding to the right channel frequency spectrum feature;

determining a left channel vocal signal based on the third training spectral feature and determining a right channel vocal signal based on the fourth training spectral feature;

determining a second loss parameter for measuring the voice loss based on the left channel voice signal and the right channel voice signal;

determining a third loss parameter for measuring the overall loss of stereo audio data based on the left channel accompaniment signal, the right channel accompaniment signal, the left channel vocal signal and the right channel vocal signal;

determining a sum of the first loss parameter, the second loss parameter, and the third loss parameter as a target loss parameter.

Optionally, the method further includes:

and if the target loss parameter is not converged, updating the accompaniment separation model by using the first loss parameter and the third loss parameter, and updating the voice separation model by using the second loss parameter and the third loss parameter.

Optionally, the separating the accompaniment signals in the stereo signal by using the target accompaniment separation model to obtain a first accompaniment signal in the left channel signal and a second accompaniment signal in the right channel signal includes:

performing subband decomposition on the stereo signal to obtain a plurality of third subband signals corresponding to a left channel signal and a plurality of fourth subband signals corresponding to a right channel signal in the stereo signal;

acquiring a plurality of third sub-band frequency spectrums corresponding to the plurality of third sub-band signals and a plurality of fourth sub-band frequency spectrums corresponding to the plurality of fourth sub-band signals;

splicing a plurality of third sub-band frequency spectrums to obtain a first model input frequency spectrum characteristic, and splicing a plurality of fourth sub-band frequency spectrums to obtain a second model input frequency spectrum characteristic;

inputting the first model input spectrum feature and the second model input spectrum feature into the target accompaniment separation model to obtain a first model output spectrum feature corresponding to the first model input spectrum feature output by the target accompaniment model separation model and a second model output spectrum feature corresponding to the second model input spectrum feature;

splitting the first model output spectral features into a plurality of first output sub-band spectral features and splitting the second model output spectral features into a plurality of second output sub-band spectral features;

determining a first accompaniment signal in a left channel signal using a plurality of said first output sub-band spectral features and determining a second accompaniment signal in a right channel signal using a plurality of said second output sub-band spectral features.

Optionally, the generating a left surround channel signal and a right surround channel signal of a 5.1 channel signal based on the first accompaniment signal and the second accompaniment signal, respectively, includes:

performing time delay processing on the first accompaniment signal and the second accompaniment signal respectively; wherein the delay time of the delay processing is not greater than a preset delay time threshold;

and respectively processing the delayed first accompaniment signal and the delayed second accompaniment signal by utilizing a band-pass filter to obtain a left surround sound channel signal and a right surround sound channel signal of the 5.1 sound channel signal.

In a second aspect, the present application discloses a 5.1 channel signal generating apparatus, comprising:

the accompaniment signal separation module is used for separating the accompaniment signals in the stereo signals to obtain a first accompaniment signal in the left channel signal and a second accompaniment signal in the right channel signal;

a surround channel signal generating module for generating a left surround channel signal and a right surround channel signal of a 5.1 channel signal based on the first accompaniment signal and the second accompaniment signal, respectively;

a center channel signal generating module, configured to generate a center channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal;

a bass channel signal generating module, configured to generate a bass channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal;

a front left channel signal determining module for determining the front left channel signal as the front left channel signal of the 5.1 channel signal

A right front channel signal determining module for determining the right channel signal as a right front channel signal of the 5.1 channel signal.

In a third aspect, the present application discloses an electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the aforementioned 5.1-channel signal generation method.

In a fourth aspect, the present application discloses a computer-readable storage medium storing a computer program which, when executed by a processor, implements the aforementioned 5.1-channel signal generation method.

It can be seen that, the present application separates accompaniment signals in stereo signals, obtains a first accompaniment signal in a left channel signal and a second accompaniment signal in a right channel signal, then generates a left surround channel signal and a right surround channel signal of a 5.1 channel signal based on the first accompaniment signal and the second accompaniment signal, respectively, generates a center channel signal and a bass channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal, determines the left channel signal as a front left channel signal of the 5.1 channel signal, and determines the right channel signal as a front right channel signal of the 5.1 channel signal. That is, the present application generates a left surround channel signal of a 5.1 channel signal based on a first accompaniment signal of a left channel signal separated from a stereo signal, and generates a right surround channel signal of the 5.1 channel signal based on a second accompaniment signal of a right channel signal separated from the stereo signal, so that various musical instrument components of the stereo signal and irrelevancy of the left channel signal and the right channel signal can be well maintained.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of a system framework for which the 5.1 channel signal generation scheme disclosed herein is applicable;

FIG. 2 is a flow chart of a method for generating a 5.1 channel signal according to the present disclosure;

FIG. 3 is a flow chart of a specific 5.1 channel signal generation method disclosed herein;

FIG. 4 is a flow chart of a specific 5.1 channel signal generation method disclosed herein;

FIG. 5 is a flow chart illustrating training of an embodiment of an accompaniment separation model according to the present disclosure;

FIG. 6 is a schematic diagram of a specific neural network model structure disclosed in the present application;

FIG. 7 is a schematic diagram of an exemplary training accompaniment separation model according to the present disclosure;

FIG. 8 is a flow chart illustrating training of an embodiment of an accompaniment separation model according to the present disclosure;

fig. 9 is a schematic structural diagram of a 5.1 channel signal generating apparatus disclosed in the present application;

fig. 10 is a block diagram of an electronic device disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Currently, in order to convert a stereo signal into a 5.1 channel signal, the prior art generally uses the difference between the left and right channel signals to generate left and right surround channels, so that on one hand, the difference between the left and right channel signals eliminates the same-phase instrument sound while eliminating the human voice, and the obtained left and right surround channel information is lost; on the other hand, the difference signal of the left and right channel signals is a single channel signal, and then the left and right surround channels of the two channels are obtained by adopting a decorrelation mode, but the obtained left and right surround channels are still correlated, and the irrelevance of the original stereo left and right channel signals is not reserved. In order to overcome the technical problems, the application provides a 5.1-channel signal generation scheme, which can better reserve various musical instrument components of a stereo signal and the irrelevance of a left channel signal and a right channel signal.

In the audio playback device warm-up scheme of the present application, a system framework adopted may specifically refer to fig. 1, and may specifically include: a computer device 01 and a number of audio playback devices 02 establishing a communication connection with the computer device 01. The audio playing device 02 may include a sound device (such as a home theater), an earphone, a user terminal, and the like; the computer device 01 may be a PC, a server, etc.

In the present application, the computer device 01 may separate accompaniment signals in stereo signals to obtain a first accompaniment signal in a left channel signal and a second accompaniment signal in a right channel signal; generating a left surround channel signal and a right surround channel signal of a 5.1 channel signal based on the first accompaniment signal and the second accompaniment signal, respectively; generating a center channel signal and a bass channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal; and after the left channel signal is determined as a left front channel signal of the 5.1 channel signal and the right channel signal is determined as a right front channel signal of the 5.1 channel signal, the 5.1 channel signal is transmitted to a sound box comprising a 5.1 loudspeaker array through a sound card and a power amplifier for playing, or a surround sound effect based on the external playing of an earphone or a terminal loudspeaker is added to the 5.1 channel signal for processing, and then the 5.1 channel signal is played through the earphone or the terminal loudspeaker.

Referring to fig. 2, an embodiment of the present application discloses a 5.1 channel signal generating method, including:

step S11: the accompaniment signals in the stereo signals are separated to obtain a first accompaniment signal in the left channel signal and a second accompaniment signal in the right channel signal.

Step S12: generating a left surround channel signal and a right surround channel signal of a 5.1 channel signal based on the first accompaniment signal and the second accompaniment signal, respectively.

In a specific embodiment, the first accompaniment signal and the second accompaniment signal may be respectively subjected to a time-delay process; wherein the delay time of the delay processing is not greater than a preset delay time threshold; and respectively processing the delayed first accompaniment signal and the delayed second accompaniment signal by utilizing a band-pass filter to obtain a left surround sound channel signal and a right surround sound channel signal of the 5.1 sound channel signal.

Specifically, in this embodiment, the delay time may be 20ms, the preset delay time threshold may be 50ms, and the frequency range of the band-pass filter is [100Hz,2000Hz ].

It should be noted that, the surround sound channel is simulated by the ambient reflected sound, so the surround sound channel is delayed and attenuated by high frequency sound absorption, and therefore the embodiment of the present application can delay the accompaniment signal by about 20ms and does not exceed 50ms, so as to ensure that the surround sound channel delays the left front channel and the right front channel and does not generate echo effect, and simultaneously the surround sound channel is simulated by the band pass filter, and the surround sound channel does not contain low frequency components, so the low frequency cut-off frequency can be set to 100Hz, and the high frequency cut-off frequency can be set to 2000Hz to simulate high frequency sound absorption, and of course, the frequency range of the band pass filter can be adjusted according to the actual situation. For example, songs are more frequent and may be more obtrusive to play. To improve the hearing effect, the high frequency cut-off frequency may be lowered.

Step S13: generating a center channel signal and a bass channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal.

In a specific embodiment, the sum of the left channel signal and the right channel signal may be determined to obtain a sum signal, and the sum signal is attenuated by 3dB to obtain a center channel signal; the sum signal is attenuated by 6dB and then input to a low pass filter, the output of which is taken as the bass channel signal. The cut-off frequency of the low-pass filter can be about 100Hz and can be adjusted according to actual conditions.

Step S14: determining the left channel signal as a left front channel signal of the 5.1 channel signal and determining the right channel signal as a right front channel signal of the 5.1 channel signal.

That is, in the embodiment of the present application, the left channel signal in the stereo signal is directly used as the left front channel signal of the 5.1 channel signal, and the right channel signal is used as the right front channel signal of the 5.1 channel signal.

For example, referring to fig. 3, fig. 3 is a flowchart of a specific method for generating a 5.1 channel signal according to an embodiment of the present disclosure.

It can be seen that, in the embodiments of the present application, accompaniment signals in stereo signals are separated to obtain a first accompaniment signal in a left channel signal and a second accompaniment signal in a right channel signal, then a left surround channel signal and a right surround channel signal of a 5.1 channel signal are generated based on the first accompaniment signal and the second accompaniment signal, respectively, a center channel signal and a bass channel signal of the 5.1 channel signal are generated based on the left channel signal and the right channel signal, the left channel signal is determined as a left front channel signal of the 5.1 channel signal, and the right channel signal is determined as a right front channel signal of the 5.1 channel signal. That is, the present application generates a left surround channel signal of a 5.1 channel signal based on a first accompaniment signal of a left channel signal separated from a stereo signal, and generates a right surround channel signal of the 5.1 channel signal based on a second accompaniment signal of a right channel signal separated from the stereo signal, so that various musical instrument components of the stereo signal and irrelevancy of the left channel signal and the right channel signal can be well maintained.

Referring to fig. 4, an embodiment of the present application discloses a specific 5.1 channel signal generating method, including:

step S21: and acquiring a target accompaniment separation model.

Fig. 5 is a flowchart illustrating training of a specific accompaniment separation model according to an embodiment of the present application, and is shown in fig. 5, including:

step S31: and acquiring a left channel frequency spectrum characteristic corresponding to a left channel signal and a right channel frequency spectrum characteristic corresponding to a right channel signal in each stereo audio data set.

In a specific implementation manner, in the embodiment of the present application, each stereo audio data in the stereo audio data set may be subjected to subband decomposition, so as to obtain a plurality of first subband signals corresponding to a left channel signal and a plurality of second subband signals corresponding to a right channel signal in each stereo audio data set; obtaining a plurality of first sub-band frequency spectrums corresponding to a plurality of first sub-band signals to obtain the frequency spectrum characteristics of the left channel; and acquiring a plurality of second sub-band frequency spectrums corresponding to the plurality of second sub-band signals to obtain the right channel frequency spectrum characteristic.

The first subband signal and the second subband signal are subband signals obtained by performing subband decomposition on stereo audio data through an analysis filter.

Step S32: inputting the left channel frequency spectrum characteristic and the right channel frequency spectrum characteristic of each stereo audio data into a preset neural network model for training, and determining a corresponding target loss parameter in the training process; the preset neural network model at least comprises an accompaniment separation model.

For example, referring to fig. 6, fig. 6 is a schematic structural diagram of a specific neural network model disclosed in the embodiment of the present application. The preset neural network model comprises an input convolutional Layer (CNN Layer), an odd number of convolutional blocks (CNN Block), a plurality of Down sampling layers (Down sampling Layer) and a plurality of Up sampling layers (Up sampling Layer), wherein the input of a first preset convolutional Block in the odd number of convolutional blocks comprises the output characteristic of a second preset convolutional Block and the characteristic is obtained by processing the characteristic through a plurality of convolutional blocks, a plurality of Down sampling layers and a plurality of Up sampling layers. That is, direct connections are added among CNN blocks, and masks or spectrum features output by a neural network model can be fused with information of different sizes and different levels. The input dimension and the output dimension of the preset neural network are the same.

Of course, in other embodiments, neural network models of other structures may be used for training.

In a specific embodiment, the left channel spectrum feature and the right channel spectrum feature of each piece of stereo audio data may be input to an accompaniment separation model for training, so as to obtain a first training spectrum feature corresponding to the left channel spectrum feature and a second training spectrum feature corresponding to the right channel spectrum feature; determining a left channel accompaniment signal based on the first training spectral features and determining a right channel accompaniment signal based on the second training spectral features; and determining a loss parameter for measuring the accompaniment loss based on the left channel accompaniment signal and the right channel accompaniment signal, and determining the loss parameter as a target loss parameter.

In some embodiments, the network model training is performed based on a mapping method, and then the output of the neural network model is the training spectrum feature. In other embodiments, the network model training is performed based on a mask method, and the output of the neural network model is multiplied by the input to obtain the training spectrum features.

For example, stereo audio data is audio data with 1S duration, 8 subband signals of a left channel signal are obtained through analysis filter decomposition, the subband signals are time domain signals, 8 subband signals are subjected to short-time fourier transform respectively to obtain 8 subband spectrums corresponding to the 8 subband signals, the 8 subband spectrums are combined into a vector to obtain left channel spectrum characteristics corresponding to the left channel signal, similarly, right channel spectrum characteristics are obtained, an accompaniment separation model is input to the accompaniment separation model, the accompaniment separation model outputs first training spectrum characteristics corresponding to the left channel spectrum characteristics and second training spectrum characteristics corresponding to the right channel spectrum characteristics, short-time fourier inverse transform is respectively performed on each subband characteristic in the first training spectrum characteristics and the second training spectrum characteristics to obtain a time domain signal corresponding to each subband characteristic, and then up-sampling is performed, the separated left channel accompaniment signal and right channel accompaniment signal are then obtained by a synthesis filter. After the short-time Fourier transform, signal overlap addition is carried out according to the overlap parameters during the short-time Fourier transform, and a time domain signal corresponding to each sub-band characteristic is obtained.

Step S33: and when the target loss parameters are converged, determining the current accompaniment separation model as a target accompaniment separation model.

For example, referring to fig. 7, fig. 7 is a schematic diagram of a specific training of an accompaniment separation model disclosed in the embodiment of the present application, wherein x (n) represents a left channel signal or a right channel signal, H₁(z),…,H_L(z) denotes an analysis filter, x₁(n),…,x_L(n) denotes a subband signal, x₁(Ln),…,x_L(Ln) represents the sub-band signal after the down-sampling process, and the sub-band signal after the down-sampling process is subjected to STFT (short-time Fourier transform) to obtain the sub-band spectrum of the sub-band signal, Y₁(f/L,t),…,Y_L(f/L, t) represents a sub-band frequency spectrum, a plurality of sub-band frequency spectrums are spliced into a vector to obtain multi-dimensional frequency spectrum characteristics, then the multi-dimensional frequency spectrum characteristics are input into a neural network model to obtain training frequency spectrum characteristics with the same dimensionality, and the training frequency spectrum characteristics are split into a plurality of sub-band characteristics, wherein Y is₁'(f/L,t),…,Y_L' (f/L, t) represents the sub-band feature obtained by training, then performing ISTFT (inverse short-time Fourier transform), and then performing L-time up-sampling to obtain sub-band signal y_1,E(n),…,y_L,E(n),F₁(z),…,F_L(z) denotes a synthesis filter, and y (n) is a synthesized left channel accompaniment signal or right channel accompaniment signal.

Step S22: and separating the accompaniment signals in the stereo signals by using the target accompaniment separation model to obtain a first accompaniment signal in the left channel signal and a second accompaniment signal in the right channel signal.

In a specific implementation manner, this embodiment may perform subband decomposition on the stereo signal to obtain a plurality of third subband signals corresponding to a left channel signal and a plurality of fourth subband signals corresponding to a right channel signal in the stereo signal; acquiring a plurality of third sub-band frequency spectrums corresponding to the plurality of third sub-band signals and a plurality of fourth sub-band frequency spectrums corresponding to the plurality of fourth sub-band signals; splicing a plurality of third sub-band frequency spectrums to obtain a first model input frequency spectrum characteristic, and splicing a plurality of fourth sub-band frequency spectrums to obtain a second model input frequency spectrum characteristic; inputting the first model input spectrum feature and the second model input spectrum feature into the target accompaniment separation model to obtain a first model output spectrum feature corresponding to the first model input spectrum feature output by the target accompaniment model separation model and a second model output spectrum feature corresponding to the second model input spectrum feature; splitting the first model output spectral features into a plurality of first output sub-band spectral features and splitting the second model output spectral features into a plurality of second output sub-band spectral features; determining a first accompaniment signal in a left channel signal using a plurality of said first output sub-band spectral features and determining a second accompaniment signal in a right channel signal using a plurality of said second output sub-band spectral features.

And then, carrying out short-time Fourier transform to obtain a plurality of third sub-band frequency spectrums corresponding to the plurality of third sub-band signals and a plurality of fourth sub-band frequency spectrums corresponding to the plurality of fourth sub-band signals. And performing short-time inverse Fourier transform on the plurality of first output sub-band spectral features and the plurality of second output sub-band spectral features to obtain a plurality of first time domain signals corresponding to the plurality of first output sub-band spectral features and a plurality of second time domain signals corresponding to the plurality of first output sub-band spectral features, then performing up-sampling on the plurality of first time domain signals and the plurality of second time domain signals, and then obtaining corresponding first accompaniment signals and second accompaniment signals through synthesis filters respectively. The specific process can be referred to fig. 7.

Step S23: generating a left surround channel signal and a right surround channel signal of a 5.1 channel signal based on the first accompaniment signal and the second accompaniment signal, respectively.

Step S24: generating a center channel signal and a bass channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal.

Step S25: determining the left channel signal as a left front channel signal of the 5.1 channel signal and determining the right channel signal as a right front channel signal of the 5.1 channel signal.

For the specific processes of the steps S23 to S25, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

It should be pointed out that carry out the sub-band to stereo audio data and decompose, obtain corresponding sub-band spectral feature, train, like this, on the one hand can improve frequency resolution for can obtain the more abundant detail of training data in the model training process, thereby promote the degree of accuracy that the model feature extracted, and then promote the model performance, on the other hand, the size of sub-band frequency spectrum is less, can promote the computational efficiency in the model training process.

Referring to fig. 8, an embodiment of the present application discloses a specific training method of an accompaniment separation model, including:

step S401: and acquiring a left channel frequency spectrum characteristic corresponding to a left channel signal and a right channel frequency spectrum characteristic corresponding to a right channel signal in each stereo audio data set.

For the specific process of the step S401, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

Step S402: inputting the left channel frequency spectrum feature and the right channel frequency spectrum feature of each stereo audio data into an accompaniment separation model for training to obtain a first training frequency spectrum feature corresponding to the left channel frequency spectrum feature and a second training frequency spectrum feature corresponding to the right channel frequency spectrum feature.

Step S403: determining a left channel accompaniment signal based on the first training spectral features and determining a right channel accompaniment signal based on the second training spectral features.

Step S404: a first loss parameter for measuring an accompaniment loss is determined based on the left channel accompaniment signal and the right channel accompaniment signal.

Step S405: inputting the left channel frequency spectrum feature and the right channel frequency spectrum feature of each stereo audio data into a human voice separation model for training, and obtaining a third training frequency spectrum feature corresponding to the left channel frequency spectrum feature and a fourth training frequency spectrum feature corresponding to the right channel frequency spectrum feature.

In a specific embodiment, the above steps S402 and S405 may be performed simultaneously, that is, the present embodiment may input the left channel spectrum feature and the right channel spectrum feature into the accompaniment separation model and the vocal separation model for training simultaneously.

Step S406: determining a left channel vocal signal based on the third training spectral feature, and determining a right channel vocal signal based on the fourth training spectral feature.

Step S407: and determining a second loss parameter for measuring the voice loss based on the left channel voice signal and the right channel voice signal.

Step S408: and determining a third loss parameter for measuring the overall loss of stereo audio data based on the left channel accompaniment signal, the right channel accompaniment signal, the left channel vocal signal and the right channel vocal signal.

In a specific embodiment, the left channel accompaniment signal, the right channel accompaniment signal, the left channel vocal signal and the right channel vocal signal may be synthesized into a stereo audio, and the third loss parameter for measuring the overall loss of the stereo audio data may be determined based on the stereo audio.

Step S409: determining a sum of the first loss parameter, the second loss parameter, and the third loss parameter as a target loss parameter.

Step S410: and when the target loss parameters are converged, determining the current accompaniment separation model as a target accompaniment separation model.

In a specific embodiment, if the target loss parameter does not converge, the accompaniment separation model is updated using the first loss parameter and the third loss parameter, and the vocal separation model is updated using the second loss parameter and the third loss parameter.

Therefore, the accompaniment separation model and the voice separation model are trained jointly, the integral loss of stereo audio data is considered in the training process, and the integral performance of the accompaniment separation model and the voice separation model can be improved.

The technical scheme of the application is described below by taking a song playing process of a certain music client APP as an example.

The background server of the music client APP obtains a left channel frequency spectrum characteristic corresponding to a left channel signal and a right channel frequency spectrum characteristic corresponding to a right channel signal in each stereo audio data set of stereo audio data, then inputs the left channel frequency spectrum characteristic and the right channel frequency spectrum characteristic of each stereo audio data into a preset neural network model for training, and determines a corresponding target loss parameter in a training process; the preset neural network model at least comprises an accompaniment separation model, when the target loss parameters are converged, the current accompaniment separation model is determined as the target accompaniment separation model, then the target accompaniment separation model is utilized to separate the accompaniment signals in the stereo signals of each stereo sound source in the music library of the music client APP, a first accompaniment signal in the left channel signal and a second accompaniment signal in the right channel signal are obtained, a left surround channel signal and a right surround channel signal of the 5.1 channel signal are generated respectively based on the first accompaniment signal and the second accompaniment signal, a channel signal and a bass channel signal of the center 5.1 channel signal are generated based on the left channel signal and the right channel signal, then the left channel signal is determined as a left front channel signal of the 5.1 channel signal and the right channel signal is determined as a right front channel signal of the 5.1 channel signal, each stereo sound source is converted into a 5.1 sound channel signal, when a user plays any song in a song library through the music client APP, the 5.1 sound channel signal corresponding to the song passes through and can be transmitted to a sound box including a 5.1 loudspeaker array through a sound card and a power amplifier for playing, or a surround sound effect based on the external playing of an earphone or a terminal loudspeaker is added to the 5.1 sound channel signal for processing, and then the 5.1 sound channel signal is played through the earphone or the terminal loudspeaker.

Referring to fig. 9, an embodiment of the present application discloses a 5.1 channel signal generating apparatus, including:

an accompaniment signal separation module 11, configured to separate accompaniment signals in stereo signals to obtain a first accompaniment signal in a left channel signal and a second accompaniment signal in a right channel signal;

a surround channel signal generating module 12, configured to generate a left surround channel signal and a right surround channel signal of a 5.1 channel signal based on the first accompaniment signal and the second accompaniment signal, respectively;

a center channel signal generating module 13, configured to generate a center channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal;

a bass channel signal generating module 14, configured to generate a bass channel signal of the 5.1 channel signal based on the left channel signal and the right channel signal;

a left front channel signal determining module 15, configured to determine the left channel signal as a left front channel signal of the 5.1 channel signal

A right front channel signal determining module 16, configured to determine the right channel signal as a right front channel signal of the 5.1 channel signal.

It can be seen that, in the embodiments of the present application, accompaniment signals in stereo signals are separated to obtain a first accompaniment signal in a left channel signal and a second accompaniment signal in a right channel signal, then a left surround channel signal and a right surround channel signal of a 5.1 channel signal are generated based on the first accompaniment signal and the second accompaniment signal, respectively, a center channel signal and a bass channel signal of the 5.1 channel signal are generated based on the left channel signal and the right channel signal, the left channel signal is determined as a left front channel signal of the 5.1 channel signal, and the right channel signal is determined as a right front channel signal of the 5.1 channel signal. That is, the embodiment of the present application generates a left surround channel signal of a 5.1 channel signal based on a first accompaniment signal of a left channel signal separated from a stereo signal, and generates a right surround channel signal of the 5.1 channel signal based on a second accompaniment signal of a right channel signal separated from the stereo signal, so that various musical instrument components of the stereo signal and irrelevancy of the left channel signal and the right channel signal can be well maintained.

The accompaniment signal separation module 11 specifically includes:

the target accompaniment separation model acquisition sub-module is used for acquiring a target accompaniment separation model;

and the accompaniment signal separation submodule is used for separating the accompaniment signals in the stereo signals by utilizing the target accompaniment separation model.

In a specific embodiment, the target accompaniment separation model acquisition sub-module specifically includes:

the stereo audio data processing device comprises a spectrum characteristic acquisition unit, a processing unit and a processing unit, wherein the spectrum characteristic acquisition unit is used for acquiring a left channel spectrum characteristic corresponding to a left channel signal and a right channel spectrum characteristic corresponding to a right channel signal in each stereo audio data set;

the model training unit is used for inputting the left channel frequency spectrum characteristic and the right channel frequency spectrum characteristic of each stereo audio data into a preset neural network model for training and determining corresponding target loss parameters in the training process; the preset neural network model at least comprises an accompaniment separation model; and when the target loss parameters are converged, determining the current accompaniment separation model as a target accompaniment separation model.

In a specific embodiment, the spectrum feature obtaining unit is specifically configured to:

In a specific embodiment, the model training unit is specifically configured to:

In another specific embodiment, the model training unit is specifically configured to:

determining a sum of the first loss parameter, the second loss parameter, and the third loss parameter as a target loss parameter;

if the target loss parameter is not converged, updating an accompaniment separation model by using the first loss parameter and the third loss parameter, and updating a voice separation model by using the second loss parameter and the third loss parameter;

An accompaniment signal separation sub-module, specifically configured to:

The surround channel signal generating module 12 specifically includes:

a delay processing unit, configured to perform delay processing on the first accompaniment signal and the second accompaniment signal respectively; wherein the delay time of the delay processing is not more than a preset delay time threshold

And the filtering unit is used for respectively processing the delayed first accompaniment signal and the delayed second accompaniment signal by utilizing a band-pass filter to obtain a left surround sound channel signal and a right surround sound channel signal of the 5.1 sound channel signal.

Further, the embodiment of the application also provides electronic equipment. FIG. 10 is a block diagram illustrating an electronic device 20 according to an exemplary embodiment, and nothing in the figure should be taken as a limitation on the scope of use of the present application.

Fig. 10 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the 5.1 channel signal generating method disclosed in any of the foregoing embodiments.

In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

In addition, the storage 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include an operating system 221, a computer program 222, data 223, etc., and the storage may be a transient storage or a permanent storage.

The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to realize the operation and processing of the processor 21 on the data 223 in the memory 22, and may be Windows Server, Netware, Unix, Linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the 5.1-channel signal generation method by the electronic device 20 disclosed in any of the foregoing embodiments. Data 223 may include stereo signal data collected by electronic device 20, and the like.

Further, an embodiment of the present application also discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the steps of the 5.1 channel signal generation method disclosed in any of the foregoing embodiments are implemented.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The method, the apparatus and the medium for generating a 5.1 channel signal provided by the present application are described in detail above, and a specific example is applied in the present application to explain the principles and embodiments of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of generating a 5.1 channel signal, comprising:

2. The method of claim 1, wherein the separating the accompaniment signals in the stereo signal to obtain a first accompaniment signal in a left channel signal and a second accompaniment signal in a right channel signal comprises:

acquiring a target accompaniment separation model;

3. The 5.1 channel signal generating method according to claim 2, wherein said obtaining a target accompaniment separation model includes:

4. The method of claim 3, wherein the obtaining a left channel spectrum characteristic corresponding to a left channel signal and a right channel spectrum characteristic corresponding to a right channel signal in each stereo audio data in the stereo audio data set comprises:

5. The method of claim 3, wherein the inputting the left channel spectral feature and the right channel spectral feature of each stereo audio data into a preset neural network model for training and determining a corresponding target loss parameter during training comprises:

6. The method of claim 3, wherein the inputting the left channel spectral feature and the right channel spectral feature of each stereo audio data into a preset neural network model for training and determining corresponding loss parameters during training comprises:

7. The 5.1-channel signal generation method according to claim 6, characterized by further comprising:

8. The method of claim 2, wherein the separating the accompaniment signals in the stereo signal using the target accompaniment separation model to obtain a first accompaniment signal in a left channel signal and a second accompaniment signal in a right channel signal comprises:

inputting the first model input spectrum feature and the second model input spectrum feature into the target accompaniment separation model to obtain a first model output spectrum feature corresponding to the first model input spectrum feature output by the target accompaniment separation model and a second model output spectrum feature corresponding to the second model input spectrum feature;

9. The 5.1-channel signal generating method according to any one of claims 1 to 8, wherein the generating a left surround channel signal and a right surround channel signal of a 5.1-channel signal based on the first accompaniment signal and the second accompaniment signal, respectively, comprises:

10. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the 5.1-channel signal generation method of any one of claims 1 to 9.

11. A computer-readable storage medium storing a computer program which when executed by a processor implements the 5.1-channel signal generating method according to any one of claims 1 to 9.