CN116758883A

CN116758883A - Method for setting up mixed sound processing model, computer equipment and storage medium

Info

Publication number: CN116758883A
Application number: CN202310376950.3A
Authority: CN
Inventors: 江益靓; 翁志强; 姜涛; 寇志娟; 李革委
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-09-15

Abstract

The application relates to a method for constructing a mixing processing model, a computer device, a storage medium and a computer program product. The method comprises the following steps: generating mixed audio from the human voice track and at least one accompaniment track; inputting the mixed audio to a sound track separation module to obtain a target person sound track and at least one item of accompaniment sound track of the mixed audio; according to the difference between the target human voice track and the difference between at least one item of the standard accompaniment track and at least one accompaniment track, a first loss function of the mixing processing model is adjusted; inputting the target person sound track and at least one item of accompaniment sound track into a sound track mixing module to obtain target mixed audio; according to the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio, adjusting a second loss function of the mixed audio processing model; the second loss function is used to approximate the target mixed audio to the mixed audio. By adopting the method, the sound mixing effect of the musical composition can be improved.

Description

Method for setting up mixed sound processing model, computer equipment and storage medium

Technical Field

The present application relates to the field of audio processing technology, and in particular, to a method for building a mixing processing model, a computer device, a storage medium, and a computer program product.

Background

In music production, professional blenders can integrate singers' singers and various instrument sounds into one stereo or single track, so that the resulting musical composition is more wonderful and listened to.

The traditional sound mixing technology firstly needs to record the sound track of singing voice and the sound track of musical instrument voice respectively, and then a sound mixer adjusts and superimposes the audio signals on different sound tracks in sequence according to professional experience so as to obtain professional musical compositions.

In the singing platform, most users lack a recording environment of independent audio tracks, and the recorded music works are usually music works with singing sounds and musical instrument sounds mixed, so that the traditional mixing technology cannot mix the mixed music works again, and under the condition that the users lack professional mixing capability, the mixing effect of the mixed music works is poor.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, a computer device, a computer-readable storage medium, and a computer program product for creating a mixing processing model capable of improving a mixing effect of a musical piece.

In a first aspect, the present application provides a method for building a mixing processing model. The method comprises the following steps:

generating mixed audio from the human voice track and at least one accompaniment track;

inputting the mixed audio to the audio track separation module to obtain a target person audio track and at least one item of accompaniment audio track of the mixed audio;

adjusting a first loss function of the mixing processing model according to the difference between the target human voice track and the difference between the at least one item of standard accompaniment track and the at least one accompaniment track; the first loss function is used for enabling the voice track of the person and the at least one accompaniment track to be close to the target voice track and the at least one accompaniment track respectively;

inputting the target person sound track and the at least one item of target accompaniment sound track to the sound track mixing module to obtain target mixed audio;

according to the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio, adjusting a second loss function of the mixed audio processing model; the second loss function is to approximate the target mixed audio to the mixed audio.

In one embodiment, said adjusting the first loss function of the mixing processing model according to the difference between the target human voice track and the difference between the at least one item of the accompaniment track and the at least one accompaniment track comprises:

performing signal-to-noise ratio processing on the difference between the target human voice track and the human voice track to obtain a human voice track signal-to-noise ratio between the target human voice track and the human voice track;

performing signal-to-noise ratio processing on the difference between the at least one item of standard accompaniment track and the at least one accompaniment track to obtain an accompaniment track signal-to-noise ratio between the at least one item of standard accompaniment track and the at least one accompaniment track;

and adjusting a first loss function of the audio mixing processing model according to the fusion result of the signal-to-noise ratio of the voice track and the signal-to-noise ratio of the accompaniment track.

In one embodiment, the adjusting the second loss function of the mixing processing model according to the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio includes:

Determining an audio difference between the tag mixed audio corresponding to the mixed audio and the target mixed audio;

performing signal-to-noise ratio processing on the audio difference and the label mixed audio to obtain a mixed audio signal-to-noise ratio between the target mixed audio and the label mixed audio;

and adjusting a second loss function of the mixing processing model according to the mixing signal-to-noise ratio.

In one embodiment, generating mixed audio from a human voice track and at least one accompaniment track includes:

and carrying out sound track superposition processing on the human sound track and the at least one accompaniment sound track to obtain the mixed audio.

In one embodiment, after adjusting the second loss function of the mixing processing model according to the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio, the method further includes:

and updating the model parameters of the audio track separation module according to the first loss function, and updating the model parameters of the audio track mixing module according to the second loss function to obtain a trained mixing processing model.

In one embodiment, after updating the model parameters of the audio track separation module according to the first loss function and updating the model parameters of the audio track mixing module according to the second loss function, the method further includes:

Acquiring original mixed audio;

inputting the original mixed audio into the trained mixed audio processing model to obtain target mixed audio of the original mixed audio; the mixing quality of the target mixed audio is higher than that of the original mixed audio.

In one embodiment, inputting the original mixed audio into the trained mixed audio processing model to obtain a target mixed audio of the original mixed audio, where the target mixed audio comprises:

inputting the original mixed audio to an audio track separation module in the trained mixed audio processing model to obtain a target person audio track, a drum point audio track, a Bei Siyin audio track and a piano audio track of the original mixed audio;

and inputting the target person sound track, the drum point sound track, the Bei Siyin sound track and the piano sound track into a sound track mixing module in the training-completed sound mixing processing model to obtain target mixed audio of the original mixed audio.

In a second aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a third aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fourth aspect, the application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The above method, computer device, storage medium and computer program product for creating a mixed audio by generating a mixed audio from a human voice track and at least one accompaniment track; inputting the mixed audio to a sound track separation module to obtain a target person sound track and at least one item of accompaniment sound track of the mixed audio; according to the difference between the target human voice track and the difference between at least one item of the standard accompaniment track and at least one accompaniment track, a first loss function of the mixing processing model is adjusted; inputting the target person sound track and at least one item of accompaniment sound track into a sound track mixing module to obtain target mixed audio; and adjusting a second loss function of the mixing processing model according to the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio. By adopting the method, the human voice track and at least one accompaniment track can be continuously close to the target human voice track and at least one target accompaniment track respectively through the first loss function, and meanwhile, the target mixed audio is continuously close to the mixed audio through the second loss function, so that the human voice track and the accompaniment track can be separated from the mixed audio by utilizing the track separation module in the mixing processing model, the mixing effect of the target mixed audio output by the mixing processing model is effectively improved, the defect that secondary mixing cannot be carried out on the mixed audio in the prior art is overcome, and the mixing effect of the original musical composition can be improved through the mixing processing model.

Drawings

FIG. 1 is a flow diagram of a method for building a mixing processing model in one embodiment;

fig. 2 is a schematic diagram of a method for building a mixing processing model in an embodiment;

FIG. 3 is a schematic diagram of acquiring tag mixed audio in one embodiment;

FIG. 4 is a schematic diagram of acquiring mixed audio in one embodiment;

FIG. 5 is a flow chart illustrating steps of obtaining a target mixed audio of an original mixed audio in one embodiment;

FIG. 6 is a schematic diagram of an application of a training-completed mixing process model in one embodiment;

FIG. 7 is a flowchart of a method for building a mixing processing model in another embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, a method for building a mixing processing model is provided, where the mixing processing model includes a track separation module and a track mixing module, and the embodiment is applied to a terminal to illustrate the method, and it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. The terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers. In this embodiment, the method includes the steps of:

Step S101, generating mixed audio from a human voice track and at least one accompaniment track.

The human voice track refers to an audio track where processed human voice audio which is not mixed with other audio can be independently displayed and adjusted. An accompaniment track refers to an audio track in which processed accompaniment audio that is not mixed with other audio can be displayed and adjusted separately. Mixed audio refers to audio data obtained by mixing a plurality of audio tracks (e.g., human voice audio and accompaniment tracks).

Specifically, fig. 2 is a schematic diagram of a method for setting up a mixing processing model, and as shown in fig. 2, a terminal may fuse a human sound track and at least one accompaniment track to obtain mixed audio.

Step S102, inputting the mixed audio to an audio track separation module to obtain a target person audio track and at least one item of accompaniment audio track of the mixed audio.

The audio track separation module is used for separating input audio data into independent audio tracks. In practical applications, the track separation module may be implemented by non-negative matrix factorization (Nonnegative Matrix Factorization, NMF), and may also be implemented by a neural network, for example, the track separation module may be a neural network of a U-Net structure and a neural network of a convolution structure.

Specifically, after the terminal obtains the mixed audio, the mixed audio may be input into the audio track separation module, and the audio track separation module performs audio track separation processing on the mixed audio, so that the audio track separation module extracts the audio track of the voice from the mixed audio, so that the terminal obtains the target human audio track, and meanwhile, the audio track separation module may also extract at least one accompaniment audio track from the mixed audio, so that the terminal obtains at least one accompaniment audio track.

Step S103, according to the difference between the target human voice track and the difference between at least one item of the standard accompaniment track and at least one accompaniment track, a first loss function of the mixing processing model is adjusted; the first loss function is for approximating the human voice track and the at least one accompaniment track to the target human voice track and the at least one item of accompaniment track, respectively.

Specifically, the terminal may adjust the first loss function of the mixing processing model by using the difference between the target human voice track and the difference between the at least one item of the standard accompaniment track and the at least one accompaniment track, so that the target human voice track and the at least one item of the standard accompaniment track output by the track separation module are respectively and continuously close to the human voice track and the at least one accompaniment track by using the first loss function, thereby continuously improving the track separation accuracy of the track separation module.

Step S104, inputting the target person sound track and at least one item of target accompaniment sound track into a sound track mixing module to obtain target mixed audio.

The audio track mixing module is used for mixing audio tracks. The target mixed audio is audio data obtained by re-mixing the mixed audio.

Specifically, the terminal inputs the target person audio track and at least one item of accompaniment audio track obtained in the step S102 into the audio track mixing module, and performs audio mixing processing on the target person audio track and at least one item of accompaniment audio track through the audio track mixing module, that is, a professional audio mixer is simulated through the audio track mixing module to perform multi-level audio mixing processing such as adjusting and overlapping on the human voice audio carried on the target person audio track and the accompaniment audio carried on at least one item of accompaniment audio track, so as to obtain target mixed audio corresponding to the target person audio track and at least one item of accompaniment audio track.

Step S105, adjusting a second loss function of the audio mixing processing model according to the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio; the second loss function is used to approximate the target mixed audio to the mixed audio.

Specifically, after the terminal obtains the target mixed audio, the terminal may further adjust the second loss function of the audio mixing processing model by using the difference between the target mixed audio and the mixed audio, so that the audio track mixing module outputs the target mixed audio and the mixed audio continuously approaches by using the second loss function, and further, the audio mixing effect of the audio track mixing module is continuously improved. And finally, the terminal constructs and obtains a sound mixing processing model according to the track separation module and the track mixing module which are completed through training.

In the method for constructing the sound mixing processing model, mixed audio is generated by a human sound track and at least one accompaniment sound track; inputting the mixed audio to a sound track separation module to obtain a target person sound track and at least one item of accompaniment sound track of the mixed audio; according to the difference between the target human voice track and the difference between at least one item of the standard accompaniment track and at least one accompaniment track, a first loss function of the mixing processing model is adjusted; inputting the target person sound track and at least one item of accompaniment sound track into a sound track mixing module to obtain target mixed audio; and adjusting a second loss function of the mixing processing model according to the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio. By adopting the method, the human voice track and at least one accompaniment track can be continuously close to the target human voice track and at least one target accompaniment track respectively through the first loss function, and meanwhile, the target mixed audio is continuously close to the mixed audio through the second loss function, so that the human voice track and the accompaniment track can be separated from the mixed audio by utilizing the track separation module in the mixing processing model, the mixing effect of the target mixed audio output by the mixing processing model is effectively improved, the defect that secondary mixing cannot be carried out on the mixed audio in the prior art is overcome, and the mixing effect of the original musical composition can be improved through the mixing processing model.

In one embodiment, the step S103 adjusts the first loss function of the mixing model according to the difference between the target human voice track and the difference between the at least one target accompaniment track and the at least one accompaniment track, which specifically includes the following: performing signal-to-noise ratio processing on the difference between the target human voice track and the human voice track to obtain the human voice track signal-to-noise ratio between the target human voice track and the human voice track; performing signal-to-noise ratio processing on the difference between the at least one item of standard accompaniment track and the at least one accompaniment track to obtain the signal-to-noise ratio of the accompaniment track between the at least one item of standard accompaniment track and the at least one accompaniment track; and adjusting a first loss function of the mixing processing model according to the fusion result of the signal-to-noise ratio of the human voice track and the signal-to-noise ratio of the accompaniment track.

The human voice track signal-to-noise ratio is an index for evaluating a target human voice track separated from an input sound source (for example, mixed audio). The signal-to-noise ratio of the accompaniment track refers to an index for evaluating at least one item of the accompaniment track separated from the input sound source. The first loss function of the mixing processing model may be an evaluation index in the technical field of sound source separation, such as SNR (Signal-to-noise ratio), SI-SDR (Scale invariant Signal-to-distortion ratio), and SDR (Signal-to-distortion ratio).

Specifically, the terminal performs signal-to-noise ratio processing on the difference between the target voice track and the voice track, which may be to use the difference between the target voice track and the voice track as a denominator and use the square of the voice track as a numerator to obtain a corresponding score, and then perform logarithmic processing on the score, so that the terminal obtains the signal-to-noise ratio of the voice track between the target voice track and the voice track. Similarly, the terminal performs signal-to-noise ratio processing on the difference between the at least one item of standard accompaniment track and the at least one accompaniment track, which may be to use the difference between the at least one item of standard accompaniment track and the at least one accompaniment track as a denominator, use the square of the at least one accompaniment track as a numerator to obtain a corresponding score, and perform logarithmic processing on the score, so that the terminal obtains the signal-to-noise ratio of the accompaniment track between the at least one item of standard accompaniment track and the at least one accompaniment track. The terminal performs fusion processing on the signal-to-noise ratio of the voice track and the signal-to-noise ratio of the accompaniment track, which can be the signal-to-noise ratio average value of the signal-to-noise ratio of the voice track and the signal-to-noise ratio average value of the accompaniment track, and takes the signal-to-noise ratio average value as a first loss function of the mixing processing model; and the first loss function of the mixing processing model can be obtained by carrying out weighting processing on the signal-to-noise ratio of the human voice track and the signal-to-noise ratio of the accompaniment track according to the importance of the signal-to-noise ratio of the human voice track and the importance of the signal-to-noise ratio of the accompaniment track.

In the embodiment, the signal-to-noise ratio of the voice tracks between the target voice track and the voice tracks is obtained by performing signal-to-noise ratio processing on the difference between the target voice track and the voice tracks; performing signal-to-noise ratio processing on the difference between the at least one item of standard accompaniment track and the at least one accompaniment track to obtain the signal-to-noise ratio of the accompaniment track between the at least one item of standard accompaniment track and the at least one accompaniment track; and carrying out fusion processing on the signal-to-noise ratio of the voice track and the signal-to-noise ratio of the accompaniment track to obtain a first loss function of the mixing processing model, so that the performance of the track separation module can be continuously optimized by utilizing the first loss function, and the track separation module has better track separation performance.

In one embodiment, the step S105 adjusts the second loss function of the mixing process model according to the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio, and specifically includes the following steps: determining an audio difference between the label mixed audio corresponding to the mixed audio and the target mixed audio; performing signal-to-noise ratio processing on the audio difference and the label mixed audio to obtain a mixed audio signal-to-noise ratio between the target mixed audio and the label mixed audio; and adjusting a second loss function of the mixing processing model according to the mixing signal-to-noise ratio.

The mixing signal-to-noise ratio is an index for evaluating the output target mixed audio. It is understood that the second loss function of the mixing processing model may also be an evaluation index in the technical field of sound source separation, such as SNR (Signal-to-noise ratio), SI-SDR (Scale invariant Signal-to-distortion ratio) and SDR (Signal-to-distortion ratio).

Specifically, the terminal calculates a mixing difference between the label mixed audio corresponding to the mixed audio and the target mixed audio, and then performs signal-to-noise ratio processing on the mixing difference and the label mixed audio, which may be to take the mixing difference as a denominator, take the square of the label mixed audio as a molecule, so as to obtain a corresponding score, and then perform logarithmic processing on the score, so that the terminal obtains a mixing signal-to-noise ratio between the target mixed audio and the label mixed audio, and then adjusts (e.g. updates and replaces) the second loss function of the mixing processing model according to the mixing signal-to-noise ratio.

The tag mixed audio is audio data obtained by performing professional mixing on a human voice track and at least one accompaniment track. Fig. 3 is a schematic diagram of acquiring a tag mixed audio, where a terminal may perform a mixing process on a human voice track and at least one accompaniment track to determine the tag mixed audio of the mixed audio. The professional audio mixing operator can also carry out manual audio mixing processing on the human audio track and at least one accompaniment audio track through the terminal to obtain a professional audio mixing result, and the professional audio mixing result is used as a label mixed audio of the mixed audio. The mixing effect of the label mixed audio is significantly higher than the mixing effect of the mixed audio.

In the embodiment, the signal-to-noise ratio processing is performed on the mixing difference between the label mixed audio corresponding to the mixed audio and the target mixed audio and the label mixed audio to obtain the mixing signal-to-noise ratio between the target mixed audio and the label mixed audio; and taking the mixing signal-to-noise ratio as a second loss function of the mixing processing model, so that model parameters of the mixing processing model can be continuously optimized by utilizing the second loss function, and the mixing processing model has better mixing performance.

In one embodiment, the step S101 generates mixed audio from a human voice track and at least one accompaniment track, which specifically includes the following: and carrying out audio track superposition processing on the human audio track and at least one accompaniment audio track to obtain mixed audio.

Specifically, fig. 4 is a schematic diagram of a principle of acquiring mixed audio, and the terminal may perform a track stacking process on a human voice track and at least one accompaniment track to generate mixed audio corresponding to the human voice track and the at least one accompaniment track. The terminal performs audio track superposition processing, namely, audio fusion is performed on the human voice audio track and the accompaniment audio on at least one accompaniment audio track, so that the terminal obtains mixed audio; the audio signal of the simultaneous playing of the human voice audio and the accompaniment audio on at least one accompaniment audio track is recorded, and then the terminal can obtain the mixed audio.

In the present embodiment, mixed audio is obtained by performing track superimposition processing on a human voice track and at least one accompaniment track; thus, mixed audio and label mixed audio with significantly different mixing effects are obtained. The method has the advantages that a non-professional user or conventional software is simulated through the mixed audio with poor mixing effect to obtain a musical composition after non-professional mixing treatment, and a label mixed audio with good mixing effect is simulated through a high-quality musical composition obtained after professional mixing treatment by a professional mixer, so that the mixed audio of the label is used as a learning target by a mixing treatment model, the target mixed audio output by the mixing treatment model is enabled to be continuously close to the mixing effect of the label mixed audio, and the mixing effect of the mixing treatment model on the mixed audio is improved.

In one embodiment, after adjusting the second loss function of the mixing process model according to the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio, the method further comprises: and updating the model parameters of the audio track separation module according to the first loss function, and updating the model parameters of the audio track mixing module according to the second loss function to obtain a trained mixing processing model.

Specifically, after the terminal obtains the first loss function and the second loss function of the audio mixing processing model, the model parameters of the audio track separation module and the model parameters of the audio track mixing module in the audio mixing processing model can be synchronously updated by using the first loss function and the second loss function until the audio mixing processing model converges, so that the terminal obtains the audio mixing processing model after training.

In this embodiment, according to the first loss function and the second loss function, model parameters of the track separation module and model parameters of the audio mixing module in the audio mixing processing model are updated to obtain a trained audio mixing processing model, so that the trained audio mixing processing model learns not only a human audio track and at least one accompaniment audio track, but also label mixed audio, and the trained audio mixing processing model can perform secondary audio mixing on the mixed audio, and can further improve the audio mixing effect of the obtained target mixed audio.

In one embodiment, as shown in fig. 5, after updating the model parameters of the audio track separation module according to the first loss function and updating the model parameters of the audio track mixing module according to the second loss function, the method further includes:

In step S501, the original mixed audio is acquired.

Wherein, the original mixed audio refers to audio data obtained by mixing a plurality of audio tracks. For example, the mixed audio may be audio data obtained by simply mixing the singing work by the user, and the mixed audio may also be a music product recorded in a K song scene.

Specifically, the user may transmit the original mixed audio, which needs to be subjected to the remixing process again, to the terminal, and the terminal receives the original mixed audio. The terminal may also obtain the original mixed audio from the audio library, which requires re-mixing.

Step S502, inputting the original mixed audio into a trained mixed audio processing model to obtain target mixed audio of the original mixed audio; the mixing quality of the target mixed audio is higher than that of the original mixed audio.

Specifically, after the terminal obtains the original mixed audio, the original mixed audio can be input into an audio track separation module in a mixing processing model with the training completed, the mixed audio is subjected to audio track separation processing through the audio track separation module, so that a target person audio track and at least one item of target accompaniment audio track of the original mixed audio are obtained, and accordingly the terminal can input the obtained target person audio track and at least one item of target accompaniment audio track into the audio track mixing module in the mixing processing model with the training completed, and the target person audio track and at least one item of target accompaniment audio track are subjected to audio mixing processing through the audio track mixing module, so that the target mixed audio of the original mixed audio is obtained.

Fig. 6 is an application schematic diagram of a training-completed mixing process model, as shown in fig. 6, in practical application, a user may send audio data of an original musical composition (i.e., an original mixed audio) to a terminal, and then the terminal inputs the original mixed audio to the training-completed mixing process model, and performs audio track separation processing and audio mixing processing on the original mixed audio through the training-completed mixing process model, so as to obtain a target mixed audio of the original mixed audio without relying on the user to collect independent audio tracks.

In the present embodiment, by acquiring the original mixed audio; the original mixed audio is input into the trained mixed audio processing model to obtain the target mixed audio of the original mixed audio, and the defect that secondary mixing cannot be carried out on the mixed audio in the traditional technology is overcome, so that the mixing effect of the original mixed audio is improved, and particularly, the mixing effect of an original musical composition can be remarkably improved in a scene lacking independent audio track acquisition capability and professional mixing processing capability.

In one embodiment, the original mixed audio is input into a trained mixed audio processing model to obtain a target mixed audio of the original mixed audio, which specifically includes the following contents: inputting the original mixed audio to an audio track separation module in a trained mixed audio processing model to obtain a target person audio track, a drum point audio track, a Bei Siyin audio track and a piano audio track of the original mixed audio; and inputting the target person sound track, the drum point sound track, the Bei Siyin sound track and the piano sound track into a sound track mixing module in the trained sound mixing processing model to obtain target mixed audio of the original mixed audio.

Specifically, the accompaniment track may be at least one of a drum point track, a Bei Siyin track, and a piano track. Under the condition that the accompaniment of the original mixed audio contains the audio of three musical instruments of a drum point, a bass and a piano, the terminal inputs the original mixed audio into a sound track separation module in a trained sound mixing processing model, and the sound track separation module processes the sound track to obtain a target person sound track, a drum point sound track, a Bei Siyin track and a piano sound track of the original mixed audio. And then the terminal inputs the target person sound track, the drum point sound track, the Bei Siyin track and the piano sound track into the sound track mixing module so as to mix the sound of the target person sound track, the drum point sound track, the Bei Siyin track and the piano sound track through the sound track mixing module, and then the terminal obtains target mixed audio of initial mixed audio.

In this embodiment, under the condition that the accompaniment tracks are more in variety, the trained audio mixing processing model can still realize accurate separation and remixing processing of the tracks of the original mixed audio, so that the audio mixing effect of the original mixed audio is effectively improved, independent tracks of the human voice audio and various accompaniment audio are not required to be collected independently, the audio mixing processing efficiency of the original mixed audio is improved, and the audio mixing quality of the processed target mixed audio is also improved.

In one embodiment, as shown in fig. 7, another method for building a mixing processing model is provided, and the method is applied to a terminal for illustration, and includes the following steps:

step S701, performing track superposition processing on the human voice track and at least one accompaniment track to obtain mixed audio.

Step S702, inputting the mixed audio to the audio track separation module to obtain a target person audio track and at least one item of accompaniment audio track of the mixed audio.

Step S703, performing signal-to-noise ratio processing on the difference between the target human voice track and the human voice track to obtain the human voice track signal-to-noise ratio between the target human voice track and the human voice track.

Step S704, signal-to-noise ratio processing is performed on the difference between the at least one item of the accompaniment track and the at least one accompaniment track to obtain the signal-to-noise ratio of the accompaniment track between the at least one item of the accompaniment track and the at least one accompaniment track.

Step S705, the first loss function of the mixing processing model is adjusted according to the fusion result of the signal-to-noise ratio of the human voice track and the signal-to-noise ratio of the accompaniment track.

Wherein the first loss function is for approximating the human voice track and the at least one accompaniment track to the target human voice track and the at least one accompaniment track, respectively.

Step S706, inputting the target person sound track and at least one item of target accompaniment sound track to the sound track mixing module to obtain target mixed audio.

Step S707, determining an audio difference between the tag mixed audio corresponding to the mixed audio and the target mixed audio; and performing signal-to-noise ratio processing on the audio difference and the label mixed audio to obtain the mixed audio signal-to-noise ratio between the target mixed audio and the label mixed audio.

Step S708, the second loss function of the mixing processing model is adjusted according to the mixing signal-to-noise ratio.

Wherein the second loss function is used to approximate the target mixed audio to the mixed audio.

Step S709, updating the model parameters of the audio track separation module according to the first loss function, and updating the model parameters of the audio track mixing module according to the second loss function to obtain a trained mixing processing model.

The method for constructing the mixing processing model has the following beneficial effects: the human voice track and at least one accompaniment track are enabled to be continuously close to the target human voice track and the at least one accompaniment track respectively through the first loss function, meanwhile, the target mixed audio is enabled to be continuously close to the mixed audio through the second loss function, the human voice track and the accompaniment track can be separated from the mixed audio by utilizing the track separation module in the mixing processing model, the mixing effect of the target mixed audio output by the mixing processing model is effectively improved, the defect that secondary mixing cannot be carried out on the mixed audio in the prior art is overcome, and furthermore, the mixing effect of an original music work can be improved through the mixing processing model.

In order to more clearly illustrate the method for building the mixing processing model provided by the embodiment of the present disclosure, a specific embodiment is used to specifically illustrate the method for building the mixing processing model. The method for constructing the mixing processing model can be applied to a terminal and specifically comprises the following steps: after the terminal obtains the trained audio mixing processing model, a user can upload the recorded personal music works to a K song platform, the terminal obtains the personal music works in the K song platform, and then the personal music works are input to an audio track separation module in the trained audio mixing processing model to obtain a target person audio track and at least one item of target accompaniment audio track of the personal music works; and the terminal inputs the target person sound track and at least one item of target accompaniment sound track into a sound track mixing module in the trained sound mixing processing model to obtain target mixed audio of the personal music works after remixing, and displays the target mixed audio to a user.

In this embodiment, remixing processing is performed on the personal musical composition of the user through the trained remixing processing model, so that target mixed audio with a more mixed audio effect than the personal musical composition is obtained, the mixing processing effect on the original mixed audio is improved, the user does not need to have independent audio track acquisition capability and professional mixing processing capability, and the mixing processing efficiency is greatly improved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a method of model building for mixing processing. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for building a mixing processing model, the method comprising a track separation module and a track mixing module, the method comprising:

2. The method of claim 1, wherein adjusting the first loss function of the mixing process model based on the difference between the target human voice track and the difference between the at least one item of the accompaniment track and the at least one accompaniment track comprises:

3. The method of claim 1, wherein adjusting the second loss function of the mixing process model based on the difference between the label mixed audio corresponding to the mixed audio and the target mixed audio comprises:

4. The method of claim 1, wherein generating mixed audio from the human voice track and the at least one accompaniment track comprises:

5. The method of claim 1, further comprising, after adjusting the second loss function of the mixing process model according to a difference between the label mixed audio corresponding to the mixed audio and the target mixed audio:

6. The method of claim 5, wherein after updating the model parameters of the track separation module according to the first loss function and updating the model parameters of the track mixing module according to the second loss function, the method further comprises:

Acquiring original mixed audio;

7. The method of claim 6, wherein inputting the original mixed audio into the trained mixing model to obtain the target mixed audio of the original mixed audio comprises:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.