CN110706679A

CN110706679A - Audio processing method and electronic equipment

Info

Publication number: CN110706679A
Application number: CN201910952208.6A
Authority: CN
Inventors: 秦帅
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-01-17
Anticipated expiration: 2039-09-30
Also published as: CN110706679B

Abstract

The embodiment of the invention provides a song recording method and electronic equipment, wherein the method comprises the following steps: acquiring first audio data; carrying out voice optimization processing on first voice data in the first audio data to obtain second voice data; second audio data is generated based on the second human voice data. The optimization of the first person voice data in the first audio data of the song recorded by the user is realized, the recorded song with the second audio data and the better playing effect is obtained, and the user does not need to utilize professional software to process the audio data in the recorded song.

Description

Audio processing method and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of audio, in particular to an audio processing method and electronic equipment.

Background

At present, many users like to record songs sung by themselves, and the recorded songs are shared with friends and relatives. In order to make the recorded song have a better playing effect, the user needs to perform related processing on related data in the recorded song by using professional audio processing software.

However, due to the lack of music knowledge and sound effect processing skills, it is difficult for users to process the audio data in the recorded songs by using professional audio processing software, and it is difficult to obtain the recorded songs with better playing effect.

Disclosure of Invention

The embodiment of the invention provides an audio processing method, which aims to solve the problems that a user is difficult to process audio data in a recorded song by using professional audio processing software and to obtain the recorded song with a better playing effect.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides an audio processing method, including:

song first audio data recorded by user, voice data voice of song first audio data recorded by user, optimization processing of voice data of second voice data

Acquiring first audio data;

carrying out voice optimization processing on first voice data in the first audio data to obtain second voice data;

and generating second audio data based on the second voice data.

In some embodiments, the vocal optimization process comprises at least one of: adjusting audio frequency, adjusting volume, restoring plosive, adding mixed sound and filtering noise.

In some embodiments, before performing the human voice optimization processing on the first human voice data in the first audio data to obtain the second human voice data, the method further includes:

determining a voice optimization parameter through a preset voice optimization strategy model;

the voice optimization processing of the first voice data in the first audio data to obtain second voice data includes:

based on the voice optimization parameters, performing voice optimization processing on first voice data in the first audio data to obtain second voice data;

wherein the vocal optimization parameters comprise at least one of: the type of the items in the human voice optimization processing, the execution sequence of the items in the human voice optimization processing and the processing duration of the human voice optimization processing are obtained;

the voice optimization strategy model is obtained by training in a reinforcement learning mode, and training samples of the voice optimization strategy model comprise: the voice data with the voice data quality score smaller than a first score threshold value and the voice data with the voice data quality score larger than a second score threshold value are obtained, and the second score threshold value is larger than the first score threshold value.

In some embodiments, after the obtaining the first audio data, before performing a human voice optimization process on the first human voice data in the first audio data to obtain second human voice data, the method further includes:

determining whether the first audio data includes accompaniment data;

in a case where the first audio data includes accompaniment data, the first audio data is separated into first human voice data and first accompaniment data.

In some embodiments, said determining whether said first audio data includes accompaniment data comprises:

inputting the first audio data into an accompaniment music discrimination model to obtain a discrimination result output by the accompaniment music discrimination model, wherein the discrimination result indicates whether the first audio data comprises accompaniment data;

wherein, the accompaniment music discriminant model is trained in advance, and the training sample of the accompaniment music discriminant model comprises: audio data of a song including accompaniment music or audio data of a song not including accompaniment music.

In some embodiments, said separating the first audio data into first vocal data and first accompaniment data in the case where the first audio data includes accompaniment data comprises:

inputting the first audio data into a human voice accompaniment separation model to obtain a separation result output by the human voice accompaniment separation model, wherein the separation result comprises: the first vocal data and the first accompaniment data;

wherein, vocal accompaniment separation model is trained in advance, vocal accompaniment separation model's training sample includes: audio data of a song for training, vocal data among the audio data of the song for training, and accompaniment music data among the audio data of the song for training.

In some embodiments, the generating second audio data based on the second vocal data comprises:

fusing the second voice data and the separated first accompaniment data to obtain first fused audio data;

second audio data is generated based on the first fused audio data.

In some embodiments, the first audio data does not include accompaniment data;

generating second audio data based on the second human voice data, comprising:

and fusing the second voice data and preset second accompaniment data to obtain second fused audio data, and generating second audio data based on the second fused audio data.

In some embodiments, the preset second accompaniment data is generated by:

inputting the second voice data into an accompaniment music generation model to obtain accompaniment data output by the accompaniment music generation model;

the accompaniment data output by the accompaniment music generation model is used as the preset second accompaniment data;

wherein the accompaniment music generation model comprises: the accompanying music generation method comprises the following steps of judging a submodel and generating the submodel, wherein the judging submodel and the generating submodel are jointly trained by utilizing training samples of the accompanying music generation model in advance, and the training samples of the accompanying music generation model comprise: the voice data used for training and the accompaniment music data corresponding to the voice data used for training.

In some embodiments, the generating second audio data based on the first fused audio data comprises:

performing song optimization processing on the first fusion audio data to generate second audio data;

wherein the song optimization process comprises at least one of: alignment of vocal and accompaniment tracks, vocal volume adjustment and accompaniment volume adjustment, noise filtering and spectral smoothing.

In some embodiments, the generating second audio data based on the second fused audio data comprises:

performing song optimization processing on the second fusion audio data to generate second audio data;

In some embodiments, further comprising:

determining song optimization parameters through a preset song optimization strategy model;

wherein the song optimization parameters include at least one of: the type of the items in the song optimization processing, the execution sequence of the items in the song optimization processing, and the processing duration of the song optimization processing;

wherein the song optimization strategy model is trained in a reinforcement learning mode in advance, and training samples of the song optimization strategy model comprise: audio data for songs having a song quality score less than a third score threshold, audio data for songs having a song quality score greater than a fourth score threshold, the fourth score threshold being greater than the third score threshold.

In a second aspect, an embodiment of the present invention further provides an electronic device, including:

an acquisition unit configured to acquire first audio data;

the optimization unit is configured to perform voice optimization processing on first voice data in the first audio data to obtain second voice data;

a generating unit configured to generate second audio data based on the second human voice data.

In some embodiments, the electronic device further comprises: a voice optimization policy determination unit configured to: determining a voice optimization parameter through a preset voice optimization strategy model; the optimization unit further comprises: a vocal optimization module configured to: based on the voice optimization parameters, performing voice optimization processing on first voice data in the first audio data to obtain second voice data;

wherein the vocal optimization parameters comprise at least one of: the type of the items in the human voice optimization processing, the execution sequence of the items in the human voice optimization processing and the execution duration of the items in the human voice optimization processing are determined;

In some embodiments, the electronic device further comprises: a separation module configured to: after first audio data are obtained, before voice optimization processing is carried out on first voice data in the first audio data to obtain second voice data, whether the first audio data contain accompaniment data is determined; in a case where the first audio data includes accompaniment data, the first audio data is separated into first human voice data and first accompaniment data.

In some embodiments, the separation module further comprises:

a discrimination sub-module configured to:

wherein, the music of accompanying discrimination model is trained in advance, and music of accompanying discrimination model's training sample includes: audio data of a song including accompaniment music or audio data of a song not including accompaniment music.

In some embodiments, the separation module further comprises: the separation result acquisition sub-module is configured to:

wherein, vocal accompaniment separation model is trained in advance, and vocal accompaniment separation model's training sample includes: audio data of a song for training, vocal data among the audio data of the song for training, and accompaniment music data among the audio data of the song for training.

In some embodiments, the generating unit further comprises:

a first processing module configured to: and performing fusion processing on the second voice data and the separated first accompaniment data to obtain first fusion audio data, and generating second audio data based on the first fusion audio data.

In some embodiments, the generating unit further comprises:

a second processing module configured to: and under the condition that the first audio data does not comprise accompaniment data, carrying out fusion processing on the second voice data and preset second accompaniment data to obtain second fusion audio data, and generating the second audio data based on the second fusion audio data.

In some embodiments, the electronic device further comprises:

a second accompaniment data generation unit configured to: inputting the second voice data into an accompaniment music generation model to obtain accompaniment data output by the accompaniment music generation model, and taking the accompaniment data output by the accompaniment music generation model as the preset second accompaniment data;

In some embodiments, the first processing module comprises:

a first song optimization sub-module configured to: performing song optimization processing on the first fused audio data, wherein the song optimization processing comprises at least one of the following steps: alignment of vocal and accompaniment tracks, vocal volume adjustment and accompaniment volume adjustment, noise filtering and spectral smoothing.

In some embodiments, the second processing module comprises:

a second song optimization submodule configured to:

In some embodiments, the electronic device further comprises:

a song optimization parameter determination unit configured to: determining song optimization parameters through a preset song optimization strategy model;

wherein the song optimization parameters include at least one of: the type of the items in the song optimization processing, the execution sequence of the items in the song optimization processing, and the execution duration of the items in the song optimization processing;

the song optimization strategy model is trained in a reinforcement learning mode in advance, and training samples of the song optimization strategy model comprise: audio data for songs having a song quality score less than a third score threshold, audio data for songs having a song quality score greater than a fourth score threshold, the fourth score threshold being greater than the third score threshold.

In the embodiment of the invention, the first audio data is acquired; carrying out voice optimization processing on first voice data in the first audio data to obtain second voice data; second audio data is generated based on the second human voice data. The method and the device can optimize the first person voice data in the first audio data of the song recorded by the user to obtain the recorded song with the second audio data and the better playing effect, and the user does not need to utilize professional software to process the audio data in the recorded song.

Drawings

Fig. 1 shows a flowchart of an audio processing method provided by an embodiment of the present invention;

FIG. 2 shows a schematic flow chart of optimizing songs recorded by a user;

FIG. 3 is a block diagram of an electronic device according to an embodiment of the invention;

fig. 4 is a schematic diagram illustrating a hardware structure of a mobile terminal implementing the audio processing method according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an audio processing method according to an embodiment of the present invention is shown, where the method includes:

step 101, acquiring first audio data.

In the present invention, the audio data of the song recorded by the user may be referred to as first audio data. The song recorded by the user may be a song recorded while the user is singing. The song recorded by the user may also be a song recorded when the user sings with accompanying music.

And 102, performing voice optimization processing on the first voice data in the first audio data to obtain second voice data, and performing voice optimization processing on the first audio data and the second voice data.

In the present invention, the human voice data in the first audio data may be referred to as first human voice data. In order to perform a vocal optimization process on first personal sound data in first audio data of a song recorded by a user, the first personal sound data in the first audio data may be determined first. Then, the human voice optimization processing may be performed on the first human voice data, and the human voice data obtained after the human voice optimization processing is performed on the first human voice data is used as the second human voice data. For example, a parameter value of a parameter related to the frequency of the human voice in the first human voice data is adjusted, a parameter value of a parameter related to the volume of the human voice in the first human voice data is adjusted, and the like.

In some embodiments, the vocal optimization process includes at least one of: adjusting audio frequency, adjusting volume, restoring plosive, adding mixed sound and filtering noise.

In some embodiments, before performing the vocal optimization processing on the first vocal data in the first audio data to obtain the second vocal data, the method further includes: determining a voice optimization parameter through a preset voice optimization strategy model; carrying out voice optimization processing on first voice data in the first audio data to obtain second voice data, comprising: based on the voice optimization parameters, performing voice optimization processing on first voice data in the first audio data to obtain second voice data; wherein the voice optimization parameters include at least one of: the type of the items in the human voice optimization processing, the execution sequence of the items in the human voice optimization processing and the processing duration of the human voice optimization processing are determined; the voice optimization strategy model is obtained by training in a reinforcement learning mode, and training samples of the voice optimization strategy model comprise: the voice data with the voice data quality score smaller than the first score threshold value, the voice data with the voice data quality score larger than the second score threshold value, and the second score threshold value is larger than the first score threshold value.

Determining the type of the item in the human voice optimization processing in the human voice optimization parameters through a preset human voice optimization strategy model is equivalent to determining which item in the audio frequency adjustment, volume adjustment, pop sound restoration, audio mixing addition and noise filtration is included in the human voice optimization processing of the first human voice data in the first audio data when the human voice optimization processing is performed on the first human voice data in the first audio data.

In the invention, the voice data with the voice data quality score smaller than the first score threshold value can be called low-quality voice data, and the voice data with the voice data quality score larger than the second score threshold value can be called high-quality voice data.

In the invention, the voice optimization strategy model can be trained in advance by adopting the following modes: and pre-creating a voice optimization strategy model, wherein the created voice optimization strategy model has initial model parameters. Then, the human voice optimization strategy model can be trained in a reinforcement learning mode by utilizing the human voice data with low quality, namely the human voice data with the quality score smaller than the first score threshold value, and the human voice data with high quality, namely the human voice data with the quality score larger than the second score threshold value.

The basic principle of training the human voice optimization strategy model in a reinforcement learning mode is as follows: and the human voice signal with high quality, namely the human voice data with the human voice data quality score larger than the second score threshold value is used as the reference human voice signal, and the human voice optimization strategy model is used for combining the audio regulation, the volume regulation, the plosive restoration, the audio mixing addition, the noise filtering and the like to obtain strategies using different candidates. And processing the voice data with the voice data quality score smaller than the first score threshold value, namely the voice signals with low quality, by using different candidate strategies, wherein the better the candidate strategy is, the better the voice characteristic of the signals obtained after the processing of the candidate strategies is closer to the voice signal with high quality, namely the voice data quality score larger than the second score threshold value is, the better the candidate strategy is, the better the strategy reward is given. When the processing effect of a candidate strategy is poor, for example, after a candidate strategy processes a low-quality human voice signal, the relevant audio features are not improved, and then a candidate strategy penalty is given. Through multiple times of training, the voice optimization strategy model can optimize the input voice signals by using a strategy suitable for the input voice data according to the input voice data.

In the training process, T time before and after each moment can be used as a segment of voice data as input V_l(T-T < T + T), the problem can be expressed as P ═ S, a, S denotes the state space, specifically, the human voice data space. A is a selectable policy scheme space that may include but is not limited toThe strategy scheme may be null, without limitation, adjusting audio, adjusting volume, plosive remediation, mixing addition, noise filtering, etc. The training function may be expressed as p: S × A → S, and the state transition may be expressed as S_i+1＝p(s_i,a_i),s_i∈S,a_iE.g., A, can get the corresponding reward, R: S × A → R, and thus, the state, action and feedback { S [ - ]₁,a₁,r₁,s₂,a₂,r₂,…s_t,a_t,r_t}. The process of training the optimization strategy model in a reinforcement learning manner is to continuously maximize the total return through strategy changes. The current time instant return may be denoted asr denotes reward feedback and λ is the discount factor. The cost function may be expressed as v(s) ═ E [ g (t) | s (t) ═ s]And the optimal optimization strategy model can be obtained by continuously optimizing the value function in a strategy iteration mode.

Alternatively, consider the human voice signal V relative to the acquisition of low quality_lObtaining high quality voice data V_hIt is relatively easy. Therefore, in order to obtain the low-quality human voice signal, various noises can be added into the high-quality human voice signal, various frequency spectrum disturbances are carried out on the voice signal, and degradation technologies such as impact signals are added to generate the low-quality human voice signal. The low quality signal may be represented by formula V_l＝f_preprocess(V_h) Is obtained, wherein f_preprocessIs a combination of one or more of various degeneration techniques.

In some embodiments, after the first audio data is acquired, performing a human voice optimization process on first human voice data in the first audio data, and before the second human voice data is obtained, the method further includes: determining whether the first audio data contains accompaniment data; in a case where the first audio data includes accompaniment data, the first audio data is separated into first human voice data and first accompaniment data.

In some embodiments, determining whether the first audio data includes accompaniment data comprises: inputting the first audio data into an accompaniment music discrimination model to obtain a discrimination result output by the accompaniment music discrimination model, wherein the discrimination result indicates whether the first audio data comprises accompaniment data; wherein, the music of accompanying discrimination model is trained in advance, and music of accompanying discrimination model's training sample includes: audio data of a song including accompaniment music or audio data of a song not including accompaniment music.

In the present invention, the accompaniment music discrimination model may be trained in advance in the following manner: and acquiring a training sample set of the accompaniment music discrimination model. Each training sample of the accompaniment music discrimination model is one of the following: audio data of a song including accompaniment music, audio data of a song not including accompaniment music. And labeling each training sample to obtain a labeling result of the training sample. The labeling result of a training sample is one of the following: including accompaniment data, not including accompaniment data. In the training process, a supervised learning mode is adopted to train the accompaniment music discrimination model. In the training process, for each training sample, obtaining a prediction result corresponding to the training sample in the accompanying music discrimination model, wherein the prediction result obtained each time is one of the following: the audio data includes accompaniment data, and the audio data does not include accompaniment data. And calculating the difference degree of the prediction result corresponding to the training sample and the labeling result of the training sample through the loss function, and updating the model parameters of the accompaniment music discrimination model according to the difference degree.

In some embodiments, in a case where the first audio data includes accompaniment data, separating the first audio data into first vocal data and first accompaniment data includes: inputting the first audio data into a human voice accompaniment separation model to obtain a separation result output by the human voice accompaniment separation model, wherein the separation result comprises: first vocal data and first accompaniment data; the vocal accompaniment separation model is trained in advance, and the training sample of the vocal accompaniment separation model comprises: audio data for a training song, vocal data in the audio data for a training song, and accompaniment music data in the audio data for a training song.

In the present invention, accompaniment data may also be referred to as accompaniment music data. The accompaniment data in the first audio data may be referred to as first accompaniment data.

In the present invention, in a case where the first audio data includes accompaniment data, the first audio data may be separated into first vocal data and first accompaniment data using a vocal accompaniment separation model.

In the invention, the vocal accompaniment separation model can be trained in advance by adopting the following modes: and pre-creating a vocal accompaniment separation model, wherein the created vocal accompaniment separation model has initial model parameters. And acquiring a training sample set of the vocal accompaniment separation model. Each training sample of the vocal accompaniment separation model includes: audio data of a song for training, vocal data in the audio data of the song for training, and accompaniment music data in the audio data of the song for training. Then, the vocal accompaniment separation model can be trained in a deep learning manner by utilizing the training sample set of the vocal accompaniment separation model.

In each training process, audio data of songs used for training are input into the vocal accompaniment separation model, and vocal data predicted by the vocal accompaniment separation model and accompaniment music data predicted by the vocal accompaniment separation model are obtained. And calculating the difference degree of the predicted vocal data and the vocal data in the audio data of the song for training and the difference degree of the predicted accompaniment music data and the accompaniment music data in the audio data of the song for training through a loss function, and updating the model parameters of the vocal accompaniment separation model according to the calculated difference degree.

And 103, generating second audio data based on the second voice data.

In the present invention, the second audio data may include second vocal data obtained after performing vocal optimization processing on the first vocal data in the first audio data of the song recorded by the user, and original unprocessed data in the song recorded by the user.

In the present invention, after the second audio data is generated based on the second voice data, the second audio data may be used as the audio data of the recorded song with the better playing effect, and the recorded song with the better playing effect of the second audio data may be provided to the user.

In some embodiments, generating the second audio data based on the second vocal data comprises: fusing the second voice data and the separated first accompaniment data to obtain first fused audio data; second audio data is generated based on the first fused audio data.

In the present invention, when the second audio data is generated based on the second human voice data, the first fusion audio data may be directly used as the second audio data. In other words, the first merged audio data may be directly used as audio data for a recorded song provided to the user with a superior playing effect.

In the invention, the second voice data and the separated first accompaniment data can be fused in the following way to obtain the first fused audio data: carrying out frequency domain transformation on the second voice data and the separated first accompaniment data through FFT (fast Fourier transform), and respectively obtaining frequency domain signals F of the second voice data₁Frequency domain signal F of first accompaniment data₂Then, F is mixed₁And F₂The superposition is carried out to obtain a fused song signal F_allFor the merged song signal F_allThe DFT conversion is carried out to obtain the corresponding time domain signal T_allAnd the obtained time domain signal is used as the first fusion audio data.

In some embodiments, the first audio data does not include accompaniment data; generating second audio data based on the second human voice data, including: and performing fusion processing on the second voice data and preset second accompaniment data to obtain second fusion audio data, and generating second audio data based on the second fusion audio data.

In the invention, under the condition that the first audio data does not include accompaniment data, the second voice data and the preset second accompaniment data can be fused to obtain second fused audio data. The preset second accompaniment data may be audio data of preset accompaniment music. And performing fusion processing on the second voice data and the preset second accompaniment data to obtain a second fusion audio data mode by referring to the mode for generating the first fusion audio data.

In the present invention, when the second audio data is generated based on the second fusion audio data, the second fusion audio data may be directly taken as the second audio data. In other words, the second merged audio data may be directly used as the audio data for the recorded song provided to the user with the superior playing effect.

In some embodiments, the preset second accompaniment data is generated by: inputting the second voice data into the accompaniment music generation model to obtain accompaniment data output by the accompaniment music generation model; the accompaniment data output by the accompaniment music generation model is used as preset second accompaniment data;

wherein, the accompaniment music generation model comprises: the accompanying music generation method comprises the following steps of judging a submodel and generating the submodel, wherein the judging submodel and the generating submodel are jointly trained by utilizing training samples of the accompanying music generation model in advance, and the training samples of the accompanying music generation model comprise: the voice data used for training and the accompaniment music data corresponding to the voice data used for training.

In the present invention, when the first audio data does not include accompaniment data, preset second accompaniment data adapted to the second vocal data may be generated using the accompaniment music generation model, and the preset second accompaniment data generated using the accompaniment music generation model is closer to the real accompaniment music data.

Thereby, the recorded song provided to the user with the superior playing effect is made to include the accompaniment music having the second accompaniment data, and the accompaniment music is close to the real accompaniment music.

In the present invention, the discrimination submodel and the generation submodel in the accompaniment music generation model may be trained jointly in advance in the following manner: and acquiring a training sample set of the accompaniment music generation model. The training samples of the accompaniment music generation model include: the voice data and the accompaniment music data corresponding to the voice data. And the accompaniment music data corresponding to the voice data in the training sample is the accompaniment music data matched with the voice data in the training sample.

When the discrimination submodel and the generation submodel in the accompaniment music generation model are trained jointly, the discrimination submodel may be trained first, and when the discrimination submodel is trained, the generation submodel does not participate in the training. For example, in training the discrimination submodel, some transformations are performed on part of the accompaniment music data corresponding to all the vocal data used for training, so that the transformed accompaniment music data has some audio features of the generated signal. The discrimination submodel is trained by the untransformed accompaniment music data and the transformed accompaniment music data, and the trained discrimination submodel can be used to determine whether the accompaniment music data is generated.

And when the discriminant submodel is trained to be converged, fixing the discriminant submodel, and starting to train the generation submodel until the generation submodel is converged. In the process of training the generation submodel, the judgment submodel is used for outputting a result indicating whether the accompaniment music data output by the training generation submodel is the generated accompaniment music data, relevant parameters in the generation submodel can be adjusted based on the result output by the judgment submodel, and the accompaniment music data output by the generation submodel is closer to the real accompaniment music data through multiple times of training. Through training many times, the generation submodel after training not only can generate the accompaniment music data with the adaptation of the voice data of input to the accompaniment music data that generates is more close to real accompaniment music data.

In some embodiments, generating the second audio data based on the first fused audio data comprises: performing song optimization processing on the first fusion audio data to generate second audio data; wherein the song optimization process comprises at least one of: alignment of vocal and accompaniment tracks, vocal volume adjustment and accompaniment volume adjustment, noise filtering and spectral smoothing.

In the invention, when the second audio data is generated based on the first fusion audio data, one or more of the processes of aligning the voice and the accompaniment tracks, adjusting the voice volume and the accompaniment volume, filtering noise and smoothing spectrum can be carried out on the first fusion audio data, and the audio data obtained after the processes is used as the second audio data.

In some embodiments, generating the second audio data based on the second fused audio data comprises: and performing song optimization processing on the second fusion audio data, wherein the song optimization processing comprises at least one of the following steps: alignment of vocal and accompaniment tracks, vocal volume adjustment and accompaniment volume adjustment, noise filtering and spectral smoothing.

In the invention, when the second audio data is generated based on the second fusion audio data, one or more of the processing of aligning the vocal track and the accompaniment track, adjusting the vocal volume and the accompaniment volume, filtering noise and smoothing spectrum can be carried out on the second fusion audio data, and the audio data obtained after the processing is taken as the second audio data.

In some embodiments, further comprising: determining song optimization parameters through a preset song optimization strategy model; wherein the song optimization parameters include at least one of: the type of the items in the song optimization processing, the execution sequence of the items in the song optimization processing and the processing duration of the song optimization processing; the song optimizing strategy model is trained in a reinforcement learning mode in advance, and training samples of the song optimizing strategy model comprise: audio data of songs having a song quality score less than a third score threshold, audio data of songs having a song quality score greater than a fourth score threshold, the fourth score threshold being greater than the third score threshold.

In the invention, through a preset song optimization strategy model, the type of the items in the song optimization processing in the song optimization parameters is determined to be equivalent to determining which items in the track alignment of the voice and the accompaniment, the voice volume adjustment and the accompaniment volume adjustment, the noise filtering and the spectrum smoothing are included in the song optimization processing of the first fusion audio data or the second fusion audio data when the song optimization processing of the first fusion audio data or the second fusion audio data is determined.

In the present invention, songs having a song quality score less than the third score threshold may be referred to as low quality songs, and songs having a song quality score greater than the fourth score threshold may be referred to as high quality songs.

In the invention, the song optimization strategy model can be trained in advance by adopting the following modes: and pre-creating a song optimization strategy model, wherein the created song optimization strategy model has initial model parameters. The song optimization strategy model may then be trained in a reinforcement learning manner using the audio data of the low-quality songs and the high-quality songs.

The basic principle of training the song optimization strategy model in a reinforcement learning mode is as follows: and the audio data of the high-quality song is used as reference audio data, and the song optimization strategy model combines the alignment of the voice and the accompaniment tracks, the voice volume adjustment and the accompaniment volume adjustment, noise filtering, spectrum smoothing and the like to obtain strategies using different candidates. And processing the audio data of the low-quality song as the reference audio data by using different candidate strategies, wherein the more the audio characteristics of the audio data obtained after the processing of the candidate strategies are close to the audio characteristics of the audio data of the high-quality song, the more excellent the candidate strategies are, and the better the candidate strategies are rewarded. When the processing effect of a candidate strategy is poor, for example, the audio data of a song with low quality is processed through the candidate strategy, and the related audio characteristics are not improved, the candidate strategy is punished. Through multiple training, the song optimization strategy model can optimize the audio data of the input song by using a strategy suitable for the audio data of the input song.

Referring to fig. 2, a schematic flow chart of optimizing songs recorded by a user is shown.

Step 201, acquiring audio data of a song recorded by a user.

The song recorded by the user may be a song recorded while the user is singing. The song recorded by the user may also be a song recorded when the user sings with accompanying music. The audio data of the user-recorded song may also be referred to as first audio data.

At step 202, it is determined whether the song recorded by the user includes accompaniment music.

Whether the songs recorded by the user comprise accompaniment music is determined through the accompaniment music discrimination model. In other words, it is determined whether the first audio data includes accompaniment data.

Step 203 is performed if the user records a song that includes accompaniment music, and step 204 is performed if the user records a song that does not include accompaniment music.

In step 203, the first vocal data and the first accompaniment data are separated.

The accompaniment data in the first audio data may be referred to as first accompaniment data. The voice data and the first accompaniment data are extracted from the first audio data, which is the audio data of the song recorded by the user, so that the voice data in the first audio data and the accompaniment data in the first audio data are separated.

In other words, the first audio data is separated into the first personal sound data and the first accompaniment data. The first audio data can be input into the vocal accompaniment separation model, and the vocal data in the first audio data and the accompaniment data in the first audio data output by the vocal accompaniment separation model are obtained.

In step 204, second accompaniment data is generated.

The voice data obtained after performing voice optimization processing on voice data, i.e., first voice data, in the audio data of the song recorded by the user may be referred to as second voice data. In the case where the user recorded song does not include accompaniment music, accompaniment data adapted to the second vocal data may be generated. The accompaniment data adapted to the second vocal data may be referred to as second accompaniment data. The accompaniment data adapted to the second vocal data may be obtained by the accompaniment music generation model. The second vocal data may be input to the accompaniment music generation model to obtain accompaniment data output by the accompaniment music generation model, and the accompaniment data output by the accompaniment music generation model may be used as the second accompaniment data.

And step 205, intelligently optimizing the first person acoustic data.

The first intelligent optimization is to perform voice optimization processing on the first voice data to obtain second voice data. At least one of audio adjustment, volume adjustment, pop sound restoration, mixed sound addition and noise filtering can be performed on the first person sound data, and the obtained person sound data is used as second person sound data.

And step 206, fusing the second voice data and the accompaniment data.

And fusing the first accompaniment data separated in the step 203 or the second accompaniment data obtained in the step 204 with the second vocal data obtained in the step 205 to obtain first fused audio data or second fused audio data.

Step 207, intelligent optimization of the fused audio data.

At least one song optimization processing of the alignment of the voice and the accompaniment tracks, the adjustment of the voice and the accompaniment volume, the noise filtering and the spectrum smoothing can be carried out on the first fusion audio data or the second fusion audio data to obtain second audio data.

The song with the second audio data is provided to the user, step 208.

The second audio data is stored on the electronic device of the user, thereby providing the user with a song having the second audio data.

Referring to fig. 3, a block diagram of an electronic device according to an embodiment of the invention is shown. The electronic device includes: an acquisition unit 301, an optimization unit 302, and a generation unit 303.

The acquisition unit 301 is configured to acquire first audio data;

the optimization unit 302 is configured to perform voice optimization processing on first voice data in the first audio data to obtain second voice data;

the generating unit 303 is configured to generate second audio data based on the second human voice data.

wherein the voice optimization parameters include at least one of: the type of the items in the human voice optimization processing, the execution sequence of the items in the human voice optimization processing and the execution duration of the items in the human voice optimization processing are determined;

In some embodiments, the separation module further comprises:

a discrimination sub-module configured to:

In some embodiments, the separation module further comprises: a separation result acquisition sub-module configured to:

In some embodiments, the generating unit 303 further comprises:

In some embodiments, the electronic device further comprises:

In some embodiments, the first processing module comprises:

In some embodiments, the second processing module comprises:

a second song optimization submodule configured to:

In some embodiments, the electronic device further comprises:

The electronic device provided by the embodiment of the present invention can implement each process of the method embodiments shown in fig. 1 to fig. 2, and is not described herein again to avoid repetition.

In the embodiment of the invention, the first person sound data in the first audio data of the song recorded by the user is optimized to obtain the recorded song with better playing effect of the second audio data, and the user does not need to utilize professional software to process the audio data in the recorded song.

Fig. 4 is a schematic diagram of a hardware structure of a mobile terminal implementing the audio processing method according to the embodiment of the present invention.

The mobile terminal 400 includes, but is not limited to: radio frequency unit 401, network module 402, audio output unit 403, input unit 404, sensor 405, display unit 406, user input unit 407, interface unit 408, memory 409, processor 410, and power supply 411. Those skilled in the art will appreciate that the mobile terminal architecture shown in fig. 4 is not intended to be limiting of mobile terminals, and that a mobile terminal may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present invention, the mobile terminal includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

The processor 410 is configured to obtain first audio data; carrying out voice optimization processing on first voice data in the first audio data to obtain second voice data; and generating second audio data based on the second voice data.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 401 may be used for receiving and sending signals during a message sending and receiving process or a call process, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 410; in addition, the uplink data is transmitted to the base station. Typically, radio unit 401 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. Further, the radio unit 401 can also communicate with a network and other devices through a wireless communication system.

The mobile terminal provides the user with wireless broadband internet access through the network module 402, such as helping the user send and receive e-mails, browse web pages, and access streaming media.

The audio output unit 403 may convert audio data received by the radio frequency unit 401 or the network module 402 or stored in the memory 409 into an audio signal and output as sound. Also, the audio output unit 403 may also provide audio output related to a specific function performed by the mobile terminal 400 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 403 includes a speaker, a buzzer, a receiver, and the like.

The input unit 404 is used to receive audio or video signals. The input Unit 404 may include a Graphics Processing Unit (GPU) 4041 and a microphone 4042, and the Graphics processor 4041 processes image data of a still picture or video obtained by an image capturing apparatus (such as a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 406. The image frames processed by the graphic processor 4041 may be stored in the memory 409 (or other storage medium) or transmitted via the radio frequency unit 401 or the network module 402. The microphone 4042 may receive sound, and may be capable of processing such sound into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 401 in case of the phone call mode.

The mobile terminal 400 also includes at least one sensor 405, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 4061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 4061 and/or the backlight when the mobile terminal 400 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of the mobile terminal (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 405 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which will not be described in detail herein.

The display unit 406 is used to display information input by the user or information provided to the user. The Display unit 406 may include a Display panel 4061, and the Display panel 4061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 407 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the mobile terminal. Specifically, the user input unit 407 includes a touch panel 4071 and other input devices 4072. Touch panel 4071, also referred to as a touch screen, may collect touch operations by a user on or near it (e.g., operations by a user on or near touch panel 4071 using a finger, a stylus, or any suitable object or attachment). The touch panel 4071 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 410, receives a command from the processor 410, and executes the command. In addition, the touch panel 4071 can be implemented by using various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 4071, the user input unit 407 may include other input devices 4072. Specifically, the other input devices 4072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a track ball, a mouse, and a joystick, which are not described herein again.

Further, the touch panel 4071 can be overlaid on the display panel 4061, and when the touch panel 4071 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 410 to determine the type of the touch event, and then the processor 410 provides a corresponding visual output on the display panel 4061 according to the type of the touch event. Although the touch panel 4071 and the display panel 4061 are shown as two separate components to implement the input and output functions of the mobile terminal, in some embodiments, the touch panel 4071 and the display panel 4061 may be integrated to implement the input and output functions of the mobile terminal, which is not limited herein.

The interface unit 408 is an interface through which an external device is connected to the mobile terminal 400. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 408 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to at least one element within the mobile terminal 400 or may be used to transmit data between the mobile terminal 400 and an external device.

The memory 409 may be used to store software programs as well as various data. The memory 409 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 409 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 410 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 409 and calling data stored in the memory 409, thereby integrally monitoring the mobile terminal. Processor 410 may include at least one processing unit; preferably, the processor 410 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 410.

The mobile terminal 400 may further include a power supply 411 (e.g., a battery) for supplying power to various components, and preferably, the power supply 411 may be logically connected to the processor 410 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of audio processing, the method comprising:

acquiring first audio data;

and generating second audio data based on the second voice data.

2. The method of claim 1, wherein the vocal optimization process comprises at least one of: adjusting audio frequency, adjusting volume, restoring plosive, adding mixed sound and filtering noise.

3. The method according to claim 2, wherein before performing the human voice optimization processing on the first human voice data in the first audio data to obtain the second human voice data, further comprising:

4. The method according to claim 1, wherein after the obtaining of the first audio data, before performing the human voice optimization processing on the first human voice data in the first audio data to obtain the second human voice data, further comprising:

determining whether the first audio data includes accompaniment data;

5. The method of claim 4, wherein determining whether the first audio data includes accompaniment data comprises:

6. The method of claim 4, wherein the separating the first audio data into first vocal data and first accompaniment data if the first audio data comprises accompaniment data comprises:

7. The method of claim 4, wherein generating second audio data based on the second human voice data comprises:

second audio data is generated based on the first fused audio data.

8. The method of claim 1, wherein the first audio data does not include accompaniment data;

generating second audio data based on the second human voice data, comprising:

9. The method of claim 8, wherein the preset second accompaniment data is generated by:

10. The method of claim 7, wherein generating second audio data based on the first fused audio data comprises:

11. The method of claim 8, wherein generating second audio data based on the second fused audio data comprises:

12. The method according to claim 10 or 11, characterized in that the method further comprises:

13. An electronic device, characterized in that the electronic device comprises:

an acquisition unit configured to acquire first audio data;