CN110085251B

CN110085251B - Human voice extraction method, human voice extraction device and related products

Info

Publication number: CN110085251B
Application number: CN201910343129.5A
Authority: CN
Inventors: 王征韬
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2021-06-25
Anticipated expiration: 2039-04-26
Also published as: CN110085251A

Abstract

The embodiment of the application provides a voice extraction method, which comprises the following steps: based on a human voice extraction model, performing human voice extraction on the mixed audio to obtain intermediate audio, wherein the intermediate audio comprises human voice audio frames and non-human voice audio frames; and filtering the non-human voice audio frames of the intermediate audio based on a human voice filtering model to obtain human voice audio. According to the embodiment of the application, pure voice audio can be extracted, and user experience is improved.

Description

Human voice extraction method, human voice extraction device and related products

Technical Field

The application relates to the field of electronic audio signal processing, in particular to a voice extraction method, a voice extraction device and a related product.

Background

The human voice extraction technology is a widely researched audio processing method, and the existing human voice extraction algorithm has many categories. However, no human voice extraction algorithm can clearly extract human voice at present due to the limitation of the algorithm or the training sample. For example, in the prior art, the Hourglass model is used to extract the vocal from the mixed audio, and although the extracted vocal result is relatively clean and has high intelligibility, there is an error that the instrumental performance part such as the part prelude and the interlude is mistakenly recognized as the vocal and is reserved. Therefore, the prior art cannot extract completely pure human voice from mixed audio.

Disclosure of Invention

The embodiment of the application provides a voice extraction method, a voice extraction device and a related product, so that a pure voice audio is obtained through two steps of voice extraction, and the problem of false recognition in the conventional voice extraction process is solved.

In a first aspect, an embodiment of the present application provides a method for extracting human voice, including:

based on a human voice extraction model, performing human voice extraction on the mixed audio to obtain intermediate audio, wherein the intermediate audio comprises human voice audio frames and non-human voice audio frames;

and filtering the non-human voice audio frames of the intermediate audio based on a human voice filtering model to obtain human voice audio.

In a second aspect, an embodiment of the present application provides a human voice extracting apparatus, including:

the extraction unit is used for extracting the voice of the mixed audio based on the voice extraction model to obtain an intermediate audio, wherein the intermediate audio comprises a voice audio frame and a non-voice audio frame;

and the filtering unit is used for filtering the non-human voice audio frames of the intermediate audio based on the human voice filtering model to obtain the human voice audio.

In a third aspect, embodiments of the present application provide an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for performing the steps in the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, where the computer program makes a computer execute the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer being operable to cause a computer to perform the method according to the first aspect.

The embodiment of the application has the following beneficial effects:

it can be seen that, in the embodiment of the application, the voice extraction model is extracted to the intermediate audio frequency, the intermediate audio frequency is input to the filtering model, the non-voice audio frequency frame in the intermediate audio frequency is filtered, and the pure voice audio frequency is obtained through two steps of voice extraction, so that the problem that the pure audio frequency cannot be extracted from the mixed audio frequency in the prior art is solved, and the voice extraction effect is better.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for extracting human voice provided in an embodiment of the present application;

fig. 2A is a schematic flowchart of a method for obtaining training data according to an embodiment of the present disclosure;

fig. 2B is a schematic flowchart of another method for obtaining training data according to an embodiment of the present disclosure;

fig. 2C is a schematic diagram of an audio frame spectrogram according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of another human voice extraction method provided in the embodiment of the present application;

fig. 4 is a network structure diagram of a human voice filtering model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a human voice extraction device according to an embodiment of the present application;

fig. 6 is a block diagram illustrating functional units of a human voice extracting apparatus according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The voice extracting device in this application may include a smart Phone (such as an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a Mobile Internet device MID (Mobile Internet Devices, abbreviated as MID), a wearable device, etc., where the electronic Devices are merely examples, but not exhaustive, and include but are not limited to the electronic Devices, and for convenience of description, in practical applications, the voice extracting device is not limited to the above presentation form, and may further include: intelligent vehicle-mounted terminal, computer equipment and the like.

Referring to fig. 1, fig. 1 is a diagram illustrating a method for extracting human voice according to an embodiment of the present application, where the method is applied to a human voice extracting apparatus, and the method includes steps 101 to 102:

step 101: the voice extraction device extracts voice from mixed audio based on the voice extraction model to obtain intermediate audio, wherein the intermediate audio comprises voice audio frames and non-voice audio frames.

Wherein, the human voice extraction is to separate out recognizable human voice frequency from mixed audio frequency in the human voice and the background music sound.

The human voice extraction model is a neural network model in the prior art. For example, the model may be a Hoursglass model, and the details of the human voice extraction process are not described again. It should be noted that, when the Hoursglass model performs voice extraction, the input data of the Hoursglass model is one audio frame, and voice is extracted from each audio frame, so that when the Hoursglass model performs voice extraction on mixed audio, voice extraction is performed based on local information of the mixed audio, and thus, instrumental performances such as part prelude and interlude are mistakenly identified as voice and extracted, and instrumental performances such as part prelude and interlude are retained in the finally extracted vocal audio, so that pure vocal audio cannot be extracted from the mixed audio.

Step 102: and the voice extraction device filters the non-voice audio frames of the intermediate audio based on the voice filtering model to obtain the voice audio.

The human voice filtering model is constructed based on a machine learning integration algorithm, wherein the machine learning integration algorithm can be a Viterbi viterbi algorithm and a conditional random field algorithm CRF algorithm, and the Viterbi viterbi algorithm is taken as an example for specific explanation in the application.

The Viterbi algorithm is a dynamic programming algorithm used to find the sequence of-Viterbi path-hidden state that is most likely to produce the sequence of observed events, and y is particularly used in markov information source contexts and hidden markov models to solve the optimal path problem. In the application, the human voice probability sequence is dynamically adjusted by a Vibert algorithm so as to complete the construction of a human voice filtering model.

The construction process of the human voice filtering model comprises the following steps: the method comprises the steps of training a pre-trained human voice filtering model based on a machine learning integration algorithm, training data and a label sequence corresponding to the training data, wherein the training data and the label sequence are obtained by preprocessing an existing audio file. Because the input data of the human voice filtering model is an audio segment, the human voice filtering model has larger receptive field and can acquire the global information of the intermediate audio, thereby filtering the non-human voice audio frame in the intermediate audio.

It can be seen that, in the embodiment of the application, after the intermediate audio is extracted based on the voice extraction model, the voice filtering model with a larger receptive field is utilized to filter the intermediate audio so as to filter the non-voice audio frames in the intermediate audio, so that pure voice is extracted from mixed audio, the extracted voice effect is better, and the user experience is improved.

The process of preprocessing an audio file to obtain training data and a tag sequence is described in detail below.

Referring to fig. 2A, fig. 2A is a schematic flowchart of a method for obtaining training data and a tag sequence according to an embodiment of the present application, where the method is applied to a human voice extraction device, and the method includes steps 201a to 205 a:

step 201 a: the voice extraction device extracts voice from the audio file based on the voice extraction model to obtain sample audio.

Optionally, the sample audio includes a human voice audio frame and a non-human voice audio frame, where the non-human voice audio frame is an audio frame of instrumental performance such as a part prelude and an interlude obtained through misrecognition.

Step 202 a: and the human voice extraction device performs framing processing on the sample audio to obtain N sample audio frames, wherein N is an integer greater than 1.

The audio signal is a non-stationary signal in the whole time period, and signal processing cannot be performed, so that the sample audio is framed according to a preset window function and a preset step length to obtain N sample audio frames, each sample audio frame is regarded as a stationary signal, and in order to ensure continuity of the audio signal, an overlapping portion exists between any two adjacent sample audio frames. For example, the preset window function duration is 30ms, and the preset step size is 20ms, so that there is an overlap of 10ms between any two adjacent sample audio frames.

Step 203 a: and the human voice extraction device performs short-time Fourier transform on each sample audio frame to obtain a spectrogram of each sample audio frame.

In some possible embodiments, the spectrogram can be an amplitude spectrum, a power spectrum (energy spectrum), or a log power spectrum. The present application specifically describes the amplitude spectrum as an example.

Step 204 a: the human voice extraction device obtains a first spectrogram of the audio file based on the spectrogram of each sample audio frame, and marks the first spectrogram as training data.

The first spectrogram is a matrix formed by the spectrum vectors of each sample audio frame, and the spectrum vector of each sample audio frame is a column vector formed by the amplitudes corresponding to the frequency points of each sample audio frame.

For example, referring to fig. 2C, fig. 2C is a spectrogram of a k-th sample audio frame, where k is greater than or equal to 1 and less than or equal to N, f1, f2, f3, …, and fm are frequency points of the k-th sample audio frame in a frequency domain, m is the number of frequency points of each frame of the sample audio frame in the frequency domain, and spectral vectors corresponding to the spectrogram of the k-th sample audio frame are [ Ak1, Ak2, Ak3, …, Akm [ ]]^TTherefore, obtaining N spectral vectors corresponding to the N sample audio frames, and composing the N spectral vectors into the first spectrogram by:

step 205 a: and the human voice extraction device obtains a label sequence corresponding to the training data based on the first spectrogram.

The label sequence is used for labeling frame attributes of the sample audio frame corresponding to each column vector in the training data, wherein the frame attributes comprise human voices and non-human voices. For example, the jth element in the tag sequence is used to label the frame attribute of the jth audio frame in the training data, j is greater than or equal to 1 and less than or equal to N, and j is an integer.

In some possible embodiments, the obtaining of the tag sequence corresponding to the training data based on the first spectrogram may be implemented by: determining a first frame sequence number corresponding to a mute audio frame in the first spectrogram based on a voice endpoint detection algorithm VAD; acquiring a lyric file corresponding to the audio file, and determining a second frame number corresponding to a human voice audio frame and a third frame number corresponding to a non-human voice audio frame in the first spectrogram based on the lyric file; and obtaining a label sequence based on the first frame sequence number, the second frame sequence number and the third frame sequence number.

In some possible embodiments, before determining the first frame number corresponding to the silence audio frame in the first spectrogram based on a voice endpoint detection algorithm VAD, the method further includes: and performing spectral subtraction noise reduction on the first spectrogram to filter out background noise in the first spectrogram, wherein the spectral subtraction noise reduction is in the prior art and is not repeated.

Specifically, a singing time period containing the lyrics and a non-singing time period not containing the lyrics corresponding to the audio file are determined based on the lyrics file, all audio frames corresponding to the non-singing time period are non-human voice audio frames, and all audio frames corresponding to the singing time period at least comprise human voice audio frames. It can be understood that in any one singing time period, a singer ventilation stage may exist between any two adjacent lyrics, so that a silent time period exists in the singing time period, that is, silent audio frames exist in all audio frames corresponding to the singing time period. Therefore, the voice endpoint detection algorithm VAD is used to determine the mute time periods corresponding to the mute audio frames in the audio file. Then, comparing each time period with the time period corresponding to each sample audio frame to obtain the time period to which each sample audio frame belongs, and determining the frame attribute of each sample audio frame based on the time period to which each sample audio frame belongs, namely determining the frame number corresponding to the human voice audio frame, the frame number corresponding to the non-human voice audio frame and the frame number corresponding to the mute audio frame.

For example, if the duration of an audio frame is 30ms, the step length is 10ms, the tag corresponding to a human voice audio frame is set to be 1, the tag corresponding to a non-human voice audio frame is set to be 0, and the tag corresponding to a mute frame mark is also set to be 0. If the audio file belongs to a non-human voice audio time period within 0-50ms, determining that the 1 st and 2 nd audio frames are non-human voice audio frames, and the labels of the 1 st and 2 nd audio frames in the training data are both 0, if the audio file belongs to a human voice audio time period within 50-70 ms and 90-110 ms, determining that the 3 rd and 5 th audio frames are human voice audio frames and the 4 th audio frame is a mute audio frame, determining that the labels of the 3 rd and 5 th audio frame audio in the training data are both 1, and the label of the 4 th audio frame audio is 0; therefore, the tag sequence is [0,0,1,0, 1 … ].

Referring to fig. 2B, based on fig. 2A, fig. 2B is a schematic flowchart of another method for obtaining training data according to an embodiment of the present application, where the method is applied to a human voice extraction device, and the method includes steps 201B to 207B:

step 201 b: the voice extraction device extracts voice from the audio file based on the voice extraction model to obtain sample audio.

Step 202 b: and the human voice extraction device carries out framing processing on the sample audio to obtain N sample audio frames.

Step 203 b: and the human voice extraction device performs short-time Fourier transform on each sample audio frame to obtain a spectrogram of each sample audio frame.

Step 204 b: the human voice extraction device obtains a first spectrogram of the audio file based on the spectrogram of each sample audio frame.

Step 205 b: the human voice extraction device determines a first-order difference of corresponding elements of an ith column vector in the first spectrogram and an (i +1) th column vector in the first spectrogram to obtain a difference vector, and longitudinally splices the difference vector and the (i +1) th column vector to obtain a second spectrogram, wherein i is greater than or equal to 1 and is less than or equal to N, and i is an integer.

Since there is no corresponding difference vector in the first sample audio frame in the first spectrogram, a preset difference vector a ═ a01, a02, …, A0m ] is longitudinally concatenated, where the preset difference vector may be a zero vector whose elements are all 0 or a vector of preset elements, and so on, and the application is not limited uniquely.

After longitudinally splicing the preset differential vector, the second spectrogram is:

optionally, a first-order difference between frame vectors of two adjacent sample audio frames is obtained, and the difference vector is longitudinally spliced with the first spectrogram, so that each column vector of the obtained second spectrogram contains audio information of a previous audio frame.

Step 206 b: and the human voice extraction device marks the second spectrogram as training data.

Step 207 b: and the human voice extraction device obtains a label sequence corresponding to the training data based on the first spectrogram.

Finally, training the human voice filtering model by using the training data is the prior art and is not described in detail.

Referring to fig. 3, fig. 3 is a schematic flow chart of another method for extracting human voice according to the embodiment of the present application, where the method is applied to a human voice extracting apparatus, and the method includes steps 301 to 306:

step 301: the voice extraction device extracts voice from mixed audio based on the voice extraction model to obtain intermediate audio, wherein the intermediate audio comprises voice audio frames and non-voice audio frames.

Step 302: the human voice extracting device divides the intermediate audio frequency into a plurality of audio frequency segments, and any two adjacent audio frequency segments have overlapped audio frequency segments.

Optionally, the human voice extracting apparatus divides the intermediate audio into a plurality of audio segments according to a preset window function and a preset step length, where each audio segment includes at least one audio frame. For example, the intermediate audio may be divided into several audio segments according to a window function of 10s and a preset step size of 5s, where each audio segment is such that any two adjacent audio segments have 5s overlapping audio segments.

Step 303: and the voice extraction device inputs each voice frequency segment into the voice filtering model in sequence to obtain a first voice probability sequence of each voice frequency segment, wherein the first voice probability sequence is used for representing the probability that each voice frequency frame in each voice frequency segment is voice.

Step 304: and the voice extraction device determines the voice probability mean value of each audio frame in the overlapped audio segments based on the first voice probability sequence of each audio segment to obtain a second voice probability sequence of the intermediate audio.

Optionally, based on the voice filtering model, a first voice probability sequence of each audio segment is determined, because overlapping audio segments exist in two adjacent audio segments, the voice probability of each audio frame in the overlapping audio segments is included in both the two first voice probability sequences corresponding to the two adjacent audio segments, the voice probability of each audio frame in the overlapping audio segments is obtained by means of averaging, then, a second voice probability sequence of the intermediate audio is formed by the voice probabilities corresponding to the non-overlapping audio segments, and each element in the second voice probability sequence is used for representing the probability that each audio frame in the intermediate audio is a voice.

Step 305: and the voice extraction device determines a target voice probability sequence of the intermediate audio based on a Viterbi algorithm and the second voice probability sequence.

Optionally, the elements in the second probability sequence are adjusted based on a viterbi algorithm to obtain an optimal probability sequence, and the optimal probability sequence is used as a target vocal probability sequence. Namely, similar to the method for obtaining the optimal path, the hidden sequence corresponding to the second probability sequence is determined based on the viterbi algorithm, the possible probability corresponding to each hidden sequence is obtained, and the optimal probability sequence is obtained.

For example, if the second vocal probability sequence is [0.0,0,1,0.1,0.2,0.3,0.5,0.8,0.7,0.1,0.1,0.6,0.7,0.8, … ], it can be seen from the second vocal probability sequence that the 6 th, 7 th, 8 th, 11 th, 12 th, 13 th audio frames of the intermediate audio may be voice audio frames, and the 9 th and 10 th audio frames may be non-voice audio frames, since there is a gradual process for the speaking of the speaker, there is also a gradual process between the vocal probabilities, generally, the vocal probability of the absence of the previous audio frame is very large, the vocal probability of the next audio frame is very small, and the speaking rule of the speaker is not met, so it can be concluded that the vocal probabilities corresponding to the 9 th and 10 th audio frames have a problem, and need to be dynamically adjusted to meet the speaking rule of the speaker.

Step 306: and the voice extraction device filters non-voice audio frames in the intermediate audio based on the target voice probability sequence to obtain voice audio, wherein the non-voice audio frames are audio frames corresponding to target elements in the target voice probability sequence in the intermediate audio, and the target elements are elements meeting preset conditions.

The element satisfying the preset condition may be an element greater than or equal to a threshold, where the threshold may be 0.5, 0.6,0.7, or other values.

It can be seen that, in the embodiment of the present application, after the intermediate audio is extracted based on the voice extraction model, the intermediate audio is segmented, the input data corresponding to each audio segment is determined, the input data is input to the voice filtering model to filter the non-voice audio frame of the intermediate audio, and a pure voice audio is obtained.

In some possible embodiments, the voice extraction method disclosed in the present application is applied to a voice filtering model as shown in fig. 4, the voice filtering model includes P identical network layers and a fully connected layer, wherein the P identical network layers are connected in a residual form, and each network layer includes: the device comprises a first convolution layer, a second convolution layer, an activation layer, a feature fusion layer and a feature superposition layer; the fully connected layer may be densely connected for multiple network layers.

Firstly, segmenting an intermediate audio by a human voice extraction device to obtain a plurality of audio segments, then performing Fourier transform on each audio segment to obtain short-time Fourier transform on each audio segment to obtain a spectrogram (which can be the first spectrogram or the second spectrogram) corresponding to each audio segment, obtaining input data corresponding to each audio segment based on the spectrogram, wherein the specific transformation process refers to the process of obtaining training data, and detailed description is not given here, and the input data is input to a first network layer of P network layers in a human voice filtering model; the first convolution layer is used for carrying out first convolution operation on the input data to obtain a first characteristic matrix; the second convolution layer is used for carrying out second convolution operation on the input data to obtain a second characteristic matrix; the activation layer is used for carrying out nonlinear activation on the second characteristic matrix to obtain a third characteristic matrix; the feature fusion layer is used for performing cross multiplication operation on the first feature matrix and the third feature matrix to obtain a fourth feature matrix; the characteristic superposition layer is used for carrying out characteristic superposition on the fourth characteristic matrix and the input data to obtain output data of the network layer, the output data is used as input data of the next network layer, and after P network layers, a target characteristic matrix of each audio section is obtained; and the full connection layer is used for performing full connection operation on the target feature matrix to obtain a feature vector corresponding to each audio segment, and inputting the feature vector into the softmax classifier to obtain a voice probability sequence corresponding to each audio segment.

It should be noted that fig. 4 is only one network structure of the human voice filtering model, and the present application only specifically describes the network structure as an example, and does not uniquely define the human voice filtering model.

The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It will be appreciated that the means for calculating the number of syllables per unit of time, in order to carry out the functions described above, comprise corresponding hardware structures and/or software modules for performing the respective functions. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the present application may divide the functional units of the computing device of the number of syllables per unit time according to the above method example, for example, each functional unit may be divided for each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

In accordance with the embodiment of the voice extracting method described above, please refer to fig. 5, fig. 5 is a schematic structural diagram of a voice extracting apparatus 500 according to an embodiment of the present application, and as shown in fig. 5, the voice extracting apparatus 500 includes a processor, a memory, a communication interface, and one or more programs, where the one or more programs are different from the one or more application programs, and the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for performing the following steps:

In a possible embodiment, the human voice filtering model is constructed based on a machine learning integration algorithm, and the program is further used for executing the following steps:

and preprocessing an audio file to obtain training data and a label sequence before filtering out the non-human voice audio frame of the intermediate audio based on a human voice filtering model, and performing optimization training on the human voice filtering model by using the training data and the label sequence.

In a possible embodiment, the program is specifically configured to execute the following steps in terms of preprocessing an audio file to obtain training data and a tag sequence:

carrying out voice extraction on the audio file based on the voice extraction model to obtain sample audio;

performing framing processing on the sample audio to obtain N sample audio frames, wherein N is an integer greater than 1;

carrying out short-time Fourier transform on each sample audio frame to obtain a spectrogram of each sample audio frame;

obtaining a first spectrogram of the audio file based on a spectrogram of each sample audio frame, wherein the first spectrogram is a matrix formed by spectral vectors of each sample audio frame, and the spectral vectors of each sample audio frame are column vectors formed by amplitudes corresponding to frequency points of each sample audio frame;

tagging the first spectrogram as training data;

and obtaining a label sequence corresponding to the training data based on the first spectrogram, wherein the label sequence is used for labeling frame attributes of sample audio frames corresponding to each column vector in the training data, and the frame attributes comprise human voices and non-human voices.

In a possible embodiment, before marking the first spectrogram as training data, the program is further for instructions to:

determining a first-order difference between corresponding elements of the ith column vector in the first spectrogram and the (i +1) th column vector in the first spectrogram to obtain a difference vector, wherein i is greater than or equal to 1 and less than or equal to N, and is an integer;

longitudinally splicing the difference vector and the (i +1) th column vector to obtain a second spectrogram;

the using the first spectrogram as training data comprises:

labeling the second spectrogram as training data.

In a possible embodiment, in terms of obtaining the label sequence corresponding to the training data based on the first spectrogram, the program is specifically configured to execute the following steps:

determining a first frame sequence number corresponding to a mute audio frame in the first spectrogram based on a voice endpoint detection algorithm;

acquiring a lyric file corresponding to the audio file, and determining a second frame number corresponding to a human voice audio frame and a third frame number corresponding to a non-human voice audio frame in the first spectrogram based on the lyric file;

and obtaining a label sequence based on the first frame sequence number, the second frame sequence number and the third frame sequence number.

In a possible embodiment, the program is specifically configured to execute the following steps in terms of filtering out non-human audio frames of the intermediate audio based on a human voice filtering model:

dividing the intermediate audio into a plurality of audio segments, wherein any two adjacent audio segments have overlapped audio segments;

inputting each audio segment into a human voice filtering model in sequence to obtain a first human voice probability sequence of each audio segment, wherein the first human voice probability sequence is used for representing the probability that each audio frame in each audio segment is human voice;

determining the average value of the voice probability of each audio frame in the overlapped audio segments based on the first voice probability sequence of each audio segment to obtain a second voice probability sequence of the intermediate audio;

determining a target human voice probability sequence of the intermediate audio based on a Viterbi algorithm and the second human voice probability sequence;

and filtering non-human voice audio frames in the intermediate audio based on the target human voice probability sequence to obtain human voice audio, wherein the non-human voice audio frames are audio frames corresponding to target elements in the target human voice probability sequence in the intermediate audio, and the target elements are elements meeting preset conditions.

Referring to fig. 6, fig. 6 shows a block diagram of a possible functional unit of the human voice extracting apparatus 600 according to the above embodiment, and the human voice extracting apparatus 600 includes: extraction unit 610, filter unit 620, wherein:

the extracting unit 610 is configured to perform voice extraction on the mixed audio based on a voice extraction model to obtain an intermediate audio, where the intermediate audio includes a voice audio frame and a non-voice audio frame;

and the filtering unit 620 is configured to filter the non-human voice audio frames of the intermediate audio based on the human voice filtering model to obtain the human voice audio.

In a possible embodiment, the human voice filtering model is constructed based on a machine learning integration algorithm, and the human voice extracting apparatus 600 further includes a training unit 630, and the training unit 630 is configured to: and preprocessing an audio file to obtain training data and a label sequence before filtering out the non-human voice audio frame of the intermediate audio based on a human voice filtering model, and performing optimization training on the human voice filtering model by using the training data and the label sequence.

In a possible embodiment, in terms of preprocessing the audio file to obtain training data and a tag sequence, the training unit 630 is specifically configured to: carrying out voice extraction on the audio file based on the voice extraction model to obtain sample audio; performing framing processing on the sample audio to obtain N sample audio frames, wherein N is an integer greater than 1; carrying out short-time Fourier transform on each sample audio frame to obtain a spectrogram of each sample audio frame; obtaining a first spectrogram of the audio file based on a spectrogram of each sample audio frame, wherein the first spectrogram is a matrix formed by spectral vectors of each sample audio frame, and the spectral vectors of each sample audio frame are column vectors formed by amplitudes corresponding to frequency points of each sample audio frame; tagging the first spectrogram as training data; and obtaining a label sequence corresponding to the training data based on the first spectrogram, wherein the label sequence is used for labeling frame attributes of sample audio frames corresponding to each column vector in the training data, and the frame attributes comprise human voices and non-human voices.

In a possible embodiment, before marking the first spectrogram as training data, the training unit 630 is further configured to: determining a first-order difference between corresponding elements of the ith column vector in the first spectrogram and the (i +1) th column vector in the first spectrogram to obtain a difference vector, wherein i is greater than or equal to 1 and less than or equal to N, and is an integer; longitudinally splicing the difference vector and the (i +1) th column vector to obtain a second spectrogram; in respect of using the first spectrogram as training data, the training unit 630 is specifically configured to: labeling the second spectrogram as training data.

In a possible embodiment, in terms of obtaining the label sequence corresponding to the training data based on the first spectrogram, the training unit 630 is specifically configured to: determining a first frame sequence number corresponding to a mute audio frame in the first spectrogram based on a voice endpoint detection algorithm; acquiring a lyric file corresponding to the audio file, and determining a second frame number corresponding to a human voice audio frame and a third frame number corresponding to a non-human voice audio frame in the first spectrogram based on the lyric file; and obtaining a label sequence based on the first frame sequence number, the second frame sequence number and the third frame sequence number.

In a possible embodiment, in terms of filtering out the non-human audio frames of the intermediate audio based on the human voice filtering model, the filtering unit 620 is specifically configured to: dividing the intermediate audio into a plurality of audio segments, wherein any two adjacent audio segments have overlapped audio segments; inputting each audio segment into a human voice filtering model in sequence to obtain a first human voice probability sequence of each audio segment, wherein the first human voice probability sequence is used for representing the probability that each audio frame in each audio segment is human voice; determining the average value of the voice probability of each audio frame in the overlapped audio segments based on the first voice probability sequence of each audio segment to obtain a second voice probability sequence of the intermediate audio; determining a target human voice probability sequence of the intermediate audio based on a Viterbi algorithm and the second human voice probability sequence; and filtering non-human voice audio frames in the intermediate audio based on the target human voice probability sequence to obtain human voice audio, wherein the non-human voice audio frames are audio frames corresponding to target elements in the target human voice probability sequence in the intermediate audio, and the target elements are elements meeting preset conditions.

Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program, and the computer program is executed by a processor to implement part or all of the steps of any one of the voice extraction methods described in the above method embodiments.

Embodiments of the present application also provide a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute part or all of the steps of any one of the voice extraction methods as described in the above method embodiments.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A human voice extraction method is characterized by comprising the following steps:

tagging the first spectrogram as training data;

obtaining a label sequence corresponding to the training data based on the first spectrogram, wherein the label sequence is used for labeling frame attributes of sample audio frames corresponding to each column vector in the training data, and the frame attributes comprise human voices and non-human voices;

performing optimization training on a human voice filtering model by using the training data and the label sequence;

based on the voice extraction model, voice extraction is carried out on the mixed audio to obtain intermediate audio, and the intermediate audio comprises voice audio frames and non-voice audio frames;

and filtering the non-human voice audio frames of the intermediate audio based on the optimally trained human voice filtering model to obtain the human voice audio.

2. The method of claim 1, wherein prior to labeling the first spectrogram as training data, the method further comprises:

the using the first spectrogram as training data comprises:

labeling the second spectrogram as training data.

3. The method according to claim 1 or 2, wherein the obtaining the label sequence corresponding to the training data based on the first spectrogram comprises:

4. The method of claim 1, wherein filtering out non-human audio frames of the intermediate audio based on a human filtering model comprises:

5. A human voice extraction device, characterized by comprising:

the training unit is used for extracting the voice of the audio file based on the voice extraction model to obtain sample audio; performing framing processing on the sample audio to obtain N sample audio frames, wherein N is an integer greater than 1; carrying out short-time Fourier transform on each sample audio frame to obtain a spectrogram of each sample audio frame; obtaining a first spectrogram of the audio file based on a spectrogram of each sample audio frame, wherein the first spectrogram is a matrix formed by spectral vectors of each sample audio frame, and the spectral vectors of each sample audio frame are column vectors formed by amplitudes corresponding to frequency points of each sample audio frame; tagging the first spectrogram as training data; obtaining a label sequence corresponding to the training data based on the first spectrogram, wherein the label sequence is used for labeling frame attributes of sample audio frames corresponding to each column vector in the training data, and the frame attributes comprise human voices and non-human voices; performing optimization training on a human voice filtering model by using the training data and the label sequence;

the extraction unit is used for extracting the voice of the mixed audio based on the voice extraction model to obtain an intermediate audio, and the intermediate audio comprises a voice audio frame and a non-voice audio frame;

and the filtering unit is used for filtering the non-human voice audio frames of the intermediate audio based on the human voice filtering model after the optimization training to obtain the human voice audio.

6. An electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of any of claims 1-4.

7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-4.