CN110085251A

CN110085251A - Voice extracting method, voice extraction element and Related product

Info

Publication number: CN110085251A
Application number: CN201910343129.5A
Authority: CN
Inventors: 王征韬
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2019-08-02
Anticipated expiration: 2039-04-26
Also published as: CN110085251B

Abstract

The embodiment of the present application provides a kind of voice extracting method, comprising: extracts model based on voice, carries out voice extraction to mixed audio, obtain intermediate audio, the intermediate audio includes voice audio frame and inhuman sound audio frame；Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.The embodiment of the present application can extract pure voice audio, improve user experience.

Description

Voice extracting method, voice extraction element and Related product

Technical field

This application involves electronic audio signal process fields, and in particular to a kind of voice extracting method, voice extraction element And Related product.

Background technique

Voice extractive technique is a kind of audio-frequency processing method studied extensively, and there are many classes for the algorithm that existing voice extracts Not.But due to the limitation of algorithm itself or training sample, currently without a kind of proposition voice that voice extraction algorithm can be clean. For example, extracting voice from mixed audio by Hourglass model in the prior art, although the voice result extracted compares Completely, it is with higher can identification, but exist and the instrumental music playings such as part prelude, interlude part accidentally known for voice and protected The mistake stayed.So completely pure voice can not be extracted from mixed audio in the prior art.

Summary of the invention

The embodiment of the present application provides a kind of voice extracting method, voice extraction element and Related product, to pass through two It walks voice to extract, obtains pure voice audio, misrecognition problem when existing voice being avoided to extract.

In a first aspect, the embodiment of the present application provides a kind of voice extracting method, comprising:

Model is extracted based on voice, voice extraction is carried out to mixed audio, obtains intermediate audio, the intermediate audio includes Voice audio frame and inhuman sound audio frame；

Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.

Second aspect, the embodiment of the present application provide a kind of voice extraction element, comprising:

Extraction unit carries out voice extraction to mixed audio, obtains intermediate audio, institute for extracting model based on voice Stating intermediate audio includes voice audio frame and inhuman sound audio frame；

Filter element filters out the inhuman sound audio frame of the intermediate audio, obtains voice for being based on voice filtering model Audio.

The third aspect, the embodiment of the present application provide a kind of electronic equipment, including processor, memory, communication interface and One or more programs, wherein one or more of programs are stored in the memory, and are configured by described It manages device to execute, described program is included the steps that for executing the instruction in method as described in relation to the first aspect.

Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage medium Matter is stored with computer program, and the computer program makes the method for computer execution as described in relation to the first aspect.

5th aspect, the embodiment of the present application provide a kind of computer program product, and the computer program product includes depositing The non-transient computer readable storage medium of computer program is stored up, the computer is operable to make computer to execute such as the Method described in one side.

Implement the embodiment of the present application, has the following beneficial effects:

As can be seen that in the embodiment of the present application, voice is extracted into model extraction and is input to filtering model to intermediate audio, Inhuman sound audio frame in the intermediate audio is filtered, is extracted by two step voice, obtains pure voice audio, is solved The problem of can not extracting pure audio from mixed audio in the prior art, keep voice extraction effect better.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of flow diagram of voice extracting method provided by the embodiments of the present application；

Fig. 2A is a kind of flow diagram for obtaining training data method provided by the embodiments of the present application；

Fig. 2 B is that another kind provided by the embodiments of the present application obtains the flow diagram of training data method；

Fig. 2 C is a kind of schematic diagram of audio frame frequency spectrum figure provided by the embodiments of the present application；

Fig. 3 is the flow diagram of another voice extracting method provided by the embodiments of the present application；

Fig. 4 is a kind of network structure of voice filtering model provided by the embodiments of the present application；

Fig. 5 is a kind of structural schematic diagram of voice extraction element provided by the embodiments of the present application；

Fig. 6 is that a kind of functional unit of voice extraction element provided by the embodiments of the present application forms block diagram.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

The description and claims of this application and term " first ", " second ", " third " and " in the attached drawing Four " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and it Any deformation, it is intended that cover and non-exclusive include.Such as it contains the process, method of a series of steps or units, be System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or list Member, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.

Referenced herein " embodiment " is it is meant that the special characteristic, result or the characteristic that describe can wrap in conjunction with the embodiments It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.

Voice extraction element in the application may include smart phone (such as Android phone, iOS mobile phone, Windows Phone mobile phone etc.), tablet computer, palm PC, laptop, mobile internet device MID (Mobile Internet Devices, referred to as: MID) or wearable device etc., above-mentioned electronic equipment is only citing, and non exhaustive, including but not limited to upper Electronic equipment is stated, for convenience of description, certainly in practical applications, above-mentioned voice extraction element is also not necessarily limited to above-mentioned realization shape Formula, such as can also include: intelligent vehicle mounted terminal, computer equipment etc..

Refering to fig. 1, Fig. 1 is a kind of voice extracting method provided by the embodiments of the present application, and this method is extracted applied to voice Device, the method comprising the steps of 101~102:

Step 101: voice extraction element is based on voice and extracts model, carries out voice extraction to mixed audio, obtains centre Audio, the intermediate audio include voice audio frame and inhuman sound audio frame.

Wherein, voice is extracted as isolating cognizable voice sound from the mixed audio in voice and background instrumental music sound Frequently.

Wherein, it is neural network model in the prior art which, which extracts model,.For example, can be Hoursglass mould Type etc. no longer repeats voice extraction process.It is understood that Hoursglass model is when carrying out voice extraction, Its input data is audio frame one by one, extracts voice respectively from each audio frame, therefore Hoursglass model is right When mixed audio carries out voice extraction, local message based on mixed audio carries out voice extraction, so as to cause by part prelude, The instrumental music playings such as interlude part is mistakenly identified as voice and extracts, and causes to remain part in the voice audio finally extracted The instrumental music playings such as part prelude, interlude, so pure voice audio can not be extracted from mixed audio.

Step 102: voice extraction element is based on voice filtering model, filters out the inhuman sound audio frame of the intermediate audio, Obtain voice audio.

Wherein, the voice filtering model is to be constructed based on machine learning Integrated Algorithm, wherein the machine learning is integrated Algorithm can be Viterbi viterbi algorithm, condition random field algorithm CRF algorithm, and the application is with Viterbi viterbi algorithm Example illustrates.

Viterbi algorithm, which is a kind of dynamic programming algorithm, most possible generates observed events sequence-Wei Te for finding Than path-hidden state sequence, special y is applied in Markoff information source context and hidden Markov model, for solving Certainly optimum path problems.To tie up bit algorithm dynamic adjustment voice probability sequence in the application, to complete to voice filtering model Building.

Wherein, be to the building process of voice filtering model are as follows: based on machine learning Integrated Algorithm, training data and with The corresponding sequence label trained voice filtering model in advance of the training data, the training data and the sequence label Existing audio file is pre-processed to obtain.Since the input data of the voice filtering model is audio section, tool There is bigger receptive field, the global information of the intermediate audio can be got, to filter the inhuman sound audio in the intermediate audio Frame.

As can be seen that in the embodiment of the present application, after extracting model extraction to intermediate audio based on voice, utilizing impression Wild bigger voice filtering model, is filtered the intermediate audio, to filter out the inhuman sound audio frame in the intermediate audio, from And pure voice is extracted from mixed audio, keep the vocal effects extracted more preferable, improves user experience.

It describes in detail below and audio file is pre-processed to obtain the process of training data and sequence label.

Refering to Fig. 2A, Fig. 2A is a kind of process for obtaining training data and sequence label method provided by the embodiments of the present application Schematic diagram, this method are applied to voice extraction element, the method comprising the steps of 201a~205a:

Step 201a: voice extraction element is based on voice and extracts model to audio file progress voice extraction, obtains sample Audio.

It optionally, include voice audio frame and inhuman sound audio frame in the sample audio, which is accidentally to know The audio frame of the instrumental music playings such as the part prelude, the interlude that do not obtain.

Step 202a: voice extraction element carries out sub-frame processing to the sample audio, obtains N number of sample audio frame, N is Integer greater than 1.

Wherein, it is non-stationary signal within the entire time cycle due to audio signal, signal processing can not be carried out, therefore to this Sample audio carries out framing according to preset window function and preset step-length, obtains N number of sample audio frame, each sample audio frame is seen Doing is stationary signal, and the continuous type in order to guarantee audio signal, has weight between two sample audio frames of arbitrary neighborhood Folded part.For example, a length of 30ms when preset window function, preset step-length 20ms, therefore the two of arbitrary neighborhood sample audio frames Between with 10ms lap.

Step 203a: voice extraction element carries out Short Time Fourier Transform to each sample audio frame, obtains each sample The spectrogram of audio frame.

In some possible embodiments, spectrogram can be amplitude spectrum, power spectrum (energy spectrum) or log power spectrum. The application is illustrated by taking amplitude spectrum as an example.

Step 204a: spectrogram of the voice extraction element based on each sample audio frame obtains the of the audio file First sonograph is labeled as training data by one sonograph.

Wherein, the matrix that the first sonograph is made of the spectral vectors of each sample audio frame, each sample audio frame The column vector that is made of the corresponding amplitude of each frequency point of each sample audio frame of spectral vectors.

For example, refering to Fig. 2 C, Fig. 2 C is the spectrogram of k-th of sample audio frame, 1≤k≤N, f1, f2, f3 ..., Fm is the frequency point of k-th of sample audio frame in a frequency domain, and m is the quantity of every frame sample audio frame frequency point in a frequency domain, k-th of sample The corresponding spectral vectors of the spectrogram of this audio frame be [Ak1, Ak2, Ak3 ..., Akm]^T, therefore obtain N number of sample audio frame pair N number of spectral vectors can be formed first sonograph by the N number of spectral vectors answered are as follows:

Step 205a: voice extraction element is based on first sonograph and obtains the corresponding label sequence of the training data Column.

Wherein, the sequence label is for marking sample audio frame corresponding with each column vector in the training data Frame Properties, the Frame Properties include voice and non-voice.For example, j-th of element in the sequence label is for marking the instruction Practice the Frame Properties of j-th of audio frame in data, 1≤j≤N, j are integer.

In some possible embodiments, the corresponding sequence label of the training data is obtained based on first sonograph Realization process can be with are as follows: determine that mute audio frame is corresponding in first sonograph based on voice activity detection algorithm VAD First frame number；The corresponding lyrics file of the audio file is obtained, first sonograph is determined based on the lyrics file Middle corresponding second frame number of voice audio frame and the corresponding third frame number of inhuman sound audio frame；Based on the first frame sequence Number, second frame number, the third frame number obtain sequence label.

It is quiet in first sonograph being determined based on voice activity detection algorithm VAD in some possible embodiments Before corresponding first frame number of sound audio frame, the method also includes: spectrum-subtraction noise reduction is carried out to first spectrogram, with The ambient noise in first spectrogram is filtered out, spectrum-subtraction noise reduction is the prior art, is repeated no more.

Specifically, based on the lyrics file determine the audio file it is corresponding include that the lyrics sing period and not Non- comprising the lyrics sings the period, then non-period corresponding all audio frames of singing are inhuman sound audio frame, when singing Between the corresponding all audio frames of section include at least voice audio frame.It is to be appreciated that singing in the period at any one, arbitrarily There may be the stages of singer's ventilation between two adjacent lyrics, therefore singing the period, there are time periods of silence, that is, sing There are mute audio frames for period corresponding all audio frames.So determining audio text based on voice activity detection algorithm VAD The corresponding time periods of silence of mute audio frame in part.Then, by the period corresponding with each sample audio frame each period It is compared, obtains the period belonging to each sample audio frame, determined based on the period belonging to each sample audio frame every The Frame Properties of a sample audio frame, that is, determine the corresponding frame number of voice audio frame, the corresponding frame number of inhuman sound audio frame And the corresponding frame number of mute audio frame.

A length of 30ms when for example, such as audio frame, step-length 10ms, set the corresponding label of voice audio frame as 1, non- The corresponding label of voice audio frame is 0, and the corresponding label of mute frame flag is also 0.Belong in the 0-50ms of such as audio file The inhuman sound audio period, it is determined that the 1st, 2 audio frame is inhuman sound audio frame, then the 1st, 2 in training data sound The label of frequency frame is 0, such as the 50ms-70ms of the audio file, belongs to voice audio session in 90ms-110ms, it is determined that 3rd, 5 audio frame be voice audio frame, the 4th frame be mute audio frame, it is determined that the 3rd, 5 audio in the training data The label of frame frame audio is that the label of the 1, the 4th frame audio is 0 etc.；Therefore its sequence label is [0,0,1,0,1 ... ,].

Refering to Fig. 2 B, it is based on Fig. 2A, Fig. 2 B is the method that another kind provided by the embodiments of the present application obtains training data Flow diagram, this method are applied to voice extraction element, the method comprising the steps of 201b~207b:

Step 201b: voice extraction element is based on voice and extracts model to audio file progress voice extraction, obtains sample Audio.

Step 202b: voice extraction element carries out sub-frame processing to the sample audio, obtains N number of sample audio frame.

Step 203b: voice extraction element carries out Short Time Fourier Transform to each sample audio frame, obtains each sample The spectrogram of audio frame.

Step 204b: spectrogram of the voice extraction element based on each sample audio frame obtains the of the audio file One sonograph.

Step 205b: voice extraction element determines i-th of column vector and first sound spectrum in first sonograph The first-order difference of (i+1) a column vector corresponding element in figure, obtains difference vector, by the difference vector and the (i + 1) a column vector carries out longitudinal spliced, obtains the second sonograph, 1≤i≤N, i are integer.

Wherein, since corresponding difference vector is not present in first sample audio frame in first sonograph, therefore to it A longitudinal spliced preset difference vector A=[A01, A02 ..., A0m], wherein the preset difference vector can be element The null vector or vector, etc. for predicted elemental for being all 0, the application do not do unique restriction.

After longitudinal spliced preset difference vector, second sonograph are as follows:

Optionally, the first-order difference for obtaining the frame vector of two neighboring sample audio frame, by difference vector and the first frequency spectrum Figure progress is longitudinal spliced, and each column vector of the second sonograph made contains the audio-frequency information of an audio frame, therefore In the corresponding voice probability of each sample audio frame of calculating, due to being contained in the corresponding column vector of each sample audio frame The audio-frequency information of one audio frame makes the voice probability for the sample audio frame being calculated more due to joined prior information Add accurate.

Step 206b: second sonograph is labeled as training data by voice extraction element.

Step 207b: voice extraction element is based on first sonograph and obtains the corresponding label sequence of the training data Column.

Finally, being trained voice filtering model for the prior art using training data, repeat no more.

Refering to Fig. 3, Fig. 3 is the flow diagram of another voice extracting method provided by the embodiments of the present application, this method Applied to voice extraction element, the method comprising the steps of 301~306:

Step 301: voice extraction element is based on voice and extracts model, carries out voice extraction to mixed audio, obtains centre Audio, the intermediate audio include voice audio frame and inhuman sound audio frame.

Step 302: the intertone frequency division is segmented into several audio sections, the adjacent sound of any two by voice extraction element There is the audio section of overlapping in frequency range.

Optionally, the intertone frequency division is segmented into several by voice extraction element according to preset window function and preset step-length A audio section, each audio section include at least an audio frame.For example, can be 10s according to window function, preset step-length is 5s to institute It states intertone frequency division and is segmented into several audio sections, there are the audios that 5s is overlapped for the adjacent audio section of each audio section event any two Section.

Step 303: each audio section is successively input to voice filtering model by voice extraction element, obtains each audio section The first voice probability sequence, the first voice probability sequence is used to indicate that each audio frame in each audio section to be voice Probability.

Step 304: voice extraction element is determined in overlapping audio section based on the first voice probability sequence of each audio section The voice mathematical expectation of probability of each audio frame obtains the second voice probability sequence of the intermediate audio.

Optionally, it is based on the voice filtering model, the first voice probability sequence of each audio section is determined, due to phase There is the audio section of overlapping in two adjacent audio sections, therefore corresponding two the first voice probability sequences of two adjacent audio sections In comprising the overlapping audio section in each audio frame voice probability, the sound of overlapping is obtained by way of averaged The voice probability of each audio frame in frequency range, then, voice probability corresponding with underlapped audio section form the intermediate audio The second voice probability sequence, each element in the second voice probability sequence is for indicating each audio in the intermediate audio Frame is the probability of voice.

Step 305: voice extraction element is determined in described based on viterbi algorithm and the second voice probability sequence Between audio target voice probability sequence.

Optionally, the element in second probability sequence is adjusted based on Viterbi viterbi algorithm, is obtained most Excellent probability sequence, using optimal probability sequence as target voice probability sequence.The i.e. similar method for seeking optimal path, base Determine the corresponding hiding sequence of second probability sequence in Viterbi viterbi algorithm, obtain it is each hide that sequence is corresponding can Energy probability, obtains the optimal probability sequence, detailed process is the prior art, is no longer described.

For example, as the second voice probability sequence be [0.0,0,1,0.1,0.2,0.3,0.5,0.8,0.7,0.1, 0.1,0.6,0.7,0.8 ... ,], by the second voice probability sequence can be seen that the intermediate audio the 6th, 7,8,11, 12,13 audio frames may be people's sound audio frame, and the 9th, 10 audio frame may be inhuman sound audio frame, since speaker says There are the processes of a gradual change for words, so being also a progressive formation between voice probability, in general, a upper sound are not present The voice probability of frequency frame is very big, and the voice probability of next audio frame is very small, does not meet the rule of speaking of speaker, institute With can conclude that the 9th, the 10 corresponding voice probability of audio frame, there are problems, therefore dynamic is needed to adjust, to meet speaking for speaker Rule.

Step 306: voice extraction element is based on the target voice probability sequence and filters out non-voice in the intermediate audio Audio frame, obtains voice audio, and the inhuman sound audio frame is in the intermediate audio and in the target voice probability sequence The corresponding audio frame of object element, the object element is the element for meeting preset condition.

Wherein, the element for meeting preset condition can be the element more than or equal to threshold value, the threshold value can for 0.5, 0.6,0.7 or other values.

As can be seen that in the embodiment of the present application, after extracting model extraction to intermediate audio based on voice, to intertone Frequency division section, determines the corresponding input data of each audio section, which is input to voice filtering model, to filter out in this Between audio inhuman sound audio frame, obtain pure voice audio, due to the input data in voice filtering model be audio section, Input data is audio frame compared to the prior art, which has bigger receptive field, utilize intermediate audio Global information filters inhuman sound audio frame, and pure voice is extracted from mixed audio, keeps the vocal effects extracted more preferable, Improve user experience.

In some possible embodiments, voice extracting method disclosed in the present application is applied to voice mistake as shown in Figure 4 Model is filtered, which includes P identical network layers and full articulamentum, wherein the P identical network layers are with residual Poor form connection, each network layer include: the first convolutional layer, the second convolutional layer, active coating, Fusion Features layer and feature superposition Layer；The full articulamentum can intensively connect for multiple network layers.

Firstly, voice extraction element is segmented intermediate audio, several audio sections are obtained, then, then to each sound Frequency range carries out Fourier transformation, obtains each audio section and carries out Short Time Fourier Transform, obtains the corresponding sound spectrum of each audio section Scheme (can be above-mentioned the first sonograph or the second sonograph), the corresponding input of each audio section is obtained based on the sonograph Data, specific conversion process are not described in detail herein referring to the above-mentioned process for obtaining training data, and by the input Data are input to first network layer of P network layer in the voice filtering model；First convolutional layer, for the input Data carry out the first convolution algorithm, obtain fisrt feature matrix；Second convolutional layer, for carrying out the second convolution fortune to input data It calculates, obtains second characteristic matrix；Active coating obtains third feature for carrying out nonlinear activation to the second characteristic matrix Matrix；Fusion Features layer obtains fourth feature for carrying out multiplication cross operation to the fisrt feature matrix and third feature matrix Matrix；Feature superimposed layer obtains the output number of the network layer for fourth feature matrix and input data to be carried out feature superposition According to after P network layer, obtaining the mesh of each audio section using the output data as the input data of next network layer Mark eigenmatrix；Full articulamentum obtains the corresponding spy of each audio section for carrying out full connection operation to the target signature matrix Vector is levied, described eigenvector is input to softmax classifier, obtains the corresponding voice probability sequence of each audio section.

It should be noted that Fig. 4 is only a kind of network structure of voice filtering model, the application is only with the network structure Example illustrates, and is not limited uniquely voice filtering model.

It is above-mentioned that mainly the scheme of the embodiment of the present application is described from the angle of method side implementation procedure.It is understood that , in order to realize the above functions, it comprises executing, each function is corresponding for the computing device of syllable quantity in the unit time Hardware configuration and/or software module.Those skilled in the art should be readily appreciated that, in conjunction with the embodiments described herein Each exemplary unit and algorithm steps of description, the application can be come real with the combining form of hardware or hardware and computer software It is existing.Some functions is executed in a manner of hardware or computer software driving hardware actually, depending on the specific of technical solution Using and design constraint.Professional technician can to it is each specifically realized using distinct methods it is described Function, but this realization is it is not considered that exceed scope of the present application.

The embodiment of the present application can carry out function according to computing device of the above method example to syllable quantity in the unit time The division of energy unit, for example, each functional unit of each function division can be corresponded to, it can also be by two or more function It can be integrated in a processing unit.Above-mentioned integrated unit both can take the form of hardware realization, can also use software The form of functional unit is realized.It should be noted that being schematically only one to the division of unit in the embodiment of the present application Kind logical function partition, there may be another division manner in actual implementation.

With it is above-mentioned shown in voice extracting method embodiment it is consistent, referring to Fig. 5, Fig. 5 mentions for the embodiment of the present application The structural schematic diagram of a kind of voice extraction element 500 supplied, as shown in figure 5, voice extraction element 500 includes processor, storage Device, communication interface and one or more programs, wherein said one or multiple programs are different from said one or multiple applications Program, and said one or multiple programs are stored in above-mentioned memory, and are configured to be executed by above-mentioned processor, it is above-mentioned Program includes the instruction for executing following steps:

In a possible embodiment, the voice filtering model is constructed based on machine learning Integrated Algorithm, above procedure It is also used to execute the instruction of following steps:

Based on voice filtering model, before the inhuman sound audio frame for filtering out the intermediate audio, audio file is carried out Pretreatment obtains training data and sequence label, using the training data and the sequence label to the voice filter module Type optimizes training.

In a possible embodiment, in terms of being pre-processed to obtain training data and sequence label to audio file, Above procedure is specifically used for executing the instruction of following steps:

Model is extracted based on the voice, voice extraction is carried out to audio file, obtain sample audio；

Sub-frame processing is carried out to the sample audio, obtains N number of sample audio frame, N is the integer greater than 1；

Short Time Fourier Transform is carried out to each sample audio frame, obtains the spectrogram of each sample audio frame；

Based on the spectrogram of each sample audio frame, the first sonograph of the audio file, first sound spectrum are obtained The matrix that figure is made of the spectral vectors of each sample audio frame, the spectral vectors of each sample audio frame are by each sample The column vector of the corresponding amplitude composition of each frequency point of audio frame；

First sonograph is labeled as training data；

The corresponding sequence label of the training data is obtained based on first sonograph, the sequence label is for marking The Frame Properties of sample audio frame corresponding with each column vector in the training data, the Frame Properties include voice and inhuman Sound.

In a possible embodiment, before by first sonograph labeled as training data, above procedure is also used In the instruction for executing following steps:

Determine (i+1) a column vector in i-th of the column vector and first sonograph in first sonograph The first-order difference of corresponding element, obtains difference vector, and 1≤i≤N, i are integer；

The difference vector and (i+1) a column vector progress is longitudinal spliced, obtain the second sonograph；

It is described using first sonograph as training data, comprising:

Second sonograph is labeled as training data.

In a possible embodiment, the corresponding sequence label of the training data is being obtained based on first sonograph Aspect, above procedure are specifically used for executing the instruction of following steps:

Corresponding first frame number of mute audio frame in first sonograph is determined based on voice activity detection algorithm；

The corresponding lyrics file of the audio file is obtained, people in first sonograph is determined based on the lyrics file Corresponding second frame number of sound audio frame and the corresponding third frame number of inhuman sound audio frame；

Sequence label is obtained based on first frame number, second frame number, the third frame number.

In a possible embodiment, it is being based on voice filtering model, is filtering out the inhuman sound audio frame of the intermediate audio Aspect, above procedure are specifically used for executing the instruction of following steps:

The intertone frequency division is segmented into several audio sections, the adjacent audio section of any two has the audio of overlapping Section；

Each audio section is successively input to voice filtering model, obtains the first voice probability sequence of each audio section, The first voice probability sequence is used to indicate in each audio section that each audio frame to be the probability of voice；

The voice probability of each audio frame in overlapping audio section is determined based on the first voice probability sequence of each audio section Mean value obtains the second voice probability sequence of the intermediate audio；

The target voice probability of the intermediate audio is determined based on viterbi algorithm and the second voice probability sequence Sequence；

Inhuman sound audio frame in the intermediate audio is filtered out based on the target voice probability sequence, obtains voice audio, The inhuman sound audio frame is audio corresponding with the object element in the target voice probability sequence in the intermediate audio Frame, the object element are the element for meeting preset condition.

A kind of possible function of voice extraction element 600 involved in above-described embodiment is shown refering to Fig. 6, Fig. 6 Unit composition block diagram, voice extraction element 600 include: extraction unit 610, filter element 620, in which:

Extraction unit 610 carries out voice extraction to mixed audio, obtains intermediate audio for extracting model based on voice, The intermediate audio includes voice audio frame and inhuman sound audio frame；

Filter element 620 filters out the inhuman sound audio frame of the intermediate audio, obtains for being based on voice filtering model Voice audio.

In a possible embodiment, the voice filtering model is constructed based on machine learning Integrated Algorithm, and voice extracts Device 600 further includes training unit 630, and training unit 630 is used for: being based on voice filtering model, is filtering out the intermediate audio Inhuman sound audio frame before, audio file is pre-processed to obtain training data and sequence label, uses the trained number Accordingly and the sequence label optimizes training to the voice filtering model.

In a possible embodiment, in terms of being pre-processed to obtain training data and sequence label to audio file, Training unit 630, is specifically used for: extracting model based on the voice and carries out voice extraction to audio file, obtains sample audio； Sub-frame processing is carried out to the sample audio, obtains N number of sample audio frame, N is the integer greater than 1；To each sample audio frame Short Time Fourier Transform is carried out, the spectrogram of each sample audio frame is obtained；Based on the spectrogram of each sample audio frame, obtain First sonograph of the audio file, the square that first sonograph is made of the spectral vectors of each sample audio frame Battle array, the spectral vectors of each sample audio frame be from the column that the corresponding amplitude of each frequency point of each sample audio frame forms to Amount；First sonograph is labeled as training data；The corresponding mark of the training data is obtained based on first sonograph Sequence is signed, the sequence label is used to mark the frame category of sample audio frame corresponding with each column vector in the training data Property, the Frame Properties includes voice and non-voice.

In a possible embodiment, before by first sonograph labeled as training data, training unit 630, It is also used to: determining (i+1) a column vector pair in i-th of the column vector and first sonograph in first sonograph The first-order difference for answering element, obtains difference vector, and 1≤i≤N, i are integer；By the difference vector and (i+1) a column Vector progress is longitudinal spliced, obtains the second sonograph；Using first sonograph as training data in terms of, training unit 630, it is specifically used for: by second sonograph labeled as training data.

In a possible embodiment, the corresponding sequence label of the training data is being obtained based on first sonograph Aspect, training unit 630, is specifically used for: determining mute audio frame in first sonograph based on voice activity detection algorithm Corresponding first frame number；The corresponding lyrics file of the audio file is obtained, determines described first based on the lyrics file Corresponding second frame number of voice audio frame and the corresponding third frame number of inhuman sound audio frame in sonograph；Based on described first Frame number, second frame number, the third frame number obtain sequence label.

In a possible embodiment, it is being based on voice filtering model, is filtering out the inhuman sound audio frame of the intermediate audio Aspect, filter element 620, is specifically used for: the intertone frequency division is segmented into several audio sections, the adjacent sound of any two There is the audio section of overlapping in frequency range；Each audio section is successively input to voice filtering model, obtains the first of each audio section Voice probability sequence, the first voice probability sequence are used to indicate in each audio section that each audio frame to be the probability of voice； The voice mathematical expectation of probability that each audio frame in overlapping audio section is determined based on the first voice probability sequence of each audio section, is obtained Second voice probability sequence of the intermediate audio；Based on described in viterbi algorithm and the second voice probability sequence determination The target voice probability sequence of intermediate audio；Inhuman sound in the intermediate audio is filtered out based on the target voice probability sequence Frequency frame, obtains voice audio, and the inhuman sound audio frame is in the intermediate audio and in the target voice probability sequence The corresponding audio frame of object element, the object element are the element for meeting preset condition.

The embodiment of the present application also provides a kind of computer storage medium, and the computer-readable recording medium storage has calculating Machine program, the computer program are executed by processor to realize that any voice recorded in above method embodiment such as mentions Take some or all of method step.

The embodiment of the present application also provides a kind of computer program product, and the computer program product includes storing calculating The non-transient computer readable storage medium of machine program, the computer program are operable to that computer is made to execute such as above-mentioned side Some or all of any voice extracting method recorded in method embodiment step.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily the application It is necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed device, it can be by another way It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit, It can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also be realized in the form of software program module.

If the integrated unit is realized in the form of software program module and sells or use as independent product When, it can store in a computer-readable access to memory.Based on this understanding, the technical solution of the application substantially or Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products Reveal and, which is stored in a memory, including some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes all or part of each embodiment the method for the application Step.And memory above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can store in a computer-readable memory, memory May include: flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), disk or CD etc..

The embodiment of the present application is described in detail above, specific case used herein to the principle of the application and Embodiment is expounded, the description of the example is only used to help understand the method for the present application and its core ideas； At the same time, for those skilled in the art can in specific embodiments and applications according to the thought of the application There is change place, in conclusion the contents of this specification should not be construed as limiting the present application.

Claims

1. a kind of voice extracting method characterized by comprising

2. the method according to claim 1, wherein the voice filtering model is based on machine learning Integrated Algorithm Building, the method also includes: it is being based on voice filtering model, it is right before the inhuman sound audio frame for filtering out the intermediate audio Audio file is pre-processed to obtain training data and sequence label, using the training data and the sequence label to institute It states voice filtering model and optimizes training.

3. according to the method described in claim 2, it is characterized in that, described pre-process audio file to obtain training data And sequence label, comprising:

Based on the spectrogram of each sample audio frame, the first sonograph of the audio file is obtained, first sonograph is The matrix being made of the spectral vectors of each sample audio frame, the spectral vectors of each sample audio frame are by each sample audio The column vector of the corresponding amplitude composition of each frequency point of frame；

First sonograph is labeled as training data；

The corresponding sequence label of the training data is obtained based on first sonograph, the sequence label is described for marking The Frame Properties of sample audio frame corresponding with each column vector in training data, the Frame Properties include voice and non-voice.

4. according to the method described in claim 3, it is characterized in that, described be labeled as training data for first sonograph Before, the method also includes:

Determine that (i+1) a column vector in i-th of the column vector and first sonograph in first sonograph is corresponding The first-order difference of element, obtains difference vector, and 1≤i≤N, i are integer；

It is described using first sonograph as training data, comprising:

Second sonograph is labeled as training data.

5. the method according to claim 3 or 4, which is characterized in that described to obtain the instruction based on first sonograph Practice the corresponding sequence label of data, comprising:

The corresponding lyrics file of the audio file is obtained, voice sound in first sonograph is determined based on the lyrics file Corresponding second frame number of frequency frame and the corresponding third frame number of inhuman sound audio frame；

6. filtering out the intertone the method according to claim 1, wherein described be based on voice filtering model The inhuman sound audio frame of frequency, comprising:

The intertone frequency division is segmented into several audio sections, the adjacent audio section of any two has the audio section of overlapping；

Each audio section is successively input to voice filtering model, obtains the first voice probability sequence of each audio section, it is described First voice probability sequence is used to indicate in each audio section that each audio frame to be the probability of voice；

The voice mathematical expectation of probability of each audio frame in overlapping audio section is determined based on the first voice probability sequence of each audio section, Obtain the second voice probability sequence of the intermediate audio；

The target voice probability sequence of the intermediate audio is determined based on viterbi algorithm and the second voice probability sequence；

Inhuman sound audio frame in the intermediate audio is filtered out based on the target voice probability sequence, obtains voice audio, it is described Inhuman sound audio frame is audio frame corresponding with the object element in the target voice probability sequence in the intermediate audio, institute Stating object element is the element for meeting preset condition.

7. a kind of voice extraction element characterized by comprising

Extraction unit carries out voice extraction to mixed audio, obtains intermediate audio for extracting model based on voice, it is described in Between audio include voice audio frame and inhuman sound audio frame；

Filter element filters out the inhuman sound audio frame of the intermediate audio, obtains voice sound for being based on voice filtering model Frequently.

8. device according to claim 7, which is characterized in that described device further includes training unit,

The training unit, it is right before the inhuman sound audio frame for filtering out the intermediate audio for being based on voice filtering model Audio file is pre-processed to obtain training data and sequence label, using the training data and the sequence label to institute It states voice filtering model and optimizes training.

9. a kind of electronic equipment, which is characterized in that including processor, memory, communication interface and one or more program, In, one or more of programs are stored in the memory, and are configured to be executed by the processor, described program Include the steps that requiring the instruction in any one of 1-7 method for perform claim.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program are executed by processor to realize the method according to claim 1 to 7.