CN110085251A - Voice extracting method, voice extraction element and Related product - Google Patents
Voice extracting method, voice extraction element and Related product Download PDFInfo
- Publication number
- CN110085251A CN110085251A CN201910343129.5A CN201910343129A CN110085251A CN 110085251 A CN110085251 A CN 110085251A CN 201910343129 A CN201910343129 A CN 201910343129A CN 110085251 A CN110085251 A CN 110085251A
- Authority
- CN
- China
- Prior art keywords
- audio
- voice
- frame
- audio frame
- sonograph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The embodiment of the present application provides a kind of voice extracting method, comprising: extracts model based on voice, carries out voice extraction to mixed audio, obtain intermediate audio, the intermediate audio includes voice audio frame and inhuman sound audio frame;Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.The embodiment of the present application can extract pure voice audio, improve user experience.
Description
Technical field
This application involves electronic audio signal process fields, and in particular to a kind of voice extracting method, voice extraction element
And Related product.
Background technique
Voice extractive technique is a kind of audio-frequency processing method studied extensively, and there are many classes for the algorithm that existing voice extracts
Not.But due to the limitation of algorithm itself or training sample, currently without a kind of proposition voice that voice extraction algorithm can be clean.
For example, extracting voice from mixed audio by Hourglass model in the prior art, although the voice result extracted compares
Completely, it is with higher can identification, but exist and the instrumental music playings such as part prelude, interlude part accidentally known for voice and protected
The mistake stayed.So completely pure voice can not be extracted from mixed audio in the prior art.
Summary of the invention
The embodiment of the present application provides a kind of voice extracting method, voice extraction element and Related product, to pass through two
It walks voice to extract, obtains pure voice audio, misrecognition problem when existing voice being avoided to extract.
In a first aspect, the embodiment of the present application provides a kind of voice extracting method, comprising:
Model is extracted based on voice, voice extraction is carried out to mixed audio, obtains intermediate audio, the intermediate audio includes
Voice audio frame and inhuman sound audio frame;
Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.
Second aspect, the embodiment of the present application provide a kind of voice extraction element, comprising:
Extraction unit carries out voice extraction to mixed audio, obtains intermediate audio, institute for extracting model based on voice
Stating intermediate audio includes voice audio frame and inhuman sound audio frame;
Filter element filters out the inhuman sound audio frame of the intermediate audio, obtains voice for being based on voice filtering model
Audio.
The third aspect, the embodiment of the present application provide a kind of electronic equipment, including processor, memory, communication interface and
One or more programs, wherein one or more of programs are stored in the memory, and are configured by described
It manages device to execute, described program is included the steps that for executing the instruction in method as described in relation to the first aspect.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage medium
Matter is stored with computer program, and the computer program makes the method for computer execution as described in relation to the first aspect.
5th aspect, the embodiment of the present application provide a kind of computer program product, and the computer program product includes depositing
The non-transient computer readable storage medium of computer program is stored up, the computer is operable to make computer to execute such as the
Method described in one side.
Implement the embodiment of the present application, has the following beneficial effects:
As can be seen that in the embodiment of the present application, voice is extracted into model extraction and is input to filtering model to intermediate audio,
Inhuman sound audio frame in the intermediate audio is filtered, is extracted by two step voice, obtains pure voice audio, is solved
The problem of can not extracting pure audio from mixed audio in the prior art, keep voice extraction effect better.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a kind of flow diagram of voice extracting method provided by the embodiments of the present application;
Fig. 2A is a kind of flow diagram for obtaining training data method provided by the embodiments of the present application;
Fig. 2 B is that another kind provided by the embodiments of the present application obtains the flow diagram of training data method;
Fig. 2 C is a kind of schematic diagram of audio frame frequency spectrum figure provided by the embodiments of the present application;
Fig. 3 is the flow diagram of another voice extracting method provided by the embodiments of the present application;
Fig. 4 is a kind of network structure of voice filtering model provided by the embodiments of the present application;
Fig. 5 is a kind of structural schematic diagram of voice extraction element provided by the embodiments of the present application;
Fig. 6 is that a kind of functional unit of voice extraction element provided by the embodiments of the present application forms block diagram.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
The description and claims of this application and term " first ", " second ", " third " and " in the attached drawing
Four " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and it
Any deformation, it is intended that cover and non-exclusive include.Such as it contains the process, method of a series of steps or units, be
System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or list
Member, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.
Referenced herein " embodiment " is it is meant that the special characteristic, result or the characteristic that describe can wrap in conjunction with the embodiments
It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical
Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and
Implicitly understand, embodiment described herein can be combined with other embodiments.
Voice extraction element in the application may include smart phone (such as Android phone, iOS mobile phone, Windows
Phone mobile phone etc.), tablet computer, palm PC, laptop, mobile internet device MID (Mobile Internet
Devices, referred to as: MID) or wearable device etc., above-mentioned electronic equipment is only citing, and non exhaustive, including but not limited to upper
Electronic equipment is stated, for convenience of description, certainly in practical applications, above-mentioned voice extraction element is also not necessarily limited to above-mentioned realization shape
Formula, such as can also include: intelligent vehicle mounted terminal, computer equipment etc..
Refering to fig. 1, Fig. 1 is a kind of voice extracting method provided by the embodiments of the present application, and this method is extracted applied to voice
Device, the method comprising the steps of 101~102:
Step 101: voice extraction element is based on voice and extracts model, carries out voice extraction to mixed audio, obtains centre
Audio, the intermediate audio include voice audio frame and inhuman sound audio frame.
Wherein, voice is extracted as isolating cognizable voice sound from the mixed audio in voice and background instrumental music sound
Frequently.
Wherein, it is neural network model in the prior art which, which extracts model,.For example, can be Hoursglass mould
Type etc. no longer repeats voice extraction process.It is understood that Hoursglass model is when carrying out voice extraction,
Its input data is audio frame one by one, extracts voice respectively from each audio frame, therefore Hoursglass model is right
When mixed audio carries out voice extraction, local message based on mixed audio carries out voice extraction, so as to cause by part prelude,
The instrumental music playings such as interlude part is mistakenly identified as voice and extracts, and causes to remain part in the voice audio finally extracted
The instrumental music playings such as part prelude, interlude, so pure voice audio can not be extracted from mixed audio.
Step 102: voice extraction element is based on voice filtering model, filters out the inhuman sound audio frame of the intermediate audio,
Obtain voice audio.
Wherein, the voice filtering model is to be constructed based on machine learning Integrated Algorithm, wherein the machine learning is integrated
Algorithm can be Viterbi viterbi algorithm, condition random field algorithm CRF algorithm, and the application is with Viterbi viterbi algorithm
Example illustrates.
Viterbi algorithm, which is a kind of dynamic programming algorithm, most possible generates observed events sequence-Wei Te for finding
Than path-hidden state sequence, special y is applied in Markoff information source context and hidden Markov model, for solving
Certainly optimum path problems.To tie up bit algorithm dynamic adjustment voice probability sequence in the application, to complete to voice filtering model
Building.
Wherein, be to the building process of voice filtering model are as follows: based on machine learning Integrated Algorithm, training data and with
The corresponding sequence label trained voice filtering model in advance of the training data, the training data and the sequence label
Existing audio file is pre-processed to obtain.Since the input data of the voice filtering model is audio section, tool
There is bigger receptive field, the global information of the intermediate audio can be got, to filter the inhuman sound audio in the intermediate audio
Frame.
As can be seen that in the embodiment of the present application, after extracting model extraction to intermediate audio based on voice, utilizing impression
Wild bigger voice filtering model, is filtered the intermediate audio, to filter out the inhuman sound audio frame in the intermediate audio, from
And pure voice is extracted from mixed audio, keep the vocal effects extracted more preferable, improves user experience.
It describes in detail below and audio file is pre-processed to obtain the process of training data and sequence label.
Refering to Fig. 2A, Fig. 2A is a kind of process for obtaining training data and sequence label method provided by the embodiments of the present application
Schematic diagram, this method are applied to voice extraction element, the method comprising the steps of 201a~205a:
Step 201a: voice extraction element is based on voice and extracts model to audio file progress voice extraction, obtains sample
Audio.
It optionally, include voice audio frame and inhuman sound audio frame in the sample audio, which is accidentally to know
The audio frame of the instrumental music playings such as the part prelude, the interlude that do not obtain.
Step 202a: voice extraction element carries out sub-frame processing to the sample audio, obtains N number of sample audio frame, N is
Integer greater than 1.
Wherein, it is non-stationary signal within the entire time cycle due to audio signal, signal processing can not be carried out, therefore to this
Sample audio carries out framing according to preset window function and preset step-length, obtains N number of sample audio frame, each sample audio frame is seen
Doing is stationary signal, and the continuous type in order to guarantee audio signal, has weight between two sample audio frames of arbitrary neighborhood
Folded part.For example, a length of 30ms when preset window function, preset step-length 20ms, therefore the two of arbitrary neighborhood sample audio frames
Between with 10ms lap.
Step 203a: voice extraction element carries out Short Time Fourier Transform to each sample audio frame, obtains each sample
The spectrogram of audio frame.
In some possible embodiments, spectrogram can be amplitude spectrum, power spectrum (energy spectrum) or log power spectrum.
The application is illustrated by taking amplitude spectrum as an example.
Step 204a: spectrogram of the voice extraction element based on each sample audio frame obtains the of the audio file
First sonograph is labeled as training data by one sonograph.
Wherein, the matrix that the first sonograph is made of the spectral vectors of each sample audio frame, each sample audio frame
The column vector that is made of the corresponding amplitude of each frequency point of each sample audio frame of spectral vectors.
For example, refering to Fig. 2 C, Fig. 2 C is the spectrogram of k-th of sample audio frame, 1≤k≤N, f1, f2, f3 ...,
Fm is the frequency point of k-th of sample audio frame in a frequency domain, and m is the quantity of every frame sample audio frame frequency point in a frequency domain, k-th of sample
The corresponding spectral vectors of the spectrogram of this audio frame be [Ak1, Ak2, Ak3 ..., Akm]T, therefore obtain N number of sample audio frame pair
N number of spectral vectors can be formed first sonograph by the N number of spectral vectors answered are as follows:
Step 205a: voice extraction element is based on first sonograph and obtains the corresponding label sequence of the training data
Column.
Wherein, the sequence label is for marking sample audio frame corresponding with each column vector in the training data
Frame Properties, the Frame Properties include voice and non-voice.For example, j-th of element in the sequence label is for marking the instruction
Practice the Frame Properties of j-th of audio frame in data, 1≤j≤N, j are integer.
In some possible embodiments, the corresponding sequence label of the training data is obtained based on first sonograph
Realization process can be with are as follows: determine that mute audio frame is corresponding in first sonograph based on voice activity detection algorithm VAD
First frame number;The corresponding lyrics file of the audio file is obtained, first sonograph is determined based on the lyrics file
Middle corresponding second frame number of voice audio frame and the corresponding third frame number of inhuman sound audio frame;Based on the first frame sequence
Number, second frame number, the third frame number obtain sequence label.
It is quiet in first sonograph being determined based on voice activity detection algorithm VAD in some possible embodiments
Before corresponding first frame number of sound audio frame, the method also includes: spectrum-subtraction noise reduction is carried out to first spectrogram, with
The ambient noise in first spectrogram is filtered out, spectrum-subtraction noise reduction is the prior art, is repeated no more.
Specifically, based on the lyrics file determine the audio file it is corresponding include that the lyrics sing period and not
Non- comprising the lyrics sings the period, then non-period corresponding all audio frames of singing are inhuman sound audio frame, when singing
Between the corresponding all audio frames of section include at least voice audio frame.It is to be appreciated that singing in the period at any one, arbitrarily
There may be the stages of singer's ventilation between two adjacent lyrics, therefore singing the period, there are time periods of silence, that is, sing
There are mute audio frames for period corresponding all audio frames.So determining audio text based on voice activity detection algorithm VAD
The corresponding time periods of silence of mute audio frame in part.Then, by the period corresponding with each sample audio frame each period
It is compared, obtains the period belonging to each sample audio frame, determined based on the period belonging to each sample audio frame every
The Frame Properties of a sample audio frame, that is, determine the corresponding frame number of voice audio frame, the corresponding frame number of inhuman sound audio frame
And the corresponding frame number of mute audio frame.
A length of 30ms when for example, such as audio frame, step-length 10ms, set the corresponding label of voice audio frame as 1, non-
The corresponding label of voice audio frame is 0, and the corresponding label of mute frame flag is also 0.Belong in the 0-50ms of such as audio file
The inhuman sound audio period, it is determined that the 1st, 2 audio frame is inhuman sound audio frame, then the 1st, 2 in training data sound
The label of frequency frame is 0, such as the 50ms-70ms of the audio file, belongs to voice audio session in 90ms-110ms, it is determined that
3rd, 5 audio frame be voice audio frame, the 4th frame be mute audio frame, it is determined that the 3rd, 5 audio in the training data
The label of frame frame audio is that the label of the 1, the 4th frame audio is 0 etc.;Therefore its sequence label is [0,0,1,0,1 ... ,].
Refering to Fig. 2 B, it is based on Fig. 2A, Fig. 2 B is the method that another kind provided by the embodiments of the present application obtains training data
Flow diagram, this method are applied to voice extraction element, the method comprising the steps of 201b~207b:
Step 201b: voice extraction element is based on voice and extracts model to audio file progress voice extraction, obtains sample
Audio.
Step 202b: voice extraction element carries out sub-frame processing to the sample audio, obtains N number of sample audio frame.
Step 203b: voice extraction element carries out Short Time Fourier Transform to each sample audio frame, obtains each sample
The spectrogram of audio frame.
Step 204b: spectrogram of the voice extraction element based on each sample audio frame obtains the of the audio file
One sonograph.
Step 205b: voice extraction element determines i-th of column vector and first sound spectrum in first sonograph
The first-order difference of (i+1) a column vector corresponding element in figure, obtains difference vector, by the difference vector and the (i
+ 1) a column vector carries out longitudinal spliced, obtains the second sonograph, 1≤i≤N, i are integer.
Wherein, since corresponding difference vector is not present in first sample audio frame in first sonograph, therefore to it
A longitudinal spliced preset difference vector A=[A01, A02 ..., A0m], wherein the preset difference vector can be element
The null vector or vector, etc. for predicted elemental for being all 0, the application do not do unique restriction.
After longitudinal spliced preset difference vector, second sonograph are as follows:
Optionally, the first-order difference for obtaining the frame vector of two neighboring sample audio frame, by difference vector and the first frequency spectrum
Figure progress is longitudinal spliced, and each column vector of the second sonograph made contains the audio-frequency information of an audio frame, therefore
In the corresponding voice probability of each sample audio frame of calculating, due to being contained in the corresponding column vector of each sample audio frame
The audio-frequency information of one audio frame makes the voice probability for the sample audio frame being calculated more due to joined prior information
Add accurate.
Step 206b: second sonograph is labeled as training data by voice extraction element.
Step 207b: voice extraction element is based on first sonograph and obtains the corresponding label sequence of the training data
Column.
Finally, being trained voice filtering model for the prior art using training data, repeat no more.
Refering to Fig. 3, Fig. 3 is the flow diagram of another voice extracting method provided by the embodiments of the present application, this method
Applied to voice extraction element, the method comprising the steps of 301~306:
Step 301: voice extraction element is based on voice and extracts model, carries out voice extraction to mixed audio, obtains centre
Audio, the intermediate audio include voice audio frame and inhuman sound audio frame.
Step 302: the intertone frequency division is segmented into several audio sections, the adjacent sound of any two by voice extraction element
There is the audio section of overlapping in frequency range.
Optionally, the intertone frequency division is segmented into several by voice extraction element according to preset window function and preset step-length
A audio section, each audio section include at least an audio frame.For example, can be 10s according to window function, preset step-length is 5s to institute
It states intertone frequency division and is segmented into several audio sections, there are the audios that 5s is overlapped for the adjacent audio section of each audio section event any two
Section.
Step 303: each audio section is successively input to voice filtering model by voice extraction element, obtains each audio section
The first voice probability sequence, the first voice probability sequence is used to indicate that each audio frame in each audio section to be voice
Probability.
Step 304: voice extraction element is determined in overlapping audio section based on the first voice probability sequence of each audio section
The voice mathematical expectation of probability of each audio frame obtains the second voice probability sequence of the intermediate audio.
Optionally, it is based on the voice filtering model, the first voice probability sequence of each audio section is determined, due to phase
There is the audio section of overlapping in two adjacent audio sections, therefore corresponding two the first voice probability sequences of two adjacent audio sections
In comprising the overlapping audio section in each audio frame voice probability, the sound of overlapping is obtained by way of averaged
The voice probability of each audio frame in frequency range, then, voice probability corresponding with underlapped audio section form the intermediate audio
The second voice probability sequence, each element in the second voice probability sequence is for indicating each audio in the intermediate audio
Frame is the probability of voice.
Step 305: voice extraction element is determined in described based on viterbi algorithm and the second voice probability sequence
Between audio target voice probability sequence.
Optionally, the element in second probability sequence is adjusted based on Viterbi viterbi algorithm, is obtained most
Excellent probability sequence, using optimal probability sequence as target voice probability sequence.The i.e. similar method for seeking optimal path, base
Determine the corresponding hiding sequence of second probability sequence in Viterbi viterbi algorithm, obtain it is each hide that sequence is corresponding can
Energy probability, obtains the optimal probability sequence, detailed process is the prior art, is no longer described.
For example, as the second voice probability sequence be [0.0,0,1,0.1,0.2,0.3,0.5,0.8,0.7,0.1,
0.1,0.6,0.7,0.8 ... ,], by the second voice probability sequence can be seen that the intermediate audio the 6th, 7,8,11,
12,13 audio frames may be people's sound audio frame, and the 9th, 10 audio frame may be inhuman sound audio frame, since speaker says
There are the processes of a gradual change for words, so being also a progressive formation between voice probability, in general, a upper sound are not present
The voice probability of frequency frame is very big, and the voice probability of next audio frame is very small, does not meet the rule of speaking of speaker, institute
With can conclude that the 9th, the 10 corresponding voice probability of audio frame, there are problems, therefore dynamic is needed to adjust, to meet speaking for speaker
Rule.
Step 306: voice extraction element is based on the target voice probability sequence and filters out non-voice in the intermediate audio
Audio frame, obtains voice audio, and the inhuman sound audio frame is in the intermediate audio and in the target voice probability sequence
The corresponding audio frame of object element, the object element is the element for meeting preset condition.
Wherein, the element for meeting preset condition can be the element more than or equal to threshold value, the threshold value can for 0.5,
0.6,0.7 or other values.
As can be seen that in the embodiment of the present application, after extracting model extraction to intermediate audio based on voice, to intertone
Frequency division section, determines the corresponding input data of each audio section, which is input to voice filtering model, to filter out in this
Between audio inhuman sound audio frame, obtain pure voice audio, due to the input data in voice filtering model be audio section,
Input data is audio frame compared to the prior art, which has bigger receptive field, utilize intermediate audio
Global information filters inhuman sound audio frame, and pure voice is extracted from mixed audio, keeps the vocal effects extracted more preferable,
Improve user experience.
In some possible embodiments, voice extracting method disclosed in the present application is applied to voice mistake as shown in Figure 4
Model is filtered, which includes P identical network layers and full articulamentum, wherein the P identical network layers are with residual
Poor form connection, each network layer include: the first convolutional layer, the second convolutional layer, active coating, Fusion Features layer and feature superposition
Layer;The full articulamentum can intensively connect for multiple network layers.
Firstly, voice extraction element is segmented intermediate audio, several audio sections are obtained, then, then to each sound
Frequency range carries out Fourier transformation, obtains each audio section and carries out Short Time Fourier Transform, obtains the corresponding sound spectrum of each audio section
Scheme (can be above-mentioned the first sonograph or the second sonograph), the corresponding input of each audio section is obtained based on the sonograph
Data, specific conversion process are not described in detail herein referring to the above-mentioned process for obtaining training data, and by the input
Data are input to first network layer of P network layer in the voice filtering model;First convolutional layer, for the input
Data carry out the first convolution algorithm, obtain fisrt feature matrix;Second convolutional layer, for carrying out the second convolution fortune to input data
It calculates, obtains second characteristic matrix;Active coating obtains third feature for carrying out nonlinear activation to the second characteristic matrix
Matrix;Fusion Features layer obtains fourth feature for carrying out multiplication cross operation to the fisrt feature matrix and third feature matrix
Matrix;Feature superimposed layer obtains the output number of the network layer for fourth feature matrix and input data to be carried out feature superposition
According to after P network layer, obtaining the mesh of each audio section using the output data as the input data of next network layer
Mark eigenmatrix;Full articulamentum obtains the corresponding spy of each audio section for carrying out full connection operation to the target signature matrix
Vector is levied, described eigenvector is input to softmax classifier, obtains the corresponding voice probability sequence of each audio section.
It should be noted that Fig. 4 is only a kind of network structure of voice filtering model, the application is only with the network structure
Example illustrates, and is not limited uniquely voice filtering model.
It is above-mentioned that mainly the scheme of the embodiment of the present application is described from the angle of method side implementation procedure.It is understood that
, in order to realize the above functions, it comprises executing, each function is corresponding for the computing device of syllable quantity in the unit time
Hardware configuration and/or software module.Those skilled in the art should be readily appreciated that, in conjunction with the embodiments described herein
Each exemplary unit and algorithm steps of description, the application can be come real with the combining form of hardware or hardware and computer software
It is existing.Some functions is executed in a manner of hardware or computer software driving hardware actually, depending on the specific of technical solution
Using and design constraint.Professional technician can to it is each specifically realized using distinct methods it is described
Function, but this realization is it is not considered that exceed scope of the present application.
The embodiment of the present application can carry out function according to computing device of the above method example to syllable quantity in the unit time
The division of energy unit, for example, each functional unit of each function division can be corresponded to, it can also be by two or more function
It can be integrated in a processing unit.Above-mentioned integrated unit both can take the form of hardware realization, can also use software
The form of functional unit is realized.It should be noted that being schematically only one to the division of unit in the embodiment of the present application
Kind logical function partition, there may be another division manner in actual implementation.
With it is above-mentioned shown in voice extracting method embodiment it is consistent, referring to Fig. 5, Fig. 5 mentions for the embodiment of the present application
The structural schematic diagram of a kind of voice extraction element 500 supplied, as shown in figure 5, voice extraction element 500 includes processor, storage
Device, communication interface and one or more programs, wherein said one or multiple programs are different from said one or multiple applications
Program, and said one or multiple programs are stored in above-mentioned memory, and are configured to be executed by above-mentioned processor, it is above-mentioned
Program includes the instruction for executing following steps:
Model is extracted based on voice, voice extraction is carried out to mixed audio, obtains intermediate audio, the intermediate audio includes
Voice audio frame and inhuman sound audio frame;
Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.
In a possible embodiment, the voice filtering model is constructed based on machine learning Integrated Algorithm, above procedure
It is also used to execute the instruction of following steps:
Based on voice filtering model, before the inhuman sound audio frame for filtering out the intermediate audio, audio file is carried out
Pretreatment obtains training data and sequence label, using the training data and the sequence label to the voice filter module
Type optimizes training.
In a possible embodiment, in terms of being pre-processed to obtain training data and sequence label to audio file,
Above procedure is specifically used for executing the instruction of following steps:
Model is extracted based on the voice, voice extraction is carried out to audio file, obtain sample audio;
Sub-frame processing is carried out to the sample audio, obtains N number of sample audio frame, N is the integer greater than 1;
Short Time Fourier Transform is carried out to each sample audio frame, obtains the spectrogram of each sample audio frame;
Based on the spectrogram of each sample audio frame, the first sonograph of the audio file, first sound spectrum are obtained
The matrix that figure is made of the spectral vectors of each sample audio frame, the spectral vectors of each sample audio frame are by each sample
The column vector of the corresponding amplitude composition of each frequency point of audio frame;
First sonograph is labeled as training data;
The corresponding sequence label of the training data is obtained based on first sonograph, the sequence label is for marking
The Frame Properties of sample audio frame corresponding with each column vector in the training data, the Frame Properties include voice and inhuman
Sound.
In a possible embodiment, before by first sonograph labeled as training data, above procedure is also used
In the instruction for executing following steps:
Determine (i+1) a column vector in i-th of the column vector and first sonograph in first sonograph
The first-order difference of corresponding element, obtains difference vector, and 1≤i≤N, i are integer;
The difference vector and (i+1) a column vector progress is longitudinal spliced, obtain the second sonograph;
It is described using first sonograph as training data, comprising:
Second sonograph is labeled as training data.
In a possible embodiment, the corresponding sequence label of the training data is being obtained based on first sonograph
Aspect, above procedure are specifically used for executing the instruction of following steps:
Corresponding first frame number of mute audio frame in first sonograph is determined based on voice activity detection algorithm;
The corresponding lyrics file of the audio file is obtained, people in first sonograph is determined based on the lyrics file
Corresponding second frame number of sound audio frame and the corresponding third frame number of inhuman sound audio frame;
Sequence label is obtained based on first frame number, second frame number, the third frame number.
In a possible embodiment, it is being based on voice filtering model, is filtering out the inhuman sound audio frame of the intermediate audio
Aspect, above procedure are specifically used for executing the instruction of following steps:
The intertone frequency division is segmented into several audio sections, the adjacent audio section of any two has the audio of overlapping
Section;
Each audio section is successively input to voice filtering model, obtains the first voice probability sequence of each audio section,
The first voice probability sequence is used to indicate in each audio section that each audio frame to be the probability of voice;
The voice probability of each audio frame in overlapping audio section is determined based on the first voice probability sequence of each audio section
Mean value obtains the second voice probability sequence of the intermediate audio;
The target voice probability of the intermediate audio is determined based on viterbi algorithm and the second voice probability sequence
Sequence;
Inhuman sound audio frame in the intermediate audio is filtered out based on the target voice probability sequence, obtains voice audio,
The inhuman sound audio frame is audio corresponding with the object element in the target voice probability sequence in the intermediate audio
Frame, the object element are the element for meeting preset condition.
A kind of possible function of voice extraction element 600 involved in above-described embodiment is shown refering to Fig. 6, Fig. 6
Unit composition block diagram, voice extraction element 600 include: extraction unit 610, filter element 620, in which:
Extraction unit 610 carries out voice extraction to mixed audio, obtains intermediate audio for extracting model based on voice,
The intermediate audio includes voice audio frame and inhuman sound audio frame;
Filter element 620 filters out the inhuman sound audio frame of the intermediate audio, obtains for being based on voice filtering model
Voice audio.
In a possible embodiment, the voice filtering model is constructed based on machine learning Integrated Algorithm, and voice extracts
Device 600 further includes training unit 630, and training unit 630 is used for: being based on voice filtering model, is filtering out the intermediate audio
Inhuman sound audio frame before, audio file is pre-processed to obtain training data and sequence label, uses the trained number
Accordingly and the sequence label optimizes training to the voice filtering model.
In a possible embodiment, in terms of being pre-processed to obtain training data and sequence label to audio file,
Training unit 630, is specifically used for: extracting model based on the voice and carries out voice extraction to audio file, obtains sample audio;
Sub-frame processing is carried out to the sample audio, obtains N number of sample audio frame, N is the integer greater than 1;To each sample audio frame
Short Time Fourier Transform is carried out, the spectrogram of each sample audio frame is obtained;Based on the spectrogram of each sample audio frame, obtain
First sonograph of the audio file, the square that first sonograph is made of the spectral vectors of each sample audio frame
Battle array, the spectral vectors of each sample audio frame be from the column that the corresponding amplitude of each frequency point of each sample audio frame forms to
Amount;First sonograph is labeled as training data;The corresponding mark of the training data is obtained based on first sonograph
Sequence is signed, the sequence label is used to mark the frame category of sample audio frame corresponding with each column vector in the training data
Property, the Frame Properties includes voice and non-voice.
In a possible embodiment, before by first sonograph labeled as training data, training unit 630,
It is also used to: determining (i+1) a column vector pair in i-th of the column vector and first sonograph in first sonograph
The first-order difference for answering element, obtains difference vector, and 1≤i≤N, i are integer;By the difference vector and (i+1) a column
Vector progress is longitudinal spliced, obtains the second sonograph;Using first sonograph as training data in terms of, training unit
630, it is specifically used for: by second sonograph labeled as training data.
In a possible embodiment, the corresponding sequence label of the training data is being obtained based on first sonograph
Aspect, training unit 630, is specifically used for: determining mute audio frame in first sonograph based on voice activity detection algorithm
Corresponding first frame number;The corresponding lyrics file of the audio file is obtained, determines described first based on the lyrics file
Corresponding second frame number of voice audio frame and the corresponding third frame number of inhuman sound audio frame in sonograph;Based on described first
Frame number, second frame number, the third frame number obtain sequence label.
In a possible embodiment, it is being based on voice filtering model, is filtering out the inhuman sound audio frame of the intermediate audio
Aspect, filter element 620, is specifically used for: the intertone frequency division is segmented into several audio sections, the adjacent sound of any two
There is the audio section of overlapping in frequency range;Each audio section is successively input to voice filtering model, obtains the first of each audio section
Voice probability sequence, the first voice probability sequence are used to indicate in each audio section that each audio frame to be the probability of voice;
The voice mathematical expectation of probability that each audio frame in overlapping audio section is determined based on the first voice probability sequence of each audio section, is obtained
Second voice probability sequence of the intermediate audio;Based on described in viterbi algorithm and the second voice probability sequence determination
The target voice probability sequence of intermediate audio;Inhuman sound in the intermediate audio is filtered out based on the target voice probability sequence
Frequency frame, obtains voice audio, and the inhuman sound audio frame is in the intermediate audio and in the target voice probability sequence
The corresponding audio frame of object element, the object element are the element for meeting preset condition.
The embodiment of the present application also provides a kind of computer storage medium, and the computer-readable recording medium storage has calculating
Machine program, the computer program are executed by processor to realize that any voice recorded in above method embodiment such as mentions
Take some or all of method step.
The embodiment of the present application also provides a kind of computer program product, and the computer program product includes storing calculating
The non-transient computer readable storage medium of machine program, the computer program are operable to that computer is made to execute such as above-mentioned side
Some or all of any voice extracting method recorded in method embodiment step.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of
Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because
According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know
It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily the application
It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment
Point, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed device, it can be by another way
It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of
Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can
To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit,
It can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also be realized in the form of software program module.
If the integrated unit is realized in the form of software program module and sells or use as independent product
When, it can store in a computer-readable access to memory.Based on this understanding, the technical solution of the application substantially or
Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products
Reveal and, which is stored in a memory, including some instructions are used so that a computer equipment
(can be personal computer, server or network equipment etc.) executes all or part of each embodiment the method for the application
Step.And memory above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory
The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can
It is completed with instructing relevant hardware by program, which can store in a computer-readable memory, memory
May include: flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English:
Random Access Memory, referred to as: RAM), disk or CD etc..
The embodiment of the present application is described in detail above, specific case used herein to the principle of the application and
Embodiment is expounded, the description of the example is only used to help understand the method for the present application and its core ideas;
At the same time, for those skilled in the art can in specific embodiments and applications according to the thought of the application
There is change place, in conclusion the contents of this specification should not be construed as limiting the present application.
Claims (10)
1. a kind of voice extracting method characterized by comprising
Model is extracted based on voice, voice extraction is carried out to mixed audio, obtains intermediate audio, the intermediate audio includes voice
Audio frame and inhuman sound audio frame;
Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.
2. the method according to claim 1, wherein the voice filtering model is based on machine learning Integrated Algorithm
Building, the method also includes: it is being based on voice filtering model, it is right before the inhuman sound audio frame for filtering out the intermediate audio
Audio file is pre-processed to obtain training data and sequence label, using the training data and the sequence label to institute
It states voice filtering model and optimizes training.
3. according to the method described in claim 2, it is characterized in that, described pre-process audio file to obtain training data
And sequence label, comprising:
Model is extracted based on the voice, voice extraction is carried out to audio file, obtain sample audio;
Sub-frame processing is carried out to the sample audio, obtains N number of sample audio frame, N is the integer greater than 1;
Short Time Fourier Transform is carried out to each sample audio frame, obtains the spectrogram of each sample audio frame;
Based on the spectrogram of each sample audio frame, the first sonograph of the audio file is obtained, first sonograph is
The matrix being made of the spectral vectors of each sample audio frame, the spectral vectors of each sample audio frame are by each sample audio
The column vector of the corresponding amplitude composition of each frequency point of frame;
First sonograph is labeled as training data;
The corresponding sequence label of the training data is obtained based on first sonograph, the sequence label is described for marking
The Frame Properties of sample audio frame corresponding with each column vector in training data, the Frame Properties include voice and non-voice.
4. according to the method described in claim 3, it is characterized in that, described be labeled as training data for first sonograph
Before, the method also includes:
Determine that (i+1) a column vector in i-th of the column vector and first sonograph in first sonograph is corresponding
The first-order difference of element, obtains difference vector, and 1≤i≤N, i are integer;
The difference vector and (i+1) a column vector progress is longitudinal spliced, obtain the second sonograph;
It is described using first sonograph as training data, comprising:
Second sonograph is labeled as training data.
5. the method according to claim 3 or 4, which is characterized in that described to obtain the instruction based on first sonograph
Practice the corresponding sequence label of data, comprising:
Corresponding first frame number of mute audio frame in first sonograph is determined based on voice activity detection algorithm;
The corresponding lyrics file of the audio file is obtained, voice sound in first sonograph is determined based on the lyrics file
Corresponding second frame number of frequency frame and the corresponding third frame number of inhuman sound audio frame;
Sequence label is obtained based on first frame number, second frame number, the third frame number.
6. filtering out the intertone the method according to claim 1, wherein described be based on voice filtering model
The inhuman sound audio frame of frequency, comprising:
The intertone frequency division is segmented into several audio sections, the adjacent audio section of any two has the audio section of overlapping;
Each audio section is successively input to voice filtering model, obtains the first voice probability sequence of each audio section, it is described
First voice probability sequence is used to indicate in each audio section that each audio frame to be the probability of voice;
The voice mathematical expectation of probability of each audio frame in overlapping audio section is determined based on the first voice probability sequence of each audio section,
Obtain the second voice probability sequence of the intermediate audio;
The target voice probability sequence of the intermediate audio is determined based on viterbi algorithm and the second voice probability sequence;
Inhuman sound audio frame in the intermediate audio is filtered out based on the target voice probability sequence, obtains voice audio, it is described
Inhuman sound audio frame is audio frame corresponding with the object element in the target voice probability sequence in the intermediate audio, institute
Stating object element is the element for meeting preset condition.
7. a kind of voice extraction element characterized by comprising
Extraction unit carries out voice extraction to mixed audio, obtains intermediate audio for extracting model based on voice, it is described in
Between audio include voice audio frame and inhuman sound audio frame;
Filter element filters out the inhuman sound audio frame of the intermediate audio, obtains voice sound for being based on voice filtering model
Frequently.
8. device according to claim 7, which is characterized in that described device further includes training unit,
The training unit, it is right before the inhuman sound audio frame for filtering out the intermediate audio for being based on voice filtering model
Audio file is pre-processed to obtain training data and sequence label, using the training data and the sequence label to institute
It states voice filtering model and optimizes training.
9. a kind of electronic equipment, which is characterized in that including processor, memory, communication interface and one or more program,
In, one or more of programs are stored in the memory, and are configured to be executed by the processor, described program
Include the steps that requiring the instruction in any one of 1-7 method for perform claim.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, the computer program are executed by processor to realize the method according to claim 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910343129.5A CN110085251B (en) | 2019-04-26 | 2019-04-26 | Human voice extraction method, human voice extraction device and related products |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910343129.5A CN110085251B (en) | 2019-04-26 | 2019-04-26 | Human voice extraction method, human voice extraction device and related products |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110085251A true CN110085251A (en) | 2019-08-02 |
CN110085251B CN110085251B (en) | 2021-06-25 |
Family
ID=67416989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910343129.5A Active CN110085251B (en) | 2019-04-26 | 2019-04-26 | Human voice extraction method, human voice extraction device and related products |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110085251B (en) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110428853A (en) * | 2019-08-30 | 2019-11-08 | 北京太极华保科技股份有限公司 | Voice activity detection method, Voice activity detection device and electronic equipment |
CN110718228A (en) * | 2019-10-22 | 2020-01-21 | 中信银行股份有限公司 | Voice separation method and device, electronic equipment and computer readable storage medium |
CN110782907A (en) * | 2019-11-06 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Method, device and equipment for transmitting voice signal and readable storage medium |
CN110942776A (en) * | 2019-10-31 | 2020-03-31 | 厦门快商通科技股份有限公司 | Audio splicing prevention detection method and system based on GRU |
CN111145763A (en) * | 2019-12-17 | 2020-05-12 | 厦门快商通科技股份有限公司 | GRU-based voice recognition method and system in audio |
CN111276113A (en) * | 2020-01-21 | 2020-06-12 | 北京永航科技有限公司 | Method and device for generating key time data based on audio |
CN111341341A (en) * | 2020-02-11 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Training method of audio separation network, audio separation method, device and medium |
CN111354378A (en) * | 2020-02-12 | 2020-06-30 | 北京声智科技有限公司 | Voice endpoint detection method, device, equipment and computer storage medium |
CN111968623A (en) * | 2020-08-19 | 2020-11-20 | 腾讯音乐娱乐科技(深圳)有限公司 | Air port position detection method and related equipment |
CN112259119A (en) * | 2020-10-19 | 2021-01-22 | 成都明杰科技有限公司 | Music source separation method based on stacked hourglass network |
CN112270933A (en) * | 2020-11-12 | 2021-01-26 | 北京猿力未来科技有限公司 | Audio identification method and device |
CN112382310A (en) * | 2020-11-12 | 2021-02-19 | 北京猿力未来科技有限公司 | Human voice audio recording method and device |
CN112397073A (en) * | 2020-11-04 | 2021-02-23 | 北京三快在线科技有限公司 | Audio data processing method and device |
CN112687274A (en) * | 2019-10-17 | 2021-04-20 | 北京猎户星空科技有限公司 | Voice information processing method, device, equipment and medium |
CN112818163A (en) * | 2021-01-22 | 2021-05-18 | 惠州Tcl移动通信有限公司 | Song display processing method, device, terminal and medium based on mobile terminal |
WO2021115083A1 (en) * | 2019-12-11 | 2021-06-17 | 北京影谱科技股份有限公司 | Audio signal time sequence processing method, apparatus and system based on neural network, and computer-readable storage medium |
CN113053401A (en) * | 2019-12-26 | 2021-06-29 | 上海博泰悦臻电子设备制造有限公司 | Audio acquisition method and related product |
CN113114417A (en) * | 2021-03-30 | 2021-07-13 | 深圳市冠标科技发展有限公司 | Audio transmission method and device, electronic equipment and storage medium |
CN113113051A (en) * | 2021-03-10 | 2021-07-13 | 深圳市声扬科技有限公司 | Audio fingerprint extraction method and device, computer equipment and storage medium |
CN113242361A (en) * | 2021-07-13 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Video processing method and device and computer readable storage medium |
CN113257242A (en) * | 2021-04-06 | 2021-08-13 | 杭州远传新业科技有限公司 | Voice broadcast suspension method, device, equipment and medium in self-service voice service |
CN113572908A (en) * | 2021-06-16 | 2021-10-29 | 云茂互联智能科技(厦门)有限公司 | Method, device and system for reducing noise in VoIP (Voice over Internet protocol) call |
CN113724720A (en) * | 2021-07-19 | 2021-11-30 | 电信科学技术第五研究所有限公司 | Non-human voice filtering method in noisy environment based on neural network and MFCC |
CN114203163A (en) * | 2022-02-16 | 2022-03-18 | 荣耀终端有限公司 | Audio signal processing method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006133284A (en) * | 2004-11-02 | 2006-05-25 | Kddi Corp | Voice information extracting device |
CN101404160A (en) * | 2008-11-21 | 2009-04-08 | 北京科技大学 | Voice denoising method based on audio recognition |
WO2014167570A1 (en) * | 2013-04-10 | 2014-10-16 | Technologies For Voice Interface | System and method for extracting and using prosody features |
CN105719657A (en) * | 2016-02-23 | 2016-06-29 | 惠州市德赛西威汽车电子股份有限公司 | Human voice extracting method and device based on microphone |
CN108962277A (en) * | 2018-07-20 | 2018-12-07 | 广州酷狗计算机科技有限公司 | Speech signal separation method, apparatus, computer equipment and storage medium |
CN109308901A (en) * | 2018-09-29 | 2019-02-05 | 百度在线网络技术(北京)有限公司 | Chanteur's recognition methods and device |
-
2019
- 2019-04-26 CN CN201910343129.5A patent/CN110085251B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006133284A (en) * | 2004-11-02 | 2006-05-25 | Kddi Corp | Voice information extracting device |
CN101404160A (en) * | 2008-11-21 | 2009-04-08 | 北京科技大学 | Voice denoising method based on audio recognition |
WO2014167570A1 (en) * | 2013-04-10 | 2014-10-16 | Technologies For Voice Interface | System and method for extracting and using prosody features |
CN105719657A (en) * | 2016-02-23 | 2016-06-29 | 惠州市德赛西威汽车电子股份有限公司 | Human voice extracting method and device based on microphone |
CN108962277A (en) * | 2018-07-20 | 2018-12-07 | 广州酷狗计算机科技有限公司 | Speech signal separation method, apparatus, computer equipment and storage medium |
CN109308901A (en) * | 2018-09-29 | 2019-02-05 | 百度在线网络技术(北京)有限公司 | Chanteur's recognition methods and device |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110428853A (en) * | 2019-08-30 | 2019-11-08 | 北京太极华保科技股份有限公司 | Voice activity detection method, Voice activity detection device and electronic equipment |
CN112687274A (en) * | 2019-10-17 | 2021-04-20 | 北京猎户星空科技有限公司 | Voice information processing method, device, equipment and medium |
CN110718228A (en) * | 2019-10-22 | 2020-01-21 | 中信银行股份有限公司 | Voice separation method and device, electronic equipment and computer readable storage medium |
CN110942776B (en) * | 2019-10-31 | 2022-12-06 | 厦门快商通科技股份有限公司 | Audio splicing prevention detection method and system based on GRU |
CN110942776A (en) * | 2019-10-31 | 2020-03-31 | 厦门快商通科技股份有限公司 | Audio splicing prevention detection method and system based on GRU |
CN110782907A (en) * | 2019-11-06 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Method, device and equipment for transmitting voice signal and readable storage medium |
CN110782907B (en) * | 2019-11-06 | 2023-11-28 | 腾讯科技(深圳)有限公司 | Voice signal transmitting method, device, equipment and readable storage medium |
WO2021115083A1 (en) * | 2019-12-11 | 2021-06-17 | 北京影谱科技股份有限公司 | Audio signal time sequence processing method, apparatus and system based on neural network, and computer-readable storage medium |
CN111145763A (en) * | 2019-12-17 | 2020-05-12 | 厦门快商通科技股份有限公司 | GRU-based voice recognition method and system in audio |
CN113053401A (en) * | 2019-12-26 | 2021-06-29 | 上海博泰悦臻电子设备制造有限公司 | Audio acquisition method and related product |
CN111276113A (en) * | 2020-01-21 | 2020-06-12 | 北京永航科技有限公司 | Method and device for generating key time data based on audio |
CN111276113B (en) * | 2020-01-21 | 2023-10-17 | 北京永航科技有限公司 | Method and device for generating key time data based on audio |
CN111341341A (en) * | 2020-02-11 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Training method of audio separation network, audio separation method, device and medium |
CN111354378B (en) * | 2020-02-12 | 2020-11-24 | 北京声智科技有限公司 | Voice endpoint detection method, device, equipment and computer storage medium |
CN111354378A (en) * | 2020-02-12 | 2020-06-30 | 北京声智科技有限公司 | Voice endpoint detection method, device, equipment and computer storage medium |
CN111968623A (en) * | 2020-08-19 | 2020-11-20 | 腾讯音乐娱乐科技(深圳)有限公司 | Air port position detection method and related equipment |
CN111968623B (en) * | 2020-08-19 | 2023-11-28 | 腾讯音乐娱乐科技(深圳)有限公司 | Gas port position detection method and related equipment |
CN112259119A (en) * | 2020-10-19 | 2021-01-22 | 成都明杰科技有限公司 | Music source separation method based on stacked hourglass network |
CN112397073A (en) * | 2020-11-04 | 2021-02-23 | 北京三快在线科技有限公司 | Audio data processing method and device |
CN112397073B (en) * | 2020-11-04 | 2023-11-21 | 北京三快在线科技有限公司 | Audio data processing method and device |
CN112270933B (en) * | 2020-11-12 | 2024-03-12 | 北京猿力未来科技有限公司 | Audio identification method and device |
CN112270933A (en) * | 2020-11-12 | 2021-01-26 | 北京猿力未来科技有限公司 | Audio identification method and device |
WO2022100691A1 (en) * | 2020-11-12 | 2022-05-19 | 北京猿力未来科技有限公司 | Audio recognition method and device |
WO2022100692A1 (en) * | 2020-11-12 | 2022-05-19 | 北京猿力未来科技有限公司 | Human voice audio recording method and apparatus |
CN112382310A (en) * | 2020-11-12 | 2021-02-19 | 北京猿力未来科技有限公司 | Human voice audio recording method and device |
CN112818163A (en) * | 2021-01-22 | 2021-05-18 | 惠州Tcl移动通信有限公司 | Song display processing method, device, terminal and medium based on mobile terminal |
CN113113051A (en) * | 2021-03-10 | 2021-07-13 | 深圳市声扬科技有限公司 | Audio fingerprint extraction method and device, computer equipment and storage medium |
CN113114417A (en) * | 2021-03-30 | 2021-07-13 | 深圳市冠标科技发展有限公司 | Audio transmission method and device, electronic equipment and storage medium |
CN113257242A (en) * | 2021-04-06 | 2021-08-13 | 杭州远传新业科技有限公司 | Voice broadcast suspension method, device, equipment and medium in self-service voice service |
CN113572908A (en) * | 2021-06-16 | 2021-10-29 | 云茂互联智能科技(厦门)有限公司 | Method, device and system for reducing noise in VoIP (Voice over Internet protocol) call |
CN113242361A (en) * | 2021-07-13 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Video processing method and device and computer readable storage medium |
CN113242361B (en) * | 2021-07-13 | 2021-09-24 | 腾讯科技(深圳)有限公司 | Video processing method and device and computer readable storage medium |
CN113724720B (en) * | 2021-07-19 | 2023-07-11 | 电信科学技术第五研究所有限公司 | Non-human voice filtering method based on neural network and MFCC (multiple frequency component carrier) in noisy environment |
CN113724720A (en) * | 2021-07-19 | 2021-11-30 | 电信科学技术第五研究所有限公司 | Non-human voice filtering method in noisy environment based on neural network and MFCC |
CN114203163A (en) * | 2022-02-16 | 2022-03-18 | 荣耀终端有限公司 | Audio signal processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110085251B (en) | 2021-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110085251A (en) | Voice extracting method, voice extraction element and Related product | |
Zhang et al. | Denoising deep neural networks based voice activity detection | |
CN108109619A (en) | Sense of hearing selection method and device based on memory and attention model | |
CN107578775A (en) | A kind of multitask method of speech classification based on deep neural network | |
CN108847249A (en) | Sound converts optimization method and system | |
CN108198569A (en) | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing | |
CN107195296A (en) | A kind of audio recognition method, device, terminal and system | |
CN110675891B (en) | Voice separation method and module based on multilayer attention mechanism | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN113921022B (en) | Audio signal separation method, device, storage medium and electronic equipment | |
Huang et al. | Extraction of adaptive wavelet packet filter‐bank‐based acoustic feature for speech emotion recognition | |
CN106033669B (en) | Audio recognition method and device | |
CN107910008A (en) | A kind of audio recognition method based on more acoustic models for personal device | |
CN110268471A (en) | The method and apparatus of ASR with embedded noise reduction | |
CN104952446A (en) | Digital building presentation system based on voice interaction | |
Bansal et al. | Phoneme based model for gender identification and adult-child classification | |
CN108766416A (en) | Audio recognition method and Related product | |
KR102220964B1 (en) | Method and device for audio recognition | |
Mitra et al. | Speech inversion: Benefits of tract variables over pellet trajectories | |
CN106128472A (en) | The processing method and processing device of singer's sound | |
Venkateswarlu et al. | Performance on speech enhancement objective quality measures using hybrid wavelet thresholding | |
Dai et al. | 2D Psychoacoustic modeling of equivalent masking for automatic speech recognition | |
CN111696524A (en) | Character-overlapping voice recognition method and system | |
Muni et al. | Deep learning techniques for speech emotion recognition | |
CN110189747A (en) | Voice signal recognition methods, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |