CN110085251A - Voice extracting method, voice extraction element and Related product - Google Patents

Voice extracting method, voice extraction element and Related product Download PDF

Info

Publication number
CN110085251A
CN110085251A CN201910343129.5A CN201910343129A CN110085251A CN 110085251 A CN110085251 A CN 110085251A CN 201910343129 A CN201910343129 A CN 201910343129A CN 110085251 A CN110085251 A CN 110085251A
Authority
CN
China
Prior art keywords
audio
voice
frame
audio frame
sonograph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910343129.5A
Other languages
Chinese (zh)
Other versions
CN110085251B (en
Inventor
王征韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN201910343129.5A priority Critical patent/CN110085251B/en
Publication of CN110085251A publication Critical patent/CN110085251A/en
Application granted granted Critical
Publication of CN110085251B publication Critical patent/CN110085251B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The embodiment of the present application provides a kind of voice extracting method, comprising: extracts model based on voice, carries out voice extraction to mixed audio, obtain intermediate audio, the intermediate audio includes voice audio frame and inhuman sound audio frame;Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.The embodiment of the present application can extract pure voice audio, improve user experience.

Description

Voice extracting method, voice extraction element and Related product
Technical field
This application involves electronic audio signal process fields, and in particular to a kind of voice extracting method, voice extraction element And Related product.
Background technique
Voice extractive technique is a kind of audio-frequency processing method studied extensively, and there are many classes for the algorithm that existing voice extracts Not.But due to the limitation of algorithm itself or training sample, currently without a kind of proposition voice that voice extraction algorithm can be clean. For example, extracting voice from mixed audio by Hourglass model in the prior art, although the voice result extracted compares Completely, it is with higher can identification, but exist and the instrumental music playings such as part prelude, interlude part accidentally known for voice and protected The mistake stayed.So completely pure voice can not be extracted from mixed audio in the prior art.
Summary of the invention
The embodiment of the present application provides a kind of voice extracting method, voice extraction element and Related product, to pass through two It walks voice to extract, obtains pure voice audio, misrecognition problem when existing voice being avoided to extract.
In a first aspect, the embodiment of the present application provides a kind of voice extracting method, comprising:
Model is extracted based on voice, voice extraction is carried out to mixed audio, obtains intermediate audio, the intermediate audio includes Voice audio frame and inhuman sound audio frame;
Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.
Second aspect, the embodiment of the present application provide a kind of voice extraction element, comprising:
Extraction unit carries out voice extraction to mixed audio, obtains intermediate audio, institute for extracting model based on voice Stating intermediate audio includes voice audio frame and inhuman sound audio frame;
Filter element filters out the inhuman sound audio frame of the intermediate audio, obtains voice for being based on voice filtering model Audio.
The third aspect, the embodiment of the present application provide a kind of electronic equipment, including processor, memory, communication interface and One or more programs, wherein one or more of programs are stored in the memory, and are configured by described It manages device to execute, described program is included the steps that for executing the instruction in method as described in relation to the first aspect.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage medium Matter is stored with computer program, and the computer program makes the method for computer execution as described in relation to the first aspect.
5th aspect, the embodiment of the present application provide a kind of computer program product, and the computer program product includes depositing The non-transient computer readable storage medium of computer program is stored up, the computer is operable to make computer to execute such as the Method described in one side.
Implement the embodiment of the present application, has the following beneficial effects:
As can be seen that in the embodiment of the present application, voice is extracted into model extraction and is input to filtering model to intermediate audio, Inhuman sound audio frame in the intermediate audio is filtered, is extracted by two step voice, obtains pure voice audio, is solved The problem of can not extracting pure audio from mixed audio in the prior art, keep voice extraction effect better.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is a kind of flow diagram of voice extracting method provided by the embodiments of the present application;
Fig. 2A is a kind of flow diagram for obtaining training data method provided by the embodiments of the present application;
Fig. 2 B is that another kind provided by the embodiments of the present application obtains the flow diagram of training data method;
Fig. 2 C is a kind of schematic diagram of audio frame frequency spectrum figure provided by the embodiments of the present application;
Fig. 3 is the flow diagram of another voice extracting method provided by the embodiments of the present application;
Fig. 4 is a kind of network structure of voice filtering model provided by the embodiments of the present application;
Fig. 5 is a kind of structural schematic diagram of voice extraction element provided by the embodiments of the present application;
Fig. 6 is that a kind of functional unit of voice extraction element provided by the embodiments of the present application forms block diagram.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.
The description and claims of this application and term " first ", " second ", " third " and " in the attached drawing Four " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and it Any deformation, it is intended that cover and non-exclusive include.Such as it contains the process, method of a series of steps or units, be System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or list Member, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.
Referenced herein " embodiment " is it is meant that the special characteristic, result or the characteristic that describe can wrap in conjunction with the embodiments It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.
Voice extraction element in the application may include smart phone (such as Android phone, iOS mobile phone, Windows Phone mobile phone etc.), tablet computer, palm PC, laptop, mobile internet device MID (Mobile Internet Devices, referred to as: MID) or wearable device etc., above-mentioned electronic equipment is only citing, and non exhaustive, including but not limited to upper Electronic equipment is stated, for convenience of description, certainly in practical applications, above-mentioned voice extraction element is also not necessarily limited to above-mentioned realization shape Formula, such as can also include: intelligent vehicle mounted terminal, computer equipment etc..
Refering to fig. 1, Fig. 1 is a kind of voice extracting method provided by the embodiments of the present application, and this method is extracted applied to voice Device, the method comprising the steps of 101~102:
Step 101: voice extraction element is based on voice and extracts model, carries out voice extraction to mixed audio, obtains centre Audio, the intermediate audio include voice audio frame and inhuman sound audio frame.
Wherein, voice is extracted as isolating cognizable voice sound from the mixed audio in voice and background instrumental music sound Frequently.
Wherein, it is neural network model in the prior art which, which extracts model,.For example, can be Hoursglass mould Type etc. no longer repeats voice extraction process.It is understood that Hoursglass model is when carrying out voice extraction, Its input data is audio frame one by one, extracts voice respectively from each audio frame, therefore Hoursglass model is right When mixed audio carries out voice extraction, local message based on mixed audio carries out voice extraction, so as to cause by part prelude, The instrumental music playings such as interlude part is mistakenly identified as voice and extracts, and causes to remain part in the voice audio finally extracted The instrumental music playings such as part prelude, interlude, so pure voice audio can not be extracted from mixed audio.
Step 102: voice extraction element is based on voice filtering model, filters out the inhuman sound audio frame of the intermediate audio, Obtain voice audio.
Wherein, the voice filtering model is to be constructed based on machine learning Integrated Algorithm, wherein the machine learning is integrated Algorithm can be Viterbi viterbi algorithm, condition random field algorithm CRF algorithm, and the application is with Viterbi viterbi algorithm Example illustrates.
Viterbi algorithm, which is a kind of dynamic programming algorithm, most possible generates observed events sequence-Wei Te for finding Than path-hidden state sequence, special y is applied in Markoff information source context and hidden Markov model, for solving Certainly optimum path problems.To tie up bit algorithm dynamic adjustment voice probability sequence in the application, to complete to voice filtering model Building.
Wherein, be to the building process of voice filtering model are as follows: based on machine learning Integrated Algorithm, training data and with The corresponding sequence label trained voice filtering model in advance of the training data, the training data and the sequence label Existing audio file is pre-processed to obtain.Since the input data of the voice filtering model is audio section, tool There is bigger receptive field, the global information of the intermediate audio can be got, to filter the inhuman sound audio in the intermediate audio Frame.
As can be seen that in the embodiment of the present application, after extracting model extraction to intermediate audio based on voice, utilizing impression Wild bigger voice filtering model, is filtered the intermediate audio, to filter out the inhuman sound audio frame in the intermediate audio, from And pure voice is extracted from mixed audio, keep the vocal effects extracted more preferable, improves user experience.
It describes in detail below and audio file is pre-processed to obtain the process of training data and sequence label.
Refering to Fig. 2A, Fig. 2A is a kind of process for obtaining training data and sequence label method provided by the embodiments of the present application Schematic diagram, this method are applied to voice extraction element, the method comprising the steps of 201a~205a:
Step 201a: voice extraction element is based on voice and extracts model to audio file progress voice extraction, obtains sample Audio.
It optionally, include voice audio frame and inhuman sound audio frame in the sample audio, which is accidentally to know The audio frame of the instrumental music playings such as the part prelude, the interlude that do not obtain.
Step 202a: voice extraction element carries out sub-frame processing to the sample audio, obtains N number of sample audio frame, N is Integer greater than 1.
Wherein, it is non-stationary signal within the entire time cycle due to audio signal, signal processing can not be carried out, therefore to this Sample audio carries out framing according to preset window function and preset step-length, obtains N number of sample audio frame, each sample audio frame is seen Doing is stationary signal, and the continuous type in order to guarantee audio signal, has weight between two sample audio frames of arbitrary neighborhood Folded part.For example, a length of 30ms when preset window function, preset step-length 20ms, therefore the two of arbitrary neighborhood sample audio frames Between with 10ms lap.
Step 203a: voice extraction element carries out Short Time Fourier Transform to each sample audio frame, obtains each sample The spectrogram of audio frame.
In some possible embodiments, spectrogram can be amplitude spectrum, power spectrum (energy spectrum) or log power spectrum. The application is illustrated by taking amplitude spectrum as an example.
Step 204a: spectrogram of the voice extraction element based on each sample audio frame obtains the of the audio file First sonograph is labeled as training data by one sonograph.
Wherein, the matrix that the first sonograph is made of the spectral vectors of each sample audio frame, each sample audio frame The column vector that is made of the corresponding amplitude of each frequency point of each sample audio frame of spectral vectors.
For example, refering to Fig. 2 C, Fig. 2 C is the spectrogram of k-th of sample audio frame, 1≤k≤N, f1, f2, f3 ..., Fm is the frequency point of k-th of sample audio frame in a frequency domain, and m is the quantity of every frame sample audio frame frequency point in a frequency domain, k-th of sample The corresponding spectral vectors of the spectrogram of this audio frame be [Ak1, Ak2, Ak3 ..., Akm]T, therefore obtain N number of sample audio frame pair N number of spectral vectors can be formed first sonograph by the N number of spectral vectors answered are as follows:
Step 205a: voice extraction element is based on first sonograph and obtains the corresponding label sequence of the training data Column.
Wherein, the sequence label is for marking sample audio frame corresponding with each column vector in the training data Frame Properties, the Frame Properties include voice and non-voice.For example, j-th of element in the sequence label is for marking the instruction Practice the Frame Properties of j-th of audio frame in data, 1≤j≤N, j are integer.
In some possible embodiments, the corresponding sequence label of the training data is obtained based on first sonograph Realization process can be with are as follows: determine that mute audio frame is corresponding in first sonograph based on voice activity detection algorithm VAD First frame number;The corresponding lyrics file of the audio file is obtained, first sonograph is determined based on the lyrics file Middle corresponding second frame number of voice audio frame and the corresponding third frame number of inhuman sound audio frame;Based on the first frame sequence Number, second frame number, the third frame number obtain sequence label.
It is quiet in first sonograph being determined based on voice activity detection algorithm VAD in some possible embodiments Before corresponding first frame number of sound audio frame, the method also includes: spectrum-subtraction noise reduction is carried out to first spectrogram, with The ambient noise in first spectrogram is filtered out, spectrum-subtraction noise reduction is the prior art, is repeated no more.
Specifically, based on the lyrics file determine the audio file it is corresponding include that the lyrics sing period and not Non- comprising the lyrics sings the period, then non-period corresponding all audio frames of singing are inhuman sound audio frame, when singing Between the corresponding all audio frames of section include at least voice audio frame.It is to be appreciated that singing in the period at any one, arbitrarily There may be the stages of singer's ventilation between two adjacent lyrics, therefore singing the period, there are time periods of silence, that is, sing There are mute audio frames for period corresponding all audio frames.So determining audio text based on voice activity detection algorithm VAD The corresponding time periods of silence of mute audio frame in part.Then, by the period corresponding with each sample audio frame each period It is compared, obtains the period belonging to each sample audio frame, determined based on the period belonging to each sample audio frame every The Frame Properties of a sample audio frame, that is, determine the corresponding frame number of voice audio frame, the corresponding frame number of inhuman sound audio frame And the corresponding frame number of mute audio frame.
A length of 30ms when for example, such as audio frame, step-length 10ms, set the corresponding label of voice audio frame as 1, non- The corresponding label of voice audio frame is 0, and the corresponding label of mute frame flag is also 0.Belong in the 0-50ms of such as audio file The inhuman sound audio period, it is determined that the 1st, 2 audio frame is inhuman sound audio frame, then the 1st, 2 in training data sound The label of frequency frame is 0, such as the 50ms-70ms of the audio file, belongs to voice audio session in 90ms-110ms, it is determined that 3rd, 5 audio frame be voice audio frame, the 4th frame be mute audio frame, it is determined that the 3rd, 5 audio in the training data The label of frame frame audio is that the label of the 1, the 4th frame audio is 0 etc.;Therefore its sequence label is [0,0,1,0,1 ... ,].
Refering to Fig. 2 B, it is based on Fig. 2A, Fig. 2 B is the method that another kind provided by the embodiments of the present application obtains training data Flow diagram, this method are applied to voice extraction element, the method comprising the steps of 201b~207b:
Step 201b: voice extraction element is based on voice and extracts model to audio file progress voice extraction, obtains sample Audio.
Step 202b: voice extraction element carries out sub-frame processing to the sample audio, obtains N number of sample audio frame.
Step 203b: voice extraction element carries out Short Time Fourier Transform to each sample audio frame, obtains each sample The spectrogram of audio frame.
Step 204b: spectrogram of the voice extraction element based on each sample audio frame obtains the of the audio file One sonograph.
Step 205b: voice extraction element determines i-th of column vector and first sound spectrum in first sonograph The first-order difference of (i+1) a column vector corresponding element in figure, obtains difference vector, by the difference vector and the (i + 1) a column vector carries out longitudinal spliced, obtains the second sonograph, 1≤i≤N, i are integer.
Wherein, since corresponding difference vector is not present in first sample audio frame in first sonograph, therefore to it A longitudinal spliced preset difference vector A=[A01, A02 ..., A0m], wherein the preset difference vector can be element The null vector or vector, etc. for predicted elemental for being all 0, the application do not do unique restriction.
After longitudinal spliced preset difference vector, second sonograph are as follows:
Optionally, the first-order difference for obtaining the frame vector of two neighboring sample audio frame, by difference vector and the first frequency spectrum Figure progress is longitudinal spliced, and each column vector of the second sonograph made contains the audio-frequency information of an audio frame, therefore In the corresponding voice probability of each sample audio frame of calculating, due to being contained in the corresponding column vector of each sample audio frame The audio-frequency information of one audio frame makes the voice probability for the sample audio frame being calculated more due to joined prior information Add accurate.
Step 206b: second sonograph is labeled as training data by voice extraction element.
Step 207b: voice extraction element is based on first sonograph and obtains the corresponding label sequence of the training data Column.
Finally, being trained voice filtering model for the prior art using training data, repeat no more.
Refering to Fig. 3, Fig. 3 is the flow diagram of another voice extracting method provided by the embodiments of the present application, this method Applied to voice extraction element, the method comprising the steps of 301~306:
Step 301: voice extraction element is based on voice and extracts model, carries out voice extraction to mixed audio, obtains centre Audio, the intermediate audio include voice audio frame and inhuman sound audio frame.
Step 302: the intertone frequency division is segmented into several audio sections, the adjacent sound of any two by voice extraction element There is the audio section of overlapping in frequency range.
Optionally, the intertone frequency division is segmented into several by voice extraction element according to preset window function and preset step-length A audio section, each audio section include at least an audio frame.For example, can be 10s according to window function, preset step-length is 5s to institute It states intertone frequency division and is segmented into several audio sections, there are the audios that 5s is overlapped for the adjacent audio section of each audio section event any two Section.
Step 303: each audio section is successively input to voice filtering model by voice extraction element, obtains each audio section The first voice probability sequence, the first voice probability sequence is used to indicate that each audio frame in each audio section to be voice Probability.
Step 304: voice extraction element is determined in overlapping audio section based on the first voice probability sequence of each audio section The voice mathematical expectation of probability of each audio frame obtains the second voice probability sequence of the intermediate audio.
Optionally, it is based on the voice filtering model, the first voice probability sequence of each audio section is determined, due to phase There is the audio section of overlapping in two adjacent audio sections, therefore corresponding two the first voice probability sequences of two adjacent audio sections In comprising the overlapping audio section in each audio frame voice probability, the sound of overlapping is obtained by way of averaged The voice probability of each audio frame in frequency range, then, voice probability corresponding with underlapped audio section form the intermediate audio The second voice probability sequence, each element in the second voice probability sequence is for indicating each audio in the intermediate audio Frame is the probability of voice.
Step 305: voice extraction element is determined in described based on viterbi algorithm and the second voice probability sequence Between audio target voice probability sequence.
Optionally, the element in second probability sequence is adjusted based on Viterbi viterbi algorithm, is obtained most Excellent probability sequence, using optimal probability sequence as target voice probability sequence.The i.e. similar method for seeking optimal path, base Determine the corresponding hiding sequence of second probability sequence in Viterbi viterbi algorithm, obtain it is each hide that sequence is corresponding can Energy probability, obtains the optimal probability sequence, detailed process is the prior art, is no longer described.
For example, as the second voice probability sequence be [0.0,0,1,0.1,0.2,0.3,0.5,0.8,0.7,0.1, 0.1,0.6,0.7,0.8 ... ,], by the second voice probability sequence can be seen that the intermediate audio the 6th, 7,8,11, 12,13 audio frames may be people's sound audio frame, and the 9th, 10 audio frame may be inhuman sound audio frame, since speaker says There are the processes of a gradual change for words, so being also a progressive formation between voice probability, in general, a upper sound are not present The voice probability of frequency frame is very big, and the voice probability of next audio frame is very small, does not meet the rule of speaking of speaker, institute With can conclude that the 9th, the 10 corresponding voice probability of audio frame, there are problems, therefore dynamic is needed to adjust, to meet speaking for speaker Rule.
Step 306: voice extraction element is based on the target voice probability sequence and filters out non-voice in the intermediate audio Audio frame, obtains voice audio, and the inhuman sound audio frame is in the intermediate audio and in the target voice probability sequence The corresponding audio frame of object element, the object element is the element for meeting preset condition.
Wherein, the element for meeting preset condition can be the element more than or equal to threshold value, the threshold value can for 0.5, 0.6,0.7 or other values.
As can be seen that in the embodiment of the present application, after extracting model extraction to intermediate audio based on voice, to intertone Frequency division section, determines the corresponding input data of each audio section, which is input to voice filtering model, to filter out in this Between audio inhuman sound audio frame, obtain pure voice audio, due to the input data in voice filtering model be audio section, Input data is audio frame compared to the prior art, which has bigger receptive field, utilize intermediate audio Global information filters inhuman sound audio frame, and pure voice is extracted from mixed audio, keeps the vocal effects extracted more preferable, Improve user experience.
In some possible embodiments, voice extracting method disclosed in the present application is applied to voice mistake as shown in Figure 4 Model is filtered, which includes P identical network layers and full articulamentum, wherein the P identical network layers are with residual Poor form connection, each network layer include: the first convolutional layer, the second convolutional layer, active coating, Fusion Features layer and feature superposition Layer;The full articulamentum can intensively connect for multiple network layers.
Firstly, voice extraction element is segmented intermediate audio, several audio sections are obtained, then, then to each sound Frequency range carries out Fourier transformation, obtains each audio section and carries out Short Time Fourier Transform, obtains the corresponding sound spectrum of each audio section Scheme (can be above-mentioned the first sonograph or the second sonograph), the corresponding input of each audio section is obtained based on the sonograph Data, specific conversion process are not described in detail herein referring to the above-mentioned process for obtaining training data, and by the input Data are input to first network layer of P network layer in the voice filtering model;First convolutional layer, for the input Data carry out the first convolution algorithm, obtain fisrt feature matrix;Second convolutional layer, for carrying out the second convolution fortune to input data It calculates, obtains second characteristic matrix;Active coating obtains third feature for carrying out nonlinear activation to the second characteristic matrix Matrix;Fusion Features layer obtains fourth feature for carrying out multiplication cross operation to the fisrt feature matrix and third feature matrix Matrix;Feature superimposed layer obtains the output number of the network layer for fourth feature matrix and input data to be carried out feature superposition According to after P network layer, obtaining the mesh of each audio section using the output data as the input data of next network layer Mark eigenmatrix;Full articulamentum obtains the corresponding spy of each audio section for carrying out full connection operation to the target signature matrix Vector is levied, described eigenvector is input to softmax classifier, obtains the corresponding voice probability sequence of each audio section.
It should be noted that Fig. 4 is only a kind of network structure of voice filtering model, the application is only with the network structure Example illustrates, and is not limited uniquely voice filtering model.
It is above-mentioned that mainly the scheme of the embodiment of the present application is described from the angle of method side implementation procedure.It is understood that , in order to realize the above functions, it comprises executing, each function is corresponding for the computing device of syllable quantity in the unit time Hardware configuration and/or software module.Those skilled in the art should be readily appreciated that, in conjunction with the embodiments described herein Each exemplary unit and algorithm steps of description, the application can be come real with the combining form of hardware or hardware and computer software It is existing.Some functions is executed in a manner of hardware or computer software driving hardware actually, depending on the specific of technical solution Using and design constraint.Professional technician can to it is each specifically realized using distinct methods it is described Function, but this realization is it is not considered that exceed scope of the present application.
The embodiment of the present application can carry out function according to computing device of the above method example to syllable quantity in the unit time The division of energy unit, for example, each functional unit of each function division can be corresponded to, it can also be by two or more function It can be integrated in a processing unit.Above-mentioned integrated unit both can take the form of hardware realization, can also use software The form of functional unit is realized.It should be noted that being schematically only one to the division of unit in the embodiment of the present application Kind logical function partition, there may be another division manner in actual implementation.
With it is above-mentioned shown in voice extracting method embodiment it is consistent, referring to Fig. 5, Fig. 5 mentions for the embodiment of the present application The structural schematic diagram of a kind of voice extraction element 500 supplied, as shown in figure 5, voice extraction element 500 includes processor, storage Device, communication interface and one or more programs, wherein said one or multiple programs are different from said one or multiple applications Program, and said one or multiple programs are stored in above-mentioned memory, and are configured to be executed by above-mentioned processor, it is above-mentioned Program includes the instruction for executing following steps:
Model is extracted based on voice, voice extraction is carried out to mixed audio, obtains intermediate audio, the intermediate audio includes Voice audio frame and inhuman sound audio frame;
Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.
In a possible embodiment, the voice filtering model is constructed based on machine learning Integrated Algorithm, above procedure It is also used to execute the instruction of following steps:
Based on voice filtering model, before the inhuman sound audio frame for filtering out the intermediate audio, audio file is carried out Pretreatment obtains training data and sequence label, using the training data and the sequence label to the voice filter module Type optimizes training.
In a possible embodiment, in terms of being pre-processed to obtain training data and sequence label to audio file, Above procedure is specifically used for executing the instruction of following steps:
Model is extracted based on the voice, voice extraction is carried out to audio file, obtain sample audio;
Sub-frame processing is carried out to the sample audio, obtains N number of sample audio frame, N is the integer greater than 1;
Short Time Fourier Transform is carried out to each sample audio frame, obtains the spectrogram of each sample audio frame;
Based on the spectrogram of each sample audio frame, the first sonograph of the audio file, first sound spectrum are obtained The matrix that figure is made of the spectral vectors of each sample audio frame, the spectral vectors of each sample audio frame are by each sample The column vector of the corresponding amplitude composition of each frequency point of audio frame;
First sonograph is labeled as training data;
The corresponding sequence label of the training data is obtained based on first sonograph, the sequence label is for marking The Frame Properties of sample audio frame corresponding with each column vector in the training data, the Frame Properties include voice and inhuman Sound.
In a possible embodiment, before by first sonograph labeled as training data, above procedure is also used In the instruction for executing following steps:
Determine (i+1) a column vector in i-th of the column vector and first sonograph in first sonograph The first-order difference of corresponding element, obtains difference vector, and 1≤i≤N, i are integer;
The difference vector and (i+1) a column vector progress is longitudinal spliced, obtain the second sonograph;
It is described using first sonograph as training data, comprising:
Second sonograph is labeled as training data.
In a possible embodiment, the corresponding sequence label of the training data is being obtained based on first sonograph Aspect, above procedure are specifically used for executing the instruction of following steps:
Corresponding first frame number of mute audio frame in first sonograph is determined based on voice activity detection algorithm;
The corresponding lyrics file of the audio file is obtained, people in first sonograph is determined based on the lyrics file Corresponding second frame number of sound audio frame and the corresponding third frame number of inhuman sound audio frame;
Sequence label is obtained based on first frame number, second frame number, the third frame number.
In a possible embodiment, it is being based on voice filtering model, is filtering out the inhuman sound audio frame of the intermediate audio Aspect, above procedure are specifically used for executing the instruction of following steps:
The intertone frequency division is segmented into several audio sections, the adjacent audio section of any two has the audio of overlapping Section;
Each audio section is successively input to voice filtering model, obtains the first voice probability sequence of each audio section, The first voice probability sequence is used to indicate in each audio section that each audio frame to be the probability of voice;
The voice probability of each audio frame in overlapping audio section is determined based on the first voice probability sequence of each audio section Mean value obtains the second voice probability sequence of the intermediate audio;
The target voice probability of the intermediate audio is determined based on viterbi algorithm and the second voice probability sequence Sequence;
Inhuman sound audio frame in the intermediate audio is filtered out based on the target voice probability sequence, obtains voice audio, The inhuman sound audio frame is audio corresponding with the object element in the target voice probability sequence in the intermediate audio Frame, the object element are the element for meeting preset condition.
A kind of possible function of voice extraction element 600 involved in above-described embodiment is shown refering to Fig. 6, Fig. 6 Unit composition block diagram, voice extraction element 600 include: extraction unit 610, filter element 620, in which:
Extraction unit 610 carries out voice extraction to mixed audio, obtains intermediate audio for extracting model based on voice, The intermediate audio includes voice audio frame and inhuman sound audio frame;
Filter element 620 filters out the inhuman sound audio frame of the intermediate audio, obtains for being based on voice filtering model Voice audio.
In a possible embodiment, the voice filtering model is constructed based on machine learning Integrated Algorithm, and voice extracts Device 600 further includes training unit 630, and training unit 630 is used for: being based on voice filtering model, is filtering out the intermediate audio Inhuman sound audio frame before, audio file is pre-processed to obtain training data and sequence label, uses the trained number Accordingly and the sequence label optimizes training to the voice filtering model.
In a possible embodiment, in terms of being pre-processed to obtain training data and sequence label to audio file, Training unit 630, is specifically used for: extracting model based on the voice and carries out voice extraction to audio file, obtains sample audio; Sub-frame processing is carried out to the sample audio, obtains N number of sample audio frame, N is the integer greater than 1;To each sample audio frame Short Time Fourier Transform is carried out, the spectrogram of each sample audio frame is obtained;Based on the spectrogram of each sample audio frame, obtain First sonograph of the audio file, the square that first sonograph is made of the spectral vectors of each sample audio frame Battle array, the spectral vectors of each sample audio frame be from the column that the corresponding amplitude of each frequency point of each sample audio frame forms to Amount;First sonograph is labeled as training data;The corresponding mark of the training data is obtained based on first sonograph Sequence is signed, the sequence label is used to mark the frame category of sample audio frame corresponding with each column vector in the training data Property, the Frame Properties includes voice and non-voice.
In a possible embodiment, before by first sonograph labeled as training data, training unit 630, It is also used to: determining (i+1) a column vector pair in i-th of the column vector and first sonograph in first sonograph The first-order difference for answering element, obtains difference vector, and 1≤i≤N, i are integer;By the difference vector and (i+1) a column Vector progress is longitudinal spliced, obtains the second sonograph;Using first sonograph as training data in terms of, training unit 630, it is specifically used for: by second sonograph labeled as training data.
In a possible embodiment, the corresponding sequence label of the training data is being obtained based on first sonograph Aspect, training unit 630, is specifically used for: determining mute audio frame in first sonograph based on voice activity detection algorithm Corresponding first frame number;The corresponding lyrics file of the audio file is obtained, determines described first based on the lyrics file Corresponding second frame number of voice audio frame and the corresponding third frame number of inhuman sound audio frame in sonograph;Based on described first Frame number, second frame number, the third frame number obtain sequence label.
In a possible embodiment, it is being based on voice filtering model, is filtering out the inhuman sound audio frame of the intermediate audio Aspect, filter element 620, is specifically used for: the intertone frequency division is segmented into several audio sections, the adjacent sound of any two There is the audio section of overlapping in frequency range;Each audio section is successively input to voice filtering model, obtains the first of each audio section Voice probability sequence, the first voice probability sequence are used to indicate in each audio section that each audio frame to be the probability of voice; The voice mathematical expectation of probability that each audio frame in overlapping audio section is determined based on the first voice probability sequence of each audio section, is obtained Second voice probability sequence of the intermediate audio;Based on described in viterbi algorithm and the second voice probability sequence determination The target voice probability sequence of intermediate audio;Inhuman sound in the intermediate audio is filtered out based on the target voice probability sequence Frequency frame, obtains voice audio, and the inhuman sound audio frame is in the intermediate audio and in the target voice probability sequence The corresponding audio frame of object element, the object element are the element for meeting preset condition.
The embodiment of the present application also provides a kind of computer storage medium, and the computer-readable recording medium storage has calculating Machine program, the computer program are executed by processor to realize that any voice recorded in above method embodiment such as mentions Take some or all of method step.
The embodiment of the present application also provides a kind of computer program product, and the computer program product includes storing calculating The non-transient computer readable storage medium of machine program, the computer program are operable to that computer is made to execute such as above-mentioned side Some or all of any voice extracting method recorded in method embodiment step.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily the application It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed device, it can be by another way It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit, It can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also be realized in the form of software program module.
If the integrated unit is realized in the form of software program module and sells or use as independent product When, it can store in a computer-readable access to memory.Based on this understanding, the technical solution of the application substantially or Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products Reveal and, which is stored in a memory, including some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes all or part of each embodiment the method for the application Step.And memory above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can store in a computer-readable memory, memory May include: flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), disk or CD etc..
The embodiment of the present application is described in detail above, specific case used herein to the principle of the application and Embodiment is expounded, the description of the example is only used to help understand the method for the present application and its core ideas; At the same time, for those skilled in the art can in specific embodiments and applications according to the thought of the application There is change place, in conclusion the contents of this specification should not be construed as limiting the present application.

Claims (10)

1. a kind of voice extracting method characterized by comprising
Model is extracted based on voice, voice extraction is carried out to mixed audio, obtains intermediate audio, the intermediate audio includes voice Audio frame and inhuman sound audio frame;
Based on voice filtering model, the inhuman sound audio frame of the intermediate audio is filtered out, voice audio is obtained.
2. the method according to claim 1, wherein the voice filtering model is based on machine learning Integrated Algorithm Building, the method also includes: it is being based on voice filtering model, it is right before the inhuman sound audio frame for filtering out the intermediate audio Audio file is pre-processed to obtain training data and sequence label, using the training data and the sequence label to institute It states voice filtering model and optimizes training.
3. according to the method described in claim 2, it is characterized in that, described pre-process audio file to obtain training data And sequence label, comprising:
Model is extracted based on the voice, voice extraction is carried out to audio file, obtain sample audio;
Sub-frame processing is carried out to the sample audio, obtains N number of sample audio frame, N is the integer greater than 1;
Short Time Fourier Transform is carried out to each sample audio frame, obtains the spectrogram of each sample audio frame;
Based on the spectrogram of each sample audio frame, the first sonograph of the audio file is obtained, first sonograph is The matrix being made of the spectral vectors of each sample audio frame, the spectral vectors of each sample audio frame are by each sample audio The column vector of the corresponding amplitude composition of each frequency point of frame;
First sonograph is labeled as training data;
The corresponding sequence label of the training data is obtained based on first sonograph, the sequence label is described for marking The Frame Properties of sample audio frame corresponding with each column vector in training data, the Frame Properties include voice and non-voice.
4. according to the method described in claim 3, it is characterized in that, described be labeled as training data for first sonograph Before, the method also includes:
Determine that (i+1) a column vector in i-th of the column vector and first sonograph in first sonograph is corresponding The first-order difference of element, obtains difference vector, and 1≤i≤N, i are integer;
The difference vector and (i+1) a column vector progress is longitudinal spliced, obtain the second sonograph;
It is described using first sonograph as training data, comprising:
Second sonograph is labeled as training data.
5. the method according to claim 3 or 4, which is characterized in that described to obtain the instruction based on first sonograph Practice the corresponding sequence label of data, comprising:
Corresponding first frame number of mute audio frame in first sonograph is determined based on voice activity detection algorithm;
The corresponding lyrics file of the audio file is obtained, voice sound in first sonograph is determined based on the lyrics file Corresponding second frame number of frequency frame and the corresponding third frame number of inhuman sound audio frame;
Sequence label is obtained based on first frame number, second frame number, the third frame number.
6. filtering out the intertone the method according to claim 1, wherein described be based on voice filtering model The inhuman sound audio frame of frequency, comprising:
The intertone frequency division is segmented into several audio sections, the adjacent audio section of any two has the audio section of overlapping;
Each audio section is successively input to voice filtering model, obtains the first voice probability sequence of each audio section, it is described First voice probability sequence is used to indicate in each audio section that each audio frame to be the probability of voice;
The voice mathematical expectation of probability of each audio frame in overlapping audio section is determined based on the first voice probability sequence of each audio section, Obtain the second voice probability sequence of the intermediate audio;
The target voice probability sequence of the intermediate audio is determined based on viterbi algorithm and the second voice probability sequence;
Inhuman sound audio frame in the intermediate audio is filtered out based on the target voice probability sequence, obtains voice audio, it is described Inhuman sound audio frame is audio frame corresponding with the object element in the target voice probability sequence in the intermediate audio, institute Stating object element is the element for meeting preset condition.
7. a kind of voice extraction element characterized by comprising
Extraction unit carries out voice extraction to mixed audio, obtains intermediate audio for extracting model based on voice, it is described in Between audio include voice audio frame and inhuman sound audio frame;
Filter element filters out the inhuman sound audio frame of the intermediate audio, obtains voice sound for being based on voice filtering model Frequently.
8. device according to claim 7, which is characterized in that described device further includes training unit,
The training unit, it is right before the inhuman sound audio frame for filtering out the intermediate audio for being based on voice filtering model Audio file is pre-processed to obtain training data and sequence label, using the training data and the sequence label to institute It states voice filtering model and optimizes training.
9. a kind of electronic equipment, which is characterized in that including processor, memory, communication interface and one or more program, In, one or more of programs are stored in the memory, and are configured to be executed by the processor, described program Include the steps that requiring the instruction in any one of 1-7 method for perform claim.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program are executed by processor to realize the method according to claim 1 to 7.
CN201910343129.5A 2019-04-26 2019-04-26 Human voice extraction method, human voice extraction device and related products Active CN110085251B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910343129.5A CN110085251B (en) 2019-04-26 2019-04-26 Human voice extraction method, human voice extraction device and related products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910343129.5A CN110085251B (en) 2019-04-26 2019-04-26 Human voice extraction method, human voice extraction device and related products

Publications (2)

Publication Number Publication Date
CN110085251A true CN110085251A (en) 2019-08-02
CN110085251B CN110085251B (en) 2021-06-25

Family

ID=67416989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910343129.5A Active CN110085251B (en) 2019-04-26 2019-04-26 Human voice extraction method, human voice extraction device and related products

Country Status (1)

Country Link
CN (1) CN110085251B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110428853A (en) * 2019-08-30 2019-11-08 北京太极华保科技股份有限公司 Voice activity detection method, Voice activity detection device and electronic equipment
CN110718228A (en) * 2019-10-22 2020-01-21 中信银行股份有限公司 Voice separation method and device, electronic equipment and computer readable storage medium
CN110782907A (en) * 2019-11-06 2020-02-11 腾讯科技(深圳)有限公司 Method, device and equipment for transmitting voice signal and readable storage medium
CN110942776A (en) * 2019-10-31 2020-03-31 厦门快商通科技股份有限公司 Audio splicing prevention detection method and system based on GRU
CN111145763A (en) * 2019-12-17 2020-05-12 厦门快商通科技股份有限公司 GRU-based voice recognition method and system in audio
CN111276113A (en) * 2020-01-21 2020-06-12 北京永航科技有限公司 Method and device for generating key time data based on audio
CN111341341A (en) * 2020-02-11 2020-06-26 腾讯科技(深圳)有限公司 Training method of audio separation network, audio separation method, device and medium
CN111354378A (en) * 2020-02-12 2020-06-30 北京声智科技有限公司 Voice endpoint detection method, device, equipment and computer storage medium
CN111968623A (en) * 2020-08-19 2020-11-20 腾讯音乐娱乐科技(深圳)有限公司 Air port position detection method and related equipment
CN112259119A (en) * 2020-10-19 2021-01-22 成都明杰科技有限公司 Music source separation method based on stacked hourglass network
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device
CN112382310A (en) * 2020-11-12 2021-02-19 北京猿力未来科技有限公司 Human voice audio recording method and device
CN112397073A (en) * 2020-11-04 2021-02-23 北京三快在线科技有限公司 Audio data processing method and device
CN112687274A (en) * 2019-10-17 2021-04-20 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
CN112818163A (en) * 2021-01-22 2021-05-18 惠州Tcl移动通信有限公司 Song display processing method, device, terminal and medium based on mobile terminal
WO2021115083A1 (en) * 2019-12-11 2021-06-17 北京影谱科技股份有限公司 Audio signal time sequence processing method, apparatus and system based on neural network, and computer-readable storage medium
CN113053401A (en) * 2019-12-26 2021-06-29 上海博泰悦臻电子设备制造有限公司 Audio acquisition method and related product
CN113114417A (en) * 2021-03-30 2021-07-13 深圳市冠标科技发展有限公司 Audio transmission method and device, electronic equipment and storage medium
CN113113051A (en) * 2021-03-10 2021-07-13 深圳市声扬科技有限公司 Audio fingerprint extraction method and device, computer equipment and storage medium
CN113242361A (en) * 2021-07-13 2021-08-10 腾讯科技(深圳)有限公司 Video processing method and device and computer readable storage medium
CN113257242A (en) * 2021-04-06 2021-08-13 杭州远传新业科技有限公司 Voice broadcast suspension method, device, equipment and medium in self-service voice service
CN113572908A (en) * 2021-06-16 2021-10-29 云茂互联智能科技(厦门)有限公司 Method, device and system for reducing noise in VoIP (Voice over Internet protocol) call
CN113724720A (en) * 2021-07-19 2021-11-30 电信科学技术第五研究所有限公司 Non-human voice filtering method in noisy environment based on neural network and MFCC
CN114203163A (en) * 2022-02-16 2022-03-18 荣耀终端有限公司 Audio signal processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006133284A (en) * 2004-11-02 2006-05-25 Kddi Corp Voice information extracting device
CN101404160A (en) * 2008-11-21 2009-04-08 北京科技大学 Voice denoising method based on audio recognition
WO2014167570A1 (en) * 2013-04-10 2014-10-16 Technologies For Voice Interface System and method for extracting and using prosody features
CN105719657A (en) * 2016-02-23 2016-06-29 惠州市德赛西威汽车电子股份有限公司 Human voice extracting method and device based on microphone
CN108962277A (en) * 2018-07-20 2018-12-07 广州酷狗计算机科技有限公司 Speech signal separation method, apparatus, computer equipment and storage medium
CN109308901A (en) * 2018-09-29 2019-02-05 百度在线网络技术(北京)有限公司 Chanteur's recognition methods and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006133284A (en) * 2004-11-02 2006-05-25 Kddi Corp Voice information extracting device
CN101404160A (en) * 2008-11-21 2009-04-08 北京科技大学 Voice denoising method based on audio recognition
WO2014167570A1 (en) * 2013-04-10 2014-10-16 Technologies For Voice Interface System and method for extracting and using prosody features
CN105719657A (en) * 2016-02-23 2016-06-29 惠州市德赛西威汽车电子股份有限公司 Human voice extracting method and device based on microphone
CN108962277A (en) * 2018-07-20 2018-12-07 广州酷狗计算机科技有限公司 Speech signal separation method, apparatus, computer equipment and storage medium
CN109308901A (en) * 2018-09-29 2019-02-05 百度在线网络技术(北京)有限公司 Chanteur's recognition methods and device

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110428853A (en) * 2019-08-30 2019-11-08 北京太极华保科技股份有限公司 Voice activity detection method, Voice activity detection device and electronic equipment
CN112687274A (en) * 2019-10-17 2021-04-20 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
CN110718228A (en) * 2019-10-22 2020-01-21 中信银行股份有限公司 Voice separation method and device, electronic equipment and computer readable storage medium
CN110942776B (en) * 2019-10-31 2022-12-06 厦门快商通科技股份有限公司 Audio splicing prevention detection method and system based on GRU
CN110942776A (en) * 2019-10-31 2020-03-31 厦门快商通科技股份有限公司 Audio splicing prevention detection method and system based on GRU
CN110782907A (en) * 2019-11-06 2020-02-11 腾讯科技(深圳)有限公司 Method, device and equipment for transmitting voice signal and readable storage medium
CN110782907B (en) * 2019-11-06 2023-11-28 腾讯科技(深圳)有限公司 Voice signal transmitting method, device, equipment and readable storage medium
WO2021115083A1 (en) * 2019-12-11 2021-06-17 北京影谱科技股份有限公司 Audio signal time sequence processing method, apparatus and system based on neural network, and computer-readable storage medium
CN111145763A (en) * 2019-12-17 2020-05-12 厦门快商通科技股份有限公司 GRU-based voice recognition method and system in audio
CN113053401A (en) * 2019-12-26 2021-06-29 上海博泰悦臻电子设备制造有限公司 Audio acquisition method and related product
CN111276113A (en) * 2020-01-21 2020-06-12 北京永航科技有限公司 Method and device for generating key time data based on audio
CN111276113B (en) * 2020-01-21 2023-10-17 北京永航科技有限公司 Method and device for generating key time data based on audio
CN111341341A (en) * 2020-02-11 2020-06-26 腾讯科技(深圳)有限公司 Training method of audio separation network, audio separation method, device and medium
CN111354378B (en) * 2020-02-12 2020-11-24 北京声智科技有限公司 Voice endpoint detection method, device, equipment and computer storage medium
CN111354378A (en) * 2020-02-12 2020-06-30 北京声智科技有限公司 Voice endpoint detection method, device, equipment and computer storage medium
CN111968623A (en) * 2020-08-19 2020-11-20 腾讯音乐娱乐科技(深圳)有限公司 Air port position detection method and related equipment
CN111968623B (en) * 2020-08-19 2023-11-28 腾讯音乐娱乐科技(深圳)有限公司 Gas port position detection method and related equipment
CN112259119A (en) * 2020-10-19 2021-01-22 成都明杰科技有限公司 Music source separation method based on stacked hourglass network
CN112397073A (en) * 2020-11-04 2021-02-23 北京三快在线科技有限公司 Audio data processing method and device
CN112397073B (en) * 2020-11-04 2023-11-21 北京三快在线科技有限公司 Audio data processing method and device
CN112270933B (en) * 2020-11-12 2024-03-12 北京猿力未来科技有限公司 Audio identification method and device
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device
WO2022100691A1 (en) * 2020-11-12 2022-05-19 北京猿力未来科技有限公司 Audio recognition method and device
WO2022100692A1 (en) * 2020-11-12 2022-05-19 北京猿力未来科技有限公司 Human voice audio recording method and apparatus
CN112382310A (en) * 2020-11-12 2021-02-19 北京猿力未来科技有限公司 Human voice audio recording method and device
CN112818163A (en) * 2021-01-22 2021-05-18 惠州Tcl移动通信有限公司 Song display processing method, device, terminal and medium based on mobile terminal
CN113113051A (en) * 2021-03-10 2021-07-13 深圳市声扬科技有限公司 Audio fingerprint extraction method and device, computer equipment and storage medium
CN113114417A (en) * 2021-03-30 2021-07-13 深圳市冠标科技发展有限公司 Audio transmission method and device, electronic equipment and storage medium
CN113257242A (en) * 2021-04-06 2021-08-13 杭州远传新业科技有限公司 Voice broadcast suspension method, device, equipment and medium in self-service voice service
CN113572908A (en) * 2021-06-16 2021-10-29 云茂互联智能科技(厦门)有限公司 Method, device and system for reducing noise in VoIP (Voice over Internet protocol) call
CN113242361A (en) * 2021-07-13 2021-08-10 腾讯科技(深圳)有限公司 Video processing method and device and computer readable storage medium
CN113242361B (en) * 2021-07-13 2021-09-24 腾讯科技(深圳)有限公司 Video processing method and device and computer readable storage medium
CN113724720B (en) * 2021-07-19 2023-07-11 电信科学技术第五研究所有限公司 Non-human voice filtering method based on neural network and MFCC (multiple frequency component carrier) in noisy environment
CN113724720A (en) * 2021-07-19 2021-11-30 电信科学技术第五研究所有限公司 Non-human voice filtering method in noisy environment based on neural network and MFCC
CN114203163A (en) * 2022-02-16 2022-03-18 荣耀终端有限公司 Audio signal processing method and device

Also Published As

Publication number Publication date
CN110085251B (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN110085251A (en) Voice extracting method, voice extraction element and Related product
Zhang et al. Denoising deep neural networks based voice activity detection
CN108109619A (en) Sense of hearing selection method and device based on memory and attention model
CN107578775A (en) A kind of multitask method of speech classification based on deep neural network
CN108847249A (en) Sound converts optimization method and system
CN108198569A (en) A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN107195296A (en) A kind of audio recognition method, device, terminal and system
CN110675891B (en) Voice separation method and module based on multilayer attention mechanism
CN108986798B (en) Processing method, device and the equipment of voice data
CN113921022B (en) Audio signal separation method, device, storage medium and electronic equipment
Huang et al. Extraction of adaptive wavelet packet filter‐bank‐based acoustic feature for speech emotion recognition
CN106033669B (en) Audio recognition method and device
CN107910008A (en) A kind of audio recognition method based on more acoustic models for personal device
CN110268471A (en) The method and apparatus of ASR with embedded noise reduction
CN104952446A (en) Digital building presentation system based on voice interaction
Bansal et al. Phoneme based model for gender identification and adult-child classification
CN108766416A (en) Audio recognition method and Related product
KR102220964B1 (en) Method and device for audio recognition
Mitra et al. Speech inversion: Benefits of tract variables over pellet trajectories
CN106128472A (en) The processing method and processing device of singer's sound
Venkateswarlu et al. Performance on speech enhancement objective quality measures using hybrid wavelet thresholding
Dai et al. 2D Psychoacoustic modeling of equivalent masking for automatic speech recognition
CN111696524A (en) Character-overlapping voice recognition method and system
Muni et al. Deep learning techniques for speech emotion recognition
CN110189747A (en) Voice signal recognition methods, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant