CN108962229A - A kind of target speaker's voice extraction method based on single channel, unsupervised formula - Google Patents
A kind of target speaker's voice extraction method based on single channel, unsupervised formula Download PDFInfo
- Publication number
- CN108962229A CN108962229A CN201810832080.5A CN201810832080A CN108962229A CN 108962229 A CN108962229 A CN 108962229A CN 201810832080 A CN201810832080 A CN 201810832080A CN 108962229 A CN108962229 A CN 108962229A
- Authority
- CN
- China
- Prior art keywords
- voice
- teacher
- classification
- segments
- voice segments
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
- G10L2015/0636—Threshold criteria for the updating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0638—Interactive procedures
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Abstract
Target speaker's voice extraction method based on single channel, unsupervised formula that the embodiment of the invention discloses a kind of, including teacher's language detecting step and teacher's language model training step;Teacher's language detecting step obtains voice data the following steps are included: recording to classroom;Carry out Speech processing;Voice segmentation and modeling, the voice segmentation includes carrying out isometric segmentation to classroom voice, extracts corresponding MFCC feature, and the GMM model based on each section of voice of MFCC latent structure for every section of voice later;The GMM model of each section of voice outside Teachers ' Talk classification and GGMM are carried out similarity calculation by teacher's speech detection, are classification by the label of given threshold is less than, thus to obtain final Teachers ' Talk classification;Teacher's language GGMM model training step is the following steps are included: carry out clustering processing to the obtained voice data of S3;Initial Teachers ' Talk classification is obtained, and is based on initial Teachers ' Talk classification extraction GGMM model.The present invention effectively improves the adaptability and intelligence in practical applications of system, also lays the foundation for subsequent applications and research.
Description
Technical field
The present invention relates under a kind of voice extraction method more particularly to a kind of more speaker's situations of complexity be based on single channel,
Target speaker's voice extraction method of unsupervised formula.
Background technique
The key for being ensured of our each Level Educations of the quality of education.And in improving the quality of education, it improves the quality of teaching
Especially Classroom Teaching should be the most important thing.But way traditional at present is watched and is evaluated based on artificial (colleague)
Method do not have pervasive operability although such methods can play certain effect, it is pervasive objective also not have
Property, be with regard to its reason: the authorities that first impart knowledge to students are difficult to accomplish momently all investigating classroom, making and evaluate and provide
It is recommended that this will certainly bring heavy burden to be also not necessarily to teaching management.Moreover it is traditional watching and evaluating, by
In the teaching process that whole cannot follow up, therefore it is difficult to objectively evaluate the quality of instruction of teacher.
Information and intellectual technology already become the important base of social development, how to utilize and Information of Development and intellectual technology
Traditional Classroom is reformed, building is towards classroom instruction, and efficiently, automatically " Intellisense " then becomes one naturally and great grind
Study carefully the problem in science of value.
It realizes " Intellisense " towards classroom instruction, the problem of to be solved of standing in the breach is teacher's speech recognition and obtain
It takes.
At present in addition to the method for distinguishing speek person for having supervision, unsupervised Speaker Identification main method is poly- for speaker
Class.Teacher's speech recognition in the voice of classroom also substantially belongs to this part.Mainly have following 4 for the research of speaker clustering
Kind: 1. hierarchical clusterings;2.K-means cluster;3. spectral clustering;4. neighbour's propagation clustering.
In article<<unsupervised speaker clustering method research and realization>>, it is poly- to have studied the spectrum based on feature similar matrix
The operational efficiency of class algorithm realizes a kind of spectral clustering calculation that model similar matrix is constructed by ADAPTIVE MIXED Gauss model
Method.Gauss hybrid models are obtained according to GMM-UBM-MAP technique drill voice segments first, i.e., first off-line training background model
(UBM), adaptive, the gauss hybrid models (Gauss of acquisition target words person and according to maximum a posteriori criterion (MAP) to UBM is carried out
Mixture Model, GMM).Later, the similarity for calculating GMM model constructs similarity matrix, and carries out feature for matrix
It extracts for clustering, obtains the speech portions of target person.
In article<<more Speaker Identifications of improved speaker clustering initialization and GMM>>, the Meier of voice segments is extracted
Cepstrum coefficient (MFCC) feature, training part is handled initial classes using bayesian information criterion (BIC) later, obtain compared with
Pure initial category later clusters MFCC feature using clustering algorithm, and it is special to obtain GMM model to every a kind of training
Sign carries out speaker's judgement using the Speaker Identification based on GMM model in cognitive phase.
Extraction for teacher's voice also needs to talk about to comprising teacher in addition to needing to identify individual teacher's voice
The overlapping voice of language carries out speech Separation.The purpose of speech Separation is isolated interested from multiple while sounding sound source
Voice.Relationship of the speech Separation based on the received between source signal and the mixed signal of acquisition is divided into multicenter voice and separates and list
Channel speech separation.Single-channel voice only needs single signal source, compared be not only easier to for multicenter voice signal obtain and also more
Meet reality.But it is more difficult to carry out speech Separation for single-channel voice signal.The research master of single-channel voice separation
There are following 3 kinds: 1. to be based on Computational auditory scene analysis;2. being based on model;3. being based on time-frequency distributions.
Article < < An Auditory Scene Analysis Approach to Monaural Speech
Segregation > > Hu Wang proposes the speech Separation system framework based on CASA.By the basilar memebrane for simulating human ear cochlea
Mixed signal is decomposed into time-frequency and expresses and extract the progress sense of hearing time-frequency segmentation of feature needed for speech Separation, combined same by characteristic
The adjacent time frequency unit of sound source forms sense of hearing segment and finally merges the sense of hearing segment for forming same sound source, is finally based on same sound
The Waveform composition in source realizes speech Separation.Later, Hu Wang has carried out a series of improvement to CASA system, including for pure and impure
The optimization of sound signal separation.Article<<CNMF-based acoustic features for noise-robust ASR>>refers to
NMF is a kind of unsupervised method dictionary-based learning out, is played when handling various types of Signal separators very well
Effect.NMF algorithm requires to carry out pure additivity operation, and it is nonnegative matrix that institute is important after decomposition, and can be realized matrix drop
Maintenance and operation is calculated.As research is continuous deeply, oneself has the characteristics that rapid computations and accurate to NMF algorithm, is highly convenient for extensive number
According to processing, therefore find broad application in numerous areas.
In the above prior art, it has the following deficiencies:
1. whether hierarchical clustering is greater than certain threshold with infima species spacing when carrying out unsupervised speaker clustering identification
Standard of the value as judgement end of clustering, the threshold value definite limitation effect of hierarchical clustering algorithm really.
2. proposed in article<<unsupervised speaker clustering method research and realization>>
The spectral clustering of GMM-UBM-MAP binding characteristic similar matrix needs to carry out the GMM model of voice signal
Training cannot achieve completely unsupervised Speaker Identification.In addition, this method requires the section of the speaker in voice to be detected opposite
It is average and more demanding for " purity " of each speaker's section, it is poor for the adaptability of the real situation of each form.
3. article in<<more Speaker Identifications of improved speaker clustering initialization and GMM>>for
MFCC coefficient is clustered, and MFCC extracts individual features according to framing is carried out to voice, for longer
Voice segments, if the classroom of 40min is recorded, operand can be very big, and clusters the guarantee that accuracy rate cannot be got well.
4. article < < An Auditory Scene Analysis Approach to Monaural Speech
Segregation > > based on CASA progress speech Separation, simulation human ear carries out speech Separation, but the feature of model human ear is difficult to
It chooses.
5. article<<CNMF-based acoustic features for noise-robust ASR>>needs are given in advance
Surely the training voice of voice is separated.
6. there are still the influences of noise in single-channel voice separating resulting, above-mentioned speech separating method is seldom to speech Separation
As a result further denoising, purification & isolation voice signal.
Summary of the invention
The technical problem to be solved by the embodiment of the invention is that providing a kind of target based on single channel, unsupervised formula
Speaker's voice extraction method.It is available and develop relevant information and intellectual technology means classroom voice signal is obtained,
Analysis processing and identification are robustly detected from the voice signal of classroom based on the intelligent method for constructing adaptive, unsupervised formula
And extract teacher's phonological component.
In order to solve the above-mentioned technical problem, the target based on single channel, unsupervised formula that the embodiment of the invention provides a kind of
Speaker's voice extraction method, including teacher's language detecting step and teacher's language GGMM (General Gauss Mixture
Model) model training step;
Teacher's language detecting step the following steps are included:
S1: it records to classroom and obtains voice data;
S2: Speech processing is carried out;
S3: voice segmentation and modeling, the voice segmentation includes carrying out isometric segmentation to classroom voice, is directed to every section later
Voice extracts corresponding MFCC feature, and the GMM model based on each section of voice of MFCC latent structure;
S4: the GMM model of each section of voice outside Teachers ' Talk classification and GGMM are carried out similarity meter by teacher's speech detection
It calculates, sets adaptive threshold value, be classification by the label of the threshold value is less than, thus to obtain final Teachers ' Talk
Classification;
Teacher's language GGMM model training step the following steps are included:
S5: clustering processing is carried out to the obtained voice data of S3;Initial Teachers ' Talk classification is obtained, and based on initial
Teachers ' Talk classification extraction GGMM model.
Further, the clustering processing the following steps are included:
S51: cluster centre point is chosen;
S52: all samples and central point distance, iteration and until meeting preset halt condition are calculated;
S53: circulation executes S51 step and S52 is total to n times, can get n kind teacher voice division group, selects according to the rule of setting
The division group of maximum satisfaction is selected as initial teacher's voice;
S54: it selects several to train GGMM model from the division group, and calculates average distance in class;
S55: according to GGMM and average distance, secondary judgement is carried out to remaining speech samples section, cardinal distance is from less than setting
Sample is then added in teacher's classification by threshold values;
S56: all teacher's speech samples are exported and database is written.
Further, the step of S51 is specifically included:
S511: one is randomly selected from all voice segments and is used as first central point;
S512: calculating the GMM model distance of remaining voice segments and first central point, and selection is made apart from maximum voice segments
For second central point;
S513: successively calculate it is non-selected centered on the voice segments put at a distance from central point, select distance center point away from
Next central point is used as from maximum;
S514: iteration is until central point number reaches specified classification number.
Further, the step of S52 is specifically included:
S521: remainder GMM model is calculated at a distance from central point, each GMM is divided into nearest central point;
S522: updating central point, take in all kinds of, the smallest as new central point with all the points sum of the distance in class;
S523: iteration is until meeting preset stop condition or iterating to predetermined number of times.
Further, the step of S53 is specifically included: iteration obtains N number of teacher's categorization vector and carries out similarity
It calculates, takes and the maximum initial teacher's classification obtained as final cluster of the sum of remaining N-1 vector similarity.
Further, the step of S54 is specifically included: being randomly selected in teacher's classificationSection, wherein M
The voice segments number in teacher's classification is obtained for cluster, is taken at randomPurpose be reduce in teacher's classification all
Voice segments carry out the time of GMM model training, and N is the constant obtained according to the size adaptation of M, the following institute of acquisition pattern
Show:
Wherein, α is time adjustment parameter, and for adjusting the voice segment number for carrying out GMM training, length (C) indicates former
Beginning classroom voice obtains the total number of voice segments after singulated, and coefficient 0.4*length (C) indicates least teacher's voice segments
Number.
Further, the S3 includes:
S31: overlapping speech detection obtains the overlapping voice segments in the voice of classroom;
S32: whether judge to be overlapped in voice comprising teacher's voice;
S33: selection and the immediate voice segments of Chong Die voice, as trained voice segments;
S34: design CNMF+JADE method carries out speech Separation.
Further, the S31 includes:
Overlapping voice segments are obtained using mute point, the judgement of mute frame, the energy cut-off are carried out by setting energy threshold
Value is obtained by the following method:
Wherein, EiIndicate the energy of the i-th frame speech frame,Wherein N is voice segments totalframes, and r is one normal
Number, range are (0,1),Expression rounds up.
Further, whether the S32 includes: to be judged to be overlapped in voice using GMM similarity comprising teacher, similarity
By using improved Bhattacharyya distance, judgment basis is as follows:
Wherein, disp (A, B) indicates that A, the distance of B voice segments GMM model, A indicate overlapping voice segments, and B is teacher's voice
Section, t are an adaptive threshold, and calculation formula is as follows:
Wherein, p is adjustment parameter, and for value between [0.5,0.8], K is the voice segment number of student part, SiIt is i-th
Section student's voice segments, B are teacher's voice segments.
Further, the S33 includes: selection and the closest non-teacher's voice segments of Chong Die voice and teacher's voice
Duan Yiqi trains CNMF, selection mode are as follows:
vi=min (disp (Ai,Sj)), i=1,2 .., N, j=1,2 ..., K
Wherein, AiFor i-th of overlapping voice segments, viFor i-th of trained voice segments of corresponding selection.
The implementation of the embodiments of the present invention has the following beneficial effects: classroom instruction of the present invention towards high complexity is (main
The diversity of diversity, Teachers ' Subject including classroom situation and the diversity of classroom tissue), it proposes a kind of unsupervised
Formula, the teacher's speech detection and extracting method of ADAPTIVE ROBUST effectively improve the adaptability in practical applications of system
With intelligence, also lay the foundation for subsequent applications and research.
Detailed description of the invention
Fig. 1 is frame flowage structure schematic diagram of the invention;
Fig. 2 is the flow diagram of teacher's language detecting step;
Fig. 3 is teacher's language GGMM model training step schematic diagram;
Fig. 4 is the step flow diagram of clustering algorithm;
Fig. 5 is speech Separation implementation steps schematic diagram;
Fig. 6 is speech enhan-cement implementation steps.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention is made into one below in conjunction with attached drawing
Step ground detailed description.
Shown in referring to Fig.1, a kind of target speaker's voice extraction method based on single channel, unsupervised formula of the invention,
Including teacher's language detecting step and teacher's language GGMM model training step.
As shown in Fig. 2, the detection of teacher's language should include following steps:
S110, recording;
S120, speech signal pre-processing;
S130, voice segmentation and modeling;
S140, teacher's speech detection.
As shown in figure 3, teacher's voice GGMM model training functional unit should include following steps:
S110, recording;
S120, speech signal pre-processing;
S130, voice segmentation and modeling;
S240, cluster.
Wherein, corresponding classroom voice data is obtained by using sound pick-up outfit in S110.
The classroom voice obtained in S120 for recording pre-processes, including framing, adding window, the voices such as preemphasis are located in advance
Manage common method.
Isometric segmentation is carried out for classroom voice in S130, extracts corresponding MFCC feature for every section of voice later, and
GMM model based on each section of voice of MFCC latent structure.Later using the GMM model of each section of voice as the input data of S240 into
Row cluster operation obtains initial Teachers ' Talk classification, and is based on initial Teachers ' Talk classification extraction GGMM model.It will in S140
The GMM model and GGMM of each section of voice outside Teachers ' Talk classification carry out similarity calculation, set adaptive threshold value, will be small
In threshold value label be classification, thus to obtain final Teachers ' Talk classification.
Clustering algorithm in S240 is as shown in Figure 4.
Clustering algorithm specific embodiment includes following steps:
S2401, initial center point choosing method;
1) one is randomly selected from all voice segments is used as first central point.
2) the GMM model distance for calculating remaining voice segments and first central point, select apart from maximum voice segments as
Second central point.
3) successively calculate it is non-selected centered on the voice segments put at a distance from central point, select distance center point distance most
The next central point of big conduct.
4) iteration is until central point number reaches specified classification number.
Above-mentioned center clicks selection method and obtains in the accuracy rate of final cluster result compared to random central point choosing method
Obtained apparent raising.Above-mentioned center, which clicks, takes scheme that may there are problems that for outlier being selected as central point to influence to gather
Class as a result, but in practice, due to GMM-Kmeans algorithm stop condition set in S2402 (3), during outlier is used as
Heart point cluster result obtained can be excluded in an iterative process, so choosing initial center point by the above method can obtain
Obtain steady cluster result.
Distance between gauss hybrid models can not be measured well by only being remained unchanged by the above method, that is, define GMM A with
The dispersion of GMM B is as follows:
Referred to as dispersion of the GMM A relative to GMM B, wherein WAiIndicate i-th of the GMM A weight for mixing member, WBjIt indicates
J-th of GMM B mixes the weight of member, dAB(i, j) indicates j-th of Gaussian Profile of i-th Gaussian Profile and GMM B of GMM A
Between distance, it is contemplated that the reason of calculation amount and there is a possibility that mean vector is identical, this reality in multiple Gaussian Profiles
Applying example selects mahalanobis distance as dABThe distance calculating method of (i, j).
Wherein,Indicate two Multi-dimensional Gaussian distributions, μ1,μ2It is distributed for two equal
It is worth vector,The covariance matrix being distributed for two.
The considerations of for symmetry, final GMM distance metric formula are as follows:
Wherein A, B respectively indicate two GMM models.
S2402, all samples and central point are calculated apart from, iteration and until meeting preset halt condition;
1) remainder GMM model is calculated at a distance from central point, each GMM is divided into nearest central point.
2) central point is updated, is taken in all kinds of, it is the smallest as new central point with all the points sum of the distance in class.
Iteration is until meet preset stop condition (when the most classification institute of voice segment number in cluster result obtained
When the voice segment number for including is greater than 40% and voice segment number of total voice segments more than voice segment number in the second largest classification
Output) or iterate to predetermined number of times.
S2403, circulation execute S2401 step and S2402 is total to n times, can get n kind teacher voice division group, according to certain
Rule selects the division group of maximum satisfaction as initial teacher's voice.
S2403 iteration obtains N number of teacher's categorization vector and carries out similarity calculation, takes and remaining N-1 vector similarity
The sum of the maximum initial teacher's classification obtained as final cluster.Due to N number of teacher's categorization vector obtained length not
Uniquely, it carries out needing to perform corresponding processing before similarity calculation keeping vector length identical.
In the present embodiment, keep vector length equal using zero padding method.
Length is longest in the N number of teacher's categorization vector of this method selection is denoted as M, all vector lengths is expanded to M, no
The part of sufficient M is replaced using 0 element, it may be assumed that
M=max (length (T1),length(T2),...,length(TN))
Ti=[Ti,Appendi], i=1,2 ..., N
Appendi=zeros (1, M-length (Ti)), i=1,2 ..., N
Wherein, T1,T2,...,TNFor N number of teacher's categorization vector, M is longest vector length, and length (T) indicates to obtain
T vector length, AppendiFor 0 element vector of i-th of all addition of teacher's categorization vector, zeros (i, j) indicates to form one
0 element vector of a i row j column.
In the present embodiment, so that teacher's categorization vector is obtained unified length by using zero padding method, calculate later two-by-two to
Amount the distance between, due to being artificially added to 0 element, using between vector apart from the similar method of measuring vector, such as: Euclidean away from
From etc., there can be very big error, therefore, method of the cosine similarity as similarity between measuring vector is selected herein.
Cosine similarity indicates the similarity of vector with two vectorial angle cosine values in vector space.Cosine value more connects
It is bordering on 1, then shows angle closer to 0 degree, then vector is more similar.
Cosine similarity between vector a, b is defined as follows:
Wherein a=(a1,a2,...,aN), b=(b1,b2,...,bN) respectively indicate a N-dimensional vector.
It is randomly selected in S2404 in teacher's classificationSection, wherein M is the voice in cluster acquisition teacher's classification
Section number, takes at randomPurpose be when reducing to carry out GMM model training for voice segments whole in teacher's classification
Between, N is the constant obtained according to the size adaptation of M, and acquisition pattern is as follows:
Wherein, α is time adjustment parameter, and for adjusting the voice segment number for carrying out GMM training, the present embodiment takes α=2.
Length (C) indicates that original classroom voice obtains the total number of voice segments after mono- section of 30s segmentation.Coefficient 0.4*length (C)
Indicate least teacher's voice segments number.Above formula indicates that the teacher's classification voice segments number for clustering acquisition is bigger, is carrying out GMM
When model training, wherein smaller ratio is taken.By above-mentioned formula, so that required when different phonetic progress GMM model training
Voice segments number tends to be similar.
Setting similarity threshold is S/ γ, wherein S similarity mean value between the class of teacher's classification voice segments, and γ is adaptive
Adjustment parameter, for guaranteeing the integrality of teacher's classification to greatest extent.Its acquisition pattern is as follows:
Wherein, β is adjustment parameter, and range is [0,1], and the present embodiment takes β=1/5.Smax,SminRespectively indicate teacher's classification
The maximum value and minimum value of similarity between class.Length (C) indicates that original classroom voice obtains voice segments after mono- section of 30s segmentation
Total number.M is the quantity of voice segments in teacher's classification.When above formula indicates that M is bigger, γ is bigger, i.e. similarity threshold setting is got over
It is small.And when the range of similarity between class is bigger, smaller similarity threshold is taken, so that whether being teacher's words for remainder
The accuracy of language is higher.
By the processing of GMM-Kmeans algorithm, metastable teacher's categorization vector may finally be obtained, is passed through
It is compared in test with the classification of artificial division, teacher's classification obtained has higher phase with the teacher's classification manually marked
Like degree, it is compared to directly obtained as a result, GMM- used in the present embodiment using cluster for improved K-means
Kmeans algorithm increases significantly in cluster accuracy rate.
After obtaining teacher's classification, later for for mute and overlapping phonological component judgement.By
In student's classification without specific feature, and student's quantity is unknown, so can not first detect to student's classification.
The present embodiment, which passes through, preferentially detects teacher's classification, and mute and overlapping voice class is included by excluding above-mentioned three parts
Remaining voice segments are labeled as student's language classification by voice segments.
Referring to Figure 5, specific speech Separation implementation steps are as follows:
S310, overlapping speech detection, obtain the overlapping voice segments in the voice of classroom
S320, whether judge to be overlapped in voice comprising teacher's voice
S330, selection and the immediate voice segments of Chong Die voice, as training voice segments
S340, design CNMF+JADE method carry out speech Separation
Overlapping voice segments are obtained based on mute point in S310, the study found that mute frame is compared with non-mute frame with lower
Energy, the judgement of mute frame can be carried out by setting energy threshold.Energy threshold is defined as follows:
Wherein, EiIndicate the energy of the i-th frame speech frame,Wherein N is voice segments totalframes, and r is one normal
Number, range are (0,1),Expression rounds up.
Overlapping voice indicates to speak simultaneously in one section of voice comprising two or more people.Language is overlapped in true class
Sound is mainly shown as: 1. students divide panel discussion;2. when teacher asking questions, multiple students answer simultaneously etc..Voice segments are overlapped quiet
It is different from mute section in the performance of sound frame.The study found that in a voice segments, when the muting duration the long, which includes
The probability for being overlapped voice is lower [56].Contact problem handled by the present embodiment, it may be considered that determine by the quantity of mute frame
Potential overlapping voice class.It is similar with potential mute classification method is obtained to obtain potential overlapping voice class method for distinguishing, it is as follows
It is shown:
ClassOfOverlapi=I (numberOfSilencei< Thresholds), i=1,2 ..., N
Wherein, α ' is constant, for obtaining overlapping phonetic decision class threshold Thresholdo.The present embodiment take α '=
0.6.Number of frames mute in voice segments is less than threshold value ThresholdoSection be considered potentially to be overlapped voice segments, obtained based on this
Obtain potential overlapping voice class.
S320, S330 are referred to as speech Separation front-end processing, this processing totally two purposes: judge to be overlapped in voice whether
Comprising target speaker, find in addition to target speaker with the immediate voice segments of Chong Die voice as CNMF training data.
Whether the present invention is based on GMM similarities to judge to be overlapped in voice comprising teacher.Similarity calculating method is looked into using improved Bart
In sub- distance, judgment basis is as follows:
Wherein, disp (A, B) indicates that A, the distance of B voice segments GMM model, A indicate overlapping voice segments, and B is teacher's voice
Section.T is an adaptive threshold, and calculation formula is as follows:
Wherein, p is adjustment parameter, and for value between [0.5,0.8], K is the voice segment number of student part, SiIt is i-th
Section student's voice segments, B are teacher's voice segments.It can obtain an adaptive threshold value by being calculated with student's section and judge
Whether be overlapped in voice includes teacher.
Second task of speech Separation pre-treatment is that selection is carried out with the immediate non-teacher's voice segments of Chong Die voice segments
CNMF training, the step have larger impact for subsequent voice separation.The present invention is closest with Chong Die voice by selecting
Non- teacher's voice segments train CNMF, selection mode together with teacher's voice segments are as follows:
vi=min (disp (Ai,Sj)), i=1,2 .., N, j=1,2 ..., K
Wherein, AiFor i-th of overlapping voice segments, viFor i-th of trained voice segments of corresponding selection.
S340 carries out speech Separation to the overlapping voice comprising teacher, the present invention propose a kind of fusion CNMF and JADE into
The method of row single-channel voice separation carries out secondary separation to the voice signal after CNMF separation based on JADE.CNMF+JADE is calculated
Method is intended to obtain the separation voice signal of all speakers in single channel mixing voice, and its step are as follows:
Input: speaker's clean speech t to be separated1,t2,...,tN, mixing voice o to be trained1,o2,...oN-1, to be separated
Mixing voice O.
Output: speaker's voice s after separation1,s2,...,sN。
Step1: selection target speaker t1And corresponding mixing voice o1Training CNMF
Step2: separation acquisition is carried out to mixing voice OAnd
Step3: random matrix R is generated1, mixingAndForm double-channel pronunciation signal S1。
Step4: S is realized based on JADE1Separation obtain s1And O1。
Step5: with O1As mixing voice to be separated, t2,...,tNFor speaker's clean speech to be separated, o2,...oN-1
For mixing voice to be trained, Step1-Step5 is repeated.
Step6: voice s after being separated1,s2,...,sN。
In above-mentioned algorithm, t1,t2,...,tNIndicate the clean speech for the speaker for including in mixing voice O.N indicates mixing
It include speaker's number in voice O.o1,o2,...oN-1For the creolized language successively after removing corresponding speaker in mixing voice O
Sound is expressed as follows:
In the realistic case, o is obtained1,o2,...oN-1It is extremely difficult, therefore can be by randomly choosing a current creolized language
Non-targeted speaker's voice is as substitution training CNMF in sound.Experiments verify that the more original CNMF+JADE effect of this method is omited
There is decline, but CNMF+JADE algorithm can be generalized to more generally situation.
Double-channel pronunciation signal generation form in Step3 is as follows:
Wherein, RiFor 2 × 2 matrix.
As shown in fig. 6, specific speech enhan-cement implementation steps are as follows:
S410, teacher's voice after speech enhan-cement data are speech Separation
S420, adaptive judgement is carried out to teacher's voice after speech Separation, chooses suitable voice segments and carries out speech enhan-cement
S430, speech enhan-cement is carried out using wavelet transformation
Wavelet transformation is the research hotspot in terms of speech processes in recent years.Compared to the frequency domains such as traditional Fourier transformation point
Analysis method, wavelet transformation can provide the horizon state of signal simultaneously, be a kind of with multiresolution analysis, time-frequency partial transformation
And the Time-Frequency Analysis Method of the features such as flexible choice wavelet function.The principle of wavelet transformation is explained below.
If L2(R) it is a square integrable space, and always hasIf its Fourier transformationMeet:
ClaimFor a wavelet or morther wavelet.
By morther waveletThrough a real number to (a, b), wherein a, after the b ∈ zooming and panning of R, a ≠ 0, so that it may obtain cluster
Function:
This cluster function is referred to as wavelet basis function, and wherein a is known as zoom factor, and b is known as shift factor,For window letter
Number, window size is fixed but its shape is changeable.Has the characteristics that multiresolution analysis based on this characteristic wavelet transformation.For normalization factor, effect is to make small echo energy having the same under different scales.
One of the main means that signal processing is current Speech processing are carried out based on wavelet field.It is more based on wavelet transformation
Resolution ratio, low entropy and the characteristic of decorrelation make it have great advantage when carrying out Speech processing.It is a large amount of small
Wave base can cope with different scenes, therefore wavelet transformation is very suitable to Speech processing.
Using wavelet transformation carry out speech enhan-cement when, be using the characteristic of the multiresolution analysis in wavelet transformation,
According to the different features that noise is shown in the wavelet field of different scale from the wavelet coefficient of voice, corresponding rule are formulated
Then, the processing to noise signal wavelet coefficient is completed.
The key step of Noise Elimination from Wavelet Transform is as follows:
Step1: wavelet transformation is carried out to signals and associated noises
Step2: denoising is carried out to wavelet coefficient on different scale
Step3: by treated, wavelet coefficient does wavelet inverse transformation, obtains enhanced reconstruction signal
The mode of Wavelet Denoising Method can substantially be divided into following three types: take making an uproar using wavelet modulus maxima principle;
It is denoised using the correlation of wavelet transformation space factor;Utilize wavelet threshold denoising.The present embodiment mainly uses the third
Based on wavelet threshold denoising.
Wavelet threshold denoising is more commonly used one of denoising method, and basic process is as follows:
Step1: according to one suitable wavelet basis of signal behavior to be processed, determining reasonably Decomposition order, right
Noisy speech signal carries out multilayer decomposition.
Step2: suitable threshold value is selected on different scale to the wavelet coefficient after decomposition, and is quantified.
Step3: wavelet reconstruction is carried out according to the processing result after threshold value quantizing and obtains enhanced voice signal.
The diversity of wavelet basis is one of the advantages of wavelet transformation carries out time frequency analysis, therefore selects suitable small echo letter
Number is most important.Studies have shown that being the transient state for being more advantageous to processing voice signal when carrying out Speech processing
Variation needs to select slickness, symmetry preferably and has the wavelet basis function of lower vanishing moment.
The series of wavelet decomposition is always concerned as the factor for influencing voice enhancement algorithm denoising effect.With point
The detail section of the increase of solution series, voice signal and noise signal is more clear, and is more conducively denoised.But with decomposed class
Increase, speech energy can increasingly disperse and cause distortion and also the algorithm speed of service also can be slower and slower;Decomposed class is few
It will lead to signal and noise aliasing then to which noise can not be isolated.Researcher had found by numerous studies experimental analysis, needle
Selection to wavelet decomposition series, takes the most reasonable decomposed class to beN is data length,Expression takes downwards
It is whole.
In wavelet threshold denoising algorithm, the estimation for threshold value is to determine one of an important factor for denoising effect.Small echo
Noisy speech signal is decomposed into high frequency detail part and low-frequency approximation part by transformation, and the frequency of noise is usually higher, therefore is made an uproar
Acoustic energy focuses primarily upon on high-frequency wavelet coefficient, and speech energy focuses primarily upon the low frequency part of voice signal.So can be with
The noise component(s) truncation for being less than the value is denoised by one threshold value of setting.The threshold value is exactly in wavelet threshold denoising
The threshold value to be studied.
Selection for wavelet threshold denoising threshold value mainly includes following classical way:
Uniform threshold method.
Uniform threshold estimation is to derive to obtain based on minimum mean square error criterion.It may be expressed as:
Wherein σnIt is the standard deviation of noise, N is signal length.The variance of noise is obtained by following formula:
σn=Mj/0.6745
Wherein MjFor the absolute intermediate value for decomposing each layer of wavelet coefficient, 0.6745 is empirical value.
This method is realized simply, preferable for filtering out white Gaussian noise effect, but due to related to voice length, in data
It will lead to effect variation when measuring very big.
SUREShrink threshold value [73]
SUREShrink threshold estimation is a kind of method of adaptive threshold selection, is the unbiased esti-mator of optimal threshold.Threshold
The selection of value can be defined by following risk function:
It goes for threshold estimation function and then needs to meet risk function minimum, it may be assumed that
Signal length is substituted into, then is had:
WhereinExpression take set Y | | Yi| < t } in element number.
Minimaxi threshold value
Minimaxi threshold value, which is called, does minimax threshold value, and what this method generated is a least mean-square error
Extreme value.The calculation formula of Minimaxi threshold value is as follows:
Wherein, N is signal length.
In addition, threshold function table as threshold estimation, also play the role of in wavelet threshold denoising algorithm it is vital,
Common threshold function table is as follows:
Hard threshold function
Hard threshold function is as follows:
Wherein,To estimate wavelet coefficient, ωj,kTo decompose wavelet coefficient, λ is noise-removed threshold value.From above formula it can be found that
The principle of hard-threshold denoising is by ωj,kIt makes comparisons with λ, less than being zeroed out for λ, greater than being retained for λ, such processing can
Signal can be made to introduce oscillator signal in reconstruct, influence to denoise effect.
Soft-threshold function
For the influence for eliminating hard-threshold denoising, the method for introducing soft-threshold denoising, form is as follows:
Compared with hard threshold function, soft-threshold function strengthens the flatness of voice signal, but equally can be to a certain extent
It loses feature and causes distortion [74].
Semisoft shrinkage function
Soft to overcome, the defect of hard threshold function has scholar to propose that semisoft shrinkage function, function are as follows:
Wherein λ1,λ2Respectively lower threshold value and upper threshold value and there is 0 < λ1< λ2, rule of thumbλ1Take
Value is related with voice, when voiceless sound is more, λ1Value table is smaller, when voiced sound is more, λ1Value is larger.By adjusting λ1,λ2It can make
So that this method has both soft, the advantages of hard -threshold, but two parameters will increase algorithm computation complexity.
Garrote threshold function table
Garrote threshold function table indicates as follows:
Threshold value is introduced into threshold function table by the function, dynamic to reject the wavelet coefficient for being greater than selected threshold value.
The present invention is designed one and is divided based on the adaptive approach of wavelet transformation voice signal after CNMF+JADE separation
Analysis selectively carries out speech enhan-cement it is expected that realizing, i.e., may after those speech enhan-cements of automatic fitration before speech enhan-cement
The voice segments for causing voice quality to decline.By to voice signal and the sound effect after wavelet transformation after multistage speech Separation
Analysis find, when the distance between voice is larger after separation, when carrying out wavelet transformation speech enhan-cement effect have under
Drop.Based on above-mentioned discovery, the present invention designs following method and carries out adaptive judgement before speech enhan-cement.
I=1,2 ..., N
Oi-1=Oi+si+l
O0=O
ON=sN
Wherein, siIndicate i-th of teacher's voice signal after CNMF+JADE is decomposed.OiIndicate mixing voice Oi-1Through
CNMF+JADE isolates siMixing voice signal afterwards.Respectively indicate si、OiCorresponding GMM.L indicates separation process
In loss.N indicates the speaker's number for including in mixed signal.Disp () is that above GMM distance calculates public affairs
Formula, p are zoom factor, and value is [1,1.2].
1. this complicated situation, designs teacher's voice extraction method, has derived information-based class the present invention is based on classroom instruction
The application category of hall, is not only the important component of wisdom classroom (artificial intelligence+education), and one kind of even more future education is complete
It is new to embody.Have us refering to known to data, the research of same type is few at present, and substantially there are no form available frame and reason
By.The present invention is to have stepped major step in the research in wisdom classroom, has opened up the instructional methodology based on artificial intelligence
New view.
2. the present invention is based on single channel, adaptive, unsupervised formulas, and classroom teacher's voice identifies and extracted.It compares
In existing method, any priori knowledge is not needed, and for different form, the classroom voice of different length, different classrooms
Environment has good adaptive ability.Method proposed in this paper simultaneously, can not only apply in classroom teaching, can be with
It applies in fields such as meeting, hearing aid, communications (for example, by speech Separation technology in conjunction with hearing aid, so that hearing aid has
More powerful signal processing function improves the voice quality of hearing aid.In mobile communication field, in equipment end application speech Separation
Technology, which reaches, inhibits non-targeted speaker, improves voice quality and intelligibility etc..)
3. the present invention designs and Implements a kind of improved GMM-Kmeans clustering method, carried out using GMM model as feature
Cluster, remains primitive character to greatest extent, improves the accuracy rate of cluster.Using GMM as feature and distance is calculated, is avoided straight
Processing greater depth voice signal is connect, so as to shorten the algorithm process time, it is high and fast generally to realize a kind of accuracy rate
Spend fast classroom classroom speech recognition.
4. considering the influence of environment on the basis of GMM-Kmeans clustering algorithm, it is based on cluster result, adaptive choosing
It takes suitable voice segments and constructs GGMM model, adaptively obtain similarity threshold, secondary detection Teachers ' Talk, to obtain standard
True ground teacher's voice class.All threshold values are to be obtained by the way that design formula is adaptive according to classroom voice data, prosthetic
Interference, so that the algorithm is directed to different classroom environments, classroom situation has very strong robustness.
5. the present invention designs and Implements the Speech separation algorithm of CNMF+JADE a kind of, by application JADE to CNMF voice
Separating resulting carries out secondary speech Separation.Speech Separation result is effectively promoted.7. the present invention designs and Implements one kind certainly
The method for adapting to wavelet transformation speech enhan-cement carries out adaptive judgement to the voice after CNMF+JADE speech Separation, and filtering is uncomfortable
The voice segments for carrying out speech enhan-cement again are closed, it is purposive that voice signal is denoised.
Above disclosed is only a preferred embodiment of the present invention, cannot limit the power of the present invention with this certainly
Sharp range, therefore equivalent changes made in accordance with the claims of the present invention, are still within the scope of the present invention.
Claims (10)
1. a kind of target speaker's voice extraction method based on single channel, unsupervised formula, which is characterized in that including teacher's language
Detecting step and teacher's language GGMM model training step;
Teacher's language detecting step the following steps are included:
S1: it records to classroom and obtains voice data;
S2: Speech processing is carried out;
S3: voice segmentation and modeling, the voice segmentation includes carrying out isometric segmentation to classroom voice, is directed to every section of voice later
Extract corresponding MFCC feature, and the GMM model based on each section of voice of MFCC latent structure;
S4: the GMM model of each section of voice outside Teachers ' Talk classification and GGMM are carried out similarity calculation by teacher's speech detection,
Adaptive threshold value is set, is classification by the label of the threshold value is less than, thus to obtain final Teachers ' Talk class
Not;
Teacher's language GGMM model training step the following steps are included:
S5: clustering processing is carried out to the obtained voice data of S3;Initial Teachers ' Talk classification is obtained, and is based on initial teacher
Language classification extraction GGMM model.
2. target speaker's voice extraction method according to claim 1 based on single channel, unsupervised formula, feature exist
In, the clustering processing the following steps are included:
S51: cluster centre point is chosen;
S52: all samples and central point distance, iteration and until meeting preset halt condition are calculated;
S53: circulation execution S51 step and S52 are total to n times, can get n kind teacher voice division group, select most according to the rule of setting
The division group of big satisfaction is as initial teacher's voice;
S54: it selects several to train GGMM model from the division group, and calculates average distance in class;
S55: according to GGMM and average distance, carrying out secondary judgement to remaining speech samples section, cardinal distance from being less than setting threshold values,
Then sample is added in teacher's classification;
S56: all teacher's speech samples are exported and database is written.
3. target speaker's voice extraction method according to claim 2 based on single channel, unsupervised formula, feature exist
It is specifically included in the step of, S51:
S511: one is randomly selected from all voice segments and is used as first central point;
S512: calculating the GMM model distance of remaining voice segments and first central point, selects apart from maximum voice segments as the
Two central points;
S513: successively calculate it is non-selected centered on the voice segments put at a distance from central point, select distance center point distance most
The next central point of big conduct;
S514: iteration is until central point number reaches specified classification number.
4. target speaker's voice extraction method according to claim 3 based on single channel, unsupervised formula, feature exist
It is specifically included in the step of, S52:
S521: remainder GMM model is calculated at a distance from central point, each GMM is divided into nearest central point;
S522: updating central point, take in all kinds of, the smallest as new central point with all the points sum of the distance in class;
S523: iteration is until meeting preset stop condition or iterating to predetermined number of times.
5. target speaker's voice extraction method according to claim 4 based on single channel, unsupervised formula, feature exist
Specifically include in the step of, S53: iteration obtains N number of teacher's categorization vector and carries out similarity calculation, takes and remaining N-1
The maximum initial teacher's classification obtained as final cluster of the sum of a vector similarity.
6. target speaker's voice extraction method according to claim 5 based on single channel, unsupervised formula, feature exist
It specifically includes: is randomly selected in teacher's classification in the step of, S54Section, wherein M is that cluster obtains teacher's class
Voice segments number in not, takes at randomPurpose be to reduce to carry out GMM models for voice segments whole in teacher's classification
Trained time, N are the constant obtained according to the size adaptation of M, and acquisition pattern is as follows:
Wherein, α is time adjustment parameter, and for adjusting the voice segment number for carrying out GMM training, length (C) indicates original class
Hall voice obtains the total number of voice segments after singulated, and coefficient 0.4*length (C) indicates least teacher's voice segments number.
7. target speaker's voice extraction method according to claim 1 based on single channel, unsupervised formula, feature exist
In the S3 includes:
S31: overlapping speech detection obtains the overlapping voice segments in the voice of classroom;
S32: whether judge to be overlapped in voice comprising teacher's voice;
S33: selection and the immediate voice segments of Chong Die voice, as trained voice segments;
S34: design CNMF+JADE method carries out speech Separation.
8. target speaker's voice extraction method according to claim 7 based on single channel, unsupervised formula, feature exist
In the S31 includes:
Overlapping voice segments are obtained using mute point, the judgement of mute frame is carried out by setting energy threshold, the energy threshold is logical
Following methods are crossed to be obtained:
Wherein, EiIndicate the energy of the i-th frame speech frame,Wherein N is voice segments totalframes, and r is a constant, model
It encloses for (0,1),Expression rounds up.
9. target speaker's voice extraction method according to claim 8 based on single channel, unsupervised formula, feature exist
In whether the S32 includes: to judge to be overlapped comprising teacher in voice using GMM similarity, and similarity is by using improved bar
Te Chaliya distance, judgment basis are as follows:
Wherein, disp (A, B) indicates that A, the distance of B voice segments GMM model, A indicate overlapping voice segments, and B is teacher's voice segments, t
For an adaptive threshold, calculation formula is as follows:
Wherein, p is adjustment parameter, and for value between [0.5,0.8], K is the voice segment number of student part, SiFor i-th section of student
Voice segments, B are teacher's voice segments.
10. target speaker's voice extraction method according to claim 9 based on single channel, unsupervised formula, feature
It is, the S33 includes: that the selection non-teacher voice segments closest with overlapping voice are trained together with teacher's voice segments
CNMF, selection mode are as follows:
vi=min (disp (Ai,Sj)), i=1,2 .., N, j=1,2 ..., K
Wherein, AiFor i-th of overlapping voice segments, viFor i-th of trained voice segments of corresponding selection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810832080.5A CN108962229B (en) | 2018-07-26 | 2018-07-26 | Single-channel and unsupervised target speaker voice extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810832080.5A CN108962229B (en) | 2018-07-26 | 2018-07-26 | Single-channel and unsupervised target speaker voice extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108962229A true CN108962229A (en) | 2018-12-07 |
CN108962229B CN108962229B (en) | 2020-11-13 |
Family
ID=64464209
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810832080.5A Active CN108962229B (en) | 2018-07-26 | 2018-07-26 | Single-channel and unsupervised target speaker voice extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108962229B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109948144A (en) * | 2019-01-29 | 2019-06-28 | 汕头大学 | A method of the Teachers ' Talk Intelligent treatment based on classroom instruction situation |
CN110544482A (en) * | 2019-09-09 | 2019-12-06 | 极限元(杭州)智能科技股份有限公司 | single-channel voice separation system |
CN110544481A (en) * | 2019-08-27 | 2019-12-06 | 华中师范大学 | S-T classification method and device based on voiceprint recognition and equipment terminal |
CN110874879A (en) * | 2019-10-18 | 2020-03-10 | 平安科技(深圳)有限公司 | Old man registration method, device, equipment and storage medium based on voice recognition |
CN111179962A (en) * | 2020-01-02 | 2020-05-19 | 腾讯科技(深圳)有限公司 | Training method of voice separation model, voice separation method and device |
CN112017685A (en) * | 2020-08-27 | 2020-12-01 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
US20220012667A1 (en) * | 2020-07-13 | 2022-01-13 | Allstate Insurance Company | Intelligent prediction systems and methods for conversational outcome modeling frameworks for sales predictions |
CN117577124A (en) * | 2024-01-12 | 2024-02-20 | 京东城市(北京)数字科技有限公司 | Training method, device and equipment of audio noise reduction model based on knowledge distillation |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050125223A1 (en) * | 2003-12-05 | 2005-06-09 | Ajay Divakaran | Audio-visual highlights detection using coupled hidden markov models |
CN101866421A (en) * | 2010-01-08 | 2010-10-20 | 苏州市职业大学 | Method for extracting characteristic of natural image based on dispersion-constrained non-negative sparse coding |
CN102568477A (en) * | 2010-12-29 | 2012-07-11 | 盛乐信息技术(上海)有限公司 | Semi-supervised pronunciation model modeling system and method |
CN102682760A (en) * | 2011-03-07 | 2012-09-19 | 株式会社理光 | Overlapped voice detection method and system |
CN103680517A (en) * | 2013-11-20 | 2014-03-26 | 华为技术有限公司 | Method, device and equipment for processing audio signals |
CN103854644A (en) * | 2012-12-05 | 2014-06-11 | 中国传媒大学 | Automatic duplicating method and device for single track polyphonic music signals |
CN104167208A (en) * | 2014-08-08 | 2014-11-26 | 中国科学院深圳先进技术研究院 | Speaker recognition method and device |
CN105096955A (en) * | 2015-09-06 | 2015-11-25 | 广东外语外贸大学 | Speaker rapid identification method and system based on growing and clustering algorithm of models |
CN105957537A (en) * | 2016-06-20 | 2016-09-21 | 安徽大学 | Voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition |
JP2018502319A (en) * | 2015-07-07 | 2018-01-25 | 三菱電機株式会社 | Method for distinguishing one or more components of a signal |
-
2018
- 2018-07-26 CN CN201810832080.5A patent/CN108962229B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050125223A1 (en) * | 2003-12-05 | 2005-06-09 | Ajay Divakaran | Audio-visual highlights detection using coupled hidden markov models |
CN101866421A (en) * | 2010-01-08 | 2010-10-20 | 苏州市职业大学 | Method for extracting characteristic of natural image based on dispersion-constrained non-negative sparse coding |
CN102568477A (en) * | 2010-12-29 | 2012-07-11 | 盛乐信息技术(上海)有限公司 | Semi-supervised pronunciation model modeling system and method |
CN102682760A (en) * | 2011-03-07 | 2012-09-19 | 株式会社理光 | Overlapped voice detection method and system |
CN103854644A (en) * | 2012-12-05 | 2014-06-11 | 中国传媒大学 | Automatic duplicating method and device for single track polyphonic music signals |
CN103680517A (en) * | 2013-11-20 | 2014-03-26 | 华为技术有限公司 | Method, device and equipment for processing audio signals |
CN104167208A (en) * | 2014-08-08 | 2014-11-26 | 中国科学院深圳先进技术研究院 | Speaker recognition method and device |
JP2018502319A (en) * | 2015-07-07 | 2018-01-25 | 三菱電機株式会社 | Method for distinguishing one or more components of a signal |
CN105096955A (en) * | 2015-09-06 | 2015-11-25 | 广东外语外贸大学 | Speaker rapid identification method and system based on growing and clustering algorithm of models |
CN105957537A (en) * | 2016-06-20 | 2016-09-21 | 安徽大学 | Voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition |
Non-Patent Citations (2)
Title |
---|
COLIN VAZ: "CNMF-BASED ACOUSTIC FEATURES FOR NOISE-ROBUST ASR", 《2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
李号: "基于深度学习的单通道语音分离", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109948144A (en) * | 2019-01-29 | 2019-06-28 | 汕头大学 | A method of the Teachers ' Talk Intelligent treatment based on classroom instruction situation |
CN109948144B (en) * | 2019-01-29 | 2022-12-06 | 汕头大学 | Teacher utterance intelligent processing method based on classroom teaching situation |
CN110544481A (en) * | 2019-08-27 | 2019-12-06 | 华中师范大学 | S-T classification method and device based on voiceprint recognition and equipment terminal |
CN110544482B (en) * | 2019-09-09 | 2021-11-12 | 北京中科智极科技有限公司 | Single-channel voice separation system |
CN110544482A (en) * | 2019-09-09 | 2019-12-06 | 极限元(杭州)智能科技股份有限公司 | single-channel voice separation system |
CN110874879A (en) * | 2019-10-18 | 2020-03-10 | 平安科技(深圳)有限公司 | Old man registration method, device, equipment and storage medium based on voice recognition |
WO2021073161A1 (en) * | 2019-10-18 | 2021-04-22 | 平安科技(深圳)有限公司 | Elderly people registration method, apparatus and device based on voice recognition, and storage medium |
CN111179962A (en) * | 2020-01-02 | 2020-05-19 | 腾讯科技(深圳)有限公司 | Training method of voice separation model, voice separation method and device |
US20220012667A1 (en) * | 2020-07-13 | 2022-01-13 | Allstate Insurance Company | Intelligent prediction systems and methods for conversational outcome modeling frameworks for sales predictions |
US11829920B2 (en) * | 2020-07-13 | 2023-11-28 | Allstate Insurance Company | Intelligent prediction systems and methods for conversational outcome modeling frameworks for sales predictions |
CN112017685A (en) * | 2020-08-27 | 2020-12-01 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN112017685B (en) * | 2020-08-27 | 2023-12-22 | 抖音视界有限公司 | Speech generation method, device, equipment and computer readable medium |
CN117577124A (en) * | 2024-01-12 | 2024-02-20 | 京东城市(北京)数字科技有限公司 | Training method, device and equipment of audio noise reduction model based on knowledge distillation |
CN117577124B (en) * | 2024-01-12 | 2024-04-16 | 京东城市(北京)数字科技有限公司 | Training method, device and equipment of audio noise reduction model based on knowledge distillation |
Also Published As
Publication number | Publication date |
---|---|
CN108962229B (en) | 2020-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108962229A (en) | A kind of target speaker's voice extraction method based on single channel, unsupervised formula | |
EP3292515B1 (en) | Method for distinguishing one or more components of signal | |
Sarikaya et al. | High resolution speech feature parametrization for monophone-based stressed speech recognition | |
CN106952643A (en) | A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering | |
CN109326302A (en) | A kind of sound enhancement method comparing and generate confrontation network based on vocal print | |
CN108899047B (en) | The masking threshold estimation method, apparatus and storage medium of audio signal | |
CN109410976A (en) | Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid | |
CN108922559A (en) | Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming | |
CN101366078A (en) | Neural network classifier for separating audio sources from a monophonic audio signal | |
CN110197665B (en) | Voice separation and tracking method for public security criminal investigation monitoring | |
CN106328123B (en) | Method for recognizing middle ear voice in normal voice stream under condition of small database | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN110111797A (en) | Method for distinguishing speek person based on Gauss super vector and deep neural network | |
CN103985381A (en) | Voice frequency indexing method based on parameter fusion optimized decision | |
CN109346084A (en) | Method for distinguishing speek person based on depth storehouse autoencoder network | |
Do et al. | Speech source separation using variational autoencoder and bandpass filter | |
Fan et al. | Utterance-level permutation invariant training with discriminative learning for single channel speech separation | |
CN110136746B (en) | Method for identifying mobile phone source in additive noise environment based on fusion features | |
CN106875944A (en) | A kind of system of Voice command home intelligent terminal | |
Fan et al. | Deep attention fusion feature for speech separation with end-to-end post-filter method | |
Qiu et al. | Self-Supervised Learning Based Phone-Fortified Speech Enhancement. | |
CN116092512A (en) | Small sample voice separation method based on data generation | |
Wang et al. | Robust speech recognition from ratio masks | |
Ravindran et al. | Improving the noise-robustness of mel-frequency cepstral coefficients for speech processing | |
CN111091847A (en) | Deep clustering voice separation method based on improvement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |