CN108962229A - A kind of target speaker's voice extraction method based on single channel, unsupervised formula - Google Patents

A kind of target speaker's voice extraction method based on single channel, unsupervised formula Download PDF

Info

Publication number
CN108962229A
CN108962229A CN201810832080.5A CN201810832080A CN108962229A CN 108962229 A CN108962229 A CN 108962229A CN 201810832080 A CN201810832080 A CN 201810832080A CN 108962229 A CN108962229 A CN 108962229A
Authority
CN
China
Prior art keywords
voice
teacher
classification
segments
voice segments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810832080.5A
Other languages
Chinese (zh)
Other versions
CN108962229B (en
Inventor
姜大志
陈逸飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shantou University
Original Assignee
Shantou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shantou University filed Critical Shantou University
Priority to CN201810832080.5A priority Critical patent/CN108962229B/en
Publication of CN108962229A publication Critical patent/CN108962229A/en
Application granted granted Critical
Publication of CN108962229B publication Critical patent/CN108962229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • G10L2015/0636Threshold criteria for the updating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

Target speaker's voice extraction method based on single channel, unsupervised formula that the embodiment of the invention discloses a kind of, including teacher's language detecting step and teacher's language model training step;Teacher's language detecting step obtains voice data the following steps are included: recording to classroom;Carry out Speech processing;Voice segmentation and modeling, the voice segmentation includes carrying out isometric segmentation to classroom voice, extracts corresponding MFCC feature, and the GMM model based on each section of voice of MFCC latent structure for every section of voice later;The GMM model of each section of voice outside Teachers ' Talk classification and GGMM are carried out similarity calculation by teacher's speech detection, are classification by the label of given threshold is less than, thus to obtain final Teachers ' Talk classification;Teacher's language GGMM model training step is the following steps are included: carry out clustering processing to the obtained voice data of S3;Initial Teachers ' Talk classification is obtained, and is based on initial Teachers ' Talk classification extraction GGMM model.The present invention effectively improves the adaptability and intelligence in practical applications of system, also lays the foundation for subsequent applications and research.

Description

A kind of target speaker's voice extraction method based on single channel, unsupervised formula
Technical field
The present invention relates under a kind of voice extraction method more particularly to a kind of more speaker's situations of complexity be based on single channel, Target speaker's voice extraction method of unsupervised formula.
Background technique
The key for being ensured of our each Level Educations of the quality of education.And in improving the quality of education, it improves the quality of teaching Especially Classroom Teaching should be the most important thing.But way traditional at present is watched and is evaluated based on artificial (colleague) Method do not have pervasive operability although such methods can play certain effect, it is pervasive objective also not have Property, be with regard to its reason: the authorities that first impart knowledge to students are difficult to accomplish momently all investigating classroom, making and evaluate and provide It is recommended that this will certainly bring heavy burden to be also not necessarily to teaching management.Moreover it is traditional watching and evaluating, by In the teaching process that whole cannot follow up, therefore it is difficult to objectively evaluate the quality of instruction of teacher.
Information and intellectual technology already become the important base of social development, how to utilize and Information of Development and intellectual technology Traditional Classroom is reformed, building is towards classroom instruction, and efficiently, automatically " Intellisense " then becomes one naturally and great grind Study carefully the problem in science of value.
It realizes " Intellisense " towards classroom instruction, the problem of to be solved of standing in the breach is teacher's speech recognition and obtain It takes.
At present in addition to the method for distinguishing speek person for having supervision, unsupervised Speaker Identification main method is poly- for speaker Class.Teacher's speech recognition in the voice of classroom also substantially belongs to this part.Mainly have following 4 for the research of speaker clustering Kind: 1. hierarchical clusterings;2.K-means cluster;3. spectral clustering;4. neighbour's propagation clustering.
In article<<unsupervised speaker clustering method research and realization>>, it is poly- to have studied the spectrum based on feature similar matrix The operational efficiency of class algorithm realizes a kind of spectral clustering calculation that model similar matrix is constructed by ADAPTIVE MIXED Gauss model Method.Gauss hybrid models are obtained according to GMM-UBM-MAP technique drill voice segments first, i.e., first off-line training background model (UBM), adaptive, the gauss hybrid models (Gauss of acquisition target words person and according to maximum a posteriori criterion (MAP) to UBM is carried out Mixture Model, GMM).Later, the similarity for calculating GMM model constructs similarity matrix, and carries out feature for matrix It extracts for clustering, obtains the speech portions of target person.
In article<<more Speaker Identifications of improved speaker clustering initialization and GMM>>, the Meier of voice segments is extracted Cepstrum coefficient (MFCC) feature, training part is handled initial classes using bayesian information criterion (BIC) later, obtain compared with Pure initial category later clusters MFCC feature using clustering algorithm, and it is special to obtain GMM model to every a kind of training Sign carries out speaker's judgement using the Speaker Identification based on GMM model in cognitive phase.
Extraction for teacher's voice also needs to talk about to comprising teacher in addition to needing to identify individual teacher's voice The overlapping voice of language carries out speech Separation.The purpose of speech Separation is isolated interested from multiple while sounding sound source Voice.Relationship of the speech Separation based on the received between source signal and the mixed signal of acquisition is divided into multicenter voice and separates and list Channel speech separation.Single-channel voice only needs single signal source, compared be not only easier to for multicenter voice signal obtain and also more Meet reality.But it is more difficult to carry out speech Separation for single-channel voice signal.The research master of single-channel voice separation There are following 3 kinds: 1. to be based on Computational auditory scene analysis;2. being based on model;3. being based on time-frequency distributions.
Article < < An Auditory Scene Analysis Approach to Monaural Speech Segregation > > Hu Wang proposes the speech Separation system framework based on CASA.By the basilar memebrane for simulating human ear cochlea Mixed signal is decomposed into time-frequency and expresses and extract the progress sense of hearing time-frequency segmentation of feature needed for speech Separation, combined same by characteristic The adjacent time frequency unit of sound source forms sense of hearing segment and finally merges the sense of hearing segment for forming same sound source, is finally based on same sound The Waveform composition in source realizes speech Separation.Later, Hu Wang has carried out a series of improvement to CASA system, including for pure and impure The optimization of sound signal separation.Article<<CNMF-based acoustic features for noise-robust ASR>>refers to NMF is a kind of unsupervised method dictionary-based learning out, is played when handling various types of Signal separators very well Effect.NMF algorithm requires to carry out pure additivity operation, and it is nonnegative matrix that institute is important after decomposition, and can be realized matrix drop Maintenance and operation is calculated.As research is continuous deeply, oneself has the characteristics that rapid computations and accurate to NMF algorithm, is highly convenient for extensive number According to processing, therefore find broad application in numerous areas.
In the above prior art, it has the following deficiencies:
1. whether hierarchical clustering is greater than certain threshold with infima species spacing when carrying out unsupervised speaker clustering identification Standard of the value as judgement end of clustering, the threshold value definite limitation effect of hierarchical clustering algorithm really.
2. proposed in article<<unsupervised speaker clustering method research and realization>>
The spectral clustering of GMM-UBM-MAP binding characteristic similar matrix needs to carry out the GMM model of voice signal Training cannot achieve completely unsupervised Speaker Identification.In addition, this method requires the section of the speaker in voice to be detected opposite It is average and more demanding for " purity " of each speaker's section, it is poor for the adaptability of the real situation of each form.
3. article in<<more Speaker Identifications of improved speaker clustering initialization and GMM>>for
MFCC coefficient is clustered, and MFCC extracts individual features according to framing is carried out to voice, for longer Voice segments, if the classroom of 40min is recorded, operand can be very big, and clusters the guarantee that accuracy rate cannot be got well.
4. article < < An Auditory Scene Analysis Approach to Monaural Speech Segregation > > based on CASA progress speech Separation, simulation human ear carries out speech Separation, but the feature of model human ear is difficult to It chooses.
5. article<<CNMF-based acoustic features for noise-robust ASR>>needs are given in advance Surely the training voice of voice is separated.
6. there are still the influences of noise in single-channel voice separating resulting, above-mentioned speech separating method is seldom to speech Separation As a result further denoising, purification & isolation voice signal.
Summary of the invention
The technical problem to be solved by the embodiment of the invention is that providing a kind of target based on single channel, unsupervised formula Speaker's voice extraction method.It is available and develop relevant information and intellectual technology means classroom voice signal is obtained, Analysis processing and identification are robustly detected from the voice signal of classroom based on the intelligent method for constructing adaptive, unsupervised formula And extract teacher's phonological component.
In order to solve the above-mentioned technical problem, the target based on single channel, unsupervised formula that the embodiment of the invention provides a kind of Speaker's voice extraction method, including teacher's language detecting step and teacher's language GGMM (General Gauss Mixture Model) model training step;
Teacher's language detecting step the following steps are included:
S1: it records to classroom and obtains voice data;
S2: Speech processing is carried out;
S3: voice segmentation and modeling, the voice segmentation includes carrying out isometric segmentation to classroom voice, is directed to every section later Voice extracts corresponding MFCC feature, and the GMM model based on each section of voice of MFCC latent structure;
S4: the GMM model of each section of voice outside Teachers ' Talk classification and GGMM are carried out similarity meter by teacher's speech detection It calculates, sets adaptive threshold value, be classification by the label of the threshold value is less than, thus to obtain final Teachers ' Talk Classification;
Teacher's language GGMM model training step the following steps are included:
S5: clustering processing is carried out to the obtained voice data of S3;Initial Teachers ' Talk classification is obtained, and based on initial Teachers ' Talk classification extraction GGMM model.
Further, the clustering processing the following steps are included:
S51: cluster centre point is chosen;
S52: all samples and central point distance, iteration and until meeting preset halt condition are calculated;
S53: circulation executes S51 step and S52 is total to n times, can get n kind teacher voice division group, selects according to the rule of setting The division group of maximum satisfaction is selected as initial teacher's voice;
S54: it selects several to train GGMM model from the division group, and calculates average distance in class;
S55: according to GGMM and average distance, secondary judgement is carried out to remaining speech samples section, cardinal distance is from less than setting Sample is then added in teacher's classification by threshold values;
S56: all teacher's speech samples are exported and database is written.
Further, the step of S51 is specifically included:
S511: one is randomly selected from all voice segments and is used as first central point;
S512: calculating the GMM model distance of remaining voice segments and first central point, and selection is made apart from maximum voice segments For second central point;
S513: successively calculate it is non-selected centered on the voice segments put at a distance from central point, select distance center point away from Next central point is used as from maximum;
S514: iteration is until central point number reaches specified classification number.
Further, the step of S52 is specifically included:
S521: remainder GMM model is calculated at a distance from central point, each GMM is divided into nearest central point;
S522: updating central point, take in all kinds of, the smallest as new central point with all the points sum of the distance in class;
S523: iteration is until meeting preset stop condition or iterating to predetermined number of times.
Further, the step of S53 is specifically included: iteration obtains N number of teacher's categorization vector and carries out similarity It calculates, takes and the maximum initial teacher's classification obtained as final cluster of the sum of remaining N-1 vector similarity.
Further, the step of S54 is specifically included: being randomly selected in teacher's classificationSection, wherein M The voice segments number in teacher's classification is obtained for cluster, is taken at randomPurpose be reduce in teacher's classification all Voice segments carry out the time of GMM model training, and N is the constant obtained according to the size adaptation of M, the following institute of acquisition pattern Show:
Wherein, α is time adjustment parameter, and for adjusting the voice segment number for carrying out GMM training, length (C) indicates former Beginning classroom voice obtains the total number of voice segments after singulated, and coefficient 0.4*length (C) indicates least teacher's voice segments Number.
Further, the S3 includes:
S31: overlapping speech detection obtains the overlapping voice segments in the voice of classroom;
S32: whether judge to be overlapped in voice comprising teacher's voice;
S33: selection and the immediate voice segments of Chong Die voice, as trained voice segments;
S34: design CNMF+JADE method carries out speech Separation.
Further, the S31 includes:
Overlapping voice segments are obtained using mute point, the judgement of mute frame, the energy cut-off are carried out by setting energy threshold Value is obtained by the following method:
Wherein, EiIndicate the energy of the i-th frame speech frame,Wherein N is voice segments totalframes, and r is one normal Number, range are (0,1),Expression rounds up.
Further, whether the S32 includes: to be judged to be overlapped in voice using GMM similarity comprising teacher, similarity By using improved Bhattacharyya distance, judgment basis is as follows:
Wherein, disp (A, B) indicates that A, the distance of B voice segments GMM model, A indicate overlapping voice segments, and B is teacher's voice Section, t are an adaptive threshold, and calculation formula is as follows:
Wherein, p is adjustment parameter, and for value between [0.5,0.8], K is the voice segment number of student part, SiIt is i-th Section student's voice segments, B are teacher's voice segments.
Further, the S33 includes: selection and the closest non-teacher's voice segments of Chong Die voice and teacher's voice Duan Yiqi trains CNMF, selection mode are as follows:
vi=min (disp (Ai,Sj)), i=1,2 .., N, j=1,2 ..., K
Wherein, AiFor i-th of overlapping voice segments, viFor i-th of trained voice segments of corresponding selection.
The implementation of the embodiments of the present invention has the following beneficial effects: classroom instruction of the present invention towards high complexity is (main The diversity of diversity, Teachers ' Subject including classroom situation and the diversity of classroom tissue), it proposes a kind of unsupervised Formula, the teacher's speech detection and extracting method of ADAPTIVE ROBUST effectively improve the adaptability in practical applications of system With intelligence, also lay the foundation for subsequent applications and research.
Detailed description of the invention
Fig. 1 is frame flowage structure schematic diagram of the invention;
Fig. 2 is the flow diagram of teacher's language detecting step;
Fig. 3 is teacher's language GGMM model training step schematic diagram;
Fig. 4 is the step flow diagram of clustering algorithm;
Fig. 5 is speech Separation implementation steps schematic diagram;
Fig. 6 is speech enhan-cement implementation steps.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention is made into one below in conjunction with attached drawing Step ground detailed description.
Shown in referring to Fig.1, a kind of target speaker's voice extraction method based on single channel, unsupervised formula of the invention, Including teacher's language detecting step and teacher's language GGMM model training step.
As shown in Fig. 2, the detection of teacher's language should include following steps:
S110, recording;
S120, speech signal pre-processing;
S130, voice segmentation and modeling;
S140, teacher's speech detection.
As shown in figure 3, teacher's voice GGMM model training functional unit should include following steps:
S110, recording;
S120, speech signal pre-processing;
S130, voice segmentation and modeling;
S240, cluster.
Wherein, corresponding classroom voice data is obtained by using sound pick-up outfit in S110.
The classroom voice obtained in S120 for recording pre-processes, including framing, adding window, the voices such as preemphasis are located in advance Manage common method.
Isometric segmentation is carried out for classroom voice in S130, extracts corresponding MFCC feature for every section of voice later, and GMM model based on each section of voice of MFCC latent structure.Later using the GMM model of each section of voice as the input data of S240 into Row cluster operation obtains initial Teachers ' Talk classification, and is based on initial Teachers ' Talk classification extraction GGMM model.It will in S140 The GMM model and GGMM of each section of voice outside Teachers ' Talk classification carry out similarity calculation, set adaptive threshold value, will be small In threshold value label be classification, thus to obtain final Teachers ' Talk classification.
Clustering algorithm in S240 is as shown in Figure 4.
Clustering algorithm specific embodiment includes following steps:
S2401, initial center point choosing method;
1) one is randomly selected from all voice segments is used as first central point.
2) the GMM model distance for calculating remaining voice segments and first central point, select apart from maximum voice segments as Second central point.
3) successively calculate it is non-selected centered on the voice segments put at a distance from central point, select distance center point distance most The next central point of big conduct.
4) iteration is until central point number reaches specified classification number.
Above-mentioned center clicks selection method and obtains in the accuracy rate of final cluster result compared to random central point choosing method Obtained apparent raising.Above-mentioned center, which clicks, takes scheme that may there are problems that for outlier being selected as central point to influence to gather Class as a result, but in practice, due to GMM-Kmeans algorithm stop condition set in S2402 (3), during outlier is used as Heart point cluster result obtained can be excluded in an iterative process, so choosing initial center point by the above method can obtain Obtain steady cluster result.
Distance between gauss hybrid models can not be measured well by only being remained unchanged by the above method, that is, define GMM A with The dispersion of GMM B is as follows:
Referred to as dispersion of the GMM A relative to GMM B, wherein WAiIndicate i-th of the GMM A weight for mixing member, WBjIt indicates J-th of GMM B mixes the weight of member, dAB(i, j) indicates j-th of Gaussian Profile of i-th Gaussian Profile and GMM B of GMM A Between distance, it is contemplated that the reason of calculation amount and there is a possibility that mean vector is identical, this reality in multiple Gaussian Profiles Applying example selects mahalanobis distance as dABThe distance calculating method of (i, j).
Wherein,Indicate two Multi-dimensional Gaussian distributions, μ12It is distributed for two equal It is worth vector,The covariance matrix being distributed for two.
The considerations of for symmetry, final GMM distance metric formula are as follows:
Wherein A, B respectively indicate two GMM models.
S2402, all samples and central point are calculated apart from, iteration and until meeting preset halt condition;
1) remainder GMM model is calculated at a distance from central point, each GMM is divided into nearest central point.
2) central point is updated, is taken in all kinds of, it is the smallest as new central point with all the points sum of the distance in class.
Iteration is until meet preset stop condition (when the most classification institute of voice segment number in cluster result obtained When the voice segment number for including is greater than 40% and voice segment number of total voice segments more than voice segment number in the second largest classification Output) or iterate to predetermined number of times.
S2403, circulation execute S2401 step and S2402 is total to n times, can get n kind teacher voice division group, according to certain Rule selects the division group of maximum satisfaction as initial teacher's voice.
S2403 iteration obtains N number of teacher's categorization vector and carries out similarity calculation, takes and remaining N-1 vector similarity The sum of the maximum initial teacher's classification obtained as final cluster.Due to N number of teacher's categorization vector obtained length not Uniquely, it carries out needing to perform corresponding processing before similarity calculation keeping vector length identical.
In the present embodiment, keep vector length equal using zero padding method.
Length is longest in the N number of teacher's categorization vector of this method selection is denoted as M, all vector lengths is expanded to M, no The part of sufficient M is replaced using 0 element, it may be assumed that
M=max (length (T1),length(T2),...,length(TN))
Ti=[Ti,Appendi], i=1,2 ..., N
Appendi=zeros (1, M-length (Ti)), i=1,2 ..., N
Wherein, T1,T2,...,TNFor N number of teacher's categorization vector, M is longest vector length, and length (T) indicates to obtain T vector length, AppendiFor 0 element vector of i-th of all addition of teacher's categorization vector, zeros (i, j) indicates to form one 0 element vector of a i row j column.
In the present embodiment, so that teacher's categorization vector is obtained unified length by using zero padding method, calculate later two-by-two to Amount the distance between, due to being artificially added to 0 element, using between vector apart from the similar method of measuring vector, such as: Euclidean away from From etc., there can be very big error, therefore, method of the cosine similarity as similarity between measuring vector is selected herein.
Cosine similarity indicates the similarity of vector with two vectorial angle cosine values in vector space.Cosine value more connects It is bordering on 1, then shows angle closer to 0 degree, then vector is more similar.
Cosine similarity between vector a, b is defined as follows:
Wherein a=(a1,a2,...,aN), b=(b1,b2,...,bN) respectively indicate a N-dimensional vector.
It is randomly selected in S2404 in teacher's classificationSection, wherein M is the voice in cluster acquisition teacher's classification Section number, takes at randomPurpose be when reducing to carry out GMM model training for voice segments whole in teacher's classification Between, N is the constant obtained according to the size adaptation of M, and acquisition pattern is as follows:
Wherein, α is time adjustment parameter, and for adjusting the voice segment number for carrying out GMM training, the present embodiment takes α=2. Length (C) indicates that original classroom voice obtains the total number of voice segments after mono- section of 30s segmentation.Coefficient 0.4*length (C) Indicate least teacher's voice segments number.Above formula indicates that the teacher's classification voice segments number for clustering acquisition is bigger, is carrying out GMM When model training, wherein smaller ratio is taken.By above-mentioned formula, so that required when different phonetic progress GMM model training Voice segments number tends to be similar.
Setting similarity threshold is S/ γ, wherein S similarity mean value between the class of teacher's classification voice segments, and γ is adaptive Adjustment parameter, for guaranteeing the integrality of teacher's classification to greatest extent.Its acquisition pattern is as follows:
Wherein, β is adjustment parameter, and range is [0,1], and the present embodiment takes β=1/5.Smax,SminRespectively indicate teacher's classification The maximum value and minimum value of similarity between class.Length (C) indicates that original classroom voice obtains voice segments after mono- section of 30s segmentation Total number.M is the quantity of voice segments in teacher's classification.When above formula indicates that M is bigger, γ is bigger, i.e. similarity threshold setting is got over It is small.And when the range of similarity between class is bigger, smaller similarity threshold is taken, so that whether being teacher's words for remainder The accuracy of language is higher.
By the processing of GMM-Kmeans algorithm, metastable teacher's categorization vector may finally be obtained, is passed through It is compared in test with the classification of artificial division, teacher's classification obtained has higher phase with the teacher's classification manually marked Like degree, it is compared to directly obtained as a result, GMM- used in the present embodiment using cluster for improved K-means Kmeans algorithm increases significantly in cluster accuracy rate.
After obtaining teacher's classification, later for for mute and overlapping phonological component judgement.By
In student's classification without specific feature, and student's quantity is unknown, so can not first detect to student's classification. The present embodiment, which passes through, preferentially detects teacher's classification, and mute and overlapping voice class is included by excluding above-mentioned three parts Remaining voice segments are labeled as student's language classification by voice segments.
Referring to Figure 5, specific speech Separation implementation steps are as follows:
S310, overlapping speech detection, obtain the overlapping voice segments in the voice of classroom
S320, whether judge to be overlapped in voice comprising teacher's voice
S330, selection and the immediate voice segments of Chong Die voice, as training voice segments
S340, design CNMF+JADE method carry out speech Separation
Overlapping voice segments are obtained based on mute point in S310, the study found that mute frame is compared with non-mute frame with lower Energy, the judgement of mute frame can be carried out by setting energy threshold.Energy threshold is defined as follows:
Wherein, EiIndicate the energy of the i-th frame speech frame,Wherein N is voice segments totalframes, and r is one normal Number, range are (0,1),Expression rounds up.
Overlapping voice indicates to speak simultaneously in one section of voice comprising two or more people.Language is overlapped in true class Sound is mainly shown as: 1. students divide panel discussion;2. when teacher asking questions, multiple students answer simultaneously etc..Voice segments are overlapped quiet It is different from mute section in the performance of sound frame.The study found that in a voice segments, when the muting duration the long, which includes The probability for being overlapped voice is lower [56].Contact problem handled by the present embodiment, it may be considered that determine by the quantity of mute frame Potential overlapping voice class.It is similar with potential mute classification method is obtained to obtain potential overlapping voice class method for distinguishing, it is as follows It is shown:
ClassOfOverlapi=I (numberOfSilencei< Thresholds), i=1,2 ..., N
Wherein, α ' is constant, for obtaining overlapping phonetic decision class threshold Thresholdo.The present embodiment take α '= 0.6.Number of frames mute in voice segments is less than threshold value ThresholdoSection be considered potentially to be overlapped voice segments, obtained based on this Obtain potential overlapping voice class.
S320, S330 are referred to as speech Separation front-end processing, this processing totally two purposes: judge to be overlapped in voice whether Comprising target speaker, find in addition to target speaker with the immediate voice segments of Chong Die voice as CNMF training data. Whether the present invention is based on GMM similarities to judge to be overlapped in voice comprising teacher.Similarity calculating method is looked into using improved Bart In sub- distance, judgment basis is as follows:
Wherein, disp (A, B) indicates that A, the distance of B voice segments GMM model, A indicate overlapping voice segments, and B is teacher's voice Section.T is an adaptive threshold, and calculation formula is as follows:
Wherein, p is adjustment parameter, and for value between [0.5,0.8], K is the voice segment number of student part, SiIt is i-th Section student's voice segments, B are teacher's voice segments.It can obtain an adaptive threshold value by being calculated with student's section and judge Whether be overlapped in voice includes teacher.
Second task of speech Separation pre-treatment is that selection is carried out with the immediate non-teacher's voice segments of Chong Die voice segments CNMF training, the step have larger impact for subsequent voice separation.The present invention is closest with Chong Die voice by selecting Non- teacher's voice segments train CNMF, selection mode together with teacher's voice segments are as follows:
vi=min (disp (Ai,Sj)), i=1,2 .., N, j=1,2 ..., K
Wherein, AiFor i-th of overlapping voice segments, viFor i-th of trained voice segments of corresponding selection.
S340 carries out speech Separation to the overlapping voice comprising teacher, the present invention propose a kind of fusion CNMF and JADE into The method of row single-channel voice separation carries out secondary separation to the voice signal after CNMF separation based on JADE.CNMF+JADE is calculated Method is intended to obtain the separation voice signal of all speakers in single channel mixing voice, and its step are as follows:
Input: speaker's clean speech t to be separated1,t2,...,tN, mixing voice o to be trained1,o2,...oN-1, to be separated Mixing voice O.
Output: speaker's voice s after separation1,s2,...,sN
Step1: selection target speaker t1And corresponding mixing voice o1Training CNMF
Step2: separation acquisition is carried out to mixing voice OAnd
Step3: random matrix R is generated1, mixingAndForm double-channel pronunciation signal S1
Step4: S is realized based on JADE1Separation obtain s1And O1
Step5: with O1As mixing voice to be separated, t2,...,tNFor speaker's clean speech to be separated, o2,...oN-1 For mixing voice to be trained, Step1-Step5 is repeated.
Step6: voice s after being separated1,s2,...,sN
In above-mentioned algorithm, t1,t2,...,tNIndicate the clean speech for the speaker for including in mixing voice O.N indicates mixing It include speaker's number in voice O.o1,o2,...oN-1For the creolized language successively after removing corresponding speaker in mixing voice O Sound is expressed as follows:
In the realistic case, o is obtained1,o2,...oN-1It is extremely difficult, therefore can be by randomly choosing a current creolized language Non-targeted speaker's voice is as substitution training CNMF in sound.Experiments verify that the more original CNMF+JADE effect of this method is omited There is decline, but CNMF+JADE algorithm can be generalized to more generally situation.
Double-channel pronunciation signal generation form in Step3 is as follows:
Wherein, RiFor 2 × 2 matrix.
As shown in fig. 6, specific speech enhan-cement implementation steps are as follows:
S410, teacher's voice after speech enhan-cement data are speech Separation
S420, adaptive judgement is carried out to teacher's voice after speech Separation, chooses suitable voice segments and carries out speech enhan-cement
S430, speech enhan-cement is carried out using wavelet transformation
Wavelet transformation is the research hotspot in terms of speech processes in recent years.Compared to the frequency domains such as traditional Fourier transformation point Analysis method, wavelet transformation can provide the horizon state of signal simultaneously, be a kind of with multiresolution analysis, time-frequency partial transformation And the Time-Frequency Analysis Method of the features such as flexible choice wavelet function.The principle of wavelet transformation is explained below.
If L2(R) it is a square integrable space, and always hasIf its Fourier transformationMeet:
ClaimFor a wavelet or morther wavelet.
By morther waveletThrough a real number to (a, b), wherein a, after the b ∈ zooming and panning of R, a ≠ 0, so that it may obtain cluster Function:
This cluster function is referred to as wavelet basis function, and wherein a is known as zoom factor, and b is known as shift factor,For window letter Number, window size is fixed but its shape is changeable.Has the characteristics that multiresolution analysis based on this characteristic wavelet transformation.For normalization factor, effect is to make small echo energy having the same under different scales.
One of the main means that signal processing is current Speech processing are carried out based on wavelet field.It is more based on wavelet transformation Resolution ratio, low entropy and the characteristic of decorrelation make it have great advantage when carrying out Speech processing.It is a large amount of small Wave base can cope with different scenes, therefore wavelet transformation is very suitable to Speech processing.
Using wavelet transformation carry out speech enhan-cement when, be using the characteristic of the multiresolution analysis in wavelet transformation, According to the different features that noise is shown in the wavelet field of different scale from the wavelet coefficient of voice, corresponding rule are formulated Then, the processing to noise signal wavelet coefficient is completed.
The key step of Noise Elimination from Wavelet Transform is as follows:
Step1: wavelet transformation is carried out to signals and associated noises
Step2: denoising is carried out to wavelet coefficient on different scale
Step3: by treated, wavelet coefficient does wavelet inverse transformation, obtains enhanced reconstruction signal
The mode of Wavelet Denoising Method can substantially be divided into following three types: take making an uproar using wavelet modulus maxima principle; It is denoised using the correlation of wavelet transformation space factor;Utilize wavelet threshold denoising.The present embodiment mainly uses the third Based on wavelet threshold denoising.
Wavelet threshold denoising is more commonly used one of denoising method, and basic process is as follows:
Step1: according to one suitable wavelet basis of signal behavior to be processed, determining reasonably Decomposition order, right
Noisy speech signal carries out multilayer decomposition.
Step2: suitable threshold value is selected on different scale to the wavelet coefficient after decomposition, and is quantified.
Step3: wavelet reconstruction is carried out according to the processing result after threshold value quantizing and obtains enhanced voice signal.
The diversity of wavelet basis is one of the advantages of wavelet transformation carries out time frequency analysis, therefore selects suitable small echo letter
Number is most important.Studies have shown that being the transient state for being more advantageous to processing voice signal when carrying out Speech processing Variation needs to select slickness, symmetry preferably and has the wavelet basis function of lower vanishing moment.
The series of wavelet decomposition is always concerned as the factor for influencing voice enhancement algorithm denoising effect.With point The detail section of the increase of solution series, voice signal and noise signal is more clear, and is more conducively denoised.But with decomposed class Increase, speech energy can increasingly disperse and cause distortion and also the algorithm speed of service also can be slower and slower;Decomposed class is few It will lead to signal and noise aliasing then to which noise can not be isolated.Researcher had found by numerous studies experimental analysis, needle Selection to wavelet decomposition series, takes the most reasonable decomposed class to beN is data length,Expression takes downwards It is whole.
In wavelet threshold denoising algorithm, the estimation for threshold value is to determine one of an important factor for denoising effect.Small echo Noisy speech signal is decomposed into high frequency detail part and low-frequency approximation part by transformation, and the frequency of noise is usually higher, therefore is made an uproar Acoustic energy focuses primarily upon on high-frequency wavelet coefficient, and speech energy focuses primarily upon the low frequency part of voice signal.So can be with The noise component(s) truncation for being less than the value is denoised by one threshold value of setting.The threshold value is exactly in wavelet threshold denoising The threshold value to be studied.
Selection for wavelet threshold denoising threshold value mainly includes following classical way:
Uniform threshold method.
Uniform threshold estimation is to derive to obtain based on minimum mean square error criterion.It may be expressed as:
Wherein σnIt is the standard deviation of noise, N is signal length.The variance of noise is obtained by following formula:
σn=Mj/0.6745
Wherein MjFor the absolute intermediate value for decomposing each layer of wavelet coefficient, 0.6745 is empirical value.
This method is realized simply, preferable for filtering out white Gaussian noise effect, but due to related to voice length, in data It will lead to effect variation when measuring very big.
SUREShrink threshold value [73]
SUREShrink threshold estimation is a kind of method of adaptive threshold selection, is the unbiased esti-mator of optimal threshold.Threshold The selection of value can be defined by following risk function:
It goes for threshold estimation function and then needs to meet risk function minimum, it may be assumed that
Signal length is substituted into, then is had:
WhereinExpression take set Y | | Yi| < t } in element number.
Minimaxi threshold value
Minimaxi threshold value, which is called, does minimax threshold value, and what this method generated is a least mean-square error
Extreme value.The calculation formula of Minimaxi threshold value is as follows:
Wherein, N is signal length.
In addition, threshold function table as threshold estimation, also play the role of in wavelet threshold denoising algorithm it is vital, Common threshold function table is as follows:
Hard threshold function
Hard threshold function is as follows:
Wherein,To estimate wavelet coefficient, ωj,kTo decompose wavelet coefficient, λ is noise-removed threshold value.From above formula it can be found that The principle of hard-threshold denoising is by ωj,kIt makes comparisons with λ, less than being zeroed out for λ, greater than being retained for λ, such processing can Signal can be made to introduce oscillator signal in reconstruct, influence to denoise effect.
Soft-threshold function
For the influence for eliminating hard-threshold denoising, the method for introducing soft-threshold denoising, form is as follows:
Compared with hard threshold function, soft-threshold function strengthens the flatness of voice signal, but equally can be to a certain extent It loses feature and causes distortion [74].
Semisoft shrinkage function
Soft to overcome, the defect of hard threshold function has scholar to propose that semisoft shrinkage function, function are as follows:
Wherein λ12Respectively lower threshold value and upper threshold value and there is 0 < λ1< λ2, rule of thumbλ1Take Value is related with voice, when voiceless sound is more, λ1Value table is smaller, when voiced sound is more, λ1Value is larger.By adjusting λ12It can make So that this method has both soft, the advantages of hard -threshold, but two parameters will increase algorithm computation complexity.
Garrote threshold function table
Garrote threshold function table indicates as follows:
Threshold value is introduced into threshold function table by the function, dynamic to reject the wavelet coefficient for being greater than selected threshold value.
The present invention is designed one and is divided based on the adaptive approach of wavelet transformation voice signal after CNMF+JADE separation Analysis selectively carries out speech enhan-cement it is expected that realizing, i.e., may after those speech enhan-cements of automatic fitration before speech enhan-cement The voice segments for causing voice quality to decline.By to voice signal and the sound effect after wavelet transformation after multistage speech Separation Analysis find, when the distance between voice is larger after separation, when carrying out wavelet transformation speech enhan-cement effect have under Drop.Based on above-mentioned discovery, the present invention designs following method and carries out adaptive judgement before speech enhan-cement.
I=1,2 ..., N
Oi-1=Oi+si+l
O0=O
ON=sN
Wherein, siIndicate i-th of teacher's voice signal after CNMF+JADE is decomposed.OiIndicate mixing voice Oi-1Through CNMF+JADE isolates siMixing voice signal afterwards.Respectively indicate si、OiCorresponding GMM.L indicates separation process In loss.N indicates the speaker's number for including in mixed signal.Disp () is that above GMM distance calculates public affairs Formula, p are zoom factor, and value is [1,1.2].
1. this complicated situation, designs teacher's voice extraction method, has derived information-based class the present invention is based on classroom instruction The application category of hall, is not only the important component of wisdom classroom (artificial intelligence+education), and one kind of even more future education is complete It is new to embody.Have us refering to known to data, the research of same type is few at present, and substantially there are no form available frame and reason By.The present invention is to have stepped major step in the research in wisdom classroom, has opened up the instructional methodology based on artificial intelligence New view.
2. the present invention is based on single channel, adaptive, unsupervised formulas, and classroom teacher's voice identifies and extracted.It compares In existing method, any priori knowledge is not needed, and for different form, the classroom voice of different length, different classrooms Environment has good adaptive ability.Method proposed in this paper simultaneously, can not only apply in classroom teaching, can be with It applies in fields such as meeting, hearing aid, communications (for example, by speech Separation technology in conjunction with hearing aid, so that hearing aid has More powerful signal processing function improves the voice quality of hearing aid.In mobile communication field, in equipment end application speech Separation Technology, which reaches, inhibits non-targeted speaker, improves voice quality and intelligibility etc..)
3. the present invention designs and Implements a kind of improved GMM-Kmeans clustering method, carried out using GMM model as feature Cluster, remains primitive character to greatest extent, improves the accuracy rate of cluster.Using GMM as feature and distance is calculated, is avoided straight Processing greater depth voice signal is connect, so as to shorten the algorithm process time, it is high and fast generally to realize a kind of accuracy rate Spend fast classroom classroom speech recognition.
4. considering the influence of environment on the basis of GMM-Kmeans clustering algorithm, it is based on cluster result, adaptive choosing It takes suitable voice segments and constructs GGMM model, adaptively obtain similarity threshold, secondary detection Teachers ' Talk, to obtain standard True ground teacher's voice class.All threshold values are to be obtained by the way that design formula is adaptive according to classroom voice data, prosthetic Interference, so that the algorithm is directed to different classroom environments, classroom situation has very strong robustness.
5. the present invention designs and Implements the Speech separation algorithm of CNMF+JADE a kind of, by application JADE to CNMF voice Separating resulting carries out secondary speech Separation.Speech Separation result is effectively promoted.7. the present invention designs and Implements one kind certainly The method for adapting to wavelet transformation speech enhan-cement carries out adaptive judgement to the voice after CNMF+JADE speech Separation, and filtering is uncomfortable The voice segments for carrying out speech enhan-cement again are closed, it is purposive that voice signal is denoised.
Above disclosed is only a preferred embodiment of the present invention, cannot limit the power of the present invention with this certainly Sharp range, therefore equivalent changes made in accordance with the claims of the present invention, are still within the scope of the present invention.

Claims (10)

1. a kind of target speaker's voice extraction method based on single channel, unsupervised formula, which is characterized in that including teacher's language Detecting step and teacher's language GGMM model training step;
Teacher's language detecting step the following steps are included:
S1: it records to classroom and obtains voice data;
S2: Speech processing is carried out;
S3: voice segmentation and modeling, the voice segmentation includes carrying out isometric segmentation to classroom voice, is directed to every section of voice later Extract corresponding MFCC feature, and the GMM model based on each section of voice of MFCC latent structure;
S4: the GMM model of each section of voice outside Teachers ' Talk classification and GGMM are carried out similarity calculation by teacher's speech detection, Adaptive threshold value is set, is classification by the label of the threshold value is less than, thus to obtain final Teachers ' Talk class Not;
Teacher's language GGMM model training step the following steps are included:
S5: clustering processing is carried out to the obtained voice data of S3;Initial Teachers ' Talk classification is obtained, and is based on initial teacher Language classification extraction GGMM model.
2. target speaker's voice extraction method according to claim 1 based on single channel, unsupervised formula, feature exist In, the clustering processing the following steps are included:
S51: cluster centre point is chosen;
S52: all samples and central point distance, iteration and until meeting preset halt condition are calculated;
S53: circulation execution S51 step and S52 are total to n times, can get n kind teacher voice division group, select most according to the rule of setting The division group of big satisfaction is as initial teacher's voice;
S54: it selects several to train GGMM model from the division group, and calculates average distance in class;
S55: according to GGMM and average distance, carrying out secondary judgement to remaining speech samples section, cardinal distance from being less than setting threshold values, Then sample is added in teacher's classification;
S56: all teacher's speech samples are exported and database is written.
3. target speaker's voice extraction method according to claim 2 based on single channel, unsupervised formula, feature exist It is specifically included in the step of, S51:
S511: one is randomly selected from all voice segments and is used as first central point;
S512: calculating the GMM model distance of remaining voice segments and first central point, selects apart from maximum voice segments as the Two central points;
S513: successively calculate it is non-selected centered on the voice segments put at a distance from central point, select distance center point distance most The next central point of big conduct;
S514: iteration is until central point number reaches specified classification number.
4. target speaker's voice extraction method according to claim 3 based on single channel, unsupervised formula, feature exist It is specifically included in the step of, S52:
S521: remainder GMM model is calculated at a distance from central point, each GMM is divided into nearest central point;
S522: updating central point, take in all kinds of, the smallest as new central point with all the points sum of the distance in class;
S523: iteration is until meeting preset stop condition or iterating to predetermined number of times.
5. target speaker's voice extraction method according to claim 4 based on single channel, unsupervised formula, feature exist Specifically include in the step of, S53: iteration obtains N number of teacher's categorization vector and carries out similarity calculation, takes and remaining N-1 The maximum initial teacher's classification obtained as final cluster of the sum of a vector similarity.
6. target speaker's voice extraction method according to claim 5 based on single channel, unsupervised formula, feature exist It specifically includes: is randomly selected in teacher's classification in the step of, S54Section, wherein M is that cluster obtains teacher's class Voice segments number in not, takes at randomPurpose be to reduce to carry out GMM models for voice segments whole in teacher's classification Trained time, N are the constant obtained according to the size adaptation of M, and acquisition pattern is as follows:
Wherein, α is time adjustment parameter, and for adjusting the voice segment number for carrying out GMM training, length (C) indicates original class Hall voice obtains the total number of voice segments after singulated, and coefficient 0.4*length (C) indicates least teacher's voice segments number.
7. target speaker's voice extraction method according to claim 1 based on single channel, unsupervised formula, feature exist In the S3 includes:
S31: overlapping speech detection obtains the overlapping voice segments in the voice of classroom;
S32: whether judge to be overlapped in voice comprising teacher's voice;
S33: selection and the immediate voice segments of Chong Die voice, as trained voice segments;
S34: design CNMF+JADE method carries out speech Separation.
8. target speaker's voice extraction method according to claim 7 based on single channel, unsupervised formula, feature exist In the S31 includes:
Overlapping voice segments are obtained using mute point, the judgement of mute frame is carried out by setting energy threshold, the energy threshold is logical Following methods are crossed to be obtained:
Wherein, EiIndicate the energy of the i-th frame speech frame,Wherein N is voice segments totalframes, and r is a constant, model It encloses for (0,1),Expression rounds up.
9. target speaker's voice extraction method according to claim 8 based on single channel, unsupervised formula, feature exist In whether the S32 includes: to judge to be overlapped comprising teacher in voice using GMM similarity, and similarity is by using improved bar Te Chaliya distance, judgment basis are as follows:
Wherein, disp (A, B) indicates that A, the distance of B voice segments GMM model, A indicate overlapping voice segments, and B is teacher's voice segments, t For an adaptive threshold, calculation formula is as follows:
Wherein, p is adjustment parameter, and for value between [0.5,0.8], K is the voice segment number of student part, SiFor i-th section of student Voice segments, B are teacher's voice segments.
10. target speaker's voice extraction method according to claim 9 based on single channel, unsupervised formula, feature It is, the S33 includes: that the selection non-teacher voice segments closest with overlapping voice are trained together with teacher's voice segments CNMF, selection mode are as follows:
vi=min (disp (Ai,Sj)), i=1,2 .., N, j=1,2 ..., K
Wherein, AiFor i-th of overlapping voice segments, viFor i-th of trained voice segments of corresponding selection.
CN201810832080.5A 2018-07-26 2018-07-26 Single-channel and unsupervised target speaker voice extraction method Active CN108962229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810832080.5A CN108962229B (en) 2018-07-26 2018-07-26 Single-channel and unsupervised target speaker voice extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810832080.5A CN108962229B (en) 2018-07-26 2018-07-26 Single-channel and unsupervised target speaker voice extraction method

Publications (2)

Publication Number Publication Date
CN108962229A true CN108962229A (en) 2018-12-07
CN108962229B CN108962229B (en) 2020-11-13

Family

ID=64464209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810832080.5A Active CN108962229B (en) 2018-07-26 2018-07-26 Single-channel and unsupervised target speaker voice extraction method

Country Status (1)

Country Link
CN (1) CN108962229B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948144A (en) * 2019-01-29 2019-06-28 汕头大学 A method of the Teachers ' Talk Intelligent treatment based on classroom instruction situation
CN110544482A (en) * 2019-09-09 2019-12-06 极限元(杭州)智能科技股份有限公司 single-channel voice separation system
CN110544481A (en) * 2019-08-27 2019-12-06 华中师范大学 S-T classification method and device based on voiceprint recognition and equipment terminal
CN110874879A (en) * 2019-10-18 2020-03-10 平安科技(深圳)有限公司 Old man registration method, device, equipment and storage medium based on voice recognition
CN111179962A (en) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 Training method of voice separation model, voice separation method and device
CN112017685A (en) * 2020-08-27 2020-12-01 北京字节跳动网络技术有限公司 Voice generation method, device, equipment and computer readable medium
US20220012667A1 (en) * 2020-07-13 2022-01-13 Allstate Insurance Company Intelligent prediction systems and methods for conversational outcome modeling frameworks for sales predictions
CN117577124A (en) * 2024-01-12 2024-02-20 京东城市(北京)数字科技有限公司 Training method, device and equipment of audio noise reduction model based on knowledge distillation

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125223A1 (en) * 2003-12-05 2005-06-09 Ajay Divakaran Audio-visual highlights detection using coupled hidden markov models
CN101866421A (en) * 2010-01-08 2010-10-20 苏州市职业大学 Method for extracting characteristic of natural image based on dispersion-constrained non-negative sparse coding
CN102568477A (en) * 2010-12-29 2012-07-11 盛乐信息技术(上海)有限公司 Semi-supervised pronunciation model modeling system and method
CN102682760A (en) * 2011-03-07 2012-09-19 株式会社理光 Overlapped voice detection method and system
CN103680517A (en) * 2013-11-20 2014-03-26 华为技术有限公司 Method, device and equipment for processing audio signals
CN103854644A (en) * 2012-12-05 2014-06-11 中国传媒大学 Automatic duplicating method and device for single track polyphonic music signals
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
CN105096955A (en) * 2015-09-06 2015-11-25 广东外语外贸大学 Speaker rapid identification method and system based on growing and clustering algorithm of models
CN105957537A (en) * 2016-06-20 2016-09-21 安徽大学 Voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition
JP2018502319A (en) * 2015-07-07 2018-01-25 三菱電機株式会社 Method for distinguishing one or more components of a signal

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125223A1 (en) * 2003-12-05 2005-06-09 Ajay Divakaran Audio-visual highlights detection using coupled hidden markov models
CN101866421A (en) * 2010-01-08 2010-10-20 苏州市职业大学 Method for extracting characteristic of natural image based on dispersion-constrained non-negative sparse coding
CN102568477A (en) * 2010-12-29 2012-07-11 盛乐信息技术(上海)有限公司 Semi-supervised pronunciation model modeling system and method
CN102682760A (en) * 2011-03-07 2012-09-19 株式会社理光 Overlapped voice detection method and system
CN103854644A (en) * 2012-12-05 2014-06-11 中国传媒大学 Automatic duplicating method and device for single track polyphonic music signals
CN103680517A (en) * 2013-11-20 2014-03-26 华为技术有限公司 Method, device and equipment for processing audio signals
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
JP2018502319A (en) * 2015-07-07 2018-01-25 三菱電機株式会社 Method for distinguishing one or more components of a signal
CN105096955A (en) * 2015-09-06 2015-11-25 广东外语外贸大学 Speaker rapid identification method and system based on growing and clustering algorithm of models
CN105957537A (en) * 2016-06-20 2016-09-21 安徽大学 Voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
COLIN VAZ: "CNMF-BASED ACOUSTIC FEATURES FOR NOISE-ROBUST ASR", 《2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
李号: "基于深度学习的单通道语音分离", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948144A (en) * 2019-01-29 2019-06-28 汕头大学 A method of the Teachers ' Talk Intelligent treatment based on classroom instruction situation
CN109948144B (en) * 2019-01-29 2022-12-06 汕头大学 Teacher utterance intelligent processing method based on classroom teaching situation
CN110544481A (en) * 2019-08-27 2019-12-06 华中师范大学 S-T classification method and device based on voiceprint recognition and equipment terminal
CN110544482B (en) * 2019-09-09 2021-11-12 北京中科智极科技有限公司 Single-channel voice separation system
CN110544482A (en) * 2019-09-09 2019-12-06 极限元(杭州)智能科技股份有限公司 single-channel voice separation system
CN110874879A (en) * 2019-10-18 2020-03-10 平安科技(深圳)有限公司 Old man registration method, device, equipment and storage medium based on voice recognition
WO2021073161A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Elderly people registration method, apparatus and device based on voice recognition, and storage medium
CN111179962A (en) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 Training method of voice separation model, voice separation method and device
US20220012667A1 (en) * 2020-07-13 2022-01-13 Allstate Insurance Company Intelligent prediction systems and methods for conversational outcome modeling frameworks for sales predictions
US11829920B2 (en) * 2020-07-13 2023-11-28 Allstate Insurance Company Intelligent prediction systems and methods for conversational outcome modeling frameworks for sales predictions
CN112017685A (en) * 2020-08-27 2020-12-01 北京字节跳动网络技术有限公司 Voice generation method, device, equipment and computer readable medium
CN112017685B (en) * 2020-08-27 2023-12-22 抖音视界有限公司 Speech generation method, device, equipment and computer readable medium
CN117577124A (en) * 2024-01-12 2024-02-20 京东城市(北京)数字科技有限公司 Training method, device and equipment of audio noise reduction model based on knowledge distillation
CN117577124B (en) * 2024-01-12 2024-04-16 京东城市(北京)数字科技有限公司 Training method, device and equipment of audio noise reduction model based on knowledge distillation

Also Published As

Publication number Publication date
CN108962229B (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN108962229A (en) A kind of target speaker&#39;s voice extraction method based on single channel, unsupervised formula
EP3292515B1 (en) Method for distinguishing one or more components of signal
Sarikaya et al. High resolution speech feature parametrization for monophone-based stressed speech recognition
CN106952643A (en) A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN109326302A (en) A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN108899047B (en) The masking threshold estimation method, apparatus and storage medium of audio signal
CN109410976A (en) Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid
CN108922559A (en) Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming
CN101366078A (en) Neural network classifier for separating audio sources from a monophonic audio signal
CN110197665B (en) Voice separation and tracking method for public security criminal investigation monitoring
CN106328123B (en) Method for recognizing middle ear voice in normal voice stream under condition of small database
CN108986798B (en) Processing method, device and the equipment of voice data
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
CN109346084A (en) Method for distinguishing speek person based on depth storehouse autoencoder network
Do et al. Speech source separation using variational autoencoder and bandpass filter
Fan et al. Utterance-level permutation invariant training with discriminative learning for single channel speech separation
CN110136746B (en) Method for identifying mobile phone source in additive noise environment based on fusion features
CN106875944A (en) A kind of system of Voice command home intelligent terminal
Fan et al. Deep attention fusion feature for speech separation with end-to-end post-filter method
Qiu et al. Self-Supervised Learning Based Phone-Fortified Speech Enhancement.
CN116092512A (en) Small sample voice separation method based on data generation
Wang et al. Robust speech recognition from ratio masks
Ravindran et al. Improving the noise-robustness of mel-frequency cepstral coefficients for speech processing
CN111091847A (en) Deep clustering voice separation method based on improvement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant