CN108962229A

CN108962229A - A kind of target speaker's voice extraction method based on single channel, unsupervised formula

Info

Publication number: CN108962229A
Application number: CN201810832080.5A
Authority: CN
Inventors: 姜大志; 陈逸飞
Original assignee: Shantou University
Current assignee: Shantou University
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2018-12-07
Anticipated expiration: 2038-07-26
Also published as: CN108962229B

Abstract

Target speaker's voice extraction method based on single channel, unsupervised formula that the embodiment of the invention discloses a kind of, including teacher's language detecting step and teacher's language model training step；Teacher's language detecting step obtains voice data the following steps are included: recording to classroom；Carry out Speech processing；Voice segmentation and modeling, the voice segmentation includes carrying out isometric segmentation to classroom voice, extracts corresponding MFCC feature, and the GMM model based on each section of voice of MFCC latent structure for every section of voice later；The GMM model of each section of voice outside Teachers ' Talk classification and GGMM are carried out similarity calculation by teacher's speech detection, are classification by the label of given threshold is less than, thus to obtain final Teachers ' Talk classification；Teacher's language GGMM model training step is the following steps are included: carry out clustering processing to the obtained voice data of S3；Initial Teachers ' Talk classification is obtained, and is based on initial Teachers ' Talk classification extraction GGMM model.The present invention effectively improves the adaptability and intelligence in practical applications of system, also lays the foundation for subsequent applications and research.

Description

A kind of target speaker's voice extraction method based on single channel, unsupervised formula

Technical field

The present invention relates under a kind of voice extraction method more particularly to a kind of more speaker's situations of complexity be based on single channel, Target speaker's voice extraction method of unsupervised formula.

Background technique

The key for being ensured of our each Level Educations of the quality of education.And in improving the quality of education, it improves the quality of teaching Especially Classroom Teaching should be the most important thing.But way traditional at present is watched and is evaluated based on artificial (colleague) Method do not have pervasive operability although such methods can play certain effect, it is pervasive objective also not have Property, be with regard to its reason: the authorities that first impart knowledge to students are difficult to accomplish momently all investigating classroom, making and evaluate and provide It is recommended that this will certainly bring heavy burden to be also not necessarily to teaching management.Moreover it is traditional watching and evaluating, by In the teaching process that whole cannot follow up, therefore it is difficult to objectively evaluate the quality of instruction of teacher.

Information and intellectual technology already become the important base of social development, how to utilize and Information of Development and intellectual technology Traditional Classroom is reformed, building is towards classroom instruction, and efficiently, automatically " Intellisense " then becomes one naturally and great grind Study carefully the problem in science of value.

It realizes " Intellisense " towards classroom instruction, the problem of to be solved of standing in the breach is teacher's speech recognition and obtain It takes.

At present in addition to the method for distinguishing speek person for having supervision, unsupervised Speaker Identification main method is poly- for speaker Class.Teacher's speech recognition in the voice of classroom also substantially belongs to this part.Mainly have following 4 for the research of speaker clustering Kind: 1. hierarchical clusterings；2.K-means cluster；3. spectral clustering；4. neighbour's propagation clustering.

In article<<unsupervised speaker clustering method research and realization>>, it is poly- to have studied the spectrum based on feature similar matrix The operational efficiency of class algorithm realizes a kind of spectral clustering calculation that model similar matrix is constructed by ADAPTIVE MIXED Gauss model Method.Gauss hybrid models are obtained according to GMM-UBM-MAP technique drill voice segments first, i.e., first off-line training background model (UBM), adaptive, the gauss hybrid models (Gauss of acquisition target words person and according to maximum a posteriori criterion (MAP) to UBM is carried out Mixture Model, GMM).Later, the similarity for calculating GMM model constructs similarity matrix, and carries out feature for matrix It extracts for clustering, obtains the speech portions of target person.

In article<<more Speaker Identifications of improved speaker clustering initialization and GMM>>, the Meier of voice segments is extracted Cepstrum coefficient (MFCC) feature, training part is handled initial classes using bayesian information criterion (BIC) later, obtain compared with Pure initial category later clusters MFCC feature using clustering algorithm, and it is special to obtain GMM model to every a kind of training Sign carries out speaker's judgement using the Speaker Identification based on GMM model in cognitive phase.

Extraction for teacher's voice also needs to talk about to comprising teacher in addition to needing to identify individual teacher's voice The overlapping voice of language carries out speech Separation.The purpose of speech Separation is isolated interested from multiple while sounding sound source Voice.Relationship of the speech Separation based on the received between source signal and the mixed signal of acquisition is divided into multicenter voice and separates and list Channel speech separation.Single-channel voice only needs single signal source, compared be not only easier to for multicenter voice signal obtain and also more Meet reality.But it is more difficult to carry out speech Separation for single-channel voice signal.The research master of single-channel voice separation There are following 3 kinds: 1. to be based on Computational auditory scene analysis；2. being based on model；3. being based on time-frequency distributions.

Article < < An Auditory Scene Analysis Approach to Monaural Speech Segregation > > Hu Wang proposes the speech Separation system framework based on CASA.By the basilar memebrane for simulating human ear cochlea Mixed signal is decomposed into time-frequency and expresses and extract the progress sense of hearing time-frequency segmentation of feature needed for speech Separation, combined same by characteristic The adjacent time frequency unit of sound source forms sense of hearing segment and finally merges the sense of hearing segment for forming same sound source, is finally based on same sound The Waveform composition in source realizes speech Separation.Later, Hu Wang has carried out a series of improvement to CASA system, including for pure and impure The optimization of sound signal separation.Article<<CNMF-based acoustic features for noise-robust ASR>>refers to NMF is a kind of unsupervised method dictionary-based learning out, is played when handling various types of Signal separators very well Effect.NMF algorithm requires to carry out pure additivity operation, and it is nonnegative matrix that institute is important after decomposition, and can be realized matrix drop Maintenance and operation is calculated.As research is continuous deeply, oneself has the characteristics that rapid computations and accurate to NMF algorithm, is highly convenient for extensive number According to processing, therefore find broad application in numerous areas.

In the above prior art, it has the following deficiencies:

1. whether hierarchical clustering is greater than certain threshold with infima species spacing when carrying out unsupervised speaker clustering identification Standard of the value as judgement end of clustering, the threshold value definite limitation effect of hierarchical clustering algorithm really.

2. proposed in article<<unsupervised speaker clustering method research and realization>>

The spectral clustering of GMM-UBM-MAP binding characteristic similar matrix needs to carry out the GMM model of voice signal Training cannot achieve completely unsupervised Speaker Identification.In addition, this method requires the section of the speaker in voice to be detected opposite It is average and more demanding for " purity " of each speaker's section, it is poor for the adaptability of the real situation of each form.

3. article in<<more Speaker Identifications of improved speaker clustering initialization and GMM>>for

MFCC coefficient is clustered, and MFCC extracts individual features according to framing is carried out to voice, for longer Voice segments, if the classroom of 40min is recorded, operand can be very big, and clusters the guarantee that accuracy rate cannot be got well.

4. article < < An Auditory Scene Analysis Approach to Monaural Speech Segregation > > based on CASA progress speech Separation, simulation human ear carries out speech Separation, but the feature of model human ear is difficult to It chooses.

5. article<<CNMF-based acoustic features for noise-robust ASR>>needs are given in advance Surely the training voice of voice is separated.

6. there are still the influences of noise in single-channel voice separating resulting, above-mentioned speech separating method is seldom to speech Separation As a result further denoising, purification & isolation voice signal.

Summary of the invention

The technical problem to be solved by the embodiment of the invention is that providing a kind of target based on single channel, unsupervised formula Speaker's voice extraction method.It is available and develop relevant information and intellectual technology means classroom voice signal is obtained, Analysis processing and identification are robustly detected from the voice signal of classroom based on the intelligent method for constructing adaptive, unsupervised formula And extract teacher's phonological component.

In order to solve the above-mentioned technical problem, the target based on single channel, unsupervised formula that the embodiment of the invention provides a kind of Speaker's voice extraction method, including teacher's language detecting step and teacher's language GGMM (General Gauss Mixture Model) model training step；

Teacher's language detecting step the following steps are included:

S1: it records to classroom and obtains voice data；

S2: Speech processing is carried out；

S3: voice segmentation and modeling, the voice segmentation includes carrying out isometric segmentation to classroom voice, is directed to every section later Voice extracts corresponding MFCC feature, and the GMM model based on each section of voice of MFCC latent structure；

S4: the GMM model of each section of voice outside Teachers ' Talk classification and GGMM are carried out similarity meter by teacher's speech detection It calculates, sets adaptive threshold value, be classification by the label of the threshold value is less than, thus to obtain final Teachers ' Talk Classification；

Teacher's language GGMM model training step the following steps are included:

S5: clustering processing is carried out to the obtained voice data of S3；Initial Teachers ' Talk classification is obtained, and based on initial Teachers ' Talk classification extraction GGMM model.

Further, the clustering processing the following steps are included:

S51: cluster centre point is chosen；

S52: all samples and central point distance, iteration and until meeting preset halt condition are calculated；

S53: circulation executes S51 step and S52 is total to n times, can get n kind teacher voice division group, selects according to the rule of setting The division group of maximum satisfaction is selected as initial teacher's voice；

S54: it selects several to train GGMM model from the division group, and calculates average distance in class；

S55: according to GGMM and average distance, secondary judgement is carried out to remaining speech samples section, cardinal distance is from less than setting Sample is then added in teacher's classification by threshold values；

S56: all teacher's speech samples are exported and database is written.

Further, the step of S51 is specifically included:

S511: one is randomly selected from all voice segments and is used as first central point；

S512: calculating the GMM model distance of remaining voice segments and first central point, and selection is made apart from maximum voice segments For second central point；

S513: successively calculate it is non-selected centered on the voice segments put at a distance from central point, select distance center point away from Next central point is used as from maximum；

S514: iteration is until central point number reaches specified classification number.

Further, the step of S52 is specifically included:

S521: remainder GMM model is calculated at a distance from central point, each GMM is divided into nearest central point；

S522: updating central point, take in all kinds of, the smallest as new central point with all the points sum of the distance in class；

S523: iteration is until meeting preset stop condition or iterating to predetermined number of times.

Further, the step of S53 is specifically included: iteration obtains N number of teacher's categorization vector and carries out similarity It calculates, takes and the maximum initial teacher's classification obtained as final cluster of the sum of remaining N-1 vector similarity.

Further, the step of S54 is specifically included: being randomly selected in teacher's classificationSection, wherein M The voice segments number in teacher's classification is obtained for cluster, is taken at randomPurpose be reduce in teacher's classification all Voice segments carry out the time of GMM model training, and N is the constant obtained according to the size adaptation of M, the following institute of acquisition pattern Show:

Wherein, α is time adjustment parameter, and for adjusting the voice segment number for carrying out GMM training, length (C) indicates former Beginning classroom voice obtains the total number of voice segments after singulated, and coefficient 0.4*length (C) indicates least teacher's voice segments Number.

Further, the S3 includes:

S31: overlapping speech detection obtains the overlapping voice segments in the voice of classroom；

S32: whether judge to be overlapped in voice comprising teacher's voice；

S33: selection and the immediate voice segments of Chong Die voice, as trained voice segments；

S34: design CNMF+JADE method carries out speech Separation.

Further, the S31 includes:

Overlapping voice segments are obtained using mute point, the judgement of mute frame, the energy cut-off are carried out by setting energy threshold Value is obtained by the following method:

Wherein, E_iIndicate the energy of the i-th frame speech frame,Wherein N is voice segments totalframes, and r is one normal Number, range are (0,1),Expression rounds up.

Further, whether the S32 includes: to be judged to be overlapped in voice using GMM similarity comprising teacher, similarity By using improved Bhattacharyya distance, judgment basis is as follows:

Wherein, disp (A, B) indicates that A, the distance of B voice segments GMM model, A indicate overlapping voice segments, and B is teacher's voice Section, t are an adaptive threshold, and calculation formula is as follows:

Wherein, p is adjustment parameter, and for value between [0.5,0.8], K is the voice segment number of student part, S_iIt is i-th Section student's voice segments, B are teacher's voice segments.

Further, the S33 includes: selection and the closest non-teacher's voice segments of Chong Die voice and teacher's voice Duan Yiqi trains CNMF, selection mode are as follows:

v_i=min (disp (A_i,S_j)), i=1,2 .., N, j=1,2 ..., K

Wherein, A_iFor i-th of overlapping voice segments, v_iFor i-th of trained voice segments of corresponding selection.

The implementation of the embodiments of the present invention has the following beneficial effects: classroom instruction of the present invention towards high complexity is (main The diversity of diversity, Teachers ' Subject including classroom situation and the diversity of classroom tissue), it proposes a kind of unsupervised Formula, the teacher's speech detection and extracting method of ADAPTIVE ROBUST effectively improve the adaptability in practical applications of system With intelligence, also lay the foundation for subsequent applications and research.

Detailed description of the invention

Fig. 1 is frame flowage structure schematic diagram of the invention；

Fig. 2 is the flow diagram of teacher's language detecting step；

Fig. 3 is teacher's language GGMM model training step schematic diagram；

Fig. 4 is the step flow diagram of clustering algorithm；

Fig. 5 is speech Separation implementation steps schematic diagram；

Fig. 6 is speech enhan-cement implementation steps.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, the present invention is made into one below in conjunction with attached drawing Step ground detailed description.

Shown in referring to Fig.1, a kind of target speaker's voice extraction method based on single channel, unsupervised formula of the invention, Including teacher's language detecting step and teacher's language GGMM model training step.

As shown in Fig. 2, the detection of teacher's language should include following steps:

S110, recording；

S120, speech signal pre-processing；

S130, voice segmentation and modeling；

S140, teacher's speech detection.

As shown in figure 3, teacher's voice GGMM model training functional unit should include following steps:

S110, recording；

S120, speech signal pre-processing；

S130, voice segmentation and modeling；

S240, cluster.

Wherein, corresponding classroom voice data is obtained by using sound pick-up outfit in S110.

The classroom voice obtained in S120 for recording pre-processes, including framing, adding window, the voices such as preemphasis are located in advance Manage common method.

Isometric segmentation is carried out for classroom voice in S130, extracts corresponding MFCC feature for every section of voice later, and GMM model based on each section of voice of MFCC latent structure.Later using the GMM model of each section of voice as the input data of S240 into Row cluster operation obtains initial Teachers ' Talk classification, and is based on initial Teachers ' Talk classification extraction GGMM model.It will in S140 The GMM model and GGMM of each section of voice outside Teachers ' Talk classification carry out similarity calculation, set adaptive threshold value, will be small In threshold value label be classification, thus to obtain final Teachers ' Talk classification.

Clustering algorithm in S240 is as shown in Figure 4.

Clustering algorithm specific embodiment includes following steps:

S2401, initial center point choosing method；

1) one is randomly selected from all voice segments is used as first central point.

2) the GMM model distance for calculating remaining voice segments and first central point, select apart from maximum voice segments as Second central point.

3) successively calculate it is non-selected centered on the voice segments put at a distance from central point, select distance center point distance most The next central point of big conduct.

4) iteration is until central point number reaches specified classification number.

Above-mentioned center clicks selection method and obtains in the accuracy rate of final cluster result compared to random central point choosing method Obtained apparent raising.Above-mentioned center, which clicks, takes scheme that may there are problems that for outlier being selected as central point to influence to gather Class as a result, but in practice, due to GMM-Kmeans algorithm stop condition set in S2402 (3), during outlier is used as Heart point cluster result obtained can be excluded in an iterative process, so choosing initial center point by the above method can obtain Obtain steady cluster result.

Distance between gauss hybrid models can not be measured well by only being remained unchanged by the above method, that is, define GMM A with The dispersion of GMM B is as follows:

Referred to as dispersion of the GMM A relative to GMM B, wherein WA_iIndicate i-th of the GMM A weight for mixing member, WB_jIt indicates J-th of GMM B mixes the weight of member, d_AB(i, j) indicates j-th of Gaussian Profile of i-th Gaussian Profile and GMM B of GMM A Between distance, it is contemplated that the reason of calculation amount and there is a possibility that mean vector is identical, this reality in multiple Gaussian Profiles Applying example selects mahalanobis distance as d_ABThe distance calculating method of (i, j).

Wherein,Indicate two Multi-dimensional Gaussian distributions, μ₁,μ₂It is distributed for two equal It is worth vector,The covariance matrix being distributed for two.

The considerations of for symmetry, final GMM distance metric formula are as follows:

Wherein A, B respectively indicate two GMM models.

S2402, all samples and central point are calculated apart from, iteration and until meeting preset halt condition；

1) remainder GMM model is calculated at a distance from central point, each GMM is divided into nearest central point.

2) central point is updated, is taken in all kinds of, it is the smallest as new central point with all the points sum of the distance in class.

Iteration is until meet preset stop condition (when the most classification institute of voice segment number in cluster result obtained When the voice segment number for including is greater than 40% and voice segment number of total voice segments more than voice segment number in the second largest classification Output) or iterate to predetermined number of times.

S2403, circulation execute S2401 step and S2402 is total to n times, can get n kind teacher voice division group, according to certain Rule selects the division group of maximum satisfaction as initial teacher's voice.

S2403 iteration obtains N number of teacher's categorization vector and carries out similarity calculation, takes and remaining N-1 vector similarity The sum of the maximum initial teacher's classification obtained as final cluster.Due to N number of teacher's categorization vector obtained length not Uniquely, it carries out needing to perform corresponding processing before similarity calculation keeping vector length identical.

In the present embodiment, keep vector length equal using zero padding method.

Length is longest in the N number of teacher's categorization vector of this method selection is denoted as M, all vector lengths is expanded to M, no The part of sufficient M is replaced using 0 element, it may be assumed that

M=max (length (T₁),length(T₂),...,length(T_N))

T_i=[T_i,Append_i], i=1,2 ..., N

Append_i=zeros (1, M-length (T_i)), i=1,2 ..., N

Wherein, T₁,T₂,...,T_NFor N number of teacher's categorization vector, M is longest vector length, and length (T) indicates to obtain T vector length, Append_iFor 0 element vector of i-th of all addition of teacher's categorization vector, zeros (i, j) indicates to form one 0 element vector of a i row j column.

In the present embodiment, so that teacher's categorization vector is obtained unified length by using zero padding method, calculate later two-by-two to Amount the distance between, due to being artificially added to 0 element, using between vector apart from the similar method of measuring vector, such as: Euclidean away from From etc., there can be very big error, therefore, method of the cosine similarity as similarity between measuring vector is selected herein.

Cosine similarity indicates the similarity of vector with two vectorial angle cosine values in vector space.Cosine value more connects It is bordering on 1, then shows angle closer to 0 degree, then vector is more similar.

Cosine similarity between vector a, b is defined as follows:

Wherein a=(a₁,a₂,...,a_N), b=(b₁,b₂,...,b_N) respectively indicate a N-dimensional vector.

It is randomly selected in S2404 in teacher's classificationSection, wherein M is the voice in cluster acquisition teacher's classification Section number, takes at randomPurpose be when reducing to carry out GMM model training for voice segments whole in teacher's classification Between, N is the constant obtained according to the size adaptation of M, and acquisition pattern is as follows:

Wherein, α is time adjustment parameter, and for adjusting the voice segment number for carrying out GMM training, the present embodiment takes α=2. Length (C) indicates that original classroom voice obtains the total number of voice segments after mono- section of 30s segmentation.Coefficient 0.4*length (C) Indicate least teacher's voice segments number.Above formula indicates that the teacher's classification voice segments number for clustering acquisition is bigger, is carrying out GMM When model training, wherein smaller ratio is taken.By above-mentioned formula, so that required when different phonetic progress GMM model training Voice segments number tends to be similar.

Setting similarity threshold is S/ γ, wherein S similarity mean value between the class of teacher's classification voice segments, and γ is adaptive Adjustment parameter, for guaranteeing the integrality of teacher's classification to greatest extent.Its acquisition pattern is as follows:

Wherein, β is adjustment parameter, and range is [0,1], and the present embodiment takes β=1/5.S_max,S_minRespectively indicate teacher's classification The maximum value and minimum value of similarity between class.Length (C) indicates that original classroom voice obtains voice segments after mono- section of 30s segmentation Total number.M is the quantity of voice segments in teacher's classification.When above formula indicates that M is bigger, γ is bigger, i.e. similarity threshold setting is got over It is small.And when the range of similarity between class is bigger, smaller similarity threshold is taken, so that whether being teacher's words for remainder The accuracy of language is higher.

By the processing of GMM-Kmeans algorithm, metastable teacher's categorization vector may finally be obtained, is passed through It is compared in test with the classification of artificial division, teacher's classification obtained has higher phase with the teacher's classification manually marked Like degree, it is compared to directly obtained as a result, GMM- used in the present embodiment using cluster for improved K-means Kmeans algorithm increases significantly in cluster accuracy rate.

After obtaining teacher's classification, later for for mute and overlapping phonological component judgement.By

In student's classification without specific feature, and student's quantity is unknown, so can not first detect to student's classification. The present embodiment, which passes through, preferentially detects teacher's classification, and mute and overlapping voice class is included by excluding above-mentioned three parts Remaining voice segments are labeled as student's language classification by voice segments.

Referring to Figure 5, specific speech Separation implementation steps are as follows:

S310, overlapping speech detection, obtain the overlapping voice segments in the voice of classroom

S320, whether judge to be overlapped in voice comprising teacher's voice

S330, selection and the immediate voice segments of Chong Die voice, as training voice segments

S340, design CNMF+JADE method carry out speech Separation

Overlapping voice segments are obtained based on mute point in S310, the study found that mute frame is compared with non-mute frame with lower Energy, the judgement of mute frame can be carried out by setting energy threshold.Energy threshold is defined as follows:

Overlapping voice indicates to speak simultaneously in one section of voice comprising two or more people.Language is overlapped in true class Sound is mainly shown as: 1. students divide panel discussion；2. when teacher asking questions, multiple students answer simultaneously etc..Voice segments are overlapped quiet It is different from mute section in the performance of sound frame.The study found that in a voice segments, when the muting duration the long, which includes The probability for being overlapped voice is lower [56].Contact problem handled by the present embodiment, it may be considered that determine by the quantity of mute frame Potential overlapping voice class.It is similar with potential mute classification method is obtained to obtain potential overlapping voice class method for distinguishing, it is as follows It is shown:

ClassOfOverlap_i=I (numberOfSilence_i< Threshold_s), i=1,2 ..., N

Wherein, α ' is constant, for obtaining overlapping phonetic decision class threshold Threshold_o.The present embodiment take α '= 0.6.Number of frames mute in voice segments is less than threshold value Threshold_oSection be considered potentially to be overlapped voice segments, obtained based on this Obtain potential overlapping voice class.

S320, S330 are referred to as speech Separation front-end processing, this processing totally two purposes: judge to be overlapped in voice whether Comprising target speaker, find in addition to target speaker with the immediate voice segments of Chong Die voice as CNMF training data. Whether the present invention is based on GMM similarities to judge to be overlapped in voice comprising teacher.Similarity calculating method is looked into using improved Bart In sub- distance, judgment basis is as follows:

Wherein, disp (A, B) indicates that A, the distance of B voice segments GMM model, A indicate overlapping voice segments, and B is teacher's voice Section.T is an adaptive threshold, and calculation formula is as follows:

Wherein, p is adjustment parameter, and for value between [0.5,0.8], K is the voice segment number of student part, S_iIt is i-th Section student's voice segments, B are teacher's voice segments.It can obtain an adaptive threshold value by being calculated with student's section and judge Whether be overlapped in voice includes teacher.

Second task of speech Separation pre-treatment is that selection is carried out with the immediate non-teacher's voice segments of Chong Die voice segments CNMF training, the step have larger impact for subsequent voice separation.The present invention is closest with Chong Die voice by selecting Non- teacher's voice segments train CNMF, selection mode together with teacher's voice segments are as follows:

v_i=min (disp (A_i,S_j)), i=1,2 .., N, j=1,2 ..., K

S340 carries out speech Separation to the overlapping voice comprising teacher, the present invention propose a kind of fusion CNMF and JADE into The method of row single-channel voice separation carries out secondary separation to the voice signal after CNMF separation based on JADE.CNMF+JADE is calculated Method is intended to obtain the separation voice signal of all speakers in single channel mixing voice, and its step are as follows:

Input: speaker's clean speech t to be separated₁,t₂,...,t_N, mixing voice o to be trained₁,o₂,...o_N-1, to be separated Mixing voice O.

Output: speaker's voice s after separation₁,s₂,...,s_N。

Step1: selection target speaker t₁And corresponding mixing voice o₁Training CNMF

Step2: separation acquisition is carried out to mixing voice OAnd

Step3: random matrix R is generated₁, mixingAndForm double-channel pronunciation signal S₁。

Step4: S is realized based on JADE₁Separation obtain s₁And O₁。

Step5: with O₁As mixing voice to be separated, t₂,...,t_NFor speaker's clean speech to be separated, o₂,...o_N-1 For mixing voice to be trained, Step1-Step5 is repeated.

Step6: voice s after being separated₁,s₂,...,s_N。

In above-mentioned algorithm, t₁,t₂,...,t_NIndicate the clean speech for the speaker for including in mixing voice O.N indicates mixing It include speaker's number in voice O.o₁,o₂,...o_N-1For the creolized language successively after removing corresponding speaker in mixing voice O Sound is expressed as follows:

In the realistic case, o is obtained₁,o₂,...o_N-1It is extremely difficult, therefore can be by randomly choosing a current creolized language Non-targeted speaker's voice is as substitution training CNMF in sound.Experiments verify that the more original CNMF+JADE effect of this method is omited There is decline, but CNMF+JADE algorithm can be generalized to more generally situation.

Double-channel pronunciation signal generation form in Step3 is as follows:

Wherein, R_iFor 2 × 2 matrix.

As shown in fig. 6, specific speech enhan-cement implementation steps are as follows:

S410, teacher's voice after speech enhan-cement data are speech Separation

S420, adaptive judgement is carried out to teacher's voice after speech Separation, chooses suitable voice segments and carries out speech enhan-cement

S430, speech enhan-cement is carried out using wavelet transformation

Wavelet transformation is the research hotspot in terms of speech processes in recent years.Compared to the frequency domains such as traditional Fourier transformation point Analysis method, wavelet transformation can provide the horizon state of signal simultaneously, be a kind of with multiresolution analysis, time-frequency partial transformation And the Time-Frequency Analysis Method of the features such as flexible choice wavelet function.The principle of wavelet transformation is explained below.

If L²(R) it is a square integrable space, and always hasIf its Fourier transformationMeet:

ClaimFor a wavelet or morther wavelet.

By morther waveletThrough a real number to (a, b), wherein a, after the b ∈ zooming and panning of R, a ≠ 0, so that it may obtain cluster Function:

This cluster function is referred to as wavelet basis function, and wherein a is known as zoom factor, and b is known as shift factor,For window letter Number, window size is fixed but its shape is changeable.Has the characteristics that multiresolution analysis based on this characteristic wavelet transformation.For normalization factor, effect is to make small echo energy having the same under different scales.

One of the main means that signal processing is current Speech processing are carried out based on wavelet field.It is more based on wavelet transformation Resolution ratio, low entropy and the characteristic of decorrelation make it have great advantage when carrying out Speech processing.It is a large amount of small Wave base can cope with different scenes, therefore wavelet transformation is very suitable to Speech processing.

Using wavelet transformation carry out speech enhan-cement when, be using the characteristic of the multiresolution analysis in wavelet transformation, According to the different features that noise is shown in the wavelet field of different scale from the wavelet coefficient of voice, corresponding rule are formulated Then, the processing to noise signal wavelet coefficient is completed.

The key step of Noise Elimination from Wavelet Transform is as follows:

Step1: wavelet transformation is carried out to signals and associated noises

Step2: denoising is carried out to wavelet coefficient on different scale

Step3: by treated, wavelet coefficient does wavelet inverse transformation, obtains enhanced reconstruction signal

The mode of Wavelet Denoising Method can substantially be divided into following three types: take making an uproar using wavelet modulus maxima principle； It is denoised using the correlation of wavelet transformation space factor；Utilize wavelet threshold denoising.The present embodiment mainly uses the third Based on wavelet threshold denoising.

Wavelet threshold denoising is more commonly used one of denoising method, and basic process is as follows:

Step1: according to one suitable wavelet basis of signal behavior to be processed, determining reasonably Decomposition order, right

Noisy speech signal carries out multilayer decomposition.

Step2: suitable threshold value is selected on different scale to the wavelet coefficient after decomposition, and is quantified.

Step3: wavelet reconstruction is carried out according to the processing result after threshold value quantizing and obtains enhanced voice signal.

The diversity of wavelet basis is one of the advantages of wavelet transformation carries out time frequency analysis, therefore selects suitable small echo letter

Number is most important.Studies have shown that being the transient state for being more advantageous to processing voice signal when carrying out Speech processing Variation needs to select slickness, symmetry preferably and has the wavelet basis function of lower vanishing moment.

The series of wavelet decomposition is always concerned as the factor for influencing voice enhancement algorithm denoising effect.With point The detail section of the increase of solution series, voice signal and noise signal is more clear, and is more conducively denoised.But with decomposed class Increase, speech energy can increasingly disperse and cause distortion and also the algorithm speed of service also can be slower and slower；Decomposed class is few It will lead to signal and noise aliasing then to which noise can not be isolated.Researcher had found by numerous studies experimental analysis, needle Selection to wavelet decomposition series, takes the most reasonable decomposed class to beN is data length,Expression takes downwards It is whole.

In wavelet threshold denoising algorithm, the estimation for threshold value is to determine one of an important factor for denoising effect.Small echo Noisy speech signal is decomposed into high frequency detail part and low-frequency approximation part by transformation, and the frequency of noise is usually higher, therefore is made an uproar Acoustic energy focuses primarily upon on high-frequency wavelet coefficient, and speech energy focuses primarily upon the low frequency part of voice signal.So can be with The noise component(s) truncation for being less than the value is denoised by one threshold value of setting.The threshold value is exactly in wavelet threshold denoising The threshold value to be studied.

Selection for wavelet threshold denoising threshold value mainly includes following classical way:

Uniform threshold method.

Uniform threshold estimation is to derive to obtain based on minimum mean square error criterion.It may be expressed as:

Wherein σ_nIt is the standard deviation of noise, N is signal length.The variance of noise is obtained by following formula:

σ_n=M_j/0.6745

Wherein M_jFor the absolute intermediate value for decomposing each layer of wavelet coefficient, 0.6745 is empirical value.

This method is realized simply, preferable for filtering out white Gaussian noise effect, but due to related to voice length, in data It will lead to effect variation when measuring very big.

SUREShrink threshold value [73]

SUREShrink threshold estimation is a kind of method of adaptive threshold selection, is the unbiased esti-mator of optimal threshold.Threshold The selection of value can be defined by following risk function:

It goes for threshold estimation function and then needs to meet risk function minimum, it may be assumed that

Signal length is substituted into, then is had:

WhereinExpression take set Y | | Y_i| < t } in element number.

Minimaxi threshold value

Minimaxi threshold value, which is called, does minimax threshold value, and what this method generated is a least mean-square error

Extreme value.The calculation formula of Minimaxi threshold value is as follows:

Wherein, N is signal length.

In addition, threshold function table as threshold estimation, also play the role of in wavelet threshold denoising algorithm it is vital, Common threshold function table is as follows:

Hard threshold function

Hard threshold function is as follows:

Wherein,To estimate wavelet coefficient, ω_j,kTo decompose wavelet coefficient, λ is noise-removed threshold value.From above formula it can be found that The principle of hard-threshold denoising is by ω_j,kIt makes comparisons with λ, less than being zeroed out for λ, greater than being retained for λ, such processing can Signal can be made to introduce oscillator signal in reconstruct, influence to denoise effect.

Soft-threshold function

For the influence for eliminating hard-threshold denoising, the method for introducing soft-threshold denoising, form is as follows:

Compared with hard threshold function, soft-threshold function strengthens the flatness of voice signal, but equally can be to a certain extent It loses feature and causes distortion [74].

Semisoft shrinkage function

Soft to overcome, the defect of hard threshold function has scholar to propose that semisoft shrinkage function, function are as follows:

Wherein λ₁,λ₂Respectively lower threshold value and upper threshold value and there is 0 < λ₁< λ₂, rule of thumbλ₁Take Value is related with voice, when voiceless sound is more, λ₁Value table is smaller, when voiced sound is more, λ₁Value is larger.By adjusting λ₁,λ₂It can make So that this method has both soft, the advantages of hard -threshold, but two parameters will increase algorithm computation complexity.

Garrote threshold function table

Garrote threshold function table indicates as follows:

Threshold value is introduced into threshold function table by the function, dynamic to reject the wavelet coefficient for being greater than selected threshold value.

The present invention is designed one and is divided based on the adaptive approach of wavelet transformation voice signal after CNMF+JADE separation Analysis selectively carries out speech enhan-cement it is expected that realizing, i.e., may after those speech enhan-cements of automatic fitration before speech enhan-cement The voice segments for causing voice quality to decline.By to voice signal and the sound effect after wavelet transformation after multistage speech Separation Analysis find, when the distance between voice is larger after separation, when carrying out wavelet transformation speech enhan-cement effect have under Drop.Based on above-mentioned discovery, the present invention designs following method and carries out adaptive judgement before speech enhan-cement.

I=1,2 ..., N

O_i-1=O_i+s_i+l

O₀=O

O_N=s_N

Wherein, s_iIndicate i-th of teacher's voice signal after CNMF+JADE is decomposed.O_iIndicate mixing voice O_i-1Through CNMF+JADE isolates s_iMixing voice signal afterwards.Respectively indicate s_i、O_iCorresponding GMM.L indicates separation process In loss.N indicates the speaker's number for including in mixed signal.Disp () is that above GMM distance calculates public affairs Formula, p are zoom factor, and value is [1,1.2].

1. this complicated situation, designs teacher's voice extraction method, has derived information-based class the present invention is based on classroom instruction The application category of hall, is not only the important component of wisdom classroom (artificial intelligence+education), and one kind of even more future education is complete It is new to embody.Have us refering to known to data, the research of same type is few at present, and substantially there are no form available frame and reason By.The present invention is to have stepped major step in the research in wisdom classroom, has opened up the instructional methodology based on artificial intelligence New view.

2. the present invention is based on single channel, adaptive, unsupervised formulas, and classroom teacher's voice identifies and extracted.It compares In existing method, any priori knowledge is not needed, and for different form, the classroom voice of different length, different classrooms Environment has good adaptive ability.Method proposed in this paper simultaneously, can not only apply in classroom teaching, can be with It applies in fields such as meeting, hearing aid, communications (for example, by speech Separation technology in conjunction with hearing aid, so that hearing aid has More powerful signal processing function improves the voice quality of hearing aid.In mobile communication field, in equipment end application speech Separation Technology, which reaches, inhibits non-targeted speaker, improves voice quality and intelligibility etc..)

3. the present invention designs and Implements a kind of improved GMM-Kmeans clustering method, carried out using GMM model as feature Cluster, remains primitive character to greatest extent, improves the accuracy rate of cluster.Using GMM as feature and distance is calculated, is avoided straight Processing greater depth voice signal is connect, so as to shorten the algorithm process time, it is high and fast generally to realize a kind of accuracy rate Spend fast classroom classroom speech recognition.

4. considering the influence of environment on the basis of GMM-Kmeans clustering algorithm, it is based on cluster result, adaptive choosing It takes suitable voice segments and constructs GGMM model, adaptively obtain similarity threshold, secondary detection Teachers ' Talk, to obtain standard True ground teacher's voice class.All threshold values are to be obtained by the way that design formula is adaptive according to classroom voice data, prosthetic Interference, so that the algorithm is directed to different classroom environments, classroom situation has very strong robustness.

5. the present invention designs and Implements the Speech separation algorithm of CNMF+JADE a kind of, by application JADE to CNMF voice Separating resulting carries out secondary speech Separation.Speech Separation result is effectively promoted.7. the present invention designs and Implements one kind certainly The method for adapting to wavelet transformation speech enhan-cement carries out adaptive judgement to the voice after CNMF+JADE speech Separation, and filtering is uncomfortable The voice segments for carrying out speech enhan-cement again are closed, it is purposive that voice signal is denoised.

Above disclosed is only a preferred embodiment of the present invention, cannot limit the power of the present invention with this certainly Sharp range, therefore equivalent changes made in accordance with the claims of the present invention, are still within the scope of the present invention.

Claims

1. a kind of target speaker's voice extraction method based on single channel, unsupervised formula, which is characterized in that including teacher's language Detecting step and teacher's language GGMM model training step；

Teacher's language detecting step the following steps are included:

S1: it records to classroom and obtains voice data；

S2: Speech processing is carried out；

S3: voice segmentation and modeling, the voice segmentation includes carrying out isometric segmentation to classroom voice, is directed to every section of voice later Extract corresponding MFCC feature, and the GMM model based on each section of voice of MFCC latent structure；

S4: the GMM model of each section of voice outside Teachers ' Talk classification and GGMM are carried out similarity calculation by teacher's speech detection, Adaptive threshold value is set, is classification by the label of the threshold value is less than, thus to obtain final Teachers ' Talk class Not；

Teacher's language GGMM model training step the following steps are included:

S5: clustering processing is carried out to the obtained voice data of S3；Initial Teachers ' Talk classification is obtained, and is based on initial teacher Language classification extraction GGMM model.

2. target speaker's voice extraction method according to claim 1 based on single channel, unsupervised formula, feature exist In, the clustering processing the following steps are included:

S51: cluster centre point is chosen；

S53: circulation execution S51 step and S52 are total to n times, can get n kind teacher voice division group, select most according to the rule of setting The division group of big satisfaction is as initial teacher's voice；

S55: according to GGMM and average distance, carrying out secondary judgement to remaining speech samples section, cardinal distance from being less than setting threshold values, Then sample is added in teacher's classification；

S56: all teacher's speech samples are exported and database is written.

3. target speaker's voice extraction method according to claim 2 based on single channel, unsupervised formula, feature exist It is specifically included in the step of, S51:

S512: calculating the GMM model distance of remaining voice segments and first central point, selects apart from maximum voice segments as the Two central points；

S513: successively calculate it is non-selected centered on the voice segments put at a distance from central point, select distance center point distance most The next central point of big conduct；

4. target speaker's voice extraction method according to claim 3 based on single channel, unsupervised formula, feature exist It is specifically included in the step of, S52:

5. target speaker's voice extraction method according to claim 4 based on single channel, unsupervised formula, feature exist Specifically include in the step of, S53: iteration obtains N number of teacher's categorization vector and carries out similarity calculation, takes and remaining N-1 The maximum initial teacher's classification obtained as final cluster of the sum of a vector similarity.

6. target speaker's voice extraction method according to claim 5 based on single channel, unsupervised formula, feature exist It specifically includes: is randomly selected in teacher's classification in the step of, S54Section, wherein M is that cluster obtains teacher's class Voice segments number in not, takes at randomPurpose be to reduce to carry out GMM models for voice segments whole in teacher's classification Trained time, N are the constant obtained according to the size adaptation of M, and acquisition pattern is as follows:

Wherein, α is time adjustment parameter, and for adjusting the voice segment number for carrying out GMM training, length (C) indicates original class Hall voice obtains the total number of voice segments after singulated, and coefficient 0.4*length (C) indicates least teacher's voice segments number.

7. target speaker's voice extraction method according to claim 1 based on single channel, unsupervised formula, feature exist In the S3 includes:

S32: whether judge to be overlapped in voice comprising teacher's voice；

S34: design CNMF+JADE method carries out speech Separation.

8. target speaker's voice extraction method according to claim 7 based on single channel, unsupervised formula, feature exist In the S31 includes:

Overlapping voice segments are obtained using mute point, the judgement of mute frame is carried out by setting energy threshold, the energy threshold is logical Following methods are crossed to be obtained:

Wherein, E_iIndicate the energy of the i-th frame speech frame,Wherein N is voice segments totalframes, and r is a constant, model It encloses for (0,1),Expression rounds up.

9. target speaker's voice extraction method according to claim 8 based on single channel, unsupervised formula, feature exist In whether the S32 includes: to judge to be overlapped comprising teacher in voice using GMM similarity, and similarity is by using improved bar Te Chaliya distance, judgment basis are as follows:

Wherein, disp (A, B) indicates that A, the distance of B voice segments GMM model, A indicate overlapping voice segments, and B is teacher's voice segments, t For an adaptive threshold, calculation formula is as follows:

Wherein, p is adjustment parameter, and for value between [0.5,0.8], K is the voice segment number of student part, S_iFor i-th section of student Voice segments, B are teacher's voice segments.

10. target speaker's voice extraction method according to claim 9 based on single channel, unsupervised formula, feature It is, the S33 includes: that the selection non-teacher voice segments closest with overlapping voice are trained together with teacher's voice segments CNMF, selection mode are as follows:

v_i=min (disp (A_i,S_j)), i=1,2 .., N, j=1,2 ..., K