CN108962229B

CN108962229B - Single-channel and unsupervised target speaker voice extraction method

Info

Publication number: CN108962229B
Application number: CN201810832080.5A
Authority: CN
Inventors: 姜大志; 陈逸飞
Original assignee: Shantou University
Current assignee: Shantou University
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2020-11-13
Anticipated expiration: 2038-07-26
Also published as: CN108962229A

Abstract

The embodiment of the invention discloses a target speaker voice extraction method based on a single channel and an unsupervised mode, which comprises a teacher language detection step and a teacher language model training step; the teacher language detection step comprises the following steps: obtaining voice data for classroom recording; carrying out voice signal processing; voice segmentation and modeling, wherein the voice segmentation comprises equal-length segmentation of class voices, then extracting corresponding MFCC (Mel frequency cepstrum coefficient) features for each section of voice, and constructing a GMM (Gaussian mixture model) of each section of voice based on the MFCC features; the teacher voice detection is to calculate the similarity between the GMM model and the GGMM of each voice section outside the teacher utterance category and mark the voice sections smaller than a set threshold value as the teacher utterance category, so as to obtain the final teacher utterance category; the teacher language GGMM model training step comprises the following steps: clustering the voice data obtained in step S3; an initial teacher utterance class is obtained, and a GGMM model is extracted based on the initial teacher utterance class. The invention effectively improves the adaptability and intelligence of the system in practical application and lays a foundation for subsequent application and research.

Description

Single-channel and unsupervised target speaker voice extraction method

Technical Field

The invention relates to a voice extraction method, in particular to a target speaker voice extraction method based on a single channel and an unsupervised mode under a complex multi-speaker situation.

Background

The guarantee of education quality is the key of our every level of education. In improving the education quality, it is important to improve the teaching quality, especially the classroom teaching quality. However, the conventional method is based on manual (peer) on-site observation and evaluation, and although the method can play a certain role, the method has no general operability and general objectivity, and the reason is that: the teaching competent department is difficult to examine the classroom, make evaluation and give suggestions all the time and all the time, which inevitably brings heavy burden to teaching management and is unnecessary. Moreover, the traditional on-site observation and evaluation cannot follow the teaching process in the whole process, so that the teaching quality of teachers is difficult to objectively evaluate.

Information and intelligent technology become important supports for social development, and how to utilize and develop information and intelligent technology to reform traditional classes and construct efficient and automatic intelligent sensing oriented to class teaching becomes a scientific problem with great research value.

The intelligent perception facing classroom teaching is realized, and the problem to be solved for the first time is teacher voice recognition and acquisition.

At present, besides a supervised speaker recognition method, an unsupervised speaker recognition method is mainly speaker clustering. Teacher speech recognition in classroom speech also generally belongs to this section. There are mainly 4 kinds of studies on speaker clustering: 1. hierarchical clustering; k-means clustering; 3. spectral clustering; 4. and (5) carrying out neighbor propagation clustering.

In the article research and implementation of the unsupervised speaker clustering method, the operating efficiency of a spectral clustering algorithm based on a characteristic similar matrix is researched, and the spectral clustering algorithm for constructing the model similar matrix through a self-adaptive Gaussian mixture model is implemented. Firstly, a Gaussian Mixture Model (GMM) of a target speaker is obtained by training a speech section according to the GMM-UBM-MAP technology, namely, a background Model (UBM) is trained off-line, and the UBM is self-adapted according to a Maximum A Posteriori (MAP) criterion. And then, calculating the similarity of the GMM model to construct a similarity matrix, and carrying out feature extraction on the matrix for clustering to obtain the speaking part of the target person.

In the article < improved speaker cluster initialization and GMM multi-speaker recognition >, Mel cepstrum coefficient (MFCC) features of voice segments are extracted, then a training part uses Bayesian Information Criterion (BIC) to process initial classes to obtain purer initial classes, then clustering is carried out on the MFCC features by adopting a clustering algorithm, GMM model features are obtained by training each class, and speaker judgment is carried out by using speaker recognition based on a GMM model in a recognition stage.

For the extraction of the teacher's speech, in addition to the recognition of the individual teacher's speech, speech separation of the overlapping speech containing the teacher's utterance is required. The purpose of speech separation is to separate the speech of interest from multiple simultaneous sounding sources. The voice separation is divided into multi-channel voice separation and single-channel voice separation according to the relationship between the received source signal and the collected mixed signal. Single channel voice only needs single signal source, and more passageway voice signal not only more easily acquires but also more accords with the actual conditions. But voice separation is more difficult for single channel voice signals. The following 3 types of studies on the separation of single-channel speech are mainly made: 1. based on a computational auditory scene analysis; 2. based on the model; 3. based on time-frequency distribution.

The article "An Audio Scene Analysis Approach to Monaural Speech Segregation" Hu Wang proposes a CASA-based Speech separation system framework. The method comprises the steps of decomposing a mixed signal into time-frequency expression and extracting features required by voice separation to perform auditory time-frequency segmentation by simulating the characteristics of a basilar membrane of a human ear cochlea, combining adjacent time-frequency units of the same sound source to form auditory segments, finally combining the auditory segments to form the auditory segments of the same sound source, and finally realizing voice separation based on waveform synthesis of the same sound source. Later, Hu Wang made a series of improvements to the CASA system, including optimization of voiced and unvoiced signal separation. The article "CNMF-based acoustical features for noise-robust ASR" indicates that NMF is an unsupervised dictionary-based learning method that works well in dealing with various types of signal separation. The NMF algorithm requires purely additive operation, all components are non-negative matrixes after decomposition, and matrix dimensionality reduction operation can be achieved. With the continuous and deep research, the NMF algorithm has the characteristics of quick operation and accuracy, and is very convenient for processing large-scale data, so that the NMF algorithm is widely applied to various fields.

In the above prior art, there are the following drawbacks:

1. when unsupervised speaker clustering identification is carried out, whether the minimum cluster distance is larger than a certain threshold value or not is used as a standard for judging clustering end, and the effect of a hierarchical clustering algorithm is limited by the determination of the threshold value.

2. Article "unsupervised speaker clustering method research and implementation

The GMM-UBM-MAP combined with the spectral clustering algorithm of the characteristic similar matrix needs to train a GMM model of a voice signal, and can not realize complete unsupervised speaker recognition. In addition, the method requires that segments of speakers in the speech to be detected are relatively average, the requirement on the 'purity' of each segment of speakers is high, and the adaptability to real scenes of various forms is poor.

3. Article for improved speaker clustering initialization and GMM multi-speaker recognition

The MFCC coefficient is clustered, the MFCC extracts corresponding characteristics according to the framing of the voice, the calculation amount is large for a long voice section such as 40min classroom recording, and the clustering accuracy rate cannot be well guaranteed.

4. The article "An Audio Scene Analysis Approach to Monaural Speech segmentation" performs Speech separation based on CASA, simulates human ears to perform Speech separation, but the characteristics of the model human ears are difficult to select.

5. The article "CNMF-based acoustical features for noise-robust ASR" > requires a training speech for which a separate speech is given beforehand.

6. The single-channel voice separation result still has the influence of noise, and the voice separation method rarely further denoises the voice separation result and purifies and separates voice signals.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide a single-channel unsupervised target speaker voice extraction method. The classroom voice signals can be acquired, analyzed, processed and identified by utilizing and developing relevant information and intelligent technical means, and the teacher voice part can be robustly detected and extracted from the classroom voice signals on the basis of constructing a self-adaptive and unsupervised intelligent method. .

In order to solve the above technical problems, an embodiment of the present invention provides a single-channel-based unsupervised target speaker speech extraction method, including a teacher language detection step and a teacher language ggmm (general Gauss mix model) model training step;

the teacher language detection step comprises the following steps:

s1: obtaining voice data for classroom recording;

s2: carrying out voice signal processing;

s3: voice segmentation and modeling, wherein the voice segmentation comprises equal-length segmentation of class voices, then extracting corresponding MFCC (Mel frequency cepstrum coefficient) features for each section of voice, and constructing a GMM (Gaussian mixture model) of each section of voice based on the MFCC features;

s4: the teacher voice detection is to calculate the similarity between the GMM model and the GGMM of each voice section outside the teacher utterance category, set a self-adaptive threshold value, and mark the voice section smaller than the threshold value as the teacher utterance category, so as to obtain the final teacher utterance category;

the teacher language GGMM model training step comprises the following steps:

s5: clustering the voice data obtained in step S3; an initial teacher utterance class is obtained, and a GGMM model is extracted based on the initial teacher utterance class.

Further, the clustering process includes the steps of:

s51: selecting a clustering central point;

s52: calculating the distances between all samples and the central point, iterating until a preset shutdown condition is met;

s53: step S51 and step S52 are executed circularly for n times, n teacher voice division groups can be obtained, and the division group with the maximum satisfaction degree is selected as initial teacher voice according to a set rule;

s54: selecting a plurality of training GGMM models from the division group, and calculating the average distance in the class;

s55: performing secondary judgment on the rest voice sample segment according to the GGMM and the average distance, and adding the sample into the class of teachers if the base distance is smaller than a set threshold value;

s56: and outputting all teacher voice samples and writing the teacher voice samples into a database.

Further, the step of S51 specifically includes:

s511: randomly selecting one from all the voice sections as a first central point;

s512: calculating the distance between the residual voice sections and the GMM model of the first central point, and selecting the voice section with the largest distance as a second central point;

s513: sequentially calculating the distance between the voice section which is not selected as the central point and the central point, and selecting the voice section which has the largest distance from the central point as the next central point;

s514: and iterating until the number of the central points reaches the number of the specified categories.

Further, the step of S52 specifically includes:

s521: calculating the distance between the remaining part of GMM models and the central point, and dividing each GMM into the nearest central points;

s522: updating the central point, and taking the point in each type with the minimum sum of distances to all points in the type as a new central point;

s523: and iterating until a preset stop condition is met or iterating for a specified number of times.

Further, the step of S53 specifically includes: and (4) carrying out similarity calculation on N teacher category vectors obtained through iteration, and taking the largest sum of the similarity of the N teacher category vectors and the similarity of the rest N-1 vectors as the initial teacher category obtained through final clustering.

Further, the step of S54 specifically includes: randomly selecting from the teacher category

The number of the voice sections in the teacher category is obtained by clustering M, and the voice sections are randomly taken

The purpose of the method is to reduce the time for carrying out GMM model training on all voice segments in a teacher class, wherein N is self-adaptive according to the size of MThe resulting constants were obtained as follows:

wherein, α is a time adjustment parameter for adjusting the number of speech segments for GMM training, length (c) represents the total number of speech segments obtained by segmenting the original classroom speech, and a coefficient of 0.4 × length (c) represents the minimum number of teacher speech segments.

Further, the S3 includes:

s31: detecting the overlapped voice to obtain overlapped voice sections in the classroom voice;

s32: judging whether the overlapped voice contains teacher voice;

s33: selecting a voice segment closest to the overlapped voice as a training voice segment;

s34: and designing a CNMF + JADE method for voice separation.

Further, the S31 includes:

using a mute point to obtain overlapped speech segments, and judging a mute frame by setting an energy threshold value, wherein the energy threshold value is obtained by the following method:

wherein E is_iRepresents the energy of the speech frame of the i-th frame,

wherein N is the total frame number of the voice section, r is a constant and the range is (0,1),

indicating rounding up.

Further, the S32 includes: and judging whether the overlapped voice contains a teacher or not by using the GMM similarity, wherein the similarity is judged according to the following steps by adopting an improved Bartchinson distance:

disp (A, B) represents the distance between the GMM models of the speech segments A and B, A represents the overlapped speech segment, B represents the teacher speech segment, and t represents an adaptive threshold, and the calculation formula is as follows:

wherein p is an adjusting parameter and takes the value of [0.5,0.8 ]]K is the number of speech segments of the student part, X_iThe i-th student voice section and the B teacher voice section.

Further, the S33 includes: the non-teacher speech segment closest to the overlapped speech is selected to train the CNMF together with the teacher speech segment in the following selection mode:

v_i＝min(disp(A_i,S_j)),i＝1,2,..,N,j＝1,2,...,K

wherein A is_iFor the ith overlapping speech segment, v_iFor the corresponding selected i-th training speech segment.

The embodiment of the invention has the following beneficial effects: the invention provides an unsupervised and self-adaptive robust teacher voice detection and extraction method for classroom teaching with high complexity (mainly comprising the diversity of classroom situations, the diversity of teacher subjects and the diversity of teacher classroom organizations), so that the adaptability and intelligence of the system in practical application are effectively improved, and a foundation is laid for subsequent application and research.

Drawings

FIG. 1 is a schematic diagram of the framework flow structure of the present invention;

FIG. 2 is a flow chart illustrating the teacher language detection step;

FIG. 3 is a schematic diagram of teacher language GGMM model training steps;

FIG. 4 is a flow chart illustrating the steps of a clustering algorithm;

FIG. 5 is a schematic diagram of a speech separation implementation step;

fig. 6 shows the speech enhancement implementation steps.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

Referring to fig. 1, the single-channel unsupervised target speaker voice extraction method includes a teacher language detection step and a teacher language GGMM model training step.

As shown in fig. 2, the teacher language detection should include the following steps:

s110, recording;

s120, preprocessing a voice signal;

s130, voice segmentation and modeling;

and S140, teacher voice detection.

As shown in fig. 3, the teacher's speech GGMM model training function unit should include the following steps:

s110, recording;

s120, preprocessing a voice signal;

s130, voice segmentation and modeling;

and S240, clustering.

Wherein, the corresponding classroom speech data is obtained by using the sound recording apparatus in S110.

In S120, the classroom voices obtained by recording are preprocessed, which includes common voice preprocessing methods such as framing, windowing, pre-emphasis, and the like.

In S130, the classroom voices are divided into equal lengths, and then, a corresponding MFCC feature is extracted for each section of voice, and a GMM model of each section of voice is constructed based on the MFCC features. Then, the GMM models of the respective voices are used as input data of S240 to perform clustering operation, an initial teacher utterance class is obtained, and the GGMM model is extracted based on the initial teacher utterance class. And S140, carrying out similarity calculation on the GMM model and the GGMM of each voice outside the teacher utterance class, setting a self-adaptive threshold value, and marking the voice smaller than the threshold value as the teacher utterance class, thereby obtaining the final teacher utterance class.

The clustering algorithm in S240 is shown in fig. 4.

The specific embodiment of the clustering algorithm comprises the following steps:

s2401, selecting an initial central point;

1) randomly selecting one from all the voice sections as a first central point.

2) And calculating the distance between the residual voice sections and the GMM model of the first central point, and selecting the voice section with the largest distance as the second central point.

3) And sequentially calculating the distances between the voice sections which are not selected as the central points and the central points, and selecting the voice section which has the largest distance from the central points as the next central point.

4) And iterating until the number of the central points reaches the number of the specified categories.

Compared with a random central point selection method, the central point selection method has the advantage that the accuracy of the final clustering result is obviously improved. In practice, due to the stopping condition set in S2402(3) by the GMM-Kmeans algorithm, the clustering result obtained by using the outlier as the center point is excluded in the iteration process, so that a robust clustering result can be obtained by selecting the initial center point through the method.

The distance between the gaussian mixture models cannot be measured well only by the above method, that is, the dispersion of GMM a and GMM B is defined as follows:

referred to as the dispersion of GMM A relative to GMM B, where WA_iRepresents the weight of the ith element of GMM A, WB_jRepresents the weight of the jth element of GMM B, d_AB(i, j) represents the distance between the ith Gaussian distribution of GMM A and the jth Gaussian distribution of GMM B, and considering the reason of the calculated amount and the possibility that the mean vectors of the multiple Gaussian distributions are completely the same, the present embodimentFor example, Mahalanobis distance is selected as d_AB(i, j) a distance calculating method.

Wherein,

representing two multidimensional Gaussian distributions, mu₁,μ₂Is the mean vector of the two distributions,

the covariance matrices for the two distributions.

For symmetry, the final GMM distance metric formula is as follows:

wherein A and B respectively represent two GMM models.

S2402, calculating the distances between all samples and a central point, and iterating until a preset shutdown condition is met;

1) and calculating the distance between the residual part of the GMM model and the central point, and dividing each GMM into the nearest central points.

2) And updating the central point, and taking the point in each class with the minimum sum of distances to all points in the class as a new central point.

Iterating until a preset stop condition is met (output when the number of the speech segments contained in the category with the largest number of the speech segments in the obtained clustering result is more than 40% of the total speech segments and the number of the speech segments is more than the number of the speech segments in the second largest category) or iterating to a specified number of times.

S2403, circularly executing the step S2401 and the step S2402 for n times, obtaining n teacher voice division groups, and selecting the division group with the maximum satisfaction degree as the initial teacher voice according to a certain rule.

S2403, N teacher category vectors are obtained through iteration, similarity calculation is carried out, and the initial teacher category obtained through final clustering is taken as the largest sum of the similarity of the vectors and the similarity of the rest N-1 vectors. Because the length of the obtained N teacher class vectors is not unique, corresponding processing is needed to make the vector length the same before similarity calculation.

In the present embodiment, the zero padding method is used to equalize the vector lengths.

The method selects the longest teacher category vector from N teacher category vectors to be marked as M, the length of all vectors is expanded to M, and the part which is less than M is replaced by 0 element, namely:

M＝max(length(T₁),length(T₂),...,length(T_N))

T_i＝[T_i,Append_i],i＝1,2,...,N

Append_i＝zeros(1,M-length(T_i)),i＝1,2,...,N

wherein, T₁,T₂,...,T_NIs N teacher class vectors, M is the longest vector length, length (T) indicates the length of the obtained T vector, appendix_iFor all the added 0 element vectors for the ith teacher class vector, zeros (i, j) represents the 0 element vector forming one i row and j column.

In this embodiment, the teacher category vectors are made to have a uniform length by using a zero padding method, and then the distance between every two vectors is calculated, and since 0 element is artificially added, a method of measuring the similarity of vectors by using the inter-vector distance is used, for example: the euclidean distance, etc. has a large error, so the cosine similarity is used as a method for measuring the similarity between vectors.

Cosine similarity represents the similarity of vectors by using cosine values of an included angle between two vectors in a vector space. The closer the cosine value is to 1, the closer the angle is to 0 degrees, the more similar the vector is.

The cosine similarity between vectors a, b is defined as follows:

wherein a is＝(a₁,a₂,...,a_N),b＝(b₁,b₂,...,b_N) Each representing an N-dimensional vector.

S2404, randomly selecting teacher classes

The purpose of the method is to reduce the time for GMM model training on all voice segments in a teacher category, wherein N is a constant obtained by self-adaption according to the size of M and is obtained as follows:

where α is a time adjustment parameter used to adjust the number of speech segments for GMM training, and α is 2 in this embodiment. length (c) represents the total number of speech segments obtained after the original classroom speech is segmented by 30 s. The coefficient 0.4 × length (c) represents the minimum number of teacher speech segments. The expression shows that the larger the number of the teacher class voice sections obtained by clustering is, the smaller the number of the teacher class voice sections is, when the GMM model training is carried out. Through the formula, the number of the voice sections required by different voices during GMM model training tends to be similar.

And setting a similarity threshold value as S/gamma, wherein S is the inter-class similarity mean value of the teacher class voice segment, and gamma is a self-adaptive adjustment parameter for ensuring the integrity of the teacher class to the maximum extent. The manner of obtaining it is as follows:

wherein beta is an adjusting parameter and the range is [0,1 ]]In this embodiment, β is 1/5. S_max,S_minRespectively representing the maximum value and the minimum value of the similarity between the teacher classes. length (C)Representing the total number of the obtained speech segments after the original classroom speech is segmented by 30 s. And M is the number of voice sections in the class of the teacher. The above expression indicates that γ is larger, i.e., the similarity threshold setting is smaller, as M is larger. And when the range of the inter-class similarity is larger, a smaller similarity threshold is taken, so that the accuracy of whether the rest part is the teacher utterance is higher.

Through the processing of the GMM-Kmeans algorithm, a relatively stable teacher class vector can be finally obtained, the obtained teacher class and the manually marked teacher class have higher similarity through comparison with manually divided classes in a test, and compared with a result obtained by directly using improved K-means for clustering, the GMM-Kmeans algorithm used in the embodiment has obvious improvement on clustering accuracy.

After the teacher category is obtained, a determination of silent and overlapping speech portions follows. By

There is no clear feature in the student category, and the number of students is unknown, so it is impossible to detect the student category first. The present embodiment labels the remaining speech segments as the student utterance classes by preferentially detecting the teacher class, the mute class, and the overlap speech class, and by excluding the speech segments included in the above three parts.

Referring to fig. 5, the specific implementation steps of speech separation are as follows:

s310, detecting the overlapped voice to obtain the overlapped voice section in the classroom voice

S320, judging whether the overlapped voice contains teacher voice

S330, selecting the voice segment closest to the overlapped voice as a training voice segment

S340, designing a CNMF + JADE method for voice separation

In S310, overlapping speech segments are obtained based on the mute point, and it is found that the mute frame has lower energy than the non-mute frame, and the mute frame can be determined by setting an energy threshold. The energy threshold is defined as follows:

wherein E is_iRepresents the energy of the speech frame of the i-th frame,

indicating rounding up.

Overlapping speech means that a segment of speech contains two or more people speaking simultaneously. Overlapping speech in a real classroom mainly appears as: 1. student team discussion; 2. when a teacher asks questions, a plurality of students answer simultaneously, and the like. Overlapping speech segments differ from silence segments in the appearance of the silence frames. It has been found that in a speech segment, the probability that the segment contains overlapping speech is lower the longer the duration of silence [56 ]. In connection with the problem addressed by the present embodiment, it may be considered to determine the potential overlapping speech class by the number of silent frames. The method of obtaining potentially overlapping speech classes is similar to the method of obtaining potentially silent classes, as follows:

ClassOfOverlap_i＝I(numberOfSilence_i<Threshold_s),i＝1,2,...,N

wherein alpha' is a constant and is used for obtaining the Threshold of the overlapped voice judgment category_o. In this example, α' is 0.6. The number of the silent frames in the speech segment is smaller than the Threshold value Threshold_oThe segments of (2) are considered potential overlapping speech segments, based on which potential overlapping speech classes are obtained.

S320 and S330 are collectively referred to as speech separation front-end processing, and this processing has two purposes: and judging whether the overlapped voice contains the target speaker, and searching the voice section which is closest to the overlapped voice except the target speaker as CNMF training data. The invention judges whether the overlapped voice contains teachers or not based on the similarity of the GMM. The similarity calculation method adopts an improved ButterCharia distance, and the judgment basis is as follows:

disp (A, B) represents the distance between the A and B speech segments GMM model, A represents the overlapped speech segment, and B is the teacher speech segment. t is an adaptive threshold, which is calculated as follows:

wherein p is an adjusting parameter and takes the value of [0.5,0.8 ]]K is the number of speech segments of the student part, X_iThe i-th student voice section and the B teacher voice section. An adaptive threshold is obtained by calculation with the student segments to determine whether the teacher is included in the overlapping speech.

The second task of pre-processing the speech separation is to select the non-instructor speech segment closest to the overlapping speech segment for CNMF training, which has a large impact on the subsequent speech separation. The invention trains CNMF by selecting the non-teacher voice segment which is closest to the overlapped voice and the teacher voice segment, and the selection mode is as follows:

v_i＝min(disp(A_i,S_j)),i＝1,2,..,N,j＝1,2,...,K

S340 voice separation is carried out on overlapped voice containing teachers, the invention provides a method for carrying out single-channel voice separation by fusing CNMF and JADE, and voice signals after CNMF separation are subjected to secondary separation based on JADE. The CNMF + JADE algorithm aims at obtaining the separated speech signals of all speakers in single-channel mixed speech, and comprises the following steps:

inputting: pure voice t of speaker to be separated₁,t₂,...,t_NMixed speech to be trained o₁,o₂,...o_N-1And mixed speech O is to be separated.

And (3) outputting: separated speaker voice s₁,s₂,...,s_N。

Step 1: selecting a target speaker t₁And corresponding mixed speech o₁Training CNMF

Step 2: separating the mixed voice O to obtain

And

step 3: generating a random matrix R₁Mixing of

And

forming a two-channel speech signal S₁。

Step 4: realizing S based on JADE₁Separation of (2) to obtain s₁And O₁。

Step 5: with O₁As mixed speech to be separated, t₂,...,t_NFor the speaker to be isolated, o₂,...o_N-1To train the mixed speech, Step 1-Step 5 is repeated.

Step 6: obtaining the separated speech s₁,s₂,...,s_N。

In the above algorithm, t₁,t₂,...,t_NRepresenting the clean speech of the speaker contained in the mixed speech O. N represents the number of speakers included in the mixed speech O. o₁,o₂,...o_N-1The mixed speech after sequentially removing the corresponding speaker from the mixed speech O is expressed as follows:

in the real case, o is obtained₁,o₂,...o_N-1It is very difficult to train the CNMF by randomly selecting a non-target speaker's voice from a current mixed voice as a surrogate. Experiments prove that the method has slightly reduced effect compared with the original CNMF + JADE, but can popularize the CNMF + JADE algorithm to a more general situation.

The two-channel speech signal in Step3 is generated in the form of:

wherein R is_iA 2 x 2 matrix.

As shown in fig. 6, the specific speech enhancement implementation steps are as follows:

s410, the voice of the teacher after the voice enhanced data is separated from the voice

S420, carrying out self-adaptive judgment on the voice of the teacher after voice separation, and selecting a proper voice section for voice enhancement

S430, speech enhancement by applying wavelet transform

Wavelet transform is a hot research point in speech processing in recent years. Compared with the traditional frequency domain analysis method such as Fourier transform, the wavelet transform can simultaneously give the time domain state of the signal, and the time-frequency analysis method has the characteristics of multi-resolution analysis, time-frequency local transform, flexible selection of wavelet functions and the like. The principle of wavelet transform will be described below.

Is provided with L²(R) is a square integrable space and always has

If it is Fourier transformed

Satisfies the following conditions:

balance

A basic wavelet or mother wavelet.

Mother wavelet

After scaling and translation through a real pair (a, b), where a, b ∈ R, a ≠ 0, a cluster of functions can be obtained:

this cluster of functions is referred to as the wavelet basis function, where a is referred to as the scaling factor, b is referred to as the translation factor,

as a function of the window, its window size is fixed but its shape can be changed. The wavelet transform has the characteristic of multi-resolution analysis based on the characteristic.

To normalize the factors, the effect is to make the wavelets have the same energy at different scales.

Signal processing based on wavelet domain is one of the main means of speech signal processing at present. Based on the characteristics of multi-resolution, low entropy and decorrelation of wavelet transform, the method has great advantages in speech signal processing. A large number of wavelet bases can deal with different scenes, so the wavelet transformation is very suitable for processing voice signals.

When the wavelet transform is used for voice enhancement, the characteristic of multi-resolution analysis in the wavelet transform is used, and corresponding rules are formulated according to different characteristics of wavelet coefficients of noise and voice on wavelet domains with different scales, so that the processing of the wavelet coefficients of the noise signals is completed.

The wavelet transformation denoising method mainly comprises the following steps:

step 1: wavelet transformation of noisy signals

Step 2: denoising wavelet coefficient on different scales

Step 3: performing wavelet inverse transformation on the processed wavelet coefficient to obtain an enhanced reconstructed signal

The wavelet denoising method can be roughly classified into the following three types: noise is extracted by utilizing a wavelet transform modulus maximum principle; denoising by utilizing the correlation of the wavelet transform space coefficient; denoising by using a wavelet threshold. The present embodiment mainly uses a third wavelet threshold-based denoising method.

Wavelet threshold denoising is one of the commonly used denoising methods, and the basic process is as follows:

step 1: and selecting a proper wavelet basis according to the signal to be processed, determining the reasonable decomposition layer number, and performing multi-layer decomposition on the noisy speech signal.

Step 2: and selecting proper threshold values on different scales for the decomposed wavelet coefficients, and quantizing the wavelet coefficients.

Step 3: and performing wavelet reconstruction according to the processing result after threshold quantization to obtain an enhanced voice signal.

The diversity of wavelet base is one of the advantages of wavelet transform in time-frequency analysis, so that proper wavelet function is selected

The number is critical. Research shows that when speech signal processing is carried out, wavelet basis functions with smoothness, good symmetry and low vanishing moment need to be selected for processing transient change of speech signals.

The level of wavelet decomposition has been attracting attention as a factor affecting the denoising effect of the speech enhancement algorithm. With the increase of the decomposition series, the detail parts of the voice signal and the noise signal are clearer, and the denoising is more facilitated. However, as the number of decomposition stages is increased, the voice energy is dispersed more and more, so that distortion is caused and the algorithm operation speed is slowed down; a small number of decomposition stages can confuse the signal with noise and thus the noise cannot be separated. ResearcherThrough a large number of research and experimental analyses, the personnel select the most reasonable decomposition series as

N is the length of the data, and N is the length of the data,

indicating a rounding down.

In the wavelet threshold denoising algorithm, the estimation of the threshold is one of the important factors determining the denoising effect. The wavelet transform decomposes noisy speech signals into high frequency detail portions and low frequency approximation portions, and the frequency of noise is usually high, so that noise energy is mainly concentrated on high frequency wavelet coefficients, and speech energy is mainly concentrated on low frequency portions of the speech signals. The noise component smaller than the threshold value can be cut off by setting the threshold value to perform denoising. The threshold is the threshold to be studied in wavelet threshold denoising.

The selection of the wavelet threshold denoising threshold mainly comprises the following classical methods:

unified thresholding.

The unified threshold estimate is derived based on a minimum mean square error criterion. Can be expressed as:

wherein sigma_nIs the standard deviation of the noise and N is the signal length. The variance of the noise is obtained by:

σ_n＝M_j/0.6745

wherein M is_jTo decompose the absolute median of each layer of wavelet coefficients, 0.6745 is an empirical value.

The method is simple to implement, has a good effect of filtering Gaussian white noise, but has a poor effect when the data size is large due to the correlation with the voice length.

SURE shrnk threshold [73]

The SURE spring threshold estimation is a self-adaptive threshold selection method and is an unbiased estimation of an optimal threshold. The selection of the threshold may be defined by the risk function:

the minimum risk function that needs to be satisfied to obtain the threshold estimation function is:

substituting the signal length, then there are:

wherein

The expression is taken as the set { Y | | Y_i|<t } number of elements.

Minimax threshold

The minimax threshold is also called the minimum maximum threshold, and the method generates an extreme value of the minimum mean square error. The minimax threshold is calculated as follows:

where N is the signal length.

In addition, the threshold function, like the threshold estimation, also plays a crucial role in the wavelet threshold denoising algorithm, and the commonly used threshold function is as follows:

hard threshold function

The hard threshold function is as follows:

wherein,

to estimate wavelet coefficients, ω_j,kFor decomposing wavelet coefficients, λ is the denoising threshold. From the above equation, the principle of denoising the hard threshold is to denoise ω_j,kCompared with lambda, the signal smaller than lambda is set to zero, and the signal larger than lambda is reserved, so that the signal can be introduced into an oscillation signal during reconstruction, and the denoising effect is influenced.

Soft threshold function

In order to eliminate the influence of hard threshold denoising, a soft threshold denoising method is introduced, and the form is as follows:

soft threshold functions enhance the smoothness of the speech signal compared to hard threshold functions, but also lose features to some extent causing distortion [74 ].

Semi-soft threshold function

In order to overcome the defects of the soft threshold function and the hard threshold function, some researchers propose a semi-soft threshold function, which has the following functions:

wherein λ₁,λ₂Respectively a lower threshold and an upper threshold and having a value of 0<λ₁<λ₂According to experience

λ₁The value of (A) is related to the speech, and when there are more unvoiced sounds, lambda₁The table of values is smaller, when the voiced sound is more, lambda₁The value is large. By adjusting λ₁,λ₂The method can have the advantages of soft and hard threshold values, but the two parameters increase the computational complexity of the algorithm.

Garret threshold function

The Garrote threshold function is expressed as follows:

the function introduces a threshold into the threshold function, and dynamically culls wavelet coefficients that are larger than a selected threshold.

The invention designs a wavelet transform-based adaptive method to analyze voice signals after CNMF + JADE separation, and hopefully realizes selective voice enhancement, namely, automatically filters voice sections which can cause voice quality reduction after voice enhancement before voice enhancement. Analysis of the separated speech signals of the multiple sections of speech and the speech effect after wavelet transform shows that when the distance between the separated speech signals is large, the effect is reduced when wavelet transform speech enhancement is carried out. Based on the above findings, the present invention designs the following method for adaptive judgment before speech enhancement.

i＝1,2,...,N

O_i-1＝O_i+s_i+l

O₀＝O

O_N＝s_N

Wherein s is_iRepresenting the ith teacher's voice signal after CNMF + JADE decomposition. O is_iRepresenting mixed speech O_i-1Separation of s by CNMF + JADE_iThe latter mixed speech signal.

Respectively represents s_i、O_iThe corresponding GMM. l represents the loss during the separation. N represents the number of speakers included in the mixed signal. disp (. circle.) is the GMM distance calculation formula presented above, and p is a scaling factor, and takes the value of [1,1.2 ]]。

1. The invention designs the teacher voice extraction method based on the complex situation of classroom teaching, derives the application scope of the informatization classroom, is not only an important component of the intelligent classroom (artificial intelligence and education), but also is a brand-new embodiment of future education. As we can see the data, the same type of research is very little at present, and basically no available framework and theory is formed. The invention can make a great step in the research of intelligent classes and develop a new visual field of education methodology based on artificial intelligence.

2. The method is used for recognizing and extracting the classroom teacher voice based on a single channel, self-adaption and unsupervised mode. Compared with the existing method, the method does not need any prior knowledge, and has good self-adaptive capacity for classroom voices with different forms and lengths and different classroom environments. Meanwhile, the method provided by the text can be applied to classroom teaching, and can also be applied to the fields such as conferences, hearing aids, communication and the like (for example, the voice separation technology is combined with the hearing aid, so that the hearing aid has a stronger signal processing function, and the voice quality of the hearing aid is improved

3. The invention designs and realizes an improved GMM-Kmeans clustering method, which takes a GMM model as a characteristic to carry out clustering, retains the original characteristic to the maximum extent and improves the clustering accuracy. GMM is used as a characteristic and the distance is calculated, so that the direct processing of a voice signal with a larger length is avoided, the algorithm processing time is shortened, and the classroom voice recognition with high accuracy and high speed is realized on the whole.

4. On the basis of a GMM-Kmeans clustering algorithm, the influence of the environment is considered, based on a clustering result, a proper voice section is selected in a self-adaptive mode, a GGMM model is constructed, a similarity threshold value is obtained in a self-adaptive mode, teacher words are detected for the second time, and therefore accurate teacher voice classes are obtained. All the thresholds are obtained by designing a formula in a self-adaptive manner according to the classroom voice data without manual interference, so that the algorithm has strong robustness for different classroom environments and classroom situations.

5. The invention designs and realizes a voice separation algorithm of CNMF + JADE, and carries out secondary voice separation on a CNMF voice separation result by applying JADE. The voice separation result is effectively improved. 7. The invention designs and realizes a method for enhancing the voice by self-adaptive wavelet transform, which carries out self-adaptive judgment on the voice after CNMF + JADE voice separation, filters voice sections which are not suitable for voice enhancement again and carries out de-noising on voice signals purposefully.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A target speaker voice extraction method based on a single channel and an unsupervised mode is characterized by comprising a teacher language detection step and a teacher language GGMM model training step;

the teacher language detection step comprises the following steps:

s1: obtaining voice data for classroom recording;

s2: carrying out voice signal processing;

the teacher language GGMM model training step comprises the following steps:

s5: clustering the voice data obtained in step S3; obtaining an initial teacher utterance category, and extracting a GGMM model based on the initial teacher utterance category; the clustering process comprises the following steps:

s51: selecting a clustering central point;

s55: performing secondary judgment on the rest voice sample segments according to the GGMM and the average distance, and adding the samples into the class of teachers if the distance is smaller than a set threshold value;

2. The single-channel unsupervised targeted speaker based speech extraction method of claim 1, wherein the step of S51 specifically comprises:

3. The single-channel unsupervised targeted speaker based speech extraction method of claim 2, wherein the step of S52 specifically comprises:

4. The single-channel unsupervised targeted speaker based speech extraction method of claim 3, wherein the step of S53 specifically comprises: and (4) carrying out similarity calculation on N teacher category vectors obtained through iteration, and taking the largest sum of the similarity of the N teacher category vectors and the similarity of the rest N-1 vectors as the initial teacher category obtained through final clustering.

5. The single-channel unsupervised targeted speaker based speech extraction method of claim 4, wherein the step of S54 specifically comprises: randomly selecting from the teacher category

6. The single-channel-based unsupervised targeted speaker speech extraction method of claim 1, wherein the S3 comprises:

s32: judging whether the overlapped voice contains teacher voice;

s34: and designing a CNMF + JADE method for voice separation.

7. The single channel-based, unsupervised targeted speaker' S speech extraction method of claim 6, wherein the S31 comprises:

wherein E is_iRepresents the energy of the speech frame of the i-th frame,

indicating rounding up.

8. The single channel-based, unsupervised targeted speaker' S speech extraction method of claim 7, wherein the S32 comprises: and judging whether the overlapped voice contains a teacher or not by using the GMM similarity, wherein the similarity is judged according to the following steps by adopting an improved Bartchinson distance:

9. The single channel-based, unsupervised targeted speaker speech extraction method of claim 8, wherein the S33 comprises: the non-teacher speech segment closest to the overlapped speech is selected to train the CNMF together with the teacher speech segment in the following selection mode:

v_i＝min(disp(A_i,S_j)),i＝1,2,..,N,j＝1,2,...,K