CN108962229B - Single-channel and unsupervised target speaker voice extraction method - Google Patents

Single-channel and unsupervised target speaker voice extraction method Download PDF

Info

Publication number
CN108962229B
CN108962229B CN201810832080.5A CN201810832080A CN108962229B CN 108962229 B CN108962229 B CN 108962229B CN 201810832080 A CN201810832080 A CN 201810832080A CN 108962229 B CN108962229 B CN 108962229B
Authority
CN
China
Prior art keywords
voice
teacher
speech
gmm
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810832080.5A
Other languages
Chinese (zh)
Other versions
CN108962229A (en
Inventor
姜大志
陈逸飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shantou University
Original Assignee
Shantou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shantou University filed Critical Shantou University
Priority to CN201810832080.5A priority Critical patent/CN108962229B/en
Publication of CN108962229A publication Critical patent/CN108962229A/en
Application granted granted Critical
Publication of CN108962229B publication Critical patent/CN108962229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • G10L2015/0636Threshold criteria for the updating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the invention discloses a target speaker voice extraction method based on a single channel and an unsupervised mode, which comprises a teacher language detection step and a teacher language model training step; the teacher language detection step comprises the following steps: obtaining voice data for classroom recording; carrying out voice signal processing; voice segmentation and modeling, wherein the voice segmentation comprises equal-length segmentation of class voices, then extracting corresponding MFCC (Mel frequency cepstrum coefficient) features for each section of voice, and constructing a GMM (Gaussian mixture model) of each section of voice based on the MFCC features; the teacher voice detection is to calculate the similarity between the GMM model and the GGMM of each voice section outside the teacher utterance category and mark the voice sections smaller than a set threshold value as the teacher utterance category, so as to obtain the final teacher utterance category; the teacher language GGMM model training step comprises the following steps: clustering the voice data obtained in step S3; an initial teacher utterance class is obtained, and a GGMM model is extracted based on the initial teacher utterance class. The invention effectively improves the adaptability and intelligence of the system in practical application and lays a foundation for subsequent application and research.

Description

Single-channel and unsupervised target speaker voice extraction method
Technical Field
The invention relates to a voice extraction method, in particular to a target speaker voice extraction method based on a single channel and an unsupervised mode under a complex multi-speaker situation.
Background
The guarantee of education quality is the key of our every level of education. In improving the education quality, it is important to improve the teaching quality, especially the classroom teaching quality. However, the conventional method is based on manual (peer) on-site observation and evaluation, and although the method can play a certain role, the method has no general operability and general objectivity, and the reason is that: the teaching competent department is difficult to examine the classroom, make evaluation and give suggestions all the time and all the time, which inevitably brings heavy burden to teaching management and is unnecessary. Moreover, the traditional on-site observation and evaluation cannot follow the teaching process in the whole process, so that the teaching quality of teachers is difficult to objectively evaluate.
Information and intelligent technology become important supports for social development, and how to utilize and develop information and intelligent technology to reform traditional classes and construct efficient and automatic intelligent sensing oriented to class teaching becomes a scientific problem with great research value.
The intelligent perception facing classroom teaching is realized, and the problem to be solved for the first time is teacher voice recognition and acquisition.
At present, besides a supervised speaker recognition method, an unsupervised speaker recognition method is mainly speaker clustering. Teacher speech recognition in classroom speech also generally belongs to this section. There are mainly 4 kinds of studies on speaker clustering: 1. hierarchical clustering; k-means clustering; 3. spectral clustering; 4. and (5) carrying out neighbor propagation clustering.
In the article research and implementation of the unsupervised speaker clustering method, the operating efficiency of a spectral clustering algorithm based on a characteristic similar matrix is researched, and the spectral clustering algorithm for constructing the model similar matrix through a self-adaptive Gaussian mixture model is implemented. Firstly, a Gaussian Mixture Model (GMM) of a target speaker is obtained by training a speech section according to the GMM-UBM-MAP technology, namely, a background Model (UBM) is trained off-line, and the UBM is self-adapted according to a Maximum A Posteriori (MAP) criterion. And then, calculating the similarity of the GMM model to construct a similarity matrix, and carrying out feature extraction on the matrix for clustering to obtain the speaking part of the target person.
In the article < improved speaker cluster initialization and GMM multi-speaker recognition >, Mel cepstrum coefficient (MFCC) features of voice segments are extracted, then a training part uses Bayesian Information Criterion (BIC) to process initial classes to obtain purer initial classes, then clustering is carried out on the MFCC features by adopting a clustering algorithm, GMM model features are obtained by training each class, and speaker judgment is carried out by using speaker recognition based on a GMM model in a recognition stage.
For the extraction of the teacher's speech, in addition to the recognition of the individual teacher's speech, speech separation of the overlapping speech containing the teacher's utterance is required. The purpose of speech separation is to separate the speech of interest from multiple simultaneous sounding sources. The voice separation is divided into multi-channel voice separation and single-channel voice separation according to the relationship between the received source signal and the collected mixed signal. Single channel voice only needs single signal source, and more passageway voice signal not only more easily acquires but also more accords with the actual conditions. But voice separation is more difficult for single channel voice signals. The following 3 types of studies on the separation of single-channel speech are mainly made: 1. based on a computational auditory scene analysis; 2. based on the model; 3. based on time-frequency distribution.
The article "An Audio Scene Analysis Approach to Monaural Speech Segregation" Hu Wang proposes a CASA-based Speech separation system framework. The method comprises the steps of decomposing a mixed signal into time-frequency expression and extracting features required by voice separation to perform auditory time-frequency segmentation by simulating the characteristics of a basilar membrane of a human ear cochlea, combining adjacent time-frequency units of the same sound source to form auditory segments, finally combining the auditory segments to form the auditory segments of the same sound source, and finally realizing voice separation based on waveform synthesis of the same sound source. Later, Hu Wang made a series of improvements to the CASA system, including optimization of voiced and unvoiced signal separation. The article "CNMF-based acoustical features for noise-robust ASR" indicates that NMF is an unsupervised dictionary-based learning method that works well in dealing with various types of signal separation. The NMF algorithm requires purely additive operation, all components are non-negative matrixes after decomposition, and matrix dimensionality reduction operation can be achieved. With the continuous and deep research, the NMF algorithm has the characteristics of quick operation and accuracy, and is very convenient for processing large-scale data, so that the NMF algorithm is widely applied to various fields.
In the above prior art, there are the following drawbacks:
1. when unsupervised speaker clustering identification is carried out, whether the minimum cluster distance is larger than a certain threshold value or not is used as a standard for judging clustering end, and the effect of a hierarchical clustering algorithm is limited by the determination of the threshold value.
2. Article "unsupervised speaker clustering method research and implementation
The GMM-UBM-MAP combined with the spectral clustering algorithm of the characteristic similar matrix needs to train a GMM model of a voice signal, and can not realize complete unsupervised speaker recognition. In addition, the method requires that segments of speakers in the speech to be detected are relatively average, the requirement on the 'purity' of each segment of speakers is high, and the adaptability to real scenes of various forms is poor.
3. Article for improved speaker clustering initialization and GMM multi-speaker recognition
The MFCC coefficient is clustered, the MFCC extracts corresponding characteristics according to the framing of the voice, the calculation amount is large for a long voice section such as 40min classroom recording, and the clustering accuracy rate cannot be well guaranteed.
4. The article "An Audio Scene Analysis Approach to Monaural Speech segmentation" performs Speech separation based on CASA, simulates human ears to perform Speech separation, but the characteristics of the model human ears are difficult to select.
5. The article "CNMF-based acoustical features for noise-robust ASR" > requires a training speech for which a separate speech is given beforehand.
6. The single-channel voice separation result still has the influence of noise, and the voice separation method rarely further denoises the voice separation result and purifies and separates voice signals.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present invention is to provide a single-channel unsupervised target speaker voice extraction method. The classroom voice signals can be acquired, analyzed, processed and identified by utilizing and developing relevant information and intelligent technical means, and the teacher voice part can be robustly detected and extracted from the classroom voice signals on the basis of constructing a self-adaptive and unsupervised intelligent method. .
In order to solve the above technical problems, an embodiment of the present invention provides a single-channel-based unsupervised target speaker speech extraction method, including a teacher language detection step and a teacher language ggmm (general Gauss mix model) model training step;
the teacher language detection step comprises the following steps:
s1: obtaining voice data for classroom recording;
s2: carrying out voice signal processing;
s3: voice segmentation and modeling, wherein the voice segmentation comprises equal-length segmentation of class voices, then extracting corresponding MFCC (Mel frequency cepstrum coefficient) features for each section of voice, and constructing a GMM (Gaussian mixture model) of each section of voice based on the MFCC features;
s4: the teacher voice detection is to calculate the similarity between the GMM model and the GGMM of each voice section outside the teacher utterance category, set a self-adaptive threshold value, and mark the voice section smaller than the threshold value as the teacher utterance category, so as to obtain the final teacher utterance category;
the teacher language GGMM model training step comprises the following steps:
s5: clustering the voice data obtained in step S3; an initial teacher utterance class is obtained, and a GGMM model is extracted based on the initial teacher utterance class.
Further, the clustering process includes the steps of:
s51: selecting a clustering central point;
s52: calculating the distances between all samples and the central point, iterating until a preset shutdown condition is met;
s53: step S51 and step S52 are executed circularly for n times, n teacher voice division groups can be obtained, and the division group with the maximum satisfaction degree is selected as initial teacher voice according to a set rule;
s54: selecting a plurality of training GGMM models from the division group, and calculating the average distance in the class;
s55: performing secondary judgment on the rest voice sample segment according to the GGMM and the average distance, and adding the sample into the class of teachers if the base distance is smaller than a set threshold value;
s56: and outputting all teacher voice samples and writing the teacher voice samples into a database.
Further, the step of S51 specifically includes:
s511: randomly selecting one from all the voice sections as a first central point;
s512: calculating the distance between the residual voice sections and the GMM model of the first central point, and selecting the voice section with the largest distance as a second central point;
s513: sequentially calculating the distance between the voice section which is not selected as the central point and the central point, and selecting the voice section which has the largest distance from the central point as the next central point;
s514: and iterating until the number of the central points reaches the number of the specified categories.
Further, the step of S52 specifically includes:
s521: calculating the distance between the remaining part of GMM models and the central point, and dividing each GMM into the nearest central points;
s522: updating the central point, and taking the point in each type with the minimum sum of distances to all points in the type as a new central point;
s523: and iterating until a preset stop condition is met or iterating for a specified number of times.
Further, the step of S53 specifically includes: and (4) carrying out similarity calculation on N teacher category vectors obtained through iteration, and taking the largest sum of the similarity of the N teacher category vectors and the similarity of the rest N-1 vectors as the initial teacher category obtained through final clustering.
Further, the step of S54 specifically includes: randomly selecting from the teacher category
Figure GDA0002680971920000041
The number of the voice sections in the teacher category is obtained by clustering M, and the voice sections are randomly taken
Figure GDA0002680971920000042
The purpose of the method is to reduce the time for carrying out GMM model training on all voice segments in a teacher class, wherein N is self-adaptive according to the size of MThe resulting constants were obtained as follows:
Figure GDA0002680971920000051
wherein, α is a time adjustment parameter for adjusting the number of speech segments for GMM training, length (c) represents the total number of speech segments obtained by segmenting the original classroom speech, and a coefficient of 0.4 × length (c) represents the minimum number of teacher speech segments.
Further, the S3 includes:
s31: detecting the overlapped voice to obtain overlapped voice sections in the classroom voice;
s32: judging whether the overlapped voice contains teacher voice;
s33: selecting a voice segment closest to the overlapped voice as a training voice segment;
s34: and designing a CNMF + JADE method for voice separation.
Further, the S31 includes:
using a mute point to obtain overlapped speech segments, and judging a mute frame by setting an energy threshold value, wherein the energy threshold value is obtained by the following method:
Figure GDA0002680971920000052
wherein E isiRepresents the energy of the speech frame of the i-th frame,
Figure GDA0002680971920000053
wherein N is the total frame number of the voice section, r is a constant and the range is (0,1),
Figure GDA0002680971920000054
indicating rounding up.
Further, the S32 includes: and judging whether the overlapped voice contains a teacher or not by using the GMM similarity, wherein the similarity is judged according to the following steps by adopting an improved Bartchinson distance:
Figure GDA0002680971920000055
disp (A, B) represents the distance between the GMM models of the speech segments A and B, A represents the overlapped speech segment, B represents the teacher speech segment, and t represents an adaptive threshold, and the calculation formula is as follows:
Figure GDA0002680971920000056
wherein p is an adjusting parameter and takes the value of [0.5,0.8 ]]K is the number of speech segments of the student part, XiThe i-th student voice section and the B teacher voice section.
Further, the S33 includes: the non-teacher speech segment closest to the overlapped speech is selected to train the CNMF together with the teacher speech segment in the following selection mode:
vi=min(disp(Ai,Sj)),i=1,2,..,N,j=1,2,...,K
wherein A isiFor the ith overlapping speech segment, viFor the corresponding selected i-th training speech segment.
The embodiment of the invention has the following beneficial effects: the invention provides an unsupervised and self-adaptive robust teacher voice detection and extraction method for classroom teaching with high complexity (mainly comprising the diversity of classroom situations, the diversity of teacher subjects and the diversity of teacher classroom organizations), so that the adaptability and intelligence of the system in practical application are effectively improved, and a foundation is laid for subsequent application and research.
Drawings
FIG. 1 is a schematic diagram of the framework flow structure of the present invention;
FIG. 2 is a flow chart illustrating the teacher language detection step;
FIG. 3 is a schematic diagram of teacher language GGMM model training steps;
FIG. 4 is a flow chart illustrating the steps of a clustering algorithm;
FIG. 5 is a schematic diagram of a speech separation implementation step;
fig. 6 shows the speech enhancement implementation steps.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.
Referring to fig. 1, the single-channel unsupervised target speaker voice extraction method includes a teacher language detection step and a teacher language GGMM model training step.
As shown in fig. 2, the teacher language detection should include the following steps:
s110, recording;
s120, preprocessing a voice signal;
s130, voice segmentation and modeling;
and S140, teacher voice detection.
As shown in fig. 3, the teacher's speech GGMM model training function unit should include the following steps:
s110, recording;
s120, preprocessing a voice signal;
s130, voice segmentation and modeling;
and S240, clustering.
Wherein, the corresponding classroom speech data is obtained by using the sound recording apparatus in S110.
In S120, the classroom voices obtained by recording are preprocessed, which includes common voice preprocessing methods such as framing, windowing, pre-emphasis, and the like.
In S130, the classroom voices are divided into equal lengths, and then, a corresponding MFCC feature is extracted for each section of voice, and a GMM model of each section of voice is constructed based on the MFCC features. Then, the GMM models of the respective voices are used as input data of S240 to perform clustering operation, an initial teacher utterance class is obtained, and the GGMM model is extracted based on the initial teacher utterance class. And S140, carrying out similarity calculation on the GMM model and the GGMM of each voice outside the teacher utterance class, setting a self-adaptive threshold value, and marking the voice smaller than the threshold value as the teacher utterance class, thereby obtaining the final teacher utterance class.
The clustering algorithm in S240 is shown in fig. 4.
The specific embodiment of the clustering algorithm comprises the following steps:
s2401, selecting an initial central point;
1) randomly selecting one from all the voice sections as a first central point.
2) And calculating the distance between the residual voice sections and the GMM model of the first central point, and selecting the voice section with the largest distance as the second central point.
3) And sequentially calculating the distances between the voice sections which are not selected as the central points and the central points, and selecting the voice section which has the largest distance from the central points as the next central point.
4) And iterating until the number of the central points reaches the number of the specified categories.
Compared with a random central point selection method, the central point selection method has the advantage that the accuracy of the final clustering result is obviously improved. In practice, due to the stopping condition set in S2402(3) by the GMM-Kmeans algorithm, the clustering result obtained by using the outlier as the center point is excluded in the iteration process, so that a robust clustering result can be obtained by selecting the initial center point through the method.
The distance between the gaussian mixture models cannot be measured well only by the above method, that is, the dispersion of GMM a and GMM B is defined as follows:
Figure GDA0002680971920000071
referred to as the dispersion of GMM A relative to GMM B, where WAiRepresents the weight of the ith element of GMM A, WBjRepresents the weight of the jth element of GMM B, dAB(i, j) represents the distance between the ith Gaussian distribution of GMM A and the jth Gaussian distribution of GMM B, and considering the reason of the calculated amount and the possibility that the mean vectors of the multiple Gaussian distributions are completely the same, the present embodimentFor example, Mahalanobis distance is selected as dAB(i, j) a distance calculating method.
Figure GDA0002680971920000081
Wherein,
Figure GDA0002680971920000082
representing two multidimensional Gaussian distributions, mu12Is the mean vector of the two distributions,
Figure GDA0002680971920000083
the covariance matrices for the two distributions.
For symmetry, the final GMM distance metric formula is as follows:
Figure GDA0002680971920000084
wherein A and B respectively represent two GMM models.
S2402, calculating the distances between all samples and a central point, and iterating until a preset shutdown condition is met;
1) and calculating the distance between the residual part of the GMM model and the central point, and dividing each GMM into the nearest central points.
2) And updating the central point, and taking the point in each class with the minimum sum of distances to all points in the class as a new central point.
Iterating until a preset stop condition is met (output when the number of the speech segments contained in the category with the largest number of the speech segments in the obtained clustering result is more than 40% of the total speech segments and the number of the speech segments is more than the number of the speech segments in the second largest category) or iterating to a specified number of times.
S2403, circularly executing the step S2401 and the step S2402 for n times, obtaining n teacher voice division groups, and selecting the division group with the maximum satisfaction degree as the initial teacher voice according to a certain rule.
S2403, N teacher category vectors are obtained through iteration, similarity calculation is carried out, and the initial teacher category obtained through final clustering is taken as the largest sum of the similarity of the vectors and the similarity of the rest N-1 vectors. Because the length of the obtained N teacher class vectors is not unique, corresponding processing is needed to make the vector length the same before similarity calculation.
In the present embodiment, the zero padding method is used to equalize the vector lengths.
The method selects the longest teacher category vector from N teacher category vectors to be marked as M, the length of all vectors is expanded to M, and the part which is less than M is replaced by 0 element, namely:
M=max(length(T1),length(T2),...,length(TN))
Ti=[Ti,Appendi],i=1,2,...,N
Appendi=zeros(1,M-length(Ti)),i=1,2,...,N
wherein, T1,T2,...,TNIs N teacher class vectors, M is the longest vector length, length (T) indicates the length of the obtained T vector, appendixiFor all the added 0 element vectors for the ith teacher class vector, zeros (i, j) represents the 0 element vector forming one i row and j column.
In this embodiment, the teacher category vectors are made to have a uniform length by using a zero padding method, and then the distance between every two vectors is calculated, and since 0 element is artificially added, a method of measuring the similarity of vectors by using the inter-vector distance is used, for example: the euclidean distance, etc. has a large error, so the cosine similarity is used as a method for measuring the similarity between vectors.
Cosine similarity represents the similarity of vectors by using cosine values of an included angle between two vectors in a vector space. The closer the cosine value is to 1, the closer the angle is to 0 degrees, the more similar the vector is.
The cosine similarity between vectors a, b is defined as follows:
Figure GDA0002680971920000091
wherein a is=(a1,a2,...,aN),b=(b1,b2,...,bN) Each representing an N-dimensional vector.
S2404, randomly selecting teacher classes
Figure GDA0002680971920000092
The number of the voice sections in the teacher category is obtained by clustering M, and the voice sections are randomly taken
Figure GDA0002680971920000093
The purpose of the method is to reduce the time for GMM model training on all voice segments in a teacher category, wherein N is a constant obtained by self-adaption according to the size of M and is obtained as follows:
Figure GDA0002680971920000094
where α is a time adjustment parameter used to adjust the number of speech segments for GMM training, and α is 2 in this embodiment. length (c) represents the total number of speech segments obtained after the original classroom speech is segmented by 30 s. The coefficient 0.4 × length (c) represents the minimum number of teacher speech segments. The expression shows that the larger the number of the teacher class voice sections obtained by clustering is, the smaller the number of the teacher class voice sections is, when the GMM model training is carried out. Through the formula, the number of the voice sections required by different voices during GMM model training tends to be similar.
And setting a similarity threshold value as S/gamma, wherein S is the inter-class similarity mean value of the teacher class voice segment, and gamma is a self-adaptive adjustment parameter for ensuring the integrity of the teacher class to the maximum extent. The manner of obtaining it is as follows:
Figure GDA0002680971920000095
wherein beta is an adjusting parameter and the range is [0,1 ]]In this embodiment, β is 1/5. Smax,SminRespectively representing the maximum value and the minimum value of the similarity between the teacher classes. length (C)Representing the total number of the obtained speech segments after the original classroom speech is segmented by 30 s. And M is the number of voice sections in the class of the teacher. The above expression indicates that γ is larger, i.e., the similarity threshold setting is smaller, as M is larger. And when the range of the inter-class similarity is larger, a smaller similarity threshold is taken, so that the accuracy of whether the rest part is the teacher utterance is higher.
Through the processing of the GMM-Kmeans algorithm, a relatively stable teacher class vector can be finally obtained, the obtained teacher class and the manually marked teacher class have higher similarity through comparison with manually divided classes in a test, and compared with a result obtained by directly using improved K-means for clustering, the GMM-Kmeans algorithm used in the embodiment has obvious improvement on clustering accuracy.
After the teacher category is obtained, a determination of silent and overlapping speech portions follows. By
There is no clear feature in the student category, and the number of students is unknown, so it is impossible to detect the student category first. The present embodiment labels the remaining speech segments as the student utterance classes by preferentially detecting the teacher class, the mute class, and the overlap speech class, and by excluding the speech segments included in the above three parts.
Referring to fig. 5, the specific implementation steps of speech separation are as follows:
s310, detecting the overlapped voice to obtain the overlapped voice section in the classroom voice
S320, judging whether the overlapped voice contains teacher voice
S330, selecting the voice segment closest to the overlapped voice as a training voice segment
S340, designing a CNMF + JADE method for voice separation
In S310, overlapping speech segments are obtained based on the mute point, and it is found that the mute frame has lower energy than the non-mute frame, and the mute frame can be determined by setting an energy threshold. The energy threshold is defined as follows:
Figure GDA0002680971920000101
wherein E isiRepresents the energy of the speech frame of the i-th frame,
Figure GDA0002680971920000102
wherein N is the total frame number of the voice section, r is a constant and the range is (0,1),
Figure GDA0002680971920000103
indicating rounding up.
Overlapping speech means that a segment of speech contains two or more people speaking simultaneously. Overlapping speech in a real classroom mainly appears as: 1. student team discussion; 2. when a teacher asks questions, a plurality of students answer simultaneously, and the like. Overlapping speech segments differ from silence segments in the appearance of the silence frames. It has been found that in a speech segment, the probability that the segment contains overlapping speech is lower the longer the duration of silence [56 ]. In connection with the problem addressed by the present embodiment, it may be considered to determine the potential overlapping speech class by the number of silent frames. The method of obtaining potentially overlapping speech classes is similar to the method of obtaining potentially silent classes, as follows:
Figure GDA0002680971920000104
Figure GDA0002680971920000105
ClassOfOverlapi=I(numberOfSilencei<Thresholds),i=1,2,...,N
wherein alpha' is a constant and is used for obtaining the Threshold of the overlapped voice judgment categoryo. In this example, α' is 0.6. The number of the silent frames in the speech segment is smaller than the Threshold value ThresholdoThe segments of (2) are considered potential overlapping speech segments, based on which potential overlapping speech classes are obtained.
S320 and S330 are collectively referred to as speech separation front-end processing, and this processing has two purposes: and judging whether the overlapped voice contains the target speaker, and searching the voice section which is closest to the overlapped voice except the target speaker as CNMF training data. The invention judges whether the overlapped voice contains teachers or not based on the similarity of the GMM. The similarity calculation method adopts an improved ButterCharia distance, and the judgment basis is as follows:
Figure GDA0002680971920000111
disp (A, B) represents the distance between the A and B speech segments GMM model, A represents the overlapped speech segment, and B is the teacher speech segment. t is an adaptive threshold, which is calculated as follows:
Figure GDA0002680971920000112
wherein p is an adjusting parameter and takes the value of [0.5,0.8 ]]K is the number of speech segments of the student part, XiThe i-th student voice section and the B teacher voice section. An adaptive threshold is obtained by calculation with the student segments to determine whether the teacher is included in the overlapping speech.
The second task of pre-processing the speech separation is to select the non-instructor speech segment closest to the overlapping speech segment for CNMF training, which has a large impact on the subsequent speech separation. The invention trains CNMF by selecting the non-teacher voice segment which is closest to the overlapped voice and the teacher voice segment, and the selection mode is as follows:
vi=min(disp(Ai,Sj)),i=1,2,..,N,j=1,2,...,K
wherein A isiFor the ith overlapping speech segment, viFor the corresponding selected i-th training speech segment.
S340 voice separation is carried out on overlapped voice containing teachers, the invention provides a method for carrying out single-channel voice separation by fusing CNMF and JADE, and voice signals after CNMF separation are subjected to secondary separation based on JADE. The CNMF + JADE algorithm aims at obtaining the separated speech signals of all speakers in single-channel mixed speech, and comprises the following steps:
inputting: pure voice t of speaker to be separated1,t2,...,tNMixed speech to be trained o1,o2,...oN-1And mixed speech O is to be separated.
And (3) outputting: separated speaker voice s1,s2,...,sN
Step 1: selecting a target speaker t1And corresponding mixed speech o1Training CNMF
Step 2: separating the mixed voice O to obtain
Figure GDA0002680971920000113
And
Figure GDA0002680971920000114
step 3: generating a random matrix R1Mixing of
Figure GDA0002680971920000121
And
Figure GDA0002680971920000122
forming a two-channel speech signal S1
Step 4: realizing S based on JADE1Separation of (2) to obtain s1And O1
Step 5: with O1As mixed speech to be separated, t2,...,tNFor the speaker to be isolated, o2,...oN-1To train the mixed speech, Step 1-Step 5 is repeated.
Step 6: obtaining the separated speech s1,s2,...,sN
In the above algorithm, t1,t2,...,tNRepresenting the clean speech of the speaker contained in the mixed speech O. N represents the number of speakers included in the mixed speech O. o1,o2,...oN-1The mixed speech after sequentially removing the corresponding speaker from the mixed speech O is expressed as follows:
Figure GDA0002680971920000123
in the real case, o is obtained1,o2,...oN-1It is very difficult to train the CNMF by randomly selecting a non-target speaker's voice from a current mixed voice as a surrogate. Experiments prove that the method has slightly reduced effect compared with the original CNMF + JADE, but can popularize the CNMF + JADE algorithm to a more general situation.
The two-channel speech signal in Step3 is generated in the form of:
Figure GDA0002680971920000124
wherein R isiA 2 x 2 matrix.
As shown in fig. 6, the specific speech enhancement implementation steps are as follows:
s410, the voice of the teacher after the voice enhanced data is separated from the voice
S420, carrying out self-adaptive judgment on the voice of the teacher after voice separation, and selecting a proper voice section for voice enhancement
S430, speech enhancement by applying wavelet transform
Wavelet transform is a hot research point in speech processing in recent years. Compared with the traditional frequency domain analysis method such as Fourier transform, the wavelet transform can simultaneously give the time domain state of the signal, and the time-frequency analysis method has the characteristics of multi-resolution analysis, time-frequency local transform, flexible selection of wavelet functions and the like. The principle of wavelet transform will be described below.
Is provided with L2(R) is a square integrable space and always has
Figure GDA0002680971920000125
If it is Fourier transformed
Figure GDA0002680971920000126
Satisfies the following conditions:
Figure GDA0002680971920000127
balance
Figure GDA0002680971920000128
A basic wavelet or mother wavelet.
Mother wavelet
Figure GDA0002680971920000129
After scaling and translation through a real pair (a, b), where a, b ∈ R, a ≠ 0, a cluster of functions can be obtained:
Figure GDA0002680971920000131
this cluster of functions is referred to as the wavelet basis function, where a is referred to as the scaling factor, b is referred to as the translation factor,
Figure GDA0002680971920000132
as a function of the window, its window size is fixed but its shape can be changed. The wavelet transform has the characteristic of multi-resolution analysis based on the characteristic.
Figure GDA0002680971920000133
To normalize the factors, the effect is to make the wavelets have the same energy at different scales.
Signal processing based on wavelet domain is one of the main means of speech signal processing at present. Based on the characteristics of multi-resolution, low entropy and decorrelation of wavelet transform, the method has great advantages in speech signal processing. A large number of wavelet bases can deal with different scenes, so the wavelet transformation is very suitable for processing voice signals.
When the wavelet transform is used for voice enhancement, the characteristic of multi-resolution analysis in the wavelet transform is used, and corresponding rules are formulated according to different characteristics of wavelet coefficients of noise and voice on wavelet domains with different scales, so that the processing of the wavelet coefficients of the noise signals is completed.
The wavelet transformation denoising method mainly comprises the following steps:
step 1: wavelet transformation of noisy signals
Step 2: denoising wavelet coefficient on different scales
Step 3: performing wavelet inverse transformation on the processed wavelet coefficient to obtain an enhanced reconstructed signal
The wavelet denoising method can be roughly classified into the following three types: noise is extracted by utilizing a wavelet transform modulus maximum principle; denoising by utilizing the correlation of the wavelet transform space coefficient; denoising by using a wavelet threshold. The present embodiment mainly uses a third wavelet threshold-based denoising method.
Wavelet threshold denoising is one of the commonly used denoising methods, and the basic process is as follows:
step 1: and selecting a proper wavelet basis according to the signal to be processed, determining the reasonable decomposition layer number, and performing multi-layer decomposition on the noisy speech signal.
Step 2: and selecting proper threshold values on different scales for the decomposed wavelet coefficients, and quantizing the wavelet coefficients.
Step 3: and performing wavelet reconstruction according to the processing result after threshold quantization to obtain an enhanced voice signal.
The diversity of wavelet base is one of the advantages of wavelet transform in time-frequency analysis, so that proper wavelet function is selected
The number is critical. Research shows that when speech signal processing is carried out, wavelet basis functions with smoothness, good symmetry and low vanishing moment need to be selected for processing transient change of speech signals.
The level of wavelet decomposition has been attracting attention as a factor affecting the denoising effect of the speech enhancement algorithm. With the increase of the decomposition series, the detail parts of the voice signal and the noise signal are clearer, and the denoising is more facilitated. However, as the number of decomposition stages is increased, the voice energy is dispersed more and more, so that distortion is caused and the algorithm operation speed is slowed down; a small number of decomposition stages can confuse the signal with noise and thus the noise cannot be separated. ResearcherThrough a large number of research and experimental analyses, the personnel select the most reasonable decomposition series as
Figure GDA0002680971920000141
N is the length of the data, and N is the length of the data,
Figure GDA0002680971920000142
indicating a rounding down.
In the wavelet threshold denoising algorithm, the estimation of the threshold is one of the important factors determining the denoising effect. The wavelet transform decomposes noisy speech signals into high frequency detail portions and low frequency approximation portions, and the frequency of noise is usually high, so that noise energy is mainly concentrated on high frequency wavelet coefficients, and speech energy is mainly concentrated on low frequency portions of the speech signals. The noise component smaller than the threshold value can be cut off by setting the threshold value to perform denoising. The threshold is the threshold to be studied in wavelet threshold denoising.
The selection of the wavelet threshold denoising threshold mainly comprises the following classical methods:
unified thresholding.
The unified threshold estimate is derived based on a minimum mean square error criterion. Can be expressed as:
Figure GDA0002680971920000143
wherein sigmanIs the standard deviation of the noise and N is the signal length. The variance of the noise is obtained by:
σn=Mj/0.6745
wherein M isjTo decompose the absolute median of each layer of wavelet coefficients, 0.6745 is an empirical value.
The method is simple to implement, has a good effect of filtering Gaussian white noise, but has a poor effect when the data size is large due to the correlation with the voice length.
SURE shrnk threshold [73]
The SURE spring threshold estimation is a self-adaptive threshold selection method and is an unbiased estimation of an optimal threshold. The selection of the threshold may be defined by the risk function:
Figure GDA0002680971920000144
the minimum risk function that needs to be satisfied to obtain the threshold estimation function is:
Figure GDA0002680971920000145
substituting the signal length, then there are:
Figure GDA0002680971920000146
wherein
Figure GDA0002680971920000151
The expression is taken as the set { Y | | Yi|<t } number of elements.
Minimax threshold
The minimax threshold is also called the minimum maximum threshold, and the method generates an extreme value of the minimum mean square error. The minimax threshold is calculated as follows:
Figure GDA0002680971920000152
where N is the signal length.
In addition, the threshold function, like the threshold estimation, also plays a crucial role in the wavelet threshold denoising algorithm, and the commonly used threshold function is as follows:
hard threshold function
The hard threshold function is as follows:
Figure GDA0002680971920000153
wherein,
Figure GDA0002680971920000154
to estimate wavelet coefficients, ωj,kFor decomposing wavelet coefficients, λ is the denoising threshold. From the above equation, the principle of denoising the hard threshold is to denoise ωj,kCompared with lambda, the signal smaller than lambda is set to zero, and the signal larger than lambda is reserved, so that the signal can be introduced into an oscillation signal during reconstruction, and the denoising effect is influenced.
Soft threshold function
In order to eliminate the influence of hard threshold denoising, a soft threshold denoising method is introduced, and the form is as follows:
Figure GDA0002680971920000155
soft threshold functions enhance the smoothness of the speech signal compared to hard threshold functions, but also lose features to some extent causing distortion [74 ].
Semi-soft threshold function
In order to overcome the defects of the soft threshold function and the hard threshold function, some researchers propose a semi-soft threshold function, which has the following functions:
Figure GDA0002680971920000156
wherein λ12Respectively a lower threshold and an upper threshold and having a value of 0<λ12According to experience
Figure GDA0002680971920000157
λ1The value of (A) is related to the speech, and when there are more unvoiced sounds, lambda1The table of values is smaller, when the voiced sound is more, lambda1The value is large. By adjusting λ12The method can have the advantages of soft and hard threshold values, but the two parameters increase the computational complexity of the algorithm.
Garret threshold function
The Garrote threshold function is expressed as follows:
Figure GDA0002680971920000161
the function introduces a threshold into the threshold function, and dynamically culls wavelet coefficients that are larger than a selected threshold.
The invention designs a wavelet transform-based adaptive method to analyze voice signals after CNMF + JADE separation, and hopefully realizes selective voice enhancement, namely, automatically filters voice sections which can cause voice quality reduction after voice enhancement before voice enhancement. Analysis of the separated speech signals of the multiple sections of speech and the speech effect after wavelet transform shows that when the distance between the separated speech signals is large, the effect is reduced when wavelet transform speech enhancement is carried out. Based on the above findings, the present invention designs the following method for adaptive judgment before speech enhancement.
Figure GDA0002680971920000162
i=1,2,...,N
Oi-1=Oi+si+l
O0=O
ON=sN
Wherein s isiRepresenting the ith teacher's voice signal after CNMF + JADE decomposition. O isiRepresenting mixed speech Oi-1Separation of s by CNMF + JADEiThe latter mixed speech signal.
Figure GDA0002680971920000163
Respectively represents si、OiThe corresponding GMM. l represents the loss during the separation. N represents the number of speakers included in the mixed signal. disp (. circle.) is the GMM distance calculation formula presented above, and p is a scaling factor, and takes the value of [1,1.2 ]]。
1. The invention designs the teacher voice extraction method based on the complex situation of classroom teaching, derives the application scope of the informatization classroom, is not only an important component of the intelligent classroom (artificial intelligence and education), but also is a brand-new embodiment of future education. As we can see the data, the same type of research is very little at present, and basically no available framework and theory is formed. The invention can make a great step in the research of intelligent classes and develop a new visual field of education methodology based on artificial intelligence.
2. The method is used for recognizing and extracting the classroom teacher voice based on a single channel, self-adaption and unsupervised mode. Compared with the existing method, the method does not need any prior knowledge, and has good self-adaptive capacity for classroom voices with different forms and lengths and different classroom environments. Meanwhile, the method provided by the text can be applied to classroom teaching, and can also be applied to the fields such as conferences, hearing aids, communication and the like (for example, the voice separation technology is combined with the hearing aid, so that the hearing aid has a stronger signal processing function, and the voice quality of the hearing aid is improved
3. The invention designs and realizes an improved GMM-Kmeans clustering method, which takes a GMM model as a characteristic to carry out clustering, retains the original characteristic to the maximum extent and improves the clustering accuracy. GMM is used as a characteristic and the distance is calculated, so that the direct processing of a voice signal with a larger length is avoided, the algorithm processing time is shortened, and the classroom voice recognition with high accuracy and high speed is realized on the whole.
4. On the basis of a GMM-Kmeans clustering algorithm, the influence of the environment is considered, based on a clustering result, a proper voice section is selected in a self-adaptive mode, a GGMM model is constructed, a similarity threshold value is obtained in a self-adaptive mode, teacher words are detected for the second time, and therefore accurate teacher voice classes are obtained. All the thresholds are obtained by designing a formula in a self-adaptive manner according to the classroom voice data without manual interference, so that the algorithm has strong robustness for different classroom environments and classroom situations.
5. The invention designs and realizes a voice separation algorithm of CNMF + JADE, and carries out secondary voice separation on a CNMF voice separation result by applying JADE. The voice separation result is effectively improved. 7. The invention designs and realizes a method for enhancing the voice by self-adaptive wavelet transform, which carries out self-adaptive judgment on the voice after CNMF + JADE voice separation, filters voice sections which are not suitable for voice enhancement again and carries out de-noising on voice signals purposefully.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (9)

1. A target speaker voice extraction method based on a single channel and an unsupervised mode is characterized by comprising a teacher language detection step and a teacher language GGMM model training step;
the teacher language detection step comprises the following steps:
s1: obtaining voice data for classroom recording;
s2: carrying out voice signal processing;
s3: voice segmentation and modeling, wherein the voice segmentation comprises equal-length segmentation of class voices, then extracting corresponding MFCC (Mel frequency cepstrum coefficient) features for each section of voice, and constructing a GMM (Gaussian mixture model) of each section of voice based on the MFCC features;
s4: the teacher voice detection is to calculate the similarity between the GMM model and the GGMM of each voice section outside the teacher utterance category, set a self-adaptive threshold value, and mark the voice section smaller than the threshold value as the teacher utterance category, so as to obtain the final teacher utterance category;
the teacher language GGMM model training step comprises the following steps:
s5: clustering the voice data obtained in step S3; obtaining an initial teacher utterance category, and extracting a GGMM model based on the initial teacher utterance category; the clustering process comprises the following steps:
s51: selecting a clustering central point;
s52: calculating the distances between all samples and the central point, iterating until a preset shutdown condition is met;
s53: step S51 and step S52 are executed circularly for n times, n teacher voice division groups can be obtained, and the division group with the maximum satisfaction degree is selected as initial teacher voice according to a set rule;
s54: selecting a plurality of training GGMM models from the division group, and calculating the average distance in the class;
s55: performing secondary judgment on the rest voice sample segments according to the GGMM and the average distance, and adding the samples into the class of teachers if the distance is smaller than a set threshold value;
s56: and outputting all teacher voice samples and writing the teacher voice samples into a database.
2. The single-channel unsupervised targeted speaker based speech extraction method of claim 1, wherein the step of S51 specifically comprises:
s511: randomly selecting one from all the voice sections as a first central point;
s512: calculating the distance between the residual voice sections and the GMM model of the first central point, and selecting the voice section with the largest distance as a second central point;
s513: sequentially calculating the distance between the voice section which is not selected as the central point and the central point, and selecting the voice section which has the largest distance from the central point as the next central point;
s514: and iterating until the number of the central points reaches the number of the specified categories.
3. The single-channel unsupervised targeted speaker based speech extraction method of claim 2, wherein the step of S52 specifically comprises:
s521: calculating the distance between the remaining part of GMM models and the central point, and dividing each GMM into the nearest central points;
s522: updating the central point, and taking the point in each type with the minimum sum of distances to all points in the type as a new central point;
s523: and iterating until a preset stop condition is met or iterating for a specified number of times.
4. The single-channel unsupervised targeted speaker based speech extraction method of claim 3, wherein the step of S53 specifically comprises: and (4) carrying out similarity calculation on N teacher category vectors obtained through iteration, and taking the largest sum of the similarity of the N teacher category vectors and the similarity of the rest N-1 vectors as the initial teacher category obtained through final clustering.
5. The single-channel unsupervised targeted speaker based speech extraction method of claim 4, wherein the step of S54 specifically comprises: randomly selecting from the teacher category
Figure FDA0002685778340000022
The number of the voice sections in the teacher category is obtained by clustering M, and the voice sections are randomly taken
Figure FDA0002685778340000023
The purpose of the method is to reduce the time for GMM model training on all voice segments in a teacher category, wherein N is a constant obtained by self-adaption according to the size of M and is obtained as follows:
Figure FDA0002685778340000021
wherein, α is a time adjustment parameter for adjusting the number of speech segments for GMM training, length (c) represents the total number of speech segments obtained by segmenting the original classroom speech, and a coefficient of 0.4 × length (c) represents the minimum number of teacher speech segments.
6. The single-channel-based unsupervised targeted speaker speech extraction method of claim 1, wherein the S3 comprises:
s31: detecting the overlapped voice to obtain overlapped voice sections in the classroom voice;
s32: judging whether the overlapped voice contains teacher voice;
s33: selecting a voice segment closest to the overlapped voice as a training voice segment;
s34: and designing a CNMF + JADE method for voice separation.
7. The single channel-based, unsupervised targeted speaker' S speech extraction method of claim 6, wherein the S31 comprises:
using a mute point to obtain overlapped speech segments, and judging a mute frame by setting an energy threshold value, wherein the energy threshold value is obtained by the following method:
Figure FDA0002685778340000031
wherein E isiRepresents the energy of the speech frame of the i-th frame,
Figure FDA0002685778340000032
wherein N is the total frame number of the voice section, r is a constant and the range is (0,1),
Figure FDA0002685778340000033
indicating rounding up.
8. The single channel-based, unsupervised targeted speaker' S speech extraction method of claim 7, wherein the S32 comprises: and judging whether the overlapped voice contains a teacher or not by using the GMM similarity, wherein the similarity is judged according to the following steps by adopting an improved Bartchinson distance:
Figure FDA0002685778340000034
disp (A, B) represents the distance between the GMM models of the speech segments A and B, A represents the overlapped speech segment, B represents the teacher speech segment, and t represents an adaptive threshold, and the calculation formula is as follows:
Figure FDA0002685778340000035
wherein p is an adjusting parameter and takes the value of [0.5,0.8 ]]K is the number of speech segments of the student part, XiThe i-th student voice section and the B teacher voice section.
9. The single channel-based, unsupervised targeted speaker speech extraction method of claim 8, wherein the S33 comprises: the non-teacher speech segment closest to the overlapped speech is selected to train the CNMF together with the teacher speech segment in the following selection mode:
vi=min(disp(Ai,Sj)),i=1,2,..,N,j=1,2,...,K
wherein A isiFor the ith overlapping speech segment, viFor the corresponding selected i-th training speech segment.
CN201810832080.5A 2018-07-26 2018-07-26 Single-channel and unsupervised target speaker voice extraction method Active CN108962229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810832080.5A CN108962229B (en) 2018-07-26 2018-07-26 Single-channel and unsupervised target speaker voice extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810832080.5A CN108962229B (en) 2018-07-26 2018-07-26 Single-channel and unsupervised target speaker voice extraction method

Publications (2)

Publication Number Publication Date
CN108962229A CN108962229A (en) 2018-12-07
CN108962229B true CN108962229B (en) 2020-11-13

Family

ID=64464209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810832080.5A Active CN108962229B (en) 2018-07-26 2018-07-26 Single-channel and unsupervised target speaker voice extraction method

Country Status (1)

Country Link
CN (1) CN108962229B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948144B (en) * 2019-01-29 2022-12-06 汕头大学 Teacher utterance intelligent processing method based on classroom teaching situation
CN110544481B (en) * 2019-08-27 2022-09-20 华中师范大学 S-T classification method and device based on voiceprint recognition and equipment terminal
CN110544482B (en) * 2019-09-09 2021-11-12 北京中科智极科技有限公司 Single-channel voice separation system
CN110874879A (en) * 2019-10-18 2020-03-10 平安科技(深圳)有限公司 Old man registration method, device, equipment and storage medium based on voice recognition
CN111179962B (en) * 2020-01-02 2022-09-27 腾讯科技(深圳)有限公司 Training method of voice separation model, voice separation method and device
US11829920B2 (en) * 2020-07-13 2023-11-28 Allstate Insurance Company Intelligent prediction systems and methods for conversational outcome modeling frameworks for sales predictions
CN112017685B (en) * 2020-08-27 2023-12-22 抖音视界有限公司 Speech generation method, device, equipment and computer readable medium
CN117577124B (en) * 2024-01-12 2024-04-16 京东城市(北京)数字科技有限公司 Training method, device and equipment of audio noise reduction model based on knowledge distillation

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866421A (en) * 2010-01-08 2010-10-20 苏州市职业大学 Method for extracting characteristic of natural image based on dispersion-constrained non-negative sparse coding
CN102568477A (en) * 2010-12-29 2012-07-11 盛乐信息技术(上海)有限公司 Semi-supervised pronunciation model modeling system and method
CN102682760A (en) * 2011-03-07 2012-09-19 株式会社理光 Overlapped voice detection method and system
CN103680517A (en) * 2013-11-20 2014-03-26 华为技术有限公司 Method, device and equipment for processing audio signals
CN103854644A (en) * 2012-12-05 2014-06-11 中国传媒大学 Automatic duplicating method and device for single track polyphonic music signals
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
CN105096955A (en) * 2015-09-06 2015-11-25 广东外语外贸大学 Speaker rapid identification method and system based on growing and clustering algorithm of models
CN105957537A (en) * 2016-06-20 2016-09-21 安徽大学 Voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition
JP2018502319A (en) * 2015-07-07 2018-01-25 三菱電機株式会社 Method for distinguishing one or more components of a signal

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125223A1 (en) * 2003-12-05 2005-06-09 Ajay Divakaran Audio-visual highlights detection using coupled hidden markov models

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866421A (en) * 2010-01-08 2010-10-20 苏州市职业大学 Method for extracting characteristic of natural image based on dispersion-constrained non-negative sparse coding
CN102568477A (en) * 2010-12-29 2012-07-11 盛乐信息技术(上海)有限公司 Semi-supervised pronunciation model modeling system and method
CN102682760A (en) * 2011-03-07 2012-09-19 株式会社理光 Overlapped voice detection method and system
CN103854644A (en) * 2012-12-05 2014-06-11 中国传媒大学 Automatic duplicating method and device for single track polyphonic music signals
CN103680517A (en) * 2013-11-20 2014-03-26 华为技术有限公司 Method, device and equipment for processing audio signals
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
JP2018502319A (en) * 2015-07-07 2018-01-25 三菱電機株式会社 Method for distinguishing one or more components of a signal
CN105096955A (en) * 2015-09-06 2015-11-25 广东外语外贸大学 Speaker rapid identification method and system based on growing and clustering algorithm of models
CN105957537A (en) * 2016-06-20 2016-09-21 安徽大学 Voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CNMF-BASED ACOUSTIC FEATURES FOR NOISE-ROBUST ASR;Colin Vaz;《2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》;20160519;第5735-5739页 *
基于深度学习的单通道语音分离;李号;《中国优秀硕士学位论文全文数据库信息科技辑》;20171115;第I136-139页 *

Also Published As

Publication number Publication date
CN108962229A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108962229B (en) Single-channel and unsupervised target speaker voice extraction method
Vasquez et al. Melnet: A generative model for audio in the frequency domain
Yu et al. Speech enhancement based on denoising autoencoder with multi-branched encoders
JP2002014692A (en) Device and method for generating acoustic model
CN106128477B (en) A kind of spoken identification correction system
CN108615533A (en) A kind of high-performance sound enhancement method based on deep learning
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
Li et al. ILMSAF based speech enhancement with DNN and noise classification
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
Do et al. Speech source separation using variational autoencoder and bandpass filter
Vignolo et al. Feature optimisation for stress recognition in speech
CN116092512A (en) Small sample voice separation method based on data generation
Wang Supervised speech separation using deep neural networks
Ling An acoustic model for English speech recognition based on deep learning
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
CN116347723A (en) Control system of sound control type wall switch with adjustable lamplight sample color
MY An improved feature extraction method for Malay vowel recognition based on spectrum delta
CN111833851B (en) Method for automatically learning and optimizing acoustic model
CN114678039A (en) Singing evaluation method based on deep learning
Hashemi et al. Persian music source separation in audio-visual data using deep learning
Kim Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection.
Iswarya et al. Speech query recognition for Tamil language using wavelet and wavelet packets
Shahrul Azmi et al. Noise robustness of Spectrum Delta (SpD) features in Malay vowel recognition
Zhao et al. Enhancing audio perception in augmented reality: a dynamic vocal information processing framework
Lung Feature extracted from wavelet decomposition using biorthogonal Riesz basis for text-independent speaker recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant