CN109545227B - Depth self-coding network-based speaker sex automatic identification method and system - Google Patents

Depth self-coding network-based speaker sex automatic identification method and system Download PDF

Info

Publication number
CN109545227B
CN109545227B CN201810402685.0A CN201810402685A CN109545227B CN 109545227 B CN109545227 B CN 109545227B CN 201810402685 A CN201810402685 A CN 201810402685A CN 109545227 B CN109545227 B CN 109545227B
Authority
CN
China
Prior art keywords
speaker
vector
ubm
steps
depth self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810402685.0A
Other languages
Chinese (zh)
Other versions
CN109545227A (en
Inventor
王志锋
段苏容
左明章
田元
闵秋莎
夏丹
叶俊民
陈迪
罗恒
姚璜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central China Normal University
Original Assignee
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central China Normal University filed Critical Central China Normal University
Priority to CN201810402685.0A priority Critical patent/CN109545227B/en
Publication of CN109545227A publication Critical patent/CN109545227A/en
Application granted granted Critical
Publication of CN109545227B publication Critical patent/CN109545227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The invention belongs to the technical field of voiceprint recognition, and discloses a method and a system for automatically recognizing the gender of a speaker based on a depth self-coding network, which are used for training a universal background model of a UBM by using voice signals irrelevant to registered speakers and channels; extracting an i-vector of the registration data; extracting an i-vector of test data; training a depth self-coding network; pattern matching and recognition, and model evaluation. The invention applies the depth self-coding network to the speaker identification, and uses the strong learning ability of the depth self-coding network to characterize the speaker characteristics of different sexes, thereby not only realizing the re-extraction of the characteristics, but also reducing the feature dimension, thereby reducing the complexity in classification operation. The method provided by the invention can be further popularized and applied to speaker recognition, and the robustness of a speaker recognition system is attempted to be improved.

Description

Depth self-coding network-based speaker sex automatic identification method and system
Technical Field
The invention belongs to the technical field of voiceprint recognition, and particularly relates to a method and a system for automatically recognizing the gender of a speaker based on a depth self-coding network.
Background
Currently, industryThe prior art commonly used in this field is as follows:
the speaker identification is a biometric authentication technique for automatically identifying the sex of a speaker by using speaker information of a specific sex contained in a voice signal, similar to speaker identification (voiceprint identification). The deep learning simulates a layered structure when the human brain processes information, and is essentially realized by realizing layer-by-layer abstraction of the features through nonlinear transformation in a mode of connecting a plurality of hidden layers, so that the mapping from the bottom layer features to the high layer concepts is constructed, and the deep learning has stronger learning capability. In the field of speech recognition in recent years, deep neural networks (deep neuralnetwork, DNN) have been successfully applied in acoustic modeling, leading to a milestone development in speech recognition performance. The speaker recognition field also explores how to model speakers by using DNN, but because the types of speakers are not fixed, the training data of each speaker is relatively less, and the obtained effect is very limited.
Currently, deep learning algorithms are applied to the field of speaker recognition and can be roughly divided into three categories: feature extraction based, mapping based and feature extraction and mapping based simultaneously. The first method applies a deep learning algorithm to feature extraction in the speaker registration stage, and the recognition mapping is completed by using a traditional speaker recognition method such as GMM after feature extraction. The second method uses acoustic features extracted in the conventional method, such as MFCCs, as input to the deep neural network, and uses the deep neural network as a classifier to complete the recognition mapping. The third method is to apply the deep neural network to the feature extraction and classification to finish the speaker recognition process. Among the three categories, the i-vector based method achieves better effects, one is to apply the depth network to the extraction stage of the i-vector, and the other is to use the depth network as a classifier to finish final recognition after the i-vector is extracted. The invention belongs to the latter, and realizes voiceprint recognition based on an i-vector and a depth network classifier.
In summary, the problems of the prior art are:
in the prior art based on i-vector, a multi-application Deep Belief Network (DBN) is used for model construction, and a deep self-coding network (SAE) is not used for extracting characteristics of the i-vector again and finally completing classification recognition.
SolutionMeaning of solving the technical problems:
the invention has the advantages that the i-vector-based deep self-coding network speaker sex identification system is realized, the voiceprint information of speakers with different sexes is further extracted by utilizing the characterization capability of the deep self-coding network, the feature dimension is reduced, the calculation complexity of a classification algorithm is reduced, and the method can be further popularized to the field of speaker identification.
The deep self-encoder is mainly used for completing the learning task of data conversion, is essentially a non-linear feature extraction model for unsupervised learning, and the learning process consists of two stages of unsupervised pre-training and supervised tuning. The most basic self-encoding network is a feed-forward neural network symmetric about the middle layer, comprising an input layer, an output layer and a hidden layer, the goal of which is to achieve the desired output to be the same as the input, i.e. each layer of which is a feature representation of the input, can be used to learn identity mapping and extract unsupervised features. After multi-layer training, the self-encoder can extract the essence characteristics from the original data, and then a neural network based on the essence characteristics can be constructed, or a classifier such as SVM or LR is added, so that the classification can be realized efficiently.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a method and a system for automatically identifying the gender of a speaker based on a depth self-coding network.
The invention is realized in such a way that the speaker sex automatic identification method based on the depth self-coding network comprises the following steps:
firstly, preprocessing a training set voice signal and extracting Mel cepstrum coefficient characteristics, and then training a UBM universal background model by utilizing a large amount of voice data irrelevant to a specific speaker and a channel; extracting an i-vector based on the UBM universal background model and the voice signal of the specific speaker; and extracting the characteristics of the i-vector by using a self-coding network, reducing the dimension of the i-vector, and taking the characteristic as the input of a classifier (the classifier can be a neural network or other classification algorithms) to finish identification classification.
And in the test stage, the test voice signals are subjected to signal preprocessing, i-vector extraction and self-coding feature re-extraction in the same mode as the training stage, the test voice signals are classified by a trained classifier, and then different evaluation standards such as classification accuracy, AUC, MCC and the like are used for evaluating the model.
If the method is used for speaker recognition, the method can be realized by only changing the gender voice signals of the speakers into a certain number of specific speaker voice signals and correspondingly adjusting the deep network structure and the evaluation standard.
Further, the automatic speaker sex identification method based on the depth self-coding network specifically comprises the following steps:
step 1: training a UBM generic background model using speech signals unrelated to both registered speakers and channels;
step 2: extracting an i-vector of the registration data;
step 3: extracting an i-vector of test data;
step 4: training a depth self-coding network;
step 5: pattern matching and recognition, and model evaluation.
Further, the specific implementation of step 1 includes the following sub-steps:
step 1.1: preprocessing voice signals irrelevant to registered speakers and channels, including pre-emphasis, framing and windowing;
step 1.2: extracting Mel cepstrum coefficient characteristics from the signals pretreated in the step 1.1;
step 1.3: and (3) carrying out global cepstrum mean and variance normalization on the Mel cepstrum features obtained in the step (1.2).
Step 1.4: and (3) carrying out statistical modeling on the Mel cepstrum features obtained in the step (1.3) by using N mixed Gaussian models, and obtaining a universal background model UBM with N Gaussian components by using an EM algorithm, wherein the universal background model UBM comprises a mean supervector, a weight and a Gaussian component covariance matrix of each Gaussian component.
Further, the specific implementation of step 2 includes the following sub-steps:
step 2.1: preprocessing the registered voice signal, including pre-emphasis, framing and windowing;
step 2.2: extracting Mel cepstrum coefficient characteristics from the signals pretreated in the step 2.1;
step 2.3: and (3) carrying out global cepstrum mean and variance normalization on the Mel cepstrum features obtained in the step (2.2).
Step 2.4: and (3) calculating zero-order and first-order full statistics (Baum-Welch statistics) of each voice segment on each GMM mixed component of the UBM by using the characteristics obtained in the step 2.3 and the universal background model UBM obtained in the step 1:
Figure BDA0001646100150000041
Figure BDA0001646100150000042
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0001646100150000043
the zero order statistic and the first order statistic of the voice segment k on the c-th GMM mixed component are respectively represented; />
Figure BDA0001646100150000044
Representing the acoustic characteristics of the speech segment k at the time index t; />
Figure BDA0001646100150000045
Representing acoustic features->
Figure BDA0001646100150000046
Posterior probability for the c-th GMM mixture component;
step 2.5: obtaining a full-variation subspace T through maximum likelihood estimation calculation by using the full statistics obtained in the step 2.4 and the average supervector of the universal background model UBM obtained in the step 1
M=m+Tw
Wherein M is a GMM mean value supervector containing speaker information and channel information; m is the UBM mean supervector, independent of both speaker and channel; w is a low-dimensional vector containing only speaker information, i.e., i-vector;
step 2.6: and (3) extracting the i-vector by using the full statistics obtained in the step (2.4), the average supervector of the universal background model UBM obtained in the step (1) and the full-variation subspace T obtained in the step (2.5).
Further, the specific implementation of the step 3: i-vector is performed on test data according to the steps involved in the step 2
Extracting.
Further, the specific implementation of step 4 includes the following sub-steps:
step 4.1: carrying out maximum and minimum normalization on the characteristics obtained in the step 2;
step 4.2: carrying out one-hot coding on the gender labels of all registered speakers;
step 4.3: and constructing a depth self-coding network structure.
In step 4, after the trained depth self-coding network is obtained, the test data features obtained in step 3 are used for automatically identifying the speaker character and performing model evaluation by using three indexes of classification accuracy, AUC and MCC.
Another object of the present invention is to provide a speaker sex automatic identification control system based on a depth self-coding network.
In summary, the invention has the advantages and positive effects that:
in order to explore the application of the deep neural network in the voiceprint recognition field, the invention provides an automatic gender method of a speaker based on an i-vector and a deep self-coding network, the application of the deep self-coding network in the field is realized, and the method can be further popularized and applied to speaker recognition.
The invention applies the depth self-coding network to the voiceprint recognition field for the first time, re-extracts the characteristics by using the depth self-coding network, reduces the characteristic dimension, thereby reducing the calculation complexity of the classification algorithm, and is one-time exploration of the depth neural network in the field;
the invention utilizes the learning ability of the deep neural network to further extract the voiceprint information of speakers with different sexes, thereby improving the accuracy of the recognition system;
the method provided by the invention realizes the speaker sex classification accuracy of 98% on the experimental data set, wherein the AUC is about 0.995, the MCC is about 0.96, and the traditional speaker sex identification accuracy based on the fundamental frequency is only 85%.
The invention applies the depth self-coding network to the speaker identification, and uses the strong learning ability of the depth self-coding network to characterize the speaker characteristics of different sexes, thereby not only realizing the re-extraction of the characteristics, but also reducing the feature dimension, thereby reducing the complexity in classification operation. The method can also be used for speaker recognition in the later stage, and the robustness of the speaker recognition system is attempted to be improved.
Drawings
Fig. 1 is a flowchart of a method for automatically identifying the gender of a speaker based on a depth self-coding network according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a speaker sex identification method based on an i-vector and depth self-coding network. Firstly, preprocessing a training set voice signal and extracting Mel cepstrum coefficient characteristics, and then training a UBM universal background model by utilizing a large amount of voice data irrelevant to a specific speaker and a channel; extracting an i-vector based on the UBM and Mel cepstrum coefficient characteristics of the specific speaker; and (5) training the depth self-coding network by using the extracted i-vector to realize the classification of men and women. And in the test stage, signal preprocessing and i-vector extraction are carried out in the same way, a trained deep network is used for classification, and then the model performance is evaluated by using three evaluation standards of classification accuracy, AUC and MCC. The invention applies the depth self-coding network to the speaker identification, and uses the strong learning ability of the depth self-coding network to characterize the speaker characteristics of different sexes, thereby not only realizing the re-extraction of the characteristics, but also reducing the feature dimension, thereby reducing the complexity in classification operation. The method can also be used for speaker recognition in the later stage, and the robustness of the speaker recognition system is attempted to be improved.
Referring to fig. 1, the method for automatically identifying the gender of the speaker based on the depth self-coding network provided by the embodiment of the invention comprises the following steps:
step 1: training a UBM generic background model using speech signals unrelated to both registered speakers and channels;
the specific implementation comprises the following substeps:
step 1.1: preprocessing voice signals irrelevant to registered speakers and channels, including pre-emphasis, framing and windowing;
step 1.2: extracting Mel cepstrum coefficient characteristics from the signals pretreated in the step 1.1;
step 1.3: and (3) carrying out global cepstrum mean and variance normalization on the Mel cepstrum coefficient obtained in the step (1.2).
Step 1.4: and (3) carrying out statistical modeling on the Mel cepstrum coefficient obtained in the step (1.3) by using N mixed Gaussian models, and obtaining a universal background model UBM with N Gaussian components by using an EM algorithm, wherein the universal background model UBM comprises a mean supervector, a weight and a Gaussian component covariance matrix of each Gaussian component.
In this embodiment, the UBM generic background model is trained by using a training set speech signal that is independent of both the registered speaker and the channel, and the number of GMM mixes in the UBM model should be dependent on the actual situation, and both the running speed and the accuracy should be considered in the training process. At the same time, it is necessary to ensure equalization of the training data at the time of training, i.e., equalization of the male and female proportions of the training data set in this example.
Step 2: extracting an i-vector of the registration data;
the specific implementation comprises the following substeps:
step 2.1: preprocessing the registered voice signal, including pre-emphasis, framing and windowing;
step 2.2: extracting Mel cepstrum coefficient characteristics from the signals pretreated in the step 2.1;
step 2.3: and (3) carrying out global cepstrum mean and variance normalization on the Mel cepstrum coefficient obtained in the step (2.2).
Step 2.4: and (3) calculating zero-order and first-order full statistics (Baum-Welch statistics) of each voice segment on each GMM mixed component of the UBM by using the characteristics obtained in the step 2.3 and the universal background model UBM obtained in the step 1:
Figure BDA0001646100150000071
Figure BDA0001646100150000072
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0001646100150000073
the zero order statistic and the first order statistic of the voice segment k on the c-th GMM mixed component are respectively represented; />
Figure BDA0001646100150000074
Representing the acoustic characteristics of the speech segment k at the time index t; />
Figure BDA0001646100150000075
Representing acoustic features->
Figure BDA0001646100150000076
Posterior probability for the c-th GMM mixture component;
step 2.5: obtaining a full-variation subspace T through maximum likelihood estimation calculation by using the full statistics obtained in the step 2.4 and the average supervector of the universal background model UBM obtained in the step 1
M=m+Tw
Wherein M is a GMM mean value supervector containing speaker information and channel information; m is the UBM mean supervector, independent of both speaker and channel; w is a low-dimensional vector containing only speaker information, i.e., i-vector;
step 2.6: and (3) extracting the i-vector by using the full statistics obtained in the step (2.4), the average supervector of the universal background model UBM obtained in the step (1) and the full-variation subspace T obtained in the step (2.5).
In this embodiment, feature extraction is performed on the voice of the registered speaker, where the registered speaker and the UBM training set speaker do not overlap, so as to ensure that the UBM model fits the voice feature distribution of the person, and features that are not covered by the registered voice signal of the specific speaker can be approximated by similar feature distribution in the UBM.
Step 3: extracting test data i-vector; the specific implementation is as follows: and (3) extracting the i-vector from the test data according to the steps involved in the step (2).
In this embodiment, 9 sentences are taken from 10 sentences of voice samples of each registered speaker for training, and 1 sentence is used as a test sentence.
Step 4: training a depth self-coding network;
the specific implementation comprises the following substeps:
step 4.1: performing maximum and minimum normalization on the features obtained in the step 2, and scaling all feature data to a 0-1 interval in equal proportion;
Figure BDA0001646100150000081
step 4.2: one-hot encoding is performed on the gender tags of all registered speakers. The coding method is that when N states are coded, an N-bit independent state register is adopted, and when the N states are called, only one bit is effectively coded. For this dataset, there are only two classifications or states of 0 and 1, which after one-hot encoding become 2 binary mutually exclusive features, only one feature being activated each time a call is made.
Step 4.3: constructing a depth self-coding network structure;
the self-coding network is an unsupervised learning algorithm that attempts to approximate an identity function: h is a W,b (x) Approximately equal to x, makeThe output is close to the input and usually comprises two parts, namely an encoder and a decoder, and two kinds of transformation can be used
Figure BDA0001646100150000082
And ψ gives its definition:
Figure BDA0001646100150000083
ψ:F→X
Figure BDA0001646100150000084
/>
the coding process refers to inputting x E R m Mapping to an implicit representation h (x) =r n The specific construction process is as follows:
z=σ(Wx+b)
where σ is the activation function, in the nonlinear case a sigmoid function or a tanh function, etc. are usually taken. W epsilon R n×m For coding weight matrix, b E R n For encoding the offset vector.
The decoding process refers to a process of mapping an implicit representation h (x) to an output layer to reconstruct an input x, and restoring x' as identical as possible to the input x:
x′=σ′(W′z+b′)
where σ' is the activation function, meaning is the same as σ. W' E R m×n For decoding weight matrix, b' ∈R m To decode the offset vector.
The reconstruction error is
L(x,x′)=||x-x′|| 2 =||x-σ′(W′(σ(Wx+b))+b′)|| 2
The depth self-coding network is a self-coding network comprising a plurality of hidden layers and being symmetrical with respect to the intermediate layer, comprising one input layer, 2r-1 hidden layers and one output layer. Let the input layer contain m neurons x= (x) 1 ,x 2 ,...,x m ) T ∈R m The method comprises the steps of carrying out a first treatment on the surface of the The kth hidden layer contains n k =n 2r-k Individual neurons (k=1, 2,..2 r-1), phaseShould be implied with layer vectors of
Figure BDA0001646100150000091
The output layer is x' = (x) 1 ′,x 2 ′,...,x m ′) T ∈R m The neuron activation outputs from the layers of the coding network can be expressed as:
Figure BDA0001646100150000092
Figure BDA0001646100150000093
x′=σ′(W 2r h 2r-1 +b 2r )
wherein the method comprises the steps of
Figure BDA0001646100150000094
For the weight matrix between the input layer and the 1 st hidden layer,
Figure BDA0001646100150000095
is the weight matrix between the kth hidden layer and the kth hidden layer, and is +.>
Figure BDA0001646100150000096
B is a weight matrix between the 2r-1 hidden layer and the output layer 1 、b k 、b 2r Is the corresponding offset vector.
The training process comprises two stages of unsupervised pre-training and supervised tuning, wherein from an input layer to a middle layer of the encoder, two adjacent layers are regarded as a limited boltzmann machine, the output of each limited boltzmann machine is the input of the next adjacent limited boltzmann machine, and an unsupervised learning algorithm (such as a CD algorithm, a PCD algorithm and the like) is adopted to train all the limited boltzmann machines layer by layer. Pre-training a weight matrix W starting from the underlying limited Boltzmann machine 1 Visual layer bias a 1 And hidden layer bias b 1 The method comprises the steps of carrying out a first treatment on the surface of the Then layer by layer the (k-1) th hidden layer and the (k-1) th hidden layerThe k hidden layers are regarded as a restricted Boltzmann machine pre-trained corresponding weight matrix W k Bias a k And b k (1 < k.ltoreq.r); finally, when r is less than k and less than or equal to 2r, reversely stacking each pre-trained limited Boltzmann machine, and directly constructing W k =(W 2r+1-k ) T And b k =a 2r+1-k Thereby yielding all initialization weights and offsets from the encoder. For the above training mode, while training each layer of parameters, other layers of parameters are fixed and kept unchanged.
After the non-supervision pre-training is completed, all parameters of the network are optimized by adopting a supervised learning algorithm. The supervised learning algorithm generally selects a BP algorithm, or a random gradient descent algorithm, a conjugate gradient descent algorithm and the like, and the optimized objective function can be a square reconstruction error:
Figure BDA0001646100150000101
or a cross entropy function:
Figure BDA0001646100150000102
wherein (x) l ,y l ) (1 is less than or equal to l is less than or equal to N) is N training samples,
Figure BDA0001646100150000103
in order for the output to be desired,
Figure BDA0001646100150000104
is the actual output.
For classification purposes, a single-layer neural network or a multi-layer perceptron is added after the self-coding network. The method comprises the steps of discarding a decoding layer of a self-coding network, taking the output of the last coding layer as the input of a classification neural network, and reversely transmitting gradient values of classification errors to the coding layer.
In this embodiment, the depth self-coding network architecture is a four-layer network structure, wherein two layers are self-coding layers, and the other two layers are sensor layers. The coding layer of the self-coding network compresses the 400-dimensional feature map of the original input into 40-dimensional, the feature re-extraction process is completed, the perceptron layer utilizes the 40-dimensional features to classify, and finally the output is 2 labels. The self-coding layer uses the mean square error as a loss function, and the perceptron layer uses the cross entropy as the loss function, and the actual use is optional.
Step 5: pattern matching and recognition; after a trained network is obtained, the test i-vector obtained in the step 3 is used as the input of the depth self-coding network, so that the automatic classification of the gender of the speaker is realized, and the model is evaluated according to three indexes of classification accuracy, AUC and MCC.
AUC (Area Under Curve) is an evaluation index commonly used in classification models, defined as the area under the ROC curve. ROC curves are made based on the true class and predicted probability of the sample, with the x-axis representing false positive rate-FPR (False Positive Rate) and the y-axis representing true positive rate-TPR (True Positive Rate), defined as:
Figure BDA0001646100150000111
the classification Accuracy (ACC) is defined as:
Figure BDA0001646100150000112
wherein the method comprises the steps of
TP- -True Positives: positive samples predicted to be positive classes
FP-False positive (False positive): negative samples predicted as positive classes
FN- -False negative classes (False negative): positive samples predicted as negative classes
TN- -True Negatives (True negative): negative samples predicted as negative classes
The ROC curve can still keep the curve itself basically unchanged when the distribution of positive and negative samples is changed, when the completely random samples are classified, the AUC is close to 0.5, and the closer the AUC value is to 1, the better the model prediction effect is indicated.
The MCC, i.e., the matthews correlation coefficient, is also an evaluation index applicable to the two-class model, defined as:
Figure BDA0001646100150000113
the MCC has a value in the range of [ -1,1], if mcc= -1 indicates a completely opposite prediction, mcc=0 indicates a random prediction, and mcc=1 indicates a perfect prediction, i.e. if MCC is closer to 1, the model prediction is better.
The invention is further described in connection with simulation experiments.
In the experiment, the method is used for a TIMIT database and UBM training stage, 108 men and 72 women are respectively selected for 200 persons, each speaker contains 10 sentences of 10s of voice signals, 12-order MFCC parameters are extracted by taking 256 frames as frame lengths, and a universal background model UBM containing 64 Gaussian components is trained after normalization.
in the i-vector extraction stage, 77 persons (different from 200 persons training UBM) of each man and woman are selected, each speaker comprises 10 sentences of 10s of voice signals, the 12-order MFCC parameters are extracted by taking 256 frames as frame lengths, zero-order and first-order full statistics are extracted based on the trained UBM, then the full-variation subspace T with the dimension of 400 is calculated, and the i-vector extraction is completed based on the UBM and the T, so that the 400-dimensional i-vector is obtained.
And in the neural network training stage, 9 out of 10 voice signals of each speaker are used as training sets, 1 is used as a test set, and the maximum and minimum normalization is carried out on all the features. The tag male label is set to 1, the female label to 0, and the tag is one-hot coded.
Three-layer networks (one self-coding layer and two perceptual classification layers) and four-layer networks (two stacked self-coding layers and two perceptual classification layers) were constructed in this experiment, respectively. The three-layer network has 5000 training times of the self-coding network and 10000 training times of the classifier, can realize 96% classification accuracy, and has AUC of 0.995 and MCC of 0.9097. The four-layer network, the training times of the self-coding network are 7000 and 25000, the classification accuracy of 98% can be realized, the AUC is 0.9886, and the MCC is 0.961.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (4)

1. The automatic speaker sex identification method based on the depth self-coding network is characterized by comprising the following steps of:
firstly, preprocessing a training set voice signal and extracting Mel cepstrum coefficient characteristics, and then training a UBM universal background model by utilizing a large amount of voice data irrelevant to a specific speaker and a channel; extracting an i-vector based on the UBM universal background model and the voice signal of the specific speaker; training a self-encoder by using the extracted i-vector as the input of a depth self-encoding network, further refining the characteristics, and finally realizing different speaker classification by a classifier;
the testing stage, the testing voice signal is preprocessed and i-vector extracted in the same way as the training stage, the trained deep self-coding network is used for feature extraction and classification, and then the classification accuracy, AUC and MCC are used for evaluating the model;
the speaker sex automatic identification method based on the depth self-coding network specifically comprises the following steps:
step one: training a UBM generic background model using speech signals unrelated to both registered speakers and channels;
step two: extracting an i-vector of the registration data;
step three: extracting an i-vector of test data;
step four: training a depth self-coding network;
step five: pattern matching and recognition, and model evaluation;
step four, specifically comprising:
a) The method comprises the following steps Performing maximum and minimum normalization on the features obtained in the second step, and scaling all feature data to a 0-1 interval in equal proportion;
Figure FDA0004131144100000011
b) The method comprises the following steps Carrying out one-hot coding on the gender labels of all registered speakers; the coding method comprises the following steps: when N states are coded, an N-bit independent state register is adopted, and when the N states are called, only one bit in the N bits is effectively coded; the data set has only 0 and 1 classifications or states, and becomes 2 binary mutually exclusive features after one-hot encoding, and only one feature is activated when each call is made;
c) The method comprises the following steps Constructing a depth self-coding network structure;the depth self-coding network comprises an input layer, 2r-1 hidden layers and an output layer; the input layer contains m neurons x= (x) 1 ,x 2 ,...,x m ) T epsilon Rm; the kth hidden layer contains n k =n 2r-k Individual neurons (k=1, 2,., 2 r-1); implicit layer vector is
Figure FDA0004131144100000012
The output layer is x '= (x 1', x2',..xm') T e Rm, the neuron activation output of each layer from the coding network is expressed as:
Figure FDA0004131144100000013
Figure FDA0004131144100000014
x′=σ′(W 2r h 2r-1 +b 2r )
wherein the method comprises the steps of
Figure FDA0004131144100000023
For the weight matrix between the input layer and the 1 st hidden layer,>
Figure FDA0004131144100000024
is the weight matrix between the kth hidden layer and the kth hidden layer, and is +.>
Figure FDA0004131144100000025
B is a weight matrix between the 2r-1 hidden layer and the output layer 1 、b k 、b 2r Is the corresponding offset vector.
2. The method for automatically identifying the gender of a speaker based on a depth self-encoding network as claimed in claim 1, wherein the step one specifically comprises:
1): preprocessing voice signals irrelevant to registered speakers and channels, including pre-emphasis, framing and windowing;
2): extracting Mel cepstrum coefficient characteristics from the signals pretreated in the step 1);
3): carrying out global cepstrum mean and variance normalization on the Mel cepstrum features obtained in the step 2);
4): and (3) carrying out statistical modeling on the Mel cepstrum features obtained in the step (3) by using N mixed Gaussian models, and obtaining a universal background model UBM with N Gaussian components by using an EM algorithm, wherein the universal background model UBM comprises a mean value supervector, a weight and a Gaussian component covariance matrix of each Gaussian component.
3. The method for automatically identifying the gender of a speaker based on a depth self-encoding network as claimed in claim 1, wherein the step two specifically comprises:
a) The method comprises the following steps Preprocessing the registered voice signal, including pre-emphasis, framing and windowing;
b) The method comprises the following steps Extracting Mel cepstrum coefficient characteristics from the signal pretreated in step a);
c) The method comprises the following steps Carrying out global cepstrum mean and variance normalization on the Mel cepstrum features obtained in the step b);
d) The method comprises the following steps Calculating zero-order and first-order full statistics of each voice segment on each GMM mixed component of the UBM by using the characteristics obtained in the step c) and the universal background model UBM obtained in the step one:
Figure FDA0004131144100000021
Figure FDA0004131144100000022
wherein (1)>
Figure FDA0004131144100000026
The zero order statistic and the first order statistic of the voice segment k on the c-th GMM mixed component are respectively represented;
Figure FDA0004131144100000027
representing the acoustic characteristics of the speech segment k at the time index t; />
Figure FDA0004131144100000028
Representing acoustic features->
Figure FDA0004131144100000029
Posterior probability for the c-th GMM mixture component;
e) The method comprises the following steps Obtaining the total variation subspace TM=m+Tw by maximum likelihood estimation calculation by using the full statistics obtained in the step d) and the average value supervector of the universal background model UBM obtained in the step one
Wherein M is a GMM mean value supervector containing speaker information and channel information; m is the UBM mean supervector, independent of both speaker and channel; w is a low-dimensional vector i-vector containing only speaker information;
f) The method comprises the following steps And d), extracting the i-vector by using the full statistics obtained in the step d), the average supervector of the universal background model UBM obtained in the step e) and the full-variation subspace T obtained in the step e).
4. A depth self-encoding network-based speaker sex automatic recognition control system of the depth self-encoding network-based speaker sex automatic recognition method of claim 1.
CN201810402685.0A 2018-04-28 2018-04-28 Depth self-coding network-based speaker sex automatic identification method and system Active CN109545227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810402685.0A CN109545227B (en) 2018-04-28 2018-04-28 Depth self-coding network-based speaker sex automatic identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810402685.0A CN109545227B (en) 2018-04-28 2018-04-28 Depth self-coding network-based speaker sex automatic identification method and system

Publications (2)

Publication Number Publication Date
CN109545227A CN109545227A (en) 2019-03-29
CN109545227B true CN109545227B (en) 2023-05-09

Family

ID=65830729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810402685.0A Active CN109545227B (en) 2018-04-28 2018-04-28 Depth self-coding network-based speaker sex automatic identification method and system

Country Status (1)

Country Link
CN (1) CN109545227B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109920435B (en) * 2019-04-09 2021-04-06 厦门快商通信息咨询有限公司 Voiceprint recognition method and voiceprint recognition device
CN110187321B (en) * 2019-05-30 2022-07-22 电子科技大学 Radar radiation source characteristic parameter extraction method based on deep learning in complex environment
CN110136726A (en) * 2019-06-20 2019-08-16 厦门市美亚柏科信息股份有限公司 A kind of estimation method, device, system and the storage medium of voice gender
CN110427978B (en) * 2019-07-10 2022-01-11 清华大学 Variational self-encoder network model and device for small sample learning
CN112331181A (en) * 2019-07-30 2021-02-05 中国科学院声学研究所 Target speaker voice extraction method based on multi-speaker condition
CN110473557B (en) * 2019-08-22 2021-05-28 浙江树人学院(浙江树人大学) Speech signal coding and decoding method based on depth self-encoder
CN111161744B (en) * 2019-12-06 2023-04-28 华南理工大学 Speaker clustering method for simultaneously optimizing deep characterization learning and speaker identification estimation
CN111161713A (en) * 2019-12-20 2020-05-15 北京皮尔布莱尼软件有限公司 Voice gender identification method and device and computing equipment
CN111798875A (en) * 2020-07-21 2020-10-20 杭州芯声智能科技有限公司 VAD implementation method based on three-value quantization compression

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2579332A1 (en) * 2006-02-20 2007-08-20 Diaphonics, Inc. Method and system for detecting speaker change in a voice transaction
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
WO2017113680A1 (en) * 2015-12-30 2017-07-06 百度在线网络技术(北京)有限公司 Method and device for voiceprint authentication processing
CN107146615A (en) * 2017-05-16 2017-09-08 南京理工大学 Audio recognition method and system based on the secondary identification of Matching Model

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003036097A (en) * 2001-07-25 2003-02-07 Sony Corp Device and method for detecting and retrieving information
CN101833951B (en) * 2010-03-04 2011-11-09 清华大学 Multi-background modeling method for speaker recognition
US9502038B2 (en) * 2013-01-28 2016-11-22 Tencent Technology (Shenzhen) Company Limited Method and device for voiceprint recognition
CN107112025A (en) * 2014-09-12 2017-08-29 美商楼氏电子有限公司 System and method for recovering speech components
KR101843074B1 (en) * 2016-10-07 2018-03-28 서울대학교산학협력단 Speaker recognition feature extraction method and system using variational auto encoder
CN107784215B (en) * 2017-10-13 2018-10-26 上海交通大学 Audio unit based on intelligent terminal carries out the user authen method and system of labiomaney

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2579332A1 (en) * 2006-02-20 2007-08-20 Diaphonics, Inc. Method and system for detecting speaker change in a voice transaction
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
WO2017113680A1 (en) * 2015-12-30 2017-07-06 百度在线网络技术(北京)有限公司 Method and device for voiceprint authentication processing
CN107146615A (en) * 2017-05-16 2017-09-08 南京理工大学 Audio recognition method and system based on the secondary identification of Matching Model

Also Published As

Publication number Publication date
CN109545227A (en) 2019-03-29

Similar Documents

Publication Publication Date Title
CN109545227B (en) Depth self-coding network-based speaker sex automatic identification method and system
Mannepalli et al. A novel adaptive fractional deep belief networks for speaker emotion recognition
Liu et al. Deep feature for text-dependent speaker verification
CN110164452A (en) A kind of method of Application on Voiceprint Recognition, the method for model training and server
Mingote et al. Optimization of the area under the ROC curve using neural network supervectors for text-dependent speaker verification
US20210005183A1 (en) Orthogonally constrained multi-head attention for speech tasks
Bhardwaj et al. GFM-based methods for speaker identification
CN109637526A (en) The adaptive approach of DNN acoustic model based on personal identification feature
CN110246509B (en) Stack type denoising self-encoder and deep neural network structure for voice lie detection
CN113450806B (en) Training method of voice detection model, and related method, device and equipment
Ivanko et al. An experimental analysis of different approaches to audio–visual speech recognition and lip-reading
Azam et al. Speaker verification using adapted bounded Gaussian mixture model
Karthikeyan Adaptive boosted random forest-support vector machine based classification scheme for speaker identification
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
Jakubec et al. Deep speaker embeddings for Speaker Verification: Review and experimental comparison
CN112863521A (en) Speaker identification method based on mutual information estimation
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Anand et al. Text-independent speaker recognition for Ambient Intelligence applications by using information set features
CN115101077A (en) Voiceprint detection model training method and voiceprint recognition method
CN112951270B (en) Voice fluency detection method and device and electronic equipment
Dennis et al. Generalized Hough transform for speech pattern classification
Moonasar et al. A committee of neural networks for automatic speaker recognition (ASR) systems
Bhardwaj et al. Identification of speech signal in moving objects using artificial neural network system
Sapijaszko Increasing accuracy performance through optimal feature extraction algorithms
US20240105206A1 (en) Seamless customization of machine learning models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant