CN109545227B

CN109545227B - Depth self-coding network-based speaker sex automatic identification method and system

Info

Publication number: CN109545227B
Application number: CN201810402685.0A
Authority: CN
Inventors: 王志锋; 段苏容; 左明章; 田元; 闵秋莎; 夏丹; 叶俊民; 陈迪; 罗恒; 姚璜
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2023-05-09
Anticipated expiration: 2038-04-28
Also published as: CN109545227A

Abstract

The invention belongs to the technical field of voiceprint recognition, and discloses a method and a system for automatically recognizing the gender of a speaker based on a depth self-coding network, which are used for training a universal background model of a UBM by using voice signals irrelevant to registered speakers and channels; extracting an i-vector of the registration data; extracting an i-vector of test data; training a depth self-coding network; pattern matching and recognition, and model evaluation. The invention applies the depth self-coding network to the speaker identification, and uses the strong learning ability of the depth self-coding network to characterize the speaker characteristics of different sexes, thereby not only realizing the re-extraction of the characteristics, but also reducing the feature dimension, thereby reducing the complexity in classification operation. The method provided by the invention can be further popularized and applied to speaker recognition, and the robustness of a speaker recognition system is attempted to be improved.

Description

Depth self-coding network-based speaker sex automatic identification method and system

Technical Field

The invention belongs to the technical field of voiceprint recognition, and particularly relates to a method and a system for automatically recognizing the gender of a speaker based on a depth self-coding network.

Background

Currently, industryThe prior art commonly used in this field is as follows:

the speaker identification is a biometric authentication technique for automatically identifying the sex of a speaker by using speaker information of a specific sex contained in a voice signal, similar to speaker identification (voiceprint identification). The deep learning simulates a layered structure when the human brain processes information, and is essentially realized by realizing layer-by-layer abstraction of the features through nonlinear transformation in a mode of connecting a plurality of hidden layers, so that the mapping from the bottom layer features to the high layer concepts is constructed, and the deep learning has stronger learning capability. In the field of speech recognition in recent years, deep neural networks (deep neuralnetwork, DNN) have been successfully applied in acoustic modeling, leading to a milestone development in speech recognition performance. The speaker recognition field also explores how to model speakers by using DNN, but because the types of speakers are not fixed, the training data of each speaker is relatively less, and the obtained effect is very limited.

Currently, deep learning algorithms are applied to the field of speaker recognition and can be roughly divided into three categories: feature extraction based, mapping based and feature extraction and mapping based simultaneously. The first method applies a deep learning algorithm to feature extraction in the speaker registration stage, and the recognition mapping is completed by using a traditional speaker recognition method such as GMM after feature extraction. The second method uses acoustic features extracted in the conventional method, such as MFCCs, as input to the deep neural network, and uses the deep neural network as a classifier to complete the recognition mapping. The third method is to apply the deep neural network to the feature extraction and classification to finish the speaker recognition process. Among the three categories, the i-vector based method achieves better effects, one is to apply the depth network to the extraction stage of the i-vector, and the other is to use the depth network as a classifier to finish final recognition after the i-vector is extracted. The invention belongs to the latter, and realizes voiceprint recognition based on an i-vector and a depth network classifier.

In summary, the problems of the prior art are:

in the prior art based on i-vector, a multi-application Deep Belief Network (DBN) is used for model construction, and a deep self-coding network (SAE) is not used for extracting characteristics of the i-vector again and finally completing classification recognition.

SolutionMeaning of solving the technical problems:

the invention has the advantages that the i-vector-based deep self-coding network speaker sex identification system is realized, the voiceprint information of speakers with different sexes is further extracted by utilizing the characterization capability of the deep self-coding network, the feature dimension is reduced, the calculation complexity of a classification algorithm is reduced, and the method can be further popularized to the field of speaker identification.

The deep self-encoder is mainly used for completing the learning task of data conversion, is essentially a non-linear feature extraction model for unsupervised learning, and the learning process consists of two stages of unsupervised pre-training and supervised tuning. The most basic self-encoding network is a feed-forward neural network symmetric about the middle layer, comprising an input layer, an output layer and a hidden layer, the goal of which is to achieve the desired output to be the same as the input, i.e. each layer of which is a feature representation of the input, can be used to learn identity mapping and extract unsupervised features. After multi-layer training, the self-encoder can extract the essence characteristics from the original data, and then a neural network based on the essence characteristics can be constructed, or a classifier such as SVM or LR is added, so that the classification can be realized efficiently.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a method and a system for automatically identifying the gender of a speaker based on a depth self-coding network.

The invention is realized in such a way that the speaker sex automatic identification method based on the depth self-coding network comprises the following steps:

firstly, preprocessing a training set voice signal and extracting Mel cepstrum coefficient characteristics, and then training a UBM universal background model by utilizing a large amount of voice data irrelevant to a specific speaker and a channel; extracting an i-vector based on the UBM universal background model and the voice signal of the specific speaker; and extracting the characteristics of the i-vector by using a self-coding network, reducing the dimension of the i-vector, and taking the characteristic as the input of a classifier (the classifier can be a neural network or other classification algorithms) to finish identification classification.

And in the test stage, the test voice signals are subjected to signal preprocessing, i-vector extraction and self-coding feature re-extraction in the same mode as the training stage, the test voice signals are classified by a trained classifier, and then different evaluation standards such as classification accuracy, AUC, MCC and the like are used for evaluating the model.

If the method is used for speaker recognition, the method can be realized by only changing the gender voice signals of the speakers into a certain number of specific speaker voice signals and correspondingly adjusting the deep network structure and the evaluation standard.

Further, the automatic speaker sex identification method based on the depth self-coding network specifically comprises the following steps:

step 1: training a UBM generic background model using speech signals unrelated to both registered speakers and channels;

step 2: extracting an i-vector of the registration data;

step 3: extracting an i-vector of test data;

step 4: training a depth self-coding network;

step 5: pattern matching and recognition, and model evaluation.

Further, the specific implementation of step 1 includes the following sub-steps:

step 1.1: preprocessing voice signals irrelevant to registered speakers and channels, including pre-emphasis, framing and windowing;

step 1.2: extracting Mel cepstrum coefficient characteristics from the signals pretreated in the step 1.1;

step 1.3: and (3) carrying out global cepstrum mean and variance normalization on the Mel cepstrum features obtained in the step (1.2).

Step 1.4: and (3) carrying out statistical modeling on the Mel cepstrum features obtained in the step (1.3) by using N mixed Gaussian models, and obtaining a universal background model UBM with N Gaussian components by using an EM algorithm, wherein the universal background model UBM comprises a mean supervector, a weight and a Gaussian component covariance matrix of each Gaussian component.

Further, the specific implementation of step 2 includes the following sub-steps:

step 2.1: preprocessing the registered voice signal, including pre-emphasis, framing and windowing;

step 2.2: extracting Mel cepstrum coefficient characteristics from the signals pretreated in the step 2.1;

step 2.3: and (3) carrying out global cepstrum mean and variance normalization on the Mel cepstrum features obtained in the step (2.2).

Step 2.4: and (3) calculating zero-order and first-order full statistics (Baum-Welch statistics) of each voice segment on each GMM mixed component of the UBM by using the characteristics obtained in the step 2.3 and the universal background model UBM obtained in the step 1:

wherein, the liquid crystal display device comprises a liquid crystal display device,

the zero order statistic and the first order statistic of the voice segment k on the c-th GMM mixed component are respectively represented; />

Representing the acoustic characteristics of the speech segment k at the time index t; />

Representing acoustic features->

Posterior probability for the c-th GMM mixture component;

step 2.5: obtaining a full-variation subspace T through maximum likelihood estimation calculation by using the full statistics obtained in the step 2.4 and the average supervector of the universal background model UBM obtained in the step 1

M＝m+Tw

Wherein M is a GMM mean value supervector containing speaker information and channel information; m is the UBM mean supervector, independent of both speaker and channel; w is a low-dimensional vector containing only speaker information, i.e., i-vector;

step 2.6: and (3) extracting the i-vector by using the full statistics obtained in the step (2.4), the average supervector of the universal background model UBM obtained in the step (1) and the full-variation subspace T obtained in the step (2.5).

Further, the specific implementation of the step 3: i-vector is performed on test data according to the steps involved in the step 2

Extracting.

Further, the specific implementation of step 4 includes the following sub-steps:

step 4.1: carrying out maximum and minimum normalization on the characteristics obtained in the step 2;

step 4.2: carrying out one-hot coding on the gender labels of all registered speakers;

step 4.3: and constructing a depth self-coding network structure.

In step 4, after the trained depth self-coding network is obtained, the test data features obtained in step 3 are used for automatically identifying the speaker character and performing model evaluation by using three indexes of classification accuracy, AUC and MCC.

Another object of the present invention is to provide a speaker sex automatic identification control system based on a depth self-coding network.

In summary, the invention has the advantages and positive effects that:

in order to explore the application of the deep neural network in the voiceprint recognition field, the invention provides an automatic gender method of a speaker based on an i-vector and a deep self-coding network, the application of the deep self-coding network in the field is realized, and the method can be further popularized and applied to speaker recognition.

The invention applies the depth self-coding network to the voiceprint recognition field for the first time, re-extracts the characteristics by using the depth self-coding network, reduces the characteristic dimension, thereby reducing the calculation complexity of the classification algorithm, and is one-time exploration of the depth neural network in the field;

the invention utilizes the learning ability of the deep neural network to further extract the voiceprint information of speakers with different sexes, thereby improving the accuracy of the recognition system;

the method provided by the invention realizes the speaker sex classification accuracy of 98% on the experimental data set, wherein the AUC is about 0.995, the MCC is about 0.96, and the traditional speaker sex identification accuracy based on the fundamental frequency is only 85%.

The invention applies the depth self-coding network to the speaker identification, and uses the strong learning ability of the depth self-coding network to characterize the speaker characteristics of different sexes, thereby not only realizing the re-extraction of the characteristics, but also reducing the feature dimension, thereby reducing the complexity in classification operation. The method can also be used for speaker recognition in the later stage, and the robustness of the speaker recognition system is attempted to be improved.

Drawings

Fig. 1 is a flowchart of a method for automatically identifying the gender of a speaker based on a depth self-coding network according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a speaker sex identification method based on an i-vector and depth self-coding network. Firstly, preprocessing a training set voice signal and extracting Mel cepstrum coefficient characteristics, and then training a UBM universal background model by utilizing a large amount of voice data irrelevant to a specific speaker and a channel; extracting an i-vector based on the UBM and Mel cepstrum coefficient characteristics of the specific speaker; and (5) training the depth self-coding network by using the extracted i-vector to realize the classification of men and women. And in the test stage, signal preprocessing and i-vector extraction are carried out in the same way, a trained deep network is used for classification, and then the model performance is evaluated by using three evaluation standards of classification accuracy, AUC and MCC. The invention applies the depth self-coding network to the speaker identification, and uses the strong learning ability of the depth self-coding network to characterize the speaker characteristics of different sexes, thereby not only realizing the re-extraction of the characteristics, but also reducing the feature dimension, thereby reducing the complexity in classification operation. The method can also be used for speaker recognition in the later stage, and the robustness of the speaker recognition system is attempted to be improved.

Referring to fig. 1, the method for automatically identifying the gender of the speaker based on the depth self-coding network provided by the embodiment of the invention comprises the following steps:

the specific implementation comprises the following substeps:

step 1.3: and (3) carrying out global cepstrum mean and variance normalization on the Mel cepstrum coefficient obtained in the step (1.2).

Step 1.4: and (3) carrying out statistical modeling on the Mel cepstrum coefficient obtained in the step (1.3) by using N mixed Gaussian models, and obtaining a universal background model UBM with N Gaussian components by using an EM algorithm, wherein the universal background model UBM comprises a mean supervector, a weight and a Gaussian component covariance matrix of each Gaussian component.

In this embodiment, the UBM generic background model is trained by using a training set speech signal that is independent of both the registered speaker and the channel, and the number of GMM mixes in the UBM model should be dependent on the actual situation, and both the running speed and the accuracy should be considered in the training process. At the same time, it is necessary to ensure equalization of the training data at the time of training, i.e., equalization of the male and female proportions of the training data set in this example.

Step 2: extracting an i-vector of the registration data;

the specific implementation comprises the following substeps:

step 2.3: and (3) carrying out global cepstrum mean and variance normalization on the Mel cepstrum coefficient obtained in the step (2.2).

Representing acoustic features->

Posterior probability for the c-th GMM mixture component;

M＝m+Tw

In this embodiment, feature extraction is performed on the voice of the registered speaker, where the registered speaker and the UBM training set speaker do not overlap, so as to ensure that the UBM model fits the voice feature distribution of the person, and features that are not covered by the registered voice signal of the specific speaker can be approximated by similar feature distribution in the UBM.

Step 3: extracting test data i-vector; the specific implementation is as follows: and (3) extracting the i-vector from the test data according to the steps involved in the step (2).

In this embodiment, 9 sentences are taken from 10 sentences of voice samples of each registered speaker for training, and 1 sentence is used as a test sentence.

Step 4: training a depth self-coding network;

the specific implementation comprises the following substeps:

step 4.1: performing maximum and minimum normalization on the features obtained in the step 2, and scaling all feature data to a 0-1 interval in equal proportion;

step 4.2: one-hot encoding is performed on the gender tags of all registered speakers. The coding method is that when N states are coded, an N-bit independent state register is adopted, and when the N states are called, only one bit is effectively coded. For this dataset, there are only two classifications or states of 0 and 1, which after one-hot encoding become 2 binary mutually exclusive features, only one feature being activated each time a call is made.

Step 4.3: constructing a depth self-coding network structure;

the self-coding network is an unsupervised learning algorithm that attempts to approximate an identity function: h is a _W,b (x) Approximately equal to x, makeThe output is close to the input and usually comprises two parts, namely an encoder and a decoder, and two kinds of transformation can be used

And ψ gives its definition:

ψ:F→X

/>

the coding process refers to inputting x E R ^m Mapping to an implicit representation h (x) =r ⁿ The specific construction process is as follows:

z＝σ(Wx+b)

where σ is the activation function, in the nonlinear case a sigmoid function or a tanh function, etc. are usually taken. W epsilon R ^n×m For coding weight matrix, b E R ⁿ For encoding the offset vector.

The decoding process refers to a process of mapping an implicit representation h (x) to an output layer to reconstruct an input x, and restoring x' as identical as possible to the input x:

x′＝σ′(W′z+b′)

where σ' is the activation function, meaning is the same as σ. W' E R ^m×n For decoding weight matrix, b' ∈R ^m To decode the offset vector.

The reconstruction error is

L(x,x′)＝||x-x′|| ² ＝||x-σ′(W′(σ(Wx+b))+b′)|| ²

The depth self-coding network is a self-coding network comprising a plurality of hidden layers and being symmetrical with respect to the intermediate layer, comprising one input layer, 2r-1 hidden layers and one output layer. Let the input layer contain m neurons x= (x) ₁ ,x ₂ ,...,x _m ) ^T ∈R ^m The method comprises the steps of carrying out a first treatment on the surface of the The kth hidden layer contains n _k ＝n _2r-k Individual neurons (k=1, 2,..2 r-1), phaseShould be implied with layer vectors of

The output layer is x' = (x) ₁ ′,x ₂ ′,...,x _m ′) ^T ∈R ^m The neuron activation outputs from the layers of the coding network can be expressed as:

x′＝σ′(W ^2r h _2r-1 +b ^2r )

wherein the method comprises the steps of

For the weight matrix between the input layer and the 1 st hidden layer,

is the weight matrix between the kth hidden layer and the kth hidden layer, and is +.>

B is a weight matrix between the 2r-1 hidden layer and the output layer ¹ 、b ^k 、b ^2r Is the corresponding offset vector.

The training process comprises two stages of unsupervised pre-training and supervised tuning, wherein from an input layer to a middle layer of the encoder, two adjacent layers are regarded as a limited boltzmann machine, the output of each limited boltzmann machine is the input of the next adjacent limited boltzmann machine, and an unsupervised learning algorithm (such as a CD algorithm, a PCD algorithm and the like) is adopted to train all the limited boltzmann machines layer by layer. Pre-training a weight matrix W starting from the underlying limited Boltzmann machine ¹ Visual layer bias a ¹ And hidden layer bias b ¹ The method comprises the steps of carrying out a first treatment on the surface of the Then layer by layer the (k-1) th hidden layer and the (k-1) th hidden layerThe k hidden layers are regarded as a restricted Boltzmann machine pre-trained corresponding weight matrix W ^k Bias a ^k And b ^k (1 < k.ltoreq.r); finally, when r is less than k and less than or equal to 2r, reversely stacking each pre-trained limited Boltzmann machine, and directly constructing W ^k ＝(W ^2r+1-k ) ^T And b ^k ＝a ^2r+1-k Thereby yielding all initialization weights and offsets from the encoder. For the above training mode, while training each layer of parameters, other layers of parameters are fixed and kept unchanged.

After the non-supervision pre-training is completed, all parameters of the network are optimized by adopting a supervised learning algorithm. The supervised learning algorithm generally selects a BP algorithm, or a random gradient descent algorithm, a conjugate gradient descent algorithm and the like, and the optimized objective function can be a square reconstruction error:

or a cross entropy function:

wherein (x) ^l ,y ^l ) (1 is less than or equal to l is less than or equal to N) is N training samples,

in order for the output to be desired,

is the actual output.

For classification purposes, a single-layer neural network or a multi-layer perceptron is added after the self-coding network. The method comprises the steps of discarding a decoding layer of a self-coding network, taking the output of the last coding layer as the input of a classification neural network, and reversely transmitting gradient values of classification errors to the coding layer.

In this embodiment, the depth self-coding network architecture is a four-layer network structure, wherein two layers are self-coding layers, and the other two layers are sensor layers. The coding layer of the self-coding network compresses the 400-dimensional feature map of the original input into 40-dimensional, the feature re-extraction process is completed, the perceptron layer utilizes the 40-dimensional features to classify, and finally the output is 2 labels. The self-coding layer uses the mean square error as a loss function, and the perceptron layer uses the cross entropy as the loss function, and the actual use is optional.

Step 5: pattern matching and recognition; after a trained network is obtained, the test i-vector obtained in the step 3 is used as the input of the depth self-coding network, so that the automatic classification of the gender of the speaker is realized, and the model is evaluated according to three indexes of classification accuracy, AUC and MCC.

AUC (Area Under Curve) is an evaluation index commonly used in classification models, defined as the area under the ROC curve. ROC curves are made based on the true class and predicted probability of the sample, with the x-axis representing false positive rate-FPR (False Positive Rate) and the y-axis representing true positive rate-TPR (True Positive Rate), defined as:

the classification Accuracy (ACC) is defined as:

wherein the method comprises the steps of

TP- -True Positives: positive samples predicted to be positive classes

FP-False positive (False positive): negative samples predicted as positive classes

FN- -False negative classes (False negative): positive samples predicted as negative classes

TN- -True Negatives (True negative): negative samples predicted as negative classes

The ROC curve can still keep the curve itself basically unchanged when the distribution of positive and negative samples is changed, when the completely random samples are classified, the AUC is close to 0.5, and the closer the AUC value is to 1, the better the model prediction effect is indicated.

The MCC, i.e., the matthews correlation coefficient, is also an evaluation index applicable to the two-class model, defined as:

the MCC has a value in the range of [ -1,1], if mcc= -1 indicates a completely opposite prediction, mcc=0 indicates a random prediction, and mcc=1 indicates a perfect prediction, i.e. if MCC is closer to 1, the model prediction is better.

The invention is further described in connection with simulation experiments.

In the experiment, the method is used for a TIMIT database and UBM training stage, 108 men and 72 women are respectively selected for 200 persons, each speaker contains 10 sentences of 10s of voice signals, 12-order MFCC parameters are extracted by taking 256 frames as frame lengths, and a universal background model UBM containing 64 Gaussian components is trained after normalization.

in the i-vector extraction stage, 77 persons (different from 200 persons training UBM) of each man and woman are selected, each speaker comprises 10 sentences of 10s of voice signals, the 12-order MFCC parameters are extracted by taking 256 frames as frame lengths, zero-order and first-order full statistics are extracted based on the trained UBM, then the full-variation subspace T with the dimension of 400 is calculated, and the i-vector extraction is completed based on the UBM and the T, so that the 400-dimensional i-vector is obtained.

And in the neural network training stage, 9 out of 10 voice signals of each speaker are used as training sets, 1 is used as a test set, and the maximum and minimum normalization is carried out on all the features. The tag male label is set to 1, the female label to 0, and the tag is one-hot coded.

Three-layer networks (one self-coding layer and two perceptual classification layers) and four-layer networks (two stacked self-coding layers and two perceptual classification layers) were constructed in this experiment, respectively. The three-layer network has 5000 training times of the self-coding network and 10000 training times of the classifier, can realize 96% classification accuracy, and has AUC of 0.995 and MCC of 0.9097. The four-layer network, the training times of the self-coding network are 7000 and 25000, the classification accuracy of 98% can be realized, the AUC is 0.9886, and the MCC is 0.961.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product comprising one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The automatic speaker sex identification method based on the depth self-coding network is characterized by comprising the following steps of:

firstly, preprocessing a training set voice signal and extracting Mel cepstrum coefficient characteristics, and then training a UBM universal background model by utilizing a large amount of voice data irrelevant to a specific speaker and a channel; extracting an i-vector based on the UBM universal background model and the voice signal of the specific speaker; training a self-encoder by using the extracted i-vector as the input of a depth self-encoding network, further refining the characteristics, and finally realizing different speaker classification by a classifier;

the testing stage, the testing voice signal is preprocessed and i-vector extracted in the same way as the training stage, the trained deep self-coding network is used for feature extraction and classification, and then the classification accuracy, AUC and MCC are used for evaluating the model;

the speaker sex automatic identification method based on the depth self-coding network specifically comprises the following steps:

step one: training a UBM generic background model using speech signals unrelated to both registered speakers and channels;

step two: extracting an i-vector of the registration data;

step three: extracting an i-vector of test data;

step four: training a depth self-coding network;

step five: pattern matching and recognition, and model evaluation;

step four, specifically comprising:

a) The method comprises the following steps Performing maximum and minimum normalization on the features obtained in the second step, and scaling all feature data to a 0-1 interval in equal proportion;

b) The method comprises the following steps Carrying out one-hot coding on the gender labels of all registered speakers; the coding method comprises the following steps: when N states are coded, an N-bit independent state register is adopted, and when the N states are called, only one bit in the N bits is effectively coded; the data set has only 0 and 1 classifications or states, and becomes 2 binary mutually exclusive features after one-hot encoding, and only one feature is activated when each call is made;

c) The method comprises the following steps Constructing a depth self-coding network structure;the depth self-coding network comprises an input layer, 2r-1 hidden layers and an output layer; the input layer contains m neurons x= (x) ₁ ,x ₂ ,...,x _m ) T epsilon Rm; the kth hidden layer contains n _k ＝n _2r-k Individual neurons (k=1, 2,., 2 r-1); implicit layer vector is

The output layer is x '= (x 1', x2',..xm') T e Rm, the neuron activation output of each layer from the coding network is expressed as:

x′＝σ′(W ^2r h _2r-1 +b ^2r )

wherein the method comprises the steps of

For the weight matrix between the input layer and the 1 st hidden layer,>

2. The method for automatically identifying the gender of a speaker based on a depth self-encoding network as claimed in claim 1, wherein the step one specifically comprises:

1): preprocessing voice signals irrelevant to registered speakers and channels, including pre-emphasis, framing and windowing;

2): extracting Mel cepstrum coefficient characteristics from the signals pretreated in the step 1);

3): carrying out global cepstrum mean and variance normalization on the Mel cepstrum features obtained in the step 2);

4): and (3) carrying out statistical modeling on the Mel cepstrum features obtained in the step (3) by using N mixed Gaussian models, and obtaining a universal background model UBM with N Gaussian components by using an EM algorithm, wherein the universal background model UBM comprises a mean value supervector, a weight and a Gaussian component covariance matrix of each Gaussian component.

3. The method for automatically identifying the gender of a speaker based on a depth self-encoding network as claimed in claim 1, wherein the step two specifically comprises:

a) The method comprises the following steps Preprocessing the registered voice signal, including pre-emphasis, framing and windowing;

b) The method comprises the following steps Extracting Mel cepstrum coefficient characteristics from the signal pretreated in step a);

c) The method comprises the following steps Carrying out global cepstrum mean and variance normalization on the Mel cepstrum features obtained in the step b);

d) The method comprises the following steps Calculating zero-order and first-order full statistics of each voice segment on each GMM mixed component of the UBM by using the characteristics obtained in the step c) and the universal background model UBM obtained in the step one:

wherein (1)>

The zero order statistic and the first order statistic of the voice segment k on the c-th GMM mixed component are respectively represented;

Representing acoustic features->

Posterior probability for the c-th GMM mixture component;

e) The method comprises the following steps Obtaining the total variation subspace TM=m+Tw by maximum likelihood estimation calculation by using the full statistics obtained in the step d) and the average value supervector of the universal background model UBM obtained in the step one

Wherein M is a GMM mean value supervector containing speaker information and channel information; m is the UBM mean supervector, independent of both speaker and channel; w is a low-dimensional vector i-vector containing only speaker information;

f) The method comprises the following steps And d), extracting the i-vector by using the full statistics obtained in the step d), the average supervector of the universal background model UBM obtained in the step e) and the full-variation subspace T obtained in the step e).

4. A depth self-encoding network-based speaker sex automatic recognition control system of the depth self-encoding network-based speaker sex automatic recognition method of claim 1.