CN106898355A

CN106898355A - A kind of method for distinguishing speek person based on two modelings

Info

Publication number: CN106898355A
Application number: CN201710031899.7A
Authority: CN
Inventors: 何亮; 陈仙红; 徐灿; 刘艺; 田垚; 刘加
Original assignee: Tsinghua University
Current assignee: Beijing Huacong Zhijia Technology Co., Ltd.
Priority date: 2017-01-17
Filing date: 2017-01-17
Publication date: 2017-06-27
Anticipated expiration: 2037-01-17
Also published as: CN106898355B

Abstract

The present invention proposes a kind of method for distinguishing speek person based on two modelings, belongs to Application on Voiceprint Recognition, pattern-recognition and machine learning field.The method obtains the training speech data of speaker to be identified and pre-processes in the model training stage；First DNN model is obtained according to training speech data training；Using first DNN model, training speech data is identified, extracts easily mixed speech data；Second DNN model is obtained according to easy mixed speech data training；In the Speaker Identification stage, obtain speech data to be identified and pre-process；Speech data to be identified is identified using first DNN model, if identification probability is more than given threshold, obtains Speaker Identification result；Second identification is otherwise carried out to speech data to be identified by second DNN model, Speaker Identification result is obtained.The present invention, while considering speaker's gross feature and microscopic feature, effectively improves the accuracy rate of Speaker Identification by setting up two DNN models.

Description

A kind of method for distinguishing speek person based on two modelings

Technical field

It is particularly a kind of based on secondary the invention belongs to Application on Voiceprint Recognition, pattern-recognition and machine learning techniques field The method for distinguishing speek person of modeling.

Background technology

Speaker Identification refer to according to recognizing speaker's identity comprising the information related to speaker in voice, with Developing rapidly for information technology and the communication technology, speaker Recognition Technology is increasingly taken seriously and is obtained extensively in numerous areas General application.As identity differentiates, seize telephone channel criminal, identity validation done according to telephonograph in court, call voice with Track, there is provided antitheft door open function.The Internet, applications and the communications field, speaker Recognition Technology can apply to sound dialing, The necks such as telephone bank, teleshopping, database access, information service, voice e-mail, security control, computer remote login Domain.

Speaker Identification first has to pre-process speech data, extracts feature.The most frequently used is characterized in that one kind is based on Human ear listens the mel cepstrum feature of perception theory, is now widely used for Speaker Identification, languages identification and continuous speech and knows Not etc..Mel cepstrum feature extraction carries out preemphasis and framing adding window to speech data first, then to the number after framing adding window According to Fast Fourier Transform (FFT) is carried out, corresponding frequency spectrum is obtained, and be filtered by Mel frequency marking triangle window filter, it is most laggard Row discrete cosine transform obtains mel cepstrum feature.

In recent years, the Speaker Identification model based on deep neural network (DNN) is received more and more attention, compared to Traditional gauss hybrid models (GMM), the descriptive power of DNN models is stronger, more preferably the extremely complex data of simulation can divide Cloth, the system based on DNN obtains significant performance boost.One DNN model includes input layer, three layers of hidden layer and output layer It is secondary：The feature of input layer correspondence speech data, depending on the dimension of the nodes of input layer according to speech data character pair；Output Layer corresponds to the probability of each speaker, depending on the number of the nodes of output layer according to the speaker for needing to recognize altogether；It is implicit Number and nodes need and engineering experience definition according to application layer by layer.During DNN model trainings, first carry out unsupervised training is carried out again Supervised training.A limited Boltzmann machine is treated as per adjacent two-tier network during unsupervised training, with to sdpecific dispersion (CD) Algorithm is successively trained.Exercise supervision training when, the DNN model parameters obtained using unsupervised training as initial value, then DNN model parameters are accurately adjusted with Back Propagation Algorithm.So far, the side of the Speaker Identification based on DNN models Method all only uses a DNN model, but a DNN model to be difficult while entering to the gross feature and microscopic feature between speaker Row modeling.When this causes to be identified speaker using a DNN model, some voices can be easily discriminated, and some Voice is but easily obscured.

The content of the invention

The purpose of the present invention is to overcome the weak point of prior art, it is proposed that a kind of speaker based on two modelings Recognition methods.The present invention is by setting up two DNN models, while speaker's gross feature and microscopic feature are considered, can be effective Improve the accuracy rate of Speaker Identification in ground.

A kind of method of the Speaker Identification based on two modelings, it is model training stage and Speaker Identification stage to be divided into Two stages；In the model training stage, obtain the training speech data of all speakers to be identified and pre-process；According to training language The training of sound data obtains first DNN model；Using first DNN model, training speech data is identified, extracts easily mixed Speech data；Second DNN model is obtained according to easy mixed speech data training；In the Speaker Identification stage, language to be identified is obtained Sound data are simultaneously pre-processed；Speech data to be identified is identified using first DNN model, if identification probability is more than decision threshold It is worth, then the speaker of artificial this speech data to be identified of speaking in recognition result corresponding to output probability maximum；Otherwise Second identification is carried out to speech data to be identified by second DNN model, output probability maximum institute is right in recognition result The speaker of artificial this speech data to be identified of speaking answered.The method is comprised the following steps：

1) the model training stage；Specifically include following steps：

1-1) obtain the training speech data of all speakers to be identified, and speaking corresponding to every training speech data It is artificial known；Training speech data to obtaining is pre-processed, and extracts the corresponding mel cepstrum of all training speech datas special Levy, and calculate the single order of mel cepstrum feature, second dervative, totally 60 dimension；

1-2) set up first DNN model and be trained, specifically include following steps：

1-2-1) the number of plies and nodes of first DNN model of setting；

First DNN model is divided into input layer, three levels of hidden layer and output layer；Input layer correspondence training speech data Feature, input layer number is step 1-1) the mel cepstrum feature of training speech data that obtains and its single order, second order lead Number 60 is tieed up totally, then input layer number is set to 60；Output layer corresponds to the probability of each speaker, and output layer nodes are institute There is the number of speaker to be identified, the output of each node corresponds to the probability of each speaker respectively；Hidden layer is used to automatically extract The feature of different levels, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted；

1-2-2) first DNN model is trained, first DNN model parameter is obtained；

The mel cepstrum feature and its single order, second dervative of the training speech data according to all speakers to be identified, First DNN mould is trained, model parameter includes the biasing of the connection weight and each node of adjacent two layers；First carry out non- Supervised training exercises supervision training again：During unsupervised training, every adjacent two-layer in first DNN model, received as one Limit Boltzmann machine, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtains first DNN model parameter Initial value；Exercise supervision training when, first initial value of DNN model parameters obtained using unsupervised training, after utilization First DNN model parameter is accurately adjusted to propagation algorithm, is finally obtained first DNN model parameter；

1-3) extract easily mixed speech data；

According to step 1-2) the first DNN model for obtaining is trained, to the training voice number of all speakers to be identified According to being identified and given threshold；If the corresponding theory of this training speech data in a recognition result for training speech data The threshold value of the probability less than setting of people is talked about, then illustrates that the recognition result distinction of this speech data is bad, by this training language Sound data are extracted as easily mixed speech data, for second training of DNN models；If recognition result is more than or equal to the threshold of setting Value, then illustrate that this training speech data is easily distinguishable, not as easy mixed speech data；

1-4) set up second DNN model and be trained, specifically include following steps：

1-4-1) the number of plies and nodes of second DNN model of setting；

Second DNN model is divided into input layer, three levels of hidden layer and output layer；Input layer correspondence training speech data Feature, input layer number is step 1-3) the mel cepstrum feature of easily mixed speech data extracted and its single order, second order lead Number 60 is tieed up totally, then input layer number is set to 60；Output layer corresponds to the probability of each speaker, and output layer nodes are easy Speaker's number included in mixed speech data, each node output corresponds to the probability of each speaker respectively；Hidden layer is used In the feature for automatically extracting different levels, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted；

1-4-2) second DNN model is trained, second DNN model parameter is obtained；

According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second Individual DNN models are trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers；First carry out non-supervisory Train the training that exercises supervision again：Every adjacent two-layer in second DNN model is treated as a limited Bohr during unsupervised training Hereby graceful machine, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtains the initial of second DNN model parameter Value；Exercise supervision training when, second initial value of DNN model parameters obtained using unsupervised training, using back-propagating Algorithm is accurately adjusted to second DNN model parameter, finally obtains second DNN model parameter；

2) the Speaker Identification stage；Specifically include following steps：

A certain personal speech data to be identified in the speaker of identification needed for 2-1) obtaining, to voice number to be identified According to being pre-processed, the mel cepstrum feature and its single order, second dervative of speech data to be identified are extracted, totally 60 dimension；

2-2) by step 2-1) obtain speech data to be identified 60 dimension features be input to step 1-2) obtain first It is identified in individual DNN models, output layer exports the recognition result of speech data to be identified, i.e. this speech data is corresponded to respectively The probability of each speaker in training speech data, the output of output layer each node corresponds to the probability of speaker respectively；

2-3) set decision threshold, judge step 2-2) recognition result in the presence or absence of probability be more than decision threshold knot Really：If so, then the speaker in first DNN Model Identification result corresponding to output probability maximum is this language to be identified The speaker of sound data, end of identification；If no, being transferred to step 2-4)；

If 2-4) step 2-3) recognition result in do not have probability more than decision threshold result, to this language to be identified Sound data carry out second identification using second DNN model；Output probability maximum institute in second DNN Model Identification result Corresponding speaker is the speaker of this speech data to be identified, end of identification.

The features of the present invention and beneficial effect are：

Compared with prior art, first DNN model of the invention is modeled to the gross feature between speaker, the Two DNN models are modeled to the microscopic feature between speaker.The inventive method increased easily mixes language to different speakers The distinctive of sound data, with the good stability of a system, while considering gross feature and microscopic feature, can improve and speak The accuracy rate of people's identification.

Brief description of the drawings

Fig. 1 is the flow chart of the inventive method.

Fig. 2 is first DNN model structure in the embodiment of the present invention.

Fig. 3 is second DNN model structure in the embodiment of the present invention.

Specific embodiment

A kind of method for distinguishing speek person based on two modelings proposed by the present invention, below in conjunction with the accompanying drawings and specific embodiment Further describe as follows.

A kind of method for distinguishing speek person based on two modelings proposed by the present invention, is divided into for the model training stage and speaks Two stages of people's cognitive phase；In the model training stage, obtain the training speech data of all speakers to be identified and pre-process； First DNN model is obtained according to training speech data training；Using first DNN model, training speech data is known Not, easily mixed speech data is extracted；Second DNN model is obtained according to easy mixed speech data training；In the Speaker Identification stage, obtain Take speech data to be identified and pre-process；Speech data to be identified is identified using first DNN model, if identification probability More than decision threshold, then artificial this speech data to be identified of speaking in recognition result corresponding to output probability maximum is said Words people；Second identification is otherwise carried out to speech data to be identified by second DNN model, output probability is most in recognition result The speaker of corresponding artificial this speech data to be identified of speaking of big value.The method flow chart is as shown in figure 1, including following Step：

1) the model training stage；Specifically include following steps：

1-1) obtain the training speech data of all speakers to be identified, and speaking corresponding to every training speech data Artificially, it is known that acquisition modes can be live recording or telephonograph；Training speech data to obtaining is pre-processed, and is extracted It is all to train the corresponding mel cepstrum feature of speech data, and the single order of mel cepstrum feature, second dervative are calculated, totally 60 tie up；

1-2-1) the number of plies and nodes of first DNN model of setting；

First DNN model is divided into input layer, three levels of hidden layer and output layer；Input layer correspondence training speech data Feature, input layer number is step 1-1) the mel cepstrum feature of training speech data that obtains and its single order, second order lead Number 60 is tieed up totally, then input layer number is set to 60；Output layer corresponds to the probability of each speaker, and output layer nodes are institute There is the number of speaker to be identified, the output of each node corresponds to the probability of each speaker respectively；Hidden layer is mainly used in automatically Extract the feature of different levels, the number of plies and nodes of hidden layer are as needed and experience setting, general to set the hidden layer number of plies It is 3-5 layers；The nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted, the hidden layer node in centre position Number is traditionally arranged to be 300-500, and other node in hidden layer are traditionally arranged to be 1000-2000；

1-2-2) first DNN model is trained, first DNN model parameter is obtained；

The mel cepstrum feature and its single order, second dervative of the training speech data according to all speakers to be identified, First DNN mould is trained, model parameter includes the biasing of the connection weight and each node of adjacent two layers；First carry out non- Supervised training exercises supervision training again：It is limited as one every adjacent two-layer in first DNN model during unsupervised training Boltzmann machine, all of limited Boltzmann machine is trained with to sdpecific dispersion (CD) algorithm successively, obtains first DNN models ginseng Several initial values；Exercise supervision training when, first initial value of DNN model parameters obtained using unsupervised training, utilize Back Propagation Algorithm is accurately adjusted to first DNN model parameter, finally obtains first DNN model parameter；

1-3) extract easily mixed speech data；

According to step 1-2) the first DNN model for obtaining is trained, to the training voice number of all speakers to be identified According to simultaneously given threshold is identified, rule of thumb, threshold range is typically set to 0.7-0.9；If one is trained speech data The threshold value of the probability less than setting of the corresponding speaker of this training speech data, then illustrate this speech data in recognition result Recognition result distinction it is bad, using this training speech data extract as it is easy mix speech data, for second DNN model Training；If recognition result is more than or equal to the threshold value of setting, illustrate that this training speech data is easily distinguishable, not as easily mixed Speech data；

1-4-1) the number of plies and nodes of second DNN model of setting；

Second DNN model is divided into input layer, three levels of hidden layer and output layer；Input layer correspondence training speech data Feature, input layer number is step 1-3) the mel cepstrum feature of easily mixed speech data extracted and its single order, second order lead Number, 60 ties up totally, then input layer number is set to 60；Output layer corresponds to the probability of each speaker, and output layer nodes are Speaker's number included in easily mixed speech data, each node output corresponds to the probability of each speaker respectively；Hidden layer The general number of plies that sets is 3-5 layers, and the node in hidden layer in centre position is traditionally arranged to be 300-500, other hidden layer nodes Number is traditionally arranged to be 1000-2000；

1-4-2) second DNN model is trained, second DNN model parameter is obtained；

According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second Individual DNN moulds are trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers；First carry out non-supervisory instruction The experienced training that exercises supervision again：Every adjacent two-layer in second DNN model is treated as a limited Bohr hereby during unsupervised training Graceful machine, all of limited Boltzmann machine is trained with to sdpecific dispersion (CD) algorithm successively, obtain second DNN model parameter just Initial value；Exercise supervision training when, second initial value of DNN model parameters obtained using unsupervised training, using backward biography Broadcast algorithm accurately to adjust second DNN model parameter, finally obtain second DNN model parameter；

2) the Speaker Identification stage, following steps are specifically included：

The inventive method is further described with reference to a specific embodiment as follows.What deserves to be explained is, under Embodiment described by text is only one embodiment of the present of invention, rather than whole embodiments.Based on implementation of the invention Example, the every other embodiment that those of ordinary skill in the art are obtained under the premise of creative work is not paid is belonged to The scope of protection of the invention.

Need to recognize 800 speakers in the embodiment, comprise the following steps that：

1) the model training stage, following steps are specifically included：

The training speech data of 800 speakers of identification needed for 1-1) obtaining, and corresponding to every training speech data Speak artificial known, acquisition modes are telephonograph；Training speech data (i.e. telephonograph) to obtaining carries out pre- place Reason, extracts the corresponding mel cepstrum feature of all training speech datas, and calculates the single order of mel cepstrum feature, second dervative, Totally 60 tie up；

1-2-1) the number of plies and nodes of first DNN model of setting；

First DNN model structure is as shown in Figure 2：First a total of 7 layers of DNN model, the 1st layer is input layer, 2-6 Layer is hidden layer (totally 5 layers of hidden layer), and the 7th layer is output layer.Wherein cross spider represents the annexation of node, first DNN It is full connection between the node of model adjacent two layers, and is connectionless between node in each layer.First DNN model Input layer correspondence training speech data feature, the present embodiment be step 1-1) obtain training speech data mel cepstrum Feature and its single order, second dervative, 60 tie up totally, then input layer number is set to N₁=60；Output layer corresponds to each and speaks The probability of people, speaker's number that the nodes of output layer are recognized needed for being equal to, is N₇=800, each node output difference The probability of corresponding each speaker；Hidden layer is mainly used in automatically extracting the feature of different levels, the present embodiment hidden layer number of plies It is set to 5 layers (being traditionally arranged to be 3-5 layers), the feature that hidden layer is carried is gradually senior to the 6th layer from the 2nd layer of low level abstraction Abstract transition.The nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted, centre position in the present embodiment The nodes of hidden layer (i.e. the 4th layer) be set to N₄=400 (being traditionally arranged to be 300-500), the section of remaining hidden layer Points N₂、N₃、N₅And N₆It is set to 1024 and (is traditionally arranged to be 1000-2000.

1-2-2) first DNN model is trained, first DNN model parameter is obtained；

According to 800 the mel cepstrum features and its single order, second dervative of the training speech data of speaker to be identified, First DNN mould is trained, model parameter includes the connection weight of adjacent two layers：

Wherein, W_i,i+1It is N_iRow N_i+1The matrix of row, whereinRepresent in first i-th layer of DNN model m-th node and N-th connection weight of node in i+1 layer.

The biasing of each node：

Wherein,Represent k-th biasing of node in first DNN models jth layer.

Unsupervised training is first carried out to exercise supervision again training.Every adjacent in first DNN model during unsupervised training Layers 1 and 2, layers 2 and 3 in two-layer, i.e. Fig. 2 ..., the 6th layer and the 7th layer, as a limited Bohr hereby Graceful machine, has 6 limited Boltzmann machines.It is trained one by one with to sdpecific dispersion (CD) algorithm, i.e., first trains the 1st layer and the 2nd The limited Boltzmann machine of layer composition, obtains the B in first DNN model parameter₁、B₂、W₁₂；Then layers 2 and 3 is trained The limited Boltzmann machine of composition, obtains the B in first DNN model parameter₃、W₂₃；All of limited Bohr is trained successively hereby Graceful machine, obtains first initial value of DNN model parameters；Exercise supervision training when, first obtained using unsupervised training The initial value of DNN model parameters, is accurately adjusted using Back Propagation Algorithm to first DNN model parameter, is finally obtained First DNN model parameter.

1-3) extract easily mixed speech data；

According to step 1-2) train the first DNN model for obtaining, all training speech datas to 800 speakers to enter Row identification.Given threshold is 0.85 (being traditionally arranged to be 0.7-0.9).If one training speech data recognition result in this The threshold value of the probability less than setting of the corresponding speaker of training speech data, then illustrate this training speech data recognition result area Divide property bad, using this training speech data as easy mixed speech data, for second training of DNN models；If identification knot Fruit then illustrates that this training speech data is easily distinguishable, not as easy mixed speech data more than or equal to the threshold value of setting；

1-4-1) the number of plies and nodes of second DNN model of setting；

Second DNN model structure is as shown in Figure 3：Second DNN model is of five storeys altogether, and the 1st layer is input layer, 2-4 Layer is hidden layer (totally 3 layers of hidden layer), and the 5th layer is output layer.Wherein cross spider represents the annexation of node, second DNN It is full connection between the node of model adjacent two layers, and is connectionless between node in each layer.Second DNN model Input layer correspondence speech data feature, the present embodiment is step 1-3) the mel cepstrum feature of easily mixed speech data extracted And its single order, second dervative, totally 60 to tie up, then second DNN mode input node layer number is set to N₁=60；Output layer correspondence The probability of each speaker, the nodes of output layer are the speaker's number included in easily mixed speech data, in this embodiment Also it is 800；, due to training data, corresponding speaker is known, and easily mixing speech data is carried from training data Take out, so it is recognised that every frame easily corresponding speaker of mixed speech data, voice number is easily mixed such that it is able to calculate People's number is in a word talked about in.The number of plies of hidden layer is set to 3 layers (being traditionally arranged to be 3-5 layers), and the hidden layer in centre position is (i.e. 3rd layer) node number be set to N₃=300 (being traditionally arranged to be 300-500), the nodes N of remaining hidden layer₂,N₄Set It is 1024 (being traditionally arranged to be 1000-2000).

1-4-2) second DNN model is trained, model parameter is obtained；

According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second Individual DNN moulds are trained.Second parameter of DNN models includes the connection weight W of adjacent two layers_i,i+1(i=1 ..., 4) and it is every The biasing B of individual node_j(j=1 ..., 5).Unsupervised training is first carried out to exercise supervision again training.Second during unsupervised training In DNN models per the layers 1 and 2 in adjacent two-layer, i.e. Fig. 3, layers 2 and 3 ..., the 4th layer and the 5th layer, when Into a limited Boltzmann machine, first totally 4 limited Boltzmann machines, be successively trained, i.e., with to sdpecific dispersion (CD) algorithm The limited Boltzmann machine of training layers 1 and 2 composition, obtains the B in second DNN model parameter₁、B₂、W₁₂；Then instruct Practice the limited Boltzmann machine of layers 2 and 3 composition, obtain the B in second DNN model parameter₃、W₂₃；Institute is trained successively The limited Boltzmann machine having, obtains second initial value of DNN model parameters.Exercise supervision training when, use non-supervisory instruction Second initial value of DNN model parameters for getting, is carried out accurately using Back Propagation Algorithm to second DNN model parameter Adjustment, finally obtains second DNN model parameter.

Above-mentioned steps are the model training stage, and Speaker Identification is carried out by obtaining after two DNN models；

A certain personal speech data to be identified in 800 speakers of identification needed for 2-1) obtaining, voice to be identified Data are obtained also by telephonograph, but the corresponding speaker of the speech data is unknown, it is necessary to by method proposed by the present invention Speaker is identified.Speech data to be identified is different voices from training speech data.To speech data to be identified Pre-processed, extracted the mel cepstrum feature and its single order, second dervative of speech data to be identified, totally 60 dimension.

2-2) by step 2-1) obtain speech data to be identified 60 dimension features be input to first DNN model in carry out Identification, output layer exports the recognition result of speech data to be identified, i.e. this speech data corresponds to the general of this 800 people respectively Rate, the output of output layer each node corresponds to the probability of each speaker, totally 800 respectively.

Decision threshold 2-3) is set, judges whether 800 probability in recognition result have the result more than threshold value 0.85：Such as Fruit has, then the speaker corresponding to this probability is judged as the speaker of this speech data to be identified, end of identification；Otherwise It is transferred to step 2-4)；

If 2-4) step 2-3) recognition result in there is no probability more than decision threshold 0.85, it is to be identified to this Speech data carry out second identification using second DNN model；According to second recognition result of DNN models, output is general Speaker corresponding to rate maximum is the speaker for being judged as this speech data to be measured, end of identification.

The method of the invention, one of ordinary skill in the art will appreciate that be, the method for above-mentioned Speaker Identification can be with Completed by program, described program can be stored in a kind of computer-readable recording medium.

Above-described is only a specific embodiment of the invention, it is clear that the power of the present invention can not be limited with this Sharp scope, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope that the present invention is covered.

Claims

1. a kind of method of the Speaker Identification based on two modelings, it is characterised in that be divided into model training stage and speaker Two stages of cognitive phase；In the model training stage, obtain the training speech data of all speakers to be identified and pre-process；Root First DNN model is obtained according to training speech data training；Using first DNN model, training speech data is identified, Extract easily mixed speech data；Second DNN model is obtained according to easy mixed speech data training；In the Speaker Identification stage, obtain Speech data to be identified is simultaneously pre-processed；Speech data to be identified is identified using first DNN model, if identification probability is big In decision threshold, then artificial this speech data to be identified of speaking in recognition result corresponding to output probability maximum is spoken People；Second identification is otherwise carried out to speech data to be identified by second DNN model, output probability is maximum in recognition result The speaker of corresponding artificial this speech data to be identified of speaking of value.

2. the method for claim 1, it is characterised in that the method is comprised the following steps：

1) the model training stage；Specifically include following steps：

1-1) obtain the training speech data of all speakers to be identified, and speaking artificially corresponding to every training speech data It is known；Training speech data to obtaining is pre-processed, and extracts the corresponding mel cepstrum feature of all training speech datas, and Single order, the second dervative of mel cepstrum feature are calculated, totally 60 dimension；

1-2-1) the number of plies and nodes of first DNN model of setting；

First DNN model is divided into input layer, three levels of hidden layer and output layer；The spy of input layer correspondence training speech data Levy, input layer number be step 1-1) obtain training speech data mel cepstrum feature and its single order, second dervative it is common 60 dimensions, then input layer number is set to 60；Output layer corresponds to the probability of each speaker, and output layer nodes are to be needed The number of speaker is recognized, the output of each node corresponds to the probability of each speaker respectively；Hidden layer is used to automatically extract difference The feature of level, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted；

1-2-2) first DNN model is trained, first DNN model parameter is obtained；

The mel cepstrum feature and its single order, second dervative of the training speech data according to all speakers to be identified, to the One DNN mould is trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers；First carry out non-supervisory Train the training that exercises supervision again：During unsupervised training, every adjacent two-layer in first DNN model, as a limited glass The graceful machine of Wurz, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtain first DNN model parameter just Initial value；Exercise supervision training when, first initial value of DNN model parameters obtained using unsupervised training, using backward biography Broadcast algorithm accurately to adjust first DNN model parameter, finally obtain first DNN model parameter；

1-3) extract easily mixed speech data；

According to step 1-2) train the first DNN model for obtaining, the training speech data to all speakers to be identified to enter Row identification and given threshold；If the corresponding speaker of this training speech data in a recognition result for training speech data Probability less than setting threshold value, then illustrate that the recognition result distinction of this speech data is bad, by this training voice number Speech data is mixed as easy according to extracting, for second training of DNN models；If recognition result is more than or equal to the threshold value of setting, Then illustrate that this training speech data is easily distinguishable, not as easy mixed speech data；

1-4-1) the number of plies and nodes of second DNN model of setting；

Second DNN model is divided into input layer, three levels of hidden layer and output layer；The spy of input layer correspondence training speech data Levy, input layer number is step 1-3) the mel cepstrum feature of easily mixed speech data extracted and its single order, second dervative it is common 60 dimensions, then input layer number is set to 60；Output layer corresponds to the probability of each speaker, and output layer nodes are easily mixed language Speaker's number included in sound data, each node output corresponds to the probability of each speaker respectively；Hidden layer is used for certainly The dynamic feature for extracting different levels, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted；

1-4-2) second DNN model is trained, second DNN model parameter is obtained；

According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second DNN Model is trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers；First carry out unsupervised training Exercise supervision training again：Every adjacent two-layer in second DNN model is treated as a limited Boltzmann during unsupervised training Machine, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtains second initial value of DNN model parameters； Exercise supervision training when, second initial value of DNN model parameters obtained using unsupervised training, using Back Propagation Algorithm Second DNN model parameter is accurately adjusted, second DNN model parameter is finally obtained；

2) the Speaker Identification stage；Specifically include following steps：

A certain personal speech data to be identified in the speaker of identification needed for 2-1) obtaining, enters to speech data to be identified Row pretreatment, extracts the mel cepstrum feature and its single order, second dervative of speech data to be identified, totally 60 dimension；

2-2) by step 2-1) features of 60 dimensions that obtain speech data to be identified are input to step 1-2) first DNN obtaining It is identified in model, output layer exports the recognition result of speech data to be identified, i.e. this speech data and corresponds to training respectively The probability of each speaker in speech data, the output of output layer each node corresponds to the probability of speaker respectively；

2-3) set decision threshold, judge step 2-2) recognition result in the presence or absence of probability be more than decision threshold result： If so, then the speaker in first DNN Model Identification result corresponding to output probability maximum is this voice to be identified The speaker of data, end of identification；If no, being transferred to step 2-4)；

If 2-4) step 2-3) recognition result in do not have probability more than decision threshold result, to this voice number to be identified Second identification is carried out according to using second DNN model；In second DNN Model Identification result corresponding to output probability maximum Speaker be the speaker of this speech data to be identified, end of identification.