CN106898355A - A kind of method for distinguishing speek person based on two modelings - Google Patents
A kind of method for distinguishing speek person based on two modelings Download PDFInfo
- Publication number
- CN106898355A CN106898355A CN201710031899.7A CN201710031899A CN106898355A CN 106898355 A CN106898355 A CN 106898355A CN 201710031899 A CN201710031899 A CN 201710031899A CN 106898355 A CN106898355 A CN 106898355A
- Authority
- CN
- China
- Prior art keywords
- speech data
- training
- dnn model
- identified
- speaker
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 135
- 239000000284 extract Substances 0.000 claims abstract description 16
- 239000006185 dispersion Substances 0.000 claims description 9
- 241001269238 Data Species 0.000 claims description 4
- 230000001149 cognitive effect Effects 0.000 claims description 2
- 239000011521 glass Substances 0.000 claims 1
- 238000010801 machine learning Methods 0.000 abstract description 2
- 238000003909 pattern recognition Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 5
- 239000000203 mixture Substances 0.000 description 5
- 241000193935 Araneus diadematus Species 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 210000003739 neck Anatomy 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/14—Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Image Analysis (AREA)
Abstract
The present invention proposes a kind of method for distinguishing speek person based on two modelings, belongs to Application on Voiceprint Recognition, pattern-recognition and machine learning field.The method obtains the training speech data of speaker to be identified and pre-processes in the model training stage;First DNN model is obtained according to training speech data training;Using first DNN model, training speech data is identified, extracts easily mixed speech data;Second DNN model is obtained according to easy mixed speech data training;In the Speaker Identification stage, obtain speech data to be identified and pre-process;Speech data to be identified is identified using first DNN model, if identification probability is more than given threshold, obtains Speaker Identification result;Second identification is otherwise carried out to speech data to be identified by second DNN model, Speaker Identification result is obtained.The present invention, while considering speaker's gross feature and microscopic feature, effectively improves the accuracy rate of Speaker Identification by setting up two DNN models.
Description
Technical field
It is particularly a kind of based on secondary the invention belongs to Application on Voiceprint Recognition, pattern-recognition and machine learning techniques field
The method for distinguishing speek person of modeling.
Background technology
Speaker Identification refer to according to recognizing speaker's identity comprising the information related to speaker in voice, with
Developing rapidly for information technology and the communication technology, speaker Recognition Technology is increasingly taken seriously and is obtained extensively in numerous areas
General application.As identity differentiates, seize telephone channel criminal, identity validation done according to telephonograph in court, call voice with
Track, there is provided antitheft door open function.The Internet, applications and the communications field, speaker Recognition Technology can apply to sound dialing,
The necks such as telephone bank, teleshopping, database access, information service, voice e-mail, security control, computer remote login
Domain.
Speaker Identification first has to pre-process speech data, extracts feature.The most frequently used is characterized in that one kind is based on
Human ear listens the mel cepstrum feature of perception theory, is now widely used for Speaker Identification, languages identification and continuous speech and knows
Not etc..Mel cepstrum feature extraction carries out preemphasis and framing adding window to speech data first, then to the number after framing adding window
According to Fast Fourier Transform (FFT) is carried out, corresponding frequency spectrum is obtained, and be filtered by Mel frequency marking triangle window filter, it is most laggard
Row discrete cosine transform obtains mel cepstrum feature.
In recent years, the Speaker Identification model based on deep neural network (DNN) is received more and more attention, compared to
Traditional gauss hybrid models (GMM), the descriptive power of DNN models is stronger, more preferably the extremely complex data of simulation can divide
Cloth, the system based on DNN obtains significant performance boost.One DNN model includes input layer, three layers of hidden layer and output layer
It is secondary:The feature of input layer correspondence speech data, depending on the dimension of the nodes of input layer according to speech data character pair;Output
Layer corresponds to the probability of each speaker, depending on the number of the nodes of output layer according to the speaker for needing to recognize altogether;It is implicit
Number and nodes need and engineering experience definition according to application layer by layer.During DNN model trainings, first carry out unsupervised training is carried out again
Supervised training.A limited Boltzmann machine is treated as per adjacent two-tier network during unsupervised training, with to sdpecific dispersion (CD)
Algorithm is successively trained.Exercise supervision training when, the DNN model parameters obtained using unsupervised training as initial value, then
DNN model parameters are accurately adjusted with Back Propagation Algorithm.So far, the side of the Speaker Identification based on DNN models
Method all only uses a DNN model, but a DNN model to be difficult while entering to the gross feature and microscopic feature between speaker
Row modeling.When this causes to be identified speaker using a DNN model, some voices can be easily discriminated, and some
Voice is but easily obscured.
The content of the invention
The purpose of the present invention is to overcome the weak point of prior art, it is proposed that a kind of speaker based on two modelings
Recognition methods.The present invention is by setting up two DNN models, while speaker's gross feature and microscopic feature are considered, can be effective
Improve the accuracy rate of Speaker Identification in ground.
A kind of method of the Speaker Identification based on two modelings, it is model training stage and Speaker Identification stage to be divided into
Two stages;In the model training stage, obtain the training speech data of all speakers to be identified and pre-process;According to training language
The training of sound data obtains first DNN model;Using first DNN model, training speech data is identified, extracts easily mixed
Speech data;Second DNN model is obtained according to easy mixed speech data training;In the Speaker Identification stage, language to be identified is obtained
Sound data are simultaneously pre-processed;Speech data to be identified is identified using first DNN model, if identification probability is more than decision threshold
It is worth, then the speaker of artificial this speech data to be identified of speaking in recognition result corresponding to output probability maximum;Otherwise
Second identification is carried out to speech data to be identified by second DNN model, output probability maximum institute is right in recognition result
The speaker of artificial this speech data to be identified of speaking answered.The method is comprised the following steps:
1) the model training stage;Specifically include following steps:
1-1) obtain the training speech data of all speakers to be identified, and speaking corresponding to every training speech data
It is artificial known;Training speech data to obtaining is pre-processed, and extracts the corresponding mel cepstrum of all training speech datas special
Levy, and calculate the single order of mel cepstrum feature, second dervative, totally 60 dimension;
1-2) set up first DNN model and be trained, specifically include following steps:
1-2-1) the number of plies and nodes of first DNN model of setting;
First DNN model is divided into input layer, three levels of hidden layer and output layer;Input layer correspondence training speech data
Feature, input layer number is step 1-1) the mel cepstrum feature of training speech data that obtains and its single order, second order lead
Number 60 is tieed up totally, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are institute
There is the number of speaker to be identified, the output of each node corresponds to the probability of each speaker respectively;Hidden layer is used to automatically extract
The feature of different levels, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted;
1-2-2) first DNN model is trained, first DNN model parameter is obtained;
The mel cepstrum feature and its single order, second dervative of the training speech data according to all speakers to be identified,
First DNN mould is trained, model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non-
Supervised training exercises supervision training again:During unsupervised training, every adjacent two-layer in first DNN model, received as one
Limit Boltzmann machine, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtains first DNN model parameter
Initial value;Exercise supervision training when, first initial value of DNN model parameters obtained using unsupervised training, after utilization
First DNN model parameter is accurately adjusted to propagation algorithm, is finally obtained first DNN model parameter;
1-3) extract easily mixed speech data;
According to step 1-2) the first DNN model for obtaining is trained, to the training voice number of all speakers to be identified
According to being identified and given threshold;If the corresponding theory of this training speech data in a recognition result for training speech data
The threshold value of the probability less than setting of people is talked about, then illustrates that the recognition result distinction of this speech data is bad, by this training language
Sound data are extracted as easily mixed speech data, for second training of DNN models;If recognition result is more than or equal to the threshold of setting
Value, then illustrate that this training speech data is easily distinguishable, not as easy mixed speech data;
1-4) set up second DNN model and be trained, specifically include following steps:
1-4-1) the number of plies and nodes of second DNN model of setting;
Second DNN model is divided into input layer, three levels of hidden layer and output layer;Input layer correspondence training speech data
Feature, input layer number is step 1-3) the mel cepstrum feature of easily mixed speech data extracted and its single order, second order lead
Number 60 is tieed up totally, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are easy
Speaker's number included in mixed speech data, each node output corresponds to the probability of each speaker respectively;Hidden layer is used
In the feature for automatically extracting different levels, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted;
1-4-2) second DNN model is trained, second DNN model parameter is obtained;
According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second
Individual DNN models are trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non-supervisory
Train the training that exercises supervision again:Every adjacent two-layer in second DNN model is treated as a limited Bohr during unsupervised training
Hereby graceful machine, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtains the initial of second DNN model parameter
Value;Exercise supervision training when, second initial value of DNN model parameters obtained using unsupervised training, using back-propagating
Algorithm is accurately adjusted to second DNN model parameter, finally obtains second DNN model parameter;
2) the Speaker Identification stage;Specifically include following steps:
A certain personal speech data to be identified in the speaker of identification needed for 2-1) obtaining, to voice number to be identified
According to being pre-processed, the mel cepstrum feature and its single order, second dervative of speech data to be identified are extracted, totally 60 dimension;
2-2) by step 2-1) obtain speech data to be identified 60 dimension features be input to step 1-2) obtain first
It is identified in individual DNN models, output layer exports the recognition result of speech data to be identified, i.e. this speech data is corresponded to respectively
The probability of each speaker in training speech data, the output of output layer each node corresponds to the probability of speaker respectively;
2-3) set decision threshold, judge step 2-2) recognition result in the presence or absence of probability be more than decision threshold knot
Really:If so, then the speaker in first DNN Model Identification result corresponding to output probability maximum is this language to be identified
The speaker of sound data, end of identification;If no, being transferred to step 2-4);
If 2-4) step 2-3) recognition result in do not have probability more than decision threshold result, to this language to be identified
Sound data carry out second identification using second DNN model;Output probability maximum institute in second DNN Model Identification result
Corresponding speaker is the speaker of this speech data to be identified, end of identification.
The features of the present invention and beneficial effect are:
Compared with prior art, first DNN model of the invention is modeled to the gross feature between speaker, the
Two DNN models are modeled to the microscopic feature between speaker.The inventive method increased easily mixes language to different speakers
The distinctive of sound data, with the good stability of a system, while considering gross feature and microscopic feature, can improve and speak
The accuracy rate of people's identification.
Brief description of the drawings
Fig. 1 is the flow chart of the inventive method.
Fig. 2 is first DNN model structure in the embodiment of the present invention.
Fig. 3 is second DNN model structure in the embodiment of the present invention.
Specific embodiment
A kind of method for distinguishing speek person based on two modelings proposed by the present invention, below in conjunction with the accompanying drawings and specific embodiment
Further describe as follows.
A kind of method for distinguishing speek person based on two modelings proposed by the present invention, is divided into for the model training stage and speaks
Two stages of people's cognitive phase;In the model training stage, obtain the training speech data of all speakers to be identified and pre-process;
First DNN model is obtained according to training speech data training;Using first DNN model, training speech data is known
Not, easily mixed speech data is extracted;Second DNN model is obtained according to easy mixed speech data training;In the Speaker Identification stage, obtain
Take speech data to be identified and pre-process;Speech data to be identified is identified using first DNN model, if identification probability
More than decision threshold, then artificial this speech data to be identified of speaking in recognition result corresponding to output probability maximum is said
Words people;Second identification is otherwise carried out to speech data to be identified by second DNN model, output probability is most in recognition result
The speaker of corresponding artificial this speech data to be identified of speaking of big value.The method flow chart is as shown in figure 1, including following
Step:
1) the model training stage;Specifically include following steps:
1-1) obtain the training speech data of all speakers to be identified, and speaking corresponding to every training speech data
Artificially, it is known that acquisition modes can be live recording or telephonograph;Training speech data to obtaining is pre-processed, and is extracted
It is all to train the corresponding mel cepstrum feature of speech data, and the single order of mel cepstrum feature, second dervative are calculated, totally 60 tie up;
1-2) set up first DNN model and be trained, specifically include following steps:
1-2-1) the number of plies and nodes of first DNN model of setting;
First DNN model is divided into input layer, three levels of hidden layer and output layer;Input layer correspondence training speech data
Feature, input layer number is step 1-1) the mel cepstrum feature of training speech data that obtains and its single order, second order lead
Number 60 is tieed up totally, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are institute
There is the number of speaker to be identified, the output of each node corresponds to the probability of each speaker respectively;Hidden layer is mainly used in automatically
Extract the feature of different levels, the number of plies and nodes of hidden layer are as needed and experience setting, general to set the hidden layer number of plies
It is 3-5 layers;The nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted, the hidden layer node in centre position
Number is traditionally arranged to be 300-500, and other node in hidden layer are traditionally arranged to be 1000-2000;
1-2-2) first DNN model is trained, first DNN model parameter is obtained;
The mel cepstrum feature and its single order, second dervative of the training speech data according to all speakers to be identified,
First DNN mould is trained, model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non-
Supervised training exercises supervision training again:It is limited as one every adjacent two-layer in first DNN model during unsupervised training
Boltzmann machine, all of limited Boltzmann machine is trained with to sdpecific dispersion (CD) algorithm successively, obtains first DNN models ginseng
Several initial values;Exercise supervision training when, first initial value of DNN model parameters obtained using unsupervised training, utilize
Back Propagation Algorithm is accurately adjusted to first DNN model parameter, finally obtains first DNN model parameter;
1-3) extract easily mixed speech data;
According to step 1-2) the first DNN model for obtaining is trained, to the training voice number of all speakers to be identified
According to simultaneously given threshold is identified, rule of thumb, threshold range is typically set to 0.7-0.9;If one is trained speech data
The threshold value of the probability less than setting of the corresponding speaker of this training speech data, then illustrate this speech data in recognition result
Recognition result distinction it is bad, using this training speech data extract as it is easy mix speech data, for second DNN model
Training;If recognition result is more than or equal to the threshold value of setting, illustrate that this training speech data is easily distinguishable, not as easily mixed
Speech data;
1-4) set up second DNN model and be trained, specifically include following steps:
1-4-1) the number of plies and nodes of second DNN model of setting;
Second DNN model is divided into input layer, three levels of hidden layer and output layer;Input layer correspondence training speech data
Feature, input layer number is step 1-3) the mel cepstrum feature of easily mixed speech data extracted and its single order, second order lead
Number, 60 ties up totally, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are
Speaker's number included in easily mixed speech data, each node output corresponds to the probability of each speaker respectively;Hidden layer
The general number of plies that sets is 3-5 layers, and the node in hidden layer in centre position is traditionally arranged to be 300-500, other hidden layer nodes
Number is traditionally arranged to be 1000-2000;
1-4-2) second DNN model is trained, second DNN model parameter is obtained;
According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second
Individual DNN moulds are trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non-supervisory instruction
The experienced training that exercises supervision again:Every adjacent two-layer in second DNN model is treated as a limited Bohr hereby during unsupervised training
Graceful machine, all of limited Boltzmann machine is trained with to sdpecific dispersion (CD) algorithm successively, obtain second DNN model parameter just
Initial value;Exercise supervision training when, second initial value of DNN model parameters obtained using unsupervised training, using backward biography
Broadcast algorithm accurately to adjust second DNN model parameter, finally obtain second DNN model parameter;
2) the Speaker Identification stage, following steps are specifically included:
A certain personal speech data to be identified in the speaker of identification needed for 2-1) obtaining, to voice number to be identified
According to being pre-processed, the mel cepstrum feature and its single order, second dervative of speech data to be identified are extracted, totally 60 dimension;
2-2) by step 2-1) obtain speech data to be identified 60 dimension features be input to step 1-2) obtain first
It is identified in individual DNN models, output layer exports the recognition result of speech data to be identified, i.e. this speech data is corresponded to respectively
The probability of each speaker in training speech data, the output of output layer each node corresponds to the probability of speaker respectively;
2-3) set decision threshold, judge step 2-2) recognition result in the presence or absence of probability be more than decision threshold knot
Really:If so, then the speaker in first DNN Model Identification result corresponding to output probability maximum is this language to be identified
The speaker of sound data, end of identification;If no, being transferred to step 2-4);
If 2-4) step 2-3) recognition result in do not have probability more than decision threshold result, to this language to be identified
Sound data carry out second identification using second DNN model;Output probability maximum institute in second DNN Model Identification result
Corresponding speaker is the speaker of this speech data to be identified, end of identification.
The inventive method is further described with reference to a specific embodiment as follows.What deserves to be explained is, under
Embodiment described by text is only one embodiment of the present of invention, rather than whole embodiments.Based on implementation of the invention
Example, the every other embodiment that those of ordinary skill in the art are obtained under the premise of creative work is not paid is belonged to
The scope of protection of the invention.
Need to recognize 800 speakers in the embodiment, comprise the following steps that:
1) the model training stage, following steps are specifically included:
The training speech data of 800 speakers of identification needed for 1-1) obtaining, and corresponding to every training speech data
Speak artificial known, acquisition modes are telephonograph;Training speech data (i.e. telephonograph) to obtaining carries out pre- place
Reason, extracts the corresponding mel cepstrum feature of all training speech datas, and calculates the single order of mel cepstrum feature, second dervative,
Totally 60 tie up;
1-2) set up first DNN model and be trained, specifically include following steps:
1-2-1) the number of plies and nodes of first DNN model of setting;
First DNN model structure is as shown in Figure 2:First a total of 7 layers of DNN model, the 1st layer is input layer, 2-6
Layer is hidden layer (totally 5 layers of hidden layer), and the 7th layer is output layer.Wherein cross spider represents the annexation of node, first DNN
It is full connection between the node of model adjacent two layers, and is connectionless between node in each layer.First DNN model
Input layer correspondence training speech data feature, the present embodiment be step 1-1) obtain training speech data mel cepstrum
Feature and its single order, second dervative, 60 tie up totally, then input layer number is set to N1=60;Output layer corresponds to each and speaks
The probability of people, speaker's number that the nodes of output layer are recognized needed for being equal to, is N7=800, each node output difference
The probability of corresponding each speaker;Hidden layer is mainly used in automatically extracting the feature of different levels, the present embodiment hidden layer number of plies
It is set to 5 layers (being traditionally arranged to be 3-5 layers), the feature that hidden layer is carried is gradually senior to the 6th layer from the 2nd layer of low level abstraction
Abstract transition.The nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted, centre position in the present embodiment
The nodes of hidden layer (i.e. the 4th layer) be set to N4=400 (being traditionally arranged to be 300-500), the section of remaining hidden layer
Points N2、N3、N5And N6It is set to 1024 and (is traditionally arranged to be 1000-2000.
1-2-2) first DNN model is trained, first DNN model parameter is obtained;
According to 800 the mel cepstrum features and its single order, second dervative of the training speech data of speaker to be identified,
First DNN mould is trained, model parameter includes the connection weight of adjacent two layers:
Wherein, Wi,i+1It is NiRow Ni+1The matrix of row, whereinRepresent in first i-th layer of DNN model m-th node and
N-th connection weight of node in i+1 layer.
The biasing of each node:
Wherein,Represent k-th biasing of node in first DNN models jth layer.
Unsupervised training is first carried out to exercise supervision again training.Every adjacent in first DNN model during unsupervised training
Layers 1 and 2, layers 2 and 3 in two-layer, i.e. Fig. 2 ..., the 6th layer and the 7th layer, as a limited Bohr hereby
Graceful machine, has 6 limited Boltzmann machines.It is trained one by one with to sdpecific dispersion (CD) algorithm, i.e., first trains the 1st layer and the 2nd
The limited Boltzmann machine of layer composition, obtains the B in first DNN model parameter1、B2、W12;Then layers 2 and 3 is trained
The limited Boltzmann machine of composition, obtains the B in first DNN model parameter3、W23;All of limited Bohr is trained successively hereby
Graceful machine, obtains first initial value of DNN model parameters;Exercise supervision training when, first obtained using unsupervised training
The initial value of DNN model parameters, is accurately adjusted using Back Propagation Algorithm to first DNN model parameter, is finally obtained
First DNN model parameter.
1-3) extract easily mixed speech data;
According to step 1-2) train the first DNN model for obtaining, all training speech datas to 800 speakers to enter
Row identification.Given threshold is 0.85 (being traditionally arranged to be 0.7-0.9).If one training speech data recognition result in this
The threshold value of the probability less than setting of the corresponding speaker of training speech data, then illustrate this training speech data recognition result area
Divide property bad, using this training speech data as easy mixed speech data, for second training of DNN models;If identification knot
Fruit then illustrates that this training speech data is easily distinguishable, not as easy mixed speech data more than or equal to the threshold value of setting;
1-4) set up second DNN model and be trained, specifically include following steps:
1-4-1) the number of plies and nodes of second DNN model of setting;
Second DNN model structure is as shown in Figure 3:Second DNN model is of five storeys altogether, and the 1st layer is input layer, 2-4
Layer is hidden layer (totally 3 layers of hidden layer), and the 5th layer is output layer.Wherein cross spider represents the annexation of node, second DNN
It is full connection between the node of model adjacent two layers, and is connectionless between node in each layer.Second DNN model
Input layer correspondence speech data feature, the present embodiment is step 1-3) the mel cepstrum feature of easily mixed speech data extracted
And its single order, second dervative, totally 60 to tie up, then second DNN mode input node layer number is set to N1=60;Output layer correspondence
The probability of each speaker, the nodes of output layer are the speaker's number included in easily mixed speech data, in this embodiment
Also it is 800;, due to training data, corresponding speaker is known, and easily mixing speech data is carried from training data
Take out, so it is recognised that every frame easily corresponding speaker of mixed speech data, voice number is easily mixed such that it is able to calculate
People's number is in a word talked about in.The number of plies of hidden layer is set to 3 layers (being traditionally arranged to be 3-5 layers), and the hidden layer in centre position is (i.e.
3rd layer) node number be set to N3=300 (being traditionally arranged to be 300-500), the nodes N of remaining hidden layer2,N4Set
It is 1024 (being traditionally arranged to be 1000-2000).
1-4-2) second DNN model is trained, model parameter is obtained;
According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second
Individual DNN moulds are trained.Second parameter of DNN models includes the connection weight W of adjacent two layersi,i+1(i=1 ..., 4) and it is every
The biasing B of individual nodej(j=1 ..., 5).Unsupervised training is first carried out to exercise supervision again training.Second during unsupervised training
In DNN models per the layers 1 and 2 in adjacent two-layer, i.e. Fig. 3, layers 2 and 3 ..., the 4th layer and the 5th layer, when
Into a limited Boltzmann machine, first totally 4 limited Boltzmann machines, be successively trained, i.e., with to sdpecific dispersion (CD) algorithm
The limited Boltzmann machine of training layers 1 and 2 composition, obtains the B in second DNN model parameter1、B2、W12;Then instruct
Practice the limited Boltzmann machine of layers 2 and 3 composition, obtain the B in second DNN model parameter3、W23;Institute is trained successively
The limited Boltzmann machine having, obtains second initial value of DNN model parameters.Exercise supervision training when, use non-supervisory instruction
Second initial value of DNN model parameters for getting, is carried out accurately using Back Propagation Algorithm to second DNN model parameter
Adjustment, finally obtains second DNN model parameter.
Above-mentioned steps are the model training stage, and Speaker Identification is carried out by obtaining after two DNN models;
2) the Speaker Identification stage, following steps are specifically included:
A certain personal speech data to be identified in 800 speakers of identification needed for 2-1) obtaining, voice to be identified
Data are obtained also by telephonograph, but the corresponding speaker of the speech data is unknown, it is necessary to by method proposed by the present invention
Speaker is identified.Speech data to be identified is different voices from training speech data.To speech data to be identified
Pre-processed, extracted the mel cepstrum feature and its single order, second dervative of speech data to be identified, totally 60 dimension.
2-2) by step 2-1) obtain speech data to be identified 60 dimension features be input to first DNN model in carry out
Identification, output layer exports the recognition result of speech data to be identified, i.e. this speech data corresponds to the general of this 800 people respectively
Rate, the output of output layer each node corresponds to the probability of each speaker, totally 800 respectively.
Decision threshold 2-3) is set, judges whether 800 probability in recognition result have the result more than threshold value 0.85:Such as
Fruit has, then the speaker corresponding to this probability is judged as the speaker of this speech data to be identified, end of identification;Otherwise
It is transferred to step 2-4);
If 2-4) step 2-3) recognition result in there is no probability more than decision threshold 0.85, it is to be identified to this
Speech data carry out second identification using second DNN model;According to second recognition result of DNN models, output is general
Speaker corresponding to rate maximum is the speaker for being judged as this speech data to be measured, end of identification.
The method of the invention, one of ordinary skill in the art will appreciate that be, the method for above-mentioned Speaker Identification can be with
Completed by program, described program can be stored in a kind of computer-readable recording medium.
Above-described is only a specific embodiment of the invention, it is clear that the power of the present invention can not be limited with this
Sharp scope, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope that the present invention is covered.
Claims (2)
1. a kind of method of the Speaker Identification based on two modelings, it is characterised in that be divided into model training stage and speaker
Two stages of cognitive phase;In the model training stage, obtain the training speech data of all speakers to be identified and pre-process;Root
First DNN model is obtained according to training speech data training;Using first DNN model, training speech data is identified,
Extract easily mixed speech data;Second DNN model is obtained according to easy mixed speech data training;In the Speaker Identification stage, obtain
Speech data to be identified is simultaneously pre-processed;Speech data to be identified is identified using first DNN model, if identification probability is big
In decision threshold, then artificial this speech data to be identified of speaking in recognition result corresponding to output probability maximum is spoken
People;Second identification is otherwise carried out to speech data to be identified by second DNN model, output probability is maximum in recognition result
The speaker of corresponding artificial this speech data to be identified of speaking of value.
2. the method for claim 1, it is characterised in that the method is comprised the following steps:
1) the model training stage;Specifically include following steps:
1-1) obtain the training speech data of all speakers to be identified, and speaking artificially corresponding to every training speech data
It is known;Training speech data to obtaining is pre-processed, and extracts the corresponding mel cepstrum feature of all training speech datas, and
Single order, the second dervative of mel cepstrum feature are calculated, totally 60 dimension;
1-2) set up first DNN model and be trained, specifically include following steps:
1-2-1) the number of plies and nodes of first DNN model of setting;
First DNN model is divided into input layer, three levels of hidden layer and output layer;The spy of input layer correspondence training speech data
Levy, input layer number be step 1-1) obtain training speech data mel cepstrum feature and its single order, second dervative it is common
60 dimensions, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are to be needed
The number of speaker is recognized, the output of each node corresponds to the probability of each speaker respectively;Hidden layer is used to automatically extract difference
The feature of level, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted;
1-2-2) first DNN model is trained, first DNN model parameter is obtained;
The mel cepstrum feature and its single order, second dervative of the training speech data according to all speakers to be identified, to the
One DNN mould is trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non-supervisory
Train the training that exercises supervision again:During unsupervised training, every adjacent two-layer in first DNN model, as a limited glass
The graceful machine of Wurz, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtain first DNN model parameter just
Initial value;Exercise supervision training when, first initial value of DNN model parameters obtained using unsupervised training, using backward biography
Broadcast algorithm accurately to adjust first DNN model parameter, finally obtain first DNN model parameter;
1-3) extract easily mixed speech data;
According to step 1-2) train the first DNN model for obtaining, the training speech data to all speakers to be identified to enter
Row identification and given threshold;If the corresponding speaker of this training speech data in a recognition result for training speech data
Probability less than setting threshold value, then illustrate that the recognition result distinction of this speech data is bad, by this training voice number
Speech data is mixed as easy according to extracting, for second training of DNN models;If recognition result is more than or equal to the threshold value of setting,
Then illustrate that this training speech data is easily distinguishable, not as easy mixed speech data;
1-4) set up second DNN model and be trained, specifically include following steps:
1-4-1) the number of plies and nodes of second DNN model of setting;
Second DNN model is divided into input layer, three levels of hidden layer and output layer;The spy of input layer correspondence training speech data
Levy, input layer number is step 1-3) the mel cepstrum feature of easily mixed speech data extracted and its single order, second dervative it is common
60 dimensions, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are easily mixed language
Speaker's number included in sound data, each node output corresponds to the probability of each speaker respectively;Hidden layer is used for certainly
The dynamic feature for extracting different levels, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted;
1-4-2) second DNN model is trained, second DNN model parameter is obtained;
According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second DNN
Model is trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out unsupervised training
Exercise supervision training again:Every adjacent two-layer in second DNN model is treated as a limited Boltzmann during unsupervised training
Machine, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtains second initial value of DNN model parameters;
Exercise supervision training when, second initial value of DNN model parameters obtained using unsupervised training, using Back Propagation Algorithm
Second DNN model parameter is accurately adjusted, second DNN model parameter is finally obtained;
2) the Speaker Identification stage;Specifically include following steps:
A certain personal speech data to be identified in the speaker of identification needed for 2-1) obtaining, enters to speech data to be identified
Row pretreatment, extracts the mel cepstrum feature and its single order, second dervative of speech data to be identified, totally 60 dimension;
2-2) by step 2-1) features of 60 dimensions that obtain speech data to be identified are input to step 1-2) first DNN obtaining
It is identified in model, output layer exports the recognition result of speech data to be identified, i.e. this speech data and corresponds to training respectively
The probability of each speaker in speech data, the output of output layer each node corresponds to the probability of speaker respectively;
2-3) set decision threshold, judge step 2-2) recognition result in the presence or absence of probability be more than decision threshold result:
If so, then the speaker in first DNN Model Identification result corresponding to output probability maximum is this voice to be identified
The speaker of data, end of identification;If no, being transferred to step 2-4);
If 2-4) step 2-3) recognition result in do not have probability more than decision threshold result, to this voice number to be identified
Second identification is carried out according to using second DNN model;In second DNN Model Identification result corresponding to output probability maximum
Speaker be the speaker of this speech data to be identified, end of identification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710031899.7A CN106898355B (en) | 2017-01-17 | 2017-01-17 | Speaker identification method based on secondary modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710031899.7A CN106898355B (en) | 2017-01-17 | 2017-01-17 | Speaker identification method based on secondary modeling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106898355A true CN106898355A (en) | 2017-06-27 |
CN106898355B CN106898355B (en) | 2020-04-14 |
Family
ID=59198262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710031899.7A Active CN106898355B (en) | 2017-01-17 | 2017-01-17 | Speaker identification method based on secondary modeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106898355B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107274890A (en) * | 2017-07-04 | 2017-10-20 | 清华大学 | Vocal print composes extracting method and device |
CN107274883A (en) * | 2017-07-04 | 2017-10-20 | 清华大学 | Voice signal reconstructing method and device |
CN107610709A (en) * | 2017-08-01 | 2018-01-19 | 百度在线网络技术(北京)有限公司 | A kind of method and system for training Application on Voiceprint Recognition model |
CN109887511A (en) * | 2019-04-24 | 2019-06-14 | 武汉水象电子科技有限公司 | A kind of voice wake-up optimization method based on cascade DNN |
CN108305615B (en) * | 2017-10-23 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Object identification method and device, storage medium and terminal thereof |
CN111724766A (en) * | 2020-06-29 | 2020-09-29 | 合肥讯飞数码科技有限公司 | Language identification method, related equipment and readable storage medium |
CN111883175A (en) * | 2020-06-09 | 2020-11-03 | 河北悦舒诚信息科技有限公司 | Voiceprint library-based oil station service quality improving method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1264887A (en) * | 2000-03-31 | 2000-08-30 | 清华大学 | Non-particular human speech recognition and prompt method based on special speech recognition chip |
CN1588536A (en) * | 2004-09-29 | 2005-03-02 | 上海交通大学 | State structure regulating method in sound identification |
CN101231848A (en) * | 2007-11-06 | 2008-07-30 | 安徽科大讯飞信息科技股份有限公司 | Method for performing pronunciation error detecting based on holding vector machine |
US20140074471A1 (en) * | 2012-09-10 | 2014-03-13 | Cisco Technology, Inc. | System and method for improving speaker segmentation and recognition accuracy in a media processing environment |
CN105761720A (en) * | 2016-04-19 | 2016-07-13 | 北京地平线机器人技术研发有限公司 | Interaction system based on voice attribute classification, and method thereof |
-
2017
- 2017-01-17 CN CN201710031899.7A patent/CN106898355B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1264887A (en) * | 2000-03-31 | 2000-08-30 | 清华大学 | Non-particular human speech recognition and prompt method based on special speech recognition chip |
CN1588536A (en) * | 2004-09-29 | 2005-03-02 | 上海交通大学 | State structure regulating method in sound identification |
CN101231848A (en) * | 2007-11-06 | 2008-07-30 | 安徽科大讯飞信息科技股份有限公司 | Method for performing pronunciation error detecting based on holding vector machine |
US20140074471A1 (en) * | 2012-09-10 | 2014-03-13 | Cisco Technology, Inc. | System and method for improving speaker segmentation and recognition accuracy in a media processing environment |
CN105761720A (en) * | 2016-04-19 | 2016-07-13 | 北京地平线机器人技术研发有限公司 | Interaction system based on voice attribute classification, and method thereof |
Non-Patent Citations (1)
Title |
---|
李敬阳等: "一种基于GMM-DNN的说话人确认方法", 《计算机应用与软件》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107274890A (en) * | 2017-07-04 | 2017-10-20 | 清华大学 | Vocal print composes extracting method and device |
CN107274883A (en) * | 2017-07-04 | 2017-10-20 | 清华大学 | Voice signal reconstructing method and device |
CN107274883B (en) * | 2017-07-04 | 2020-06-02 | 清华大学 | Voice signal reconstruction method and device |
CN107274890B (en) * | 2017-07-04 | 2020-06-02 | 清华大学 | Voiceprint spectrum extraction method and device |
CN107610709A (en) * | 2017-08-01 | 2018-01-19 | 百度在线网络技术(北京)有限公司 | A kind of method and system for training Application on Voiceprint Recognition model |
CN108305615B (en) * | 2017-10-23 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Object identification method and device, storage medium and terminal thereof |
CN109887511A (en) * | 2019-04-24 | 2019-06-14 | 武汉水象电子科技有限公司 | A kind of voice wake-up optimization method based on cascade DNN |
CN111883175A (en) * | 2020-06-09 | 2020-11-03 | 河北悦舒诚信息科技有限公司 | Voiceprint library-based oil station service quality improving method |
CN111883175B (en) * | 2020-06-09 | 2022-06-07 | 河北悦舒诚信息科技有限公司 | Voiceprint library-based oil station service quality improving method |
CN111724766A (en) * | 2020-06-29 | 2020-09-29 | 合肥讯飞数码科技有限公司 | Language identification method, related equipment and readable storage medium |
CN111724766B (en) * | 2020-06-29 | 2024-01-05 | 合肥讯飞数码科技有限公司 | Language identification method, related equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106898355B (en) | 2020-04-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106898355A (en) | A kind of method for distinguishing speek person based on two modelings | |
CN112509564B (en) | End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism | |
CN104732978B (en) | The relevant method for distinguishing speek person of text based on combined depth study | |
CN109036465B (en) | Speech emotion recognition method | |
CN107146601A (en) | A kind of rear end i vector Enhancement Methods for Speaker Recognition System | |
CN102509547B (en) | Method and system for voiceprint recognition based on vector quantization based | |
CN109119072A (en) | Civil aviaton's land sky call acoustic model construction method based on DNN-HMM | |
CN109389992A (en) | A kind of speech-emotion recognition method based on amplitude and phase information | |
CN108597496A (en) | Voice generation method and device based on generation type countermeasure network | |
CN108806667A (en) | The method for synchronously recognizing of voice and mood based on neural network | |
CN108564940A (en) | Audio recognition method, server and computer readable storage medium | |
CN107886957A (en) | Voice wake-up method and device combined with voiceprint recognition | |
CN108648759A (en) | A kind of method for recognizing sound-groove that text is unrelated | |
CN106952643A (en) | A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering | |
CN106448684A (en) | Deep-belief-network-characteristic-vector-based channel-robust voiceprint recognition system | |
CN110310647A (en) | A kind of speech identity feature extractor, classifier training method and relevant device | |
CN102324232A (en) | Method for recognizing sound-groove and system based on gauss hybrid models | |
CN107731233A (en) | A kind of method for recognizing sound-groove based on RNN | |
CN104157290A (en) | Speaker recognition method based on depth learning | |
CN107767861A (en) | voice awakening method, system and intelligent terminal | |
CN103971690A (en) | Voiceprint recognition method and device | |
CN107068167A (en) | Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures | |
CN106297773A (en) | A kind of neutral net acoustic training model method | |
CN108172218A (en) | A kind of pronunciation modeling method and device | |
CN108986798B (en) | Processing method, device and the equipment of voice data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20181128 Address after: 100085 Beijing Haidian District Shangdi Information Industry Base Pioneer Road 1 B Block 2 Floor 2030 Applicant after: Beijing Huacong Zhijia Technology Co., Ltd. Address before: 100084 Tsinghua Yuan, Haidian District, Beijing, No. 1 Applicant before: Tsinghua University |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |