CN106898355A - A kind of method for distinguishing speek person based on two modelings - Google Patents

A kind of method for distinguishing speek person based on two modelings Download PDF

Info

Publication number
CN106898355A
CN106898355A CN201710031899.7A CN201710031899A CN106898355A CN 106898355 A CN106898355 A CN 106898355A CN 201710031899 A CN201710031899 A CN 201710031899A CN 106898355 A CN106898355 A CN 106898355A
Authority
CN
China
Prior art keywords
speech data
training
dnn model
identified
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710031899.7A
Other languages
Chinese (zh)
Other versions
CN106898355B (en
Inventor
何亮
陈仙红
徐灿
刘艺
田垚
刘加
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huacong Zhijia Technology Co., Ltd.
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201710031899.7A priority Critical patent/CN106898355B/en
Publication of CN106898355A publication Critical patent/CN106898355A/en
Application granted granted Critical
Publication of CN106898355B publication Critical patent/CN106898355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Image Analysis (AREA)

Abstract

The present invention proposes a kind of method for distinguishing speek person based on two modelings, belongs to Application on Voiceprint Recognition, pattern-recognition and machine learning field.The method obtains the training speech data of speaker to be identified and pre-processes in the model training stage;First DNN model is obtained according to training speech data training;Using first DNN model, training speech data is identified, extracts easily mixed speech data;Second DNN model is obtained according to easy mixed speech data training;In the Speaker Identification stage, obtain speech data to be identified and pre-process;Speech data to be identified is identified using first DNN model, if identification probability is more than given threshold, obtains Speaker Identification result;Second identification is otherwise carried out to speech data to be identified by second DNN model, Speaker Identification result is obtained.The present invention, while considering speaker's gross feature and microscopic feature, effectively improves the accuracy rate of Speaker Identification by setting up two DNN models.

Description

A kind of method for distinguishing speek person based on two modelings
Technical field
It is particularly a kind of based on secondary the invention belongs to Application on Voiceprint Recognition, pattern-recognition and machine learning techniques field The method for distinguishing speek person of modeling.
Background technology
Speaker Identification refer to according to recognizing speaker's identity comprising the information related to speaker in voice, with Developing rapidly for information technology and the communication technology, speaker Recognition Technology is increasingly taken seriously and is obtained extensively in numerous areas General application.As identity differentiates, seize telephone channel criminal, identity validation done according to telephonograph in court, call voice with Track, there is provided antitheft door open function.The Internet, applications and the communications field, speaker Recognition Technology can apply to sound dialing, The necks such as telephone bank, teleshopping, database access, information service, voice e-mail, security control, computer remote login Domain.
Speaker Identification first has to pre-process speech data, extracts feature.The most frequently used is characterized in that one kind is based on Human ear listens the mel cepstrum feature of perception theory, is now widely used for Speaker Identification, languages identification and continuous speech and knows Not etc..Mel cepstrum feature extraction carries out preemphasis and framing adding window to speech data first, then to the number after framing adding window According to Fast Fourier Transform (FFT) is carried out, corresponding frequency spectrum is obtained, and be filtered by Mel frequency marking triangle window filter, it is most laggard Row discrete cosine transform obtains mel cepstrum feature.
In recent years, the Speaker Identification model based on deep neural network (DNN) is received more and more attention, compared to Traditional gauss hybrid models (GMM), the descriptive power of DNN models is stronger, more preferably the extremely complex data of simulation can divide Cloth, the system based on DNN obtains significant performance boost.One DNN model includes input layer, three layers of hidden layer and output layer It is secondary:The feature of input layer correspondence speech data, depending on the dimension of the nodes of input layer according to speech data character pair;Output Layer corresponds to the probability of each speaker, depending on the number of the nodes of output layer according to the speaker for needing to recognize altogether;It is implicit Number and nodes need and engineering experience definition according to application layer by layer.During DNN model trainings, first carry out unsupervised training is carried out again Supervised training.A limited Boltzmann machine is treated as per adjacent two-tier network during unsupervised training, with to sdpecific dispersion (CD) Algorithm is successively trained.Exercise supervision training when, the DNN model parameters obtained using unsupervised training as initial value, then DNN model parameters are accurately adjusted with Back Propagation Algorithm.So far, the side of the Speaker Identification based on DNN models Method all only uses a DNN model, but a DNN model to be difficult while entering to the gross feature and microscopic feature between speaker Row modeling.When this causes to be identified speaker using a DNN model, some voices can be easily discriminated, and some Voice is but easily obscured.
The content of the invention
The purpose of the present invention is to overcome the weak point of prior art, it is proposed that a kind of speaker based on two modelings Recognition methods.The present invention is by setting up two DNN models, while speaker's gross feature and microscopic feature are considered, can be effective Improve the accuracy rate of Speaker Identification in ground.
A kind of method of the Speaker Identification based on two modelings, it is model training stage and Speaker Identification stage to be divided into Two stages;In the model training stage, obtain the training speech data of all speakers to be identified and pre-process;According to training language The training of sound data obtains first DNN model;Using first DNN model, training speech data is identified, extracts easily mixed Speech data;Second DNN model is obtained according to easy mixed speech data training;In the Speaker Identification stage, language to be identified is obtained Sound data are simultaneously pre-processed;Speech data to be identified is identified using first DNN model, if identification probability is more than decision threshold It is worth, then the speaker of artificial this speech data to be identified of speaking in recognition result corresponding to output probability maximum;Otherwise Second identification is carried out to speech data to be identified by second DNN model, output probability maximum institute is right in recognition result The speaker of artificial this speech data to be identified of speaking answered.The method is comprised the following steps:
1) the model training stage;Specifically include following steps:
1-1) obtain the training speech data of all speakers to be identified, and speaking corresponding to every training speech data It is artificial known;Training speech data to obtaining is pre-processed, and extracts the corresponding mel cepstrum of all training speech datas special Levy, and calculate the single order of mel cepstrum feature, second dervative, totally 60 dimension;
1-2) set up first DNN model and be trained, specifically include following steps:
1-2-1) the number of plies and nodes of first DNN model of setting;
First DNN model is divided into input layer, three levels of hidden layer and output layer;Input layer correspondence training speech data Feature, input layer number is step 1-1) the mel cepstrum feature of training speech data that obtains and its single order, second order lead Number 60 is tieed up totally, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are institute There is the number of speaker to be identified, the output of each node corresponds to the probability of each speaker respectively;Hidden layer is used to automatically extract The feature of different levels, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted;
1-2-2) first DNN model is trained, first DNN model parameter is obtained;
The mel cepstrum feature and its single order, second dervative of the training speech data according to all speakers to be identified, First DNN mould is trained, model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non- Supervised training exercises supervision training again:During unsupervised training, every adjacent two-layer in first DNN model, received as one Limit Boltzmann machine, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtains first DNN model parameter Initial value;Exercise supervision training when, first initial value of DNN model parameters obtained using unsupervised training, after utilization First DNN model parameter is accurately adjusted to propagation algorithm, is finally obtained first DNN model parameter;
1-3) extract easily mixed speech data;
According to step 1-2) the first DNN model for obtaining is trained, to the training voice number of all speakers to be identified According to being identified and given threshold;If the corresponding theory of this training speech data in a recognition result for training speech data The threshold value of the probability less than setting of people is talked about, then illustrates that the recognition result distinction of this speech data is bad, by this training language Sound data are extracted as easily mixed speech data, for second training of DNN models;If recognition result is more than or equal to the threshold of setting Value, then illustrate that this training speech data is easily distinguishable, not as easy mixed speech data;
1-4) set up second DNN model and be trained, specifically include following steps:
1-4-1) the number of plies and nodes of second DNN model of setting;
Second DNN model is divided into input layer, three levels of hidden layer and output layer;Input layer correspondence training speech data Feature, input layer number is step 1-3) the mel cepstrum feature of easily mixed speech data extracted and its single order, second order lead Number 60 is tieed up totally, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are easy Speaker's number included in mixed speech data, each node output corresponds to the probability of each speaker respectively;Hidden layer is used In the feature for automatically extracting different levels, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted;
1-4-2) second DNN model is trained, second DNN model parameter is obtained;
According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second Individual DNN models are trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non-supervisory Train the training that exercises supervision again:Every adjacent two-layer in second DNN model is treated as a limited Bohr during unsupervised training Hereby graceful machine, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtains the initial of second DNN model parameter Value;Exercise supervision training when, second initial value of DNN model parameters obtained using unsupervised training, using back-propagating Algorithm is accurately adjusted to second DNN model parameter, finally obtains second DNN model parameter;
2) the Speaker Identification stage;Specifically include following steps:
A certain personal speech data to be identified in the speaker of identification needed for 2-1) obtaining, to voice number to be identified According to being pre-processed, the mel cepstrum feature and its single order, second dervative of speech data to be identified are extracted, totally 60 dimension;
2-2) by step 2-1) obtain speech data to be identified 60 dimension features be input to step 1-2) obtain first It is identified in individual DNN models, output layer exports the recognition result of speech data to be identified, i.e. this speech data is corresponded to respectively The probability of each speaker in training speech data, the output of output layer each node corresponds to the probability of speaker respectively;
2-3) set decision threshold, judge step 2-2) recognition result in the presence or absence of probability be more than decision threshold knot Really:If so, then the speaker in first DNN Model Identification result corresponding to output probability maximum is this language to be identified The speaker of sound data, end of identification;If no, being transferred to step 2-4);
If 2-4) step 2-3) recognition result in do not have probability more than decision threshold result, to this language to be identified Sound data carry out second identification using second DNN model;Output probability maximum institute in second DNN Model Identification result Corresponding speaker is the speaker of this speech data to be identified, end of identification.
The features of the present invention and beneficial effect are:
Compared with prior art, first DNN model of the invention is modeled to the gross feature between speaker, the Two DNN models are modeled to the microscopic feature between speaker.The inventive method increased easily mixes language to different speakers The distinctive of sound data, with the good stability of a system, while considering gross feature and microscopic feature, can improve and speak The accuracy rate of people's identification.
Brief description of the drawings
Fig. 1 is the flow chart of the inventive method.
Fig. 2 is first DNN model structure in the embodiment of the present invention.
Fig. 3 is second DNN model structure in the embodiment of the present invention.
Specific embodiment
A kind of method for distinguishing speek person based on two modelings proposed by the present invention, below in conjunction with the accompanying drawings and specific embodiment Further describe as follows.
A kind of method for distinguishing speek person based on two modelings proposed by the present invention, is divided into for the model training stage and speaks Two stages of people's cognitive phase;In the model training stage, obtain the training speech data of all speakers to be identified and pre-process; First DNN model is obtained according to training speech data training;Using first DNN model, training speech data is known Not, easily mixed speech data is extracted;Second DNN model is obtained according to easy mixed speech data training;In the Speaker Identification stage, obtain Take speech data to be identified and pre-process;Speech data to be identified is identified using first DNN model, if identification probability More than decision threshold, then artificial this speech data to be identified of speaking in recognition result corresponding to output probability maximum is said Words people;Second identification is otherwise carried out to speech data to be identified by second DNN model, output probability is most in recognition result The speaker of corresponding artificial this speech data to be identified of speaking of big value.The method flow chart is as shown in figure 1, including following Step:
1) the model training stage;Specifically include following steps:
1-1) obtain the training speech data of all speakers to be identified, and speaking corresponding to every training speech data Artificially, it is known that acquisition modes can be live recording or telephonograph;Training speech data to obtaining is pre-processed, and is extracted It is all to train the corresponding mel cepstrum feature of speech data, and the single order of mel cepstrum feature, second dervative are calculated, totally 60 tie up;
1-2) set up first DNN model and be trained, specifically include following steps:
1-2-1) the number of plies and nodes of first DNN model of setting;
First DNN model is divided into input layer, three levels of hidden layer and output layer;Input layer correspondence training speech data Feature, input layer number is step 1-1) the mel cepstrum feature of training speech data that obtains and its single order, second order lead Number 60 is tieed up totally, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are institute There is the number of speaker to be identified, the output of each node corresponds to the probability of each speaker respectively;Hidden layer is mainly used in automatically Extract the feature of different levels, the number of plies and nodes of hidden layer are as needed and experience setting, general to set the hidden layer number of plies It is 3-5 layers;The nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted, the hidden layer node in centre position Number is traditionally arranged to be 300-500, and other node in hidden layer are traditionally arranged to be 1000-2000;
1-2-2) first DNN model is trained, first DNN model parameter is obtained;
The mel cepstrum feature and its single order, second dervative of the training speech data according to all speakers to be identified, First DNN mould is trained, model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non- Supervised training exercises supervision training again:It is limited as one every adjacent two-layer in first DNN model during unsupervised training Boltzmann machine, all of limited Boltzmann machine is trained with to sdpecific dispersion (CD) algorithm successively, obtains first DNN models ginseng Several initial values;Exercise supervision training when, first initial value of DNN model parameters obtained using unsupervised training, utilize Back Propagation Algorithm is accurately adjusted to first DNN model parameter, finally obtains first DNN model parameter;
1-3) extract easily mixed speech data;
According to step 1-2) the first DNN model for obtaining is trained, to the training voice number of all speakers to be identified According to simultaneously given threshold is identified, rule of thumb, threshold range is typically set to 0.7-0.9;If one is trained speech data The threshold value of the probability less than setting of the corresponding speaker of this training speech data, then illustrate this speech data in recognition result Recognition result distinction it is bad, using this training speech data extract as it is easy mix speech data, for second DNN model Training;If recognition result is more than or equal to the threshold value of setting, illustrate that this training speech data is easily distinguishable, not as easily mixed Speech data;
1-4) set up second DNN model and be trained, specifically include following steps:
1-4-1) the number of plies and nodes of second DNN model of setting;
Second DNN model is divided into input layer, three levels of hidden layer and output layer;Input layer correspondence training speech data Feature, input layer number is step 1-3) the mel cepstrum feature of easily mixed speech data extracted and its single order, second order lead Number, 60 ties up totally, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are Speaker's number included in easily mixed speech data, each node output corresponds to the probability of each speaker respectively;Hidden layer The general number of plies that sets is 3-5 layers, and the node in hidden layer in centre position is traditionally arranged to be 300-500, other hidden layer nodes Number is traditionally arranged to be 1000-2000;
1-4-2) second DNN model is trained, second DNN model parameter is obtained;
According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second Individual DNN moulds are trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non-supervisory instruction The experienced training that exercises supervision again:Every adjacent two-layer in second DNN model is treated as a limited Bohr hereby during unsupervised training Graceful machine, all of limited Boltzmann machine is trained with to sdpecific dispersion (CD) algorithm successively, obtain second DNN model parameter just Initial value;Exercise supervision training when, second initial value of DNN model parameters obtained using unsupervised training, using backward biography Broadcast algorithm accurately to adjust second DNN model parameter, finally obtain second DNN model parameter;
2) the Speaker Identification stage, following steps are specifically included:
A certain personal speech data to be identified in the speaker of identification needed for 2-1) obtaining, to voice number to be identified According to being pre-processed, the mel cepstrum feature and its single order, second dervative of speech data to be identified are extracted, totally 60 dimension;
2-2) by step 2-1) obtain speech data to be identified 60 dimension features be input to step 1-2) obtain first It is identified in individual DNN models, output layer exports the recognition result of speech data to be identified, i.e. this speech data is corresponded to respectively The probability of each speaker in training speech data, the output of output layer each node corresponds to the probability of speaker respectively;
2-3) set decision threshold, judge step 2-2) recognition result in the presence or absence of probability be more than decision threshold knot Really:If so, then the speaker in first DNN Model Identification result corresponding to output probability maximum is this language to be identified The speaker of sound data, end of identification;If no, being transferred to step 2-4);
If 2-4) step 2-3) recognition result in do not have probability more than decision threshold result, to this language to be identified Sound data carry out second identification using second DNN model;Output probability maximum institute in second DNN Model Identification result Corresponding speaker is the speaker of this speech data to be identified, end of identification.
The inventive method is further described with reference to a specific embodiment as follows.What deserves to be explained is, under Embodiment described by text is only one embodiment of the present of invention, rather than whole embodiments.Based on implementation of the invention Example, the every other embodiment that those of ordinary skill in the art are obtained under the premise of creative work is not paid is belonged to The scope of protection of the invention.
Need to recognize 800 speakers in the embodiment, comprise the following steps that:
1) the model training stage, following steps are specifically included:
The training speech data of 800 speakers of identification needed for 1-1) obtaining, and corresponding to every training speech data Speak artificial known, acquisition modes are telephonograph;Training speech data (i.e. telephonograph) to obtaining carries out pre- place Reason, extracts the corresponding mel cepstrum feature of all training speech datas, and calculates the single order of mel cepstrum feature, second dervative, Totally 60 tie up;
1-2) set up first DNN model and be trained, specifically include following steps:
1-2-1) the number of plies and nodes of first DNN model of setting;
First DNN model structure is as shown in Figure 2:First a total of 7 layers of DNN model, the 1st layer is input layer, 2-6 Layer is hidden layer (totally 5 layers of hidden layer), and the 7th layer is output layer.Wherein cross spider represents the annexation of node, first DNN It is full connection between the node of model adjacent two layers, and is connectionless between node in each layer.First DNN model Input layer correspondence training speech data feature, the present embodiment be step 1-1) obtain training speech data mel cepstrum Feature and its single order, second dervative, 60 tie up totally, then input layer number is set to N1=60;Output layer corresponds to each and speaks The probability of people, speaker's number that the nodes of output layer are recognized needed for being equal to, is N7=800, each node output difference The probability of corresponding each speaker;Hidden layer is mainly used in automatically extracting the feature of different levels, the present embodiment hidden layer number of plies It is set to 5 layers (being traditionally arranged to be 3-5 layers), the feature that hidden layer is carried is gradually senior to the 6th layer from the 2nd layer of low level abstraction Abstract transition.The nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted, centre position in the present embodiment The nodes of hidden layer (i.e. the 4th layer) be set to N4=400 (being traditionally arranged to be 300-500), the section of remaining hidden layer Points N2、N3、N5And N6It is set to 1024 and (is traditionally arranged to be 1000-2000.
1-2-2) first DNN model is trained, first DNN model parameter is obtained;
According to 800 the mel cepstrum features and its single order, second dervative of the training speech data of speaker to be identified, First DNN mould is trained, model parameter includes the connection weight of adjacent two layers:
Wherein, Wi,i+1It is NiRow Ni+1The matrix of row, whereinRepresent in first i-th layer of DNN model m-th node and N-th connection weight of node in i+1 layer.
The biasing of each node:
Wherein,Represent k-th biasing of node in first DNN models jth layer.
Unsupervised training is first carried out to exercise supervision again training.Every adjacent in first DNN model during unsupervised training Layers 1 and 2, layers 2 and 3 in two-layer, i.e. Fig. 2 ..., the 6th layer and the 7th layer, as a limited Bohr hereby Graceful machine, has 6 limited Boltzmann machines.It is trained one by one with to sdpecific dispersion (CD) algorithm, i.e., first trains the 1st layer and the 2nd The limited Boltzmann machine of layer composition, obtains the B in first DNN model parameter1、B2、W12;Then layers 2 and 3 is trained The limited Boltzmann machine of composition, obtains the B in first DNN model parameter3、W23;All of limited Bohr is trained successively hereby Graceful machine, obtains first initial value of DNN model parameters;Exercise supervision training when, first obtained using unsupervised training The initial value of DNN model parameters, is accurately adjusted using Back Propagation Algorithm to first DNN model parameter, is finally obtained First DNN model parameter.
1-3) extract easily mixed speech data;
According to step 1-2) train the first DNN model for obtaining, all training speech datas to 800 speakers to enter Row identification.Given threshold is 0.85 (being traditionally arranged to be 0.7-0.9).If one training speech data recognition result in this The threshold value of the probability less than setting of the corresponding speaker of training speech data, then illustrate this training speech data recognition result area Divide property bad, using this training speech data as easy mixed speech data, for second training of DNN models;If identification knot Fruit then illustrates that this training speech data is easily distinguishable, not as easy mixed speech data more than or equal to the threshold value of setting;
1-4) set up second DNN model and be trained, specifically include following steps:
1-4-1) the number of plies and nodes of second DNN model of setting;
Second DNN model structure is as shown in Figure 3:Second DNN model is of five storeys altogether, and the 1st layer is input layer, 2-4 Layer is hidden layer (totally 3 layers of hidden layer), and the 5th layer is output layer.Wherein cross spider represents the annexation of node, second DNN It is full connection between the node of model adjacent two layers, and is connectionless between node in each layer.Second DNN model Input layer correspondence speech data feature, the present embodiment is step 1-3) the mel cepstrum feature of easily mixed speech data extracted And its single order, second dervative, totally 60 to tie up, then second DNN mode input node layer number is set to N1=60;Output layer correspondence The probability of each speaker, the nodes of output layer are the speaker's number included in easily mixed speech data, in this embodiment Also it is 800;, due to training data, corresponding speaker is known, and easily mixing speech data is carried from training data Take out, so it is recognised that every frame easily corresponding speaker of mixed speech data, voice number is easily mixed such that it is able to calculate People's number is in a word talked about in.The number of plies of hidden layer is set to 3 layers (being traditionally arranged to be 3-5 layers), and the hidden layer in centre position is (i.e. 3rd layer) node number be set to N3=300 (being traditionally arranged to be 300-500), the nodes N of remaining hidden layer2,N4Set It is 1024 (being traditionally arranged to be 1000-2000).
1-4-2) second DNN model is trained, model parameter is obtained;
According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second Individual DNN moulds are trained.Second parameter of DNN models includes the connection weight W of adjacent two layersi,i+1(i=1 ..., 4) and it is every The biasing B of individual nodej(j=1 ..., 5).Unsupervised training is first carried out to exercise supervision again training.Second during unsupervised training In DNN models per the layers 1 and 2 in adjacent two-layer, i.e. Fig. 3, layers 2 and 3 ..., the 4th layer and the 5th layer, when Into a limited Boltzmann machine, first totally 4 limited Boltzmann machines, be successively trained, i.e., with to sdpecific dispersion (CD) algorithm The limited Boltzmann machine of training layers 1 and 2 composition, obtains the B in second DNN model parameter1、B2、W12;Then instruct Practice the limited Boltzmann machine of layers 2 and 3 composition, obtain the B in second DNN model parameter3、W23;Institute is trained successively The limited Boltzmann machine having, obtains second initial value of DNN model parameters.Exercise supervision training when, use non-supervisory instruction Second initial value of DNN model parameters for getting, is carried out accurately using Back Propagation Algorithm to second DNN model parameter Adjustment, finally obtains second DNN model parameter.
Above-mentioned steps are the model training stage, and Speaker Identification is carried out by obtaining after two DNN models;
2) the Speaker Identification stage, following steps are specifically included:
A certain personal speech data to be identified in 800 speakers of identification needed for 2-1) obtaining, voice to be identified Data are obtained also by telephonograph, but the corresponding speaker of the speech data is unknown, it is necessary to by method proposed by the present invention Speaker is identified.Speech data to be identified is different voices from training speech data.To speech data to be identified Pre-processed, extracted the mel cepstrum feature and its single order, second dervative of speech data to be identified, totally 60 dimension.
2-2) by step 2-1) obtain speech data to be identified 60 dimension features be input to first DNN model in carry out Identification, output layer exports the recognition result of speech data to be identified, i.e. this speech data corresponds to the general of this 800 people respectively Rate, the output of output layer each node corresponds to the probability of each speaker, totally 800 respectively.
Decision threshold 2-3) is set, judges whether 800 probability in recognition result have the result more than threshold value 0.85:Such as Fruit has, then the speaker corresponding to this probability is judged as the speaker of this speech data to be identified, end of identification;Otherwise It is transferred to step 2-4);
If 2-4) step 2-3) recognition result in there is no probability more than decision threshold 0.85, it is to be identified to this Speech data carry out second identification using second DNN model;According to second recognition result of DNN models, output is general Speaker corresponding to rate maximum is the speaker for being judged as this speech data to be measured, end of identification.
The method of the invention, one of ordinary skill in the art will appreciate that be, the method for above-mentioned Speaker Identification can be with Completed by program, described program can be stored in a kind of computer-readable recording medium.
Above-described is only a specific embodiment of the invention, it is clear that the power of the present invention can not be limited with this Sharp scope, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope that the present invention is covered.

Claims (2)

1. a kind of method of the Speaker Identification based on two modelings, it is characterised in that be divided into model training stage and speaker Two stages of cognitive phase;In the model training stage, obtain the training speech data of all speakers to be identified and pre-process;Root First DNN model is obtained according to training speech data training;Using first DNN model, training speech data is identified, Extract easily mixed speech data;Second DNN model is obtained according to easy mixed speech data training;In the Speaker Identification stage, obtain Speech data to be identified is simultaneously pre-processed;Speech data to be identified is identified using first DNN model, if identification probability is big In decision threshold, then artificial this speech data to be identified of speaking in recognition result corresponding to output probability maximum is spoken People;Second identification is otherwise carried out to speech data to be identified by second DNN model, output probability is maximum in recognition result The speaker of corresponding artificial this speech data to be identified of speaking of value.
2. the method for claim 1, it is characterised in that the method is comprised the following steps:
1) the model training stage;Specifically include following steps:
1-1) obtain the training speech data of all speakers to be identified, and speaking artificially corresponding to every training speech data It is known;Training speech data to obtaining is pre-processed, and extracts the corresponding mel cepstrum feature of all training speech datas, and Single order, the second dervative of mel cepstrum feature are calculated, totally 60 dimension;
1-2) set up first DNN model and be trained, specifically include following steps:
1-2-1) the number of plies and nodes of first DNN model of setting;
First DNN model is divided into input layer, three levels of hidden layer and output layer;The spy of input layer correspondence training speech data Levy, input layer number be step 1-1) obtain training speech data mel cepstrum feature and its single order, second dervative it is common 60 dimensions, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are to be needed The number of speaker is recognized, the output of each node corresponds to the probability of each speaker respectively;Hidden layer is used to automatically extract difference The feature of level, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted;
1-2-2) first DNN model is trained, first DNN model parameter is obtained;
The mel cepstrum feature and its single order, second dervative of the training speech data according to all speakers to be identified, to the One DNN mould is trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out non-supervisory Train the training that exercises supervision again:During unsupervised training, every adjacent two-layer in first DNN model, as a limited glass The graceful machine of Wurz, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtain first DNN model parameter just Initial value;Exercise supervision training when, first initial value of DNN model parameters obtained using unsupervised training, using backward biography Broadcast algorithm accurately to adjust first DNN model parameter, finally obtain first DNN model parameter;
1-3) extract easily mixed speech data;
According to step 1-2) train the first DNN model for obtaining, the training speech data to all speakers to be identified to enter Row identification and given threshold;If the corresponding speaker of this training speech data in a recognition result for training speech data Probability less than setting threshold value, then illustrate that the recognition result distinction of this speech data is bad, by this training voice number Speech data is mixed as easy according to extracting, for second training of DNN models;If recognition result is more than or equal to the threshold value of setting, Then illustrate that this training speech data is easily distinguishable, not as easy mixed speech data;
1-4) set up second DNN model and be trained, specifically include following steps:
1-4-1) the number of plies and nodes of second DNN model of setting;
Second DNN model is divided into input layer, three levels of hidden layer and output layer;The spy of input layer correspondence training speech data Levy, input layer number is step 1-3) the mel cepstrum feature of easily mixed speech data extracted and its single order, second dervative it is common 60 dimensions, then input layer number is set to 60;Output layer corresponds to the probability of each speaker, and output layer nodes are easily mixed language Speaker's number included in sound data, each node output corresponds to the probability of each speaker respectively;Hidden layer is used for certainly The dynamic feature for extracting different levels, the nodes of every layer of hidden layer represent the dimension of the feature that this layer of hidden layer is extracted;
1-4-2) second DNN model is trained, second DNN model parameter is obtained;
According to step 1-3) what is obtained easily mixes the mel cepstrum feature and its single order, second dervative of speech data, to second DNN Model is trained, and model parameter includes the biasing of the connection weight and each node of adjacent two layers;First carry out unsupervised training Exercise supervision training again:Every adjacent two-layer in second DNN model is treated as a limited Boltzmann during unsupervised training Machine, with all of limited Boltzmann machine is trained successively to sdpecific dispersion algorithm, obtains second initial value of DNN model parameters; Exercise supervision training when, second initial value of DNN model parameters obtained using unsupervised training, using Back Propagation Algorithm Second DNN model parameter is accurately adjusted, second DNN model parameter is finally obtained;
2) the Speaker Identification stage;Specifically include following steps:
A certain personal speech data to be identified in the speaker of identification needed for 2-1) obtaining, enters to speech data to be identified Row pretreatment, extracts the mel cepstrum feature and its single order, second dervative of speech data to be identified, totally 60 dimension;
2-2) by step 2-1) features of 60 dimensions that obtain speech data to be identified are input to step 1-2) first DNN obtaining It is identified in model, output layer exports the recognition result of speech data to be identified, i.e. this speech data and corresponds to training respectively The probability of each speaker in speech data, the output of output layer each node corresponds to the probability of speaker respectively;
2-3) set decision threshold, judge step 2-2) recognition result in the presence or absence of probability be more than decision threshold result: If so, then the speaker in first DNN Model Identification result corresponding to output probability maximum is this voice to be identified The speaker of data, end of identification;If no, being transferred to step 2-4);
If 2-4) step 2-3) recognition result in do not have probability more than decision threshold result, to this voice number to be identified Second identification is carried out according to using second DNN model;In second DNN Model Identification result corresponding to output probability maximum Speaker be the speaker of this speech data to be identified, end of identification.
CN201710031899.7A 2017-01-17 2017-01-17 Speaker identification method based on secondary modeling Active CN106898355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710031899.7A CN106898355B (en) 2017-01-17 2017-01-17 Speaker identification method based on secondary modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710031899.7A CN106898355B (en) 2017-01-17 2017-01-17 Speaker identification method based on secondary modeling

Publications (2)

Publication Number Publication Date
CN106898355A true CN106898355A (en) 2017-06-27
CN106898355B CN106898355B (en) 2020-04-14

Family

ID=59198262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710031899.7A Active CN106898355B (en) 2017-01-17 2017-01-17 Speaker identification method based on secondary modeling

Country Status (1)

Country Link
CN (1) CN106898355B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274890A (en) * 2017-07-04 2017-10-20 清华大学 Vocal print composes extracting method and device
CN107274883A (en) * 2017-07-04 2017-10-20 清华大学 Voice signal reconstructing method and device
CN107610709A (en) * 2017-08-01 2018-01-19 百度在线网络技术(北京)有限公司 A kind of method and system for training Application on Voiceprint Recognition model
CN109887511A (en) * 2019-04-24 2019-06-14 武汉水象电子科技有限公司 A kind of voice wake-up optimization method based on cascade DNN
CN108305615B (en) * 2017-10-23 2020-06-16 腾讯科技(深圳)有限公司 Object identification method and device, storage medium and terminal thereof
CN111724766A (en) * 2020-06-29 2020-09-29 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium
CN111883175A (en) * 2020-06-09 2020-11-03 河北悦舒诚信息科技有限公司 Voiceprint library-based oil station service quality improving method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1264887A (en) * 2000-03-31 2000-08-30 清华大学 Non-particular human speech recognition and prompt method based on special speech recognition chip
CN1588536A (en) * 2004-09-29 2005-03-02 上海交通大学 State structure regulating method in sound identification
CN101231848A (en) * 2007-11-06 2008-07-30 安徽科大讯飞信息科技股份有限公司 Method for performing pronunciation error detecting based on holding vector machine
US20140074471A1 (en) * 2012-09-10 2014-03-13 Cisco Technology, Inc. System and method for improving speaker segmentation and recognition accuracy in a media processing environment
CN105761720A (en) * 2016-04-19 2016-07-13 北京地平线机器人技术研发有限公司 Interaction system based on voice attribute classification, and method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1264887A (en) * 2000-03-31 2000-08-30 清华大学 Non-particular human speech recognition and prompt method based on special speech recognition chip
CN1588536A (en) * 2004-09-29 2005-03-02 上海交通大学 State structure regulating method in sound identification
CN101231848A (en) * 2007-11-06 2008-07-30 安徽科大讯飞信息科技股份有限公司 Method for performing pronunciation error detecting based on holding vector machine
US20140074471A1 (en) * 2012-09-10 2014-03-13 Cisco Technology, Inc. System and method for improving speaker segmentation and recognition accuracy in a media processing environment
CN105761720A (en) * 2016-04-19 2016-07-13 北京地平线机器人技术研发有限公司 Interaction system based on voice attribute classification, and method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李敬阳等: "一种基于GMM-DNN的说话人确认方法", 《计算机应用与软件》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274890A (en) * 2017-07-04 2017-10-20 清华大学 Vocal print composes extracting method and device
CN107274883A (en) * 2017-07-04 2017-10-20 清华大学 Voice signal reconstructing method and device
CN107274883B (en) * 2017-07-04 2020-06-02 清华大学 Voice signal reconstruction method and device
CN107274890B (en) * 2017-07-04 2020-06-02 清华大学 Voiceprint spectrum extraction method and device
CN107610709A (en) * 2017-08-01 2018-01-19 百度在线网络技术(北京)有限公司 A kind of method and system for training Application on Voiceprint Recognition model
CN108305615B (en) * 2017-10-23 2020-06-16 腾讯科技(深圳)有限公司 Object identification method and device, storage medium and terminal thereof
CN109887511A (en) * 2019-04-24 2019-06-14 武汉水象电子科技有限公司 A kind of voice wake-up optimization method based on cascade DNN
CN111883175A (en) * 2020-06-09 2020-11-03 河北悦舒诚信息科技有限公司 Voiceprint library-based oil station service quality improving method
CN111883175B (en) * 2020-06-09 2022-06-07 河北悦舒诚信息科技有限公司 Voiceprint library-based oil station service quality improving method
CN111724766A (en) * 2020-06-29 2020-09-29 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium
CN111724766B (en) * 2020-06-29 2024-01-05 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium

Also Published As

Publication number Publication date
CN106898355B (en) 2020-04-14

Similar Documents

Publication Publication Date Title
CN106898355A (en) A kind of method for distinguishing speek person based on two modelings
CN112509564B (en) End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
CN104732978B (en) The relevant method for distinguishing speek person of text based on combined depth study
CN109036465B (en) Speech emotion recognition method
CN107146601A (en) A kind of rear end i vector Enhancement Methods for Speaker Recognition System
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
CN109119072A (en) Civil aviaton's land sky call acoustic model construction method based on DNN-HMM
CN109389992A (en) A kind of speech-emotion recognition method based on amplitude and phase information
CN108597496A (en) Voice generation method and device based on generation type countermeasure network
CN108806667A (en) The method for synchronously recognizing of voice and mood based on neural network
CN108564940A (en) Audio recognition method, server and computer readable storage medium
CN107886957A (en) Voice wake-up method and device combined with voiceprint recognition
CN108648759A (en) A kind of method for recognizing sound-groove that text is unrelated
CN106952643A (en) A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN106448684A (en) Deep-belief-network-characteristic-vector-based channel-robust voiceprint recognition system
CN110310647A (en) A kind of speech identity feature extractor, classifier training method and relevant device
CN102324232A (en) Method for recognizing sound-groove and system based on gauss hybrid models
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN104157290A (en) Speaker recognition method based on depth learning
CN107767861A (en) voice awakening method, system and intelligent terminal
CN103971690A (en) Voiceprint recognition method and device
CN107068167A (en) Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures
CN106297773A (en) A kind of neutral net acoustic training model method
CN108172218A (en) A kind of pronunciation modeling method and device
CN108986798B (en) Processing method, device and the equipment of voice data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20181128

Address after: 100085 Beijing Haidian District Shangdi Information Industry Base Pioneer Road 1 B Block 2 Floor 2030

Applicant after: Beijing Huacong Zhijia Technology Co., Ltd.

Address before: 100084 Tsinghua Yuan, Haidian District, Beijing, No. 1

Applicant before: Tsinghua University

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant