CN106898355B - Speaker identification method based on secondary modeling - Google Patents

Speaker identification method based on secondary modeling Download PDF

Info

Publication number
CN106898355B
CN106898355B CN201710031899.7A CN201710031899A CN106898355B CN 106898355 B CN106898355 B CN 106898355B CN 201710031899 A CN201710031899 A CN 201710031899A CN 106898355 B CN106898355 B CN 106898355B
Authority
CN
China
Prior art keywords
voice data
training
dnn model
speaker
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710031899.7A
Other languages
Chinese (zh)
Other versions
CN106898355A (en
Inventor
何亮
陈仙红
徐灿
刘艺
田垚
刘加
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huacong Zhijia Technology Co., Ltd.
Original Assignee
Beijing Huacong Zhijia Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huacong Zhijia Technology Co Ltd filed Critical Beijing Huacong Zhijia Technology Co Ltd
Priority to CN201710031899.7A priority Critical patent/CN106898355B/en
Publication of CN106898355A publication Critical patent/CN106898355A/en
Application granted granted Critical
Publication of CN106898355B publication Critical patent/CN106898355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a speaker recognition method based on secondary modeling, and belongs to the field of voiceprint recognition, pattern recognition and machine learning. In the model training stage, training voice data of a speaker to be recognized is acquired and preprocessed; training according to training voice data to obtain a first DNN model; recognizing training voice data by using a first DNN model, and extracting the easily-mixed voice data; training according to the confusing voice data to obtain a second DNN model; in the speaker recognition stage, acquiring and preprocessing voice data to be recognized; recognizing the voice data to be recognized by utilizing a first DNN model, and obtaining a speaker recognition result if the recognition probability is greater than a set threshold value; otherwise, the second time of recognition is carried out on the voice data to be recognized through a second DNN model, and a speaker recognition result is obtained. According to the invention, two DNN models are established, and the macroscopic characteristics and the microscopic characteristics of the speaker are considered at the same time, so that the accuracy of speaker identification is effectively improved.

Description

Speaker identification method based on secondary modeling
Technical Field
The invention belongs to the technical field of voiceprint recognition, pattern recognition and machine learning, and particularly relates to a speaker recognition method based on secondary modeling.
Background
Speaker recognition refers to recognizing the identity of a speaker based on information related to the speaker contained in speech, and as information technology and communication technology are rapidly developed, speaker recognition technology is increasingly gaining importance and is widely applied in many fields. For example, identity authentication, seizing of a telephone channel criminal, identity confirmation in court according to telephone recording, telephone voice tracking and the function of opening and closing an anti-theft door are provided. The technology of speaker recognition can be applied to the fields of voice dialing, telephone banking, telephone shopping, database access, information service, voice e-mail, security control, computer remote login and the like.
Speaker recognition first preprocesses speech data to extract features. The most common feature is a mel cepstrum feature based on the theory of human ear-hearing perception, and is widely applied to speaker recognition, language recognition, continuous speech recognition and the like at present. The Mel cepstrum feature extraction includes that pre-emphasis and framing windowing are performed on voice data, then fast Fourier transform is performed on the data subjected to framing and windowing to obtain corresponding frequency spectrums, filtering is performed through a Mel frequency standard triangular window filter, and finally discrete cosine transform is performed to obtain Mel cepstrum features.
In recent years, a speaker recognition model based on a Deep Neural Network (DNN) receives more and more attention, and compared with a traditional Gaussian Mixture Model (GMM), the DNN model has stronger description capability, can better simulate very complex data distribution, and achieves remarkable performance improvement of a DNN-based system. One DNN model contains three levels of an input layer, a hidden layer, and an output layer: the input layer corresponds to the characteristics of the voice data, and the node number of the input layer is determined according to the dimension of the corresponding characteristics of the voice data; the output layer corresponds to the probability of each speaker, and the node number of the output layer is determined according to the number of speakers needing to be identified in total; the number of the hidden layers and the number of the nodes are defined according to application requirements and engineering experience. During DNN model training, unsupervised training is firstly carried out, and then supervised training is carried out. During non-supervision training, each two adjacent layers of networks are used as a restricted Boltzmann machine and are trained layer by using a Contrast Divergence (CD) algorithm. And when the supervised training is carried out, the DNN model parameters obtained by the unsupervised training are used as initial values, and the DNN model parameters are accurately adjusted by using a back propagation algorithm. To date, the method for speaker recognition based on the DNN model only uses one DNN model, but it is difficult for one DNN model to simultaneously model macro features and micro features between speakers. This results in some voices being easily distinguishable and some voices being easily confused when a DNN model is used to identify the speaker.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a speaker identification method based on secondary modeling. According to the invention, two DNN models are established, and macroscopic characteristics and microscopic characteristics of the speaker are considered at the same time, so that the accuracy of speaker identification can be effectively improved.
A speaker recognition method based on secondary modeling is divided into a model training stage and a speaker recognition stage; in the model training stage, acquiring training voice data of all speakers to be recognized and preprocessing the training voice data; training according to training voice data to obtain a first DNN model; recognizing training voice data by using a first DNN model, and extracting the easily-mixed voice data; training according to the confusing voice data to obtain a second DNN model; in the speaker recognition stage, acquiring and preprocessing voice data to be recognized; identifying the voice data to be identified by utilizing a first DNN model, and if the identification probability is greater than a judgment threshold value, identifying that the speaker corresponding to the maximum output probability value in the identification result is the speaker of the voice data to be identified; otherwise, carrying out secondary recognition on the voice data to be recognized through a second DNN model, wherein the speaker corresponding to the maximum output probability in the recognition result is the speaker of the voice data to be recognized. The method comprises the following steps:
1) a model training stage; the method specifically comprises the following steps:
1-1) acquiring training voice data of all speakers to be recognized, wherein the speaker corresponding to each piece of training voice data is known; preprocessing the acquired training voice data, extracting Mel cepstrum characteristics corresponding to all the training voice data, and calculating first and second derivatives of the Mel cepstrum characteristics, which are 60-dimensional in total;
1-2) establishing a first DNN model and training, and specifically comprising the following steps:
1-2-1) setting the layer number and the node number of a first DNN model;
the first DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions in total of the Mel cepstrum characteristics of the training voice data obtained in the step 1-1) and first and second derivatives thereof, and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of all speakers to be identified, and each node outputs the probability corresponding to each speaker respectively; the hidden layer is used for automatically extracting features of different layers, and the node number of each hidden layer represents the dimension of the feature extracted by the hidden layer;
1-2-2) training a first DNN model to obtain a first DNN model parameter;
training a first DNN module according to the Mel cepstrum characteristics and first and second derivatives of training voice data of all speakers to be recognized, wherein the model parameters comprise connection weights of two adjacent layers and offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the first DNN model as a limited Boltzmann machine, and training all the limited Boltzmann machines in sequence by using a contrast divergence algorithm to obtain an initial value of a parameter of the first DNN model; when the supervised training is carried out, accurately adjusting the first DNN model parameter by using an initial value of the first DNN model parameter obtained by the unsupervised training and a back propagation algorithm to finally obtain the first DNN model parameter;
1-3) extracting the miscible voice data;
recognizing training voice data of all speakers to be recognized and setting a threshold value according to the first DNN model obtained by training in the step 1-2); if the probability of the speaker corresponding to the training voice data in the recognition result of one piece of training voice data is smaller than a set threshold, the recognition result of the piece of voice data is not good in distinction, and the training voice data is extracted to be used as the easy-to-mix voice data for training a second DNN model; if the recognition result is greater than or equal to the set threshold, the training voice data is easy to distinguish and is not used as the confusing voice data;
1-4) establishing a second DNN model and training, specifically comprising the following steps:
1-4-1) setting the layer number and the node number of the second DNN model;
the second DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions in total of the Mel cepstrum characteristics of the easy-mixing voice data extracted in the step 1-3) and the first-order and second-order derivatives thereof, and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of speakers contained in the confusing voice data, and each node outputs the probability corresponding to each speaker respectively; the hidden layer is used for automatically extracting features of different layers, and the node number of each hidden layer represents the dimension of the feature extracted by the hidden layer;
1-4-2) training the second DNN model to obtain second DNN model parameters;
training a second DNN model according to the Mel cepstrum characteristics of the confusing voice data obtained in the step 1-3) and the first-order and second-order derivatives thereof, wherein the model parameters comprise the connection weight of two adjacent layers and the offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the second DNN model as a limited Boltzmann machine, and sequentially training all the limited Boltzmann machines by using a contrast divergence algorithm to obtain an initial value of a parameter of the second DNN model; when the supervised training is carried out, the initial value of the second DNN model parameter obtained by the unsupervised training is used, the backward propagation algorithm is used for accurately adjusting the second DNN model parameter, and finally the second DNN model parameter is obtained;
2) a speaker identification stage; the method specifically comprises the following steps:
2-1) acquiring voice data to be recognized of one of speakers to be recognized, preprocessing the voice data to be recognized, and extracting the Mel cepstrum characteristics and first and second derivatives of the voice data to be recognized, wherein the total dimension is 60;
2-2) inputting the 60-dimensional characteristics of the voice data to be recognized obtained in the step 2-1) into the first DNN model obtained in the step 1-2) for recognition, and outputting the recognition result of the voice data to be recognized by an output layer, wherein the voice data corresponds to the probability of each speaker in the training voice data respectively, and the output of each node of the output layer corresponds to the probability of each speaker respectively;
2-3) setting a judgment threshold value, and judging whether a result with the probability larger than the judgment threshold value exists in the identification result of the step 2-2): if so, the speaker corresponding to the maximum output probability in the first DNN model recognition result is the speaker of the voice data to be recognized, and the recognition is finished; if not, the step 2-4) is carried out;
2-4) if the recognition result in the step 2-3) does not have a result with the probability larger than the judgment threshold, performing secondary recognition on the voice data to be recognized by using a second DNN model; and the speaker corresponding to the maximum output probability in the second DNN model recognition result is the speaker of the voice data to be recognized, and the recognition is finished.
The invention has the characteristics and beneficial effects that:
compared with the prior art, the first DNN model of the invention models macroscopic features among speakers, and the second DNN model models microscopic features among speakers. The method increases the identifiability of the confusing voice data of different speakers, has good system stability, considers macroscopic characteristics and microscopic characteristics at the same time, and can improve the accuracy of speaker identification.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a first DNN model structure diagram in the embodiment of the present invention.
Fig. 3 is a diagram of a second DNN model structure in the embodiment of the present invention.
Detailed Description
The speaker recognition method based on the quadratic modeling provided by the invention is further described in detail below by combining the figures and the specific embodiments.
The speaker recognition method based on the secondary modeling is divided into a model training stage and a speaker recognition stage; in the model training stage, acquiring training voice data of all speakers to be recognized and preprocessing the training voice data; training according to training voice data to obtain a first DNN model; recognizing training voice data by using a first DNN model, and extracting the easily-mixed voice data; training according to the confusing voice data to obtain a second DNN model; in the speaker recognition stage, acquiring and preprocessing voice data to be recognized; identifying the voice data to be identified by utilizing a first DNN model, and if the identification probability is greater than a judgment threshold value, identifying that the speaker corresponding to the maximum output probability value in the identification result is the speaker of the voice data to be identified; otherwise, carrying out secondary recognition on the voice data to be recognized through a second DNN model, wherein the speaker corresponding to the maximum output probability in the recognition result is the speaker of the voice data to be recognized. The flow chart of the method is shown in figure 1, and the method comprises the following steps:
1) a model training stage; the method specifically comprises the following steps:
1-1) acquiring training voice data of all speakers to be recognized, wherein the speaker corresponding to each piece of training voice data is known, and the acquisition mode can be field recording or telephone recording; preprocessing the acquired training voice data, extracting Mel cepstrum characteristics corresponding to all the training voice data, and calculating first and second derivatives of the Mel cepstrum characteristics, which are 60-dimensional in total;
1-2) establishing a first DNN model and training, and specifically comprising the following steps:
1-2-1) setting the layer number and the node number of a first DNN model;
the first DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions in total of the Mel cepstrum characteristics of the training voice data obtained in the step 1-1) and first and second derivatives thereof, and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of all speakers to be identified, and each node outputs the probability corresponding to each speaker respectively; the hidden layer is mainly used for automatically extracting features of different layers, the number of layers and the number of nodes of the hidden layer are set according to needs and experience, and the number of layers of the hidden layer is generally set to be 3-5; the node number of each hidden layer represents the dimension of the feature extracted by the hidden layer, the hidden layer node number of the middle position is generally set to be 300-;
1-2-2) training a first DNN model to obtain a first DNN model parameter;
training a first DNN module according to the Mel cepstrum characteristics and first and second derivatives of training voice data of all speakers to be recognized, wherein the model parameters comprise connection weights of two adjacent layers and offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the first DNN model as a limited Boltzmann machine, and sequentially training all the limited Boltzmann machines by using a Contrast Divergence (CD) algorithm to obtain an initial value of a parameter of the first DNN model; when the supervised training is carried out, accurately adjusting the first DNN model parameter by using an initial value of the first DNN model parameter obtained by the unsupervised training and a back propagation algorithm to finally obtain the first DNN model parameter;
1-3) extracting the miscible voice data;
according to the first DNN model obtained by training in the step 1-2), recognizing the training voice data of all speakers to be recognized and setting a threshold, wherein the threshold range is generally set to be 0.7-0.9 according to experience; if the probability of the speaker corresponding to the training voice data in the recognition result of one piece of training voice data is smaller than a set threshold, the recognition result of the piece of voice data is not good in distinction, and the training voice data is extracted to be used as the easy-to-mix voice data for training a second DNN model; if the recognition result is greater than or equal to the set threshold, the training voice data is easy to distinguish and is not used as the confusing voice data;
1-4) establishing a second DNN model and training, specifically comprising the following steps:
1-4-1) setting the layer number and the node number of the second DNN model;
the second DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions of the Mel cepstrum characteristics and the first and second derivatives thereof of the easy-mixing voice data extracted in the step 1-3), and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of speakers contained in the confusing voice data, and each node outputs the probability corresponding to each speaker respectively; the number of hidden layer is generally set to be 3-5, the number of hidden layer nodes at the middle position is generally set to be 500 in 300-;
1-4-2) training the second DNN model to obtain second DNN model parameters;
training a second DNN module according to the Mel cepstrum characteristics of the easily-mixed voice data obtained in the step 1-3) and first and second derivatives thereof, wherein the model parameters comprise the connection weight of two adjacent layers and the offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the second DNN model as a limited Boltzmann machine, and sequentially training all the limited Boltzmann machines by using a Contrast Divergence (CD) algorithm to obtain an initial value of a parameter of the second DNN model; when the supervised training is carried out, the initial value of the second DNN model parameter obtained by the unsupervised training is used, the backward propagation algorithm is used for accurately adjusting the second DNN model parameter, and finally the second DNN model parameter is obtained;
2) the speaker identification stage specifically comprises the following steps:
2-1) acquiring voice data to be recognized of one of speakers to be recognized, preprocessing the voice data to be recognized, and extracting the Mel cepstrum characteristics and first and second derivatives of the voice data to be recognized, wherein the total dimension is 60;
2-2) inputting the 60-dimensional characteristics of the voice data to be recognized obtained in the step 2-1) into the first DNN model obtained in the step 1-2) for recognition, and outputting the recognition result of the voice data to be recognized by an output layer, wherein the voice data corresponds to the probability of each speaker in the training voice data respectively, and the output of each node of the output layer corresponds to the probability of each speaker respectively;
2-3) setting a judgment threshold value, and judging whether a result with the probability larger than the judgment threshold value exists in the identification result of the step 2-2): if so, the speaker corresponding to the maximum output probability in the first DNN model recognition result is the speaker of the voice data to be recognized, and the recognition is finished; if not, the step 2-4) is carried out;
2-4) if the recognition result in the step 2-3) does not have a result with the probability larger than the judgment threshold, performing secondary recognition on the voice data to be recognized by using a second DNN model; and the speaker corresponding to the maximum output probability in the second DNN model recognition result is the speaker of the voice data to be recognized, and the recognition is finished.
The process of the present invention is further described in detail below with reference to a specific example. It should be noted that the embodiment described below is only one embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without inventive step, are within the scope of the present invention.
In this embodiment, 800 speakers need to be identified, and the specific steps are as follows:
1) the model training stage specifically comprises the following steps:
1-1) acquiring training voice data of 800 speakers to be identified, wherein the speaker corresponding to each piece of training voice data is known, and the acquisition mode is telephone recording; preprocessing the acquired training voice data (namely, the telephone recording), extracting Mel cepstrum characteristics corresponding to all the training voice data, and calculating first and second derivatives of the Mel cepstrum characteristics, which are 60-dimensional in total;
1-2) establishing a first DNN model and training, and specifically comprising the following steps:
1-2-1) setting the layer number and the node number of a first DNN model;
the first DNN model structure is shown in fig. 2: the first DNN model has 7 layers in total, wherein the 1 st layer is an input layer, the 2 nd to 6 th layers are hidden layers (5 layers in total), and the 7 th layerThe layer is an output layer. Wherein the cross lines represent the connection relationship of the nodes, the nodes of two adjacent layers of the first DNN model are all connected, and the nodes in each layer are not connected. The input layer of the first DNN model corresponds to the feature of the training speech data, which is the mel cepstrum feature of the training speech data obtained in step 1-1) and the first and second derivatives thereof, and is 60 dimensions in total, and the number of nodes of the input layer is set to be N160 pieces of the Chinese herbal medicines are taken; the output layer has the probability of corresponding to each speaker, and the node number of the output layer is equal to the number of the speakers to be identified and is N7Each node outputs the probability corresponding to each speaker respectively as 800; the hidden layer is mainly used for automatically extracting features of different layers, the number of the hidden layer is set to be 5 layers (generally set to be 3-5 layers), and the features extracted by the hidden layer are gradually transited from low-level abstraction of the 2 nd layer to high-level abstraction of the 6 th layer. The number of nodes in each hidden layer represents the dimension of the feature extracted by the hidden layer, and in this embodiment, the number of nodes in the hidden layer (i.e., layer 4) in the middle position is set to be N4400 (typically set to 300-2、N3、N5And N6Set to 1024 (typically set to 1000-.
1-2-2) training a first DNN model to obtain a first DNN model parameter;
training a first DNN module according to the Mel cepstrum characteristics and first and second derivatives of training voice data of 800 speakers to be recognized, wherein the model parameters comprise connection weights of two adjacent layers:
Figure BDA0001211947620000071
wherein, Wi,i+1Is NiLine Ni+1A matrix of columns, wherein
Figure BDA0001211947620000074
Representing the connection weight of the mth node in the ith layer and the nth node in the (i + 1) th layer of the first DNN model.
Bias per node:
Figure BDA0001211947620000072
wherein the content of the first and second substances,
Figure BDA0001211947620000073
representing the bias of the kth node in the jth layer of the first DNN model.
Unsupervised training is performed first and then supervised training is performed. During the unsupervised training, every two adjacent layers in the first DNN model, namely the 1 st layer and the 2 nd layer, the 2 nd layer and the 3 rd layer, … …, the 6 th layer and the 7 th layer in FIG. 2 are used as a limited Boltzmann machine, and the total number of the limited Boltzmann machines is 6. Training one by using a Contrast Divergence (CD) algorithm, namely training a limited Boltzmann machine consisting of a layer 1 and a layer 2 to obtain B in the first DNN model parameter1、B2、W12(ii) a Then training a restricted Boltzmann machine consisting of the 2 nd layer and the 3 rd layer to obtain B in the first DNN model parameter3、W23(ii) a Training all restricted Boltzmann machines in sequence to obtain an initial value of a first DNN model parameter; and when the supervised training is carried out, accurately adjusting the first DNN model parameter by using the initial value of the first DNN model parameter obtained by the unsupervised training and a back propagation algorithm to finally obtain the first DNN model parameter.
1-3) extracting the miscible voice data;
and identifying all training voice data of 800 speakers according to the first DNN model obtained by the training in the step 1-2). The threshold is set at 0.85 (typically set at 0.7-0.9). If the probability of the speaker corresponding to the training voice data in the recognition result of one piece of training voice data is smaller than a set threshold, the recognition result of the training voice data is not good in distinction, and the training voice data is used as the confusing voice data for training a second DNN model; if the recognition result is greater than or equal to the set threshold, the training voice data is easy to distinguish and is not used as the confusing voice data;
1-4) establishing a second DNN model and training, specifically comprising the following steps:
1-4-1) setting the layer number and the node number of the second DNN model;
the second DNN model structure is shown in fig. 3: the second DNN model has 5 layers in total, wherein layer 1 is an input layer, layers 2 to 4 are hidden layers (3 layers in total), and layer 5 is an output layer. Wherein the cross lines represent the connection relationship of the nodes, the nodes of two adjacent layers of the second DNN model are all connected, and the nodes in each layer are not connected. The input layer of the second DNN model corresponds to the features of the speech data, in this embodiment, the mel cepstrum features of the confusing speech data extracted in step 1-3) and the first and second derivatives thereof, which are 60 dimensions in total, and then the number of nodes of the input layer of the second DNN model is set to N160; the output layer corresponds to the probability of each speaker, the node number of the output layer is the number of speakers contained in the confusing voice data, and the number is 800 in the embodiment; because the speaker corresponding to the training data is known and the confusing voice data is extracted from the training data, the speaker corresponding to each frame of the confusing voice data can be known, and therefore the total number of speakers in the confusing voice data can be calculated. The number of hidden layers is set to 3 (generally to 3-5), and the number of nodes of the hidden layer (i.e. the 3 rd layer) at the middle position is set to N3300 (typically set to 300-2,N4Set to 1024 (typically set to 1000-.
1-4-2) training the second DNN model to obtain model parameters;
and training a second DNN mode according to the Mel cepstrum characteristics of the confusing voice data obtained in the step 1-3) and the first-order and second-order derivatives thereof. The parameters of the second DNN model include the connection weights W of two adjacent layersi,i+1(i-1, …,4) and bias B for each nodej(j ═ 1, …, 5). Unsupervised training is performed first and then supervised training is performed. During the unsupervised training, every two adjacent layers in the second DNN model, namely the 1 st layer and the 2 nd layer, the 2 nd layer and the 3 rd layer, … …, the 4 th layer and the 5 th layer in the figure 3 are used as a limited Boltzmann machine, and 4 limited Boltzmann machines are usedTraining the Contrast Divergence (CD) algorithm layer by layer, namely training a limited Boltzmann machine consisting of the 1 st layer and the 2 nd layer to obtain B in the second DNN model parameter1、B2、W12(ii) a Then training the restricted Boltzmann machine composed of the 2 nd layer and the 3 rd layer to obtain B in the second DNN model parameter3、W23(ii) a And sequentially training all the limited Boltzmann machines to obtain the initial value of the second DNN model parameter. And when the supervised training is carried out, accurately adjusting the second DNN model parameter by using the initial value of the second DNN model parameter obtained by the unsupervised training and a back propagation algorithm to finally obtain the second DNN model parameter.
The steps are a model training stage, and speaker recognition can be carried out after two DNN models are obtained;
2) the speaker identification stage specifically comprises the following steps:
2-1) acquiring the voice data to be recognized of one of 800 speakers to be recognized, wherein the voice data to be recognized is also obtained by telephone recording, but the speaker corresponding to the voice data is unknown, and the speaker needs to be recognized by the method provided by the invention. The speech data to be recognized and the training speech data are different voices. Preprocessing the voice data to be recognized, and extracting the Mel cepstrum characteristics and the first and second derivatives of the Mel cepstrum characteristics of the voice data to be recognized, wherein the total dimension is 60.
2-2) inputting the 60-dimensional characteristics of the voice data to be recognized obtained in the step 2-1) into a first DNN model for recognition, outputting the recognition result of the voice data to be recognized by an output layer, namely the probability that the voice data respectively corresponds to the 800 persons, and outputting the probability that each node of the output layer respectively corresponds to each speaker, wherein the total number of the probabilities is 800.
2-3) setting a judgment threshold value, and judging whether 800 probabilities in the recognition results have results larger than the threshold value 0.85: if so, judging the speaker corresponding to the probability as the speaker of the voice data to be recognized, and finishing the recognition; otherwise, turning to the step 2-4);
2-4) if the probability in the recognition result of the step 2-3) is not greater than the decision threshold value of 0.85, performing second recognition on the voice data to be recognized by using a second DNN model; and according to the recognition result of the second DNN model, the speaker corresponding to the output probability maximum value is judged as the speaker of the voice data to be detected, and the recognition is finished.
The method of the present invention can be implemented by a program, which can be stored in a computer-readable storage medium, as will be understood by those skilled in the art.
While the invention has been described with reference to a specific embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.

Claims (1)

1. A speaker recognition method based on secondary modeling is characterized by comprising a model training stage and a speaker recognition stage; in the model training stage, acquiring training voice data of all speakers to be recognized and preprocessing the training voice data; training according to training voice data to obtain a first DNN model; recognizing training voice data by using a first DNN model, and extracting the easily-mixed voice data; training according to the confusing voice data to obtain a second DNN model; in the speaker recognition stage, acquiring and preprocessing voice data to be recognized; identifying the voice data to be identified by utilizing a first DNN model, and if the identification probability is greater than a judgment threshold value, identifying that the speaker corresponding to the maximum output probability value in the identification result is the speaker of the voice data to be identified; otherwise, carrying out secondary recognition on the voice data to be recognized through a second DNN model, wherein the speaker corresponding to the maximum output probability in the recognition result is the speaker of the voice data to be recognized; the method comprises the following steps:
1) a model training stage; the method specifically comprises the following steps:
1-1) acquiring training voice data of all speakers to be recognized, wherein the speaker corresponding to each piece of training voice data is known; preprocessing the acquired training voice data, extracting Mel cepstrum characteristics corresponding to all the training voice data, and calculating first and second derivatives of the Mel cepstrum characteristics, which are 60-dimensional in total;
1-2) establishing a first DNN model and training, and specifically comprising the following steps:
1-2-1) setting the layer number and the node number of a first DNN model;
the first DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions in total of the Mel cepstrum characteristics of the training voice data obtained in the step 1-1) and first and second derivatives thereof, and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of all speakers to be identified, and each node outputs the probability corresponding to each speaker respectively; the hidden layer is used for automatically extracting features of different layers, and the node number of each hidden layer represents the dimension of the feature extracted by the hidden layer;
1-2-2) training a first DNN model to obtain a first DNN model parameter;
training a first DNN module according to the Mel cepstrum characteristics and first and second derivatives of training voice data of all speakers to be recognized, wherein the model parameters comprise connection weights of two adjacent layers and offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the first DNN model as a limited Boltzmann machine, and training all the limited Boltzmann machines in sequence by using a contrast divergence algorithm to obtain an initial value of a parameter of the first DNN model; when the supervised training is carried out, accurately adjusting the first DNN model parameter by using an initial value of the first DNN model parameter obtained by the unsupervised training and a back propagation algorithm to finally obtain the first DNN model parameter;
1-3) extracting the miscible voice data;
recognizing training voice data of all speakers to be recognized and setting a threshold value according to the first DNN model obtained by training in the step 1-2); if the probability of the speaker corresponding to the training voice data in the recognition result of one piece of training voice data is smaller than a set threshold, the recognition result of the piece of voice data is not good in distinction, and the training voice data is extracted to be used as the easy-to-mix voice data for training a second DNN model; if the recognition result is greater than or equal to the set threshold, the training voice data is easy to distinguish and is not used as the confusing voice data;
1-4) establishing a second DNN model and training, specifically comprising the following steps:
1-4-1) setting the layer number and the node number of the second DNN model;
the second DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions in total of the Mel cepstrum characteristics of the easy-mixing voice data extracted in the step 1-3) and the first-order and second-order derivatives thereof, and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of speakers contained in the confusing voice data, and each node outputs the probability corresponding to each speaker respectively; the hidden layer is used for automatically extracting features of different layers, and the node number of each hidden layer represents the dimension of the feature extracted by the hidden layer;
1-4-2) training the second DNN model to obtain second DNN model parameters;
training a second DNN model according to the Mel cepstrum characteristics of the confusing voice data obtained in the step 1-3) and the first-order and second-order derivatives thereof, wherein the model parameters comprise the connection weight of two adjacent layers and the offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the second DNN model as a limited Boltzmann machine, and sequentially training all the limited Boltzmann machines by using a contrast divergence algorithm to obtain an initial value of a parameter of the second DNN model; when the supervised training is carried out, the initial value of the second DNN model parameter obtained by the unsupervised training is used, the backward propagation algorithm is used for accurately adjusting the second DNN model parameter, and finally the second DNN model parameter is obtained;
2) a speaker identification stage; the method specifically comprises the following steps:
2-1) acquiring voice data to be recognized of one of speakers to be recognized, preprocessing the voice data to be recognized, and extracting the Mel cepstrum characteristics and first and second derivatives of the voice data to be recognized, wherein the total dimension is 60;
2-2) inputting the 60-dimensional characteristics of the voice data to be recognized obtained in the step 2-1) into the first DNN model obtained in the step 1-2) for recognition, and outputting the recognition result of the voice data to be recognized by an output layer, wherein the voice data corresponds to the probability of each speaker in the training voice data respectively, and the output of each node of the output layer corresponds to the probability of each speaker respectively;
2-3) setting a judgment threshold value, and judging whether a result with the probability larger than the judgment threshold value exists in the identification result of the step 2-2): if so, the speaker corresponding to the maximum output probability in the first DNN model recognition result is the speaker of the voice data to be recognized, and the recognition is finished; if not, the step 2-4) is carried out;
2-4) if the recognition result in the step 2-3) does not have a result with the probability larger than the judgment threshold, performing secondary recognition on the voice data to be recognized by using a second DNN model; and the speaker corresponding to the maximum output probability in the second DNN model recognition result is the speaker of the voice data to be recognized, and the recognition is finished.
CN201710031899.7A 2017-01-17 2017-01-17 Speaker identification method based on secondary modeling Active CN106898355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710031899.7A CN106898355B (en) 2017-01-17 2017-01-17 Speaker identification method based on secondary modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710031899.7A CN106898355B (en) 2017-01-17 2017-01-17 Speaker identification method based on secondary modeling

Publications (2)

Publication Number Publication Date
CN106898355A CN106898355A (en) 2017-06-27
CN106898355B true CN106898355B (en) 2020-04-14

Family

ID=59198262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710031899.7A Active CN106898355B (en) 2017-01-17 2017-01-17 Speaker identification method based on secondary modeling

Country Status (1)

Country Link
CN (1) CN106898355B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274890B (en) * 2017-07-04 2020-06-02 清华大学 Voiceprint spectrum extraction method and device
CN107274883B (en) * 2017-07-04 2020-06-02 清华大学 Voice signal reconstruction method and device
CN107610709B (en) * 2017-08-01 2021-03-19 百度在线网络技术(北京)有限公司 Method and system for training voiceprint recognition model
CN108305615B (en) * 2017-10-23 2020-06-16 腾讯科技(深圳)有限公司 Object identification method and device, storage medium and terminal thereof
CN109887511A (en) * 2019-04-24 2019-06-14 武汉水象电子科技有限公司 A kind of voice wake-up optimization method based on cascade DNN
CN111883175B (en) * 2020-06-09 2022-06-07 河北悦舒诚信息科技有限公司 Voiceprint library-based oil station service quality improving method
CN111724766B (en) * 2020-06-29 2024-01-05 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1264887A (en) * 2000-03-31 2000-08-30 清华大学 Non-particular human speech recognition and prompt method based on special speech recognition chip
CN1588536A (en) * 2004-09-29 2005-03-02 上海交通大学 State structure regulating method in sound identification
CN101231848A (en) * 2007-11-06 2008-07-30 安徽科大讯飞信息科技股份有限公司 Method for performing pronunciation error detecting based on holding vector machine
CN105761720A (en) * 2016-04-19 2016-07-13 北京地平线机器人技术研发有限公司 Interaction system based on voice attribute classification, and method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9058806B2 (en) * 2012-09-10 2015-06-16 Cisco Technology, Inc. Speaker segmentation and recognition based on list of speakers

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1264887A (en) * 2000-03-31 2000-08-30 清华大学 Non-particular human speech recognition and prompt method based on special speech recognition chip
CN1588536A (en) * 2004-09-29 2005-03-02 上海交通大学 State structure regulating method in sound identification
CN101231848A (en) * 2007-11-06 2008-07-30 安徽科大讯飞信息科技股份有限公司 Method for performing pronunciation error detecting based on holding vector machine
CN105761720A (en) * 2016-04-19 2016-07-13 北京地平线机器人技术研发有限公司 Interaction system based on voice attribute classification, and method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于GMM-DNN的说话人确认方法;李敬阳等;《计算机应用与软件》;20161231;第33卷(第12期);第131-135页 *

Also Published As

Publication number Publication date
CN106898355A (en) 2017-06-27

Similar Documents

Publication Publication Date Title
CN106898355B (en) Speaker identification method based on secondary modeling
CN104732978B (en) The relevant method for distinguishing speek person of text based on combined depth study
CN107464568B (en) Speaker identification method and system based on three-dimensional convolution neural network text independence
CN107886957A (en) The voice awakening method and device of a kind of combination Application on Voiceprint Recognition
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN106448684A (en) Deep-belief-network-characteristic-vector-based channel-robust voiceprint recognition system
CN105938716A (en) Multi-precision-fitting-based automatic detection method for copied sample voice
CN108281137A (en) A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN107886943A (en) A kind of method for recognizing sound-groove and device
CN107610707A (en) A kind of method for recognizing sound-groove and device
CN108172218A (en) A kind of pronunciation modeling method and device
CN109785852A (en) A kind of method and system enhancing speaker's voice
CN111524527A (en) Speaker separation method, device, electronic equipment and storage medium
CN110310647A (en) A kind of speech identity feature extractor, classifier training method and relevant device
CN108694951A (en) A kind of speaker's discrimination method based on multithread hierarchical fusion transform characteristics and long memory network in short-term
CN105096955A (en) Speaker rapid identification method and system based on growing and clustering algorithm of models
CN107993664B (en) Robust speaker recognition method based on competitive neural network
Shahnawazuddin et al. In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario
Maheswari et al. A hybrid model of neural network approach for speaker independent word recognition
CN103871417A (en) Specific continuous voice filtering method and device of mobile phone
CN106971737A (en) A kind of method for recognizing sound-groove spoken based on many people
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20181128

Address after: 100085 Beijing Haidian District Shangdi Information Industry Base Pioneer Road 1 B Block 2 Floor 2030

Applicant after: Beijing Huacong Zhijia Technology Co., Ltd.

Address before: 100084 Tsinghua Yuan, Haidian District, Beijing, No. 1

Applicant before: Tsinghua University

GR01 Patent grant
GR01 Patent grant