CN106898355B

CN106898355B - Speaker identification method based on secondary modeling

Info

Publication number: CN106898355B
Application number: CN201710031899.7A
Authority: CN
Inventors: 何亮; 陈仙红; 徐灿; 刘艺; 田垚; 刘加
Original assignee: Beijing Huacong Zhijia Technology Co Ltd
Current assignee: Beijing Huacong Zhijia Technology Co., Ltd.
Priority date: 2017-01-17
Filing date: 2017-01-17
Publication date: 2020-04-14
Anticipated expiration: 2037-01-17
Also published as: CN106898355A

Abstract

The invention provides a speaker recognition method based on secondary modeling, and belongs to the field of voiceprint recognition, pattern recognition and machine learning. In the model training stage, training voice data of a speaker to be recognized is acquired and preprocessed; training according to training voice data to obtain a first DNN model; recognizing training voice data by using a first DNN model, and extracting the easily-mixed voice data; training according to the confusing voice data to obtain a second DNN model; in the speaker recognition stage, acquiring and preprocessing voice data to be recognized; recognizing the voice data to be recognized by utilizing a first DNN model, and obtaining a speaker recognition result if the recognition probability is greater than a set threshold value; otherwise, the second time of recognition is carried out on the voice data to be recognized through a second DNN model, and a speaker recognition result is obtained. According to the invention, two DNN models are established, and the macroscopic characteristics and the microscopic characteristics of the speaker are considered at the same time, so that the accuracy of speaker identification is effectively improved.

Description

Speaker identification method based on secondary modeling

Technical Field

The invention belongs to the technical field of voiceprint recognition, pattern recognition and machine learning, and particularly relates to a speaker recognition method based on secondary modeling.

Background

Speaker recognition refers to recognizing the identity of a speaker based on information related to the speaker contained in speech, and as information technology and communication technology are rapidly developed, speaker recognition technology is increasingly gaining importance and is widely applied in many fields. For example, identity authentication, seizing of a telephone channel criminal, identity confirmation in court according to telephone recording, telephone voice tracking and the function of opening and closing an anti-theft door are provided. The technology of speaker recognition can be applied to the fields of voice dialing, telephone banking, telephone shopping, database access, information service, voice e-mail, security control, computer remote login and the like.

Speaker recognition first preprocesses speech data to extract features. The most common feature is a mel cepstrum feature based on the theory of human ear-hearing perception, and is widely applied to speaker recognition, language recognition, continuous speech recognition and the like at present. The Mel cepstrum feature extraction includes that pre-emphasis and framing windowing are performed on voice data, then fast Fourier transform is performed on the data subjected to framing and windowing to obtain corresponding frequency spectrums, filtering is performed through a Mel frequency standard triangular window filter, and finally discrete cosine transform is performed to obtain Mel cepstrum features.

In recent years, a speaker recognition model based on a Deep Neural Network (DNN) receives more and more attention, and compared with a traditional Gaussian Mixture Model (GMM), the DNN model has stronger description capability, can better simulate very complex data distribution, and achieves remarkable performance improvement of a DNN-based system. One DNN model contains three levels of an input layer, a hidden layer, and an output layer: the input layer corresponds to the characteristics of the voice data, and the node number of the input layer is determined according to the dimension of the corresponding characteristics of the voice data; the output layer corresponds to the probability of each speaker, and the node number of the output layer is determined according to the number of speakers needing to be identified in total; the number of the hidden layers and the number of the nodes are defined according to application requirements and engineering experience. During DNN model training, unsupervised training is firstly carried out, and then supervised training is carried out. During non-supervision training, each two adjacent layers of networks are used as a restricted Boltzmann machine and are trained layer by using a Contrast Divergence (CD) algorithm. And when the supervised training is carried out, the DNN model parameters obtained by the unsupervised training are used as initial values, and the DNN model parameters are accurately adjusted by using a back propagation algorithm. To date, the method for speaker recognition based on the DNN model only uses one DNN model, but it is difficult for one DNN model to simultaneously model macro features and micro features between speakers. This results in some voices being easily distinguishable and some voices being easily confused when a DNN model is used to identify the speaker.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a speaker identification method based on secondary modeling. According to the invention, two DNN models are established, and macroscopic characteristics and microscopic characteristics of the speaker are considered at the same time, so that the accuracy of speaker identification can be effectively improved.

A speaker recognition method based on secondary modeling is divided into a model training stage and a speaker recognition stage; in the model training stage, acquiring training voice data of all speakers to be recognized and preprocessing the training voice data; training according to training voice data to obtain a first DNN model; recognizing training voice data by using a first DNN model, and extracting the easily-mixed voice data; training according to the confusing voice data to obtain a second DNN model; in the speaker recognition stage, acquiring and preprocessing voice data to be recognized; identifying the voice data to be identified by utilizing a first DNN model, and if the identification probability is greater than a judgment threshold value, identifying that the speaker corresponding to the maximum output probability value in the identification result is the speaker of the voice data to be identified; otherwise, carrying out secondary recognition on the voice data to be recognized through a second DNN model, wherein the speaker corresponding to the maximum output probability in the recognition result is the speaker of the voice data to be recognized. The method comprises the following steps:

1) a model training stage; the method specifically comprises the following steps:

1-1) acquiring training voice data of all speakers to be recognized, wherein the speaker corresponding to each piece of training voice data is known; preprocessing the acquired training voice data, extracting Mel cepstrum characteristics corresponding to all the training voice data, and calculating first and second derivatives of the Mel cepstrum characteristics, which are 60-dimensional in total;

1-2) establishing a first DNN model and training, and specifically comprising the following steps:

1-2-1) setting the layer number and the node number of a first DNN model;

the first DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions in total of the Mel cepstrum characteristics of the training voice data obtained in the step 1-1) and first and second derivatives thereof, and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of all speakers to be identified, and each node outputs the probability corresponding to each speaker respectively; the hidden layer is used for automatically extracting features of different layers, and the node number of each hidden layer represents the dimension of the feature extracted by the hidden layer;

1-2-2) training a first DNN model to obtain a first DNN model parameter;

training a first DNN module according to the Mel cepstrum characteristics and first and second derivatives of training voice data of all speakers to be recognized, wherein the model parameters comprise connection weights of two adjacent layers and offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the first DNN model as a limited Boltzmann machine, and training all the limited Boltzmann machines in sequence by using a contrast divergence algorithm to obtain an initial value of a parameter of the first DNN model; when the supervised training is carried out, accurately adjusting the first DNN model parameter by using an initial value of the first DNN model parameter obtained by the unsupervised training and a back propagation algorithm to finally obtain the first DNN model parameter;

1-3) extracting the miscible voice data;

recognizing training voice data of all speakers to be recognized and setting a threshold value according to the first DNN model obtained by training in the step 1-2); if the probability of the speaker corresponding to the training voice data in the recognition result of one piece of training voice data is smaller than a set threshold, the recognition result of the piece of voice data is not good in distinction, and the training voice data is extracted to be used as the easy-to-mix voice data for training a second DNN model; if the recognition result is greater than or equal to the set threshold, the training voice data is easy to distinguish and is not used as the confusing voice data;

1-4) establishing a second DNN model and training, specifically comprising the following steps:

1-4-1) setting the layer number and the node number of the second DNN model;

the second DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions in total of the Mel cepstrum characteristics of the easy-mixing voice data extracted in the step 1-3) and the first-order and second-order derivatives thereof, and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of speakers contained in the confusing voice data, and each node outputs the probability corresponding to each speaker respectively; the hidden layer is used for automatically extracting features of different layers, and the node number of each hidden layer represents the dimension of the feature extracted by the hidden layer;

1-4-2) training the second DNN model to obtain second DNN model parameters;

training a second DNN model according to the Mel cepstrum characteristics of the confusing voice data obtained in the step 1-3) and the first-order and second-order derivatives thereof, wherein the model parameters comprise the connection weight of two adjacent layers and the offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the second DNN model as a limited Boltzmann machine, and sequentially training all the limited Boltzmann machines by using a contrast divergence algorithm to obtain an initial value of a parameter of the second DNN model; when the supervised training is carried out, the initial value of the second DNN model parameter obtained by the unsupervised training is used, the backward propagation algorithm is used for accurately adjusting the second DNN model parameter, and finally the second DNN model parameter is obtained;

2) a speaker identification stage; the method specifically comprises the following steps:

2-1) acquiring voice data to be recognized of one of speakers to be recognized, preprocessing the voice data to be recognized, and extracting the Mel cepstrum characteristics and first and second derivatives of the voice data to be recognized, wherein the total dimension is 60;

2-2) inputting the 60-dimensional characteristics of the voice data to be recognized obtained in the step 2-1) into the first DNN model obtained in the step 1-2) for recognition, and outputting the recognition result of the voice data to be recognized by an output layer, wherein the voice data corresponds to the probability of each speaker in the training voice data respectively, and the output of each node of the output layer corresponds to the probability of each speaker respectively;

2-3) setting a judgment threshold value, and judging whether a result with the probability larger than the judgment threshold value exists in the identification result of the step 2-2): if so, the speaker corresponding to the maximum output probability in the first DNN model recognition result is the speaker of the voice data to be recognized, and the recognition is finished; if not, the step 2-4) is carried out;

2-4) if the recognition result in the step 2-3) does not have a result with the probability larger than the judgment threshold, performing secondary recognition on the voice data to be recognized by using a second DNN model; and the speaker corresponding to the maximum output probability in the second DNN model recognition result is the speaker of the voice data to be recognized, and the recognition is finished.

The invention has the characteristics and beneficial effects that:

compared with the prior art, the first DNN model of the invention models macroscopic features among speakers, and the second DNN model models microscopic features among speakers. The method increases the identifiability of the confusing voice data of different speakers, has good system stability, considers macroscopic characteristics and microscopic characteristics at the same time, and can improve the accuracy of speaker identification.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a first DNN model structure diagram in the embodiment of the present invention.

Fig. 3 is a diagram of a second DNN model structure in the embodiment of the present invention.

Detailed Description

The speaker recognition method based on the quadratic modeling provided by the invention is further described in detail below by combining the figures and the specific embodiments.

The speaker recognition method based on the secondary modeling is divided into a model training stage and a speaker recognition stage; in the model training stage, acquiring training voice data of all speakers to be recognized and preprocessing the training voice data; training according to training voice data to obtain a first DNN model; recognizing training voice data by using a first DNN model, and extracting the easily-mixed voice data; training according to the confusing voice data to obtain a second DNN model; in the speaker recognition stage, acquiring and preprocessing voice data to be recognized; identifying the voice data to be identified by utilizing a first DNN model, and if the identification probability is greater than a judgment threshold value, identifying that the speaker corresponding to the maximum output probability value in the identification result is the speaker of the voice data to be identified; otherwise, carrying out secondary recognition on the voice data to be recognized through a second DNN model, wherein the speaker corresponding to the maximum output probability in the recognition result is the speaker of the voice data to be recognized. The flow chart of the method is shown in figure 1, and the method comprises the following steps:

1-1) acquiring training voice data of all speakers to be recognized, wherein the speaker corresponding to each piece of training voice data is known, and the acquisition mode can be field recording or telephone recording; preprocessing the acquired training voice data, extracting Mel cepstrum characteristics corresponding to all the training voice data, and calculating first and second derivatives of the Mel cepstrum characteristics, which are 60-dimensional in total;

1-2-1) setting the layer number and the node number of a first DNN model;

the first DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions in total of the Mel cepstrum characteristics of the training voice data obtained in the step 1-1) and first and second derivatives thereof, and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of all speakers to be identified, and each node outputs the probability corresponding to each speaker respectively; the hidden layer is mainly used for automatically extracting features of different layers, the number of layers and the number of nodes of the hidden layer are set according to needs and experience, and the number of layers of the hidden layer is generally set to be 3-5; the node number of each hidden layer represents the dimension of the feature extracted by the hidden layer, the hidden layer node number of the middle position is generally set to be 300-;

1-2-2) training a first DNN model to obtain a first DNN model parameter;

training a first DNN module according to the Mel cepstrum characteristics and first and second derivatives of training voice data of all speakers to be recognized, wherein the model parameters comprise connection weights of two adjacent layers and offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the first DNN model as a limited Boltzmann machine, and sequentially training all the limited Boltzmann machines by using a Contrast Divergence (CD) algorithm to obtain an initial value of a parameter of the first DNN model; when the supervised training is carried out, accurately adjusting the first DNN model parameter by using an initial value of the first DNN model parameter obtained by the unsupervised training and a back propagation algorithm to finally obtain the first DNN model parameter;

1-3) extracting the miscible voice data;

according to the first DNN model obtained by training in the step 1-2), recognizing the training voice data of all speakers to be recognized and setting a threshold, wherein the threshold range is generally set to be 0.7-0.9 according to experience; if the probability of the speaker corresponding to the training voice data in the recognition result of one piece of training voice data is smaller than a set threshold, the recognition result of the piece of voice data is not good in distinction, and the training voice data is extracted to be used as the easy-to-mix voice data for training a second DNN model; if the recognition result is greater than or equal to the set threshold, the training voice data is easy to distinguish and is not used as the confusing voice data;

1-4-1) setting the layer number and the node number of the second DNN model;

the second DNN model is divided into an input layer, a hidden layer and an output layer; the input layer corresponds to the characteristics of the training voice data, the number of nodes of the input layer is 60 dimensions of the Mel cepstrum characteristics and the first and second derivatives thereof of the easy-mixing voice data extracted in the step 1-3), and the number of the nodes of the input layer is set to be 60; the output layer corresponds to the probability of each speaker, the number of nodes of the output layer is the number of speakers contained in the confusing voice data, and each node outputs the probability corresponding to each speaker respectively; the number of hidden layer is generally set to be 3-5, the number of hidden layer nodes at the middle position is generally set to be 500 in 300-;

1-4-2) training the second DNN model to obtain second DNN model parameters;

training a second DNN module according to the Mel cepstrum characteristics of the easily-mixed voice data obtained in the step 1-3) and first and second derivatives thereof, wherein the model parameters comprise the connection weight of two adjacent layers and the offset of each node; carrying out unsupervised training firstly and then carrying out supervised training: during non-supervision training, taking every two adjacent layers in the second DNN model as a limited Boltzmann machine, and sequentially training all the limited Boltzmann machines by using a Contrast Divergence (CD) algorithm to obtain an initial value of a parameter of the second DNN model; when the supervised training is carried out, the initial value of the second DNN model parameter obtained by the unsupervised training is used, the backward propagation algorithm is used for accurately adjusting the second DNN model parameter, and finally the second DNN model parameter is obtained;

2) the speaker identification stage specifically comprises the following steps:

The process of the present invention is further described in detail below with reference to a specific example. It should be noted that the embodiment described below is only one embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without inventive step, are within the scope of the present invention.

In this embodiment, 800 speakers need to be identified, and the specific steps are as follows:

1) the model training stage specifically comprises the following steps:

1-1) acquiring training voice data of 800 speakers to be identified, wherein the speaker corresponding to each piece of training voice data is known, and the acquisition mode is telephone recording; preprocessing the acquired training voice data (namely, the telephone recording), extracting Mel cepstrum characteristics corresponding to all the training voice data, and calculating first and second derivatives of the Mel cepstrum characteristics, which are 60-dimensional in total;

1-2-1) setting the layer number and the node number of a first DNN model;

the first DNN model structure is shown in fig. 2: the first DNN model has 7 layers in total, wherein the 1 st layer is an input layer, the 2 nd to 6 th layers are hidden layers (5 layers in total), and the 7 th layerThe layer is an output layer. Wherein the cross lines represent the connection relationship of the nodes, the nodes of two adjacent layers of the first DNN model are all connected, and the nodes in each layer are not connected. The input layer of the first DNN model corresponds to the feature of the training speech data, which is the mel cepstrum feature of the training speech data obtained in step 1-1) and the first and second derivatives thereof, and is 60 dimensions in total, and the number of nodes of the input layer is set to be N₁60 pieces of the Chinese herbal medicines are taken; the output layer has the probability of corresponding to each speaker, and the node number of the output layer is equal to the number of the speakers to be identified and is N₇Each node outputs the probability corresponding to each speaker respectively as 800; the hidden layer is mainly used for automatically extracting features of different layers, the number of the hidden layer is set to be 5 layers (generally set to be 3-5 layers), and the features extracted by the hidden layer are gradually transited from low-level abstraction of the 2 nd layer to high-level abstraction of the 6 th layer. The number of nodes in each hidden layer represents the dimension of the feature extracted by the hidden layer, and in this embodiment, the number of nodes in the hidden layer (i.e., layer 4) in the middle position is set to be N₄400 (typically set to 300-₂、N₃、N₅And N₆Set to 1024 (typically set to 1000-.

1-2-2) training a first DNN model to obtain a first DNN model parameter;

training a first DNN module according to the Mel cepstrum characteristics and first and second derivatives of training voice data of 800 speakers to be recognized, wherein the model parameters comprise connection weights of two adjacent layers:

wherein, W_i,i+1Is N_iLine N_i+1A matrix of columns, wherein

Representing the connection weight of the mth node in the ith layer and the nth node in the (i + 1) th layer of the first DNN model.

Bias per node:

wherein the content of the first and second substances,

representing the bias of the kth node in the jth layer of the first DNN model.

Unsupervised training is performed first and then supervised training is performed. During the unsupervised training, every two adjacent layers in the first DNN model, namely the 1 st layer and the 2 nd layer, the 2 nd layer and the 3 rd layer, … …, the 6 th layer and the 7 th layer in FIG. 2 are used as a limited Boltzmann machine, and the total number of the limited Boltzmann machines is 6. Training one by using a Contrast Divergence (CD) algorithm, namely training a limited Boltzmann machine consisting of a layer 1 and a layer 2 to obtain B in the first DNN model parameter₁、B₂、W₁₂(ii) a Then training a restricted Boltzmann machine consisting of the 2 nd layer and the 3 rd layer to obtain B in the first DNN model parameter₃、W₂₃(ii) a Training all restricted Boltzmann machines in sequence to obtain an initial value of a first DNN model parameter; and when the supervised training is carried out, accurately adjusting the first DNN model parameter by using the initial value of the first DNN model parameter obtained by the unsupervised training and a back propagation algorithm to finally obtain the first DNN model parameter.

1-3) extracting the miscible voice data;

and identifying all training voice data of 800 speakers according to the first DNN model obtained by the training in the step 1-2). The threshold is set at 0.85 (typically set at 0.7-0.9). If the probability of the speaker corresponding to the training voice data in the recognition result of one piece of training voice data is smaller than a set threshold, the recognition result of the training voice data is not good in distinction, and the training voice data is used as the confusing voice data for training a second DNN model; if the recognition result is greater than or equal to the set threshold, the training voice data is easy to distinguish and is not used as the confusing voice data;

1-4-1) setting the layer number and the node number of the second DNN model;

the second DNN model structure is shown in fig. 3: the second DNN model has 5 layers in total, wherein layer 1 is an input layer, layers 2 to 4 are hidden layers (3 layers in total), and layer 5 is an output layer. Wherein the cross lines represent the connection relationship of the nodes, the nodes of two adjacent layers of the second DNN model are all connected, and the nodes in each layer are not connected. The input layer of the second DNN model corresponds to the features of the speech data, in this embodiment, the mel cepstrum features of the confusing speech data extracted in step 1-3) and the first and second derivatives thereof, which are 60 dimensions in total, and then the number of nodes of the input layer of the second DNN model is set to N₁60; the output layer corresponds to the probability of each speaker, the node number of the output layer is the number of speakers contained in the confusing voice data, and the number is 800 in the embodiment; because the speaker corresponding to the training data is known and the confusing voice data is extracted from the training data, the speaker corresponding to each frame of the confusing voice data can be known, and therefore the total number of speakers in the confusing voice data can be calculated. The number of hidden layers is set to 3 (generally to 3-5), and the number of nodes of the hidden layer (i.e. the 3 rd layer) at the middle position is set to N₃300 (typically set to 300-₂,N₄Set to 1024 (typically set to 1000-.

1-4-2) training the second DNN model to obtain model parameters;

and training a second DNN mode according to the Mel cepstrum characteristics of the confusing voice data obtained in the step 1-3) and the first-order and second-order derivatives thereof. The parameters of the second DNN model include the connection weights W of two adjacent layers_i,i+1(i-1, …,4) and bias B for each node_j(j ═ 1, …, 5). Unsupervised training is performed first and then supervised training is performed. During the unsupervised training, every two adjacent layers in the second DNN model, namely the 1 st layer and the 2 nd layer, the 2 nd layer and the 3 rd layer, … …, the 4 th layer and the 5 th layer in the figure 3 are used as a limited Boltzmann machine, and 4 limited Boltzmann machines are usedTraining the Contrast Divergence (CD) algorithm layer by layer, namely training a limited Boltzmann machine consisting of the 1 st layer and the 2 nd layer to obtain B in the second DNN model parameter₁、B₂、W₁₂(ii) a Then training the restricted Boltzmann machine composed of the 2 nd layer and the 3 rd layer to obtain B in the second DNN model parameter₃、W₂₃(ii) a And sequentially training all the limited Boltzmann machines to obtain the initial value of the second DNN model parameter. And when the supervised training is carried out, accurately adjusting the second DNN model parameter by using the initial value of the second DNN model parameter obtained by the unsupervised training and a back propagation algorithm to finally obtain the second DNN model parameter.

The steps are a model training stage, and speaker recognition can be carried out after two DNN models are obtained;

2) the speaker identification stage specifically comprises the following steps:

2-1) acquiring the voice data to be recognized of one of 800 speakers to be recognized, wherein the voice data to be recognized is also obtained by telephone recording, but the speaker corresponding to the voice data is unknown, and the speaker needs to be recognized by the method provided by the invention. The speech data to be recognized and the training speech data are different voices. Preprocessing the voice data to be recognized, and extracting the Mel cepstrum characteristics and the first and second derivatives of the Mel cepstrum characteristics of the voice data to be recognized, wherein the total dimension is 60.

2-2) inputting the 60-dimensional characteristics of the voice data to be recognized obtained in the step 2-1) into a first DNN model for recognition, outputting the recognition result of the voice data to be recognized by an output layer, namely the probability that the voice data respectively corresponds to the 800 persons, and outputting the probability that each node of the output layer respectively corresponds to each speaker, wherein the total number of the probabilities is 800.

2-3) setting a judgment threshold value, and judging whether 800 probabilities in the recognition results have results larger than the threshold value 0.85: if so, judging the speaker corresponding to the probability as the speaker of the voice data to be recognized, and finishing the recognition; otherwise, turning to the step 2-4);

2-4) if the probability in the recognition result of the step 2-3) is not greater than the decision threshold value of 0.85, performing second recognition on the voice data to be recognized by using a second DNN model; and according to the recognition result of the second DNN model, the speaker corresponding to the output probability maximum value is judged as the speaker of the voice data to be detected, and the recognition is finished.

The method of the present invention can be implemented by a program, which can be stored in a computer-readable storage medium, as will be understood by those skilled in the art.

While the invention has been described with reference to a specific embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.

Claims

1. A speaker recognition method based on secondary modeling is characterized by comprising a model training stage and a speaker recognition stage; in the model training stage, acquiring training voice data of all speakers to be recognized and preprocessing the training voice data; training according to training voice data to obtain a first DNN model; recognizing training voice data by using a first DNN model, and extracting the easily-mixed voice data; training according to the confusing voice data to obtain a second DNN model; in the speaker recognition stage, acquiring and preprocessing voice data to be recognized; identifying the voice data to be identified by utilizing a first DNN model, and if the identification probability is greater than a judgment threshold value, identifying that the speaker corresponding to the maximum output probability value in the identification result is the speaker of the voice data to be identified; otherwise, carrying out secondary recognition on the voice data to be recognized through a second DNN model, wherein the speaker corresponding to the maximum output probability in the recognition result is the speaker of the voice data to be recognized; the method comprises the following steps:

1-2-1) setting the layer number and the node number of a first DNN model;

1-2-2) training a first DNN model to obtain a first DNN model parameter;

1-3) extracting the miscible voice data;

1-4-1) setting the layer number and the node number of the second DNN model;

1-4-2) training the second DNN model to obtain second DNN model parameters;