CN114863934A

CN114863934A - Voiceprint recognition model construction method based on integrated convolutional neural network

Info

Publication number: CN114863934A
Application number: CN202210684227.7A
Authority: CN
Inventors: 张葛祥; 何瑶; 汤刚; 杨强
Original assignee: Chengdu Univeristy of Technology
Current assignee: Chengdu Univeristy of Technology
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-08-05

Abstract

The invention discloses a voiceprint recognition model construction method based on an integrated convolutional neural network. Firstly, training a basic network independently, and stopping training when the recognition accuracy of the basic network is highest; then, the weight is trained independently, the output predicted by the basic network is used as the input of a secondary network to train the weighted value of the weighted average, the weight is changed in a self-adaptive mode in the training process, and the accuracy rate of voiceprint recognition is improved. The output of each underlying network is a feature vector with the length of the speaker. The output of the secondary network is a length-based network number vector, which is a weighted average of the weight values. The weight value of the voiceprint recognition model constructed by the invention in the weighted average changes along with the output change of the basic network in a self-adaptive way; compared with a single basic network, the integrated model improves the accuracy of voiceprint recognition.

Description

Voiceprint recognition model construction method based on integrated convolutional neural network

Technical Field

The invention relates to the technical field of voiceprint recognition, in particular to a voiceprint recognition model construction method based on an integrated convolutional neural network.

Background

Voiceprint recognition has special advantages to a certain extent compared with fingerprint recognition and face recognition. Voiceprint recognition does not need to be in direct contact with an object compared with fingerprint recognition, the mask does not need to be taken down compared with face recognition, the face recognition can protect the face recognition to a certain extent better, and the face recognition is convenient to collect.

With the development of deep learning, the neural network is applied to various fields, and the accuracy of voiceprint recognition can be improved by combining the neural network with voiceprint recognition. However, because a single neural network is difficult to achieve an ideal effect without human intervention for feature extraction and feature fusion accuracy, and different network models have different attention points in the same task, the obtained results are different.

In the field of voiceprint recognition, a plurality of different neural networks are integrated to perform voiceprint recognition, and the weight of a weighted average method in the existing integration strategy is determined manually or tried one by one, cannot be dynamically changed in a self-adaptive manner, and is low in efficiency.

Disclosure of Invention

The invention aims to provide a voiceprint recognition model construction method based on an integrated convolutional neural network.

The technical scheme for realizing the purpose of the invention is as follows:

the voiceprint recognition model construction method based on the integrated convolutional neural network comprises the steps that more than two different convolutional neural networks are used as a basic network, and a single hidden layer BP neural network is used as a secondary network; step 1, training a basic network;

1.1, preprocessing a voice training set and extracting voice characteristics;

1.2, respectively inputting the voice characteristics into each basic network for training, and finishing the training when the voiceprint recognition accuracy of each basic network is not improved any more;

step 2, training a secondary network;

2.1, preprocessing the voice training set and extracting voice features;

2.2 respectively inputting the voice characteristics into each trained basic network, and training weighted average weight values by taking the predicted output of each basic network as the input of a secondary network; correspondingly multiplying the output of the secondary network with the predicted output of each basic network to obtain a final predicted value; completing training when the voiceprint recognition accuracy is not improved any more; the output of each basic network prediction is a feature vector with the length being the number of speakers; the output of the secondary network is a vector with the length of the number of the basic networks, and the vector is a weighted value of weighted average.

According to the preferable technical scheme, the basic network is two or more selected from EfficientNet, ResNet, GoogleNet, VGG and AlexNet.

In the preferred technical scheme, the basic network is EfficientNet, ResNet and GoogleNet.

In a preferred embodiment, the speech feature is an MFCC feature, an Fbank feature or an LPCC feature.

The voiceprint recognition model based on the integrated convolutional neural network is constructed, and the weight value in the weighted average changes in a self-adaptive manner along with the output change of the basic network; compared with a single basic network, the integrated model improves the accuracy of voiceprint recognition.

Drawings

FIG. 1 is an embodiment of a voiceprint recognition model based on an integrated convolutional neural network.

Detailed Description

The invention provides a voiceprint recognition model construction method based on an integrated convolutional neural network. Firstly, training a basic network independently, and stopping training when the recognition accuracy of the basic network is highest; then, the weight is trained independently, the output predicted by the basic network is used as the input of a secondary network to train the weighted value of the weighted average, the weight is changed in a self-adaptive mode in the training process, and the accuracy rate of voiceprint recognition is improved. The output of each underlying network is a feature vector with the length of the speaker. The output of the secondary network is a length-based network number vector, which is a weighted average of the weight values.

Example (b):

as shown in fig. 1, the method for constructing the voiceprint recognition model based on the integrated convolutional neural network includes the following steps:

1. selecting three neural networks of EfficientNet, ResNet and GoogleNet as the basic network of the integrated model. Other neural network models, such as convolutional neural networks like VGG, AlexNet, etc., can also be used as the base network. The number of base networks should be two or more.

2. Loading an original audio, preprocessing original voice data and extracting MFCC (Mel frequency cepstrum coefficient) features, wherein the specific steps of MFCC feature extraction are as follows:

a1: firstly, reading voice data for preprocessing, mainly comprising sampling, framing, windowing and the like, and obtaining a sequence of voice frames x (n) after processing;

a2: the preprocessed voice data x (n) is FFT converted into a frequency spectrum X (k), and the conversion formula is as follows:

wherein, x (N) is a frame of input voice signal, x (k) is its frequency domain response value, and N is the number of sampling points contained in the frame;

a3, obtaining energy distribution X (i, k) on a frequency spectrum, obtaining spectral line energy E (i, k) by taking the square of the modulus of the energy distribution X (i, k), then sending the spectral line energy E (i, k) into a Mel filter, calculating the energy of the Mel filter, and taking the logarithmic energy S (m) output by each filter; and finally, performing DCT (discrete cosine transformation) to obtain MFCC characteristic parameters, wherein the transformation formula is as follows:

cn is the characteristic parameter and M is the number of filters.

The characteristics of the voice data extracted in the step can also be selected from other characteristic parameters such as Fbank, LPCC and the like.

3. The three basic networks respectively carry out learning training on the extracted MFCC characteristics, and the networks are saved when the identification accuracy of the three basic networks is not improved any more.

4. Loading the three basic network models stored in the step 3, taking the output predicted by the basic network as the input of a secondary network to train the weighted average weight value, wherein the output of each basic network is a feature vector with the length being the number of speakers, the secondary network is a single hidden layer BP neural network, and the number of hidden layers can be changed; wherein the formula of the weighted average is:

wherein H (X) represents the result after integration, T represents the number of neural networks, and h _i Representing the ith neural network, x is input speech, h _i The output on the input training speech x is h _i (x)，h _i (x) Is a feature vector with the length being the number of speakers; w is a _i Is a neural network h _i Is derived from the secondary network training.

5. And fusing the stored three basic networks and the weight values into an integrated model, training the integrated network model, and stopping training when the recognition accuracy cannot be improved. And after the training of the secondary network is finished, multiplying the predicted value of the basic network by the weight value of the training of the secondary network for weighted average to obtain the final prediction result.

The voiceprint recognition model based on the integrated convolutional neural network constructed in the embodiment is tested by using a Free ST Chinese Mandarin Chinese data set. 855 persons are contained in the data set, 120 voices of each person are used, 90% of the 855 persons are used as a training set, 10% of the 855 persons are used as a testing set, the recognition accuracy rates of the three basic networks are 93.54%, 96.76% and 94.49% respectively, and the accuracy rate after integration is 97.02%.

Claims

1. The voiceprint recognition model construction method based on the integrated convolutional neural network is characterized in that the voiceprint recognition model takes more than two different convolutional neural networks as a basic network and takes a single hidden layer BP neural network as a secondary network;

step 1, training a basic network;

1.1, preprocessing a voice training set and extracting voice characteristics;

step 2, training a secondary network;

2.1, preprocessing the voice training set and extracting voice features;

2.2 respectively inputting the voice characteristics into each trained basic network, and training weighted average weight values by taking the predicted output of each basic network as the input of a secondary network; correspondingly multiplying the output of the secondary network with the output predicted by each basic network to obtain a final predicted value; finishing training when the voiceprint recognition accuracy is not improved any more; the output of each basic network prediction is a feature vector with the length being the number of speakers; the output of the secondary network is a vector with the length of the number of the basic networks, and the vector is a weighted value of weighted average.

2. The integrated convolutional neural network-based voiceprint recognition model building method of claim 1, wherein the base network is any two or more of EfficientNet, ResNet, google lenet, VGG, and AlexNet.

3. The integrated convolutional neural network-based voiceprint recognition model building method of claim 1, wherein the base network is EfficientNet, ResNet, and google lenet.

4. The integrated convolutional neural network-based voiceprint recognition model building method of claim 1, wherein the speech features are MFCC features, Fbank features or LPCC features.