CN114863934A - Voiceprint recognition model construction method based on integrated convolutional neural network - Google Patents

Voiceprint recognition model construction method based on integrated convolutional neural network Download PDF

Info

Publication number
CN114863934A
CN114863934A CN202210684227.7A CN202210684227A CN114863934A CN 114863934 A CN114863934 A CN 114863934A CN 202210684227 A CN202210684227 A CN 202210684227A CN 114863934 A CN114863934 A CN 114863934A
Authority
CN
China
Prior art keywords
network
voiceprint recognition
training
basic
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210684227.7A
Other languages
Chinese (zh)
Inventor
张葛祥
何瑶
汤刚
杨强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Univeristy of Technology
Original Assignee
Chengdu Univeristy of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Univeristy of Technology filed Critical Chengdu Univeristy of Technology
Priority to CN202210684227.7A priority Critical patent/CN114863934A/en
Publication of CN114863934A publication Critical patent/CN114863934A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a voiceprint recognition model construction method based on an integrated convolutional neural network. Firstly, training a basic network independently, and stopping training when the recognition accuracy of the basic network is highest; then, the weight is trained independently, the output predicted by the basic network is used as the input of a secondary network to train the weighted value of the weighted average, the weight is changed in a self-adaptive mode in the training process, and the accuracy rate of voiceprint recognition is improved. The output of each underlying network is a feature vector with the length of the speaker. The output of the secondary network is a length-based network number vector, which is a weighted average of the weight values. The weight value of the voiceprint recognition model constructed by the invention in the weighted average changes along with the output change of the basic network in a self-adaptive way; compared with a single basic network, the integrated model improves the accuracy of voiceprint recognition.

Description

Voiceprint recognition model construction method based on integrated convolutional neural network
Technical Field
The invention relates to the technical field of voiceprint recognition, in particular to a voiceprint recognition model construction method based on an integrated convolutional neural network.
Background
Voiceprint recognition has special advantages to a certain extent compared with fingerprint recognition and face recognition. Voiceprint recognition does not need to be in direct contact with an object compared with fingerprint recognition, the mask does not need to be taken down compared with face recognition, the face recognition can protect the face recognition to a certain extent better, and the face recognition is convenient to collect.
With the development of deep learning, the neural network is applied to various fields, and the accuracy of voiceprint recognition can be improved by combining the neural network with voiceprint recognition. However, because a single neural network is difficult to achieve an ideal effect without human intervention for feature extraction and feature fusion accuracy, and different network models have different attention points in the same task, the obtained results are different.
In the field of voiceprint recognition, a plurality of different neural networks are integrated to perform voiceprint recognition, and the weight of a weighted average method in the existing integration strategy is determined manually or tried one by one, cannot be dynamically changed in a self-adaptive manner, and is low in efficiency.
Disclosure of Invention
The invention aims to provide a voiceprint recognition model construction method based on an integrated convolutional neural network.
The technical scheme for realizing the purpose of the invention is as follows:
the voiceprint recognition model construction method based on the integrated convolutional neural network comprises the steps that more than two different convolutional neural networks are used as a basic network, and a single hidden layer BP neural network is used as a secondary network; step 1, training a basic network;
1.1, preprocessing a voice training set and extracting voice characteristics;
1.2, respectively inputting the voice characteristics into each basic network for training, and finishing the training when the voiceprint recognition accuracy of each basic network is not improved any more;
step 2, training a secondary network;
2.1, preprocessing the voice training set and extracting voice features;
2.2 respectively inputting the voice characteristics into each trained basic network, and training weighted average weight values by taking the predicted output of each basic network as the input of a secondary network; correspondingly multiplying the output of the secondary network with the predicted output of each basic network to obtain a final predicted value; completing training when the voiceprint recognition accuracy is not improved any more; the output of each basic network prediction is a feature vector with the length being the number of speakers; the output of the secondary network is a vector with the length of the number of the basic networks, and the vector is a weighted value of weighted average.
According to the preferable technical scheme, the basic network is two or more selected from EfficientNet, ResNet, GoogleNet, VGG and AlexNet.
In the preferred technical scheme, the basic network is EfficientNet, ResNet and GoogleNet.
In a preferred embodiment, the speech feature is an MFCC feature, an Fbank feature or an LPCC feature.
The voiceprint recognition model based on the integrated convolutional neural network is constructed, and the weight value in the weighted average changes in a self-adaptive manner along with the output change of the basic network; compared with a single basic network, the integrated model improves the accuracy of voiceprint recognition.
Drawings
FIG. 1 is an embodiment of a voiceprint recognition model based on an integrated convolutional neural network.
Detailed Description
The invention provides a voiceprint recognition model construction method based on an integrated convolutional neural network. Firstly, training a basic network independently, and stopping training when the recognition accuracy of the basic network is highest; then, the weight is trained independently, the output predicted by the basic network is used as the input of a secondary network to train the weighted value of the weighted average, the weight is changed in a self-adaptive mode in the training process, and the accuracy rate of voiceprint recognition is improved. The output of each underlying network is a feature vector with the length of the speaker. The output of the secondary network is a length-based network number vector, which is a weighted average of the weight values.
Example (b):
as shown in fig. 1, the method for constructing the voiceprint recognition model based on the integrated convolutional neural network includes the following steps:
1. selecting three neural networks of EfficientNet, ResNet and GoogleNet as the basic network of the integrated model. Other neural network models, such as convolutional neural networks like VGG, AlexNet, etc., can also be used as the base network. The number of base networks should be two or more.
2. Loading an original audio, preprocessing original voice data and extracting MFCC (Mel frequency cepstrum coefficient) features, wherein the specific steps of MFCC feature extraction are as follows:
a1: firstly, reading voice data for preprocessing, mainly comprising sampling, framing, windowing and the like, and obtaining a sequence of voice frames x (n) after processing;
a2: the preprocessed voice data x (n) is FFT converted into a frequency spectrum X (k), and the conversion formula is as follows:
Figure BDA0003699473040000031
wherein, x (N) is a frame of input voice signal, x (k) is its frequency domain response value, and N is the number of sampling points contained in the frame;
a3, obtaining energy distribution X (i, k) on a frequency spectrum, obtaining spectral line energy E (i, k) by taking the square of the modulus of the energy distribution X (i, k), then sending the spectral line energy E (i, k) into a Mel filter, calculating the energy of the Mel filter, and taking the logarithmic energy S (m) output by each filter; and finally, performing DCT (discrete cosine transformation) to obtain MFCC characteristic parameters, wherein the transformation formula is as follows:
Figure BDA0003699473040000041
cn is the characteristic parameter and M is the number of filters.
The characteristics of the voice data extracted in the step can also be selected from other characteristic parameters such as Fbank, LPCC and the like.
3. The three basic networks respectively carry out learning training on the extracted MFCC characteristics, and the networks are saved when the identification accuracy of the three basic networks is not improved any more.
4. Loading the three basic network models stored in the step 3, taking the output predicted by the basic network as the input of a secondary network to train the weighted average weight value, wherein the output of each basic network is a feature vector with the length being the number of speakers, the secondary network is a single hidden layer BP neural network, and the number of hidden layers can be changed; wherein the formula of the weighted average is:
Figure BDA0003699473040000042
wherein H (X) represents the result after integration, T represents the number of neural networks, and h i Representing the ith neural network, x is input speech, h i The output on the input training speech x is h i (x),h i (x) Is a feature vector with the length being the number of speakers; w is a i Is a neural network h i Is derived from the secondary network training.
5. And fusing the stored three basic networks and the weight values into an integrated model, training the integrated network model, and stopping training when the recognition accuracy cannot be improved. And after the training of the secondary network is finished, multiplying the predicted value of the basic network by the weight value of the training of the secondary network for weighted average to obtain the final prediction result.
The voiceprint recognition model based on the integrated convolutional neural network constructed in the embodiment is tested by using a Free ST Chinese Mandarin Chinese data set. 855 persons are contained in the data set, 120 voices of each person are used, 90% of the 855 persons are used as a training set, 10% of the 855 persons are used as a testing set, the recognition accuracy rates of the three basic networks are 93.54%, 96.76% and 94.49% respectively, and the accuracy rate after integration is 97.02%.

Claims (4)

1. The voiceprint recognition model construction method based on the integrated convolutional neural network is characterized in that the voiceprint recognition model takes more than two different convolutional neural networks as a basic network and takes a single hidden layer BP neural network as a secondary network;
step 1, training a basic network;
1.1, preprocessing a voice training set and extracting voice characteristics;
1.2, respectively inputting the voice characteristics into each basic network for training, and finishing the training when the voiceprint recognition accuracy of each basic network is not improved any more;
step 2, training a secondary network;
2.1, preprocessing the voice training set and extracting voice features;
2.2 respectively inputting the voice characteristics into each trained basic network, and training weighted average weight values by taking the predicted output of each basic network as the input of a secondary network; correspondingly multiplying the output of the secondary network with the output predicted by each basic network to obtain a final predicted value; finishing training when the voiceprint recognition accuracy is not improved any more; the output of each basic network prediction is a feature vector with the length being the number of speakers; the output of the secondary network is a vector with the length of the number of the basic networks, and the vector is a weighted value of weighted average.
2. The integrated convolutional neural network-based voiceprint recognition model building method of claim 1, wherein the base network is any two or more of EfficientNet, ResNet, google lenet, VGG, and AlexNet.
3. The integrated convolutional neural network-based voiceprint recognition model building method of claim 1, wherein the base network is EfficientNet, ResNet, and google lenet.
4. The integrated convolutional neural network-based voiceprint recognition model building method of claim 1, wherein the speech features are MFCC features, Fbank features or LPCC features.
CN202210684227.7A 2022-06-17 2022-06-17 Voiceprint recognition model construction method based on integrated convolutional neural network Pending CN114863934A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210684227.7A CN114863934A (en) 2022-06-17 2022-06-17 Voiceprint recognition model construction method based on integrated convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210684227.7A CN114863934A (en) 2022-06-17 2022-06-17 Voiceprint recognition model construction method based on integrated convolutional neural network

Publications (1)

Publication Number Publication Date
CN114863934A true CN114863934A (en) 2022-08-05

Family

ID=82624084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210684227.7A Pending CN114863934A (en) 2022-06-17 2022-06-17 Voiceprint recognition model construction method based on integrated convolutional neural network

Country Status (1)

Country Link
CN (1) CN114863934A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102800316A (en) * 2012-08-30 2012-11-28 重庆大学 Optimal codebook design method for voiceprint recognition system based on nerve network
CN110459241A (en) * 2019-08-30 2019-11-15 厦门亿联网络技术股份有限公司 A kind of extracting method and system for phonetic feature
CN111723679A (en) * 2020-05-27 2020-09-29 上海五零盛同信息科技有限公司 Face and voiceprint authentication system and method based on deep migration learning
CN113506259A (en) * 2021-07-06 2021-10-15 长江大学 Image blur distinguishing method and system based on converged network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102800316A (en) * 2012-08-30 2012-11-28 重庆大学 Optimal codebook design method for voiceprint recognition system based on nerve network
CN110459241A (en) * 2019-08-30 2019-11-15 厦门亿联网络技术股份有限公司 A kind of extracting method and system for phonetic feature
CN111723679A (en) * 2020-05-27 2020-09-29 上海五零盛同信息科技有限公司 Face and voiceprint authentication system and method based on deep migration learning
CN113506259A (en) * 2021-07-06 2021-10-15 长江大学 Image blur distinguishing method and system based on converged network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
韩志艳: "《语音识别及语音可视化技术研究》", 东北大学出版社, pages: 50 - 52 *

Similar Documents

Publication Publication Date Title
CN106847292B (en) Method for recognizing sound-groove and device
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
Tiwari MFCC and its applications in speaker recognition
CN105321525B (en) A kind of system and method reducing VOIP communication resource expense
CN109326302A (en) A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN113488058B (en) Voiceprint recognition method based on short voice
CN109215665A (en) A kind of method for recognizing sound-groove based on 3D convolutional neural networks
CN105869624A (en) Method and apparatus for constructing speech decoding network in digital speech recognition
CN102324232A (en) Method for recognizing sound-groove and system based on gauss hybrid models
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN109036460A (en) Method of speech processing and device based on multi-model neural network
CN108986798B (en) Processing method, device and the equipment of voice data
CN108597505A (en) Audio recognition method, device and terminal device
CN113823293B (en) Speaker recognition method and system based on voice enhancement
CN111986679A (en) Speaker confirmation method, system and storage medium for responding to complex acoustic environment
CN106898355A (en) A kind of method for distinguishing speek person based on two modelings
CN111583936A (en) Intelligent voice elevator control method and device
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
CN110931045A (en) Audio feature generation method based on convolutional neural network
JPH08123484A (en) Method and device for signal synthesis
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
KR100779242B1 (en) Speaker recognition methods of a speech recognition and speaker recognition integrated system
CN110570871A (en) TristouNet-based voiceprint recognition method, device and equipment
CN111524520A (en) Voiceprint recognition method based on error reverse propagation neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination