CN111310836B

CN111310836B - Voiceprint recognition integrated model defending method and defending device based on spectrogram

Info

Publication number: CN111310836B
Application number: CN202010105807.7A
Authority: CN
Inventors: 陈晋音; 叶林辉; 王雪柯; 郑喆
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2023-08-18
Anticipated expiration: 2040-02-20
Also published as: CN111310836A

Abstract

The invention discloses a voiceprint recognition integrated model defending method based on a spectrogram, which comprises the following steps: (1) Collecting an audio file, and converting the audio file into a spectrogram, wherein the spectrogram is used as a benign sample; (2) Training a plurality of voiceprint recognition models by using benign samples to obtain a plurality of trained voiceprint recognition models; (3) A voting mechanism is adopted to integrate a plurality of voiceprint recognition models which are better and obtained from screening from the trained voiceprint recognition models, so as to form a voiceprint recognition integrated model, and the voiceprint recognition integrated model is retrained by utilizing benign samples; (4) Collecting a cuckoo search algorithm to attack a plurality of voiceprint recognition models respectively to generate an countermeasure sample; (5) Retraining the voiceprint recognition integrated model obtained in the step (3) by using an antagonism sample and a benign sample to obtain a voiceprint recognition integrated model capable of resisting attack; (6) And (5) performing defending and identifying on the spectrogram corresponding to the audio file by utilizing the voiceprint identification integrated model obtained in the step (5).

Description

Voiceprint recognition integrated model defending method and defending device based on spectrogram

Technical Field

The invention belongs to the field of information security research, and particularly relates to a voiceprint recognition integrated model defense method and a voiceprint recognition integrated model defense device based on a spectrogram.

Background

Because the voice organ, tongue, teeth, lung and the like of each person have great differences in size and shape, the voices of each person speaking are different, the spectrograms of the voice are different, in fact, each person voice has unique identity information, and voiceprint recognition is to use the characteristic of the voice to recognize the identity of a speaker. Voiceprint recognition is one of the biometric techniques, and is classified into text-dependent and text-independent voiceprint recognition. Text-independent voiceprint recognition: the finger print recognition system has no requirement on the voice text content, and the speaking content of a speaker is free and random. Text-related voiceprint recognition: referring to speaker recognition systems, a user is required to pronounce according to a pre-specified content. The text-related voiceprint recognition model requires a user to pronounce according to a specified text, and once the pronunciation of the user is wrong, the identity cannot be recognized, so that the application range is narrow. The voice print recognition model which is irrelevant to the text has no requirement on the sounding content of the user, is convenient to recognize, has wider application range and has higher realization difficulty.

The deep neural network can fully utilize the relevance among the voice features, and training the voice features of the continuous frames after combining, so that the recognition rate of the voiceprint recognition system is greatly improved. The voiceprint recognition system based on the deep neural network brings convenience to people and also brings corresponding risks while improving recognition accuracy. The deep neural network is easy to be subjected to a countermeasure attack in a form of adding a fine disturbance to input data, and an attacker can add a carefully calculated disturbance to the audio of a certain speaker after obtaining the characteristics of the certain target speaker, so that a generated countermeasure sample is erroneously identified as the target speaker by a voiceprint identification model, and great potential safety hazards are brought to a voiceprint identification system and personal property safety.

Existing voiceprint recognition attack methods are mainly divided into white box attacks and black box attacks. White-box attacks are performed by an attacker under the condition of knowing internal parameters of a model, the gradient of the model with respect to noise is calculated through back propagation, and the noise to be added is continuously optimized through iteration, so that the purpose of generating an countermeasure sample is achieved. The black box attack is performed by an attacker under the condition of unknown model parameters, and the disturbance required to be added can be optimized by utilizing optimization algorithms such as a genetic algorithm, a particle swarm algorithm and the like, so that an countermeasure sample is generated. Both white box attacks and black box attacks can attack the voiceprint recognition system, so that the voiceprint recognition system can erroneously recognize the challenge sample as a target speaker.

Disclosure of Invention

Aiming at the problems that the existing voiceprint recognition system is low in precision and poor in robustness and is easy to be subjected to security against sample attack, the invention provides a voiceprint recognition integrated model defense method and a voiceprint recognition integrated model defense device based on a spectrogram.

The technical scheme of the invention is as follows:

a voiceprint recognition integrated model defending method based on a spectrogram comprises the following steps:

(1) Collecting an audio file, and converting the audio file into a spectrogram, wherein the spectrogram is used as a benign sample;

(2) Training a plurality of image recognition models by utilizing benign samples to enable the image recognition models to achieve the effect of voiceprint recognition, so as to obtain a plurality of trained voiceprint recognition models based on images;

(3) Integrating the plurality of the voiceprint recognition models based on the images trained in the step (2) by adopting a voting mechanism to form a voiceprint recognition integrated model, and retraining the voiceprint recognition integrated model by utilizing a benign sample;

(4) Collecting a cuckoo search algorithm to attack a plurality of voiceprint recognition models respectively, generating an countermeasure sample, and converting the countermeasure sample into a spectrogram to serve as a malignant sample;

(5) Retraining the voiceprint recognition integrated model based on the image obtained in the step (3) by utilizing a malignant sample and a benign sample to obtain a voiceprint recognition integrated model capable of resisting attack;

(6) And (5) performing defending and identifying on the spectrogram corresponding to the audio file by utilizing the voiceprint identification integrated model obtained in the step (5).

Preferably, the specific steps of converting the audio file into a spectrogram are:

framing the audio, windowing each frame of voice signal, and performing short-time Fourier transform;

calculating a power spectrum of the short-time Fourier transform result, normalizing the power spectrum to obtain a spectrogram, and forming a benign sample by the spectrogram and a corresponding speaker.

Preferably, the image recognition model adopts VGG16 or VGG19.

Preferably, the specific process of training the plurality of voiceprint recognition models by using the benign samples is as follows:

preprocessing a spectrogram, setting the size of the spectrogram to 224 multiplied by 3, and obtaining a spectrogram sample;

sonogram sample x _i Confidence level of output through voiceprint recognition model is y _ipre Using cross entropy as the loss function, using the loss function L (x _i ) Optimizing parameters of a voiceprint recognition model;

L(x _i )＝-[y _i logy _ipre +(1-y _i )log(1-y _ipre )]

and testing the accuracy of the trained voiceprint recognition model by utilizing the spectrogram in the test set, and retraining the voiceprint recognition model until the recognition accuracy reaches the requirement when the recognition accuracy does not reach the requirement.

The specific process of the step (3) is as follows:

integrating a plurality of voiceprint recognition models based on images by utilizing a voting mechanism to obtain a voiceprint recognition integrated model;

before voting, converting the prediction confidence coefficient returned by each voiceprint recognition model into a prediction category, namely, using a category label corresponding to the highest confidence coefficient as a prediction result of the voiceprint recognition model;

after each voiceprint recognition model obtains a prediction result of a voiceprint sample, if a certain prediction category obtains more than half of voiceprint recognition model votes, the prediction category is the prediction result of the voiceprint recognition integrated model;

and training the voiceprint recognition integrated model by using benign samples, and testing by using a testing set to improve the voiceprint recognition integrated model.

The device for defending the voiceprint recognition integrated model based on the spectrogram comprises a computer memory, a computer processor and a computer program which is stored in the computer memory and can be executed on the computer processor, wherein the computer processor realizes the defending method of the voiceprint recognition integrated model based on the spectrogram when executing the computer program.

Based on possible defects of the voiceprint recognition system and limitations of the existing attack method, the invention researches a method for converting voice into a spectrogram, and trains an image recognition model by utilizing the spectrogram so as to achieve the purpose of voiceprint recognition. And a plurality of trained image recognition models are integrated together, so that the special voiceprint recognition model can resist attack against a sample while the model precision is improved, and the defending capability of the model is further improved through the countermeasure training, so that the defending against white box or black box attack is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for defending a voiceprint recognition integrated model based on a spectrogram according to an embodiment;

FIG. 2 is a schematic diagram of a structure for obtaining a challenge sample according to an embodiment;

FIG. 3 is a schematic diagram of retraining an integrated voiceprint recognition model provided by an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.

Referring to fig. 1 to 3, the defending method of the voiceprint recognition integrated model based on the spectrogram provided by the embodiment includes the following steps:

1) A data set for voiceprint recognition model training was prepared, using the train-clean-100 data set in the librispech speech data set as the data set. Each file of the train-clean-100 data stores audio of different speakers, so that one folder corresponds to one speaker, and the file name is actually a tag;

2) The audio files in the folders are preprocessed and converted into spectrograms, and the spectrograms are stored in the corresponding folders, wherein file names are class labels corresponding to the spectrograms, namely identities of speakers. Dividing the training set into a training set and a testing set according to a certain proportion. The specific process is as follows:

step1: for each audio file x (n) in the train-claen-100 dataset, it is framed, each frame length being 25ms, during which time the speech signal is considered steady state. The windowing function of the audio signal after framing avoids high frequency part signal leakage. After framing and windowing, the voice signal is subjected to short-time Fourier transform:

where k ε {0,1, … N-1}, where N represents the number of samples contained in a frame of audio file and w (N-m) is a window function that slides along the time axis.

Step2: the power spectrum is obtained according to X (n, k)

P(n,k)＝|X(n,k)| ² (2)

Step3: because of the large amount of non-zero noise in the silence segment of speech, the spectrogram is processed by a max-min normalization method. After normalization processing, the brightness degree and the brightness distribution condition corresponding to the mean value and the variance of the spectrogram are more uniform, and the normalization formula is as follows:

in G (a, b), a represents a corresponding time, b represents a frequency at a time a, and the magnitude of G (a, b) represents an energy magnitude contained in an audio component having a frequency magnitude b at the corresponding time a. The G (a, b) can draw a spectrogram, and the energy of different frequency components at each time point is represented by the same color but different shades.

Step4: the generated spectrogram is stored in a corresponding folder according to the corresponding speaker, the file name is a class mark, namely the corresponding speaker, and the generated spectrogram data set is divided into a training set and a testing set according to a certain proportion.

3) Training a voiceprint recognition model based on a spectrogram: and training a VGG16 model by using the generated spectrogram, wherein the file name is the class label of the spectrogram, so that the aim of realizing voiceprint recognition by using image recognition is fulfilled. After training, testing by using a testing set to enable the recognition precision to meet the requirement, and if the recognition precision does not meet the requirement, continuing training the model until the model precision meets the requirement. The method comprises the following specific steps:

step1, preprocessing the image, and setting the size of the spectrogram to 224×224×3.

Step2, building a VGG16 model. An image recognition model based on a CNN structure is built, and the structure is provided with 13 convolution layers and 3 full connection layers.

Step3, setting related parameters and training. Setting a spectrogram sample x _i Confidence level of output through VGG16 model is y _ipre Cross entropy is used as a loss function:

L(x _i )＝-[y _i logy _ipre +(1-y _i )log(1-y _ipre )] (4)

wherein y is _i Representing the real label.

Step4, testing the accuracy of the identification model by using the test data set to ensure that the preset identification accuracy is achieved, otherwise, modifying the structure and parameters of the model to retrain.

4) Replacing the model structure, repeating the step 3), and training a plurality of voiceprint recognition models based on the spectrogram with different structures. After training, testing each image recognition model by using a test set to ensure that the recognition precision meets the requirement, and if the recognition precision does not meet the requirement, changing model parameters to continue training the model until the precision of each model meets the requirement. Thereby obtaining a plurality of voiceprint recognition models based on the voiceprint.

5) And integrating the obtained voiceprint recognition models based on the voiceprint spectrograms. The integrated model is provided with a plurality of voiceprint recognition models with different structures and based on the spectrogram, and the output of each model is voted by adopting a voting method. And then training is carried out again, so that the recognition accuracy and the robustness of the model are further improved. The method comprises the following specific steps:

step1: integrating the obtained voiceprint recognition models based on the spectrograms, and adopting a voting mechanism after integration.

Step2: before voting, converting the prediction confidence coefficient returned by each voiceprint recognition model into a prediction category, namely, using a category label corresponding to the highest confidence coefficient as a prediction result of the voiceprint recognition model.

Step3: after each model obtains the final prediction of the input sample x, if more than half of models are voted for a certain prediction class, namely if more than half of outputs in the voiceprint recognition integrated model output are speakers A for a voiceprint sample, the speakers A to which the audio corresponding to the voiceprint sample belongs are considered;

step trains the voiceprint recognition integrated model by using a train-clean-100 data set, and tests the voiceprint recognition integrated model by using a test set, so that the recognition precision of the model and the defending capability of the model are further improved.

6) Attack voiceprint recognition model based on voiceprint map: and attacking the voiceprint recognition model based on the spectrogram by adopting a cuckoo search algorithm. And (3) for the voiceprint recognition models based on the spectrograms obtained in the step (4), attacking each model by adopting a cuckoo search algorithm, continuously iterating, optimizing and searching for the optimal disturbance, and superposing the optimal disturbance on the original audio to generate an countermeasure sample. The method comprises the following specific steps:

step1: initializing a fitness function, and defining the fitness function as follows:

f＝[y _ti logy _ipre +(1-y _ti )log(1-y _advipre )]+c·||x _advi -x _i，0 || ₂ (5) (5) wherein x _advi Representing challenge samples, x _i，0 Representing the original audio, y _ti A tag representing the targeted speaker,y _advipre the output of the challenge sample is represented, where the difference between the challenge sample and the original sample is measured by an L2 function, the magnitude of this difference being controlled by a parameter c.

Step2: the bird nest is initialized. Setting the number of bird nests as G, initializing random disturbance with the same size as the original audio frequency, and superposing the random disturbance on the original audio frequency to form an initial countermeasure sample. I.e. the initial bird nest, set to:

X＝x ₁ ，x ₂ ，…，x _G (6)

step3: new bird nests are obtained by the lewy flight, i.e. new challenge samples are obtained by the lewy flight, which is updated as follows:

x _i ＝x _i +α*S*n (7)

where α is the step size scale factor and n is the sum of x _i The dimensions are the same, an array of standard normally distributed random numbers. S is the step length:

where u, v are two variables subject to a gaussian distribution, β is a constant, σ ² Calculated from the formula:

wherein the method comprises the steps ofIs a gamma function.

Step4: calculate fitness of each individual, noted as f=f ₁ ，f ₂ ，…，f _G Finding the optimal individual in the population, namely the individual with the smallest fitness function value, and marking the optimal individual as F _global . If the iteration number reaches the set maximum iteration number or the generated challenge sample can be classified into the target class, stopping iteration and outputting the challenge sample. If the above condition is not satisfied, step1-Step3, continuing iterative optimization of the population. Thus, challenge samples generated under different models can be obtained.

7) Challenge training is based on an integrated voiceprint recognition model of a spectrogram: and (3) converting the countermeasure sample generated in the step (6) into a spectrogram, adding the spectrogram into a training data set, retraining the integrated voiceprint recognition model based on the spectrogram, improving the recognition precision and the defense capability of the integrated voiceprint recognition model, and improving the safety and the stability of the voiceprint recognition model.

The embodiment also provides a defending device of the voiceprint recognition integrated model based on the spectrogram, which comprises a computer memory, a computer processor and a computer program stored in the computer memory and executable on the computer processor, wherein the computer processor realizes the defending method of the voiceprint recognition integrated model based on the spectrogram when executing the computer program.

The computer program stored in the defending device and the computer memory is mainly used for realizing the defending method of the voiceprint recognition integrated model based on the spectrogram, so that the effect of the defending device on the defending method is corresponding, and the details are not repeated here.

Aiming at the possible attack to the white box or the black box of the voiceprint recognition system, the invention converts the voice signal into the spectrogram, achieves the purpose of voiceprint recognition by utilizing the image recognition model, and obtains the defending capability against samples and realizes the defending to the white box or the black box attack while improving the voiceprint recognition accuracy after integrating a plurality of image recognition models.

The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims

1. The voiceprint recognition integrated model defending method based on the spectrogram is characterized by comprising the following steps of:

(2) Training a plurality of image recognition models by utilizing benign samples to enable the image recognition models to achieve the effect of voiceprint recognition, so as to obtain a plurality of image-based voiceprint recognition models;

(3) Integrating the plurality of the trained voiceprint recognition models based on the images in the step (2) by adopting a voting mechanism to form a voiceprint recognition integrated model, and retraining the voiceprint recognition integrated model by utilizing a benign sample, wherein the method specifically comprises the following steps of: integrating a plurality of voiceprint recognition models by utilizing a voting mechanism to obtain a voiceprint recognition integrated model; before voting, converting the prediction confidence coefficient returned by each voiceprint recognition model into a prediction category, namely, using a category label corresponding to the highest confidence coefficient as a prediction result of the voiceprint recognition model; after each voiceprint recognition model obtains a prediction result of a voiceprint sample, if a certain prediction category obtains more than half of voiceprint recognition model votes, the prediction category is the prediction result of the voiceprint recognition integrated model; training the voiceprint recognition integrated model by using benign samples, and testing by using a testing set to improve the voiceprint recognition integrated model;

(4) The cuckoo search algorithm is adopted to attack a plurality of voiceprint recognition models respectively, an countermeasure sample is generated, and the countermeasure sample is converted into a spectrogram which is used as a malignant sample, and the method specifically comprises the following steps:

(4-1) initializing a fitness function, defining the fitness function as follows:

f＝[y _ti logy _ipre +(1-y _ti )log(1-y _advipre )]+c·||x _advi -x _i，0 || ₂

wherein x is _advi Representing challenge samples, x _i,0 Representing the original audio, y _ti Tag representing target speaker, y _advipre Representing the output of the challenge sample, wherein the difference between the challenge sample and the original audio is measured by an L2 function, the magnitude of this difference being controlled by a parameter c, y _ipre Representing confidence of voiceprint recognition model output;

(4-2) initializing bird nests, setting the number of bird nests as G, initializing random disturbance with the same size as the original audio frequency, and superposing the random disturbance on the original audio frequency to form an initial countermeasure sample, namely setting the initial bird nests as follows:

X＝x ₁ ，x ₂ ，…，x _G

(4-3) obtaining a new bird nest by a lewy flight, i.e., obtaining a new challenge sample by a lewy flight, the lewy flight being updated as follows:

x _i ＝x _i +α*S*n

where α is the step size scale factor and n is the sum of x _i The number of dimensions is the same, an array of standard normal distributed random numbers, S is the step size:

wherein the method comprises the steps ofIs a gamma function;

(4-4) calculating fitness of each individual, denoted as f=f ₁ ,f ₂ ,…,f _G Finding the optimal individual in the population, namely the individual with the smallest fitness function value, and marking the optimal individual as F _global If the iteration number reaches the set maximum iteration number or the generated challenge sample can be classified into the target class, stopping iteration, outputting the challenge sample, and if the conditions are not met, repeating the steps (4-1) - (4-3), and continuing iteration optimizing on the population, so that the challenge sample generated under different voiceprint recognition models can be obtained;

2. The method for defending a voiceprint recognition integrated model based on a spectrogram according to claim 1, wherein the specific steps of converting an audio file into the spectrogram are as follows:

3. The method for defending a voiceprint recognition integrated model based on a spectrogram according to claim 1, wherein the image recognition model adopts VGG16 or VGG19.

4. A device for defending a voiceprint recognition integrated model based on a spectrogram, comprising a computer memory, a computer processor and a computer program stored in the computer memory and executable on the computer processor, wherein the computer processor implements the method for defending a voiceprint recognition integrated model based on a spectrogram according to any one of claims 1 to 3 when the computer program is executed.