CN113948067B

CN113948067B - Voice countercheck sample repairing method with hearing high fidelity characteristic

Info

Publication number: CN113948067B
Application number: CN202111170083.5A
Authority: CN
Inventors: 王斌; 方永强; 曾颖明; 张箐碚; 陈志浩; 郭敏; 童帅鑫; 马晓军; 桓琦
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2022-05-27
Anticipated expiration: 2041-10-08
Also published as: CN113948067A

Abstract

The invention relates to a voice confrontation sample restoration method with the characteristics of hearing high fidelity, and relates to the technical field of artificial intelligence safety. The method comprises the following steps: constructing a confrontation sample restoration training data set; building an RAE network and setting network parameters; constructing high fidelity audio reconstruction loss, namely improving a high fidelity strategy based on signal mean square error; setting training parameters and training a network; and (4) utilizing the trained RAE network to repair the confrontation sample, and judging whether the repairing is successful or not by means of a voice recognition model. Compared with the traditional common voice signal restoration method, the audio restoration sample generated by the algorithm has higher auditory fidelity and restoration success rate, and can be suitable for the confrontation sample under the condition of lower signal-to-noise ratio.

Description

Voice countercheck sample repairing method with hearing high fidelity characteristic

Technical Field

The invention relates to the technical field of artificial intelligence safety, in particular to a voice confrontation sample restoration method with the characteristics of hearing high fidelity.

Background

In recent years, the attack of voice recognition counterattack has become a new research focus of artificial intelligence, which can induce the voice recognition algorithm to generate a false recognition result by adding weak noise which is hard to be detected by human beings to the input audio data. Facing to the threat of voice anti-attack, scholars at home and abroad actively explore the safety defense technology of the artificial intelligence algorithm. The main defense methods are divided into three categories, firstly, the training data set is expanded through the confrontation samples to carry out confrontation training, the robustness of the intelligent algorithm is improved, and the method needs to generate a large number of confrontation samples and change the parameters of the original intelligent algorithm; secondly, a countermeasure sample detection technology is adopted, the countermeasure sample in the input is identified through an countermeasure sample detection algorithm, and the method can only discover the countermeasure attack behavior but cannot effectively respond to the countermeasure attack behavior; thirdly, repairing the confrontation sample, and simulating the confrontation attack effect by confronting noise filtration so as to ensure the correct identification of the original intelligent algorithm. Therefore, the research on the countermeasure sample restoration technology has great practical significance for safe and reliable application of the voice recognition algorithm.

Zhejiang industrial university proposes a defense method against a voice recognition system black box attack model in a patent application of Zhejiang industrial university (patent application No. 201911031043.5, publication No. CN 110992934A). The defense method includes the steps of firstly adding simulated environmental noise to original audio, simulating a voice input condition in a real scene, forming a primary countermeasure sample after the noise is randomly added, optimizing the countermeasure sample through a genetic algorithm and gradient estimation to obtain an accurate countermeasure sample, then mixing an original audio file and the countermeasure sample to serve as a training data set for countermeasure training, retraining the model, improving the identification accuracy of the model on the countermeasure sample, and accordingly improving the robustness of the model on the countermeasure attack. However, the method still has the following defects: a large amount of countermeasure samples need to be generated, and the defense effect of the other attack methods is poor when only one type of countermeasure sample is used for countermeasure training; the original speech recognition algorithm needs to be trained secondarily, and the method has limitation in practical application.

A sample attack resisting defense method and a sample attack resisting defense device based on a voice enhancement algorithm are proposed in a patent application No. 202010206879.0 (publication No. CN111564154A) applied by Beijing post and telecommunications university. The method comprises the steps of firstly obtaining the spectral characteristics of a voice sample to be recognized, calculating the noise spectrum of the voice sample to be recognized through a spectral subtraction method based on continuous minimum tracking and a log minimum mean square error algorithm (MMSE) algorithm combined with the voice existence probability according to the spectral characteristics of the voice sample to be recognized, denoising the voice sample to be recognized by utilizing the estimated noise spectrum obtained through calculation to obtain a denoised voice sample, and then recognizing the denoised voice sample through a pre-trained voice recognition model. The method can increase the accuracy of voice recognition and improve the effect of defending against sample attack. However, the method still has the following defects: the method is essentially a traditional general noise reduction algorithm, and has no targeted improvement on the countersample noise and low defense success rate.

Disclosure of Invention

Technical problem to be solved

The technical problem to be solved by the invention is as follows: how to design a voice countermeasure sample restoration method with the characteristics of hearing high fidelity, and under the condition that the audio sample after compression and reconstruction of a clean sample and the original sample have no great difference in hearing perception, the voice countermeasure sample restoration method can better inhibit the attack effect of the countermeasure sample.

(II) technical scheme

In order to solve the technical problem, the invention provides a method for restoring a voice confrontation sample with the characteristics of hearing high fidelity, which comprises the following steps:

(1) constructing a confrontation sample restoration training data set;

(2) building an RAE network and setting network structure parameters;

(3) constructing high-fidelity audio reconstruction loss;

(4) setting training parameters and training an RAE network based on the steps 1, 2 and 3;

(5) and repairing the confrontation sample by using the trained RAE network.

Preferably, in step 1, n pieces of original audio data are collected, the maximum sequence length of the audio data is obtained, and each piece of audio data is filled with zero to the maximum length, so as to obtain the confrontation sample restoration training data set.

Preferably, step 2 comprises:

(2a) setting network structure parameters including an input layer, an encoder hidden layer based on a BilSTM network, an implicit variable layer, a decoder hidden layer based on the BilSTM network and an output layer;

(2b) building an RAE network by combining the structural characteristics of a BilSTM network and an AE network, wherein the RAE network structure sequentially comprises the following steps: input layer → encoder hidden layer based on BilSTM network → hidden variable layer → decoder hidden layer based on BilSTM network → output layer, thus obtaining initial RAE network.

Preferably, step 3 comprises:

(3a) using the input and reconstructed audio samples, according to a formula

Calculating the mean square error, where l is the audio length, x_tThe raw data representing the time t is shown,

reconstruction data representing time t;

(3b) using input audio samples, according to a formula

Calculating a reconstruction error weight for the audio data at each time instant, wherein x_t ²Representing the intensity of the speech signal at time t, the weight w follows x_t ²Is increased and decreased, lambda is more than 0, and is a scale parameter, the larger lambda is, the weight w is along with x_t ²The faster the increase and decrease;

(3c) multiplying said mean square error by a reconstruction error weight, i.e. according to a formula

And calculating to obtain the high-fidelity audio reconstruction loss.

Preferably, the formula

The high fidelity audio reconstruction loss function expressed is an adaptive weighted mean square error loss function that allows large reconstruction errors to be generated where the speech signal strength is large and limits the errors where the speech signal strength is small.

Preferably, step 4 comprises:

(4a) setting training parameters including iteration turns T, small batch size s and learning rate eta, and selecting an optimization algorithm, wherein the optimization algorithm is based on 1-order gradient;

(4b) reading the confrontation sample restoration training data set to divide the original audio data set into

Preprocessing the small batch of data sets to obtain a preprocessed confrontation sample restoration training data set;

represents rounding up;

(4c) restoring a training data set by using the preprocessed confrontation sample, and carrying out back propagation training on the initial RAE network according to the high-fidelity audio reconstruction loss and the learning rate eta by adopting the selected optimization algorithm: will be provided with

Sequentially inputting the small-batch data into an RAE network, repeating T rounds according to the sequence of forward propagation, backward propagation and weight updating by using the selected optimization algorithm, and counting iteration

And stopping training after the next time to obtain the trained RAE network.

Preferably, the optimization algorithm is the Adam algorithm.

Preferably, step 5 comprises:

(5a) reading a countermeasure sample, repairing the countermeasure sample by using a trained RAE network model, and outputting the repaired sample; the confrontation sample is audio data generated by a voice confrontation sample generation algorithm, and the confrontation sample induces the voice recognition model to generate a false recognition result by adding disturbance on a clean sample;

(5b) inputting the confrontation sample and the repaired sample into a voice recognition model, observing the repairing effect of the confrontation sample, if the repairing effect of the confrontation sample shows that the repaired sample is correct in recognition result, the repairing is successful, otherwise, the repairing is failed.

Preferably, the repairing effect of the confrontation sample specifically includes two aspects, that is, on one hand, the difference between the sound spectrums of the confrontation sample and the repaired sample is observed, and on the other hand, the results of the confrontation sample and the repaired sample after being input into the speech recognition model are observed.

The invention also provides application of the method in the technical field of artificial intelligence safety.

(III) advantageous effects

The method repairs the confrontation sample by using the RAE network and the high-fidelity audio reconstruction loss, and can better inhibit the confrontation sample attack effect under the condition of ensuring that the audio sample after compression reconstruction of the clean sample and the original sample have no great difference in auditory perception.

The high-fidelity audio loss reconstruction strategy is a theory that the construction of a countermeasure sample is based on the fact that weak disturbance is added in an area with large signal variance, and then an identification model is judged wrongly. In the process of reducing the dimension of the audio to the dimension of the audio through the dimension reduction process and the dimension increasing process based on the encoding-decoding mode, the loss function is controlled by the weight, the region with small signal intensity is dynamically enabled not to allow errors, and the region with large signal intensity is enabled to allow reasonable errors to be generated so as to discard the disturbance which does not belong to the signal originally.

According to experimental verification, no matter the voice countermeasure sample generated in the black box/white box environment, the recognition result of the repair sample generated by the algorithm on the automatic voice command recognition system is a given arbitrary command, and compared with other voice countermeasure sample repair algorithms, under the condition that the recognition result of the countermeasure sample is the same as the error rate of an original label, the method is higher in repair success rate and has fidelity in hearing.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram of the RAE network architecture of the present invention;

FIG. 3 is a raw audio spectrogram;

FIG. 4 is a challenge sample spectrogram;

fig. 5 is an audio repair sample spectrogram generated by the present invention.

Detailed Description

In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

The invention provides a method for restoring a voice countermeasure sample with the characteristics of hearing high fidelity, which is used for restoring the countermeasure sample by utilizing an RAE (Current automatic encoder) network and high fidelity audio reconstruction loss, and can better inhibit the attack effect of the countermeasure sample under the condition of ensuring that the audio sample after compression reconstruction of a clean sample and the original sample have no great difference in hearing perception.

The method utilizes the original audio to construct a variational self-encoder with high fidelity, so that the antagonistic sample can play the aim of inhibiting the antagonistic attack effect in the compression and reconstruction processes, secondly, utilizes the reconstruction loss of the high fidelity audio to enhance the auditory similarity between the input audio and the reconstructed audio,

the overall process of the invention is shown in figure 1 and comprises the following steps:

step 1, constructing a confrontation sample restoration training data set.

4000 pieces of original audio data are collected, the format of the audio data is in a 'wav' format, the audio sampling frequency is 16kHZ, and the audio sampling time length is less than or equal to 1 second; the maximum sequence length of the audio data is calculated to be 16000, each piece of audio data is filled with zero to the maximum length, and a confrontation sample repair training data set is obtained and comprises eight types of English audio data consisting of down, go, left, off, on, right, stop and up.

And 2, building an RAE network and setting network parameters by referring to FIG. 2.

Setting network structure parameters, wherein the dimension of an input layer is 100 multiplied by 160, dividing each piece of audio data into 100 segments, each 160 frames, the size of a hidden layer of an encoder based on a BilSTM (Bi-directional Long Short-Term Memory) network is 64, the size of a hidden variable layer is 64, the size of a hidden layer of a decoder based on the BilSTM network is 64, and the dimension of an output layer is 100 multiplied by 160. The BilSTM network is a two-layer recurrent neural network, i.e. the current output is influenced not only by the previous state but also by the future state, and for each time t, the input provides two opposite recurrent neural networks to jointly determine the output.

The RAE network is built by combining the structural characteristics of a five-layer network of a BilSTM two-layer cyclic neural network and an AE self-encoder unsupervised learning, and the structure of the RAE network is as follows in sequence: input layer → encoder hidden layer based on the BilSTM network → hidden variable layer → decoder hidden layer based on the BilSTM network → output layer, to obtain initial RAE network. The AE network is an auto-encoder network, is an unsupervised learning model, and comprises an encoding process of original data x from an input layer to a hidden layer and a decoding process from the hidden layer to an output layer, so that the original data x and reconstructed data

The distance between them is the error loss of the data x. An RAE network (RNN-AE network) is a cyclic self-encoder network, which uses a BilSTM network in an encoder and a decoder to input original audio data into an encoder, and the current output Z of the encoder is obtained by learning the BilSTM network_tIs subjected to the current time state h_tAnd future time state c_tThe output Z of the final moment of the encoder_tEntering the decoder, learning again by the BilSTM network, and outputting Z at the current moment by the decoder_tIs subjected to the current time state h_tAnd future time state c_tOf the final decoder outputOutput of termination time Z_tFinally, converting the output format through the full connection layer;

and 3, constructing high-fidelity audio reconstruction loss.

Using input and reconstructed audio samples, according to a formula

Calculating the mean square error, wherein t is 1, 2, 3 … l, l is 160 audio length, x_tWhich represents the original data of the image data,

representing the reconstructed data.

Using input audio samples, according to a formula

And calculating the reconstruction error weight of the audio data at each moment, wherein lambda is 2. Wherein x is_t ²Representing the strength of the speech signal, the weight w follows x_t ²Is increased and decreased, lambda is more than 0 as a scale parameter, the larger the lambda is, the weight w is along with x_t ²The faster the increase and decrease;

multiplying the mean square error by the reconstruction error weight, i.e. according to the formula

And calculating to obtain the high-fidelity audio reconstruction loss. Wherein λ is a scale parameter, the higher the audio energy, the smaller the region weight, and the formula

The expressed high-fidelity audio reconstruction loss function is a self-adaptive weighted mean square error loss function, allows a larger reconstruction error to be generated at a place with large voice signal intensity, and limits the error at a place with small voice signal intensity; due to the fact that weak disturbance of the countersample is often added to the area with large signal intensity, the method not only guarantees that the area with large intensity of the voice after reconstruction is allowed to have errors, but also guarantees that the signal intensity of the voice after reconstruction is smallWithout loss of area

And 4, setting training parameters and training a network.

Setting training parameters including iteration round T equal to 30, small batch size s equal to 64 and learning rate eta equal to 1 × 10^-4And selecting an optimization algorithm, wherein the optimization algorithm is an optimization algorithm based on a 1-order Gradient, and includes SGD (random Gradient Descent), Momentum-SGD (random Gradient Descent with Momentum), RMSProp (Root Mean Square transfer), Adam (Adaptive motion Estimation), and the like₁＝0.99，β₂＝0.999。

Reading a confrontation sample restoration training data set, dividing eight types of English audio data in an original audio data set into 7 small-batch data sets, and obtaining the preprocessed confrontation sample restoration training data set.

And repairing the training data set by using the preprocessed confrontation sample, sequentially inputting 7 small batches of data into the RAE network aiming at eight classes of English audio data sets, repeating 30 rounds by using an Adam algorithm according to the sequence of forward propagation, backward propagation and weight updating, and stopping training after iterating 210 times in total to obtain the trained RAE network.

And 5, repairing the confrontation sample by using the trained RAE model.

And reading the confrontation sample, repairing the confrontation sample by using the trained RAE network model, and outputting the repaired sample. The confrontation sample is audio data generated by a voice confrontation sample generation algorithm, and induces a voice recognition model to generate a false recognition result by adding weak disturbance on a clean sample; the format of the audio data of the confrontation sample is 'wav', the sampling frequency is 16kHZ, the time length is less than or equal to 1 second, namely the length of the read audio data sequence is less than or equal to 16000, and zero padding operation is carried out on the tail of the sequence with the time length of the audio data less than 1 second, so that the length of the sequence reaches 16000.

Inputting the confrontation sample and the repaired sample into a voice recognition model, and observing the repairing effect of the confrontation sample, wherein the repairing effect specifically comprises two contents, namely, on one hand, the difference of the sound spectrums of the confrontation sample and the repaired sample is observed, on the other hand, the results of the confrontation sample and the repaired sample input into the voice recognition model are observed, if the repaired sample is correct in recognition result, the repairing is successful, and otherwise, the repairing is failed.

The effects of the present invention will be further described with reference to the experimental results.

1. The experimental conditions of the invention are as follows:

the software platform used in this embodiment is: windows10 operating system and Spyder.

The hardware device used in this embodiment is an Intel Core (TM) i7-9700K @3.60GHz × 8, GPU Nvidia GeForce GTX 1080Ti, 11GB video memory.

The Python version used in this example is Python 3.7.3, and the library and the corresponding version used are respectively pytorch 1.1.0, torchaudio 0.8.0, numpy 1.21.0.

2. The experimental result analysis of the invention:

the experiment of the invention is that the voice confrontation sample restoration method with the characteristics of hearing high fidelity constructed by the invention is utilized to restore the confrontation sample, the restored sample is output, the difference of the acoustic spectrums of the confrontation sample and the restored sample is observed, and the confrontation sample and the restored sample are compared through a voice recognition model to obtain a recognition result.

The acoustic spectrum difference between the confrontation sample and the repaired sample, fig. 3 is an original audio spectrogram of on-type data, fig. 4 is a confrontation sample spectrogram of on-type data, fig. 5 is a sample spectrogram of on-type audio repaired generated by the method of the present invention, the similarity between the acoustic spectrums of fig. 3 and 5 is high, and a certain difference exists between fig. 4 and 5.

According to the formula

Calculating the success rate of repairing the voice confrontation sample, wherein z is 1, 2_z500 is the total number of challenge sample data in one class, m_zAnd respectively counting the success rate of repairing the confrontation samples of different voice categories for the number of the successfully repaired confrontation samples. The results are shown in Table 1.

TABLE 1 success rate results of confrontation sample repair under different audio classes

Compared with the results in table 1, from the viewpoint of the repair success rate of different audio categories, in the repair of different countermeasure samples of eight different categories of english audios, namely down, go, left, off, on, right, stop and up, the repair success rate of the method for repairing the voice countermeasure samples with the characteristics of hearing and high fidelity provided by the invention exceeds 80% for the PGD attack and the Deepfool attack samples of the rest seven categories of audios except that the repair success rate of the four countermeasure samples of the stop category is lower than 50%, wherein the repair success rate of the Deepfool attack samples of the go category and the PGD attack and the Deepfool attack samples of the on category of audios respectively reaches 90.2%, 93.6% and 95%; from the aspect of the repair success rate of different attack types, in the repair of the anti-sample, namely PGD attack, Deepfool attack, SA attack and GA attack, the method for repairing the voice anti-sample with the characteristics of hearing high fidelity, provided by the invention, has the advantages that the repair success rate of the anti-sample for PGD attack, Deepfool attack and GA attack is higher than that of the anti-sample for SA attack besides the off-type audio.

The recognition result of the generated restoration sample on the automatic voice command recognition system is a given arbitrary command, compared with other voice countermeasure sample restoration algorithms, under the condition that the countermeasure sample recognition result is the same as the original label error rate, the method has higher restoration success rate and has fidelity in hearing, and can be suitable for the countermeasure sample under the condition of lower signal to noise ratio.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A voice confrontation sample restoration method with the characteristics of hearing high fidelity is characterized by comprising the following steps:

(1) constructing a confrontation sample restoration training data set;

(2) building an RAE network and setting network structure parameters;

(3) constructing high-fidelity audio reconstruction loss;

(5) repairing the confrontation sample by using the trained RAE network;

in the step 1, acquiring n pieces of original audio data, solving the maximum sequence length of the audio data, and filling each piece of audio data with zero to the maximum length to obtain the confrontation sample restoration training data set;

the step 2 comprises the following steps:

(2b) building an RAE network by combining the structural characteristics of a BilSTM network and an AE network, wherein the RAE network structure sequentially comprises the following steps: an input layer → an encoder hidden layer based on the BilSTM network → a hidden variable layer → a decoder hidden layer based on the BilSTM network → an output layer, thereby obtaining an initial RAE network;

the AE network is an auto-encoder network, is an unsupervised learning model, and comprises an encoding process of original data x from an input layer to a hidden layer and a decoding process from the hidden layer to an output layer, so that the original data x and reconstructed data

The RAE network is RNN-AE network, which is a cyclic self-encoder network, and the BiLSTM network is used in encoder and decoder to input the original audio data into encoderThrough the study of the BilSTM network, the output Z of the encoder at the current moment_tIs subjected to the current time state h_tAnd future time state c_tThe output Z of the final moment of the encoder_tEntering the decoder, learning again through the BilSTM network, and outputting Z at the current moment of the decoder_tIs subjected to the current time state h_tAnd future time state c_tThe output Z at the end of the decoder output_tFinally, converting the output format through the full connection layer;

the step 3 comprises the following steps:

(3a) using the input and reconstructed audio samples, according to a formula

reconstruction data representing time t;

(3b) using input audio samples, according to a formula

Calculating to obtain high-fidelity audio reconstruction loss;

by the formula

The expressed high-fidelity audio reconstruction loss function is a self-adaptive weighted mean square error loss function, allows large reconstruction errors to be generated at places with large voice signal intensity, and limits the errors at places with small voice signal intensity;

step 4 comprises the following steps:

represents rounding up;

Sequentially inputting the small batches of data into an RAE network, repeating T rounds according to the sequence of forward propagation, backward propagation and weight updating by using the selected optimization algorithm, and totally iterating

And stopping training after the next time to obtain the trained RAE network.

2. The method of claim 1, wherein the optimization algorithm is an Adam algorithm.

3. The method of claim 1, wherein step 5 comprises:

(5a) reading a countermeasure sample, repairing the countermeasure sample by using a trained RAE network model, and outputting the repaired sample; the confrontation sample is audio data generated by a voice confrontation sample generation algorithm, and induces the voice recognition model to generate a false recognition result by adding disturbance on a clean sample;

4. The method of claim 3, wherein the repairing effect of the challenge sample specifically includes observing the difference between the sound spectrum of the challenge sample and the sound spectrum of the repaired sample, and observing the result of the recognition of the challenge sample and the sound spectrum of the repaired sample by the speech recognition model.