CN113948067B - Voice countercheck sample repairing method with hearing high fidelity characteristic - Google Patents

Voice countercheck sample repairing method with hearing high fidelity characteristic Download PDF

Info

Publication number
CN113948067B
CN113948067B CN202111170083.5A CN202111170083A CN113948067B CN 113948067 B CN113948067 B CN 113948067B CN 202111170083 A CN202111170083 A CN 202111170083A CN 113948067 B CN113948067 B CN 113948067B
Authority
CN
China
Prior art keywords
network
sample
rae
audio
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111170083.5A
Other languages
Chinese (zh)
Other versions
CN113948067A (en
Inventor
王斌
方永强
曾颖明
张箐碚
陈志浩
郭敏
童帅鑫
马晓军
桓琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN202111170083.5A priority Critical patent/CN113948067B/en
Publication of CN113948067A publication Critical patent/CN113948067A/en
Application granted granted Critical
Publication of CN113948067B publication Critical patent/CN113948067B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a voice confrontation sample restoration method with the characteristics of hearing high fidelity, and relates to the technical field of artificial intelligence safety. The method comprises the following steps: constructing a confrontation sample restoration training data set; building an RAE network and setting network parameters; constructing high fidelity audio reconstruction loss, namely improving a high fidelity strategy based on signal mean square error; setting training parameters and training a network; and (4) utilizing the trained RAE network to repair the confrontation sample, and judging whether the repairing is successful or not by means of a voice recognition model. Compared with the traditional common voice signal restoration method, the audio restoration sample generated by the algorithm has higher auditory fidelity and restoration success rate, and can be suitable for the confrontation sample under the condition of lower signal-to-noise ratio.

Description

Voice countercheck sample repairing method with hearing high fidelity characteristic
Technical Field
The invention relates to the technical field of artificial intelligence safety, in particular to a voice confrontation sample restoration method with the characteristics of hearing high fidelity.
Background
In recent years, the attack of voice recognition counterattack has become a new research focus of artificial intelligence, which can induce the voice recognition algorithm to generate a false recognition result by adding weak noise which is hard to be detected by human beings to the input audio data. Facing to the threat of voice anti-attack, scholars at home and abroad actively explore the safety defense technology of the artificial intelligence algorithm. The main defense methods are divided into three categories, firstly, the training data set is expanded through the confrontation samples to carry out confrontation training, the robustness of the intelligent algorithm is improved, and the method needs to generate a large number of confrontation samples and change the parameters of the original intelligent algorithm; secondly, a countermeasure sample detection technology is adopted, the countermeasure sample in the input is identified through an countermeasure sample detection algorithm, and the method can only discover the countermeasure attack behavior but cannot effectively respond to the countermeasure attack behavior; thirdly, repairing the confrontation sample, and simulating the confrontation attack effect by confronting noise filtration so as to ensure the correct identification of the original intelligent algorithm. Therefore, the research on the countermeasure sample restoration technology has great practical significance for safe and reliable application of the voice recognition algorithm.
Zhejiang industrial university proposes a defense method against a voice recognition system black box attack model in a patent application of Zhejiang industrial university (patent application No. 201911031043.5, publication No. CN 110992934A). The defense method includes the steps of firstly adding simulated environmental noise to original audio, simulating a voice input condition in a real scene, forming a primary countermeasure sample after the noise is randomly added, optimizing the countermeasure sample through a genetic algorithm and gradient estimation to obtain an accurate countermeasure sample, then mixing an original audio file and the countermeasure sample to serve as a training data set for countermeasure training, retraining the model, improving the identification accuracy of the model on the countermeasure sample, and accordingly improving the robustness of the model on the countermeasure attack. However, the method still has the following defects: a large amount of countermeasure samples need to be generated, and the defense effect of the other attack methods is poor when only one type of countermeasure sample is used for countermeasure training; the original speech recognition algorithm needs to be trained secondarily, and the method has limitation in practical application.
A sample attack resisting defense method and a sample attack resisting defense device based on a voice enhancement algorithm are proposed in a patent application No. 202010206879.0 (publication No. CN111564154A) applied by Beijing post and telecommunications university. The method comprises the steps of firstly obtaining the spectral characteristics of a voice sample to be recognized, calculating the noise spectrum of the voice sample to be recognized through a spectral subtraction method based on continuous minimum tracking and a log minimum mean square error algorithm (MMSE) algorithm combined with the voice existence probability according to the spectral characteristics of the voice sample to be recognized, denoising the voice sample to be recognized by utilizing the estimated noise spectrum obtained through calculation to obtain a denoised voice sample, and then recognizing the denoised voice sample through a pre-trained voice recognition model. The method can increase the accuracy of voice recognition and improve the effect of defending against sample attack. However, the method still has the following defects: the method is essentially a traditional general noise reduction algorithm, and has no targeted improvement on the countersample noise and low defense success rate.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is as follows: how to design a voice countermeasure sample restoration method with the characteristics of hearing high fidelity, and under the condition that the audio sample after compression and reconstruction of a clean sample and the original sample have no great difference in hearing perception, the voice countermeasure sample restoration method can better inhibit the attack effect of the countermeasure sample.
(II) technical scheme
In order to solve the technical problem, the invention provides a method for restoring a voice confrontation sample with the characteristics of hearing high fidelity, which comprises the following steps:
(1) constructing a confrontation sample restoration training data set;
(2) building an RAE network and setting network structure parameters;
(3) constructing high-fidelity audio reconstruction loss;
(4) setting training parameters and training an RAE network based on the steps 1, 2 and 3;
(5) and repairing the confrontation sample by using the trained RAE network.
Preferably, in step 1, n pieces of original audio data are collected, the maximum sequence length of the audio data is obtained, and each piece of audio data is filled with zero to the maximum length, so as to obtain the confrontation sample restoration training data set.
Preferably, step 2 comprises:
(2a) setting network structure parameters including an input layer, an encoder hidden layer based on a BilSTM network, an implicit variable layer, a decoder hidden layer based on the BilSTM network and an output layer;
(2b) building an RAE network by combining the structural characteristics of a BilSTM network and an AE network, wherein the RAE network structure sequentially comprises the following steps: input layer → encoder hidden layer based on BilSTM network → hidden variable layer → decoder hidden layer based on BilSTM network → output layer, thus obtaining initial RAE network.
Preferably, step 3 comprises:
(3a) using the input and reconstructed audio samples, according to a formula
Figure BDA0003292623550000031
Calculating the mean square error, where l is the audio length, xtThe raw data representing the time t is shown,
Figure BDA0003292623550000032
reconstruction data representing time t;
(3b) using input audio samples, according to a formula
Figure BDA0003292623550000033
Calculating a reconstruction error weight for the audio data at each time instant, wherein xt 2Representing the intensity of the speech signal at time t, the weight w follows xt 2Is increased and decreased, lambda is more than 0, and is a scale parameter, the larger lambda is, the weight w is along with xt 2The faster the increase and decrease;
(3c) multiplying said mean square error by a reconstruction error weight, i.e. according to a formula
Figure BDA0003292623550000041
And calculating to obtain the high-fidelity audio reconstruction loss.
Preferably, the formula
Figure BDA0003292623550000042
The high fidelity audio reconstruction loss function expressed is an adaptive weighted mean square error loss function that allows large reconstruction errors to be generated where the speech signal strength is large and limits the errors where the speech signal strength is small.
Preferably, step 4 comprises:
(4a) setting training parameters including iteration turns T, small batch size s and learning rate eta, and selecting an optimization algorithm, wherein the optimization algorithm is based on 1-order gradient;
(4b) reading the confrontation sample restoration training data set to divide the original audio data set into
Figure BDA0003292623550000043
Preprocessing the small batch of data sets to obtain a preprocessed confrontation sample restoration training data set;
Figure BDA0003292623550000044
represents rounding up;
(4c) restoring a training data set by using the preprocessed confrontation sample, and carrying out back propagation training on the initial RAE network according to the high-fidelity audio reconstruction loss and the learning rate eta by adopting the selected optimization algorithm: will be provided with
Figure BDA0003292623550000045
Sequentially inputting the small-batch data into an RAE network, repeating T rounds according to the sequence of forward propagation, backward propagation and weight updating by using the selected optimization algorithm, and counting iteration
Figure BDA0003292623550000046
And stopping training after the next time to obtain the trained RAE network.
Preferably, the optimization algorithm is the Adam algorithm.
Preferably, step 5 comprises:
(5a) reading a countermeasure sample, repairing the countermeasure sample by using a trained RAE network model, and outputting the repaired sample; the confrontation sample is audio data generated by a voice confrontation sample generation algorithm, and the confrontation sample induces the voice recognition model to generate a false recognition result by adding disturbance on a clean sample;
(5b) inputting the confrontation sample and the repaired sample into a voice recognition model, observing the repairing effect of the confrontation sample, if the repairing effect of the confrontation sample shows that the repaired sample is correct in recognition result, the repairing is successful, otherwise, the repairing is failed.
Preferably, the repairing effect of the confrontation sample specifically includes two aspects, that is, on one hand, the difference between the sound spectrums of the confrontation sample and the repaired sample is observed, and on the other hand, the results of the confrontation sample and the repaired sample after being input into the speech recognition model are observed.
The invention also provides application of the method in the technical field of artificial intelligence safety.
(III) advantageous effects
The method repairs the confrontation sample by using the RAE network and the high-fidelity audio reconstruction loss, and can better inhibit the confrontation sample attack effect under the condition of ensuring that the audio sample after compression reconstruction of the clean sample and the original sample have no great difference in auditory perception.
The high-fidelity audio loss reconstruction strategy is a theory that the construction of a countermeasure sample is based on the fact that weak disturbance is added in an area with large signal variance, and then an identification model is judged wrongly. In the process of reducing the dimension of the audio to the dimension of the audio through the dimension reduction process and the dimension increasing process based on the encoding-decoding mode, the loss function is controlled by the weight, the region with small signal intensity is dynamically enabled not to allow errors, and the region with large signal intensity is enabled to allow reasonable errors to be generated so as to discard the disturbance which does not belong to the signal originally.
According to experimental verification, no matter the voice countermeasure sample generated in the black box/white box environment, the recognition result of the repair sample generated by the algorithm on the automatic voice command recognition system is a given arbitrary command, and compared with other voice countermeasure sample repair algorithms, under the condition that the recognition result of the countermeasure sample is the same as the error rate of an original label, the method is higher in repair success rate and has fidelity in hearing.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram of the RAE network architecture of the present invention;
FIG. 3 is a raw audio spectrogram;
FIG. 4 is a challenge sample spectrogram;
fig. 5 is an audio repair sample spectrogram generated by the present invention.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention provides a method for restoring a voice countermeasure sample with the characteristics of hearing high fidelity, which is used for restoring the countermeasure sample by utilizing an RAE (Current automatic encoder) network and high fidelity audio reconstruction loss, and can better inhibit the attack effect of the countermeasure sample under the condition of ensuring that the audio sample after compression reconstruction of a clean sample and the original sample have no great difference in hearing perception.
The method utilizes the original audio to construct a variational self-encoder with high fidelity, so that the antagonistic sample can play the aim of inhibiting the antagonistic attack effect in the compression and reconstruction processes, secondly, utilizes the reconstruction loss of the high fidelity audio to enhance the auditory similarity between the input audio and the reconstructed audio,
the overall process of the invention is shown in figure 1 and comprises the following steps:
step 1, constructing a confrontation sample restoration training data set.
4000 pieces of original audio data are collected, the format of the audio data is in a 'wav' format, the audio sampling frequency is 16kHZ, and the audio sampling time length is less than or equal to 1 second; the maximum sequence length of the audio data is calculated to be 16000, each piece of audio data is filled with zero to the maximum length, and a confrontation sample repair training data set is obtained and comprises eight types of English audio data consisting of down, go, left, off, on, right, stop and up.
And 2, building an RAE network and setting network parameters by referring to FIG. 2.
Setting network structure parameters, wherein the dimension of an input layer is 100 multiplied by 160, dividing each piece of audio data into 100 segments, each 160 frames, the size of a hidden layer of an encoder based on a BilSTM (Bi-directional Long Short-Term Memory) network is 64, the size of a hidden variable layer is 64, the size of a hidden layer of a decoder based on the BilSTM network is 64, and the dimension of an output layer is 100 multiplied by 160. The BilSTM network is a two-layer recurrent neural network, i.e. the current output is influenced not only by the previous state but also by the future state, and for each time t, the input provides two opposite recurrent neural networks to jointly determine the output.
The RAE network is built by combining the structural characteristics of a five-layer network of a BilSTM two-layer cyclic neural network and an AE self-encoder unsupervised learning, and the structure of the RAE network is as follows in sequence: input layer → encoder hidden layer based on the BilSTM network → hidden variable layer → decoder hidden layer based on the BilSTM network → output layer, to obtain initial RAE network. The AE network is an auto-encoder network, is an unsupervised learning model, and comprises an encoding process of original data x from an input layer to a hidden layer and a decoding process from the hidden layer to an output layer, so that the original data x and reconstructed data
Figure BDA0003292623550000071
The distance between them is the error loss of the data x. An RAE network (RNN-AE network) is a cyclic self-encoder network, which uses a BilSTM network in an encoder and a decoder to input original audio data into an encoder, and the current output Z of the encoder is obtained by learning the BilSTM networktIs subjected to the current time state htAnd future time state ctThe output Z of the final moment of the encodertEntering the decoder, learning again by the BilSTM network, and outputting Z at the current moment by the decodertIs subjected to the current time state htAnd future time state ctOf the final decoder outputOutput of termination time ZtFinally, converting the output format through the full connection layer;
and 3, constructing high-fidelity audio reconstruction loss.
Using input and reconstructed audio samples, according to a formula
Figure BDA0003292623550000081
Calculating the mean square error, wherein t is 1, 2, 3 … l, l is 160 audio length, xtWhich represents the original data of the image data,
Figure BDA0003292623550000085
representing the reconstructed data.
Using input audio samples, according to a formula
Figure BDA0003292623550000082
And calculating the reconstruction error weight of the audio data at each moment, wherein lambda is 2. Wherein x ist 2Representing the strength of the speech signal, the weight w follows xt 2Is increased and decreased, lambda is more than 0 as a scale parameter, the larger the lambda is, the weight w is along with xt 2The faster the increase and decrease;
multiplying the mean square error by the reconstruction error weight, i.e. according to the formula
Figure BDA0003292623550000083
And calculating to obtain the high-fidelity audio reconstruction loss. Wherein λ is a scale parameter, the higher the audio energy, the smaller the region weight, and the formula
Figure BDA0003292623550000084
The expressed high-fidelity audio reconstruction loss function is a self-adaptive weighted mean square error loss function, allows a larger reconstruction error to be generated at a place with large voice signal intensity, and limits the error at a place with small voice signal intensity; due to the fact that weak disturbance of the countersample is often added to the area with large signal intensity, the method not only guarantees that the area with large intensity of the voice after reconstruction is allowed to have errors, but also guarantees that the signal intensity of the voice after reconstruction is smallWithout loss of area
And 4, setting training parameters and training a network.
Setting training parameters including iteration round T equal to 30, small batch size s equal to 64 and learning rate eta equal to 1 × 10-4And selecting an optimization algorithm, wherein the optimization algorithm is an optimization algorithm based on a 1-order Gradient, and includes SGD (random Gradient Descent), Momentum-SGD (random Gradient Descent with Momentum), RMSProp (Root Mean Square transfer), Adam (Adaptive motion Estimation), and the like1=0.99,β2=0.999。
Reading a confrontation sample restoration training data set, dividing eight types of English audio data in an original audio data set into 7 small-batch data sets, and obtaining the preprocessed confrontation sample restoration training data set.
And repairing the training data set by using the preprocessed confrontation sample, sequentially inputting 7 small batches of data into the RAE network aiming at eight classes of English audio data sets, repeating 30 rounds by using an Adam algorithm according to the sequence of forward propagation, backward propagation and weight updating, and stopping training after iterating 210 times in total to obtain the trained RAE network.
And 5, repairing the confrontation sample by using the trained RAE model.
And reading the confrontation sample, repairing the confrontation sample by using the trained RAE network model, and outputting the repaired sample. The confrontation sample is audio data generated by a voice confrontation sample generation algorithm, and induces a voice recognition model to generate a false recognition result by adding weak disturbance on a clean sample; the format of the audio data of the confrontation sample is 'wav', the sampling frequency is 16kHZ, the time length is less than or equal to 1 second, namely the length of the read audio data sequence is less than or equal to 16000, and zero padding operation is carried out on the tail of the sequence with the time length of the audio data less than 1 second, so that the length of the sequence reaches 16000.
Inputting the confrontation sample and the repaired sample into a voice recognition model, and observing the repairing effect of the confrontation sample, wherein the repairing effect specifically comprises two contents, namely, on one hand, the difference of the sound spectrums of the confrontation sample and the repaired sample is observed, on the other hand, the results of the confrontation sample and the repaired sample input into the voice recognition model are observed, if the repaired sample is correct in recognition result, the repairing is successful, and otherwise, the repairing is failed.
The effects of the present invention will be further described with reference to the experimental results.
1. The experimental conditions of the invention are as follows:
the software platform used in this embodiment is: windows10 operating system and Spyder.
The hardware device used in this embodiment is an Intel Core (TM) i7-9700K @3.60GHz × 8, GPU Nvidia GeForce GTX 1080Ti, 11GB video memory.
The Python version used in this example is Python 3.7.3, and the library and the corresponding version used are respectively pytorch 1.1.0, torchaudio 0.8.0, numpy 1.21.0.
2. The experimental result analysis of the invention:
the experiment of the invention is that the voice confrontation sample restoration method with the characteristics of hearing high fidelity constructed by the invention is utilized to restore the confrontation sample, the restored sample is output, the difference of the acoustic spectrums of the confrontation sample and the restored sample is observed, and the confrontation sample and the restored sample are compared through a voice recognition model to obtain a recognition result.
The acoustic spectrum difference between the confrontation sample and the repaired sample, fig. 3 is an original audio spectrogram of on-type data, fig. 4 is a confrontation sample spectrogram of on-type data, fig. 5 is a sample spectrogram of on-type audio repaired generated by the method of the present invention, the similarity between the acoustic spectrums of fig. 3 and 5 is high, and a certain difference exists between fig. 4 and 5.
According to the formula
Figure BDA0003292623550000101
Calculating the success rate of repairing the voice confrontation sample, wherein z is 1, 2z500 is the total number of challenge sample data in one class, mzAnd respectively counting the success rate of repairing the confrontation samples of different voice categories for the number of the successfully repaired confrontation samples. The results are shown in Table 1.
TABLE 1 success rate results of confrontation sample repair under different audio classes
Figure BDA0003292623550000111
Compared with the results in table 1, from the viewpoint of the repair success rate of different audio categories, in the repair of different countermeasure samples of eight different categories of english audios, namely down, go, left, off, on, right, stop and up, the repair success rate of the method for repairing the voice countermeasure samples with the characteristics of hearing and high fidelity provided by the invention exceeds 80% for the PGD attack and the Deepfool attack samples of the rest seven categories of audios except that the repair success rate of the four countermeasure samples of the stop category is lower than 50%, wherein the repair success rate of the Deepfool attack samples of the go category and the PGD attack and the Deepfool attack samples of the on category of audios respectively reaches 90.2%, 93.6% and 95%; from the aspect of the repair success rate of different attack types, in the repair of the anti-sample, namely PGD attack, Deepfool attack, SA attack and GA attack, the method for repairing the voice anti-sample with the characteristics of hearing high fidelity, provided by the invention, has the advantages that the repair success rate of the anti-sample for PGD attack, Deepfool attack and GA attack is higher than that of the anti-sample for SA attack besides the off-type audio.
The recognition result of the generated restoration sample on the automatic voice command recognition system is a given arbitrary command, compared with other voice countermeasure sample restoration algorithms, under the condition that the countermeasure sample recognition result is the same as the original label error rate, the method has higher restoration success rate and has fidelity in hearing, and can be suitable for the countermeasure sample under the condition of lower signal to noise ratio.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (4)

1. A voice confrontation sample restoration method with the characteristics of hearing high fidelity is characterized by comprising the following steps:
(1) constructing a confrontation sample restoration training data set;
(2) building an RAE network and setting network structure parameters;
(3) constructing high-fidelity audio reconstruction loss;
(4) setting training parameters and training an RAE network based on the steps 1, 2 and 3;
(5) repairing the confrontation sample by using the trained RAE network;
in the step 1, acquiring n pieces of original audio data, solving the maximum sequence length of the audio data, and filling each piece of audio data with zero to the maximum length to obtain the confrontation sample restoration training data set;
the step 2 comprises the following steps:
(2a) setting network structure parameters including an input layer, an encoder hidden layer based on a BilSTM network, an implicit variable layer, a decoder hidden layer based on the BilSTM network and an output layer;
(2b) building an RAE network by combining the structural characteristics of a BilSTM network and an AE network, wherein the RAE network structure sequentially comprises the following steps: an input layer → an encoder hidden layer based on the BilSTM network → a hidden variable layer → a decoder hidden layer based on the BilSTM network → an output layer, thereby obtaining an initial RAE network;
the AE network is an auto-encoder network, is an unsupervised learning model, and comprises an encoding process of original data x from an input layer to a hidden layer and a decoding process from the hidden layer to an output layer, so that the original data x and reconstructed data
Figure FDA0003565291220000014
The RAE network is RNN-AE network, which is a cyclic self-encoder network, and the BiLSTM network is used in encoder and decoder to input the original audio data into encoderThrough the study of the BilSTM network, the output Z of the encoder at the current momenttIs subjected to the current time state htAnd future time state ctThe output Z of the final moment of the encodertEntering the decoder, learning again through the BilSTM network, and outputting Z at the current moment of the decodertIs subjected to the current time state htAnd future time state ctThe output Z at the end of the decoder outputtFinally, converting the output format through the full connection layer;
the step 3 comprises the following steps:
(3a) using the input and reconstructed audio samples, according to a formula
Figure FDA0003565291220000011
Calculating the mean square error, where l is the audio length, xtThe raw data representing the time t is shown,
Figure FDA0003565291220000012
reconstruction data representing time t;
(3b) using input audio samples, according to a formula
Figure FDA0003565291220000013
Calculating a reconstruction error weight for the audio data at each time instant, wherein xt 2Representing the intensity of the speech signal at time t, the weight w follows xt 2Is increased and decreased, lambda is more than 0, and is a scale parameter, the larger lambda is, the weight w is along with xt 2The faster the increase and decrease;
(3c) multiplying said mean square error by a reconstruction error weight, i.e. according to a formula
Figure FDA0003565291220000021
Calculating to obtain high-fidelity audio reconstruction loss;
by the formula
Figure FDA0003565291220000022
The expressed high-fidelity audio reconstruction loss function is a self-adaptive weighted mean square error loss function, allows large reconstruction errors to be generated at places with large voice signal intensity, and limits the errors at places with small voice signal intensity;
step 4 comprises the following steps:
(4a) setting training parameters including iteration turns T, small batch size s and learning rate eta, and selecting an optimization algorithm, wherein the optimization algorithm is based on 1-order gradient;
(4b) reading the confrontation sample restoration training data set to divide the original audio data set into
Figure FDA0003565291220000023
Preprocessing the small batch of data sets to obtain a preprocessed confrontation sample restoration training data set;
Figure FDA0003565291220000024
represents rounding up;
(4c) restoring a training data set by using the preprocessed confrontation sample, and carrying out back propagation training on the initial RAE network according to the high-fidelity audio reconstruction loss and the learning rate eta by adopting the selected optimization algorithm: will be provided with
Figure FDA0003565291220000025
Sequentially inputting the small batches of data into an RAE network, repeating T rounds according to the sequence of forward propagation, backward propagation and weight updating by using the selected optimization algorithm, and totally iterating
Figure FDA0003565291220000026
And stopping training after the next time to obtain the trained RAE network.
2. The method of claim 1, wherein the optimization algorithm is an Adam algorithm.
3. The method of claim 1, wherein step 5 comprises:
(5a) reading a countermeasure sample, repairing the countermeasure sample by using a trained RAE network model, and outputting the repaired sample; the confrontation sample is audio data generated by a voice confrontation sample generation algorithm, and induces the voice recognition model to generate a false recognition result by adding disturbance on a clean sample;
(5b) inputting the confrontation sample and the repaired sample into a voice recognition model, observing the repairing effect of the confrontation sample, if the repairing effect of the confrontation sample shows that the repaired sample is correct in recognition result, the repairing is successful, otherwise, the repairing is failed.
4. The method of claim 3, wherein the repairing effect of the challenge sample specifically includes observing the difference between the sound spectrum of the challenge sample and the sound spectrum of the repaired sample, and observing the result of the recognition of the challenge sample and the sound spectrum of the repaired sample by the speech recognition model.
CN202111170083.5A 2021-10-08 2021-10-08 Voice countercheck sample repairing method with hearing high fidelity characteristic Active CN113948067B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111170083.5A CN113948067B (en) 2021-10-08 2021-10-08 Voice countercheck sample repairing method with hearing high fidelity characteristic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111170083.5A CN113948067B (en) 2021-10-08 2021-10-08 Voice countercheck sample repairing method with hearing high fidelity characteristic

Publications (2)

Publication Number Publication Date
CN113948067A CN113948067A (en) 2022-01-18
CN113948067B true CN113948067B (en) 2022-05-27

Family

ID=79330060

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111170083.5A Active CN113948067B (en) 2021-10-08 2021-10-08 Voice countercheck sample repairing method with hearing high fidelity characteristic

Country Status (1)

Country Link
CN (1) CN113948067B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424635B (en) * 2022-11-03 2023-02-10 南京凯盛国际工程有限公司 Cement plant equipment fault diagnosis method based on sound characteristics

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110767216A (en) * 2019-09-10 2020-02-07 浙江工业大学 Voice recognition attack defense method based on PSO algorithm

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930976B (en) * 2019-12-02 2022-04-15 北京声智科技有限公司 Voice generation method and device
US20210304736A1 (en) * 2020-03-30 2021-09-30 Nvidia Corporation Media engagement through deep learning
CN113223515B (en) * 2021-04-01 2022-05-31 山东大学 Automatic voice recognition method for anti-attack immunity
CN113362822B (en) * 2021-06-08 2022-09-30 北京计算机技术及应用研究所 Black box voice confrontation sample generation method with auditory masking

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110767216A (en) * 2019-09-10 2020-02-07 浙江工业大学 Voice recognition attack defense method based on PSO algorithm

Also Published As

Publication number Publication date
CN113948067A (en) 2022-01-18

Similar Documents

Publication Publication Date Title
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN109637546B (en) Knowledge distillation method and apparatus
CN110706692B (en) Training method and system of child voice recognition model
CN111061843B (en) Knowledge-graph-guided false news detection method
Fabius et al. Variational recurrent auto-encoders
Sainath et al. Auto-encoder bottleneck features using deep belief networks
CN106782511A (en) Amendment linear depth autoencoder network audio recognition method
CN110930976B (en) Voice generation method and device
CN110349597B (en) Voice detection method and device
CN110211575A (en) Voice for data enhancing adds method for de-noising and system
CN110853668B (en) Voice tampering detection method based on multi-feature fusion
Qian et al. Deep feature engineering for noise robust spoofing detection
CN111653275B (en) Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method
CN113488060B (en) Voiceprint recognition method and system based on variation information bottleneck
Zhou et al. ICRC-HIT: A deep learning based comment sequence labeling system for answer selection challenge
CN111814489A (en) Spoken language semantic understanding method and system
CN113948067B (en) Voice countercheck sample repairing method with hearing high fidelity characteristic
Wang et al. Query-efficient adversarial attack with low perturbation against end-to-end speech recognition systems
CN113362822A (en) Black box voice confrontation sample generation method with auditory masking
CN109522448B (en) Method for carrying out robust speech gender classification based on CRBM and SNN
CN114999525A (en) Light-weight environment voice recognition method based on neural network
Naini et al. Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task.
Chang et al. A unified endpointer using multitask and multidomain training
CN116318845A (en) DGA domain name detection method under unbalanced proportion condition of positive and negative samples
CN116580694A (en) Audio challenge sample generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant