CN113948067B - Voice countercheck sample repairing method with hearing high fidelity characteristic - Google Patents
Voice countercheck sample repairing method with hearing high fidelity characteristic Download PDFInfo
- Publication number
- CN113948067B CN113948067B CN202111170083.5A CN202111170083A CN113948067B CN 113948067 B CN113948067 B CN 113948067B CN 202111170083 A CN202111170083 A CN 202111170083A CN 113948067 B CN113948067 B CN 113948067B
- Authority
- CN
- China
- Prior art keywords
- network
- sample
- rae
- audio
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/005—Correction of errors induced by the transmission channel, if related to the coding algorithm
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to a voice confrontation sample restoration method with the characteristics of hearing high fidelity, and relates to the technical field of artificial intelligence safety. The method comprises the following steps: constructing a confrontation sample restoration training data set; building an RAE network and setting network parameters; constructing high fidelity audio reconstruction loss, namely improving a high fidelity strategy based on signal mean square error; setting training parameters and training a network; and (4) utilizing the trained RAE network to repair the confrontation sample, and judging whether the repairing is successful or not by means of a voice recognition model. Compared with the traditional common voice signal restoration method, the audio restoration sample generated by the algorithm has higher auditory fidelity and restoration success rate, and can be suitable for the confrontation sample under the condition of lower signal-to-noise ratio.
Description
Technical Field
The invention relates to the technical field of artificial intelligence safety, in particular to a voice confrontation sample restoration method with the characteristics of hearing high fidelity.
Background
In recent years, the attack of voice recognition counterattack has become a new research focus of artificial intelligence, which can induce the voice recognition algorithm to generate a false recognition result by adding weak noise which is hard to be detected by human beings to the input audio data. Facing to the threat of voice anti-attack, scholars at home and abroad actively explore the safety defense technology of the artificial intelligence algorithm. The main defense methods are divided into three categories, firstly, the training data set is expanded through the confrontation samples to carry out confrontation training, the robustness of the intelligent algorithm is improved, and the method needs to generate a large number of confrontation samples and change the parameters of the original intelligent algorithm; secondly, a countermeasure sample detection technology is adopted, the countermeasure sample in the input is identified through an countermeasure sample detection algorithm, and the method can only discover the countermeasure attack behavior but cannot effectively respond to the countermeasure attack behavior; thirdly, repairing the confrontation sample, and simulating the confrontation attack effect by confronting noise filtration so as to ensure the correct identification of the original intelligent algorithm. Therefore, the research on the countermeasure sample restoration technology has great practical significance for safe and reliable application of the voice recognition algorithm.
Zhejiang industrial university proposes a defense method against a voice recognition system black box attack model in a patent application of Zhejiang industrial university (patent application No. 201911031043.5, publication No. CN 110992934A). The defense method includes the steps of firstly adding simulated environmental noise to original audio, simulating a voice input condition in a real scene, forming a primary countermeasure sample after the noise is randomly added, optimizing the countermeasure sample through a genetic algorithm and gradient estimation to obtain an accurate countermeasure sample, then mixing an original audio file and the countermeasure sample to serve as a training data set for countermeasure training, retraining the model, improving the identification accuracy of the model on the countermeasure sample, and accordingly improving the robustness of the model on the countermeasure attack. However, the method still has the following defects: a large amount of countermeasure samples need to be generated, and the defense effect of the other attack methods is poor when only one type of countermeasure sample is used for countermeasure training; the original speech recognition algorithm needs to be trained secondarily, and the method has limitation in practical application.
A sample attack resisting defense method and a sample attack resisting defense device based on a voice enhancement algorithm are proposed in a patent application No. 202010206879.0 (publication No. CN111564154A) applied by Beijing post and telecommunications university. The method comprises the steps of firstly obtaining the spectral characteristics of a voice sample to be recognized, calculating the noise spectrum of the voice sample to be recognized through a spectral subtraction method based on continuous minimum tracking and a log minimum mean square error algorithm (MMSE) algorithm combined with the voice existence probability according to the spectral characteristics of the voice sample to be recognized, denoising the voice sample to be recognized by utilizing the estimated noise spectrum obtained through calculation to obtain a denoised voice sample, and then recognizing the denoised voice sample through a pre-trained voice recognition model. The method can increase the accuracy of voice recognition and improve the effect of defending against sample attack. However, the method still has the following defects: the method is essentially a traditional general noise reduction algorithm, and has no targeted improvement on the countersample noise and low defense success rate.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is as follows: how to design a voice countermeasure sample restoration method with the characteristics of hearing high fidelity, and under the condition that the audio sample after compression and reconstruction of a clean sample and the original sample have no great difference in hearing perception, the voice countermeasure sample restoration method can better inhibit the attack effect of the countermeasure sample.
(II) technical scheme
In order to solve the technical problem, the invention provides a method for restoring a voice confrontation sample with the characteristics of hearing high fidelity, which comprises the following steps:
(1) constructing a confrontation sample restoration training data set;
(2) building an RAE network and setting network structure parameters;
(3) constructing high-fidelity audio reconstruction loss;
(4) setting training parameters and training an RAE network based on the steps 1, 2 and 3;
(5) and repairing the confrontation sample by using the trained RAE network.
Preferably, in step 1, n pieces of original audio data are collected, the maximum sequence length of the audio data is obtained, and each piece of audio data is filled with zero to the maximum length, so as to obtain the confrontation sample restoration training data set.
Preferably, step 2 comprises:
(2a) setting network structure parameters including an input layer, an encoder hidden layer based on a BilSTM network, an implicit variable layer, a decoder hidden layer based on the BilSTM network and an output layer;
(2b) building an RAE network by combining the structural characteristics of a BilSTM network and an AE network, wherein the RAE network structure sequentially comprises the following steps: input layer → encoder hidden layer based on BilSTM network → hidden variable layer → decoder hidden layer based on BilSTM network → output layer, thus obtaining initial RAE network.
Preferably, step 3 comprises:
(3a) using the input and reconstructed audio samples, according to a formulaCalculating the mean square error, where l is the audio length, xtThe raw data representing the time t is shown,reconstruction data representing time t;
(3b) using input audio samples, according to a formulaCalculating a reconstruction error weight for the audio data at each time instant, wherein xt 2Representing the intensity of the speech signal at time t, the weight w follows xt 2Is increased and decreased, lambda is more than 0, and is a scale parameter, the larger lambda is, the weight w is along with xt 2The faster the increase and decrease;
(3c) multiplying said mean square error by a reconstruction error weight, i.e. according to a formulaAnd calculating to obtain the high-fidelity audio reconstruction loss.
Preferably, the formulaThe high fidelity audio reconstruction loss function expressed is an adaptive weighted mean square error loss function that allows large reconstruction errors to be generated where the speech signal strength is large and limits the errors where the speech signal strength is small.
Preferably, step 4 comprises:
(4a) setting training parameters including iteration turns T, small batch size s and learning rate eta, and selecting an optimization algorithm, wherein the optimization algorithm is based on 1-order gradient;
(4b) reading the confrontation sample restoration training data set to divide the original audio data set intoPreprocessing the small batch of data sets to obtain a preprocessed confrontation sample restoration training data set;represents rounding up;
(4c) restoring a training data set by using the preprocessed confrontation sample, and carrying out back propagation training on the initial RAE network according to the high-fidelity audio reconstruction loss and the learning rate eta by adopting the selected optimization algorithm: will be provided withSequentially inputting the small-batch data into an RAE network, repeating T rounds according to the sequence of forward propagation, backward propagation and weight updating by using the selected optimization algorithm, and counting iterationAnd stopping training after the next time to obtain the trained RAE network.
Preferably, the optimization algorithm is the Adam algorithm.
Preferably, step 5 comprises:
(5a) reading a countermeasure sample, repairing the countermeasure sample by using a trained RAE network model, and outputting the repaired sample; the confrontation sample is audio data generated by a voice confrontation sample generation algorithm, and the confrontation sample induces the voice recognition model to generate a false recognition result by adding disturbance on a clean sample;
(5b) inputting the confrontation sample and the repaired sample into a voice recognition model, observing the repairing effect of the confrontation sample, if the repairing effect of the confrontation sample shows that the repaired sample is correct in recognition result, the repairing is successful, otherwise, the repairing is failed.
Preferably, the repairing effect of the confrontation sample specifically includes two aspects, that is, on one hand, the difference between the sound spectrums of the confrontation sample and the repaired sample is observed, and on the other hand, the results of the confrontation sample and the repaired sample after being input into the speech recognition model are observed.
The invention also provides application of the method in the technical field of artificial intelligence safety.
(III) advantageous effects
The method repairs the confrontation sample by using the RAE network and the high-fidelity audio reconstruction loss, and can better inhibit the confrontation sample attack effect under the condition of ensuring that the audio sample after compression reconstruction of the clean sample and the original sample have no great difference in auditory perception.
The high-fidelity audio loss reconstruction strategy is a theory that the construction of a countermeasure sample is based on the fact that weak disturbance is added in an area with large signal variance, and then an identification model is judged wrongly. In the process of reducing the dimension of the audio to the dimension of the audio through the dimension reduction process and the dimension increasing process based on the encoding-decoding mode, the loss function is controlled by the weight, the region with small signal intensity is dynamically enabled not to allow errors, and the region with large signal intensity is enabled to allow reasonable errors to be generated so as to discard the disturbance which does not belong to the signal originally.
According to experimental verification, no matter the voice countermeasure sample generated in the black box/white box environment, the recognition result of the repair sample generated by the algorithm on the automatic voice command recognition system is a given arbitrary command, and compared with other voice countermeasure sample repair algorithms, under the condition that the recognition result of the countermeasure sample is the same as the error rate of an original label, the method is higher in repair success rate and has fidelity in hearing.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram of the RAE network architecture of the present invention;
FIG. 3 is a raw audio spectrogram;
FIG. 4 is a challenge sample spectrogram;
fig. 5 is an audio repair sample spectrogram generated by the present invention.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention provides a method for restoring a voice countermeasure sample with the characteristics of hearing high fidelity, which is used for restoring the countermeasure sample by utilizing an RAE (Current automatic encoder) network and high fidelity audio reconstruction loss, and can better inhibit the attack effect of the countermeasure sample under the condition of ensuring that the audio sample after compression reconstruction of a clean sample and the original sample have no great difference in hearing perception.
The method utilizes the original audio to construct a variational self-encoder with high fidelity, so that the antagonistic sample can play the aim of inhibiting the antagonistic attack effect in the compression and reconstruction processes, secondly, utilizes the reconstruction loss of the high fidelity audio to enhance the auditory similarity between the input audio and the reconstructed audio,
the overall process of the invention is shown in figure 1 and comprises the following steps:
step 1, constructing a confrontation sample restoration training data set.
4000 pieces of original audio data are collected, the format of the audio data is in a 'wav' format, the audio sampling frequency is 16kHZ, and the audio sampling time length is less than or equal to 1 second; the maximum sequence length of the audio data is calculated to be 16000, each piece of audio data is filled with zero to the maximum length, and a confrontation sample repair training data set is obtained and comprises eight types of English audio data consisting of down, go, left, off, on, right, stop and up.
And 2, building an RAE network and setting network parameters by referring to FIG. 2.
Setting network structure parameters, wherein the dimension of an input layer is 100 multiplied by 160, dividing each piece of audio data into 100 segments, each 160 frames, the size of a hidden layer of an encoder based on a BilSTM (Bi-directional Long Short-Term Memory) network is 64, the size of a hidden variable layer is 64, the size of a hidden layer of a decoder based on the BilSTM network is 64, and the dimension of an output layer is 100 multiplied by 160. The BilSTM network is a two-layer recurrent neural network, i.e. the current output is influenced not only by the previous state but also by the future state, and for each time t, the input provides two opposite recurrent neural networks to jointly determine the output.
The RAE network is built by combining the structural characteristics of a five-layer network of a BilSTM two-layer cyclic neural network and an AE self-encoder unsupervised learning, and the structure of the RAE network is as follows in sequence: input layer → encoder hidden layer based on the BilSTM network → hidden variable layer → decoder hidden layer based on the BilSTM network → output layer, to obtain initial RAE network. The AE network is an auto-encoder network, is an unsupervised learning model, and comprises an encoding process of original data x from an input layer to a hidden layer and a decoding process from the hidden layer to an output layer, so that the original data x and reconstructed dataThe distance between them is the error loss of the data x. An RAE network (RNN-AE network) is a cyclic self-encoder network, which uses a BilSTM network in an encoder and a decoder to input original audio data into an encoder, and the current output Z of the encoder is obtained by learning the BilSTM networktIs subjected to the current time state htAnd future time state ctThe output Z of the final moment of the encodertEntering the decoder, learning again by the BilSTM network, and outputting Z at the current moment by the decodertIs subjected to the current time state htAnd future time state ctOf the final decoder outputOutput of termination time ZtFinally, converting the output format through the full connection layer;
and 3, constructing high-fidelity audio reconstruction loss.
Using input and reconstructed audio samples, according to a formulaCalculating the mean square error, wherein t is 1, 2, 3 … l, l is 160 audio length, xtWhich represents the original data of the image data,representing the reconstructed data.
Using input audio samples, according to a formulaAnd calculating the reconstruction error weight of the audio data at each moment, wherein lambda is 2. Wherein x ist 2Representing the strength of the speech signal, the weight w follows xt 2Is increased and decreased, lambda is more than 0 as a scale parameter, the larger the lambda is, the weight w is along with xt 2The faster the increase and decrease;
multiplying the mean square error by the reconstruction error weight, i.e. according to the formulaAnd calculating to obtain the high-fidelity audio reconstruction loss. Wherein λ is a scale parameter, the higher the audio energy, the smaller the region weight, and the formulaThe expressed high-fidelity audio reconstruction loss function is a self-adaptive weighted mean square error loss function, allows a larger reconstruction error to be generated at a place with large voice signal intensity, and limits the error at a place with small voice signal intensity; due to the fact that weak disturbance of the countersample is often added to the area with large signal intensity, the method not only guarantees that the area with large intensity of the voice after reconstruction is allowed to have errors, but also guarantees that the signal intensity of the voice after reconstruction is smallWithout loss of area
And 4, setting training parameters and training a network.
Setting training parameters including iteration round T equal to 30, small batch size s equal to 64 and learning rate eta equal to 1 × 10-4And selecting an optimization algorithm, wherein the optimization algorithm is an optimization algorithm based on a 1-order Gradient, and includes SGD (random Gradient Descent), Momentum-SGD (random Gradient Descent with Momentum), RMSProp (Root Mean Square transfer), Adam (Adaptive motion Estimation), and the like1=0.99,β2=0.999。
Reading a confrontation sample restoration training data set, dividing eight types of English audio data in an original audio data set into 7 small-batch data sets, and obtaining the preprocessed confrontation sample restoration training data set.
And repairing the training data set by using the preprocessed confrontation sample, sequentially inputting 7 small batches of data into the RAE network aiming at eight classes of English audio data sets, repeating 30 rounds by using an Adam algorithm according to the sequence of forward propagation, backward propagation and weight updating, and stopping training after iterating 210 times in total to obtain the trained RAE network.
And 5, repairing the confrontation sample by using the trained RAE model.
And reading the confrontation sample, repairing the confrontation sample by using the trained RAE network model, and outputting the repaired sample. The confrontation sample is audio data generated by a voice confrontation sample generation algorithm, and induces a voice recognition model to generate a false recognition result by adding weak disturbance on a clean sample; the format of the audio data of the confrontation sample is 'wav', the sampling frequency is 16kHZ, the time length is less than or equal to 1 second, namely the length of the read audio data sequence is less than or equal to 16000, and zero padding operation is carried out on the tail of the sequence with the time length of the audio data less than 1 second, so that the length of the sequence reaches 16000.
Inputting the confrontation sample and the repaired sample into a voice recognition model, and observing the repairing effect of the confrontation sample, wherein the repairing effect specifically comprises two contents, namely, on one hand, the difference of the sound spectrums of the confrontation sample and the repaired sample is observed, on the other hand, the results of the confrontation sample and the repaired sample input into the voice recognition model are observed, if the repaired sample is correct in recognition result, the repairing is successful, and otherwise, the repairing is failed.
The effects of the present invention will be further described with reference to the experimental results.
1. The experimental conditions of the invention are as follows:
the software platform used in this embodiment is: windows10 operating system and Spyder.
The hardware device used in this embodiment is an Intel Core (TM) i7-9700K @3.60GHz × 8, GPU Nvidia GeForce GTX 1080Ti, 11GB video memory.
The Python version used in this example is Python 3.7.3, and the library and the corresponding version used are respectively pytorch 1.1.0, torchaudio 0.8.0, numpy 1.21.0.
2. The experimental result analysis of the invention:
the experiment of the invention is that the voice confrontation sample restoration method with the characteristics of hearing high fidelity constructed by the invention is utilized to restore the confrontation sample, the restored sample is output, the difference of the acoustic spectrums of the confrontation sample and the restored sample is observed, and the confrontation sample and the restored sample are compared through a voice recognition model to obtain a recognition result.
The acoustic spectrum difference between the confrontation sample and the repaired sample, fig. 3 is an original audio spectrogram of on-type data, fig. 4 is a confrontation sample spectrogram of on-type data, fig. 5 is a sample spectrogram of on-type audio repaired generated by the method of the present invention, the similarity between the acoustic spectrums of fig. 3 and 5 is high, and a certain difference exists between fig. 4 and 5.
According to the formulaCalculating the success rate of repairing the voice confrontation sample, wherein z is 1, 2z500 is the total number of challenge sample data in one class, mzAnd respectively counting the success rate of repairing the confrontation samples of different voice categories for the number of the successfully repaired confrontation samples. The results are shown in Table 1.
TABLE 1 success rate results of confrontation sample repair under different audio classes
Compared with the results in table 1, from the viewpoint of the repair success rate of different audio categories, in the repair of different countermeasure samples of eight different categories of english audios, namely down, go, left, off, on, right, stop and up, the repair success rate of the method for repairing the voice countermeasure samples with the characteristics of hearing and high fidelity provided by the invention exceeds 80% for the PGD attack and the Deepfool attack samples of the rest seven categories of audios except that the repair success rate of the four countermeasure samples of the stop category is lower than 50%, wherein the repair success rate of the Deepfool attack samples of the go category and the PGD attack and the Deepfool attack samples of the on category of audios respectively reaches 90.2%, 93.6% and 95%; from the aspect of the repair success rate of different attack types, in the repair of the anti-sample, namely PGD attack, Deepfool attack, SA attack and GA attack, the method for repairing the voice anti-sample with the characteristics of hearing high fidelity, provided by the invention, has the advantages that the repair success rate of the anti-sample for PGD attack, Deepfool attack and GA attack is higher than that of the anti-sample for SA attack besides the off-type audio.
The recognition result of the generated restoration sample on the automatic voice command recognition system is a given arbitrary command, compared with other voice countermeasure sample restoration algorithms, under the condition that the countermeasure sample recognition result is the same as the original label error rate, the method has higher restoration success rate and has fidelity in hearing, and can be suitable for the countermeasure sample under the condition of lower signal to noise ratio.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (4)
1. A voice confrontation sample restoration method with the characteristics of hearing high fidelity is characterized by comprising the following steps:
(1) constructing a confrontation sample restoration training data set;
(2) building an RAE network and setting network structure parameters;
(3) constructing high-fidelity audio reconstruction loss;
(4) setting training parameters and training an RAE network based on the steps 1, 2 and 3;
(5) repairing the confrontation sample by using the trained RAE network;
in the step 1, acquiring n pieces of original audio data, solving the maximum sequence length of the audio data, and filling each piece of audio data with zero to the maximum length to obtain the confrontation sample restoration training data set;
the step 2 comprises the following steps:
(2a) setting network structure parameters including an input layer, an encoder hidden layer based on a BilSTM network, an implicit variable layer, a decoder hidden layer based on the BilSTM network and an output layer;
(2b) building an RAE network by combining the structural characteristics of a BilSTM network and an AE network, wherein the RAE network structure sequentially comprises the following steps: an input layer → an encoder hidden layer based on the BilSTM network → a hidden variable layer → a decoder hidden layer based on the BilSTM network → an output layer, thereby obtaining an initial RAE network;
the AE network is an auto-encoder network, is an unsupervised learning model, and comprises an encoding process of original data x from an input layer to a hidden layer and a decoding process from the hidden layer to an output layer, so that the original data x and reconstructed dataThe RAE network is RNN-AE network, which is a cyclic self-encoder network, and the BiLSTM network is used in encoder and decoder to input the original audio data into encoderThrough the study of the BilSTM network, the output Z of the encoder at the current momenttIs subjected to the current time state htAnd future time state ctThe output Z of the final moment of the encodertEntering the decoder, learning again through the BilSTM network, and outputting Z at the current moment of the decodertIs subjected to the current time state htAnd future time state ctThe output Z at the end of the decoder outputtFinally, converting the output format through the full connection layer;
the step 3 comprises the following steps:
(3a) using the input and reconstructed audio samples, according to a formulaCalculating the mean square error, where l is the audio length, xtThe raw data representing the time t is shown,reconstruction data representing time t;
(3b) using input audio samples, according to a formulaCalculating a reconstruction error weight for the audio data at each time instant, wherein xt 2Representing the intensity of the speech signal at time t, the weight w follows xt 2Is increased and decreased, lambda is more than 0, and is a scale parameter, the larger lambda is, the weight w is along with xt 2The faster the increase and decrease;
(3c) multiplying said mean square error by a reconstruction error weight, i.e. according to a formulaCalculating to obtain high-fidelity audio reconstruction loss;
by the formulaThe expressed high-fidelity audio reconstruction loss function is a self-adaptive weighted mean square error loss function, allows large reconstruction errors to be generated at places with large voice signal intensity, and limits the errors at places with small voice signal intensity;
step 4 comprises the following steps:
(4a) setting training parameters including iteration turns T, small batch size s and learning rate eta, and selecting an optimization algorithm, wherein the optimization algorithm is based on 1-order gradient;
(4b) reading the confrontation sample restoration training data set to divide the original audio data set intoPreprocessing the small batch of data sets to obtain a preprocessed confrontation sample restoration training data set;represents rounding up;
(4c) restoring a training data set by using the preprocessed confrontation sample, and carrying out back propagation training on the initial RAE network according to the high-fidelity audio reconstruction loss and the learning rate eta by adopting the selected optimization algorithm: will be provided withSequentially inputting the small batches of data into an RAE network, repeating T rounds according to the sequence of forward propagation, backward propagation and weight updating by using the selected optimization algorithm, and totally iteratingAnd stopping training after the next time to obtain the trained RAE network.
2. The method of claim 1, wherein the optimization algorithm is an Adam algorithm.
3. The method of claim 1, wherein step 5 comprises:
(5a) reading a countermeasure sample, repairing the countermeasure sample by using a trained RAE network model, and outputting the repaired sample; the confrontation sample is audio data generated by a voice confrontation sample generation algorithm, and induces the voice recognition model to generate a false recognition result by adding disturbance on a clean sample;
(5b) inputting the confrontation sample and the repaired sample into a voice recognition model, observing the repairing effect of the confrontation sample, if the repairing effect of the confrontation sample shows that the repaired sample is correct in recognition result, the repairing is successful, otherwise, the repairing is failed.
4. The method of claim 3, wherein the repairing effect of the challenge sample specifically includes observing the difference between the sound spectrum of the challenge sample and the sound spectrum of the repaired sample, and observing the result of the recognition of the challenge sample and the sound spectrum of the repaired sample by the speech recognition model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111170083.5A CN113948067B (en) | 2021-10-08 | 2021-10-08 | Voice countercheck sample repairing method with hearing high fidelity characteristic |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111170083.5A CN113948067B (en) | 2021-10-08 | 2021-10-08 | Voice countercheck sample repairing method with hearing high fidelity characteristic |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113948067A CN113948067A (en) | 2022-01-18 |
CN113948067B true CN113948067B (en) | 2022-05-27 |
Family
ID=79330060
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111170083.5A Active CN113948067B (en) | 2021-10-08 | 2021-10-08 | Voice countercheck sample repairing method with hearing high fidelity characteristic |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113948067B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115424635B (en) * | 2022-11-03 | 2023-02-10 | 南京凯盛国际工程有限公司 | Cement plant equipment fault diagnosis method based on sound characteristics |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110767216A (en) * | 2019-09-10 | 2020-02-07 | 浙江工业大学 | Voice recognition attack defense method based on PSO algorithm |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110930976B (en) * | 2019-12-02 | 2022-04-15 | 北京声智科技有限公司 | Voice generation method and device |
US20210304736A1 (en) * | 2020-03-30 | 2021-09-30 | Nvidia Corporation | Media engagement through deep learning |
CN113223515B (en) * | 2021-04-01 | 2022-05-31 | 山东大学 | Automatic voice recognition method for anti-attack immunity |
CN113362822B (en) * | 2021-06-08 | 2022-09-30 | 北京计算机技术及应用研究所 | Black box voice confrontation sample generation method with auditory masking |
-
2021
- 2021-10-08 CN CN202111170083.5A patent/CN113948067B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110767216A (en) * | 2019-09-10 | 2020-02-07 | 浙江工业大学 | Voice recognition attack defense method based on PSO algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN113948067A (en) | 2022-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
CN109637546B (en) | Knowledge distillation method and apparatus | |
CN110706692B (en) | Training method and system of child voice recognition model | |
CN111061843B (en) | Knowledge-graph-guided false news detection method | |
Fabius et al. | Variational recurrent auto-encoders | |
Sainath et al. | Auto-encoder bottleneck features using deep belief networks | |
CN106782511A (en) | Amendment linear depth autoencoder network audio recognition method | |
CN110930976B (en) | Voice generation method and device | |
CN110349597B (en) | Voice detection method and device | |
CN110211575A (en) | Voice for data enhancing adds method for de-noising and system | |
CN110853668B (en) | Voice tampering detection method based on multi-feature fusion | |
Qian et al. | Deep feature engineering for noise robust spoofing detection | |
CN111653275B (en) | Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method | |
CN113488060B (en) | Voiceprint recognition method and system based on variation information bottleneck | |
Zhou et al. | ICRC-HIT: A deep learning based comment sequence labeling system for answer selection challenge | |
CN111814489A (en) | Spoken language semantic understanding method and system | |
CN113948067B (en) | Voice countercheck sample repairing method with hearing high fidelity characteristic | |
Wang et al. | Query-efficient adversarial attack with low perturbation against end-to-end speech recognition systems | |
CN113362822A (en) | Black box voice confrontation sample generation method with auditory masking | |
CN109522448B (en) | Method for carrying out robust speech gender classification based on CRBM and SNN | |
CN114999525A (en) | Light-weight environment voice recognition method based on neural network | |
Naini et al. | Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task. | |
Chang et al. | A unified endpointer using multitask and multidomain training | |
CN116318845A (en) | DGA domain name detection method under unbalanced proportion condition of positive and negative samples | |
CN116580694A (en) | Audio challenge sample generation method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |