CN114664310B

CN114664310B - Silent attack classification promotion method based on attention enhancement filtering

Info

Publication number: CN114664310B
Application number: CN202210194280.9A
Authority: CN
Inventors: 徐文渊; 李鑫锋; 冀晓宇; 闫琛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2023-03-31
Anticipated expiration: 2042-03-01
Also published as: CN114664310A

Abstract

The invention discloses a silent attack classification promotion method based on attention enhancement filtering, and provides a speech feature processing algorithm capable of rapidly and effectively enhancing the difference between a silent attack audio frequency and a voiced audio frequency. The method is capable of amplifying such an ear-imperceptible voice attack to detect it, and can be implemented immediately on various types of existing equipment. The invention uses normal audible audio data irrelevant to equipment to realize unified model training, thereby detecting attack data. The method can be used for defending the intelligent voice system of the Internet of things in a subsequent targeted manner. The method reduces the demand of the classifier on the label samples, enables an unsupervised classification algorithm to be possible, and can effectively solve the problem that the audio features provided by the existing attack detection method may not exist on each device, so that the cost is high when the function, the data set and the model are required to be customized for each device to be protected.

Description

Silent attack classification promotion method based on attention enhancement filtering

Technical Field

The invention belongs to the technical field of artificial intelligence voice assistant safety, and particularly relates to a silent attack classification promotion method based on attention enhancement filtering.

Background

In the era of the internet of things, a plurality of potential safety hazards gradually appear, wherein for an intelligent voice control system, the Attack with the strongest destructive and covert performance is a silent Attack, which is also called Dolphin Attack (DA). The attack method is an effective attack mode for a voice recognition system, and the essence of the attack method is that the nonlinear vulnerability of microphone hardware is utilized. The popularity of voice assistants has exacerbated the threat of silent voice attacks that may secretly control smart devices without user authorization. For example, an attacker may send a voice command to the smart speaker that is imperceptible to the human ear and let the smart speaker open the home without the user noticing it. The working principle of a typical silent speech attack is shown in fig. 1.

First, malicious voice commands are modulated on an ultrasonic carrier (e.g., 25 kHz) by amplitude modulation. Next, after the microphone receives the modulated ultrasonic wave, due to the nonlinear effect of the microphone, the high-frequency input signal will be demodulated, and the modulated malicious command will be output by the microphone to a subsequent speech recognition algorithm. Wherein the nonlinear transfer function of the microphone is expressed as follows:

wherein s is _in (t) and s _out (t) denotes an input and an output of the microphone, respectively. An attacker exploiting the non-linear hole, the microphone will inevitably recover audible voice commands from the amplitude modulated ultrasound. Finally, a low pass filter will remove the high frequency ultrasonic carrier, leaving only the baseband commands in the audio, which can be speech recognized and executed by the voice assistant. Since the modulated ultrasound is above 20kHz, silent speech attacks are imperceptible to human users.

It is worth noting that through the large-scale experiments of the inventor, when the soundless attack is applied to each device, the definition, the tone quality and the like of the attack audio are obviously different from those of the normal audio, and the tone quality of the normal sound recorded by each device is almost the same. The reason for the above phenomenon is essentially the abnormal response of different microphones to high frequency signals, and as can be seen from fig. 2, each microphone manufacturer mainly tunes the frequency response of the microphone to the PSTN standard in the frequency band of 300-3400Hz, while the frequency response of silent attack in the high frequency band is different. In addition, ultrasonic waves, which are a form of high-frequency mechanical wave vibration, cause abnormal responses of the internal resonance cavity of the microphone and the like.

Existing approaches are largely divided into two broad categories of hardware and software based protection. The hardware-based protection has the defects that the hardware-based protection cannot be adapted to stock equipment which is already on the market, and the modification cost is high, so that the software-based protection has a higher advantage, and the specific method is that feature engineering is combined with a supervised machine learning classification model. However, the spectral characteristics of the attack audio and the normal audio are close to each other, so that the classification model has the problems of needing a large amount of label data to train and the like, and the silent attack is difficult to obtain due to the complex generation process. However, the unsupervised deep learning method depends on the two types of obvious differences in the feature level, so a feature enhancement algorithm capable of effectively amplifying the difference between the normal audio and the attack audio in the feature processing stage is urgently needed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a silent attack classification promotion method based on attention-enhanced filtering, which can amplify the spectrum characteristic difference of normal audio and attack audio, thereby effectively reducing the demand of a classifier on label samples and enabling an unsupervised classification algorithm to be possible.

The invention is realized by adopting the following technical scheme:

a silent attack classification promotion method based on attention enhancement filtering comprises the following steps:

step 1: noise perception: noise queues of the last five sampled audios are constructed, averaged to represent environmental noise, and updated by using a timer thread, so that the noise conditions can be reflected by the queues in real time; stopping noise perception when a voice command is detected;

step 2: noise removal: carrying out noise removal on the voice instruction section by adopting improved spectral subtraction;

and 3, step 3: removing a mute section; and (3) eliminating a mute section by adopting an adaptive threshold method: first, the amplitude of the audio signal is normalized to [ -1,1] using min-max normalization; secondly, recording the maximum energy frame as 0dB, and regarding the frame with energy lower than a specific threshold value as a mute section, wherein the range of the specific threshold value is-45 dB to-15 dB;

and 4, step 4: normalizing the voice length; designing the voice length to be 1.5 seconds, and filling the voice command with the duration less than 1.5 seconds by adopting a repeated filling method;

and 5: generating a spectrogram; carrying out feature selection on the voice command by adopting short-time Fourier transform to obtain audio spectrum features; and applying a hanning window to each frame of the audio spectrum;

step 6: long-term average normalization; averaging along the time axis of the logarithmic STFT spectrum, thereby obtaining a long-term average spectrum (LTAS):

where X (k, L) is the frequency spectrum of the signal X (n), k is the frequency index, L is the frame index, and L is the total number of frames;

and 7: attention statistical filtering;

step 7.1: carrying out significance statistics on each frequency sub-band according to speakers and voice contents aiming at the voice data set; specifically, the formula is shown as follows:

wherein Fratio is a one-dimensional vector composed of significance weights for each frequency sub-band,

is the jth speech segment of speaker i, where i ∈ [1]，j∈[1，...，N]；

u _i The feature average of all the voice segments of the speaker i is obtained, and u is the feature average of all the voice segments of the M speakers;

step 7.2: obtaining a weight coefficient vector of each voice command according to the significance of each sub-band of the normal voice command in the step 7.1, wherein the vector dimension is related to the shape of the long-term average spectrum; applying the weight coefficient vector to the weighting of different sub-bands of each input long-term average spectrum, making some frequency sub-bands more prominent while masking other sub-bands;

and 8: enhancement of a Mel inverse filter;

according to the analysis of a silent attack mechanism, the attack signal has stronger energy in a low frequency band lower than 100Hz and a high frequency band higher than 5 kHz; the frequency bands distinguished by the existence of the characteristics of the silent attack and the normal audio frequency are ear-insensitive frequency bands, and the ear-insensitive frequency bands are enhanced.

In the above technical solution, further, the step 2 specifically includes: setting a lower limit value beta P of a voice _n (w), the actual spectrum P _s (w) and the estimated noise spectrum α P _n (w) subtracting to obtain a spectrum D (w), wherein the amplitude of the spectrum D (w) is less than beta P _n (w), then uniformly set to this fixed value, thereby reducing the effect of "musical noise" that is easily generated by the original spectral subtraction method; adjusting the intensity of the noise of the broadband by adjusting the value of beta;

D(w)＝P _s (w)-αP _n (w)

wherein alpha is a subtraction factor, and beta is a spectrum lower limit threshold parameter.

Further, in step 3, the frame with energy lower than-35 dB is regarded as a mute section.

Further, the step 8 specifically includes: performing linear interpolation expansion on a high-frequency band higher than 5kHz, compressing a medium-frequency band between 100Hz and 5kHz, and performing linear interpolation expansion on a low-frequency band lower than 100Hz of an input spectrogram; and repeating the low-frequency spectrum after interpolation and expansion.

The invention has the beneficial effects that:

the invention provides a hidden audio classification promotion method based on attention enhancement filtering, which realizes the obvious amplification of the input voice characteristic difference on the voice characteristic level and greatly reduces the dependence of audio data on a label. For software-type dolphin sound protection, the method is usually composed of two parts, namely front-end feature processing and a classification algorithm. The feature vectors with high discrimination are obtained through the audio preprocessing, so that the limitation on the rear-end classification algorithm is greatly reduced, the algorithm can be adapted to simple and low-calculated-quantity models such as support vector machines, random forests and the like, the whole algorithm can be deployed on the Internet of things terminal, and the high calculation power of a cloud server and the cost of communication overhead are saved quickly and efficiently.

Drawings

FIG. 1 is a schematic diagram of a silent instruction attack exploiting a microphone nonlinear vulnerability attack;

FIG. 2 is a frequency response curve for six different models of microphones;

FIG. 3 is a flow chart of a silent attack classification promotion method based on attention-enhancing filtering;

FIG. 4 is a comparison graph of normal/attack speech spectra before and after long term normalization and attention statistics filtering;

FIG. 5 is a comparison graph of normal/attack speech spectra before and after full feature enhancement (long term normalization, attention statistics filtering, mel inverse filter);

fig. 6 is a microphone replacement experiment (the microphone of samsung S20 was attached to the OPPO Find X2);

FIG. 7 is a comparison graph of an attack audio spectrum after microphone modification;

fig. 8 is a microphone test board.

Detailed Description

For a better understanding of the technical aspects of the present invention, reference will now be made in detail to the present invention with reference to the accompanying drawings and specific examples.

FIG. 3 is a flow chart of the method of the present invention, which mainly includes environmental noise perception, speech preprocessing, generating speech spectrogram, long-term normalized averaging, attention statistics filter design, and Mel inverse filter design.

It is first necessary to eliminate some of the interference factors of the voice command, such as environmental noise, speaking speed, etc., so that the subsequent spectral characteristics can represent important information.

Step 1: and (6) noise perception. Since ultrasonic waves are high-frequency (over 20 kHz) mechanical waves, and microphones of smart devices were designed for telephone communication, manufacturers need to adjust the microphones to comply with PSTN standards for self-developed microphones and respond abnormally to high-frequency mechanical waves. Ambient noise is somewhat similar to the abnormal microphone response caused by ultrasound, and it is important to eliminate ambient noise while preserving the abnormal pattern of microphone response caused by a silent attack. The present invention uses a simple but effective method for a device to continuously "sense" its environment. This process is also referred to as "ambient noise perception". Noise conditions are reflected in the queue in real time by constructing a noise queue of the last five sampled audio samples, averaging them to represent ambient noise, and updating them using a timer thread. When a voice command is detected, noise perception is stopped.

And 2, step: and removing noise. For the speech instruction segment, improvement is mainly based on spectral subtraction. "musical noise" is easily generated due to the original spectral subtraction method. Where α is called the subtraction factor and β is called the spectral floor threshold parameter. Compared with the prior method, the method has stronger denoising effect and can remove most of noise, so that the residual noise is less. However, if the same subtracted difference is negative, the negative value will be larger, and the improvement is to set a lower limit value beta P of speech _n (w) in the above step (a). Actual frequency spectrum P _s (w) and the estimated noise spectrum α P _n (w) subtracting to obtain a spectrum D (w), the amplitude of the spectrum D (w) being provided if less than β P _n (w) is uniformly set to this fixed value, and this lower limit is also actually a wideband noise, except that the benefit of setting the lower limit is that the residual peak is less pronounced than it is, thereby reducing the effect of "musical noise". Can be used forTo adjust the strength of this broadband noise by adjusting the value of beta.

D(w)＝P _s (w)-αP _n (w)

And step 3: and removing the silent section. It is important to eliminate the influence of different unvoiced segments, which are caused by the habits of the speaker, such as speaking speed and semantic pause. For example, when one says "OK Google," the pause time and speech rate differ between young and old. In order to eliminate the pause as much as possible, the subsequent classification model can process the voice segments with close information quantity. Similarly, the pause time of "OK Google" and "light off" of the same person are different, and the mute section can be eliminated by adopting an adaptive threshold method. First, the amplitude of the audio signal is normalized to [ -1,1] using min-max normalization. Second, the maximum energy frame is recorded as 0dB. Therefore, frames with energy below a certain value are considered as silence segments, with no speech content. And based on threshold value experiment, the threshold value range is adjusted from-45 dB to-15 dB in an empirical attempt, and the setting of-35 dB can be found to have better silence removal capability.

And 4, step 4: and (5) normalizing the voice length. A neural network may require fixed-dimension input features due to subsequent classifiers. Therefore, the front-end feature processing algorithm needs to be carefully designed to a fixed speech length and be able to represent various speech commands. Furthermore, the inventors aimed to design a system that is lightweight and fast. After counting the open-source Fluent Speech Commands data set, the average duration of a typical voice command in three words was found to be about 1.9 seconds. If the unvoiced segment of the command is deleted, the average duration is about 1.5 seconds, and thus the real-time performance can be improved. For some voice commands with a duration less than 1.5 seconds, the re-fill method is used for filling.

And 5: and generating a spectrogram. The inventors performed preliminary feature selection experiments based on short-time fourier transform (STFT), fbank and MFCC. STFT performs better than other audio spectral features. The setting parameters are as follows: the sampling rate is 16kHz; each frame contains 256 sampling points; the number of sampling points between adjacent frames is 64; and a hanning window is applied to each frame of the audio spectrum. Finally, a 128 × 376 shaped STFT spectrum is obtained from the 1.5 second voice command.

And 6: and (4) long-term average normalization. Although in the preprocessing stage, it has been mentioned to eliminate disturbing factors, such as different semantic pauses and speaking speeds, by eliminating unvoiced segments. It does not solve the problem caused by the mismatch of the speech content and the speaker. We use a long-term average normalization method to suppress these factors. Specifically, the method is to average along the time axis of the logarithmic STFT spectrum, thereby obtaining a long-term average spectrum (LTAS):

where X (k, L) is the frequency spectrum of the signal X (n), k is the frequency index, L is the frame index, and L is the total number of frames.

And 7: attention statistical filtering.

Step 7.1: fratio is implemented. The algorithm is an algorithm for performing significance statistics of each frequency subband according to speaker and speech content for a speech data set, and is specifically shown as the following formula:

is the jth speech segment of speaker i, where i ∈ [1]，j∈[1，...，N].

u _i Is the institute of speaker iThere is a speech segment feature average, u being the speech segment feature average of all M speakers.

Step 7.2: the Fratio of step 7.1 performs statistical calculations on the long-term average spectra of various normal audios. Thus, the significance of each sub-band of the normal voice command, i.e., the weight coefficient vector, can be obtained, and the dimension of the vector is related to the shape of the long-term average spectrum. It can be seen as a filter in the testing phase, applying it to the weighting of different subbands of each input long-term average spectrum, making some frequency subbands more prominent, while masking others, and is called "attention statistical filtering" because its design idea stems from computing the statistical information of normal audio.

And 8: the mel inverse filter is enhanced. The mel filter sets denser and more weighted triangular filter groups in low to high band frequency sensitive regions according to human auditory sense. Thus, the inventors have developed an analysis for silent attacks

Step 8.1: and (4) analyzing a silent attack mechanism. The nonlinear action of the microphone circuit enables the finally input dolphin sound attack signal to be converted into:

for normal speech control signals:

s _nomral (t)＝A ₁ v(t)+A ₂ v ² (t)

as can be seen from fig. 4, the attack signal has a dc component close to 0Hz compared with the normal signal, i.e. has stronger energy in the low frequency band. Furthermore, the silence attack is highlighted in its high frequency band compared to the normal speech spectrum, since it has been processed by long-term averaging, attention statistics filtering.

Step 8.2: unlike the mel filter for the frequency band enhancement of human ear sensitivity, the frequency band distinguished from the normal audio frequency due to the existence of the characteristics of the silent attack is the human ear insensitive frequency band. Therefore, the enhancement to these frequency bands is called a mel-frequency inverse filter. The specific method comprises the following steps:

the high frequency band (higher than 5 kHz) of the input spectrogram is subjected to linear interpolation expansion, the middle frequency band (100 Hz-5 kHz) is compressed, and the low frequency band (lower than 100 Hz) of the input spectrogram is subjected to linear interpolation expansion. In addition, the low-frequency spectrum after linear interpolation expansion is repeated, so that the integral difference proportion can be improved, and the model classification effect is better.

And step 9: as can be seen from fig. 5, after the above steps of processing, the difference between the two frequency spectrums of the audible and inaudible voice command "OK Google" is significantly enlarged by understanding the effect of such a filter as described above.

Furthermore, the non-linear hole of the microphone is exploited due to the mention of silent voice attacks in the background art. The inventor has performed microphone reception experiments on a plurality of smart devices, taking as an example that the microphone of a samsung S20 handset is replaced with an OPPO Find X2, as shown in fig. 6. After finding that each handset replaces a particular microphone, the recorded audio features that microphone, as shown in fig. 7. Therefore, the characteristics presented by the silent attack are most relevant to the microphone, and other mobile phone operating systems, versions and the like are almost irrelevant.

A microphone test board is built, and a microphone test circuit which is three types and totally ten mainstream in the market is shown in the attached figure 8. A total of 4800 normal/challenge samples were collected for a total of 9600. The audio is divided into 2400 pieces of training sets and 7200 pieces of audio in a test set.

In order to verify the effect of the scheme, the inventor adopts an SVM classifier to classify the test data set. For the obtained two-dimensional feature enhancement spectrogram, firstly, the time axis is averaged, and the dimension can be reduced to a one-dimensional column vector. After dimension reduction, the training data is learned by using the SVM, a decision boundary is constructed, and the model can be used for verifying the test data. The final results are shown in attached Table 1.

TABLE 1 test performance index results for ten microphones

/>

Claims

1. A silent attack classification promotion method based on attention enhancement filtering is characterized by comprising the following steps:

step 1: noise perception: the method comprises the steps of constructing a noise queue of the last five sampled audios, averaging the noise queue to represent environmental noise, and updating the noise queue by using a timer thread so as to enable the noise queue to reflect noise conditions in real time; stopping noise perception when a voice command is detected;

and 2, step: removing noise: carrying out noise removal on the voice instruction section by adopting an improved spectral subtraction method;

step 6: long-term average normalization; averaging along the time axis of the logarithmic STFT spectrum, obtaining a long-term average spectrum (LTAS):

and 7: attention statistical filtering;

step 7.1: carrying out significance statistics on each frequency sub-band according to speaker and voice content aiming at the voice data set; specifically, the formula is shown as follows:

is the jth speech segment of speaker i, where i ∈ [1]，j∈[1，...，N]；

u _i The feature average of all the voice segments of the speaker i, and the feature average of all the voice segments of the M speakers u;

and 8: enhancement of a Mel inverse filter;

according to the analysis of a silent attack mechanism, the attack signal has stronger energy in a low frequency band lower than 100Hz and a high frequency band higher than 5 kHz; the frequency bands distinguished by the existence characteristics of the silent attack and the normal audio are the ear insensitive frequency bands, and the ear insensitive frequency bands are enhanced.

2. The silence attack classification promotion method based on attention enhancement filtering according to claim 1, wherein the step 2 is specifically: setting a lower limit value beta P of a voice _n (w), the actual spectrum P _s (w) and the estimated noise spectrum α P _n (w) subtracting to obtain a spectrum D (w), the amplitude of the spectrum D (w) being provided if less than β P _n (w), then uniformly setting the fixed value, thereby reducing the influence of 'music noise' easily generated by the original spectral subtraction method; by adjusting the beta valueThe intensity of noise throughout this wideband;

D(w)＝P _s (w)-αP _n (w)

3. The method as claimed in claim 1, wherein in step 3, the frame with energy lower than-35 dB is considered as the silence segment.

4. The silence attack classification promotion method based on attention enhancement filtering according to claim 1, characterized in that the step 8 specifically comprises: performing linear interpolation expansion on a high-frequency band higher than 5kHz, compressing a medium-frequency band between 100Hz and 5kHz, and performing linear interpolation expansion on a low-frequency band lower than 100Hz of an input spectrogram; and repeating the low-frequency spectrum after interpolation and expansion.