CN114664310B - Silent attack classification promotion method based on attention enhancement filtering - Google Patents
Silent attack classification promotion method based on attention enhancement filtering Download PDFInfo
- Publication number
- CN114664310B CN114664310B CN202210194280.9A CN202210194280A CN114664310B CN 114664310 B CN114664310 B CN 114664310B CN 202210194280 A CN202210194280 A CN 202210194280A CN 114664310 B CN114664310 B CN 114664310B
- Authority
- CN
- China
- Prior art keywords
- spectrum
- noise
- attack
- voice
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000001914 filtration Methods 0.000 title claims abstract description 18
- 238000001228 spectrum Methods 0.000 claims description 45
- 230000007774 longterm Effects 0.000 claims description 18
- 239000013598 vector Substances 0.000 claims description 13
- 230000003595 spectral effect Effects 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 9
- 230000008447 perception Effects 0.000 claims description 8
- 238000012935 Averaging Methods 0.000 claims description 6
- 230000007613 environmental effect Effects 0.000 claims description 4
- 238000011410 subtraction method Methods 0.000 claims description 4
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 8
- 238000012545 processing Methods 0.000 abstract description 5
- 238000007635 classification algorithm Methods 0.000 abstract description 4
- 238000012549 training Methods 0.000 abstract description 3
- 238000001514 detection method Methods 0.000 abstract 1
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 238000012360 testing method Methods 0.000 description 8
- 230000004044 response Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 230000002159 abnormal effect Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 241001481833 Coryphaena hippurus Species 0.000 description 3
- 238000013145 classification model Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000002604 ultrasonography Methods 0.000 description 3
- DWDGSKGGUZPXMQ-UHFFFAOYSA-N OPPO Chemical group OPPO DWDGSKGGUZPXMQ-UHFFFAOYSA-N 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009022 nonlinear effect Effects 0.000 description 2
- 230000003321 amplification Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001066 destructive effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000011173 large scale experimental method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/14—Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/26—Pre-filtering or post-filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Computer Security & Cryptography (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a silent attack classification promotion method based on attention enhancement filtering, and provides a speech feature processing algorithm capable of rapidly and effectively enhancing the difference between a silent attack audio frequency and a voiced audio frequency. The method is capable of amplifying such an ear-imperceptible voice attack to detect it, and can be implemented immediately on various types of existing equipment. The invention uses normal audible audio data irrelevant to equipment to realize unified model training, thereby detecting attack data. The method can be used for defending the intelligent voice system of the Internet of things in a subsequent targeted manner. The method reduces the demand of the classifier on the label samples, enables an unsupervised classification algorithm to be possible, and can effectively solve the problem that the audio features provided by the existing attack detection method may not exist on each device, so that the cost is high when the function, the data set and the model are required to be customized for each device to be protected.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence voice assistant safety, and particularly relates to a silent attack classification promotion method based on attention enhancement filtering.
Background
In the era of the internet of things, a plurality of potential safety hazards gradually appear, wherein for an intelligent voice control system, the Attack with the strongest destructive and covert performance is a silent Attack, which is also called Dolphin Attack (DA). The attack method is an effective attack mode for a voice recognition system, and the essence of the attack method is that the nonlinear vulnerability of microphone hardware is utilized. The popularity of voice assistants has exacerbated the threat of silent voice attacks that may secretly control smart devices without user authorization. For example, an attacker may send a voice command to the smart speaker that is imperceptible to the human ear and let the smart speaker open the home without the user noticing it. The working principle of a typical silent speech attack is shown in fig. 1.
First, malicious voice commands are modulated on an ultrasonic carrier (e.g., 25 kHz) by amplitude modulation. Next, after the microphone receives the modulated ultrasonic wave, due to the nonlinear effect of the microphone, the high-frequency input signal will be demodulated, and the modulated malicious command will be output by the microphone to a subsequent speech recognition algorithm. Wherein the nonlinear transfer function of the microphone is expressed as follows:
wherein s is in (t) and s out (t) denotes an input and an output of the microphone, respectively. An attacker exploiting the non-linear hole, the microphone will inevitably recover audible voice commands from the amplitude modulated ultrasound. Finally, a low pass filter will remove the high frequency ultrasonic carrier, leaving only the baseband commands in the audio, which can be speech recognized and executed by the voice assistant. Since the modulated ultrasound is above 20kHz, silent speech attacks are imperceptible to human users.
It is worth noting that through the large-scale experiments of the inventor, when the soundless attack is applied to each device, the definition, the tone quality and the like of the attack audio are obviously different from those of the normal audio, and the tone quality of the normal sound recorded by each device is almost the same. The reason for the above phenomenon is essentially the abnormal response of different microphones to high frequency signals, and as can be seen from fig. 2, each microphone manufacturer mainly tunes the frequency response of the microphone to the PSTN standard in the frequency band of 300-3400Hz, while the frequency response of silent attack in the high frequency band is different. In addition, ultrasonic waves, which are a form of high-frequency mechanical wave vibration, cause abnormal responses of the internal resonance cavity of the microphone and the like.
Existing approaches are largely divided into two broad categories of hardware and software based protection. The hardware-based protection has the defects that the hardware-based protection cannot be adapted to stock equipment which is already on the market, and the modification cost is high, so that the software-based protection has a higher advantage, and the specific method is that feature engineering is combined with a supervised machine learning classification model. However, the spectral characteristics of the attack audio and the normal audio are close to each other, so that the classification model has the problems of needing a large amount of label data to train and the like, and the silent attack is difficult to obtain due to the complex generation process. However, the unsupervised deep learning method depends on the two types of obvious differences in the feature level, so a feature enhancement algorithm capable of effectively amplifying the difference between the normal audio and the attack audio in the feature processing stage is urgently needed.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a silent attack classification promotion method based on attention-enhanced filtering, which can amplify the spectrum characteristic difference of normal audio and attack audio, thereby effectively reducing the demand of a classifier on label samples and enabling an unsupervised classification algorithm to be possible.
The invention is realized by adopting the following technical scheme:
a silent attack classification promotion method based on attention enhancement filtering comprises the following steps:
step 1: noise perception: noise queues of the last five sampled audios are constructed, averaged to represent environmental noise, and updated by using a timer thread, so that the noise conditions can be reflected by the queues in real time; stopping noise perception when a voice command is detected;
step 2: noise removal: carrying out noise removal on the voice instruction section by adopting improved spectral subtraction;
and 3, step 3: removing a mute section; and (3) eliminating a mute section by adopting an adaptive threshold method: first, the amplitude of the audio signal is normalized to [ -1,1] using min-max normalization; secondly, recording the maximum energy frame as 0dB, and regarding the frame with energy lower than a specific threshold value as a mute section, wherein the range of the specific threshold value is-45 dB to-15 dB;
and 4, step 4: normalizing the voice length; designing the voice length to be 1.5 seconds, and filling the voice command with the duration less than 1.5 seconds by adopting a repeated filling method;
and 5: generating a spectrogram; carrying out feature selection on the voice command by adopting short-time Fourier transform to obtain audio spectrum features; and applying a hanning window to each frame of the audio spectrum;
step 6: long-term average normalization; averaging along the time axis of the logarithmic STFT spectrum, thereby obtaining a long-term average spectrum (LTAS):
where X (k, L) is the frequency spectrum of the signal X (n), k is the frequency index, L is the frame index, and L is the total number of frames;
and 7: attention statistical filtering;
step 7.1: carrying out significance statistics on each frequency sub-band according to speakers and voice contents aiming at the voice data set; specifically, the formula is shown as follows:
wherein Fratio is a one-dimensional vector composed of significance weights for each frequency sub-band,is the jth speech segment of speaker i, where i ∈ [1],j∈[1,...,N];
u i The feature average of all the voice segments of the speaker i is obtained, and u is the feature average of all the voice segments of the M speakers;
step 7.2: obtaining a weight coefficient vector of each voice command according to the significance of each sub-band of the normal voice command in the step 7.1, wherein the vector dimension is related to the shape of the long-term average spectrum; applying the weight coefficient vector to the weighting of different sub-bands of each input long-term average spectrum, making some frequency sub-bands more prominent while masking other sub-bands;
and 8: enhancement of a Mel inverse filter;
according to the analysis of a silent attack mechanism, the attack signal has stronger energy in a low frequency band lower than 100Hz and a high frequency band higher than 5 kHz; the frequency bands distinguished by the existence of the characteristics of the silent attack and the normal audio frequency are ear-insensitive frequency bands, and the ear-insensitive frequency bands are enhanced.
In the above technical solution, further, the step 2 specifically includes: setting a lower limit value beta P of a voice n (w), the actual spectrum P s (w) and the estimated noise spectrum α P n (w) subtracting to obtain a spectrum D (w), wherein the amplitude of the spectrum D (w) is less than beta P n (w), then uniformly set to this fixed value, thereby reducing the effect of "musical noise" that is easily generated by the original spectral subtraction method; adjusting the intensity of the noise of the broadband by adjusting the value of beta;
D(w)=P s (w)-αP n (w)
wherein alpha is a subtraction factor, and beta is a spectrum lower limit threshold parameter.
Further, in step 3, the frame with energy lower than-35 dB is regarded as a mute section.
Further, the step 8 specifically includes: performing linear interpolation expansion on a high-frequency band higher than 5kHz, compressing a medium-frequency band between 100Hz and 5kHz, and performing linear interpolation expansion on a low-frequency band lower than 100Hz of an input spectrogram; and repeating the low-frequency spectrum after interpolation and expansion.
The invention has the beneficial effects that:
the invention provides a hidden audio classification promotion method based on attention enhancement filtering, which realizes the obvious amplification of the input voice characteristic difference on the voice characteristic level and greatly reduces the dependence of audio data on a label. For software-type dolphin sound protection, the method is usually composed of two parts, namely front-end feature processing and a classification algorithm. The feature vectors with high discrimination are obtained through the audio preprocessing, so that the limitation on the rear-end classification algorithm is greatly reduced, the algorithm can be adapted to simple and low-calculated-quantity models such as support vector machines, random forests and the like, the whole algorithm can be deployed on the Internet of things terminal, and the high calculation power of a cloud server and the cost of communication overhead are saved quickly and efficiently.
Drawings
FIG. 1 is a schematic diagram of a silent instruction attack exploiting a microphone nonlinear vulnerability attack;
FIG. 2 is a frequency response curve for six different models of microphones;
FIG. 3 is a flow chart of a silent attack classification promotion method based on attention-enhancing filtering;
FIG. 4 is a comparison graph of normal/attack speech spectra before and after long term normalization and attention statistics filtering;
FIG. 5 is a comparison graph of normal/attack speech spectra before and after full feature enhancement (long term normalization, attention statistics filtering, mel inverse filter);
fig. 6 is a microphone replacement experiment (the microphone of samsung S20 was attached to the OPPO Find X2);
FIG. 7 is a comparison graph of an attack audio spectrum after microphone modification;
fig. 8 is a microphone test board.
Detailed Description
For a better understanding of the technical aspects of the present invention, reference will now be made in detail to the present invention with reference to the accompanying drawings and specific examples.
FIG. 3 is a flow chart of the method of the present invention, which mainly includes environmental noise perception, speech preprocessing, generating speech spectrogram, long-term normalized averaging, attention statistics filter design, and Mel inverse filter design.
It is first necessary to eliminate some of the interference factors of the voice command, such as environmental noise, speaking speed, etc., so that the subsequent spectral characteristics can represent important information.
Step 1: and (6) noise perception. Since ultrasonic waves are high-frequency (over 20 kHz) mechanical waves, and microphones of smart devices were designed for telephone communication, manufacturers need to adjust the microphones to comply with PSTN standards for self-developed microphones and respond abnormally to high-frequency mechanical waves. Ambient noise is somewhat similar to the abnormal microphone response caused by ultrasound, and it is important to eliminate ambient noise while preserving the abnormal pattern of microphone response caused by a silent attack. The present invention uses a simple but effective method for a device to continuously "sense" its environment. This process is also referred to as "ambient noise perception". Noise conditions are reflected in the queue in real time by constructing a noise queue of the last five sampled audio samples, averaging them to represent ambient noise, and updating them using a timer thread. When a voice command is detected, noise perception is stopped.
And 2, step: and removing noise. For the speech instruction segment, improvement is mainly based on spectral subtraction. "musical noise" is easily generated due to the original spectral subtraction method. Where α is called the subtraction factor and β is called the spectral floor threshold parameter. Compared with the prior method, the method has stronger denoising effect and can remove most of noise, so that the residual noise is less. However, if the same subtracted difference is negative, the negative value will be larger, and the improvement is to set a lower limit value beta P of speech n (w) in the above step (a). Actual frequency spectrum P s (w) and the estimated noise spectrum α P n (w) subtracting to obtain a spectrum D (w), the amplitude of the spectrum D (w) being provided if less than β P n (w) is uniformly set to this fixed value, and this lower limit is also actually a wideband noise, except that the benefit of setting the lower limit is that the residual peak is less pronounced than it is, thereby reducing the effect of "musical noise". Can be used forTo adjust the strength of this broadband noise by adjusting the value of beta.
D(w)=P s (w)-αP n (w)
And step 3: and removing the silent section. It is important to eliminate the influence of different unvoiced segments, which are caused by the habits of the speaker, such as speaking speed and semantic pause. For example, when one says "OK Google," the pause time and speech rate differ between young and old. In order to eliminate the pause as much as possible, the subsequent classification model can process the voice segments with close information quantity. Similarly, the pause time of "OK Google" and "light off" of the same person are different, and the mute section can be eliminated by adopting an adaptive threshold method. First, the amplitude of the audio signal is normalized to [ -1,1] using min-max normalization. Second, the maximum energy frame is recorded as 0dB. Therefore, frames with energy below a certain value are considered as silence segments, with no speech content. And based on threshold value experiment, the threshold value range is adjusted from-45 dB to-15 dB in an empirical attempt, and the setting of-35 dB can be found to have better silence removal capability.
And 4, step 4: and (5) normalizing the voice length. A neural network may require fixed-dimension input features due to subsequent classifiers. Therefore, the front-end feature processing algorithm needs to be carefully designed to a fixed speech length and be able to represent various speech commands. Furthermore, the inventors aimed to design a system that is lightweight and fast. After counting the open-source Fluent Speech Commands data set, the average duration of a typical voice command in three words was found to be about 1.9 seconds. If the unvoiced segment of the command is deleted, the average duration is about 1.5 seconds, and thus the real-time performance can be improved. For some voice commands with a duration less than 1.5 seconds, the re-fill method is used for filling.
And 5: and generating a spectrogram. The inventors performed preliminary feature selection experiments based on short-time fourier transform (STFT), fbank and MFCC. STFT performs better than other audio spectral features. The setting parameters are as follows: the sampling rate is 16kHz; each frame contains 256 sampling points; the number of sampling points between adjacent frames is 64; and a hanning window is applied to each frame of the audio spectrum. Finally, a 128 × 376 shaped STFT spectrum is obtained from the 1.5 second voice command.
And 6: and (4) long-term average normalization. Although in the preprocessing stage, it has been mentioned to eliminate disturbing factors, such as different semantic pauses and speaking speeds, by eliminating unvoiced segments. It does not solve the problem caused by the mismatch of the speech content and the speaker. We use a long-term average normalization method to suppress these factors. Specifically, the method is to average along the time axis of the logarithmic STFT spectrum, thereby obtaining a long-term average spectrum (LTAS):
where X (k, L) is the frequency spectrum of the signal X (n), k is the frequency index, L is the frame index, and L is the total number of frames.
And 7: attention statistical filtering.
Step 7.1: fratio is implemented. The algorithm is an algorithm for performing significance statistics of each frequency subband according to speaker and speech content for a speech data set, and is specifically shown as the following formula:
wherein Fratio is a one-dimensional vector composed of significance weights for each frequency sub-band,is the jth speech segment of speaker i, where i ∈ [1],j∈[1,...,N].
u i Is the institute of speaker iThere is a speech segment feature average, u being the speech segment feature average of all M speakers.
Step 7.2: the Fratio of step 7.1 performs statistical calculations on the long-term average spectra of various normal audios. Thus, the significance of each sub-band of the normal voice command, i.e., the weight coefficient vector, can be obtained, and the dimension of the vector is related to the shape of the long-term average spectrum. It can be seen as a filter in the testing phase, applying it to the weighting of different subbands of each input long-term average spectrum, making some frequency subbands more prominent, while masking others, and is called "attention statistical filtering" because its design idea stems from computing the statistical information of normal audio.
And 8: the mel inverse filter is enhanced. The mel filter sets denser and more weighted triangular filter groups in low to high band frequency sensitive regions according to human auditory sense. Thus, the inventors have developed an analysis for silent attacks
Step 8.1: and (4) analyzing a silent attack mechanism. The nonlinear action of the microphone circuit enables the finally input dolphin sound attack signal to be converted into:
for normal speech control signals:
s nomral (t)=A 1 v(t)+A 2 v 2 (t)
as can be seen from fig. 4, the attack signal has a dc component close to 0Hz compared with the normal signal, i.e. has stronger energy in the low frequency band. Furthermore, the silence attack is highlighted in its high frequency band compared to the normal speech spectrum, since it has been processed by long-term averaging, attention statistics filtering.
Step 8.2: unlike the mel filter for the frequency band enhancement of human ear sensitivity, the frequency band distinguished from the normal audio frequency due to the existence of the characteristics of the silent attack is the human ear insensitive frequency band. Therefore, the enhancement to these frequency bands is called a mel-frequency inverse filter. The specific method comprises the following steps:
the high frequency band (higher than 5 kHz) of the input spectrogram is subjected to linear interpolation expansion, the middle frequency band (100 Hz-5 kHz) is compressed, and the low frequency band (lower than 100 Hz) of the input spectrogram is subjected to linear interpolation expansion. In addition, the low-frequency spectrum after linear interpolation expansion is repeated, so that the integral difference proportion can be improved, and the model classification effect is better.
And step 9: as can be seen from fig. 5, after the above steps of processing, the difference between the two frequency spectrums of the audible and inaudible voice command "OK Google" is significantly enlarged by understanding the effect of such a filter as described above.
Furthermore, the non-linear hole of the microphone is exploited due to the mention of silent voice attacks in the background art. The inventor has performed microphone reception experiments on a plurality of smart devices, taking as an example that the microphone of a samsung S20 handset is replaced with an OPPO Find X2, as shown in fig. 6. After finding that each handset replaces a particular microphone, the recorded audio features that microphone, as shown in fig. 7. Therefore, the characteristics presented by the silent attack are most relevant to the microphone, and other mobile phone operating systems, versions and the like are almost irrelevant.
A microphone test board is built, and a microphone test circuit which is three types and totally ten mainstream in the market is shown in the attached figure 8. A total of 4800 normal/challenge samples were collected for a total of 9600. The audio is divided into 2400 pieces of training sets and 7200 pieces of audio in a test set.
In order to verify the effect of the scheme, the inventor adopts an SVM classifier to classify the test data set. For the obtained two-dimensional feature enhancement spectrogram, firstly, the time axis is averaged, and the dimension can be reduced to a one-dimensional column vector. After dimension reduction, the training data is learned by using the SVM, a decision boundary is constructed, and the model can be used for verifying the test data. The final results are shown in attached Table 1.
TABLE 1 test performance index results for ten microphones
Claims (4)
1. A silent attack classification promotion method based on attention enhancement filtering is characterized by comprising the following steps:
step 1: noise perception: the method comprises the steps of constructing a noise queue of the last five sampled audios, averaging the noise queue to represent environmental noise, and updating the noise queue by using a timer thread so as to enable the noise queue to reflect noise conditions in real time; stopping noise perception when a voice command is detected;
and 2, step: removing noise: carrying out noise removal on the voice instruction section by adopting an improved spectral subtraction method;
and 3, step 3: removing a mute section; and (3) eliminating a mute section by adopting an adaptive threshold method: first, the amplitude of the audio signal is normalized to [ -1,1] using min-max normalization; secondly, recording the maximum energy frame as 0dB, and regarding the frame with energy lower than a specific threshold value as a mute section, wherein the range of the specific threshold value is-45 dB to-15 dB;
and 4, step 4: normalizing the voice length; designing the voice length to be 1.5 seconds, and filling the voice command with the duration less than 1.5 seconds by adopting a repeated filling method;
and 5: generating a spectrogram; carrying out feature selection on the voice command by adopting short-time Fourier transform to obtain audio spectrum features; and applying a hanning window to each frame of the audio spectrum;
step 6: long-term average normalization; averaging along the time axis of the logarithmic STFT spectrum, obtaining a long-term average spectrum (LTAS):
where X (k, L) is the frequency spectrum of the signal X (n), k is the frequency index, L is the frame index, and L is the total number of frames;
and 7: attention statistical filtering;
step 7.1: carrying out significance statistics on each frequency sub-band according to speaker and voice content aiming at the voice data set; specifically, the formula is shown as follows:
wherein Fratio is a one-dimensional vector composed of significance weights for each frequency sub-band,is the jth speech segment of speaker i, where i ∈ [1],j∈[1,...,N];
u i The feature average of all the voice segments of the speaker i, and the feature average of all the voice segments of the M speakers u;
step 7.2: obtaining a weight coefficient vector of each voice command according to the significance of each sub-band of the normal voice command in the step 7.1, wherein the vector dimension is related to the shape of the long-term average spectrum; applying the weight coefficient vector to the weighting of different sub-bands of each input long-term average spectrum, making some frequency sub-bands more prominent while masking other sub-bands;
and 8: enhancement of a Mel inverse filter;
according to the analysis of a silent attack mechanism, the attack signal has stronger energy in a low frequency band lower than 100Hz and a high frequency band higher than 5 kHz; the frequency bands distinguished by the existence characteristics of the silent attack and the normal audio are the ear insensitive frequency bands, and the ear insensitive frequency bands are enhanced.
2. The silence attack classification promotion method based on attention enhancement filtering according to claim 1, wherein the step 2 is specifically: setting a lower limit value beta P of a voice n (w), the actual spectrum P s (w) and the estimated noise spectrum α P n (w) subtracting to obtain a spectrum D (w), the amplitude of the spectrum D (w) being provided if less than β P n (w), then uniformly setting the fixed value, thereby reducing the influence of 'music noise' easily generated by the original spectral subtraction method; by adjusting the beta valueThe intensity of noise throughout this wideband;
D(w)=P s (w)-αP n (w)
wherein alpha is a subtraction factor, and beta is a spectrum lower limit threshold parameter.
3. The method as claimed in claim 1, wherein in step 3, the frame with energy lower than-35 dB is considered as the silence segment.
4. The silence attack classification promotion method based on attention enhancement filtering according to claim 1, characterized in that the step 8 specifically comprises: performing linear interpolation expansion on a high-frequency band higher than 5kHz, compressing a medium-frequency band between 100Hz and 5kHz, and performing linear interpolation expansion on a low-frequency band lower than 100Hz of an input spectrogram; and repeating the low-frequency spectrum after interpolation and expansion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210194280.9A CN114664310B (en) | 2022-03-01 | 2022-03-01 | Silent attack classification promotion method based on attention enhancement filtering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210194280.9A CN114664310B (en) | 2022-03-01 | 2022-03-01 | Silent attack classification promotion method based on attention enhancement filtering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114664310A CN114664310A (en) | 2022-06-24 |
CN114664310B true CN114664310B (en) | 2023-03-31 |
Family
ID=82027777
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210194280.9A Active CN114664310B (en) | 2022-03-01 | 2022-03-01 | Silent attack classification promotion method based on attention enhancement filtering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114664310B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106531172A (en) * | 2016-11-23 | 2017-03-22 | 湖北大学 | Speaker voice playback identification method and system based on environmental noise change detection |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7948938B2 (en) * | 2004-04-30 | 2011-05-24 | Research In Motion Limited | Wireless communication device with duress password protection and related method |
KR101460059B1 (en) * | 2007-12-17 | 2014-11-12 | 삼성전자주식회사 | Method and apparatus for detecting noise |
US9412381B2 (en) * | 2010-03-30 | 2016-08-09 | Ack3 Bionetics Private Ltd. | Integrated voice biometrics cloud security gateway |
WO2019173304A1 (en) * | 2018-03-05 | 2019-09-12 | The Trustees Of Indiana University | Method and system for enhancing security in a voice-controlled system |
US11457313B2 (en) * | 2018-09-07 | 2022-09-27 | Society of Cable Telecommunications Engineers, Inc. | Acoustic and visual enhancement methods for training and learning |
CN110085249B (en) * | 2019-05-09 | 2021-03-16 | 南京工程学院 | Single-channel speech enhancement method of recurrent neural network based on attention gating |
CN112116742B (en) * | 2020-08-07 | 2021-07-13 | 西安交通大学 | Identity authentication method, storage medium and equipment fusing multi-source sound production characteristics of user |
CN113192504B (en) * | 2021-04-29 | 2022-11-11 | 浙江大学 | Silent voice attack detection method based on domain adaptation |
-
2022
- 2022-03-01 CN CN202210194280.9A patent/CN114664310B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106531172A (en) * | 2016-11-23 | 2017-03-22 | 湖北大学 | Speaker voice playback identification method and system based on environmental noise change detection |
Also Published As
Publication number | Publication date |
---|---|
CN114664310A (en) | 2022-06-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10622009B1 (en) | Methods for detecting double-talk | |
US11488605B2 (en) | Method and apparatus for detecting spoofing conditions | |
US10504539B2 (en) | Voice activity detection systems and methods | |
US9305567B2 (en) | Systems and methods for audio signal processing | |
CN105513605A (en) | Voice enhancement system and method for cellphone microphone | |
CN112004177B (en) | Howling detection method, microphone volume adjustment method and storage medium | |
CN110120225A (en) | A kind of audio defeat system and method for the structure based on GRU network | |
Verteletskaya et al. | Noise reduction based on modified spectral subtraction method | |
CN113192504B (en) | Silent voice attack detection method based on domain adaptation | |
CN112466276A (en) | Speech synthesis system training method and device and readable storage medium | |
Alam et al. | Robust feature extraction for speech recognition by enhancing auditory spectrum | |
Mu et al. | MFCC as features for speaker classification using machine learning | |
Zhang et al. | A soft decision based noise cross power spectral density estimation for two-microphone speech enhancement systems | |
CN114664310B (en) | Silent attack classification promotion method based on attention enhancement filtering | |
CN112233657A (en) | Speech enhancement method based on low-frequency syllable recognition | |
Yu et al. | Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection. | |
WO2022068440A1 (en) | Howling suppression method and apparatus, computer device, and storage medium | |
Sanam et al. | A combination of semisoft and μ-law thresholding functions for enhancing noisy speech in wavelet packet domain | |
EP2063420A1 (en) | Method and assembly to enhance the intelligibility of speech | |
Darabian et al. | Improving the performance of MFCC for Persian robust speech recognition | |
Mehta et al. | Robust front-end and back-end processing for feature extraction for Hindi speech recognition | |
Yan et al. | Anti-noise power normalized cepstral coefficients for robust environmental sounds recognition in real noisy conditions | |
Butarbutar et al. | Adaptive Wiener Filtering Method for Noise Reduction in Speech Recognition System | |
Islam et al. | Modeling of teager energy operated perceptual wavelet packet coefficients with an Erlang-2 PDF for real time enhancement of noisy speech | |
CN112951259B (en) | Audio noise reduction method and device, electronic equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |