CN114664310B - Silent attack classification promotion method based on attention enhancement filtering - Google Patents

Silent attack classification promotion method based on attention enhancement filtering Download PDF

Info

Publication number
CN114664310B
CN114664310B CN202210194280.9A CN202210194280A CN114664310B CN 114664310 B CN114664310 B CN 114664310B CN 202210194280 A CN202210194280 A CN 202210194280A CN 114664310 B CN114664310 B CN 114664310B
Authority
CN
China
Prior art keywords
spectrum
noise
attack
voice
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210194280.9A
Other languages
Chinese (zh)
Other versions
CN114664310A (en
Inventor
徐文渊
李鑫锋
冀晓宇
闫琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202210194280.9A priority Critical patent/CN114664310B/en
Publication of CN114664310A publication Critical patent/CN114664310A/en
Application granted granted Critical
Publication of CN114664310B publication Critical patent/CN114664310B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a silent attack classification promotion method based on attention enhancement filtering, and provides a speech feature processing algorithm capable of rapidly and effectively enhancing the difference between a silent attack audio frequency and a voiced audio frequency. The method is capable of amplifying such an ear-imperceptible voice attack to detect it, and can be implemented immediately on various types of existing equipment. The invention uses normal audible audio data irrelevant to equipment to realize unified model training, thereby detecting attack data. The method can be used for defending the intelligent voice system of the Internet of things in a subsequent targeted manner. The method reduces the demand of the classifier on the label samples, enables an unsupervised classification algorithm to be possible, and can effectively solve the problem that the audio features provided by the existing attack detection method may not exist on each device, so that the cost is high when the function, the data set and the model are required to be customized for each device to be protected.

Description

Silent attack classification promotion method based on attention enhancement filtering
Technical Field
The invention belongs to the technical field of artificial intelligence voice assistant safety, and particularly relates to a silent attack classification promotion method based on attention enhancement filtering.
Background
In the era of the internet of things, a plurality of potential safety hazards gradually appear, wherein for an intelligent voice control system, the Attack with the strongest destructive and covert performance is a silent Attack, which is also called Dolphin Attack (DA). The attack method is an effective attack mode for a voice recognition system, and the essence of the attack method is that the nonlinear vulnerability of microphone hardware is utilized. The popularity of voice assistants has exacerbated the threat of silent voice attacks that may secretly control smart devices without user authorization. For example, an attacker may send a voice command to the smart speaker that is imperceptible to the human ear and let the smart speaker open the home without the user noticing it. The working principle of a typical silent speech attack is shown in fig. 1.
First, malicious voice commands are modulated on an ultrasonic carrier (e.g., 25 kHz) by amplitude modulation. Next, after the microphone receives the modulated ultrasonic wave, due to the nonlinear effect of the microphone, the high-frequency input signal will be demodulated, and the modulated malicious command will be output by the microphone to a subsequent speech recognition algorithm. Wherein the nonlinear transfer function of the microphone is expressed as follows:
Figure BDA0003526453870000011
wherein s is in (t) and s out (t) denotes an input and an output of the microphone, respectively. An attacker exploiting the non-linear hole, the microphone will inevitably recover audible voice commands from the amplitude modulated ultrasound. Finally, a low pass filter will remove the high frequency ultrasonic carrier, leaving only the baseband commands in the audio, which can be speech recognized and executed by the voice assistant. Since the modulated ultrasound is above 20kHz, silent speech attacks are imperceptible to human users.
It is worth noting that through the large-scale experiments of the inventor, when the soundless attack is applied to each device, the definition, the tone quality and the like of the attack audio are obviously different from those of the normal audio, and the tone quality of the normal sound recorded by each device is almost the same. The reason for the above phenomenon is essentially the abnormal response of different microphones to high frequency signals, and as can be seen from fig. 2, each microphone manufacturer mainly tunes the frequency response of the microphone to the PSTN standard in the frequency band of 300-3400Hz, while the frequency response of silent attack in the high frequency band is different. In addition, ultrasonic waves, which are a form of high-frequency mechanical wave vibration, cause abnormal responses of the internal resonance cavity of the microphone and the like.
Existing approaches are largely divided into two broad categories of hardware and software based protection. The hardware-based protection has the defects that the hardware-based protection cannot be adapted to stock equipment which is already on the market, and the modification cost is high, so that the software-based protection has a higher advantage, and the specific method is that feature engineering is combined with a supervised machine learning classification model. However, the spectral characteristics of the attack audio and the normal audio are close to each other, so that the classification model has the problems of needing a large amount of label data to train and the like, and the silent attack is difficult to obtain due to the complex generation process. However, the unsupervised deep learning method depends on the two types of obvious differences in the feature level, so a feature enhancement algorithm capable of effectively amplifying the difference between the normal audio and the attack audio in the feature processing stage is urgently needed.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a silent attack classification promotion method based on attention-enhanced filtering, which can amplify the spectrum characteristic difference of normal audio and attack audio, thereby effectively reducing the demand of a classifier on label samples and enabling an unsupervised classification algorithm to be possible.
The invention is realized by adopting the following technical scheme:
a silent attack classification promotion method based on attention enhancement filtering comprises the following steps:
step 1: noise perception: noise queues of the last five sampled audios are constructed, averaged to represent environmental noise, and updated by using a timer thread, so that the noise conditions can be reflected by the queues in real time; stopping noise perception when a voice command is detected;
step 2: noise removal: carrying out noise removal on the voice instruction section by adopting improved spectral subtraction;
and 3, step 3: removing a mute section; and (3) eliminating a mute section by adopting an adaptive threshold method: first, the amplitude of the audio signal is normalized to [ -1,1] using min-max normalization; secondly, recording the maximum energy frame as 0dB, and regarding the frame with energy lower than a specific threshold value as a mute section, wherein the range of the specific threshold value is-45 dB to-15 dB;
and 4, step 4: normalizing the voice length; designing the voice length to be 1.5 seconds, and filling the voice command with the duration less than 1.5 seconds by adopting a repeated filling method;
and 5: generating a spectrogram; carrying out feature selection on the voice command by adopting short-time Fourier transform to obtain audio spectrum features; and applying a hanning window to each frame of the audio spectrum;
step 6: long-term average normalization; averaging along the time axis of the logarithmic STFT spectrum, thereby obtaining a long-term average spectrum (LTAS):
Figure BDA0003526453870000031
where X (k, L) is the frequency spectrum of the signal X (n), k is the frequency index, L is the frame index, and L is the total number of frames;
and 7: attention statistical filtering;
step 7.1: carrying out significance statistics on each frequency sub-band according to speakers and voice contents aiming at the voice data set; specifically, the formula is shown as follows:
Figure BDA0003526453870000032
wherein Fratio is a one-dimensional vector composed of significance weights for each frequency sub-band,
Figure BDA0003526453870000033
is the jth speech segment of speaker i, where i ∈ [1],j∈[1,...,N];
Figure BDA0003526453870000034
u i The feature average of all the voice segments of the speaker i is obtained, and u is the feature average of all the voice segments of the M speakers;
step 7.2: obtaining a weight coefficient vector of each voice command according to the significance of each sub-band of the normal voice command in the step 7.1, wherein the vector dimension is related to the shape of the long-term average spectrum; applying the weight coefficient vector to the weighting of different sub-bands of each input long-term average spectrum, making some frequency sub-bands more prominent while masking other sub-bands;
and 8: enhancement of a Mel inverse filter;
according to the analysis of a silent attack mechanism, the attack signal has stronger energy in a low frequency band lower than 100Hz and a high frequency band higher than 5 kHz; the frequency bands distinguished by the existence of the characteristics of the silent attack and the normal audio frequency are ear-insensitive frequency bands, and the ear-insensitive frequency bands are enhanced.
In the above technical solution, further, the step 2 specifically includes: setting a lower limit value beta P of a voice n (w), the actual spectrum P s (w) and the estimated noise spectrum α P n (w) subtracting to obtain a spectrum D (w), wherein the amplitude of the spectrum D (w) is less than beta P n (w), then uniformly set to this fixed value, thereby reducing the effect of "musical noise" that is easily generated by the original spectral subtraction method; adjusting the intensity of the noise of the broadband by adjusting the value of beta;
D(w)=P s (w)-αP n (w)
Figure BDA0003526453870000035
wherein alpha is a subtraction factor, and beta is a spectrum lower limit threshold parameter.
Further, in step 3, the frame with energy lower than-35 dB is regarded as a mute section.
Further, the step 8 specifically includes: performing linear interpolation expansion on a high-frequency band higher than 5kHz, compressing a medium-frequency band between 100Hz and 5kHz, and performing linear interpolation expansion on a low-frequency band lower than 100Hz of an input spectrogram; and repeating the low-frequency spectrum after interpolation and expansion.
The invention has the beneficial effects that:
the invention provides a hidden audio classification promotion method based on attention enhancement filtering, which realizes the obvious amplification of the input voice characteristic difference on the voice characteristic level and greatly reduces the dependence of audio data on a label. For software-type dolphin sound protection, the method is usually composed of two parts, namely front-end feature processing and a classification algorithm. The feature vectors with high discrimination are obtained through the audio preprocessing, so that the limitation on the rear-end classification algorithm is greatly reduced, the algorithm can be adapted to simple and low-calculated-quantity models such as support vector machines, random forests and the like, the whole algorithm can be deployed on the Internet of things terminal, and the high calculation power of a cloud server and the cost of communication overhead are saved quickly and efficiently.
Drawings
FIG. 1 is a schematic diagram of a silent instruction attack exploiting a microphone nonlinear vulnerability attack;
FIG. 2 is a frequency response curve for six different models of microphones;
FIG. 3 is a flow chart of a silent attack classification promotion method based on attention-enhancing filtering;
FIG. 4 is a comparison graph of normal/attack speech spectra before and after long term normalization and attention statistics filtering;
FIG. 5 is a comparison graph of normal/attack speech spectra before and after full feature enhancement (long term normalization, attention statistics filtering, mel inverse filter);
fig. 6 is a microphone replacement experiment (the microphone of samsung S20 was attached to the OPPO Find X2);
FIG. 7 is a comparison graph of an attack audio spectrum after microphone modification;
fig. 8 is a microphone test board.
Detailed Description
For a better understanding of the technical aspects of the present invention, reference will now be made in detail to the present invention with reference to the accompanying drawings and specific examples.
FIG. 3 is a flow chart of the method of the present invention, which mainly includes environmental noise perception, speech preprocessing, generating speech spectrogram, long-term normalized averaging, attention statistics filter design, and Mel inverse filter design.
It is first necessary to eliminate some of the interference factors of the voice command, such as environmental noise, speaking speed, etc., so that the subsequent spectral characteristics can represent important information.
Step 1: and (6) noise perception. Since ultrasonic waves are high-frequency (over 20 kHz) mechanical waves, and microphones of smart devices were designed for telephone communication, manufacturers need to adjust the microphones to comply with PSTN standards for self-developed microphones and respond abnormally to high-frequency mechanical waves. Ambient noise is somewhat similar to the abnormal microphone response caused by ultrasound, and it is important to eliminate ambient noise while preserving the abnormal pattern of microphone response caused by a silent attack. The present invention uses a simple but effective method for a device to continuously "sense" its environment. This process is also referred to as "ambient noise perception". Noise conditions are reflected in the queue in real time by constructing a noise queue of the last five sampled audio samples, averaging them to represent ambient noise, and updating them using a timer thread. When a voice command is detected, noise perception is stopped.
And 2, step: and removing noise. For the speech instruction segment, improvement is mainly based on spectral subtraction. "musical noise" is easily generated due to the original spectral subtraction method. Where α is called the subtraction factor and β is called the spectral floor threshold parameter. Compared with the prior method, the method has stronger denoising effect and can remove most of noise, so that the residual noise is less. However, if the same subtracted difference is negative, the negative value will be larger, and the improvement is to set a lower limit value beta P of speech n (w) in the above step (a). Actual frequency spectrum P s (w) and the estimated noise spectrum α P n (w) subtracting to obtain a spectrum D (w), the amplitude of the spectrum D (w) being provided if less than β P n (w) is uniformly set to this fixed value, and this lower limit is also actually a wideband noise, except that the benefit of setting the lower limit is that the residual peak is less pronounced than it is, thereby reducing the effect of "musical noise". Can be used forTo adjust the strength of this broadband noise by adjusting the value of beta.
D(w)=P s (w)-αP n (w)
Figure BDA0003526453870000051
And step 3: and removing the silent section. It is important to eliminate the influence of different unvoiced segments, which are caused by the habits of the speaker, such as speaking speed and semantic pause. For example, when one says "OK Google," the pause time and speech rate differ between young and old. In order to eliminate the pause as much as possible, the subsequent classification model can process the voice segments with close information quantity. Similarly, the pause time of "OK Google" and "light off" of the same person are different, and the mute section can be eliminated by adopting an adaptive threshold method. First, the amplitude of the audio signal is normalized to [ -1,1] using min-max normalization. Second, the maximum energy frame is recorded as 0dB. Therefore, frames with energy below a certain value are considered as silence segments, with no speech content. And based on threshold value experiment, the threshold value range is adjusted from-45 dB to-15 dB in an empirical attempt, and the setting of-35 dB can be found to have better silence removal capability.
And 4, step 4: and (5) normalizing the voice length. A neural network may require fixed-dimension input features due to subsequent classifiers. Therefore, the front-end feature processing algorithm needs to be carefully designed to a fixed speech length and be able to represent various speech commands. Furthermore, the inventors aimed to design a system that is lightweight and fast. After counting the open-source Fluent Speech Commands data set, the average duration of a typical voice command in three words was found to be about 1.9 seconds. If the unvoiced segment of the command is deleted, the average duration is about 1.5 seconds, and thus the real-time performance can be improved. For some voice commands with a duration less than 1.5 seconds, the re-fill method is used for filling.
And 5: and generating a spectrogram. The inventors performed preliminary feature selection experiments based on short-time fourier transform (STFT), fbank and MFCC. STFT performs better than other audio spectral features. The setting parameters are as follows: the sampling rate is 16kHz; each frame contains 256 sampling points; the number of sampling points between adjacent frames is 64; and a hanning window is applied to each frame of the audio spectrum. Finally, a 128 × 376 shaped STFT spectrum is obtained from the 1.5 second voice command.
And 6: and (4) long-term average normalization. Although in the preprocessing stage, it has been mentioned to eliminate disturbing factors, such as different semantic pauses and speaking speeds, by eliminating unvoiced segments. It does not solve the problem caused by the mismatch of the speech content and the speaker. We use a long-term average normalization method to suppress these factors. Specifically, the method is to average along the time axis of the logarithmic STFT spectrum, thereby obtaining a long-term average spectrum (LTAS):
Figure BDA0003526453870000061
where X (k, L) is the frequency spectrum of the signal X (n), k is the frequency index, L is the frame index, and L is the total number of frames.
And 7: attention statistical filtering.
Step 7.1: fratio is implemented. The algorithm is an algorithm for performing significance statistics of each frequency subband according to speaker and speech content for a speech data set, and is specifically shown as the following formula:
Figure BDA0003526453870000062
wherein Fratio is a one-dimensional vector composed of significance weights for each frequency sub-band,
Figure BDA0003526453870000063
is the jth speech segment of speaker i, where i ∈ [1],j∈[1,...,N].
Figure BDA0003526453870000071
u i Is the institute of speaker iThere is a speech segment feature average, u being the speech segment feature average of all M speakers.
Step 7.2: the Fratio of step 7.1 performs statistical calculations on the long-term average spectra of various normal audios. Thus, the significance of each sub-band of the normal voice command, i.e., the weight coefficient vector, can be obtained, and the dimension of the vector is related to the shape of the long-term average spectrum. It can be seen as a filter in the testing phase, applying it to the weighting of different subbands of each input long-term average spectrum, making some frequency subbands more prominent, while masking others, and is called "attention statistical filtering" because its design idea stems from computing the statistical information of normal audio.
And 8: the mel inverse filter is enhanced. The mel filter sets denser and more weighted triangular filter groups in low to high band frequency sensitive regions according to human auditory sense. Thus, the inventors have developed an analysis for silent attacks
Step 8.1: and (4) analyzing a silent attack mechanism. The nonlinear action of the microphone circuit enables the finally input dolphin sound attack signal to be converted into:
Figure BDA0003526453870000072
for normal speech control signals:
s nomral (t)=A 1 v(t)+A 2 v 2 (t)
as can be seen from fig. 4, the attack signal has a dc component close to 0Hz compared with the normal signal, i.e. has stronger energy in the low frequency band. Furthermore, the silence attack is highlighted in its high frequency band compared to the normal speech spectrum, since it has been processed by long-term averaging, attention statistics filtering.
Step 8.2: unlike the mel filter for the frequency band enhancement of human ear sensitivity, the frequency band distinguished from the normal audio frequency due to the existence of the characteristics of the silent attack is the human ear insensitive frequency band. Therefore, the enhancement to these frequency bands is called a mel-frequency inverse filter. The specific method comprises the following steps:
the high frequency band (higher than 5 kHz) of the input spectrogram is subjected to linear interpolation expansion, the middle frequency band (100 Hz-5 kHz) is compressed, and the low frequency band (lower than 100 Hz) of the input spectrogram is subjected to linear interpolation expansion. In addition, the low-frequency spectrum after linear interpolation expansion is repeated, so that the integral difference proportion can be improved, and the model classification effect is better.
And step 9: as can be seen from fig. 5, after the above steps of processing, the difference between the two frequency spectrums of the audible and inaudible voice command "OK Google" is significantly enlarged by understanding the effect of such a filter as described above.
Furthermore, the non-linear hole of the microphone is exploited due to the mention of silent voice attacks in the background art. The inventor has performed microphone reception experiments on a plurality of smart devices, taking as an example that the microphone of a samsung S20 handset is replaced with an OPPO Find X2, as shown in fig. 6. After finding that each handset replaces a particular microphone, the recorded audio features that microphone, as shown in fig. 7. Therefore, the characteristics presented by the silent attack are most relevant to the microphone, and other mobile phone operating systems, versions and the like are almost irrelevant.
A microphone test board is built, and a microphone test circuit which is three types and totally ten mainstream in the market is shown in the attached figure 8. A total of 4800 normal/challenge samples were collected for a total of 9600. The audio is divided into 2400 pieces of training sets and 7200 pieces of audio in a test set.
In order to verify the effect of the scheme, the inventor adopts an SVM classifier to classify the test data set. For the obtained two-dimensional feature enhancement spectrogram, firstly, the time axis is averaged, and the dimension can be reduced to a one-dimensional column vector. After dimension reduction, the training data is learned by using the SVM, a decision boundary is constructed, and the model can be used for verifying the test data. The final results are shown in attached Table 1.
TABLE 1 test performance index results for ten microphones
Figure BDA0003526453870000081
/>

Claims (4)

1. A silent attack classification promotion method based on attention enhancement filtering is characterized by comprising the following steps:
step 1: noise perception: the method comprises the steps of constructing a noise queue of the last five sampled audios, averaging the noise queue to represent environmental noise, and updating the noise queue by using a timer thread so as to enable the noise queue to reflect noise conditions in real time; stopping noise perception when a voice command is detected;
and 2, step: removing noise: carrying out noise removal on the voice instruction section by adopting an improved spectral subtraction method;
and 3, step 3: removing a mute section; and (3) eliminating a mute section by adopting an adaptive threshold method: first, the amplitude of the audio signal is normalized to [ -1,1] using min-max normalization; secondly, recording the maximum energy frame as 0dB, and regarding the frame with energy lower than a specific threshold value as a mute section, wherein the range of the specific threshold value is-45 dB to-15 dB;
and 4, step 4: normalizing the voice length; designing the voice length to be 1.5 seconds, and filling the voice command with the duration less than 1.5 seconds by adopting a repeated filling method;
and 5: generating a spectrogram; carrying out feature selection on the voice command by adopting short-time Fourier transform to obtain audio spectrum features; and applying a hanning window to each frame of the audio spectrum;
step 6: long-term average normalization; averaging along the time axis of the logarithmic STFT spectrum, obtaining a long-term average spectrum (LTAS):
Figure FDA0003526453860000011
where X (k, L) is the frequency spectrum of the signal X (n), k is the frequency index, L is the frame index, and L is the total number of frames;
and 7: attention statistical filtering;
step 7.1: carrying out significance statistics on each frequency sub-band according to speaker and voice content aiming at the voice data set; specifically, the formula is shown as follows:
Figure FDA0003526453860000012
wherein Fratio is a one-dimensional vector composed of significance weights for each frequency sub-band,
Figure FDA0003526453860000013
is the jth speech segment of speaker i, where i ∈ [1],j∈[1,...,N];
Figure FDA0003526453860000014
u i The feature average of all the voice segments of the speaker i, and the feature average of all the voice segments of the M speakers u;
step 7.2: obtaining a weight coefficient vector of each voice command according to the significance of each sub-band of the normal voice command in the step 7.1, wherein the vector dimension is related to the shape of the long-term average spectrum; applying the weight coefficient vector to the weighting of different sub-bands of each input long-term average spectrum, making some frequency sub-bands more prominent while masking other sub-bands;
and 8: enhancement of a Mel inverse filter;
according to the analysis of a silent attack mechanism, the attack signal has stronger energy in a low frequency band lower than 100Hz and a high frequency band higher than 5 kHz; the frequency bands distinguished by the existence characteristics of the silent attack and the normal audio are the ear insensitive frequency bands, and the ear insensitive frequency bands are enhanced.
2. The silence attack classification promotion method based on attention enhancement filtering according to claim 1, wherein the step 2 is specifically: setting a lower limit value beta P of a voice n (w), the actual spectrum P s (w) and the estimated noise spectrum α P n (w) subtracting to obtain a spectrum D (w), the amplitude of the spectrum D (w) being provided if less than β P n (w), then uniformly setting the fixed value, thereby reducing the influence of 'music noise' easily generated by the original spectral subtraction method; by adjusting the beta valueThe intensity of noise throughout this wideband;
D(w)=P s (w)-αP n (w)
Figure FDA0003526453860000021
wherein alpha is a subtraction factor, and beta is a spectrum lower limit threshold parameter.
3. The method as claimed in claim 1, wherein in step 3, the frame with energy lower than-35 dB is considered as the silence segment.
4. The silence attack classification promotion method based on attention enhancement filtering according to claim 1, characterized in that the step 8 specifically comprises: performing linear interpolation expansion on a high-frequency band higher than 5kHz, compressing a medium-frequency band between 100Hz and 5kHz, and performing linear interpolation expansion on a low-frequency band lower than 100Hz of an input spectrogram; and repeating the low-frequency spectrum after interpolation and expansion.
CN202210194280.9A 2022-03-01 2022-03-01 Silent attack classification promotion method based on attention enhancement filtering Active CN114664310B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210194280.9A CN114664310B (en) 2022-03-01 2022-03-01 Silent attack classification promotion method based on attention enhancement filtering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210194280.9A CN114664310B (en) 2022-03-01 2022-03-01 Silent attack classification promotion method based on attention enhancement filtering

Publications (2)

Publication Number Publication Date
CN114664310A CN114664310A (en) 2022-06-24
CN114664310B true CN114664310B (en) 2023-03-31

Family

ID=82027777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210194280.9A Active CN114664310B (en) 2022-03-01 2022-03-01 Silent attack classification promotion method based on attention enhancement filtering

Country Status (1)

Country Link
CN (1) CN114664310B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106531172A (en) * 2016-11-23 2017-03-22 湖北大学 Speaker voice playback identification method and system based on environmental noise change detection

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7948938B2 (en) * 2004-04-30 2011-05-24 Research In Motion Limited Wireless communication device with duress password protection and related method
KR101460059B1 (en) * 2007-12-17 2014-11-12 삼성전자주식회사 Method and apparatus for detecting noise
US9412381B2 (en) * 2010-03-30 2016-08-09 Ack3 Bionetics Private Ltd. Integrated voice biometrics cloud security gateway
WO2019173304A1 (en) * 2018-03-05 2019-09-12 The Trustees Of Indiana University Method and system for enhancing security in a voice-controlled system
US11457313B2 (en) * 2018-09-07 2022-09-27 Society of Cable Telecommunications Engineers, Inc. Acoustic and visual enhancement methods for training and learning
CN110085249B (en) * 2019-05-09 2021-03-16 南京工程学院 Single-channel speech enhancement method of recurrent neural network based on attention gating
CN112116742B (en) * 2020-08-07 2021-07-13 西安交通大学 Identity authentication method, storage medium and equipment fusing multi-source sound production characteristics of user
CN113192504B (en) * 2021-04-29 2022-11-11 浙江大学 Silent voice attack detection method based on domain adaptation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106531172A (en) * 2016-11-23 2017-03-22 湖北大学 Speaker voice playback identification method and system based on environmental noise change detection

Also Published As

Publication number Publication date
CN114664310A (en) 2022-06-24

Similar Documents

Publication Publication Date Title
US10622009B1 (en) Methods for detecting double-talk
US11488605B2 (en) Method and apparatus for detecting spoofing conditions
US10504539B2 (en) Voice activity detection systems and methods
US9305567B2 (en) Systems and methods for audio signal processing
CN105513605A (en) Voice enhancement system and method for cellphone microphone
CN112004177B (en) Howling detection method, microphone volume adjustment method and storage medium
CN110120225A (en) A kind of audio defeat system and method for the structure based on GRU network
Verteletskaya et al. Noise reduction based on modified spectral subtraction method
CN113192504B (en) Silent voice attack detection method based on domain adaptation
CN112466276A (en) Speech synthesis system training method and device and readable storage medium
Alam et al. Robust feature extraction for speech recognition by enhancing auditory spectrum
Mu et al. MFCC as features for speaker classification using machine learning
Zhang et al. A soft decision based noise cross power spectral density estimation for two-microphone speech enhancement systems
CN114664310B (en) Silent attack classification promotion method based on attention enhancement filtering
CN112233657A (en) Speech enhancement method based on low-frequency syllable recognition
Yu et al. Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection.
WO2022068440A1 (en) Howling suppression method and apparatus, computer device, and storage medium
Sanam et al. A combination of semisoft and μ-law thresholding functions for enhancing noisy speech in wavelet packet domain
EP2063420A1 (en) Method and assembly to enhance the intelligibility of speech
Darabian et al. Improving the performance of MFCC for Persian robust speech recognition
Mehta et al. Robust front-end and back-end processing for feature extraction for Hindi speech recognition
Yan et al. Anti-noise power normalized cepstral coefficients for robust environmental sounds recognition in real noisy conditions
Butarbutar et al. Adaptive Wiener Filtering Method for Noise Reduction in Speech Recognition System
Islam et al. Modeling of teager energy operated perceptual wavelet packet coefficients with an Erlang-2 PDF for real time enhancement of noisy speech
CN112951259B (en) Audio noise reduction method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant