CN114694649A

CN114694649A - Universal directional voice confrontation sample generation method, system, medium and equipment

Info

Publication number: CN114694649A
Application number: CN202210296056.0A
Authority: CN
Inventors: 王宝旺; 丁菡; 赵衰; 翟临威; 王鸽; 惠维; 赵鲲; 赵季中
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-07-01

Abstract

The invention discloses a method, a system, a medium and equipment for generating a universal directional voice countermeasure sample, wherein a target optimization loss function is designed to realize the universality of voice disturbance, the confidence coefficient of the classification into an original correct class is minimized, the confidence coefficient of the classification into a target class is maximized, the decibel difference between the voice disturbance and the original voice is introduced into the loss function, the disturbance size is limited, and the 1 decibel difference of the voice disturbance is used for limiting the disturbance size_pThe norm is restricted in a specified spherical surface range; the disturbance is covered by utilizing the daily noise and the psycho-acoustic principle, the influence of sound transmission in the air is introduced when the voice disturbance is generated, so that the generated general voice disturbance is still applicable in the physical world, and the general directional voice countermeasure sample is generated after the general voice disturbance of the invention is added into any original voice command dataThe voice data is wrongly identified as the specified target class by the voice command classifier based on the convolutional neural network, and the method has great significance for the research of the robustness of the deep neural network.

Description

Universal directional voice confrontation sample generation method, system, medium and equipment

Technical Field

The invention belongs to the technical field of safety based on deep learning, and particularly relates to a method, a system, a medium and equipment for generating a universal directional voice confrontation sample.

Background

In recent years, with the continuous improvement of robustness of deep neural networks, many applications based on deep learning are also endless, and the applications relate to a plurality of fields such as images, voice, texts and the like. However, recent studies have found that applications based on deep neural networks are prone to misidentification of anti-sample data. The challenge sample is false positive data which can be mistakenly identified (classified) by a model by adding small disturbance which is difficult to be detected by human senses to raw data.

Specifically, the white-box countermeasure samples and the black-box countermeasure samples are classified according to whether the model structure of the network is predicted in advance; the recognition result (category) is classified into a non-directional countermeasure sample and a directional countermeasure sample according to whether or not an error is specified. Initially, the challenge sample is used for a network model of an image domain, and then, the challenge sample is also proved to be feasible in a voice domain. Because the introduction of voice disturbance brings certain noise on one hand, and on the other hand, the sound propagation has phenomena such as attenuation, etc., the generated voice disturbance cannot be directly realized in the physical world. In order to solve the noise problem of voice disturbance, the existing method generates specific disturbance aiming at each piece of original data, and the practicability is not strong; although some methods can generate general disturbance, on one hand, the problems of distortion and the like of a speech propagation process in the real world are not considered, and on the other hand, the method only generates non-directional disturbance, is not practical enough and is easy to be perceived. In order to realize physical voice disturbance, random noise is only added in the training process in the existing method, phenomena of attenuation, distortion, reflection and the like generated in the air transmission process of voice are not considered, and robustness is not strong.

Therefore, how to generate general disturbance with the lowest possible noise through the existing data and still be feasible in the physical world, and also be applicable to unseen data is a problem to be solved in the field of voice countermeasure samples at present.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method, a system, a medium and a device for generating a universal directional speech countermeasure sample, aiming at the defects in the prior art, and firstly, the generation of universal directional disturbance is solved by self-defining a loss function; secondly, how to cover the generated general directional disturbance is provided, so that people are less likely to be perceived; and finally, universal directional physical voice disturbance is realized.

The invention adopts the following technical scheme:

a method for generating a universal directional speech countermeasure sample comprises the following steps:

s1, obtaining a loss optimization function for generating general disturbance by optimizing the confidence coefficient vector output by the voice command classifier model; performing iterative update on the initial disturbance by utilizing back propagation according to a loss optimization function to obtain general directional disturbance, and adding the general directional disturbance to any original voice data to obtain a general directional voice countermeasure sample;

s2, initializing disturbance by adopting daily environmental noise, then obtaining a similarity function of the disturbance and the initial daily environmental noise by using a psychoacoustic principle, adding the similarity function into the loss optimization function obtained in the step S1 to obtain a new loss optimization function, obtaining general directional disturbance under the environmental noise covering through iterative training again, and adding the obtained general directional disturbance to original voice data to obtain a general directional voice countermeasure sample under the environmental noise covering;

s3, filtering the low-frequency and high-frequency parts in the initial daily environmental noise in the step S2 by using a band-pass filter to obtain distortion-free disturbance; then, simulating by using room impulse response to the distortion-free disturbance to obtain reverberation disturbance after reverberation and reflection under different room configurations; and introducing Gaussian white noise into the reverberation disturbance audio frequency for simulating background noise in the physical world to obtain physical disturbance, taking the physical disturbance as an initial value of the disturbance, carrying out iterative training on the physical disturbance again based on the loss optimization function of the step S2 to obtain general directional physical disturbance under the covering of environmental noise, adding the general directional physical disturbance into the original data, and generating a general directional physical voice countermeasure sample.

Specifically, step S1 specifically includes:

s101, firstly, selecting a voice command classifier as a reference model, and generating a general voice countermeasure sample based on the reference model; the voice command classifier comprises 8 voice commands in total;

s102, obtaining a corresponding confidence coefficient Loss optimization function according to the confidence coefficient vector output by the model, and simultaneously introducing the decibel difference to balance the relation between the disturbance size and the directional target recognition success rate to obtain a total Loss optimization function Loss_tar；

S103, utilizing the Loss optimization function Loss of the step S102_tarAnd calculating the gradient through back propagation and a chain rule, fixing the parameters of the original model, and performing iterative updating on the initial general disturbance parameters by using a well-defined learning rate.

Further, in step S102, the Loss optimization function Loss_tarThe following were used:

Loss_tar＝(L_tar1+L_tar2)/2+α·Diff

wherein L is_tar1For the first partial loss optimization function, L_tar2For the second partial loss optimization function, α is the weight coefficient and Diff is the decibel difference.

Further, the original correct class confidence L is minimized_tar1Maximizing the confidence L of the object class_tar2And the DB loss optimization function is calculated as follows:

L_tar1＝max(F_r(x+δ)-max(F_i(x+δ))，0)，i≠r

L_tar2＝max(max(F_j(x+δ))-F_t(x+δ)，-G)，j≠t

where x is the original speech data, δ is the general disturbance, F_m(. cndot.) is the confidence function for model recognition as class m, r is the correctly classified class, t is the oriented object class, G is the hyperparameter, N is the total number of speech data, Decibel (Δ) is the Decibel value corresponding to the general perturbation, Decibel (x) is the confidence function for model recognition as class m, and_i) And the value is the decibel value corresponding to the ith voice data.

Specifically, in step S2, the step of calculating the power spectral density of the disturbance and the daily noise to obtain a corresponding similarity function specifically includes:

firstly, carrying out short-time Fourier transform on audio x to obtain frequency domain information, and calculating to obtain power spectral density p_x(i) (ii) a Then on the power spectral density p_x(i) Normalization processing is carried out to obtain normalized power spectral density

Finally, power spectral density normalized by the difference value of the general disturbance delta and the daily noise theta is respectively calculated

Power spectral density normalized to daily noise theta

And further obtaining a similarity loss function, and performing iterative training on the disturbance to obtain the universal directional disturbance under the environmental noise covering.

Further, the similarity loss function sim (δ - θ, θ) is as follows:

wherein W is the Hanning window size,

normalized power spectral density for the general disturbance and daily noise difference,

normalized power spectral density for everyday noise.

Specifically, step S3 specifically includes:

s301, filtering distortion signals lower than 50Hz in the general disturbance voice by adopting a band-pass filter, and filtering threshold signals higher than 8kHz in disturbance voice data;

s302, processing the general disturbance voice processed in the step S301 by adopting the room impulse response, wherein the general disturbance voice is processed according to the length, width and height (X, Y, Z) of different rooms which accord with T distribution and the position (X) of a microphone_m，y_m，z_m) Position of the loudspeaker (x)_s，y_s，z_s) And reverberation time T₆₀Generating general disturbance voice under room impulse response, and simulating reverberation voice signals under various different room configurations;

and S303, adding Gaussian white noise into the general disturbance speech operated in the step S302 to obtain general disturbance speech, using the general disturbance speech as an initial value of disturbance, and training according to a loss function to obtain general directional physical disturbance.

In a second aspect, an embodiment of the present invention provides a universal directional speech countermeasure sample generation system, including:

the updating module is used for optimizing the confidence coefficient vector output by the voice command classifier model to obtain a loss optimization function for generating general disturbance; performing iterative update on the initial disturbance by utilizing back propagation according to a loss optimization function to obtain general directional disturbance, and adding the general directional disturbance to any original voice data to obtain a general directional voice countermeasure sample;

the training module is used for initializing disturbance by adopting daily environmental noise, then obtaining a similarity function of the disturbance and the initial daily environmental noise by utilizing a psychoacoustic principle, adding the similarity function into the loss optimization function obtained by the updating module to obtain a new loss optimization function, obtaining general directional disturbance under the environmental noise coverage by carrying out iterative training again, and adding the obtained general directional disturbance to the original voice data to obtain a general directional physical voice confrontation sample under the environmental noise coverage;

the generating module is used for filtering low-frequency and high-frequency parts in the initial daily environmental noise in the training module by using a band-pass filter to obtain distortion-free disturbance; then, simulating by using room impulse response to the distortion-free disturbance to obtain reverberation disturbance after reverberation and reflection under different room configurations; and introducing Gaussian white noise into the reverberation disturbance audio frequency for simulating background noise in the physical world to obtain physical disturbance, taking the physical disturbance as an initial value of the disturbance, carrying out iterative training on the physical disturbance again based on a loss optimization function of a training module to obtain general directional physical disturbance under the covering of environmental noise, and adding the general directional physical disturbance into original data to generate a general directional voice countermeasure sample.

In a third aspect, a computer device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above-mentioned general directional speech confrontation sample generation method when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the above-mentioned method for generating a universal directional speech confrontation sample.

Compared with the prior art, the invention at least has the following beneficial effects:

according to the method for generating the universal directional voice countermeasure sample, the universal directional disturbance is generated by utilizing the self-defined target loss function, and the disturbance is doubly covered by utilizing the daily environmental noise and the psychoacoustic principle, so that the generated universal disturbance is less likely to be perceived; further, the variation of the counterdisturbance propagated in the air is introduced into the training process to generate a general directional physical disturbance. Specifically, a distorted low-frequency voice signal is filtered out through a frequency response curve of the analysis equipment; the method comprises the steps that disturbance reverberation voice data under different room configurations are generated by utilizing room impulse responses, on one hand, a data set can be expanded, and on the other hand, the method is used for simulating reverberation and reflection of sound under different rooms, different loudspeaker positions and different microphone positions so as to improve the robustness of general disturbance; and then Gaussian white noise is added to the general disturbance voice data after the room impulse response simulation for simulating the noise in the real world, and the general directional physical disturbance is generated by combining the loss optimization function of the previous training process.

Furthermore, in order to enable the generated disturbance to be applied to unknown voice data, the invention adopts iterative training to generate general disturbance, and the generated disturbance can be added to any voice data, thereby realizing high success rate of directional target recognition.

Further, a Loss optimization function Loss is proposed_tarComprising two parts. One part is responsible for continuously reducing the confidence coefficient of the original correct category until the original correct category no longer occupies the dominant position, and at the moment, the model can wrongly identify the voice command; another part is responsible for constantly increasing the confidence level of the target class so that the voice command is recognized by the model as the specified target class after the generic perturbation is added. By combining the two parts of the loss function for optimization, the model can be classified wrongly, and the model can be also wrongly classified towards the specified target class.

Further, in order to generate as small a perturbation as possible, the present invention limits the general perturbation generated and adds the limit to the training process. Specifically, the maximum logarithmic value of the time domain signal of the speech is used as the decibel value of the speech, and the decibel difference between the disturbance speech and the original speech is used as the limiting factor, and the smaller the decibel difference is, the better the decibel difference is, so that the decibel difference can be directly used as a part of the loss function.

Furthermore, in order to make the generated disturbance less noticeable, on one hand, the general disturbance is initialized by using the daily environmental noise; on the other hand, according to the psychoacoustic principle, an initial daily noise masking threshold is calculated, a difference speech signal gamma between the generated general disturbance with the ambient noise and the original ambient noise is calculated, short-time Fourier transform is performed on the speech signal gamma to convert the speech signal gamma into a frequency domain signal, the frequency domain signal is further subjected to logarithm to obtain the power spectral density of the speech signal gamma, and the power spectral density value can not be detected by human ears when the power spectral density value is within the daily noise masking threshold range.

Further, it is proposed to use a similarity loss function sim (δ - θ, θ) to measure whether the generated disturbance δ can be masked by the environmental noise θ. The masking threshold value of the normalized power spectral density of the environmental noise is obtained by calculating, so that the difference value between the generated disturbance and the environmental noise is in the range, and the normalized power spectral density difference value of the generated disturbance and the environmental noise is introduced into a loss function for optimization, thereby obtaining the universal disturbance under the environmental masking.

Furthermore, in order to solve the problem of equipment distortion, on one hand, a high-fidelity sound box can be selected as a loudspeaker for general voice disturbance, and the problem of distortion of equipment of a voice disturbance sender is reduced as much as possible. Aiming at the distortion problem of victim equipment, by analyzing the frequency response curve of a daily mobile phone microphone and the like, the phenomenon that the voice signal attenuation below 50Hz is more obvious can be easily found, so the voice signal below 50Hz is filtered by using a filter. On the other hand, according to the nyquist theorem, the voice signals with the frequency higher than half of the sampling rate can not be obtained, so that the threshold signals of the high-frequency part can be filtered out; because real world sounds can generate phenomena such as attenuation, reverberation, reflection and the like in the transmission process, in order to enable the generated general disturbance to be still effective in the physical world, the general disturbance is subjected to room impulse response simulation so as to be suitable for different room configurations; in addition, because some noise exists in the actual environment, Gaussian white noise is added in the general disturbance to simulate the actual noise, and the voice disturbance after the operation of the steps is used as the initial value of the disturbance for training, so that the voice general disturbance which is still feasible in the real physical world can be obtained.

In conclusion, the method and the device can form security threats to voice assistants, voice interaction applications and the like, and effectively make up for the defect of the conventional universal directional physical voice disturbance.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a diagram of an overall network structure of a method for generating a speech countermeasure sample based on general directional physical disturbance according to the present invention;

FIG. 2 is a graph of the recognition accuracy confusion matrix results of the voice command classifier;

FIG. 3 is a diagram of recognition accuracy confusion matrix results after the addition of a generic directional perturbation targeting a "down" voice command;

FIG. 4 is a diagram of 8 generated universal directional disturbance waveforms for speech;

FIG. 5 is a graph comparing the recognition accuracy and the success rate of the directional target recognition of the voice confrontation samples generated by using 8 general speech perturbations respectively;

fig. 6 is a schematic diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be understood that the terms "comprises" and/or "comprising" indicate the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and including such combinations, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe preset ranges, etc. in embodiments of the present invention, these preset ranges should not be limited to these terms. These terms are only used to distinguish preset ranges from each other. For example, the first preset range may also be referred to as a second preset range, and similarly, the second preset range may also be referred to as the first preset range, without departing from the scope of the embodiments of the present invention.

The word "if," as used herein, may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection," depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of the various regions, layers and their relative sizes, positional relationships are shown in the drawings as examples only, and in practice deviations due to manufacturing tolerances or technical limitations are possible, and a person skilled in the art may additionally design regions/layers with different shapes, sizes, relative positions, according to the actual needs.

Referring to fig. 1, the present invention provides a method for generating a universal directional speech countermeasure sample, which includes a universal disturbance generating portion, an environmental noise masking portion, and a universal physical disturbance generating portion. For the general disturbance generation part, iterative training is carried out through a custom loss optimization function and the limitation of disturbance to generate general directional disturbance, the generated general directional disturbance is randomly superposed on a section of original voice data, and the voice command classifier classifies the general directional disturbance into a specified error category; for the environmental noise masking part, firstly, the general disturbance is initialized by utilizing the daily noise, secondly, the generated voice disturbance is masked under the daily environmental noise as much as possible by utilizing the psychoacoustic principle, and finally, the general directional disturbance under the masking of the daily environmental noise is generated; for the general directional physical disturbance part, the changes of general disturbance in air propagation are introduced into an iterative training process by using a band-pass filter, a room impulse response and a noise adding mode, so that the generated disturbance is still effective in the real world.

The invention discloses a method for generating a universal directional voice countermeasure sample, which comprises the following steps:

one Speech recognition model for which the present invention is directed is selected as a victim model, namely a Speech command classifier (Speech Commands Classification With torreudio) proposed by PyTorch official network, 8 Speech Commands are classified, and when a reference model is generated, 8: 2 into training data and test data.

S102, obtaining a corresponding confidence coefficient Loss optimization function according to the confidence coefficient vector output by the model, and simultaneously introducing the decibel difference to balance the relation between the disturbance size and the directional target recognition success rate to obtain a total Loss optimization function Loss_tar。

In order to make the data recognized by the model as a targeted class of objects, it is necessary to make the model as resistant as possible to the sample classification errors for the generated speech on the one hand and to make the classification errors into the specified speech commands on the other hand. Therefore, for the above purpose, a two-part loss optimization function is designed:

L_tar1＝max(F_r(x+δ)-max(F_i(x+δ))，0)，i≠r

L_tar2＝max(max(F_j(x+δ))-F_t(x+δ)，-G)，j≠t

wherein, F_m(. cndot.) is the confidence function that the speech command classifier recognizes as class m, r is the original correct class, t is the target class, x is the original speech command, δ is the general perturbation, and G is the hyperparameter. By minimizing L_tar1Such that the confidence with which the speech command classifier recognizes the original correct class is no longer the highest, by minimizing L_tar2And increasing the confidence of the voice command classifier to identify the target class, and gradually becoming the class with the highest confidence.

In order to minimize the generated disturbance, the most common decibel difference is introduced to balance the relationship between the disturbance size and the success rate of the directional target recognition, which is specifically as follows:

Decibel(v)＝max_k20·log₁₀(|v_k|)，||δ||₂＜ε

where N is the total number of voice command data pieces and ε is the disturbance l₂Maximum spherical radius under norm constraint, since decibel difference is used to calculate the phase of the original speech data and the generated disturbanceFor loudness differences, the smaller the value, the better, it is directly part of the loss function, so the overall loss function is as follows:

Loss_tar＝(L_tar1+L_tar2)/2+α·Diff

the more alpha is, the less perceptible the general disturbance is, but the success rate of identifying the directional target is lower; conversely, the smaller the alpha is, the higher the success rate of the directional target identification of the general disturbance is, but the distortion is larger.

In order to generate the universal disturbance to cheat the voice command classifier, the parameters of the original model are fixed and only the initial universal disturbance parameters are updated iteratively. Calculating the gradient by utilizing the defined loss function through back propagation and a chain rule, and performing iterative update on the general disturbance by utilizing the defined learning rate, wherein the method specifically comprises the following steps:

and lr is a learning rate, iterative training is carried out according to the process until a specified directional target recognition success rate is reached, so that general directional disturbance is generated, and the generated disturbance is added to any original voice command to realize a high directional target recognition success rate.

By minimizing L_tar1The voice command classification accuracy can be greatly reduced; by minimizing L_tar2The success rate of classifying the voice commands into the specified classes can be greatly increased; by binding to L_tar1And L_tar2And a loss function can realize the generation of general directional disturbance.

Furthermore, in order to limit the generated disturbances, on the one hand a loss optimization function is introducedThe relative loudness is: i.e. the difference between the loudness of the disturbance and the loudness of the original speech, and on the other hand the disturbed loudness/₂The norm is defined within the epsilon sphere radius.

S2, initializing disturbance by adopting daily environmental noise, then obtaining a similarity function of the disturbance and the initial daily environmental noise by using a psychoacoustic principle, adding the similarity function into the loss optimization function obtained in the step S1 to obtain a new loss optimization function, obtaining general directional disturbance under the environmental noise covering through iterative training again, and adding the obtained general directional disturbance to original voice data to obtain a general directional voice countermeasure sample under the environmental noise covering; (ii) a

The method randomly selects one of common daily noises as an initial value of the general disturbance, so that the generated disturbance is considered as the common noises by a victim, and the masking effect is achieved. In addition, since the disturbance is perceived by a human being due to distortion of the environmental noise caused during the iterative training process, the disturbance during the training process is further limited. Thus, the psychoacoustic principle is utilized that a speech signal with a higher loudness makes other speech signals in nearby frequencies difficult to perceive, and thus the generated disturbance can be made less noticeable to humans by calculating the frequency masking threshold of the original ambient noise.

Firstly, short-time Fourier transform is needed to be carried out on audio to obtain frequency domain information, then power spectral density needs to be calculated, then the power spectral density is normalized, and finally the normalized power spectral density of the difference value of the general disturbance and the initial noise is differed from the normalized power spectral density of the initial noise to obtain a similarity function. The function is introduced into the training process, so that the concealment of the universal directional disturbance can be realized; the method specifically comprises the following steps:

s201, performing short-time Fourier transform on a voice signal to obtain frequency domain information, and then calculating corresponding power spectral density for further digitalizing the expressed frequency, wherein the specific calculation is as follows:

wherein, W is the window size,

is the normalized power spectral density.

In addition to this, the present invention is,

two conditions must be met, the first one being that it must be greater than the masking threshold at rest, i.e. the frequency masking threshold in the human ear hearing range (20Hz to 16kHz), as follows:

wherein f is frequency; the second condition is that the maximum amplitude range must be maintained at

(i) Is within the range of 0.5Bark (the human hearing range is divided into 24 non-overlapping frequency bands, i.e. Bark).

S202, calculating whether the general disturbance and the environmental noise difference disturbance are within a frequency covering threshold range, specifically as follows:

where θ is the ambient noise.

That is, the average power spectral density exceeding the power spectral density normalized by the environmental noise is taken as the loss value, and added to the loss function mentioned in step S1:

Loss＝Loss_tar+β·sim(δ-θ，θ)

Loss_tar＝(L_tar1+L_tar2)/2+α·Diff

Loss＝(L_tar1+L_tar2)/2+α·||δ||₂+β·sim(δ-θ，θ)

the other steps are the same as the step S1, and the universal directional disturbance under the environment noise covering is obtained through iterative training.

Since the general perturbation proposed in step S1 or step S2 is only suitable for directly adding perturbation to the original speech data to generate a general speech countermeasure sample, and the practicability is not strong, the present invention further proposes a general directional physical perturbation, that is, while speaking, the general directional perturbation of speech is played, so that the speech command classifier is identified as the specified error category. In order to realize general physical disturbance, a voice signal which is subjected to physical world propagation change is introduced in the training process, and the change caused by the voice signal is simulated step by step due to the influence of various factors including equipment distortion, voice reflection, voice reverberation, environmental noise and the like in the air propagation of the voice.

The method comprises the following steps:

because the sampling rate of each voice in the data set Speech Commands adopted by the invention is 16kHz, according to the Nyquist theorem, the information in the original signal can be completely acquired only by the sampling rate which is more than twice of the highest frequency of the voice signal, and therefore, only the voice signal with the frequency below 8kHz can be concerned.

In addition, experimental verification shows that the attenuation phenomenon of the voice signal with the frequency below 50Hz in the frequency sweep signal (20 Hz-20 kHz) received by the victim device (such as a mobile phone) through the microphone is most obvious, so that the band-pass filter is adopted to cut off the voice signal with the frequency below 50Hz in the disturbed voice data in order to reduce the influence of attenuation.

δ_bf＝BF_50-8kHz(δ)

In addition, the audio sampling rate of the data set used by the invention is 16kHz, and the voice signal with the frequency of 8kHz can only be completely reserved at most according to the Nyquist theorem, so that the threshold signal higher than 8kHz in the disturbed voice data is filtered, and the problem of equipment distortion is solved.

In conclusion, the general disturbance frequency is limited to 50Hz to 8kHz by using the band-pass filter, and the general disturbance delta after filtering is obtained_bfAnd the problem of equipment distortion is solved.

since in a real scenario there is no knowledge of the room information where the smart device based on the voice command classifier is located. Therefore, even with the same disturbance, the directional target recognition may succeed and the recognition may fail in different rooms. Can be adapted for the generated universal disturbanceIn different rooms, the invention utilizes the room impulse response to simulate the general disturbance, and the main parameters are set as follows: length, width and height of room (X, Y, Z) and position of microphone (X)_m，y_m，z_m) Position of the loudspeaker (x)_s，y_s，z_s) And reverberation time T₆₀(i.e., the time required for sound to propagate in air with an attenuation of 60 dB), where x_m，x_s＜X，y_m，y_s＜Y，z_m，z_s＜Z。

Performing convolution operation on room impulse response h obtained by sampling room configuration R according with T distribution and general disturbance to obtain voice disturbance data delta passing through room R_rir。

Furthermore, using this particular room configuration will allow for better targeted object recognition if room configuration information is known.

The addition of white gaussian noise to the disturbance is specifically as follows:

δ_an＝δ_rir+Noise

similarly, if ambient noise can be collected in the room where the voice command classifier device is located, a higher success rate of directional target recognition will be achieved using this noise. Thus, the simulation of air propagation is completed, and the disturbance after the operation is further introduced into the training process, so that the universal directional physical disturbance can be obtained.

After the three steps, the obtained universal voice disturbance is used as an initial value of the disturbance, and training is carried out according to a designed loss function to obtain the universal directional physical disturbance.

The invention has the following characteristics:

universality: only one disturbance is trained and is added to any original voice data, and the disturbance can be recognized as a specified wrong voice category by a voice recognition system;

concealment: environmental noise is used as covering, and psychoacoustic principles are used for limitation, so that disturbing voice is less prone to being perceived by people;

physical properties: distortion problems, reverberation problems and noise problems caused by real world voice transmission are introduced into a training process of voice general disturbance by means of a band-pass filter, room impulse response and noise addition, so that the disturbance can be still feasible in the physical world.

In another embodiment of the present invention, a system for generating a generic directional speech countermeasure sample is provided, which can be used to implement the method for generating a generic directional speech countermeasure sample described above.

The updating module obtains a loss optimization function for generating general disturbance by optimizing the confidence coefficient vector output by the voice command classifier model; performing iterative update on the initial disturbance by utilizing back propagation according to a loss optimization function to obtain general directional disturbance, and adding the general directional disturbance to any original voice data to obtain a general directional voice countermeasure sample;

the training module is used for initializing disturbance by adopting daily environmental noise, then obtaining a similarity function of the disturbance and the initial daily environmental noise by utilizing a psychoacoustic principle, adding the similarity function into the loss optimization function obtained by the updating module to obtain a new loss optimization function, obtaining general directional disturbance under the environmental noise covering through iterative training again, and adding the obtained general directional disturbance to the original voice data to obtain a general directional voice countermeasure sample under the environmental noise covering;

the generating module is used for filtering low-frequency and high-frequency parts in the initial daily environmental noise in the training module by using a band-pass filter to obtain distortion-free disturbance; then, simulating by using room impulse response to the distortion-free disturbance to obtain reverberation disturbance after reverberation and reflection under different room configurations; and introducing Gaussian white noise into the reverberation disturbance audio frequency for simulating background noise in the physical world to obtain physical disturbance, taking the physical disturbance as an initial value of the disturbance, carrying out iterative training on the physical disturbance again based on a loss optimization function of a training module to obtain general directional physical disturbance under the covering of environmental noise, and adding the general directional physical disturbance into original data to generate a general directional physical voice countermeasure sample.

In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor of the embodiment of the invention can be used for the operation of the generation method of the universal directional speech confrontation sample, and comprises the following steps:

obtaining a loss optimization function for generating general disturbance by optimizing the confidence coefficient vector output by the voice command classifier model; performing iterative update on the initial disturbance by utilizing back propagation according to a loss optimization function to obtain general directional disturbance, and adding the general directional disturbance to any original voice data to obtain a general directional voice countermeasure sample; initializing disturbance by adopting daily environmental noise, then obtaining a similarity function of the disturbance and the initial daily environmental noise by utilizing a psychoacoustic principle, adding the similarity function into a loss optimization function to obtain a new loss optimization function, obtaining general directional disturbance under the environmental noise coverage by carrying out iterative training again, and adding the obtained general directional disturbance to original voice data to obtain a general directional voice antagonistic sample under the environmental noise coverage; filtering low-frequency and high-frequency parts in the initial daily environmental noise by using a band-pass filter to obtain distortion-free disturbance; then, simulating by using room impulse response to the distortion-free disturbance to obtain reverberation disturbance after reverberation and reflection under different room configurations; and introducing Gaussian white noise into the reverberation disturbance audio frequency for simulating background noise in the physical world to obtain physical disturbance, taking the physical disturbance as an initial value of the disturbance, carrying out iterative training on the physical disturbance again based on a loss optimization function to obtain general directional physical disturbance under the condition of covering environmental noise, and adding the general directional physical disturbance into original data to generate a general directional physical voice countermeasure sample.

In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM Memory, or may be a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory.

One or more instructions stored in the computer-readable storage medium may be loaded and executed by the processor to implement the corresponding steps of the method for generating a universal directional speech countermeasure sample in the above embodiments; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of:

obtaining a loss optimization function for generating general disturbance by optimizing the confidence coefficient vector output by the voice command classifier model; performing iterative update on the initial disturbance by using back propagation according to a loss optimization function to obtain a universal directional disturbance, and adding the universal directional disturbance to any original voice data to obtain a universal directional voice countermeasure sample; initializing disturbance by adopting daily environmental noise, then obtaining a similarity function of the disturbance and the initial daily environmental noise by utilizing a psychoacoustic principle, adding the similarity function into a loss optimization function to obtain a new loss optimization function, obtaining general directional disturbance under the environmental noise coverage by carrying out iterative training again, and adding the obtained general directional disturbance to original voice data to obtain a general directional voice antagonistic sample under the environmental noise coverage; filtering low-frequency and high-frequency parts in the initial daily environmental noise by using a band-pass filter to obtain distortion-free disturbance; then, simulating by using room impulse response to the distortion-free disturbance to obtain reverberation disturbance after reverberation and reflection under different room configurations; and introducing Gaussian white noise into the reverberation disturbance audio frequency for simulating background noise in the physical world to obtain physical disturbance, taking the physical disturbance as an initial value of the disturbance, carrying out iterative training on the physical disturbance again based on a loss optimization function to obtain general directional physical disturbance under the condition of covering environmental noise, and adding the general directional physical disturbance into original data to generate a general directional physical voice countermeasure sample.

Fig. 6 is a schematic diagram of a computer device provided by an embodiment of the present invention. The computer device 60 of this embodiment includes: a processor 61, a memory 62, and a computer program 63 stored in the memory 62 and capable of running on the processor 61, wherein the computer program 63 when executed by the processor 61 implements the heart rate measurement method in the embodiment, and for avoiding repetition, the details are not repeated herein. Alternatively, the computer program 63, when executed by the processor 61, implements the functions of each model/unit in the heart rate measuring device in the embodiments, which are not described herein again to avoid repetition.

The computing device 60 may be a desktop computer, a notebook, a palm top computer, a cloud server, or other computing device. The computer device 60 may include, but is not limited to, a processor 61, a memory 62. Those skilled in the art will appreciate that fig. 6 is merely an example of a computer device 60 and is not intended to limit the computer device 60 and may include more or fewer components than shown, or some of the components may be combined, or different components, e.g., the computer device may also include input output devices, network access devices, buses, etc.

The Processor 61 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 62 may be an internal storage unit of the computer device 60, such as a hard disk or a memory of the computer device 60. The memory 62 may also be an external storage device of the computer device 60, such as a plug-in hard disk provided on the computer device 60, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like.

Further, memory 62 may also include both internal and external storage devices for computer device 60. The memory 62 is used for storing computer programs and other programs and data required by the computer device. The memory 62 may also be used to temporarily store data that has been output or is to be output.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 2, all the data obtained in step S101 are tested, and the obtained confusion matrix is as shown in fig. 2, and the overall recognition accuracy is 95.66%.

Referring to fig. 3, the recognition accuracy confusion matrix results are shown in fig. 3, and using "down" as the target class, only one perturbation generated is added to the original voice command data, achieving 7961/8000 (99.51%) success rate of directional target recognition.

Referring to fig. 4, which are respectively waveform diagrams of general disturbance under the environmental noise masking generated by using 8 voice commands as target categories, it can be seen that each waveform diagram is very similar, i.e. the general disturbance is successfully hidden under the environmental noise. Each waveform map has only a few subtle differences, and it is by superimposing these differences onto the raw data respectively that 8 confrontation samples are generated, which can be successfully identified by the model as corresponding targeted object classes.

Referring to fig. 5, a result diagram of the recognition accuracy and the success rate of the directional target recognition corresponding to 8 general disturbances shows that the success rate of the directional target recognition exceeds 95%, and the success rate of the directional target recognition is maintained at about 12%, so that the general disturbances are successfully realized.

By adopting the method for generating the universal directional voice countermeasure sample, the corresponding universal countermeasure sample can be generated aiming at any model based on the convolutional neural network. In particular, the average signal-to-noise ratio of the general speech disturbance under the environmental noise masking provided by the invention reaches nearly 15dB, which is superior to the related work. In addition, the general directional physical disturbance provided by the invention can enable the generated disturbance to be still applicable in the real world, the recognition success rate of the average directional target reaches 93%, and the signal-to-noise ratio is still as high as 13.8 dB. Therefore, the method and the device can generate the universal directional voice confrontation sample with small disturbance and high success rate of directional target recognition.

In summary, the invention provides a method, a system, a medium and a device for generating a universal directional voice countermeasure sample, which solve the problem that the prior art does not relate to a method for generating a voice countermeasure sample based on universal directional physical disturbance under the covering of environmental noise, successfully enable a voice command classifier based on a convolutional neural network to wrongly identify voice command data as a specified target class, and can form security threats to voice assistants, voice interaction applications and the like, and especially directly cause great influences on the security of lives and property of users in some fields (such as automatic driving, smart home and the like) with higher security coefficients based on an automatic voice recognition technology.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A method for generating a universal directional speech confrontation sample is characterized by comprising the following steps:

s2, initializing the disturbance by adopting the daily environmental noise, then obtaining a similarity function of the disturbance and the initial daily environmental noise by using a psychoacoustic principle, adding the similarity function into the loss optimization function obtained in the step S1 to obtain a new loss optimization function, obtaining general directional disturbance under the environmental noise covering through iterative training again, and adding the obtained general directional disturbance to the original voice data to obtain a general directional voice countermeasure sample under the environmental noise covering;

2. The method for generating generic directional speech confrontation samples according to claim 1, wherein the step S1 is specifically as follows:

3. The method for generating universal directional speech confrontation samples according to claim 2, wherein in step S102, the Loss optimization function Loss_tarThe following were used:

Loss_tar＝(L_tar1+L_tar2)/2+α·Diff

4. The method of claim 3, wherein the original correct class confidence level L is minimized_tar1Maximizing the confidence L of the object class_tar2And the Diff loss optimization function is calculated as follows:

L_tar1＝max(F_r(x+δ)-max(F_i(x+δ))，0)，i≠r

L_tar2＝max(max(F_j(x+δ))-F_t(x+δ)，-G)，j≠t

where x is the original speech data, δ is the general disturbance, F_m(. to) is confidence function of model identification as class m, r is correctly classified class, t is oriented object class, G is hyper-parameter, N is total number of voice data, Decibel (delta) is Decibel value corresponding to general disturbance, Decibel (x)_i) And the value is the decibel value corresponding to the ith voice data.

5. The method for generating universal directional speech countermeasure samples according to claim 1, wherein in step S2, the power spectral density of the disturbance and the daily noise is calculated to obtain the corresponding similarity function specifically as:

Finally, respectively calculating general disturbanceNormalized power spectral density of delta and daily noise theta difference

Power spectral density normalized to daily noise theta

And further obtaining a similarity loss function, and obtaining the universal directional disturbance under the environmental noise concealment by carrying out iterative training on the disturbance.

6. The method of generating generic directional speech confrontation samples according to claim 5, wherein the similarity loss function sim (δ - θ, θ) is as follows:

wherein W is the Hanning window size,

normalized power spectral density for everyday noise.

7. The method for generating generic directional speech confrontation samples according to claim 1, wherein the step S3 is specifically as follows:

s301, filtering out distortion signals lower than 50Hz in general disturbance voice by adopting a band-pass filter, and filtering out threshold signals higher than 8kHz in disturbance voice data;

8. A universal directional speech dialog sample generation system, comprising:

9. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-7.

10. A computing device, comprising:

one or more processors, memory, and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-7.