CN114373467A

CN114373467A - Antagonistic audio sample generation method based on three-group parallel genetic algorithm

Info

Publication number: CN114373467A
Application number: CN202210026272.3A
Authority: CN
Inventors: 徐东亮; 翟文升; 马骁; 刘志伟; 徐舜; 杨承林
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-04-19

Abstract

The invention discloses a method for generating a antagonism audio sample based on three group parallel genetic algorithms, which comprises the following steps: a: obtaining an original audio sample; b: obtaining a corresponding input sample, a main population and two auxiliary populations; c: respectively calculating the fitness score of each individual; d: sorting all individuals in the main population and the auxiliary population respectively according to the sequence of fitness scores from low to high; e: using a voice recognition model to sequentially classify and recognize all the sequenced individuals in the main population; and F, performing cross-breeding, variation and individual update on the main population and the auxiliary population by using the three population parallel genetic algorithms, and then returning to the step D. The method can obtain the optimal solution meeting the requirements through multiple iterations, fully solves the problems of target network agnostic property and errors caused by conversion of the mel frequency cepstrum coefficient, and has the advantages of high convergence speed, strong global search capability and high convergence efficiency.

Description

Antagonistic audio sample generation method based on three-group parallel genetic algorithm

Technical Field

The invention relates to the field of voice recognition, in particular to a method for generating a antagonism audio sample based on three group parallel genetic algorithms.

Background

With the success of the deep learning model in the speech recognition application, the automatic speech recognition control system, such as Amazon Alexa, google speech assistant, apple siri, microsoft Cortana, science news flyer and other commercial products, is widely applied in the human-computer interaction, and is successful in multiple fields such as mobile equipment and smart home, and especially, the key application is realized in application scenes with higher security level, such as automatic driving, voiceprint identity authentication and the like.

Recent studies have shown that neural networks are vulnerable to adversarial attacks. In the field of speech recognition, there is also the problem that an attacker adds a slight disturbance to the audio, which causes the neural network to input distinct values, but the human ear cannot recognize the slight disturbance. The capability of the neural network for resisting malicious attacks can be effectively improved by training the model through generating the countermeasure sample. On one hand, however, compared with the generation of confrontational samples in the computer vision field, it is more difficult to design the confrontational samples in the voice field; on the other hand, the current research on the confrontational sample in the voice field is weak. At present, the anti-attack algorithm in the voice field is designed based on an optimized C & W attack algorithm, the method usually needs great computing resources and time overhead, and the practicability of the current voice anti-attack algorithm is severely restricted.

Disclosure of Invention

The invention aims to provide a method for generating a antagonism audio sample based on three group parallel genetic algorithms, which can obtain an optimal solution meeting requirements through multiple iterations, fully solve the problems of target network agnostic property and errors caused by Mel Frequency Cepstrum Coefficient (MFCC) conversion, and have the advantages of high convergence speed, strong global search capability and high convergence efficiency.

The invention adopts the following technical scheme:

a method for generating a antagonism audio sample based on three group parallel genetic algorithms comprises the following steps:

a: initializing each original voice file in the voice data set into an original audio sample in a binary string form; then entering the step B;

b: selecting an original audio sample, and repeatedly adding Gaussian noise to the least significant bit of the random subset of the original audio sample for N times to obtain corresponding N generated audio samples, namely input samples; obtaining a main population and two auxiliary populations of each original audio sample according to the method, wherein each population consists of N generated audio samples; then entering step C;

c: taking each input sample as an individual, and respectively calculating the fitness score of each individual; then entering step D; the fitness score of the individual is the Euclidean distance between the original audio sample corresponding to the individual and the generated audio sample;

d: sorting all individuals in the main population and the auxiliary population respectively according to the sequence of the fitness scores from low to high by utilizing the fitness scores; then entering step E;

e: and sequentially carrying out classification recognition on all the sequenced individuals in the main population by using a speech recognition model, wherein in the classification recognition process:

if successfully attacked individuals appear in the main population, stopping classification and identification, and directly outputting the successfully attacked individuals as final antagonistic audio samples;

if the successful attacking individuals do not appear in the main population, whether the set iteration times are reached is judged:

if the iteration times are reached, the judgment process is exited, and the result of the failure of generating the antagonistic audio is output;

if the iteration times are not reached, entering the step F;

dividing individuals in the main population and the two auxiliary populations into elite individuals and non-elite individuals according to a set retention probability, and respectively carrying out genetic operations including cross operation and variation operation on the main population and the two auxiliary populations after setting different genetic operation parameters, wherein the elite individuals do not carry out the genetic operations; obtaining a main population and two auxiliary populations which have completed genetic operation, and offspring individuals corresponding to each population, wherein the offspring individuals consist of elite individuals which are not subjected to cross variation operation and non-elite individuals which are subjected to cross variation operation; then, sequencing all the offspring individuals in the corresponding population subjected to genetic operation according to the fitness score of each offspring individual obtained by calculation from low to high; then, selecting a plurality of optimal individuals from the two auxiliary populations with the completed genetic operations according to a set optimal individual selection threshold, adding the optimal individuals into the main population with the completed genetic operations, and replacing a plurality of offspring individuals with the highest fitness scores in a corresponding number in the main population with the completed genetic operations according to the fitness scores to obtain a main population with the completed genetic operations and the updated individuals; and then returns to step D.

In the step A, the voice recognition model adopts the speed _ commands of the tensoflow official, 10 groups of voice files classified by the labels are recognized and trained by using the voice recognition model, the labels of each group of voice files are corresponding English words, and each group of voice files comprise voice files of which the English words are spoken by different speakers.

In the step B, the individuals and the quantity in the main population and the two auxiliary populations are the same, and the number of the individuals in a single population is set to be 20-40.

In the step B, when gaussian noise is added, each bit element of the original audio sample in the form of a binary string is traversed, and the currently traversed element is converted by a set conversion probability, from 1 to 0 or from 0 to 1.

The step F comprises the following specific steps:

f1: respectively sequencing the current individuals in the main population and the two auxiliary populations by using an elite sense method, taking the first P individuals as elite individuals according to a set retention probability, and taking the remaining N-P individuals as non-elite individuals;

f2: performing cross operation on a main population and two auxiliary populations consisting of the current generation individuals by adopting an average cross method, wherein the elite individuals in the main population and the two auxiliary populations do not participate in the cross operation; then proceed to step F3;

f3: performing variation operation on the main population and the two auxiliary populations which are subjected to the cross operation, wherein elite individuals in the main population and the two auxiliary populations do not participate in the variation operation; then proceed to step F4;

f4: calculating the fitness score of each offspring individual in the main population and the two auxiliary populations after genetic operation by using the method in the step D; then, according to the sequence of fitness scores from low to high, respectively sequencing all offspring individuals in the main population and the two auxiliary populations which have completed the previous round of genetic operations; then, according to a set optimal individual selection threshold value, sequencing the auxiliary populations into front L individuals serving as optimal individuals, and adding 2L optimal individuals into the main population after the genetic manipulation is finished; sequencing all offspring individuals in the main population added with the 2L optimal individuals according to the sequence of fitness scores from low to high, and removing the 2L individuals with the highest fitness scores which are the last in sequencing in the main population; finally, the main population from which 2L offspring individuals are removed is taken as the main population which has completed genetic operation and individual updating; and then returns to D.

In the step F, an average crossing method is adopted for carrying out crossing operation; when genetic operation is carried out, three groups of parallel genetic algorithms are adopted; one main population and two auxiliary populations are mutually independent, different populations execute different genetic operations, and information exchange and transmission are not carried out among the three populations; and (4) performing genetic operation after setting different genetic operation parameters for the main population and the two auxiliary populations.

In the step F, the first auxiliary population is set to have a small variation probability and a large cross probability, that is, the variation probability of the first auxiliary population is the minimum variation probability of the three populations, and the cross probability of the first auxiliary population is the maximum cross probability of the three populations; the second auxiliary population is set to have a small cross probability and a large variation probability, that is, the cross probability of the second auxiliary population is the minimum cross probability of the three populations, and the variation probability of the second auxiliary population is the maximum variation probability of the three populations.

In step F1, when performing the crossover operation on the population, two contemporary individuals a and B are randomly selected from the non-elite individuals in the population as parent individuals, then each element in the contemporary individuals a and B is traversed, the elements at the positions corresponding to the contemporary individuals a and B are cross-exchanged by using the set population crossover probability threshold, the crossover operation of the population is completed, and the contemporary non-elite individuals generated after the population is crossed are obtained.

In the step F3, when performing variation operation on the population, traversing each element of the current-generation non-elite individual generated after each population is crossed, and mutating part of the elements of the current-generation non-elite individual through a set population variation threshold, thereby completing the variation operation on the population, and obtaining the population having completed the genetic operation and the offspring individuals corresponding to each population; the main population which has finished genetic operation and the offspring individuals in the two auxiliary populations are both composed of elite individuals which are not subjected to cross variation operation and non-elite individuals which are subjected to cross variation operation.

Main population crossing probability threshold value M_CThe range of M is more than or equal to 40 percent_CLess than or equal to 60 percent; first auxiliary population crossing probability threshold A_1CThe range of (A) is 60%<A_1CLess than or equal to 90 percent; second auxiliary population crossing probability threshold A_2CThe range of A is more than or equal to 10 percent_2C<40 percent; wherein M is_CWherein M is an acronym for main, A_1CAnd A_2CA in (A) is an acronym for auxiliary,1 and 2 in the subscripts represent the first and second auxiliary population, and c in the subscripts is the acronym for crossover;

main population variation probability threshold M_mIn the range of 0.0004. ltoreq.M_mLess than or equal to 0.0006; first auxiliary population variation probability threshold A_1mIn the range of 0.0001. ltoreq.A_1m<0.0004; second auxiliary population variation probability threshold A_2mIn the range of 0.0006<A_2mLess than or equal to 0.0009; wherein M is_mWherein M is an acronym for main, A_1mAnd A_2mA in (A) is an acronym for auxiliary, and 1 and 2 in the subscript represent the first and second auxiliary population, and m in the subscript is an acronym for mutagenesis.

The method is based on three group parallel genetic algorithms, can obtain the optimal solution meeting the requirements through multiple iterations, fully solves the problems of target network agnostic property and errors caused by Mel Frequency Cepstrum Coefficient (MFCC) conversion, has the advantages of high convergence speed, strong global search capability and high convergence efficiency, reduces the calculated amount and time for generating the antagonistic sample, and effectively improves the global search capability. According to the invention, through a three-population parallel mode, the problems that the population diversity is difficult to guarantee and the algorithm is easy to fall into local optimum and cannot obtain a global optimum solution due to only one population in the traditional genetic algorithm are effectively solved.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The invention is described in detail below with reference to the following figures and examples:

as shown in FIG. 1, the method for generating a resistant audio sample based on three group parallel genetic algorithms of the present invention comprises the following steps:

in the present invention, the speech recognition model employs the speed _ commands official by tensorflow. And using the voice recognition model to perform recognition training on 10 groups of voice files after label classification, wherein the labels of each group of voice files are corresponding English words such as go or stop and the like, and each group of voice files comprises voice files of which different speakers pronounce the English words. In this embodiment, the number of each group of voice files includes not less than 1700.

in this embodiment, one original audio sample obtained in step a is selected, and gaussian noise is added to the least significant bit of the random subset of the original audio sample to obtain a generated audio sample corresponding to the original audio sample, that is, an input sample; then repeatedly adding Gaussian noise to the original audio sample for N times according to the method to obtain N generated audio samples corresponding to the original audio sample; respectively generating a main population and two auxiliary populations by using N populations composed of generated audio samples, wherein the individuals and the quantity in the main population and the two auxiliary populations are the same, and the number of the individuals in a single population is set to be 20-40; by using the method, a main population and two auxiliary populations of each original audio sample are finally obtained.

Traversing each bit element of an original audio sample in a binary string form when Gaussian noise is added, converting the currently traversed element by a set conversion probability, and converting 1 into 0 or converting 0 into 1; and finally, obtaining a generated audio sample corresponding to the original audio sample. In this example, the set conversion probability was 0.0001.

C: taking each input sample as an individual, and respectively calculating the fitness score of each individual; then entering step D;

in the invention, the Euclidean distance between the original audio sample and the generated audio sample is used as the fitness score of the individual. The farther the euclidean distance between the original audio sample and the generated audio sample is, the higher the fitness score of the individual is, and the lower the acoustic similarity between the two is.

and setting that N individuals exist in the main population and each auxiliary population, wherein the lower the fitness score is, the higher the acoustic similarity of the individuals is, and the higher the ranking is.

if the iteration times are not reached, entering the step F;

in the invention, two attack modes of non-target attack and target attack exist:

aiming at non-target attack, for a certain individual, the voice recognition model recognizes the generated audio sample corresponding to the individual as any other label, namely the attack is successful.

Aiming at the target attack, for a certain individual, the voice recognition model recognizes the generated audio sample corresponding to the individual as another specified label, namely the attack is successful.

In the invention, the cross operation is to take two current generation individuals A and B in the form of binary strings as parent individuals, and exchange elements at corresponding positions of the current generation individuals A and B to generate offspring individuals. The intersection operation is divided into single-point intersection, multi-point intersection, average intersection and the like according to different intersection modes. In the invention, an average crossing method is adopted for carrying out crossing operation, wherein the average crossing is a crossing operation mode that every bit corresponding to two parent individuals has certain probability to carry out exchange. The mutation operation is to make the partial elements in the current generation individuals in the form of binary strings after the cross operation is executed mutate, and change from 1 to 0 or from 0 to 1.

Because the traditional genetic algorithm only has one population and all genetic operators only aim at individuals of one population, the diversity of the population is difficult to guarantee, and the situation that the global optimal solution cannot be obtained due to the fact that the population is trapped into local optimal solution is easy to occur. The problem is not solved, and three group parallel genetic algorithms are adopted in the invention: one main population and two auxiliary populations are mutually independent, different populations execute different genetic operations, and information exchange and transmission are not carried out among the three populations. And (4) performing genetic operation after setting different genetic operation parameters for the main population and the two auxiliary populations.

In the invention, the first auxiliary population emphasizes the global search capability and is set as small variation probability and large cross probability, namely the variation probability of the first auxiliary population is the minimum variation probability of the three populations, and the cross probability of the first auxiliary population is the maximum cross probability of the three populations; the second auxiliary population emphasizes the local search capability and is set to have a small cross probability and a large variation probability, that is, the cross probability of the second auxiliary population is the minimum cross probability of the three populations, and the variation probability of the second auxiliary population is the maximum variation probability of the three populations.

In the invention, the step F comprises the following specific steps:

f1: respectively sequencing the current individuals in the main population and the two auxiliary populations by using an elite meaning method, taking the first P individuals as elite individuals and taking the rest N-P individuals as non-elite individuals according to a set retention probability, wherein the value range of P is 2-4, and the set retention probability is 90%.

Elite is an optimization of basic genetic algorithms. Since the crossover and mutation operators are performed randomly, the evolution may be performed in a good direction and a bad direction, so that the population may lose the best individuals during the evolution process, thereby reducing the fitness function. In order to prevent the optimal solution generated in the evolution process from being damaged by crossover and mutation, the optimal solution in each generation is copied into the next generation without change. Therefore, the invention uses the elite meaning method, reserves a plurality of elite individuals with the best current generation of the population with a high probability, does not carry out any operation on the elite individuals, and directly adds the elite individuals into the offspring population.

when the population is subjected to cross operation, two contemporary individuals A and B are randomly selected from non-elite individuals in the population as parent individuals, then each element in the contemporary individuals A and B is traversed, elements at the corresponding positions of the contemporary individuals A and B are subjected to cross exchange according to a set population cross probability threshold value, population cross operation is completed, and contemporary non-elite individuals generated after population cross are obtained;

wherein, the cross probability threshold value M of the main species group_CThe range of M is more than or equal to 40 percent_CLess than or equal to 60 percent; first auxiliary population crossing probability threshold A_1CThe range of (A) is 60%<A_1CLess than or equal to 90 percent; second auxiliary population crossing probability threshold A_2CThe range of A is more than or equal to 10 percent_2C<40％；

In this embodiment, a main population crossing probability threshold M is set_CFirst auxiliary population crossing probability threshold A_1CAnd a second auxiliary population crossing probability threshold A_2c50%, 70% and 30%, respectively. Wherein M is_CWherein M is an acronym for main, A_1CAnd A_2CA in (A) is an acronym of auxiliary, 1 and 2 in the subscript represent the first and second auxiliary population, and c in the subscript is an acronym of crossover;

when performing variation operation on the population, traversing each element of the current generation non-elite individual generated after each population is crossed, and mutating partial elements in the current generation non-elite individual through a set population variation threshold value to complete the variation operation of the population and obtain the population which has completed genetic operation and the offspring individuals corresponding to each population; the main population which has finished genetic operation and the offspring individuals in the two auxiliary populations are both composed of elite individuals which are not subjected to cross variation operation and non-elite individuals which are subjected to cross variation operation;

wherein, the main population mutation probability threshold value M_mIn the range of 0.0004. ltoreq.M_mLess than or equal to 0.0006; in the variation process, each element of the contemporary non-elite generated after population crossing has M_mThe rate of variation of (a). First auxiliary population variation probability threshold A_1mIn the range of 0.0001. ltoreq.A_1m<0.0004; second auxiliary population variation probability threshold A_2mIn the range of 0.0006<A_2mLess than or equal to 0.0009. Wherein M is_mWherein M is an acronym for main, A_1mAnd A_2mA in (A) is an acronym of auxiliary, 1 and 2 in the subscript represent the first and second auxiliary population, and m in the subscript is an acronym of mutation;

in this embodiment, the set main population mutation probability threshold M_mA first auxiliary population variation probability threshold A_1mAnd a second auxiliary population variation probability threshold A_2m0.0005, 0.0001 and 0.0009, respectively.

F4: calculating the fitness score of each offspring individual in the main population and the two auxiliary populations after genetic operation by using the method in the step D; then, according to the sequence of fitness scores from low to high, respectively sequencing all offspring individuals in the main population and the two auxiliary populations which have completed the previous round of genetic operations; then, according to a set optimal individual selection threshold value, sequencing the auxiliary populations into front L individuals serving as optimal individuals, and adding 2L optimal individuals into the main population after the genetic manipulation is finished; sequencing all offspring individuals in the main population added with the 2L optimal individuals according to the sequence of fitness scores from low to high, and removing the 2L individuals with the highest fitness scores which are the last in sequencing in the main population; finally, the main population from which 2L offspring individuals are removed is taken as the main population which has completed genetic operation and individual updating; and then returning to D, wherein the value of L is 1 or 2.

After the step D is returned, sorting all the individuals in the main population and the auxiliary population which have finished the genetic operation and the individual updating respectively according to the sequence of the fitness scores from low to high by utilizing the fitness scores; then entering step E;

after step E, sequentially carrying out classification recognition on all the sequenced individuals in the main population by using a voice recognition model, wherein in the classification recognition process:

if the iteration times are not reached, entering the step F;

repeating steps D to F according to the method; until finding the individual with successful attack and outputting the individual as a final antagonistic audio sample; or after the iteration times are reached, an individual with successful attack is not found, the judgment process is quitted and the result of generating the antagonistic audio frequency failure is output.

Claims

1. A method for generating a antagonism audio sample based on three group parallel genetic algorithms is characterized by comprising the following steps:

if the iteration times are not reached, entering the step F;

2. The method for generating antagonistic audio samples based on three group parallel genetic algorithms according to claim 1, characterized in that: in the step A, the voice recognition model adopts the speed _ commands of the tensoflow official, 10 groups of voice files classified by the labels are recognized and trained by using the voice recognition model, the labels of each group of voice files are corresponding English words, and each group of voice files comprise voice files of which the English words are spoken by different speakers.

3. The method for generating antagonistic audio samples based on three group parallel genetic algorithms according to claim 1, characterized in that: in the step B, the individuals and the quantity in the main population and the two auxiliary populations are the same, and the number of the individuals in a single population is set to be 20-40.

4. The method for generating antagonistic audio samples based on three group parallel genetic algorithms according to claim 1, characterized in that: in the step B, when gaussian noise is added, each bit element of the original audio sample in the form of a binary string is traversed, and the currently traversed element is converted by a set conversion probability, from 1 to 0 or from 0 to 1.

5. The method for generating antagonistic audio samples based on three group parallel genetic algorithms according to claim 1, wherein said step F comprises the following specific steps:

6. The method for generating antagonistic audio samples based on three group parallel genetic algorithms according to claim 1, characterized in that: in the step F, an average crossing method is adopted for carrying out crossing operation; when genetic operation is carried out, three groups of parallel genetic algorithms are adopted; one main population and two auxiliary populations are mutually independent, different populations execute different genetic operations, and information exchange and transmission are not carried out among the three populations; and (4) performing genetic operation after setting different genetic operation parameters for the main population and the two auxiliary populations.

7. The method of claim 6, wherein the method comprises: in the step F, the first auxiliary population is set to have a small variation probability and a large cross probability, that is, the variation probability of the first auxiliary population is the minimum variation probability of the three populations, and the cross probability of the first auxiliary population is the maximum cross probability of the three populations; the second auxiliary population is set to have a small cross probability and a large variation probability, that is, the cross probability of the second auxiliary population is the minimum cross probability of the three populations, and the variation probability of the second auxiliary population is the maximum variation probability of the three populations.

8. The method for generating antagonistic audio samples based on three group parallel genetic algorithms according to claim 5, wherein: in step F1, when performing the crossover operation on the population, two contemporary individuals a and B are randomly selected from the non-elite individuals in the population as parent individuals, then each element in the contemporary individuals a and B is traversed, the elements at the positions corresponding to the contemporary individuals a and B are cross-exchanged by using the set population crossover probability threshold, the crossover operation of the population is completed, and the contemporary non-elite individuals generated after the population is crossed are obtained.

9. The method for generating antagonistic audio samples based on three group parallel genetic algorithms according to claim 5, wherein: in the step F3, when performing variation operation on the population, traversing each element of the current-generation non-elite individual generated after each population is crossed, and mutating part of the elements of the current-generation non-elite individual through a set population variation threshold, thereby completing the variation operation on the population, and obtaining the population having completed the genetic operation and the offspring individuals corresponding to each population; the main population which has finished genetic operation and the offspring individuals in the two auxiliary populations are both composed of elite individuals which are not subjected to cross variation operation and non-elite individuals which are subjected to cross variation operation.

10. The method for generating antagonistic audio samples based on three group parallel genetic algorithms according to claim 7, wherein: main population crossing probability threshold value M_CThe range of M is more than or equal to 40 percent_CLess than or equal to 60 percent; first auxiliary population crossing probability threshold A_1CIn the range of 60% < A_1CLess than or equal to 90 percent; second auxiliary population crossing probability threshold A_2CThe range of A is more than or equal to 10 percent_2CLess than 40 percent; wherein M is_CWherein M is an acronym for main, A_1CAnd A_2CA in (A) is an acronym of auxiliary, 1 and 2 in the subscript represent the first and second auxiliary population, and c in the subscript is an acronym of crossover;

main population variation probability threshold M_mIn the range of 0.0004. ltoreq.M_mLess than or equal to 0.0006; first auxiliary population variation probability threshold A_1mIn the range of 0.0001. ltoreq.A_1m< 0.0004; second auxiliary population variation probability threshold A_2mIn the range of 0.0006 < A_2mLess than or equal to 0.0009; wherein M is_mWherein M is an acronym for main, A_1mAnd A_2mA in (A) is an acronym of auxiliary, and 1 and 2 in the subscript represent the first and second auxiliary population, respectively, and m in the subscript is an acronym of mutation.