CN116913259B

CN116913259B - Voice recognition countermeasure method and device combined with gradient guidance

Info

Publication number: CN116913259B
Application number: CN202311154761.8A
Authority: CN
Inventors: 肖韬睿
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-12-15
Anticipated expiration: 2043-09-08
Also published as: CN116913259A

Abstract

The application discloses a voice recognition countermeasure defense method and device combined with gradient guidance, wherein the method comprises the following steps: calculating a loss function, wherein the loss function comprises connection time sequence class classification loss and optimal transportation loss, the connection time sequence class classification loss is calculated in a supervision scene, and the optimal transportation loss is calculated in an unsupervised scene; calculating cosine distance between samples; calculating maximum loss based on the cosine distance and the connection time sequence class classification loss, and iteratively reducing the value of the connection time sequence class classification loss; and generating a new challenge sample by combining the connection time sequence class classification loss and the optimal transportation loss, and performing challenge training on the voice recognition model f by using the new challenge sample. The application can obtain stronger countermeasure samples, and is helpful for countermeasure training; meanwhile, gradient guidance is utilized to defend against attacks aiming at the classifier when the ASR model is output, so that the robustness of the ASR model is improved.

Description

Voice recognition countermeasure method and device combined with gradient guidance

Technical Field

The application belongs to the technical field of voice recognition, and particularly relates to a voice recognition countermeasure method and device combined with gradient guidance.

Background

Automatic Speech Recognition (ASR) systems are less resistant to attack. Most challenge attacks such as fast gradient notation (FGSM) and random gradient descent (PGD) are supervised attacks, which have the advantage that stronger challenge samples can always be generated, but these methods do not take into account the relationships between samples and are prone to label leakage. Currently, other unsupervised challenge sample generation methods, such as Feature Scattering (FS), have long training time and cannot always generate stronger challenge samples, so that the security of voice interaction cannot be guaranteed, and the defensive power of a voice recognition system is low.

In view of the above problems, the present application provides a voice recognition countermeasure method and device combining gradient guidance.

Disclosure of Invention

In order to solve the defects of the prior art, the application provides a voice recognition countermeasure defense method combined with gradient guidance, which aims to solve the technical problems that the prior art is long in training time, cannot continuously generate strong countermeasure samples, cannot ensure the safety of voice interaction and has lower defensive capability of a voice recognition system.

The technical effects to be achieved by the application are realized by the following scheme:

in a first aspect, embodiments of the present application provide a voice recognition challenge defense method in combination with gradient guidance, including:

calculating a loss function comprising connection timing class classification loss and optimal transport loss, wherein in a supervised scenario, L _CTC (f (x), y) represents the connection timing class classification penalty, where x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport penalty is L _OT Represented by L, where _OT =min _T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;

calculating a cosine distance between samples, the cosine distance representing a prediction of a clean sample f (x) and an countermeasure sampleIs a predicted distance between the predictions of (2);

calculating maximum loss based on the cosine distance and the connection time sequence class classification loss, and iteratively reducing the value of the connection time sequence class classification loss;

generating a new challenge sample using the connection timing class classification loss and the optimal transport loss in combination, wherein the loss function involved in generating the new challenge sample is L _new The L is _new The calculation mode of (2) is as follows:

L _new =L _CTC +βL _OT ，

where β is a weighting factor;

and performing countermeasure training on the voice recognition model f by using the new countermeasure sample.

In some embodiments, the cosine distance is represented by C, and the formula for calculating C is as follows:

，

wherein f (x) represents a clean sample,representing the challenge sample, x is the original audio input provided to the speech recognition model f.

In some embodiments, the combining generates a new challenge sample using the connection timing class classification loss and the optimal transport loss, comprising:

iterating using the following formula to generate a new challenge sample, wherein the new challenge sample is a mixed challenge sample:

,

wherein,representing the challenge samples of the original audio input x, t representing the number of iterations.

In some embodiments, the value of β is 1 for balancing the connection timing class classification loss and the optimal transport loss.

In some embodiments, the method further comprises:

the gradient is calculated using the following formula to identify the word that is most important to the classification:

，

wherein,is to make word x _i Conversion to embedded e ₁ G (·) is an upper layer predicted from word embedding, the output of g (·) is a probability distribution of all classes, and g (·) k is used to represent the probability of the kth class;

the importance weight for each of the words is calculated by the following formula:

,

wherein e _i Representing the sum of the word embedding, the location embedding and the tag type embedding.

In some embodiments, the method further comprises:

and (c) setting the w _i As weights, randomly sample in sentencesPosition of>Is a masking ratio->Representing the number of words; sampling the positions to obtain a position sequence +.>Where Cat represents a polynomial distribution, α is a hyper-parameter, replacing the sequence of positions with a special mask placeholder, and estimating the most likely sentence as:

，

wherein BERT (x) is the BERT language model,representing the number of words.

In a second aspect, embodiments of the present application provide a voice recognition challenge defense device incorporating gradient guidance, comprising:

a first calculation module for calculating a loss function comprising a connection timing class classification loss and an optimal transport loss, wherein in a supervised scenario, the connection timing class classification loss is represented, wherein x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport loss is represented by L _OT Represented by L, where _OT =min _T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;

a second calculation module for calculating a cosine distance between samples, the cosine distance representing the prediction and countermeasure samples of the clean sample f (x)Is a predicted distance between the predictions of (2);

the iteration module is used for calculating the maximum loss based on the cosine distance and the connection time sequence class classification loss and reducing the value of the connection time sequence class classification loss through iteration;

a generation module for generating a new challenge sample by combining the connection timing class classification loss and the optimal transport loss, wherein the loss function involved in generating the new challenge sample is L _new The L is _new The calculation mode of (2) is as follows:

L _new =L _CTC +βL _OT ，

where β is a weighting factor;

and the training module is used for performing countermeasure training on the voice recognition model f by using the new countermeasure sample.

，

wherein f (x) is as followsA clean sample is shown and is shown in a sample,representing the challenge sample, x is the original audio input provided to the speech recognition model f.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the preceding claims when the computer program is executed.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing one or more programs executable by one or more processors to implement the method of any of the preceding claims.

The voice recognition countermeasure method combined with gradient guidance provided by the embodiment of the application can realize a method with both supervision and non-supervision capabilities, and the combination of the two methods can obtain stronger countermeasure samples, which is helpful for countermeasure training; meanwhile, when the ASR model is output, gradient guidance is utilized to defend against attack aiming at the classifier, so that the robustness of the ASR model is improved, the safety of voice interaction is ensured, and the defending capability of a voice recognition system is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the prior art solutions, the drawings which are used in the description of the embodiments or the prior art will be briefly described below, it being obvious that the drawings in the description below are only some of the embodiments described in the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a voice recognition challenge defense method incorporating gradient guidance in accordance with an embodiment of the present application;

FIG. 2 is a schematic illustration of GGAD challenge training in combination with a gradient guided voice recognition challenge defense method in accordance with an embodiment of the present application;

FIG. 3 is a schematic diagram of a speech recognition model combined with a gradient guided speech recognition challenge defense method according to an embodiment of the present application;

fig. 4 is a schematic block diagram of an electronic device in an embodiment of the application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present application should be taken in a general sense as understood by one of ordinary skill in the art to which the present application belongs. The use of the terms "first," "second," and the like in one or more embodiments of the present application does not denote any order, quantity, or importance, but rather the terms "first," "second," and the like are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

In the related art, paper "Training augmentation with adversarial examples for robust speech recognition" uses Fast Gradient Symbology (FGSM) based challenge training to train models, studying input gradient regularization as a method of challenge robustness. The method trains a differentiable model (e.g., a deep neural network) using a penalty network model approach with respect to input loss function gradients. The results show that this approach can produce very good robustness against attacks, but it almost doubles the training complexity of the network and does not show the performance of the approach in various attack-against, especially black box, scenarios.

The application aims to realize a method with both supervised and unsupervised capabilities, so that stronger countermeasure samples can be obtained, and the method is helpful for countermeasure training; meanwhile, gradient guidance is utilized to defend against attacks aiming at the classifier when the ASR model is output, so that the robustness of the ASR model is improved.

Various non-limiting embodiments of the present application are described in detail below with reference to the attached drawing figures.

First, a voice recognition countermeasure method by combining gradient guidance according to the present application will be described in detail with reference to fig. 1.

As shown in fig. 1, an embodiment of the present application provides a voice recognition countermeasure method in combination with gradient guidance, including:

s101: calculating a loss function comprising connection timing class classification loss and optimal transport loss, wherein in a supervised scenario, L _CTC (f (x), y) represents the connection timing class classification penalty, where x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport penalty is L _OT Represented by L, where _OT =min _T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;

s102: calculating a cosine distance between samples, the cosine distance representing a prediction of a clean sample f (x) and an countermeasure sampleIs a predicted distance between the predictions of (2);

s103: calculating maximum loss based on the cosine distance and the connection time sequence class classification loss, and iteratively reducing the value of the connection time sequence class classification loss;

s104: generating a new challenge sample using the connection timing class classification loss and the optimal transport loss in combination, wherein the loss function involved in generating the new challenge sample is L _new The L is _new The calculation mode of (2) is as follows:

L _new =L _CTC +βL _OT （1）

where β is a weighting factor;

s105: and performing countermeasure training on the voice recognition model f by using the new countermeasure sample.

Specifically, S101 includes the following examples:

step 11, CTC loss is calculated using a supervised method.

The challenge sample is generated by using the loss value relative to the input data to obtain gradient information. In a supervised scenario, cross entropy (for classification), connective temporal class classification (for speech recognition models), etc. are used. Connection timing class classification (CTC) penalty between the original label and model prediction can be defined as follows:

L _CTC (f(x),y)（2）

where x is the audio input provided to the speech recognition model f and y is the corresponding transcription, which is a supervised loss function using the original labels of the data, the supervised challenge sample generation technique maximizing this function.

At step 12, the OT loss is calculated using an unsupervised method.

No labels are used in the unsupervised approach, but rather the difference between clean sample predictions and countersample predictions is used. Unsupervised loss using Optimal Transport (OT) theory, the challenge samples are initialized with random noise, and the OT distance can be expressed as:

L _OT =min _T (T·C)（3）

where T is a matrix that helps solve the OT problem and C is a transportation cost matrix.

（4）

In some embodiments, the maximum loss is calculated in combination with the above steps and CTC loss is reduced by iteration. Finally, the present application combines supervised and unsupervised losses and uses it to generate new challenge samples, which are then used for challenge training to increase the robustness of the speech recognition model.

（5）

In some embodiments, the value of β is 1 for balancing the connection timing class classification loss and the optimal transport loss; the values of β are exemplary herein and other values known to those skilled in the art may be used without limitation.

In some embodiments, the method further comprises:

（6）

is to make word x _i Conversion to embedded e ₁ G (-) is the upper layer predicted from word embedding, and g (-) is the probability distribution of all classes, using g (-) _k To represent the probability of the kth class;

specifically, the importance of the word is estimated using the gradient of the classifier; in a white-box environment where an attacker has full access to the classifier, the gradient is used directly to pick candidates, whereas in a black-box environment, the gradient is approximated by comparing the output of the classifier to whether there is a word. Assuming that the classifier can be fully accessed when constructing the defenses, formula (6) is directly employed to calculate the gradient to identify the word most important to the classification.

（7）

Randomly rewriting a plurality of times by using a mask language model; after the importance weight of each word is calculated, the important word must be replaced to defend against the attack. If the importance weights are thresholded, then the words are masked and replaced, it is possible to mask all important words and make the sentence generated by the model semantically different from the original sentence. To solve this problem, GGAD uses a random alternative method.

Specifically, as shown in fig. 2, the GGAD countermeasure training model inputs a sound recording, calculates CTC loss, calculates OT loss, wherein the calculation of CTC loss and OT loss is related to the result of preprocessing clean data after the clean data is subjected to the language recognition model, the OT loss is also related to the result of preprocessing disturbance data after the clean data is subjected to the language recognition model, and calculates maximum loss and minimum loss, and the process of updating the disturbance data is also related.

Illustratively, the speech recognition model involved in the present application may include, as shown in FIG. 3, an input, a convolution layer, resCNN, full connection layer, biRNN, a classifier, and an input in that order; wherein the BiRNN comprises a layer normalization, a gated loop unit and a random inactivation layer, and the classifier comprises a fully connected layer.

In some embodiments, the method further comprises:

（8）

wherein BERT (x) is the BERT language model,representing the number of words.

Specifically, the method will be w _i As weights, randomly sample in sentencesPosition->Is the masking ratio. Specifically, the position is sampled: />Where Cat represents a polynomial distribution, α is a hyper-parameter, then replacing these positions with special mask placeholders and estimating the most likely sentence by using the BERT language model the value of equation (8);

wherein BERT (x) is a BERT language model, and the sentence is rewrittenAll->The individual words are all generated by the BERT language model, although only m words are masked, the sentence +.>More than m words may still be replaced. Different mask positions may result in different rewrites.

To make the classifier more stable, the method generates a sentence λ for each countermeasure sentence by selecting a different mask position, and then takes most of the predicted sentences λ as a prediction result of the original input audio.

According to the method, the character error rate (WER) of GGAD in a white box environment is reduced by at least 1% and at most 12% compared with other defense models; the Word Error Rate (WER) of GGAD in a white box environment is reduced by at least 4% and at most 40% compared with other defense models; the Word Error Rate (WER) of GGAD in a black box environment is reduced by at least 4% and at most 29% compared to other defense models.

Challenge samples based on audio data are difficult to deal with, and it is necessary to formulate effective defense strategies to protect the deep learning model from challenge attacks. The present application discusses a defensive approach based on challenge training that generates challenge samples in a novel manner, the generated challenge samples including the capabilities of both supervised and unsupervised approaches. Experiments are carried out on the application and other popular defense methods, and the experimental results show that the application has better effect than other popular methods in defending white box and black box attacks.

The embodiment of the application provides a voice recognition countermeasure device combined with gradient guidance, which comprises the following components:

a first calculation module for calculating a loss function comprising connection timing class classification loss and optimal transport loss, wherein in a supervised scenario, L _CTC (f (x), y) represents the connection timing class classification penalty, where x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport penalty is L _OT Represented by L, where _OT =min _T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;

L _new =L _CTC +βL _OT ，

where β is a weighting factor;

，

The present application uses a challenge defense (GGAD) method incorporating gradient guidance, first, a combination of supervised and unsupervised challenge training methods to generate new challenge samples, and second, a gradient norm is used at output to estimate and rewrite the most important words to classify. Therefore, the problem that the ASR system is weak in resistance to attack can be solved, and the robustness of an ASR system model can be improved.

It should be noted that the method according to one or more embodiments of the present application may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of one or more embodiments of the present application, the devices interacting with each other to accomplish the methods.

It should be noted that the foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, the application also discloses an electronic device corresponding to the method of any embodiment;

specifically, fig. 4 shows a schematic hardware structure of an electronic device combined with the gradient-guided voice recognition countermeasure method according to the present embodiment, where the device may include: processor 410, memory 420, input/output interface 430, communication interface 440, and bus 450. Wherein processor 410, memory 420, input/output interface 430 and communication interface 440 are communicatively coupled to each other within the device via bus 450.

The processor 410 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided by the embodiments of the present application.

The Memory 420 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 420 may store an operating system and other application programs, and when implementing the techniques provided by embodiments of the present application by software or firmware, the associated program code is stored in memory 420 and invoked for execution by processor 410.

The input/output interface 430 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown in the figure) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The communication interface 440 is used to connect communication modules (not shown) to enable communication interactions of the device with other devices. The communication module may implement communication through a wired manner (e.g., USB, network cable, etc.), or may implement communication through a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.).

Bus 450 includes a path to transfer information between components of the device (e.g., processor 410, memory 420, input/output interface 430, and communication interface 440).

It should be noted that although the above device only shows the processor 410, the memory 420, the input/output interface 430, the communication interface 440, and the bus 450, in the implementation, the device may further include other components necessary to achieve normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary for implementing the embodiments of the present application, and not all the components shown in the drawings.

The electronic device of the foregoing embodiments is configured to implement the corresponding voice recognition challenge defense method combined with gradient guidance in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.

Based on the same inventive concept, one or more embodiments of the present application also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the voice recognition challenge defense method in combination with gradient guidance as described in any of the embodiments above, corresponding to the method of any of the embodiments above.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the above embodiment stores computer instructions for causing the computer to perform the voice recognition countermeasure method in combination with gradient guidance as described in any one of the above embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; combinations of features of the above embodiments or in different embodiments are also possible within the spirit of the application, steps may be implemented in any order and there are many other variations of the different aspects of one or more embodiments of the application described above which are not provided in detail for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure one or more embodiments of the application. Furthermore, the apparatus may be shown in block diagram form in order to avoid obscuring the embodiment(s) of the present application, and also in view of the fact that specifics with respect to implementation of such block diagram apparatus are highly dependent upon the platform on which the embodiment(s) of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that one or more embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present application is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements and others which are within the spirit and principle of the one or more embodiments of the application are intended to be included within the scope of the application.

Claims

1. A method of countering defenses in combination with gradient guided speech recognition, the method comprising:

calculating a loss function comprising connection timing class classification loss and optimal transport loss, wherein in a supervised scenario, L _CTC (f (x), y) represents the connection timing class classification penalty, where x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport penalty is L _OT Represented by L, where _OT ＝min _T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;

calculating as C the cosine distance between samples, which represents the prediction of the clean sample f (x) and the challenge sampleIs a predicted distance between the predictions of (2);

L _new ＝L _CTC (f(x),y)+βL _OT ，

where β is a weighting factor;

performing challenge training on the speech recognition model f using the new challenge sample;

f(x)＝arg max _k g(E(x)) _k ，

wherein E (x) =e ₁ ,…,e _l Is to make word x _i Conversion to embedded e _i G (·) is the upper layer predicted from word embedding, the output of g (·) is the probability distribution of all classes, g () _k To represent the probability of the kth class;

wherein e _i Representing a sum of the word embedding, the location embedding and the tag type embedding;

and (c) setting the w _i As weights, randomly sample in sentencesWhere γ is the masking ratio and l represents the number of words; sampling the positions to obtain a position sequence +.>Where Cat represents a polynomial distribution, α is a hyper-parameter, replacing the sequence of positions with a special mask placeholder, and estimating the most likely sentence as:

where BERT (x) is the BERT language model, and l represents the number of words.

2. The method of claim 1, wherein the formula for calculating C is as follows:

3. The method of gradient guided speech recognition challenge defense of claim 1 or 2 wherein the combining utilizes the connection timing class classification loss and the optimal transport loss to generate a new challenge sample comprising:

4. A gradient guided speech recognition challenge defense method in combination with claim 3 wherein the value of β is 1 to balance the connection timing class classification loss and the optimal transport loss.

5. An electronic device, the electronic device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1 to 4 when the computer program is executed.

6. A computer readable storage medium storing one or more programs executable by one or more processors to implement the method of any of claims 1-4.