CN116913259B - Voice recognition countermeasure method and device combined with gradient guidance - Google Patents

Voice recognition countermeasure method and device combined with gradient guidance Download PDF

Info

Publication number
CN116913259B
CN116913259B CN202311154761.8A CN202311154761A CN116913259B CN 116913259 B CN116913259 B CN 116913259B CN 202311154761 A CN202311154761 A CN 202311154761A CN 116913259 B CN116913259 B CN 116913259B
Authority
CN
China
Prior art keywords
loss
challenge
sample
class classification
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311154761.8A
Other languages
Chinese (zh)
Other versions
CN116913259A (en
Inventor
肖韬睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202311154761.8A priority Critical patent/CN116913259B/en
Publication of CN116913259A publication Critical patent/CN116913259A/en
Application granted granted Critical
Publication of CN116913259B publication Critical patent/CN116913259B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a voice recognition countermeasure defense method and device combined with gradient guidance, wherein the method comprises the following steps: calculating a loss function, wherein the loss function comprises connection time sequence class classification loss and optimal transportation loss, the connection time sequence class classification loss is calculated in a supervision scene, and the optimal transportation loss is calculated in an unsupervised scene; calculating cosine distance between samples; calculating maximum loss based on the cosine distance and the connection time sequence class classification loss, and iteratively reducing the value of the connection time sequence class classification loss; and generating a new challenge sample by combining the connection time sequence class classification loss and the optimal transportation loss, and performing challenge training on the voice recognition model f by using the new challenge sample. The application can obtain stronger countermeasure samples, and is helpful for countermeasure training; meanwhile, gradient guidance is utilized to defend against attacks aiming at the classifier when the ASR model is output, so that the robustness of the ASR model is improved.

Description

Voice recognition countermeasure method and device combined with gradient guidance
Technical Field
The application belongs to the technical field of voice recognition, and particularly relates to a voice recognition countermeasure method and device combined with gradient guidance.
Background
Automatic Speech Recognition (ASR) systems are less resistant to attack. Most challenge attacks such as fast gradient notation (FGSM) and random gradient descent (PGD) are supervised attacks, which have the advantage that stronger challenge samples can always be generated, but these methods do not take into account the relationships between samples and are prone to label leakage. Currently, other unsupervised challenge sample generation methods, such as Feature Scattering (FS), have long training time and cannot always generate stronger challenge samples, so that the security of voice interaction cannot be guaranteed, and the defensive power of a voice recognition system is low.
In view of the above problems, the present application provides a voice recognition countermeasure method and device combining gradient guidance.
Disclosure of Invention
In order to solve the defects of the prior art, the application provides a voice recognition countermeasure defense method combined with gradient guidance, which aims to solve the technical problems that the prior art is long in training time, cannot continuously generate strong countermeasure samples, cannot ensure the safety of voice interaction and has lower defensive capability of a voice recognition system.
The technical effects to be achieved by the application are realized by the following scheme:
in a first aspect, embodiments of the present application provide a voice recognition challenge defense method in combination with gradient guidance, including:
calculating a loss function comprising connection timing class classification loss and optimal transport loss, wherein in a supervised scenario, L CTC (f (x), y) represents the connection timing class classification penalty, where x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport penalty is L OT Represented by L, where OT =min T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;
calculating a cosine distance between samples, the cosine distance representing a prediction of a clean sample f (x) and an countermeasure sampleIs a predicted distance between the predictions of (2);
calculating maximum loss based on the cosine distance and the connection time sequence class classification loss, and iteratively reducing the value of the connection time sequence class classification loss;
generating a new challenge sample using the connection timing class classification loss and the optimal transport loss in combination, wherein the loss function involved in generating the new challenge sample is L new The L is new The calculation mode of (2) is as follows:
L new =L CTC +βL OT
where β is a weighting factor;
and performing countermeasure training on the voice recognition model f by using the new countermeasure sample.
In some embodiments, the cosine distance is represented by C, and the formula for calculating C is as follows:
wherein f (x) represents a clean sample,representing the challenge sample, x is the original audio input provided to the speech recognition model f.
In some embodiments, the combining generates a new challenge sample using the connection timing class classification loss and the optimal transport loss, comprising:
iterating using the following formula to generate a new challenge sample, wherein the new challenge sample is a mixed challenge sample:
,
wherein,representing the challenge samples of the original audio input x, t representing the number of iterations.
In some embodiments, the value of β is 1 for balancing the connection timing class classification loss and the optimal transport loss.
In some embodiments, the method further comprises:
the gradient is calculated using the following formula to identify the word that is most important to the classification:
wherein,is to make word x i Conversion to embedded e 1 G (·) is an upper layer predicted from word embedding, the output of g (·) is a probability distribution of all classes, and g (·) k is used to represent the probability of the kth class;
the importance weight for each of the words is calculated by the following formula:
,
wherein e i Representing the sum of the word embedding, the location embedding and the tag type embedding.
In some embodiments, the method further comprises:
and (c) setting the w i As weights, randomly sample in sentencesPosition of>Is a masking ratio->Representing the number of words; sampling the positions to obtain a position sequence +.>Where Cat represents a polynomial distribution, α is a hyper-parameter, replacing the sequence of positions with a special mask placeholder, and estimating the most likely sentence as:
wherein BERT (x) is the BERT language model,representing the number of words.
In a second aspect, embodiments of the present application provide a voice recognition challenge defense device incorporating gradient guidance, comprising:
a first calculation module for calculating a loss function comprising a connection timing class classification loss and an optimal transport loss, wherein in a supervised scenario, the connection timing class classification loss is represented, wherein x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport loss is represented by L OT Represented by L, where OT =min T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;
a second calculation module for calculating a cosine distance between samples, the cosine distance representing the prediction and countermeasure samples of the clean sample f (x)Is a predicted distance between the predictions of (2);
the iteration module is used for calculating the maximum loss based on the cosine distance and the connection time sequence class classification loss and reducing the value of the connection time sequence class classification loss through iteration;
a generation module for generating a new challenge sample by combining the connection timing class classification loss and the optimal transport loss, wherein the loss function involved in generating the new challenge sample is L new The L is new The calculation mode of (2) is as follows:
L new =L CTC +βL OT
where β is a weighting factor;
and the training module is used for performing countermeasure training on the voice recognition model f by using the new countermeasure sample.
In some embodiments, the cosine distance is represented by C, and the formula for calculating C is as follows:
wherein f (x) is as followsA clean sample is shown and is shown in a sample,representing the challenge sample, x is the original audio input provided to the speech recognition model f.
In a third aspect, an embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the preceding claims when the computer program is executed.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing one or more programs executable by one or more processors to implement the method of any of the preceding claims.
The voice recognition countermeasure method combined with gradient guidance provided by the embodiment of the application can realize a method with both supervision and non-supervision capabilities, and the combination of the two methods can obtain stronger countermeasure samples, which is helpful for countermeasure training; meanwhile, when the ASR model is output, gradient guidance is utilized to defend against attack aiming at the classifier, so that the robustness of the ASR model is improved, the safety of voice interaction is ensured, and the defending capability of a voice recognition system is improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the prior art solutions, the drawings which are used in the description of the embodiments or the prior art will be briefly described below, it being obvious that the drawings in the description below are only some of the embodiments described in the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a voice recognition challenge defense method incorporating gradient guidance in accordance with an embodiment of the present application;
FIG. 2 is a schematic illustration of GGAD challenge training in combination with a gradient guided voice recognition challenge defense method in accordance with an embodiment of the present application;
FIG. 3 is a schematic diagram of a speech recognition model combined with a gradient guided speech recognition challenge defense method according to an embodiment of the present application;
fig. 4 is a schematic block diagram of an electronic device in an embodiment of the application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present application should be taken in a general sense as understood by one of ordinary skill in the art to which the present application belongs. The use of the terms "first," "second," and the like in one or more embodiments of the present application does not denote any order, quantity, or importance, but rather the terms "first," "second," and the like are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
In the related art, paper "Training augmentation with adversarial examples for robust speech recognition" uses Fast Gradient Symbology (FGSM) based challenge training to train models, studying input gradient regularization as a method of challenge robustness. The method trains a differentiable model (e.g., a deep neural network) using a penalty network model approach with respect to input loss function gradients. The results show that this approach can produce very good robustness against attacks, but it almost doubles the training complexity of the network and does not show the performance of the approach in various attack-against, especially black box, scenarios.
The application aims to realize a method with both supervised and unsupervised capabilities, so that stronger countermeasure samples can be obtained, and the method is helpful for countermeasure training; meanwhile, gradient guidance is utilized to defend against attacks aiming at the classifier when the ASR model is output, so that the robustness of the ASR model is improved.
Various non-limiting embodiments of the present application are described in detail below with reference to the attached drawing figures.
First, a voice recognition countermeasure method by combining gradient guidance according to the present application will be described in detail with reference to fig. 1.
As shown in fig. 1, an embodiment of the present application provides a voice recognition countermeasure method in combination with gradient guidance, including:
s101: calculating a loss function comprising connection timing class classification loss and optimal transport loss, wherein in a supervised scenario, L CTC (f (x), y) represents the connection timing class classification penalty, where x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport penalty is L OT Represented by L, where OT =min T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;
s102: calculating a cosine distance between samples, the cosine distance representing a prediction of a clean sample f (x) and an countermeasure sampleIs a predicted distance between the predictions of (2);
s103: calculating maximum loss based on the cosine distance and the connection time sequence class classification loss, and iteratively reducing the value of the connection time sequence class classification loss;
s104: generating a new challenge sample using the connection timing class classification loss and the optimal transport loss in combination, wherein the loss function involved in generating the new challenge sample is L new The L is new The calculation mode of (2) is as follows:
L new =L CTC +βL OT (1)
where β is a weighting factor;
s105: and performing countermeasure training on the voice recognition model f by using the new countermeasure sample.
Specifically, S101 includes the following examples:
step 11, CTC loss is calculated using a supervised method.
The challenge sample is generated by using the loss value relative to the input data to obtain gradient information. In a supervised scenario, cross entropy (for classification), connective temporal class classification (for speech recognition models), etc. are used. Connection timing class classification (CTC) penalty between the original label and model prediction can be defined as follows:
L CTC (f(x),y)(2)
where x is the audio input provided to the speech recognition model f and y is the corresponding transcription, which is a supervised loss function using the original labels of the data, the supervised challenge sample generation technique maximizing this function.
At step 12, the OT loss is calculated using an unsupervised method.
No labels are used in the unsupervised approach, but rather the difference between clean sample predictions and countersample predictions is used. Unsupervised loss using Optimal Transport (OT) theory, the challenge samples are initialized with random noise, and the OT distance can be expressed as:
L OT =min T (T·C)(3)
where T is a matrix that helps solve the OT problem and C is a transportation cost matrix.
In some embodiments, the cosine distance is represented by C, and the formula for calculating C is as follows:
(4)
wherein f (x) represents a clean sample,representing the challenge sample, x is the original audio input provided to the speech recognition model f.
In some embodiments, the maximum loss is calculated in combination with the above steps and CTC loss is reduced by iteration. Finally, the present application combines supervised and unsupervised losses and uses it to generate new challenge samples, which are then used for challenge training to increase the robustness of the speech recognition model.
In some embodiments, the combining generates a new challenge sample using the connection timing class classification loss and the optimal transport loss, comprising:
iterating using the following formula to generate a new challenge sample, wherein the new challenge sample is a mixed challenge sample:
(5)
wherein,representing the challenge samples of the original audio input x, t representing the number of iterations.
In some embodiments, the value of β is 1 for balancing the connection timing class classification loss and the optimal transport loss; the values of β are exemplary herein and other values known to those skilled in the art may be used without limitation.
In some embodiments, the method further comprises:
the gradient is calculated using the following formula to identify the word that is most important to the classification:
(6)
is to make word x i Conversion to embedded e 1 G (-) is the upper layer predicted from word embedding, and g (-) is the probability distribution of all classes, using g (-) k To represent the probability of the kth class;
specifically, the importance of the word is estimated using the gradient of the classifier; in a white-box environment where an attacker has full access to the classifier, the gradient is used directly to pick candidates, whereas in a black-box environment, the gradient is approximated by comparing the output of the classifier to whether there is a word. Assuming that the classifier can be fully accessed when constructing the defenses, formula (6) is directly employed to calculate the gradient to identify the word most important to the classification.
The importance weight for each of the words is calculated by the following formula:
(7)
wherein e i Representing the sum of the word embedding, the location embedding and the tag type embedding.
Randomly rewriting a plurality of times by using a mask language model; after the importance weight of each word is calculated, the important word must be replaced to defend against the attack. If the importance weights are thresholded, then the words are masked and replaced, it is possible to mask all important words and make the sentence generated by the model semantically different from the original sentence. To solve this problem, GGAD uses a random alternative method.
Specifically, as shown in fig. 2, the GGAD countermeasure training model inputs a sound recording, calculates CTC loss, calculates OT loss, wherein the calculation of CTC loss and OT loss is related to the result of preprocessing clean data after the clean data is subjected to the language recognition model, the OT loss is also related to the result of preprocessing disturbance data after the clean data is subjected to the language recognition model, and calculates maximum loss and minimum loss, and the process of updating the disturbance data is also related.
Illustratively, the speech recognition model involved in the present application may include, as shown in FIG. 3, an input, a convolution layer, resCNN, full connection layer, biRNN, a classifier, and an input in that order; wherein the BiRNN comprises a layer normalization, a gated loop unit and a random inactivation layer, and the classifier comprises a fully connected layer.
In some embodiments, the method further comprises:
and (c) setting the w i As weights, randomly sample in sentencesPosition of>Is a masking ratio->Representing the number of words; sampling the positions to obtain a position sequence +.>Where Cat represents a polynomial distribution, α is a hyper-parameter, replacing the sequence of positions with a special mask placeholder, and estimating the most likely sentence as:
(8)
wherein BERT (x) is the BERT language model,representing the number of words.
Specifically, the method will be w i As weights, randomly sample in sentencesPosition->Is the masking ratio. Specifically, the position is sampled: />Where Cat represents a polynomial distribution, α is a hyper-parameter, then replacing these positions with special mask placeholders and estimating the most likely sentence by using the BERT language model the value of equation (8);
wherein BERT (x) is a BERT language model, and the sentence is rewrittenAll->The individual words are all generated by the BERT language model, although only m words are masked, the sentence +.>More than m words may still be replaced. Different mask positions may result in different rewrites.
To make the classifier more stable, the method generates a sentence λ for each countermeasure sentence by selecting a different mask position, and then takes most of the predicted sentences λ as a prediction result of the original input audio.
According to the method, the character error rate (WER) of GGAD in a white box environment is reduced by at least 1% and at most 12% compared with other defense models; the Word Error Rate (WER) of GGAD in a white box environment is reduced by at least 4% and at most 40% compared with other defense models; the Word Error Rate (WER) of GGAD in a black box environment is reduced by at least 4% and at most 29% compared to other defense models.
Challenge samples based on audio data are difficult to deal with, and it is necessary to formulate effective defense strategies to protect the deep learning model from challenge attacks. The present application discusses a defensive approach based on challenge training that generates challenge samples in a novel manner, the generated challenge samples including the capabilities of both supervised and unsupervised approaches. Experiments are carried out on the application and other popular defense methods, and the experimental results show that the application has better effect than other popular methods in defending white box and black box attacks.
The embodiment of the application provides a voice recognition countermeasure device combined with gradient guidance, which comprises the following components:
a first calculation module for calculating a loss function comprising connection timing class classification loss and optimal transport loss, wherein in a supervised scenario, L CTC (f (x), y) represents the connection timing class classification penalty, where x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport penalty is L OT Represented by L, where OT =min T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;
a second calculation module for calculating a cosine distance between samples, the cosine distance representing the prediction and countermeasure samples of the clean sample f (x)Is a predicted distance between the predictions of (2);
the iteration module is used for calculating the maximum loss based on the cosine distance and the connection time sequence class classification loss and reducing the value of the connection time sequence class classification loss through iteration;
a generation module for generating a new challenge sample by combining the connection timing class classification loss and the optimal transport loss, wherein the loss function involved in generating the new challenge sample is L new The L is new The calculation mode of (2) is as follows:
L new =L CTC +βL OT
where β is a weighting factor;
and the training module is used for performing countermeasure training on the voice recognition model f by using the new countermeasure sample.
In some embodiments, the cosine distance is represented by C, and the formula for calculating C is as follows:
wherein f (x) represents a clean sample,representing the challenge sample, x is the original audio input provided to the speech recognition model f.
The present application uses a challenge defense (GGAD) method incorporating gradient guidance, first, a combination of supervised and unsupervised challenge training methods to generate new challenge samples, and second, a gradient norm is used at output to estimate and rewrite the most important words to classify. Therefore, the problem that the ASR system is weak in resistance to attack can be solved, and the robustness of an ASR system model can be improved.
It should be noted that the method according to one or more embodiments of the present application may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of one or more embodiments of the present application, the devices interacting with each other to accomplish the methods.
It should be noted that the foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Based on the same inventive concept, the application also discloses an electronic device corresponding to the method of any embodiment;
specifically, fig. 4 shows a schematic hardware structure of an electronic device combined with the gradient-guided voice recognition countermeasure method according to the present embodiment, where the device may include: processor 410, memory 420, input/output interface 430, communication interface 440, and bus 450. Wherein processor 410, memory 420, input/output interface 430 and communication interface 440 are communicatively coupled to each other within the device via bus 450.
The processor 410 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided by the embodiments of the present application.
The Memory 420 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 420 may store an operating system and other application programs, and when implementing the techniques provided by embodiments of the present application by software or firmware, the associated program code is stored in memory 420 and invoked for execution by processor 410.
The input/output interface 430 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown in the figure) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
The communication interface 440 is used to connect communication modules (not shown) to enable communication interactions of the device with other devices. The communication module may implement communication through a wired manner (e.g., USB, network cable, etc.), or may implement communication through a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.).
Bus 450 includes a path to transfer information between components of the device (e.g., processor 410, memory 420, input/output interface 430, and communication interface 440).
It should be noted that although the above device only shows the processor 410, the memory 420, the input/output interface 430, the communication interface 440, and the bus 450, in the implementation, the device may further include other components necessary to achieve normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary for implementing the embodiments of the present application, and not all the components shown in the drawings.
The electronic device of the foregoing embodiments is configured to implement the corresponding voice recognition challenge defense method combined with gradient guidance in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.
Based on the same inventive concept, one or more embodiments of the present application also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the voice recognition challenge defense method in combination with gradient guidance as described in any of the embodiments above, corresponding to the method of any of the embodiments above.
The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The storage medium of the above embodiment stores computer instructions for causing the computer to perform the voice recognition countermeasure method in combination with gradient guidance as described in any one of the above embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; combinations of features of the above embodiments or in different embodiments are also possible within the spirit of the application, steps may be implemented in any order and there are many other variations of the different aspects of one or more embodiments of the application described above which are not provided in detail for the sake of brevity.
Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure one or more embodiments of the application. Furthermore, the apparatus may be shown in block diagram form in order to avoid obscuring the embodiment(s) of the present application, and also in view of the fact that specifics with respect to implementation of such block diagram apparatus are highly dependent upon the platform on which the embodiment(s) of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that one or more embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
While the application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.
The present application is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements and others which are within the spirit and principle of the one or more embodiments of the application are intended to be included within the scope of the application.

Claims (6)

1. A method of countering defenses in combination with gradient guided speech recognition, the method comprising:
calculating a loss function comprising connection timing class classification loss and optimal transport loss, wherein in a supervised scenario, L CTC (f (x), y) represents the connection timing class classification penalty, where x is the original audio input provided to the speech recognition model f, y is the corresponding transcription, and in an unsupervised scenario, the optimal transport penalty is L OT Represented by L, where OT =min T (T.B), T is the correlation matrix that solves the most transportation problem, B is the transportation cost matrix;
calculating as C the cosine distance between samples, which represents the prediction of the clean sample f (x) and the challenge sampleIs a predicted distance between the predictions of (2);
generating a new challenge sample using the connection timing class classification loss and the optimal transport loss in combination, wherein the loss function involved in generating the new challenge sample is L new The L is new The calculation mode of (2) is as follows:
L new =L CTC (f(x),y)+βL OT
where β is a weighting factor;
performing challenge training on the speech recognition model f using the new challenge sample;
the gradient is calculated using the following formula to identify the word that is most important to the classification:
f(x)=arg max k g(E(x)) k
wherein E (x) =e 1 ,…,e l Is to make word x i Conversion to embedded e i G (·) is the upper layer predicted from word embedding, the output of g (·) is the probability distribution of all classes, g () k To represent the probability of the kth class;
the importance weight for each of the words is calculated by the following formula:
wherein e i Representing a sum of the word embedding, the location embedding and the tag type embedding;
and (c) setting the w i As weights, randomly sample in sentencesWhere γ is the masking ratio and l represents the number of words; sampling the positions to obtain a position sequence +.>Where Cat represents a polynomial distribution, α is a hyper-parameter, replacing the sequence of positions with a special mask placeholder, and estimating the most likely sentence as:
where BERT (x) is the BERT language model, and l represents the number of words.
2. The method of claim 1, wherein the formula for calculating C is as follows:
wherein f (x) represents a clean sample,representing the challenge sample, x is the original audio input provided to the speech recognition model f.
3. The method of gradient guided speech recognition challenge defense of claim 1 or 2 wherein the combining utilizes the connection timing class classification loss and the optimal transport loss to generate a new challenge sample comprising:
iterating using the following formula to generate a new challenge sample, wherein the new challenge sample is a mixed challenge sample:
wherein,representing the challenge samples of the original audio input x, t representing the number of iterations.
4. A gradient guided speech recognition challenge defense method in combination with claim 3 wherein the value of β is 1 to balance the connection timing class classification loss and the optimal transport loss.
5. An electronic device, the electronic device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1 to 4 when the computer program is executed.
6. A computer readable storage medium storing one or more programs executable by one or more processors to implement the method of any of claims 1-4.
CN202311154761.8A 2023-09-08 2023-09-08 Voice recognition countermeasure method and device combined with gradient guidance Active CN116913259B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311154761.8A CN116913259B (en) 2023-09-08 2023-09-08 Voice recognition countermeasure method and device combined with gradient guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311154761.8A CN116913259B (en) 2023-09-08 2023-09-08 Voice recognition countermeasure method and device combined with gradient guidance

Publications (2)

Publication Number Publication Date
CN116913259A CN116913259A (en) 2023-10-20
CN116913259B true CN116913259B (en) 2023-12-15

Family

ID=88351373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311154761.8A Active CN116913259B (en) 2023-09-08 2023-09-08 Voice recognition countermeasure method and device combined with gradient guidance

Country Status (1)

Country Link
CN (1) CN116913259B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110502976A (en) * 2019-07-10 2019-11-26 深圳追一科技有限公司 The training method and Related product of text identification model
CN111243620A (en) * 2020-01-07 2020-06-05 腾讯科技(深圳)有限公司 Voice separation model training method and device, storage medium and computer equipment
CN111540367A (en) * 2020-04-17 2020-08-14 合肥讯飞数码科技有限公司 Voice feature extraction method and device, electronic equipment and storage medium
CN112085041A (en) * 2019-06-12 2020-12-15 北京地平线机器人技术研发有限公司 Training method and training device for neural network and electronic equipment
CN113948093A (en) * 2021-10-19 2022-01-18 南京航空航天大学 Speaker identification method and system based on unsupervised scene adaptation
CN114758113A (en) * 2022-03-29 2022-07-15 浙大城市学院 Confrontation sample defense training method, classification prediction method and device, and electronic equipment
CN116386111A (en) * 2023-03-31 2023-07-04 重庆邮电大学 Face recognition-oriented patch attack countermeasure method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221905B (en) * 2021-05-18 2022-05-17 浙江大学 Semantic segmentation unsupervised domain adaptation method, device and system based on uniform clustering and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112085041A (en) * 2019-06-12 2020-12-15 北京地平线机器人技术研发有限公司 Training method and training device for neural network and electronic equipment
CN110502976A (en) * 2019-07-10 2019-11-26 深圳追一科技有限公司 The training method and Related product of text identification model
CN111243620A (en) * 2020-01-07 2020-06-05 腾讯科技(深圳)有限公司 Voice separation model training method and device, storage medium and computer equipment
CN111540367A (en) * 2020-04-17 2020-08-14 合肥讯飞数码科技有限公司 Voice feature extraction method and device, electronic equipment and storage medium
CN113948093A (en) * 2021-10-19 2022-01-18 南京航空航天大学 Speaker identification method and system based on unsupervised scene adaptation
CN114758113A (en) * 2022-03-29 2022-07-15 浙大城市学院 Confrontation sample defense training method, classification prediction method and device, and electronic equipment
CN116386111A (en) * 2023-03-31 2023-07-04 重庆邮电大学 Face recognition-oriented patch attack countermeasure method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Adaptive Activation Network for Low Resource Multilingual Speech Recognition;Jian Luo 等;2022 International Joint Conference on Neural Networks (IJCNN);全文 *
R-gossip:分布式负载均衡效率优化算法;肖韬睿 等;电子设计工程;第28卷(第6期);全文 *
基于无监督表征学习的深度聚类研究进展;侯海薇 等;模式识别与人工智能;第35卷(第11期);全文 *
深度学习中的对抗样本问题;张思思 等;计算机学报;第42卷(第8期);全文 *

Also Published As

Publication number Publication date
CN116913259A (en) 2023-10-20

Similar Documents

Publication Publication Date Title
US11334671B2 (en) Adding adversarial robustness to trained machine learning models
Kim et al. Exploring lottery ticket hypothesis in spiking neural networks
CN111241287A (en) Training method and device for generating generation model of confrontation text
US11681796B2 (en) Learning input preprocessing to harden machine learning models
US20210319090A1 (en) Authenticator-integrated generative adversarial network (gan) for secure deepfake generation
CN113298152B (en) Model training method, device, terminal equipment and computer readable storage medium
Bhaskara et al. Emulating malware authors for proactive protection using GANs over a distributed image visualization of dynamic file behavior
CN115439708A (en) Image data processing method and device
US11727686B2 (en) Framework for few-shot temporal action localization
CN113435531B (en) Zero sample image classification method and system, electronic equipment and storage medium
CN114462425A (en) Social media text processing method, device and equipment and storage medium
CN114677556A (en) Countermeasure sample generation method of neural network model and related equipment
CN116913259B (en) Voice recognition countermeasure method and device combined with gradient guidance
CN116720214A (en) Model training method and device for privacy protection
Andrew et al. Sequential deep belief networks
CN110889290A (en) Text encoding method and apparatus, text encoding validity checking method and apparatus
Gaihua et al. Instance segmentation convolutional neural network based on multi-scale attention mechanism
US20220004904A1 (en) Deepfake detection models utilizing subject-specific libraries
Kotenko et al. Attacks against machine learning systems: Analysis and GAN-based approach to protection
Mishra et al. Regularized Hardmining loss for face recognition
CN116978370A (en) Speech processing method, device, computer equipment and storage medium
CN115357712A (en) Aspect level emotion analysis method and device, electronic equipment and storage medium
CN113239215A (en) Multimedia resource classification method and device, electronic equipment and storage medium
CN113238821A (en) Data processing acceleration method and device, electronic equipment and storage medium
Zheng et al. Little‐YOLOv4: A Lightweight Pedestrian Detection Network Based on YOLOv4 and GhostNet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant