WO2022252039A1 - Method and apparatus for adversarial attacking in deep reinforcement learning - Google Patents

Method and apparatus for adversarial attacking in deep reinforcement learning Download PDF

Info

Publication number
WO2022252039A1
WO2022252039A1 PCT/CN2021/097350 CN2021097350W WO2022252039A1 WO 2022252039 A1 WO2022252039 A1 WO 2022252039A1 CN 2021097350 W CN2021097350 W CN 2021097350W WO 2022252039 A1 WO2022252039 A1 WO 2022252039A1
Authority
WO
WIPO (PCT)
Prior art keywords
policy
adversarial
state
reinforcement learning
environment
Prior art date
Application number
PCT/CN2021/097350
Other languages
English (en)
French (fr)
Inventor
Benyou QIAO
Hang SU
Jun Zhu
Bo Zhang
Ze CHENG
Yunjia WANG
Original Assignee
Robert Bosch Gmbh
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Robert Bosch Gmbh, Tsinghua University filed Critical Robert Bosch Gmbh
Priority to PCT/CN2021/097350 priority Critical patent/WO2022252039A1/en
Priority to CN202180098787.8A priority patent/CN117441168A/zh
Publication of WO2022252039A1 publication Critical patent/WO2022252039A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the present disclosure relates generally to machine learning, and more particularly, to adversarial attacking techniques for improving security in deep reinforcement learning (DRL) .
  • DRL deep reinforcement learning
  • RL Reinforcement Learning
  • AGI artificial general intelligence
  • DRL Deep Reinforcement Learning
  • a method for adversarial attacking in deep reinforcement learning comprises: determining a function space of adversarial attacks with at least an adversarial noise level; determining a deceptive policy for deceiving an agent in the deep reinforcement learning based on an attacking target; obtaining an adversarial function from the function space by minimizing a difference between an attacked policy and the deceptive policy; and perturbing a state of an environment observed by the agent in the deep reinforcement learning based on the obtained adversarial function.
  • an apparatus for adversarial attacking in deep reinforcement learning comprises a memory and at least one processor coupled to the memory.
  • the at least one processor is configured to: determine a function space of adversarial attacks with at least an adversarial noise level; determine a deceptive policy for deceiving an agent in the deep reinforcement learning based on an attacking target; obtain an adversarial function from the function space by minimizing a difference between an attacked policy and the deceptive policy; and perturb a state of an environment observed by the agent in the deep reinforcement learning based on the obtained adversarial function.
  • a computer readable medium storing computer code for adversarial attacking in deep reinforcement learning.
  • the computer code when executed by a processor causes the processor to determine a function space of adversarial attacks with at least an adversarial noise level; determine a deceptive policy for deceiving an agent in the deep reinforcement learning based on an attacking target; obtain an adversarial function from the function space by minimizing a difference between an attacked policy and the deceptive policy; and perturb a state of an environment observed by the agent in the deep reinforcement learning based on the obtained adversarial function.
  • a computer program product for adversarial attacking in deep reinforcement learning comprises processor executable computer code for determining a function space of adversarial attacks with at least an adversarial noise level; determining a deceptive policy for deceiving an agent in the deep reinforcement learning based on an attacking target; obtaining an adversarial function from the function space by minimizing a difference between an attacked policy and the deceptive policy; and perturbing a state of an environment observed by the agent in the deep reinforcement learning based on the obtained adversarial function.
  • FIG. 1 illustrates a block diagram of an exemplary reinforcement learning system in accordance with one aspect of the present disclosure.
  • FIG. 2 illustrates an example of High-risk High-reward state in Grid World based on reinforcement learning in accordance with one aspect of the present disclosure.
  • FIG. 3 illustrates a block diagram of an exemplary attacked reinforcement learning system in accordance with one aspect of the present disclosure.
  • FIG. 4 illustrates an exemplary relationship between attacked policy sets of different function spaces in accordance with one aspect of the present disclosure.
  • FIG. 5 illustrates a flow chart of a method for adversarial attacking in reinforcement learning in accordance with one aspect of the present disclosure.
  • FIG. 6 illustrates a block diagram of an apparatus for adversarial attacking in reinforcement learning in accordance with one aspect of the present disclosure.
  • Reinforcement learning is an area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize cumulative reward.
  • the agent is placed in the environment. Taking the chess game as an example, the agent may be a player and the environment may be the chessboard. At any time, the environment is always in a certain state, which comes from one of a set of possible states. In this example, the state refers to the layout state of chess pieces on the chessboard.
  • the agent can make a set of possible actions (legal moves of the chess pieces) according to a policy.
  • the policy may map a state of the environment to a certain action, and therefore determines the agent's behavior at a specific state.
  • the state of the environment may change accordingly, and the environment may give the agent a reward.
  • Reward may only give an instant reward in a certain state, while a long-term reward may be represented by a value function.
  • the value function may represent a sum of current rewards and subsequent rewards.
  • the goal of the agent may be to maximize the long-term total reward, such as winning the chess game.
  • the amount of the reward reflects the quality of the action.
  • the reward signal may be the main basis for changing the policy. If the action selected by a policy of the agent is of low return, then in the future, the policy may be changed to select other actions.
  • the behavior of the environment may be imitated by a model.
  • the model can predict the next state and the next reward. In reality, the model may or may not exist.
  • reinforcement learning when a model exists, it is called model-based reinforcement learning, and when it does not exist, it is called model-free reinforcement learning.
  • FIG. 1 illustrates a block diagram of an exemplary reinforcement learning system in accordance with one aspect of the present disclosure.
  • the reinforcement learning system 100 shown in FIG. 1 may comprise the environment 110 and the agent 120.
  • the agent 120 may be an industrial robot or a self-driving vehicle, and the environment 110 may be a particular environment where the industrial robot or the self-driving vehicle works in.
  • the agent 120 may interact with the environment 110. For example, at state s t of the environment 110, the agent 120 may get a reward r t for a previously performed action, and may perform an action a t based on a policy ⁇ .
  • the environment 110 may change into a new state s t+1 and give a reward r t+1 to the agent 120. Then, the agent 120 may proceed to interact with the environment 110 in this way, trying to maximize a total reward at least over a period of time.
  • the cyclical process in which the agent takes an action to change its state to obtain a reward and interacts with the environment may be represented by a Markov decision process (MDP) .
  • Markov decision process may comprise a finite number of states and actions.
  • the agent observes a state and executes an action, which incurs intermediate rewards to be maximized (or, in the inverse scenario, costs to be minimized) .
  • the reward and the successor state may depend only on the current state and the chosen action.
  • DRL policies are also vulnerable to adversarial attacks. Recent research has revealed the potential for the vulnerabilities where an adversary may access the inputs to an RL system and implement malicious attacks to deceive a deep policy.
  • a DRL agent may take suboptimal action such that the performance of a trained agent may be degraded.
  • DRL model not only inherits the fundamental security issues in other machine learning techniques, but also possesses its unique issues.
  • an adversary may attack a DRL system by perturbing observation received by an agent in DRL.
  • the observation may refer to the observation of an agent on the state of environment in a DRL system, and may comprise information related to the state of environment.
  • the state of environment sometimes may not be obtained directly.
  • a current state i.e., Markov state
  • the information of the environment may be obtained by observing the environment.
  • the state of the environment may be represented by the observation, i.e., the observed state.
  • SA-MDP state-adversarial Markov decision process
  • the adversary only perturbs the observation of the agent’s state.
  • any attacker S ⁇ F (S) can perturb the state and its configuration as s to where F (S) is the set of all distributions on S and g (.
  • the mapping B corresponds to the capability of the adversary: where B (s) is usually a small set around the state s.
  • P a S ⁇ A ⁇ F (S) is a transition function, is a reward function, and ⁇ is a discount factor.
  • An agent acts following a policy ⁇ : S ⁇ F (A) , ⁇ ⁇ ⁇ , where F (A) is the set of all distributions on A and ⁇ is the policy set.
  • the adversary aims to minimize the expected total reward of ⁇ by applying the perturbations.
  • the agent takes the action as With the notation of ⁇ g , the goal of SA-MDP may be to minimize the expected total reward as:
  • SA-MDP shows the effectiveness of adversarial attack in DRL by misleading the agent to take a sub-optimal action.
  • the adversarial attack according to SA-MDP misleads the agent to the sub-optimal action which may not minimize the expected reward of the attacked policy, especially in the environment with High-risk High-reward state.
  • some other works provide a heuristic adversarial attack on a subset of time steps guiding by the value function or the Q-values function, or mislead the agent to a predefined state by leveraging the Q-values function, there still exists a gap between an attacked policy and a target policy.
  • FIG. 2 illustrates an example of High-risk High-reward state in Grid World based on reinforcement learning in accordance with one aspect of the present disclosure.
  • FIG. 2 uses a 4 ⁇ 4 Grid World to illustrate an example of a simple finite MDP.
  • the cells of the grid correspond to the states of the environment. Therefore, the environment may have 16 states.
  • 4 actions are possible: up, down, left, and right, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of -1. Other actions result in a reward of 0, except those that move the agent to states 210 and 220.
  • the agent may gain a reward of -1 when moved to state 210, and may gain a reward of +1 when moved to state 220.
  • the state 230 may be a High-risk High-reward state, because an optimal action a* at this state may be moving right and an policy with adversarial reward may be moving right also. With an appropriate noise level, an adversary will choose a different action that will protect a victim from taking the minimum reward, i.e., reaching the state 210.
  • the present disclosure follows the setting of state-adversarial Markov decision process framework and reformulates the SA-MDP in function spaces to better research the adversarial attacks in a unified framework.
  • FIG. 3 illustrates a block diagram of an exemplary attacked reinforcement learning system in accordance with one aspect of the present disclosure.
  • an RL system 300 comprising an environment 310 and an agent 320 is attacked by an attacker 330 with an adversarial function h.
  • the environment 310 and the agent 320 in RL system 300 may be equivalent to the environment 110 and the agent 120 in the RL system 100 as described above with reference to FIG. 1.
  • the attacker 330 may perturb the observations to be received by agent 320. For example, at state s t of the environment 310, the attacker 330 may apply an adversarial function h to the observation state s t . Then, the observation state received by the agent 320, which may also be called as victim, may be perturbed to In this case, given a pre-trained policy ⁇ which may be the same as the policy in FIG. 1, the victim agent 320 may perform an action a t ⁇ (a ⁇ h (s t ) ) based on the input observation h (s t ) and the policy ⁇ , yielding a sub-optimal action. Therefore, an attacked agent 340 may be recognized as an agent with adversarial policy ⁇ h .
  • an observation state s t is input to the attacked agent 340, and the attacked agent 340 may consequentially behave as a t ⁇ h (a ⁇ s t ) .
  • the environment 310 may change into a new state s t+1 and give a reward r t+1 . It can be recognized that since the action a t in FIG. 3 may be different from the action a t in FIG. 1, the consequent state s t+1 and reward r t+1 in FIG. 3 may also be different from the state s t+1 and reward r t+1 in FIG. 1.
  • an adversary of the present disclosure aims at minimizing the expected total reward of the victim policy ⁇ by applying to state s an adversarial function h from an adversarial function space where the constant ⁇ is the level of the adversarial noise that measures the ability of adversary, and p refers to L p norm and can be any real number from 1 to infinity (including infinity) .
  • the problem (1) may be reformulated as finding the optimal function h* ⁇ H.
  • An attacked policy with an adversarial function h may be denoted by ⁇ h : ⁇ h (a ⁇ s) ⁇ (a ⁇ h (s) ) , i.e., the attacked agent with attacked policy ⁇ h (a ⁇ s) may behave substantially the same as the victim agent with victim policy ⁇ (a ⁇ s) when the observed state s is perturbed to h (s) . Therefore, the problem of finding the optimal function h* may be written as:
  • r t is the reward at time step t
  • ⁇ t is the corresponding discount factor
  • R ( ⁇ h ) is the expected total reward
  • the present disclosure provides three other function spaces to represent different types of adversaries: the first type of adversary misleads the agent to a sub-optimal action, the second type of adversary alternatively misleads the agent to the sub-optimal action or keep the original action, and the third type of adversary lures the agent to a target trajectory or a given malicious policy.
  • the third adversary is generally much stronger with an appropriate noise level. It should be noted that the present disclosure will not be limited to these function spaces, and may also be applied to other types of alternative function spaces without departing from the spirit of the disclosure.
  • the present disclosure provides a function space H 0 to represent the adversarial functions h 0 ⁇ H which generate the perturbed state at each state s by an algorithm ⁇ .
  • the algorithm ⁇ aims at finding an adversarial example with the adversarial noise level to maximize the distance between the victim policy ⁇ and the attacked policy ⁇ h as:
  • the algorithm is an un-targeted attack to generate adversarial examples in supervised learning, usually with Fast Gradient Sign Method (FGSM) .
  • FGSM Fast Gradient Sign Method
  • the present disclosure provides a function space H 1 to represent the adversarial functions h 1 ⁇ H which generate the perturbed state or keep the original action, where follows the algorithm ⁇ in function space H 0 , i.e.:
  • the present disclosure further provides a function space H 2 to represent the adversarial functions h 2 ⁇ H which generate the perturbed state at each state s by an algorithm ⁇ .
  • the algorithm ⁇ aims at finding an adversarial example with the adversarial noise level to minimize the distance between the target policy ⁇ ’ and the attacked policy ⁇ h as:
  • ⁇ adv is a policy set that is accessible to the adversary.
  • ⁇ adv denotes the knowledge of the adversary, e.g., the optimal policy ⁇ * ⁇ adv when the environment is accessible to the adversary.
  • the adversaries may provide different settings of the target policy ⁇ ’ in different implementations. Without the limitation of noise level, there always exists an adversary h 2 ⁇ H 2 to minimize the expected reward of the attacked policy by leveraging the trajectory with minimum reward.
  • the target policy ⁇ ′ at state s may be chosen by estimating the expected reward of the attacked policy as
  • the expected reward of the attacked policy may be estimated in a heuristic way.
  • target policy ⁇ ′ satisfies:
  • the optimal h makes the attacked policy ⁇ h to take the action arg min a ⁇ A Q (s, a) .
  • the adversary with target policy ⁇ ′ optimize the problem (3) .
  • an adversary may hold this kind of estimation, it is more reasonable that is the expected reward of the attacked policy ⁇ h . Therefore, a two-stage optimization method is provided below to solve this problem.
  • the present disclosure provides an adversarial attack to get the sub-optimal policy in a function space by a two-stage optimization, in the scenario where an adversary can manipulate the observations such that the victim is misled towards an adversarial reward (e.g., the opposite of the environmental reward) consequentially.
  • the original problem (2) may be difficult to solve since the attacker is required to infer the environmental dynamics and the exploration mechanism in DRL inevitably leads to a shift in the distribution of the state.
  • the first stage we may obtain a deceptive policy to explore the dynamics of the environment and discover the “bad case” by redesigning the reward obtained by the adversarial agent.
  • the victim In the second stage, we may manipulate the victim’s observations such that its behavior will imitate the behavior induced by the deceptive policy, which may lead the victim astray from the right trajectory.
  • the present disclosure may attempt to find a solution in the function space H 2 .
  • a set of deceptive policies for deceiving a DRL model is denoted by ⁇ d .
  • the set of deceptive policies can minimize the total reward on an MDP as
  • An adversary can interact with the environment and learn a deceptive policy, i.e., The deceptive policy can also be specified if the adversary has some expert knowledge which can help the adversary to minimize the victim’s reward. Since the deceptive policy set ⁇ d is a subset of ⁇ , the search space can be reduced by only considering the policies that can flip the reward signal for the victim, achieving a more efficient optimization.
  • D TV (. ⁇ . ) is the total variation (TV) distance between two policy distributions
  • ⁇ h (s) and ⁇ - (s) are simplified notations for ⁇ h (a ⁇ s) and ⁇ - (a ⁇ s) , which are the distributions on action a given a state s under the policies ⁇ h and ⁇ - .
  • d ⁇ is the distribution for future state under the policy as
  • the KL-divergence can be used to upper bound the TV distance.
  • the problem (4) may be reformulated by a new objective with KL-divergence as
  • the problem (5) for example may be reformulated as
  • the solution of problem (6) may belong to the function space H 2 and the target policy is ⁇ - .
  • the adversary may independently add perturbation on each state s and treat ⁇ - (s) as a target function.
  • the Projected Gradient Descent (PGD) method for targeted attacks may be used, and the adversary may iteratively update the observation as
  • the perturbation is based on the type of norm. For the commonly used l 2 -norm, the perturbation can be computed by the PGD on the negative loss of equation (6) as
  • ⁇ ′ is the step size to control the distance between the resultant and the original observations.
  • FIG. 5 illustrates a flow chart of a method 500 for adversarial attacking in reinforcement learning in accordance with one aspect of the present disclosure.
  • the method 500 may be used for testing or evaluating the security of a policy of an agent in reinforcement learning by performing improved adversarial attacks.
  • the policy may be trained by various reinforcement learning algorithms.
  • the agent in the deep reinforcement learning may comprise an industrial robot or a self-driving vehicle.
  • the method 500 may determine a function space of adversarial attacks with at least an adversarial noise level.
  • the function space may be determined as comprising all possible adversarial functions which can generate perturbed states with an adversarial level on an original state, such as the function space H.
  • the adversarial noise level may be configured as different values, such as 0.005 or 0.0005, based on different application requirement and/or performance requirement.
  • the function space may comprise a set of adversarial functions which generate a perturbed state at each state of the environment by an algorithm aiming at finding an adversarial example with the adversarial noise level to maximize a difference between an attacked policy and a victim policy, such as the function space H 0 .
  • the function space may comprise a set of adversarial functions which keep an original state at one or more states of the environment and generate a perturbed state at each of the other states of the environment by an algorithm aiming at finding an adversarial example with the adversarial noise level to maximize a difference between an attacked policy and a victim policy, such as the function space H 1 .
  • the function space may comprise a set of adversarial functions which generate a perturbed state at each state of the environment by an algorithm aiming at finding an adversarial example with the adversarial noise level to minimize a difference between an attacked policy and a target policy, such as the function space H 2 .
  • the difference between the attacked policy and the victim policy or target policy may comprise total variation distance or Kullback-Leibler divergence between distributions of these policies.
  • the function space may be determined based at least in part on an attacking target.
  • the method 500 may determine a deceptive policy for deceiving an agent in the deep reinforcement learning based on an attacking target.
  • the deceptive policy may be a policy which an adversary intends to mislead the victim agent to imitate. In other words, the victim agent’s behavior will imitate the behavior induced by the deceptive policy after being attacked.
  • the attacking target may comprise minimizing a total reward in the deep reinforcement learning. In other examples, the attacking target may comprise discounting the total reward to a certain extent, such as 10%, 25%, 50%, etc., as compared to the total reward under the victim policy. The attacking target may also comprise reducing the total reward below a certain threshold.
  • deceptive policy may be trained by interacting with the environment.
  • the deceptive policy may be trained by an adversary with existing reinforcement learning algorithms, such as, Proximal Policy Optimization (PPO) and Deep Q Network (DQN) .
  • PPO Proximal Policy Optimization
  • DQN Deep Q Network
  • Another type of deceptive policy may be specified by an expert’s policy with domain knowledge which helps to reduce the reward.
  • Expert’s policy is a deterministic policy and may reduce the adversarial attack to a one-stage attack.
  • the deceptive policy may belong to an intersection of a set of policies which are accessible to an adversary and a set of policies which minimize the total reward in the deep reinforcement learning.
  • the method 500 may obtain an adversarial function from the function space by minimizing a difference between an attacked policy and the deceptive policy.
  • the difference between the attacked policy and the deceptive policy comprises Total Variation (TV) distance or Kullback-Leibler (KL) divergence between distributions of the attacked policy and the deceptive policy.
  • the expected TV-distance can be bounded by the KL-divergence.
  • the adversarial function may be obtained by optimizing one of the equations (4) - (6) as described above.
  • the method 500 may perturb a state of an environment observed by the agent in the deep reinforcement learning based on the obtained adversarial function.
  • the perturbed state of the environment may be generated based on the obtained adversarial function by a projected gradient descent optimization method or a fast gradient sign method.
  • the method 500 may significantly reduce the reward compared with the existing attacking method, and can provide better understanding of the vulnerabilities for the deep reinforcement learning.
  • FIG. 6 illustrates a block diagram of an apparatus 600 for adversarial attacking in reinforcement learning in accordance with one aspect of the present disclosure.
  • the apparatus 600 may comprise a memory 610 and at least one processor 620.
  • the processor 620 may be coupled to the memory 610 and configured to perform the method 500 described above with reference to FIG5.
  • the processor 620 may be a general-purpose processor, or may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
  • the memory 610 may store the input data, output data, data generated by processor 620, and/or instructions executed by processor 620.
  • a computer program product for adversarial attacking in reinforcement learning may comprise processor executable computer code for performing the method 500 described above with reference to FIG5.
  • a computer readable medium may store computer code for adversarial attacking in reinforcement learning, the computer code when executed by a processor may cause the processor to perform the method 500 described above with reference to FIG5.
  • Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Any connection may be properly termed as a computer-readable medium. Other embodiments and implementations are within the scope of the disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
PCT/CN2021/097350 2021-05-31 2021-05-31 Method and apparatus for adversarial attacking in deep reinforcement learning WO2022252039A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/097350 WO2022252039A1 (en) 2021-05-31 2021-05-31 Method and apparatus for adversarial attacking in deep reinforcement learning
CN202180098787.8A CN117441168A (zh) 2021-05-31 2021-05-31 用于深度强化学习中的对抗性攻击的方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/097350 WO2022252039A1 (en) 2021-05-31 2021-05-31 Method and apparatus for adversarial attacking in deep reinforcement learning

Publications (1)

Publication Number Publication Date
WO2022252039A1 true WO2022252039A1 (en) 2022-12-08

Family

ID=84323840

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097350 WO2022252039A1 (en) 2021-05-31 2021-05-31 Method and apparatus for adversarial attacking in deep reinforcement learning

Country Status (2)

Country Link
CN (1) CN117441168A (zh)
WO (1) WO2022252039A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102596738B1 (ko) * 2022-12-23 2023-10-31 중앙대학교 산학협력단 강화학습을 이용한 적대적 공격에 대한 보행인식 시스템의 자율적 의사결정 프레임워크

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108322349A (zh) * 2018-02-11 2018-07-24 浙江工业大学 基于对抗式生成网络的深度学习对抗性攻击防御方法
CN109902018A (zh) * 2019-03-08 2019-06-18 同济大学 一种智能驾驶系统测试案例的获取方法
CN110020593A (zh) * 2019-02-03 2019-07-16 清华大学 信息处理方法及装置、介质及计算设备
CN111325324A (zh) * 2020-02-20 2020-06-23 浙江科技学院 一种基于二阶方法的深度学习对抗样本生成方法
US20200368906A1 (en) * 2019-05-20 2020-11-26 Nvidia Corporation Autonomous vehicle simulation using machine learning
CN112052456A (zh) * 2020-08-31 2020-12-08 浙江工业大学 基于多智能体的深度强化学习策略优化防御方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108322349A (zh) * 2018-02-11 2018-07-24 浙江工业大学 基于对抗式生成网络的深度学习对抗性攻击防御方法
CN110020593A (zh) * 2019-02-03 2019-07-16 清华大学 信息处理方法及装置、介质及计算设备
CN109902018A (zh) * 2019-03-08 2019-06-18 同济大学 一种智能驾驶系统测试案例的获取方法
US20200368906A1 (en) * 2019-05-20 2020-11-26 Nvidia Corporation Autonomous vehicle simulation using machine learning
CN111325324A (zh) * 2020-02-20 2020-06-23 浙江科技学院 一种基于二阶方法的深度学习对抗样本生成方法
CN112052456A (zh) * 2020-08-31 2020-12-08 浙江工业大学 基于多智能体的深度强化学习策略优化防御方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102596738B1 (ko) * 2022-12-23 2023-10-31 중앙대학교 산학협력단 강화학습을 이용한 적대적 공격에 대한 보행인식 시스템의 자율적 의사결정 프레임워크

Also Published As

Publication number Publication date
CN117441168A (zh) 2024-01-23

Similar Documents

Publication Publication Date Title
Lin et al. On the robustness of cooperative multi-agent reinforcement learning
Miehling et al. A POMDP approach to the dynamic defense of large-scale cyber networks
Kiourti et al. Trojdrl: evaluation of backdoor attacks on deep reinforcement learning
Stellios et al. Assessing IoT enabled cyber-physical attack paths against critical systems
Applebaum et al. Analysis of automated adversary emulation techniques
Feng et al. A stackelberg game and markov modeling of moving target defense
MUSMAN et al. A game oriented approach to minimizing cybersecurity risk
Li et al. Optimal timing of moving target defense: A Stackelberg game model
JP2022013823A (ja) 人工ニューラルネットワークによって分散型ネットワークの健全性ステータス(health status)を予測するための方法
CN115580430A (zh) 一种基于深度强化学习的攻击树蜜罐部署防御方法与装置
WO2022252039A1 (en) Method and apparatus for adversarial attacking in deep reinforcement learning
CN116582349A (zh) 基于网络攻击图的攻击路径预测模型生成方法及装置
Zheng et al. The Stackelberg equilibrium for one-sided zero-sum partially observable stochastic games
Li et al. Robust moving target defense against unknown attacks: A meta-reinforcement learning approach
Li et al. Efficient computation of discounted asymmetric information zero-sum stochastic games
CN106411923B (zh) 基于本体建模的网络风险评估方法
Vejandla et al. Evolving gaming strategies for attacker-defender in a simulated network environment
CN107888588B (zh) 一种指定目标结点集合的k最大概率攻击路径求解方法
Majadas et al. Disturbing reinforcement learning agents with corrupted rewards
Panagiota et al. Trojdrl: Trojan attacks on deep reinforcement learning agents. in proc. 57th acm/ieee design automation conference (dac), 2020, march 2020
US20230328094A1 (en) System and method for graphical reticulated attack vectors for internet of things aggregate security (gravitas)
CN115063652A (zh) 一种基于元学习的黑盒攻击方法、终端设备及存储介质
CN108377238B (zh) 基于攻防对抗的电力信息网络安全策略学习装置及方法
Novoa et al. A Game-Theoretic Two-Stage Stochastic Programing Model to Protect CPS against Attacks.
Gao et al. Cooperative Backdoor Attack in Decentralized Reinforcement Learning with Theoretical Guarantee

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21943416

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180098787.8

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21943416

Country of ref document: EP

Kind code of ref document: A1