CN115941243A

CN115941243A - Network virus propagation defense method, device and equipment based on reinforcement learning

Info

Publication number: CN115941243A
Application number: CN202211240445.8A
Authority: CN
Inventors: 杨润峰; 曲武; 胡永亮
Original assignee: Jinjing Yunhua Shenyang Technology Co ltd; Beijing Jinjingyunhua Technology Co ltd
Current assignee: Jinjing Yunhua Shenyang Technology Co ltd; Beijing Jinjingyunhua Technology Co ltd
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2023-04-07

Abstract

The invention provides a network virus propagation defense method, a network virus propagation defense device and network virus propagation defense equipment based on reinforcement learning. Abstracting a target network into a two-dimensional space Graph form to be used as training data for storage; constructing a reinforcement learning training model, and defining a training environment of the reinforcement learning training model; the training environment of the reinforcement learning model comprises: training environment rules, server states, intruder attack rules and training end conditions; training the reinforcement learning training model to obtain a reinforcement learning defense model; and deploying the server state in the target network to a management server, inputting the server state in the target network to a reinforcement learning defense model, and performing offline on the server in an output result. In this way, on the premise that virus attack can be detected, the server cluster can be protected macroscopically by only considering the reaction speeds of the attack and defense parties and not considering the defense or attack strength, so that the whole complex network still can operate when a small number of computers are invaded.

Description

Network virus propagation defense method, device and equipment based on reinforcement learning

Technical Field

The present invention relates generally to the field of network security, and more particularly, to a network virus propagation defense method, apparatus and device based on reinforcement learning.

Background

Reinforcement learning is a machine learning training method based on rewarding desirable behavior and/or penalizing undesirable behavior. In general, reinforcement learning intelligence can sense and interpret the environment in which it is located, and by continually trying and learning from errors, ultimately derive an excellent strategy that can achieve the goal. As a hot field of artificial intelligence, reinforcement learning has excellent development prospect in the field of preventing and dealing with Racing attack.

Reinforcement learning can be used for hacker attack and defense simulation, people are still in an exploration stage for how to apply reinforcement learning algorithms to the field of Racing attack and defense, and verified simulation experiments have the minimum unit of a simulation network: "standard network". The simulation experiment simulates the virus defense and attack process as follows: and the virus STARTs to invade from the START point, successfully attacks the server where the important asset is located by successfully attacking and capturing the control right of the CPU server, and further obtains the target resource. The AI prevents the invasion of viruses by adjusting the defense strength and the detection strength of different servers. The design of the simulation is somewhat representative and demonstrates the feasibility of reinforcement learning on the network.

However, in the real world, the design of the defense strength and the detection strength does not meet the real world situation. Because the defense strength of all servers will be the same and will not change as long as the same defense AI is installed on all servers; there are always only two cases of hacker attacks on the server: fail or succeed and the attack is always completed in a moment. In addition, in the existing design, problems in a plurality of aspects such as attack intensity, system defense intensity and the like need to be considered, so that the logic of the defense process is complex.

Disclosure of Invention

According to the embodiment of the invention, a network virus propagation defense scheme based on reinforcement learning is provided. On the premise that virus attack can be detected, the server cluster is protected macroscopically only by considering the reaction speeds of the attacking and defending parties and not considering defense or attack strength, so that the whole complex network can still operate when a few computers in the complex network are invaded.

In a first aspect of the invention, a network virus propagation defense method based on reinforcement learning is provided. The method comprises the following steps:

abstracting a target network into a two-dimensional space Graph form to be used as training data to be stored;

constructing a reinforcement learning training model, and defining a training environment of the reinforcement learning training model; the training environment of the reinforcement learning model comprises: training environment rules, server states, intruder attack rules and training end conditions;

training the reinforcement learning training model under the training environment of the reinforcement learning training model to obtain a reinforcement learning defense model;

and deploying the reinforcement learning defense model to a management server, inputting the server state in the target network to the reinforcement learning defense model on the management server, and offline the server in the output result.

Further, the training environment rule includes:

randomly distributing a server for hackers as an initial intrusion point; and

randomly distributing virtual assets on one or more servers in the target network.

Further, the server state includes: an invaded state, a non-invaded state and an offline state; wherein,

the invaded state is invaded by an invader after the server executes the invader attack rule;

the off-line state is that the server is disconnected with other servers in the network;

the non-invaded state is that the server is not invaded by an invader and is not in an offline state.

Further, the rule of attack by intruders includes:

from an initial intrusion point, intruding adjacent servers step by step, and intruding a certain number of servers each step;

the connected invaded servers form an invaded area, and the number of the invaded servers in each step is positively correlated with the boundary length of the invaded area; the boundary length of the invasion area is the number of the servers which can invade the server in the invasion area and are in the non-invaded state;

when a server is hacked, a reward is obtained and the reward is negative.

Further, the training end condition is as follows:

the server where the virtual asset is located is invaded by an invader, or the boundary length of the invaded area is 0.

Further, still include:

and screening the network according to the network complexity, and taking the network with the network complexity greater than a preset complexity threshold as a target network.

Further, the method further comprises: storing a target network in a two-dimensional space Graph form through an adjacent matrix or an adjacent table;

calculating the memory space occupied by the adjacent matrix, and if the memory space is not larger than the rest free memory space of the system, storing a target network in a two-dimensional space Graph form through the adjacent matrix; otherwise, storing the target network in the form of the two-dimensional space Graph through the adjacency list.

In a second aspect of the present invention, a network virus propagation defense device based on reinforcement learning is provided. The device includes:

the abstract storage module is used for abstracting the target network into a two-dimensional space Graph form to be used as training data for storage;

the model construction module is used for constructing a reinforcement learning training model and defining the training environment of the reinforcement learning training model; the training environment of the reinforcement learning model comprises: training environment rules, server states, intruder attack rules and training end conditions;

the model training module is used for training the reinforcement learning training model under the training environment of the reinforcement learning training model to obtain a reinforcement learning defense model;

and the deployment defense module is used for deploying the reinforcement learning defense model to a management server, inputting the server state in the target network to the reinforcement learning defense model on the management server, and offline the server in the output result.

In a third aspect of the invention, an electronic device is provided. The electronic device at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the invention.

In a fourth aspect of the invention, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect of the invention.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of any embodiment of the invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present invention will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 shows a schematic diagram of network complexity versus cost;

FIG. 2 is a flow chart of a reinforcement learning-based network virus propagation defense method according to an embodiment of the present invention;

FIG. 3 shows a schematic diagram of a network abstraction into a two-dimensional spatial Graph form according to an embodiment of the invention;

FIG. 4 is a block diagram of a reinforcement learning-based network virus propagation defense apparatus according to an embodiment of the present invention;

FIG. 5 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present invention;

the numeral 500 denotes an electronic device, 501 denotes a CPU, 502 denotes a ROM, 503 denotes a RAM, 504 denotes a bus, 505 denotes an I/O interface, 506 denotes an input unit, 507 denotes an output unit, 508 denotes a storage unit, and 509 denotes a communication unit.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

On the premise of detecting virus attack, the invention only considers the reaction speed of the attacking and defending parties and does not consider defense or attack strength, and macroscopically protects the server group, so that the whole complex network can still operate when a few computers are invaded.

The main advantages of reinforcement learning are that the optimal solution of the problem can be obtained in a complex environment, and the number of servers of the standard network is too small, the network structure is too simple, it is difficult to prove whether reinforcement learning is effective in the complex network structure, and nodes using technologies such as honeypots do not exist in the network. Therefore, if too simple, it is not suitable to use the embodiment of the present invention. It can be seen that the present invention is applicable to networks with a certain complexity.

For a certain network, the defense can be carried out by a method of programming a state machine, and the defense can also be carried out by using a neural network.

As shown in fig. 1, in the defense state using the state machine, the cost of programming is proportional to the complexity of the network, i.e., the more complex the structure of the network is, the more complex the state machine is, the higher the cost of programming is; and by using the neural network for defense, the cost does not change along with the complexity of the network structure. It can be seen that in both states, there is a coincidence point in cost, namely point a, and in the case of complexity greater than a, the present invention is applicable.

Firstly, whether the network has a certain complexity is determined, namely, the network is screened according to the network complexity, and the network with the network complexity greater than a preset complexity threshold value is taken as a target network. The complexity threshold is the complexity of point a. Of course, networks with complexity less than a point a may also be programmed with the embodiments of the present invention, but the state machine is selected to be more cost effective and faster.

For the target network after complexity screening, the following method may be performed.

Fig. 2 shows a flowchart of a network virus propagation defense method based on reinforcement learning according to an embodiment of the present invention.

The method comprises the following steps:

s201, abstracting the target network into a two-dimensional space Graph form to be used as training data to be stored.

In reality, servers are connected with each other to transmit information, the location of the servers may be anywhere, one server may be connected with several other servers, in order to analyze the relationship between the servers, it is necessary to abstract the servers into a two-dimensional space Graph form, each server serves as a point, and the connection between the servers is regarded as a line, as shown in fig. 3. Wherein, A, B, C, D, E are respectively a server, forming a point in two-dimensional space.

As an embodiment of the invention, the target network in the form of a two-dimensional space Graph is stored by means of an adjacency matrix or an adjacency table. The method specifically comprises the following steps:

calculating the memory space occupied by the adjacent matrix, and if the memory space is not larger than the rest free memory space of the system, storing a target network in a two-dimensional space Graph form through the adjacent matrix; otherwise, storing the target network in the form of the two-dimensional space Graph through the adjacency list. The memory space required by the adjacency matrix may be calculated, and if the remaining free memory space of the system is greater than or equal to the required memory, the memory space is "sufficient".

The space required for storage increases with the complexity of the network. Meanwhile, the storage space of the adjacency matrix is proportional to the square of the number of servers, and the storage space of the adjacency list is proportional to the number of connections between the servers. Therefore, for a sparse matrix, the occupied space of the adjacent matrix is larger than that of the adjacent table, especially for an irregular network structure, the volume of the adjacent matrix is much larger than that of the adjacent table, but the adjacent matrix is more suitable for being input into a neural network for calculation in theory, and the adjacent table needs certain processing; therefore, if the memory space is enough, the adjacent matrix is used, and the training process is more convenient; the adjacency list is used when the memory is limited.

S202, constructing a reinforcement learning training model, and defining a training environment of the reinforcement learning training model; the training environment of the reinforcement learning model comprises: training environment rules, server states, intruder attack rules, and training end conditions.

Reinforcement Learning (RL) is a field of machine learning that is primarily concerned with how an agent should take action in an environment to maximize the concept of cumulative rewards. Reinforcement learning is one of three basic modes of machine learning, in parallel with supervised learning and unsupervised learning. Reinforcement learning differs from supervised learning in that it does not require labeled input-output pairs, nor does it require explicit correction of suboptimal behavior. Instead, the focus of reinforcement learning is to find a balance between exploration (unknown domain) and development (prior knowledge). The partially supervised RL algorithm combines the advantages of both the supervised and RL algorithms. The environment is typically represented in the form of a Markov Decision Process (MDP) because many reinforcement learning algorithms for this context use dynamic programming techniques. The main difference between the classical dynamic programming method and the reinforcement learning algorithm is that the latter does not assume knowledge of the exact mathematical model of the MDP, which is for large MDPs for which exact methods become infeasible.

Because of its versatility, reinforcement learning has been studied in many disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, group intelligence, and statistics. In operational research and control literature, reinforcement learning is referred to as approximate dynamic planning, or neurodynamic planning. The problem of interest in reinforcement learning has also been studied in optimal control theory, which mainly involves the presence and characterization of optimal solutions, and their exact computational algorithms, and little to learn or approximate, especially in the absence of an environmental mathematical model. In economics and gaming theory, reinforcement learning can be used to explain how equilibrium is generated under limited rationality.

The form of data for the reinforcement learning process is a markov chain: a loop in the form of state-action-reward-state. By awarding the AI the rewards of the different state action pairs, the AI will automatically optimize its behavior. The goal of reinforcement learning is for an agent to learn an optimal or near optimal strategy that maximizes the "reward function" or other user-provided reinforcement signal from the accumulation of instant rewards. This is similar to the process that occurs in animal psychology.

As an embodiment of the invention, the training is performed using a DQN reinforcement learning algorithm.

DQN algorithm:

reinforcement learning algorithms can be divided into three major categories: value based, policy based and operator critic. Value-based algorithms, represented by DQN, are common, which have only one value function network and no policy network, and operator-critical algorithms, represented by DDPG, TRPO, which have both value function and policy networks.

DQN algorithm principle:

DQN reinforcement learning is a process of iterative iteration, each iteration solving two problems: a policy evaluation function is given, and the policy is updated according to the value function.

As an embodiment of the present invention, training is performed using an A3C reinforcement learning algorithm.

The A3C algorithm is fully called as: the asynchronous dominant actor critic algorithm is a strategy gradient algorithm in the field of reinforcement learning. The critics in A3C may learn with multiple actors training in parallel and frequently synchronize global parameters. Where the gradient is computed as part of the stationary training, as parallel random gradient descent (SGD).

If the network is small, the calculation can be carried out by using a DQN reinforcement learning algorithm based on values, and time is saved. However, if the network is large, huge action space can be avoided by using a policy-based A3C reinforcement learning algorithm for calculation, and the more complex the network is, the more time is saved.

As an embodiment of the present invention, the training environment rule includes:

randomly distributing a server to a hacker as an initial intrusion point; and

randomly distributing virtual assets across one or more servers in the target network.

The important assets are generally distributed to one server or a few servers for storage, and are set in advance.

As an embodiment of the present invention, the server state includes: an invaded state, a non-invaded state and an offline state; wherein,

the invaded state is invaded by an invader after the server executes the invader attack rule; and the server in the invaded state can not execute the offline defense any more.

The off-line state is that the server is disconnected from other servers in the network; and the server in the off-line state cannot be invaded by an invader.

The non-invaded state is that the server is not invaded by an invader and is not in an offline state. It can be seen that the server in the non-invaded state can still be invaded by an invader or offline defense can be performed by the AI.

As an embodiment of the present invention, the rule of attack by intruders includes:

1) And from the initial intrusion point, intruding the adjacent servers step by step, and each step intruding a certain number of servers. The initial intrusion points are randomly distributed in the training environment rules, and the condition of random intrusion is adopted without determining which server contains important assets by a hacker in a simulation reality.

2) The method comprises the following steps that the connected invaded servers form an invasion area, and the quantity of the invaded servers in each step is positively correlated with the boundary length of the invasion area; and the boundary length of the intrusion area is the number of the servers which can intrude in the intrusion area and are not intruded.

Since each step of intrusion of an intruder is a server connected with the server invaded in the previous step, the server invaded by the intruder forms an intrusion area, and the intrusion area is formed by a plurality of invaded servers. The server which the intruder possibly intrudes next time is determined by the outer server of the current intruding area, namely, the intrusion object is selected from the servers which are connected with the outer server of the current intruding area and are in the state of not being intruded. And the intruder intrudes the servers at each step, so that the servers in the intrusion area are increased, the boundary length of the intrusion area is also increased, and if the number of the servers intruded at each step is large, the boundary length of the intrusion area is also increased, namely, the number of the servers intruded at each step is positively correlated with the boundary length of the intrusion area. And the boundary length of the intrusion area is the number of the servers which can intrude in the intrusion area and are not intruded.

3) When a server is hacked, a reward is obtained and the reward is negative.

In this embodiment, in reinforcement learning, the AI is trained by receiving a reward, which may be positive, negative, or 0. Wherein a reward being negative represents a penalty. The reward gained by training is proportional to the number of servers that are being hacked in each step. Specifically, during training in the neural network, the training device is the GPU, the GPU employs floating point calculation, and during floating point calculation, the absolute value of the parameter is not greater than 1, so that the reward value for an AI action error can be set to-1.

As an embodiment of the present invention, the training end condition is:

the server where the virtual assets are located is invaded by an invader, or the boundary length of the invaded area is 0.

In this embodiment, two training end conditions are specified, which are:

1) The server where the virtual assets are located is invaded by an invader, namely, important assets in the server are stolen by the invader.

2) The boundary length of the intrusion area is 0, namely all servers in the network are intruded by an intruder, and the intruder cannot perform the next intrusion.

By specifying two types of training end conditions, the AI can be caused to end the training by the training end conditions.

S203, training the reinforcement learning training model under the training environment of the reinforcement learning training model to obtain a reinforcement learning defense model.

In this embodiment, a two-dimensional space Graph form abstracted from a target network is used as training data, and the reinforcement learning training model is trained in a set training environment according to a reinforcement learning algorithm. With the end of each training, the probability that important assets are stolen is continuously reduced, the probability that the servers are completely invaded is continuously reduced, the time point when the invader cannot carry out the next step of invasion is continuously advanced, and finally, the number of the invaded servers is continuously reduced.

For example, time is taken as one axis. During the first training, the intruder may not be able to perform the next intrusion after 100 steps, i.e., the interception is completed. During the second training, the intruder may not be able to perform the next intrusion after 50 steps, i.e. the interception is completed. In the third training, the intruder may not be able to proceed to the next intrusion after 20 steps, i.e. the interception is completed. The time is the step number used for finishing the interception, and the smaller the step number is, the more advanced the time point is. In principle, the fewer the number of steps used to complete an intercept, the greater the number of servers remaining in the hacked state.

During training, the number of servers invaded by an invader and the number of servers offline by the AI are controllable in each step. For example, in each step, the number of servers intruded by the intruder is 2, the number of servers offline by the AI is 3, and at this time, the speed ratio between the intruder and the AI is 2. The number of intruders and AI action servers per step represents their reaction speed.

In the embodiment of the invention, the server group is protected macroscopically by only considering the reaction speed of the attacking and defending parties and not considering the defense or attacking speed, and the method is suitable for virus attack and defense of modern complex networks.

As an embodiment of the present invention, after training is finished and a reinforcement learning defense model is obtained, parameters of the reinforcement learning defense model in a simulation environment, such as a probability of successful interception, a number of servers remaining in an unintrusive state after successful interception, and the like, need to be tested. And determining whether the reinforcement learning defense model meets the requirements or not through testing parameters. The interception means that all servers connected with the invaded server are in an off-line state, so that an invader cannot carry out next-step invasion.

S204, deploying the reinforcement learning defense model to a management server, inputting the server state in the target network to the reinforcement learning defense model on the management server, and carrying out offline on the server in the output result, namely starting the operation defense program by the reinforcement learning defense model.

The management server does not participate in the attack and defense process, cannot be attacked by an invader, and cannot be offline by AI.

In addition, the main advantages of reinforcement learning are that the optimal solution of the problem can be obtained in a complex environment, and the number of servers of the "standard network" is too small, the network structure is too simple, it is difficult to prove whether reinforcement learning is effective in the complex network structure, and there is no node in the network, such as using the "honeypot" technology.

According to the embodiment of the invention, reinforcement learning can be applied to virus attack and defense strategy calculation in the complex network, and on the premise of detecting virus attack, the server group is protected macroscopically by only considering the reaction speeds of the attack and defense parties and not considering defense or attack strength, so that the method is suitable for virus attack and defense of the modern complex network.

The embodiment of the invention considers the response speed of invasion and defense and the randomness of the invasion in the process of simulating virus attack and defense; the condition that the whole complex network can still operate when a small part of computers are invaded is considered; more complex networks are simulated, and the operation of the whole network is maintained as much as possible.

The invention can be expanded to the server group with a non-matrix network structure and the field of multi-agent.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules illustrated are not necessarily required to practice the invention.

The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.

As shown in fig. 4, the apparatus 400 includes:

an abstract storage module 410, configured to abstract the target network into a two-dimensional space Graph form as training data for storage;

the model construction module 420 is configured to construct a reinforcement learning training model, and define a training environment of the reinforcement learning training model; the training environment of the reinforcement learning model comprises: training environment rules, server states, intruder attack rules and training end conditions;

the model training module 430 is configured to train the reinforcement learning training model in a training environment of the reinforcement learning training model to obtain a reinforcement learning defense model;

the deployment defense module 440 is configured to deploy the reinforcement learning defense model to a management server, where the server state in the target network is input to the reinforcement learning defense model, and the server in the output result is offline.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

In the technical scheme of the invention, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations without violating the customs of public sequences.

The invention also provides an electronic device and a readable storage medium according to the embodiment of the invention.

FIG. 5 shows a schematic block diagram of an electronic device 500 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

The device 500 comprises a computing unit 501 which may perform various suitable actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 executes the respective methods and processes described above, such as the methods S201 to S204. For example, in some embodiments, methods S201-S204 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the methods S201 to S204 described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the methods S201-S204 in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A network virus propagation defense method based on reinforcement learning is characterized by comprising the following steps:

deploying the reinforcement learning defense model to a management server, inputting the server state in the target network to the reinforcement learning defense model on the management server, and performing offline on the server in the output result.

2. The method of claim 1, wherein the training environmental rules comprise:

randomly distributing a server to a hacker as an initial intrusion point; and

3. The method of claim 1, wherein the server state comprises: an invaded state, an unintrusive state and an offline state; wherein,

4. The method of claim 1 or 3, wherein the intruder attack rules comprise:

the method comprises the following steps that the connected invaded servers form an invasion area, and the quantity of the invaded servers in each step is positively correlated with the boundary length of the invasion area; the boundary length of the invasion area is the number of the servers which can invade the server in the invasion area and are in the non-invaded state;

when a server is hacked, a reward is obtained and the reward is negative.

5. The method of claim 4, wherein the training end condition is:

6. The method of claim 1, further comprising:

7. The method of claim 1, further comprising: storing a target network in a two-dimensional space Graph form through an adjacent matrix or an adjacent table;

8. A network virus propagation defense device based on reinforcement learning is characterized by comprising:

the model building module is used for building a reinforcement learning training model and defining a training environment of the reinforcement learning training model; the training environment of the reinforcement learning model comprises: training environment rules, server states, intruder attack rules and training end conditions;

9. An electronic device comprising at least one processor; and

a memory communicatively coupled to the at least one processor; it is characterized in that the preparation method is characterized in that,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.