CN114124784B

CN114124784B - Intelligent routing decision protection method and system based on vertical federation

Info

Publication number: CN114124784B
Application number: CN202210096691.4A
Authority: CN
Inventors: 杨林; 高先明; 冯涛; 张京京; 陶沛琳; 王雯
Original assignee: Institute of Network Engineering Institute of Systems Engineering Academy of Military Sciences
Current assignee: Institute of Network Engineering Institute of Systems Engineering Academy of Military Sciences
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-04-12
Anticipated expiration: 2042-01-27
Also published as: CN114124784A

Abstract

The invention provides an intelligent route decision protection method and system based on vertical federation. The method comprises the following steps: step S1, sampling state data of the agent in the application scene is obtained through sampling, the sampling state data are divided into N groups of sampling sub-state data which are respectively sent to N clients, and N is not less than 2 and is a positive integer; step S2, each client in the N clients generates characteristic data of the sampling sub-state data by using the constructed client model based on the received sampling sub-state data and sends the characteristic data to the server; and step S3, the server side generates a routing decision for the whole task of the agent based on the N groups of received characteristic data from the N clients by using the constructed server side model.

Description

Intelligent routing decision protection method and system based on vertical federation

Technical Field

The invention belongs to the field of data processing for intelligent routing, and particularly relates to an intelligent routing decision protection method and system based on vertical federation.

Background

Under the background that the connection objects of the network system are quantized excessively and the connection relation is complicated, the traditional routing decision method based on manual configuration cannot configure the optimal routing decision within a limited time, and researchers are prompted to introduce an artificial intelligence algorithm into an intelligent routing decision process. With the successful application of deep reinforcement learning in the fields of robot control, game gaming, computer vision, unmanned driving and the like, researchers apply the deep reinforcement learning to the field of intelligent routing decision, and the aspects of network traffic scheduling efficiency, network resource allocation rationality and the like are improved.

Although the deep reinforcement learning can effectively improve the level of routing decision, the training process is easy to be attacked, so that the data of the training set is abnormal, the judgment or action selection of the intelligent routing on the decision in the learning process is further influenced, and the intelligent routing finally learns the action in the failure direction. In the field of security protection of intelligent routing decision models, a model protection technology facing deep reinforcement learning does not have much new progress, and how to protect the security of the intelligent routing decision models becomes an important challenge in the field of security application.

Disclosure of Invention

In order to solve the technical problems, the invention provides an intelligent routing decision protection scheme based on a vertical federation, and aims to protect a routing decision model based on deep reinforcement learning from being influenced by decision vulnerabilities or malicious attacks.

The invention discloses an intelligent route decision protection method based on vertical federation in a first aspect. The method comprises the following steps:

step S1, sampling state data of the agent in the application scene is obtained through sampling, the sampling state data are divided into N groups of sampling sub-state data which are respectively sent to N clients, and N is not less than 2 and is a positive integer;

step S2, each client in the N clients generates characteristic data of the sampling sub-state data by using the constructed client model based on the received sampling sub-state data and sends the characteristic data to the server;

and step S3, the server side generates a routing decision for the whole task of the agent based on the N groups of received characteristic data from the N clients by using the constructed server side model.

According to the method of the first aspect of the present invention, in step S2, the N client models are constructed to have the same model structure, each client model includes two client submodels, each client submodel also has the same model structure, and each client submodel includes two fully-connected layers and two activation function layers.

According to the method of the first aspect of the present invention, in step S3, the server side performs a splicing process on the N sets of feature data received from the N clients to obtain complete feature data, and the server side model generates a routing decision for an overall task of the agent according to the complete feature data, where the server side model includes a full connection layer and a Tanh activation function layer.

According to the method of the first aspect of the present invention, before the step S1 to the step S3, the method further comprises: step S0, pre-training the server-side model and the N client-side models, where the pre-training specifically includes:

step S0-1, acquiring training state data of the agent in the application scene through pre-sampling, wherein the training state data is divided into N groups of training sub-state data, adding interference noise representing malicious attacks into the kth group of training sub-state data in the N groups of training sub-state data, and then respectively sending the kth group of training sub-state data and other N-1 groups of training sub-state data to N clients, wherein k is more than or equal to 1 and less than or equal to N, and k is a positive integer;

step S0-2, each of the N clients generates training feature data of the training sub-state data by using the client model based on the received training sub-state data, and sends the training feature data to the server;

step S0-3, the server side generates a routing decision for a training task of the agent based on the N groups of received training feature data from the N clients by using the server side model;

s0-4, acquiring a real decision of a training task of the agent, and calculating a loss function based on a routing decision of the training task and the real decision of the training task;

step S0-5, the loss function is fed back to the N clients, the N clients repeat the steps S0-1 to S0-4 after receiving the loss function until the calculated loss function is lower than a threshold, and then execute the steps S1 to S3 using the pre-trained server-side model and the N client-side models.

According to the method of the first aspect of the present invention, in said step S0-4:

the loss function is expressed using the following formula:

wherein the content of the first and second substances,

a loss function representing a network of actions in the client model,

a loss function representing a discriminative network in the client model,

model parameters representing the client model;

the loss function of the action network is:

wherein the content of the first and second substances,

representing the state transition probability of the action network,

representing the previous state transition probability of the action network,

current model parameters representing the client model,

previous model parameters representing the client model,

representing an intercept function, intercepting

The value within the range is such that,

the representation of the hyper-parameter is,

representing time step

The advantage of the estimation of the time is,

representing the time step at a previous model parameter of the client model

An estimated advantage of time;

the loss function of the discrimination network is:

wherein the content of the first and second substances,

is a function of the target value and,

is a predicted value of the number of the frames,

and

respectively represent the state and the action of the mobile phone,

and

representing a hyper-parameter.

According to the method of the first aspect of the invention, when the sampling state data and the training state data are obtained, a near-segment strategy optimization algorithm is adopted to collect the states, actions and reward values at a plurality of moments; the method specifically comprises the following steps: at a first moment, the agent obtains state data from a simulation environment of the application scene, the action network makes corresponding actions based on the state data, and the judgment network gives reward values for the actions made by the action network; at other times, the state, action and reward value at a certain time are acquired in the same way.

The invention discloses an intelligent route decision protection system based on vertical federation in a second aspect. The system comprises:

the state sampling module is configured to acquire sampling state data of the agent in an application scene through sampling, the sampling state data is divided into N groups of sampling sub-state data and is respectively sent to N clients, and N is greater than or equal to 2 and is a positive integer;

the characteristic generating module is configured to generate characteristic data of the sampling sub-state data by utilizing the constructed client model based on the sampling sub-state data received by each of the N clients and send the characteristic data to the server;

a routing decision module configured to generate, by using the constructed server-side model, a routing decision for the overall task of the agent based on the N sets of feature data received by the server side from the N clients.

According to the system of the second aspect of the present invention, the N constructed client models have the same model structure, each client model includes two client submodels, each client submodel also has the same model structure, and each client submodel includes two fully-connected layers and two activation function layers.

According to the system of the second aspect of the present invention, the server-side module performs a splicing process on the N sets of feature data received from the N clients to obtain complete feature data, and generates a routing decision for an overall task of the agent according to the complete feature data, where the server-side module includes a full connection layer and a Tanh activation function layer.

A system according to the second aspect of the invention, the system comprising: a preprocessing module configured to pre-train the server-side model and the N client-side models, where the pre-training specifically includes:

acquiring training state data of the agent in the application scene through pre-sampling, wherein the training state data is divided into N groups of training sub-state data, adding interference noise representing malicious attacks into the kth group of training sub-state data in the N groups of training sub-state data, and then respectively sending the kth group of training sub-state data and other N-1 groups of training sub-state data to N clients, wherein k is more than or equal to 1 and is less than or equal to N, and k is a positive integer;

each client side in the N client sides generates training characteristic data of the training sub-state data by utilizing the client side model based on the received training sub-state data and sends the training characteristic data to the server side;

the server side generates a routing decision aiming at a training task of the agent based on N groups of training characteristic data received from the N clients by utilizing the server side model;

acquiring a real decision of a training task of the agent, and calculating a loss function based on a routing decision of the training task and the real decision of the training task;

and feeding the loss function back to the N clients, and repeating the steps after the N clients receive the loss function until the calculated loss function is lower than a threshold value.

According to the system of the second aspect of the invention, the loss function is expressed by the following formula:

wherein the content of the first and second substances,

a loss function representing a network of actions in the client model,

a loss function representing a discriminative network in the client model,

model parameters representing the client model;

the loss function of the action network is:

wherein the content of the first and second substances,

representing the state transition probability of the action network,

representing the previous state transition probability of the action network,

current model parameters representing the client model,

previous model parameters representing the client model,

representing an intercept function, intercepting

The value within the range is such that,

the representation of the hyper-parameter is,

representing time step

The advantage of the estimation of the time is,

representing the time step at a previous model parameter of the client model

An estimated advantage of time;

the loss function of the discrimination network is:

wherein the content of the first and second substances,

is a function of the target value and,

is a predicted value of the number of the frames,

and

respectively represent the state and the action of the mobile phone,

and

representing a hyper-parameter.

According to the system of the second aspect of the invention, when the sampling state data and the training state data are obtained, a near-segment strategy optimization algorithm is adopted to collect the states, actions and reward values at a plurality of moments; the method specifically comprises the following steps: at a first moment, the agent obtains state data from a simulation environment of the application scene, the action network makes corresponding actions based on the state data, and the judgment network gives reward values for the actions made by the action network; at other times, the state, action and reward value at a certain time are acquired in the same way.

A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the intelligent route decision protection method based on the vertical federation in the first aspect of the present invention when executing the computer program.

A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium stores thereon a computer program, which when executed by a processor implements the steps of the vertical federation-based intelligent route decision protection method of the first aspect of the present invention.

In summary, the technical scheme of the invention is based on a vertical federation model and a data protection function, a reinforcement learning framework based on the vertical federation is designed, training of the model is divided into a local client and a server, the number of the clients is arbitrary, different clients respectively take different feature data for training, and simultaneously, data uploaded to the server only has features, so that an attacker can be confused that the whole strategy model cannot be equivalently divided into input features even if the attacker takes input and output of a certain client, and the divided features are divided into different clients for training. By adopting the method and the device, an attacker can hardly steal the complete training task of the intelligent routing decision and cannot steal the whole intelligent routing decision model, so that the purpose of protecting the intelligent routing decision model is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description in the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an intelligent routing decision protection method based on vertical federation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a vertical federation architecture according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a near-end policy optimization algorithm according to an embodiment of the present invention;

FIG. 4 is a block diagram of a vertical federation based intelligent routing decision protection system in accordance with an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses an intelligent route decision protection method based on vertical federation in a first aspect. Fig. 1 is a flowchart of an intelligent routing decision protection method based on vertical federation according to an embodiment of the present invention; as shown in fig. 1, the method includes: step S1, acquiring sampling state data of the agent in the application scene through sampling, wherein the sampling state data are divided into N groups of sampling sub-state data which are respectively sent to N clients, and N is not less than 2 and is a positive integer; step S2, each client in the N clients generates characteristic data of the sampling sub-state data by using the constructed client model based on the received sampling sub-state data and sends the characteristic data to the server; and step S3, the server side generates a routing decision for the whole task of the agent based on the N groups of received characteristic data from the N clients by using the constructed server side model.

FIG. 2 is a schematic diagram of a vertical federation architecture according to an embodiment of the present invention; as shown in FIG. 2, where the solid lines represent forward propagation and the dashed lines represent backward propagation, the simulation environment may be a variety of reinforcement learning scenarios.

In step S1, sampling status data of the agent in the application scenario is obtained through sampling, the sampling status data is divided into N groups of sampling sub-status data, and the N groups of sampling sub-status data are respectively sent to N clients, where N is greater than or equal to 2 and is a positive integer.

In step S2, each of the N clients generates feature data of the sampled sub-state data by using the constructed client model based on the received sampled sub-state data, and sends the feature data to the server.

In some embodiments, in the step S2, the N client models are constructed to have the same model structure, each client model includes two client submodels, each client submodel also has the same model structure, and each client submodel includes two fully-connected layers and two activation function layers.

Specifically, a traditional deep reinforcement learning model is divided into a plurality of clients and a server, two clients are designed, the sampled state is split and distributed to the clients, the clients take data with different characteristics to establish a vertical federal environment, the clients need to perform data processing (characteristic extraction) locally after taking the data, the characteristic extraction can adopt a principal component analysis method, a multidimensional scale analysis method and the like, and the characteristics are output through a client model and sent to the server. And then, building a client model and a server model, wherein each client model has a consistent structure and consists of two submodels, the submodels have consistent structures and comprise two layers of full connection and two layers of activation functions, and the server model comprises a full connection layer.

In step S3, the server generates a routing decision for the overall task of the agent based on the N sets of feature data received from the N clients using the constructed server model.

In some embodiments, in step S3, the server side performs a splicing process on the N sets of feature data received from the N clients to obtain complete feature data, and the server side model generates a routing decision for an overall task of the agent according to the complete feature data, where the server side model includes a full connection layer and a Tanh activation function layer.

Specifically, feature information output by each client model is aggregated at the server side, where the aggregator performs a splicing operation on features transmitted by the server side. And uploading the characteristics output by the local model to a server, and enabling the server to aggregate data by using an aggregator and then put the data into a server model for processing so as to generate a routing decision for the overall task of the intelligent agent.

In some embodiments, before the step S1 to the step S3, the method further comprises: step S0, pre-training the server-side model and the N client-side models, where the pre-training specifically includes:

In some embodiments, in said step S0-4:

the loss function is expressed using the following formula:

wherein the content of the first and second substances,

a loss function representing a network of actions in the client model,

a loss function representing a discriminative network in the client model,

model parameters representing the client model;

the loss function of the action network is:

wherein the content of the first and second substances,

representing the state transition probability of the action network,

representing the previous state transition probability of the action network,

current model parameters representing the client model,

previous model parameters representing the client model,

representing an intercept function, intercepting

The value within the range is such that,

the representation of the hyper-parameter is,

representing time step

The advantage of the estimation of the time is,

representing the time step at a previous model parameter of the client model

An estimated advantage of time;

the loss function of the discrimination network is:

wherein the content of the first and second substances,

is a function of the target value and,

is a predicted value of the number of the frames,

and

respectively represent the state and the action of the mobile phone,

and

representing a hyper-parameter.

Specifically, considering the attacks that may exist in the testing stage, the trained models are distributed in various places and are difficult to be manipulated simultaneously, and if an attacker can take one of the client models and noise the input through various attack strategies, the whole task is difficult to be greatly influenced by the operation. Therefore, in the training phase, the interference noise which characterizes the malicious attack is added to one of the clients, and in other embodiments, the interference noise which characterizes the malicious attack can also be added to more than one client.

Each client model also updates the model parameters by using the loss feedback of the server model. Although the server-side model training loss function is similar to a near-end Policy Optimization (PPO) model, the network models are different, and here, both the action network and the evaluation network of the server side are constructed by a layer of full-link plus a Tanh activation function.

In some embodiments, when the sampling state data and the training state data are obtained, a near-segment strategy optimization algorithm is adopted to collect states, actions and reward values at a plurality of moments; the method specifically comprises the following steps: at a first moment, the agent obtains state data from a simulation environment of the application scene, the action network makes corresponding actions based on the state data, and the judgment network gives reward values for the actions made by the action network; at other times, the state, action and reward value at a certain time are acquired in the same way.

Specifically, taking PPO as an example, an observation data set is generated; FIG. 3 is a schematic structural diagram of a near-end policy optimization algorithm (PPO) according to an embodiment of the present invention; as shown in fig. 3, reinforcement learning is mainly to continuously optimize decisions by observing the surrounding environment, taking optimal actions, and obtaining feedback. Collecting state, action and reward value pairs of N moments from training scene

. And taking the data set as a sample set to be trained. The target model selects a Deep Reinforcement Learning (DRL) model based on the PPO algorithm, and performs attack defense based on the model, where the DRL model based on the PPO algorithm is shown in fig. 2. The model decision process consists of tuples

Therein is described

In the case of a limited set of states,

in order to be a limited set of actions,Pin order to be a probability of a state transition,Rin order to be a function of the reward,

is used to calculate a long-term cumulative return for the discount factor. The agent needs to interact with the environment continuously in the DRL model training, and the agent is in the current stateS _tTemporal agent takes action according to learned policyA _t. At the same time, the environment will feed back a reward value to the agent

To evaluate the quality of the current action. PPO uses significance sampling to solve the problem that when one wants to sample from one distribution, it is difficult to sample, so it is proposed to sample from another distribution that is easy to sample. When PPO combines importance sampling and action-discrimination framework, the agent is composed of two parts, one part is action and is responsible for interacting with environment to collect samples, and the other part is discrimination and is responsible for judging the quality of action.

And (3) updating the action network, wherein the action can be updated by using a PPO gradient update formula:

wherein the content of the first and second substances,

is a parameter of the policy that is,

an empirical expectation of the step of time is referred to,

refers to the state transition probability of the action network that needs to be trained,

refers to the old action network state transition probability,

is a hyper-parameter, usually taking the value 0.1 or 0.2,

is the time steptThe estimated advantage of the time, the calculation formula of the merit function is:

wherein the content of the first and second substances,

，

is thattThe judgment network of the time obtains the function of the state value,r _tis thattThe time of day prize value.

And updating the discriminant network, wherein the other one of the PPO models needing to be updated is the discriminant network, and the partial network loss function is calculated as follows:

wherein the content of the first and second substances,

is a function of the target value and,

is a predicted value of the number of the frames,sandarespectively, state and action, the updated network parameters are propagated back through this loss function.

Client model stealing attacks in a training phase (or testing phase)

In order to improve the effect of model stealing, the model structure of the stealing model selects the input DQN which is the same as that of the target model.

(1) Stealing datasets

The trained deep reinforcement learning model is used as a target model, the sampling state action pair is used as a stealing data set in the testing stage, and the model is used as a training sample of an equivalent model.

(2) Training equivalent models

Training an equivalent strategy by using imitation learning on the basis of stealing data, replacing a generator G with an action network in the training process of the imitation strategy, inputting the output action and state of the generator G into a discriminator in pair, comparing the output action and state with expert data, and judging the output action and state of the generator G by the discriminator

As a reward value to guide strategy learning that mimics learning. Thus, the discriminator loss function in the mock learning can be expressed as:

wherein the content of the first and second substances,

a strategy that mimics the result of learning is shown,

an expert strategy representing the sampling. In the first item

Representing the judgment of the arbiter on the real data, second term

The judgment of the generated data is shown, and the G and the D are circularly and alternately optimized to train the required action network and the required judgment network through the maximum and minimum game process.

In the training process, parameters of the discriminant network and the action network are updated reversely by minimizing a loss function through gradient derivation, wherein the loss function is as follows:

wherein the content of the first and second substances,

is to imitate a strategy

Entropy of, by a constant

Control ofAs a strategic regularization term in the loss function; and then generating a target model for resisting sample attack by using the trained equivalent model.

Analysis of defense feasibility

Federal learning aims at building a federal learning model based on distributed data sets. During model training, model-related information can be exchanged between the parties, but the raw data cannot. This exchange does not expose any protected private portion of the data at each site. The trained federated learning model can be placed in each participant of the federated learning system, and can also be shared among multiple parties, so that private information is protected. The vertical federation only uploads the features processed by the model to the server based on the characteristic that the overlap of the data features of the client is low, so that the model and the data privacy are well protected; the method has the advantages that the model protection is improved well, an attacker cannot learn an approximate strategy if the attacker only is equivalent to a single client model, the total task cannot be acquired, and the attack on the single client model cannot greatly influence the total task.

Specific examples

Assuming a vertical federated scenario, the original integer isxThere are two clients whose data are respectivelyx ₁Andx ₂and is alsox ₁Andx ₂there is no feature overlap. With additional client model

Client model

And server side model

. The model attacker carries out model attack at the client, supposing that the attacker can take the data model of one of the clients, the attacker interferes the input of the current client model through various strategies, and the model after interference is executed as follows:

wherein

Is a noise, and the noise is,x ₁、x ₂are the inputs of the two clients respectively,

is the operation of the feature connection. The input change of the client model at this time is

It is obvious ifxHas the dimension ofaThenx ₁Andx ₂will all be ofa/2If, ifaIs small enough that the noise is at this time

The influence is large, and if the noise is larger than a certain threshold value, the influence on the whole is small. And if there isnEach client model having an input dimension ofa/nIf, ifnLarge enough, one client is disturbed by noise and does not have a significant impact on the overall task. Therefore, the larger the dimension of the input characteristic of the model client is, the smaller the dimension of the input characteristic of the model client is, the stronger the defense capability of the model client is.

The invention discloses an intelligent route decision protection system based on vertical federation in a second aspect. FIG. 4 is a block diagram of a vertical federation based intelligent routing decision protection system in accordance with an embodiment of the present invention; as shown in fig. 4, the system 400 includes:

the state sampling module 401 is configured to obtain sampling state data of an agent in an application scene through sampling, wherein the sampling state data is divided into N groups of sampling sub-state data and respectively sent to N clients, and N is greater than or equal to 2 and is a positive integer;

a feature generation module 402, configured to generate feature data of the sampled sub-state data by using the constructed client model based on the sampled sub-state data received by each of the N clients, and send the feature data to the server;

a routing decision module 403 configured to generate, by using the constructed server-side model, a routing decision for the overall task of the agent based on the N sets of feature data received by the server side from the N clients.

A system according to the second aspect of the invention, the system comprising: a preprocessing module 404 configured to pre-train the server-side model and the N client-side models, where the pre-training specifically includes:

wherein the content of the first and second substances,

a loss function representing a network of actions in the client model,

a loss function representing a discriminative network in the client model,

model parameters representing the client model;

the loss function of the action network is:

wherein the content of the first and second substances,

representing the action networkThe probability of the state transition of (a),

representing the previous state transition probability of the action network,

current model parameters representing the client model,

previous model parameters representing the client model,

representing an intercept function, intercepting

The value within the range is such that,

the representation of the hyper-parameter is,

representing time step

The advantage of the estimation of the time is,

representing the time step at a previous model parameter of the client model

An estimated advantage of time;

the loss function of the discrimination network is:

wherein the content of the first and second substances,

is a function of the target value and,

is a predicted value of the number of the frames,

and

respectively represent the state and the action of the mobile phone,

and

representing a hyper-parameter.

FIG. 5 is a block diagram of an electronic device according to an embodiment of the invention; as shown in fig. 5, the electronic apparatus includes a processor, a memory, a communication interface, a display screen, and an input device connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, Near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

It will be understood by those skilled in the art that the structure shown in fig. 5 is only a partial block diagram related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the solution of the present application is applied, and a specific electronic device may include more or less components than those shown in the drawings, or combine some components, or have a different arrangement of components.

In the deep reinforcement learning training process, a trained model has great potential safety hazard, the model and data are easily utilized maliciously by an attacker, the attacker can train an equivalent model according to an input state and an output action to generate a malicious sample to influence the decision of a target intelligent agent, based on the situation, by taking the vertical federal starvation model and a data protection function as a reference, a reinforcement learning framework based on the vertical federal is designed, the training of the model is divided into a local client and a server, the number of the clients is arbitrary, different clients respectively take different feature data for training, and simultaneously, the data uploaded to the server only has features, so that the attacker can be confused that the input and the output of a certain client cannot be equivalent to form an integral strategy model, and the functions of model and data protection are achieved.

The invention has the following beneficial effects: a depth reinforcement learning model protection method based on vertical federation is provided for the poisoning of the depth reinforcement learning model; not only can the model be protected, but also the data can be protected; splitting an input state in a reinforcement learning training process to ensure that a client side takes data with different characteristic distributions, so that the data and a model are protected; the method has good applicability, can effectively detect model poisoning, and does not influence the execution of normal strategies.

It should be noted that the technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered. The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An intelligent routing decision protection method based on vertical federation is characterized in that the method comprises the following steps:

2. The method according to claim 1, wherein in step S2, the N constructed client models have the same model structure, each client model includes two client submodels, each client submodel also has the same model structure, and each client submodel includes two full-connectivity layers and two activation function layers.

3. The method according to claim 2, wherein in step S3, the server side splices the N sets of feature data received from the N clients to obtain complete feature data, and the server side model generates a routing decision for an overall task of the agent according to the complete feature data, and the server side model includes a full connection layer and a Tanh activation function layer.

4. The method according to claim 3, wherein before the step S1-S3, the method further comprises: step S0, pre-training the server-side model and the N client-side models, where the pre-training specifically includes:

5. The method for intelligent vertical federation-based routing decision protection according to claim 4, wherein in the step S0-4:

the loss function is expressed using the following formula:

wherein the content of the first and second substances,

a loss function representing a network of actions in the client model,

a loss function representing a discriminative network in the client model,

current model parameters representing the client model;

the loss function of the action network is:

wherein the content of the first and second substances,

representing the state transition probability of the action network,

representing the previous state transition probability of the action network,

current model parameters representing the client model,

previous model parameters representing the client model,

representing an intercept function, intercepting

The value within the range is such that,

the representation of the hyper-parameter is,

representing time step

The advantage of the estimation of the time is,

representing the time step at a previous model parameter of the client model

An estimated advantage of time;

the loss function of the discrimination network is:

wherein the content of the first and second substances,

is a function of the target value and,

is a predicted value of the number of the frames,sand

respectively represent the state and the action of the mobile phone,

and

representing a hyper-parameter.

6. The intelligent route decision protection method based on vertical federation of claims 5, wherein a near-segment policy optimization algorithm is adopted to collect the states at multiple times when the sampling state data and the training state data are obtainedsAnd act of

A reward value; the method specifically comprises the following steps: at a first moment, the agent obtains state data from a simulation environment of the application scene, the action network makes corresponding actions based on the state data, and the judgment network gives reward values for the actions made by the action network; at other times, the state, action and reward value at a certain time are acquired in the same way.

7. A vertical federation-based intelligent routing decision protection system, the system comprising:

8. A vertical federation-based intelligent routing decision protection system, the system comprising: the preprocessing module is configured to pre-train the server-side model and the N client-side models, where the pre-training specifically includes:

acquiring training state data of the agent in an application scene through pre-sampling, wherein the training state data is divided into N groups of training sub-state data, adding interference noise representing malicious attacks into the kth group of training sub-state data in the N groups of training sub-state data, and then respectively sending the kth group of training sub-state data and other N-1 groups of training sub-state data to N clients, wherein k is more than or equal to 1 and is less than or equal to N, and k is a positive integer;

9. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method for protecting intelligent vertical federation-based routing decisions of any one of claims 1 to 6 when executing the computer program.

10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for vertical federation-based intelligent routing decision protection of any one of claims 1 to 6.