CN116681787A

CN116681787A - Deep reinforcement learning-based graph coloring method

Info

Publication number: CN116681787A
Application number: CN202310643641.8A
Authority: CN
Inventors: 李小龙; 黄珂; 陈晓红; 董莉
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-09-01

Abstract

A graph coloring method based on deep reinforcement learning; creating a graph coloring environment; using a messaging neural network as an agent for deep reinforcement learning; the intelligent agent randomly explores in the graph coloring environment and stores the data generated in each step into an experience playback pool; randomly sampling the information of the generated data from an experience playback pool to perform estimation update, inputting the current state into the intelligent body to output Q-value of each action, taking the vertex corresponding to the maximum Q-value as the optimal action, calculating a reward value, and training the intelligent body; and coloring the graph by using the trained agent, and giving a coloring scheme and a color number. The application converts the graph coloring problem into a unidirectional Markov chain model, so that the agent does not fall into a circulation track in the exploration process, a strategy for increasing the color number according to the need is designed in the process of increasing the construction solution, priori knowledge of the color number is not needed, and finally, a reward function is designed to guide the agent to learn the optimal coloring strategy.

Description

Deep reinforcement learning-based graph coloring method

Technical Field

The application belongs to the field of deep learning, and particularly relates to a graph coloring method based on deep reinforcement learning.

Background

Graphics shading problems (GCP) are well known combinatorial optimization problems in the field of operations research. The goal of this problem is to color the nodes of the graph with a minimum of colors, while ensuring that neighboring nodes do not have the same color. GCP has many practical applications such as creating schedules, fault diagnosis, mobile radio frequency allocation, and register allocation. Traditional methods for solving the graphics shading problem mainly comprise an accurate algorithm, an approximation algorithm and a heuristic algorithm. Accurate algorithms, such as algorithms using branch-and-bound or other enumeration methods, can produce optimal solutions for small-scale problems. However, due to their non-polynomial time complexity, these methods become difficult to handle for large scale problems. The approximation algorithm can improve the computational efficiency and provide a theoretically optimal solution, but cannot guarantee that the minimum color number is obtained in a scenario where a high quality solution is required. Heuristic algorithms, such as tabu searches, can find good solutions in an acceptable time, but the design of these algorithms requires extensive domain knowledge. Furthermore, the nature of these algorithms is iterative, requiring a new search for each new instance, and thus not applicable to practical scenarios where there is time limitation and flexibility is required. It is therefore crucial and profound to find a graph coloring method that can accommodate the need for fast solutions and has generalized performance.

Disclosure of Invention

The application aims to provide a graph coloring method based on deep reinforcement learning for coloring nodes of a graph.

The technical scheme of the application is as follows:

a graph coloring method based on deep reinforcement learning is characterized in that,

creating a graph coloring environment; using a messaging neural network as an agent for deep reinforcement learning;

the intelligent agent randomly explores in the graph coloring environment by using an epsilon-greedy strategy, and the data generated in each step are stored in an experience playback pool;

when the data in the experience playback pool reaches the quantity M, randomly sampling the Transition information of the generated data from the experience playback pool to perform evaluation updating, inputting the current state into the intelligent body to output Q-value of each action, taking the vertex corresponding to the maximum Q-value as the optimal action, calculating a reward value, and training the intelligent body;

and coloring the graph by using the trained agent, and giving a coloring scheme and a color number.

Further, the creating a graph coloring environment specifically includes:

the graph coloring problem is converted into a unidirectional Markov chain model, potential energy relations among various states are constructed, and then a strategy of increasing the number of colors according to the need is adopted to create the graph coloring environment.

Further, designing a bonus function with IAP directs the agent to learn the optimal coloring strategy.

Further, the design principle of the IAP specifically includes:

IAP should take precedence over other penalties; because the goal of the agent is to maximize the return, when the reward for increasing the color number is less than the reward for an invalid action, the agent may select an invalid action instead of an action that causes the color number to increase, which may result in the agent not being able to complete the graph coloring task in a limited number of steps until the agent's exploration is stopped.

The IAP and subject rewards should remain consistent throughout; excessive punishment can affect the learning of the principal objective by the agent.

Further, the reward function formula is:

wherein χ(s) _t ) Is state s _t Is used for the color number of the color number,representing a set of invalid actions, a representing the action currently performed, i.e. selecting a colored vertex.

Further, the strategy for increasing the color number according to the need is specifically as follows:

the number of colors is initialized to 1, and the number of colors is increased when the existing number of colors does not satisfy the constraint that the adjacent vertex colors are different in the process of increasing the construction solution.

Further, training the agent.

The application has the technical effects that:

the application provides a graph coloring method based on deep reinforcement learning. The graph coloring problem is converted into a unidirectional Markov chain model, so that an intelligent agent cannot fall into a circulation track in the exploration process, and potential energy relations among all states are constructed. The strategy of increasing the number of colors as needed is designed in the process of incrementally constructing the solution so that a priori knowledge of the number of colors is not required. Finally, a reward function with IAP is designed to guide the agent to learn the optimal coloring strategy.

Drawings

The accompanying drawings illustrate various embodiments by way of example in general and not by way of limitation, and together with the description and claims serve to explain the inventive embodiments. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Such embodiments are illustrative and not intended to be exhaustive or exclusive of the present apparatus or method.

FIG. 1 illustrates a schematic diagram of a deep reinforcement learning-based graph coloring algorithm framework of the present application;

figure 2 shows a schematic diagram of a unidirectional Markov chain model of the graph coloring process of the present application.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

A graph coloring method based on deep reinforcement learning,

first, the graphics shading problem GCP is converted to a unidirectional Markov chain model, which can simplify the complexity and ease of handling of the graphics shading process. In one aspect, the application provides that no matter what decision is performed, each state will return to the previous state, which allows the agent to not get into the loop track during the exploration process. On the other hand, based on the change in solution subset, the relationship of potential energy between states on the Markov chain is given. It was then demonstrated that when the return value was maximum, all vertices were colored and the number of colors was minimal.

The initial state s (1, 0) indicates that the graphics-rendering task starts from a blank graph with a color number of 1. The state s (, k) represents the number of colors χ in the current state, and the number of vertices that have been colored k (k.ltoreq.n). Lambda (lambda) ^(,k) Indicating that state s (χ, k) has been repeatedly accessed λ ^(,) And twice. When all vertices are colored k=n), i.e. the end state is reached. In addition to the initial state and the end state, there are three transition paths for each state: (1) Coloring the uncolored vertices results in an increase in color, i.e., s (, k) →s (χ+1, k+1); (2) Coloring the uncolored vertex without increasing the number of colors, i.e., s (k) →s (χ, k+1); (3) A vertex that has been colored is selected, the state remains unchanged, i.e

Then, the application provides a deep reinforcement learning framework, which adopts MPNN as an intelligent agent of the deep reinforcement learning and aggregates the related information among the vertexes in the graph. In particular, a reward function based on a unidirectional Markov chain is designed to guide the agent to a state of potential energy increase, so that the agent learns how to reach a state of maximum potential energy in a limited time. Based on the method, a GCP-oriented Q-learning algorithm is provided for training the intelligent agent. In the proposed algorithm, the strategy of increasing the number of colors as needed makes the method unnecessary to obtain a priori knowledge of the optimal number of colors. Specifically, the number of colors is initialized to 1, and the number of colors is increased when the existing number of colors does not satisfy the constraint that the adjacent vertex colors are different in the process of incrementally constructing the solution.

MPNN is used as a graph neural network framework, and can effectively transfer and aggregate information such as colors, degrees and the like of vertexes on a graph in the graph coloring problem. Aiming at the graph coloring problem, the application designs an MPNN architecture, and the message transfer function of the MPNN architecture is as follows:

wherein θ is _1, N (v) is the set of adjacent vertices of vertex v, w, which is the neural network parameter of the k-th layer messaging layer _uv Weights for the edges connecting vertex v and vertex u,being characteristic of vertex v in the kth round of messaging, ζ _v Is the adjacent edge feature of vertex v.

The update function is as follows:

wherein θ is _2, Neural network parameters of the layer are updated for the k-th layer vertices, with brackets indicating the stitching operation.

The output function of the readout phase is as follows:

wherein θ is ₃ And theta ₄ V is the set of vertices, which are parameters of the two-layer neural network in the readout phase.

Finally, the application provides a specific implementation of the reward function. To avoid that the state remains stationary, IAPs are added to the bonus function to mask invalid actions generated during the decision process (vertices that have already been colored cannot be selected for coloring again). Since IAPs are difficult to size and are subject to conflict with the primary target rewards, poor results are often caused. Therefore, the application provides the design principle of IAP for GCP: (1) IAP should take precedence over other penalties. Because the goal of the agent is to maximize the return (sum of all rewards), when the rewards for increasing the color number are less than the rewards for ineffective actions, the agent may select an ineffective action instead of an action that causes the color number to increase, which may result in the agent not completing the graph coloring task in a limited number of steps until the agent's exploration is stopped. (2) the IAP and subject rewards should remain consistent throughout. Excessive punishment can affect the learning of the principal objective by the agent. Coordination with the graph coloring rewards is achieved. The reward function formula is as follows:

wherein χ(s) _t ) Is state s _t Is used for the color number of the color number,representing a set of invalid actions, a representing the action currently performed (i.e., selecting a colored vertex).

Firstly, creating a graph coloring environment according to a Markov chain model and a strategy of increasing the number of colors according to needs, performing coloring actions, coloring vertexes according to the proposed strategy of increasing colors according to needs, calculating a reward value according to a proposed reward function formula with IAP, and finally updating vertex states (such as vertex colors, available colors of vertexes, current color numbers and the like).

Second, the agent randomly explores the data generated in each step (s, a, r, s) on a randomly generated graph with an epsilon-greedy strategy ^′ ) (s represents the current state, a represents the action, r represents the reward for executing action a in the current state, s ^′ Representing the next state) is stored in the experience playback pool.

Third, when the data in the empirical playback pool reaches a certain amount, randomly sampling (s, a, r, s ^′ ) The Transition information of (a) is used for carrying out estimation update, the current state s is input into an intelligent agent to output the Q-value of each action, and the vertex corresponding to the biggest Q-value, namely argmaxQ (s, A) (A is an action set) is taken as the optimal action a ^* And calculating a reward value and training the intelligent agent.

And fourthly, coloring the graph by using the trained intelligent agent, and giving a coloring scheme and a color number.

The foregoing is only a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, who is within the scope of the present application, should make equivalent substitutions or modifications according to the technical solution of the present application and the inventive concept thereof, and should be covered by the scope of the present application.

Claims

1. A graph coloring method based on deep reinforcement learning is characterized in that,

when the data in the experience playback pool reaches M, M is a system parameter, randomly sampling Transition information of the generated data from the experience playback pool to perform estimation update, inputting the current state into the intelligent body to output Q-value of each action, taking the vertex corresponding to the maximum Q-value as the optimal action to calculate a reward value, and training the intelligent body;

2. The deep reinforcement learning-based graph coloring method according to claim 1, wherein the creating a graph coloring environment is specifically:

3. The deep reinforcement learning based graph coloring method of claim 1, wherein designing a reward function with an ineffective action penalty IAP directs an agent to learn an optimal coloring strategy.

4. The deep reinforcement learning-based graph coloring method according to claim 3, wherein the IAP design principle is specifically as follows:

IAP should take precedence over other penalties; because the goal of the agent is to maximize the return, when the reward for increasing the color number is less than the reward for an invalid action, the agent may select an invalid action instead of an action that causes the color number to increase, which may lead to the agent not being able to complete the graph coloring task in a limited number of steps before the exploration is stopped;

5. The deep reinforcement learning based graph coloring method of claim 3, wherein the reward function formula is:

wherein χ (x) _t ) Is state s _t Is used for the color number of the color number,representing a set of invalid actions, a representing the action currently performed, i.e. selecting a colored vertex.

6. The deep reinforcement learning-based graph coloring method according to claim 2, wherein the strategy of increasing the number of colors on demand is specifically as follows: