CN116681787A - Deep reinforcement learning-based graph coloring method - Google Patents
Deep reinforcement learning-based graph coloring method Download PDFInfo
- Publication number
- CN116681787A CN116681787A CN202310643641.8A CN202310643641A CN116681787A CN 116681787 A CN116681787 A CN 116681787A CN 202310643641 A CN202310643641 A CN 202310643641A CN 116681787 A CN116681787 A CN 116681787A
- Authority
- CN
- China
- Prior art keywords
- agent
- coloring
- graph coloring
- graph
- reinforcement learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004040 coloring Methods 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000002787 reinforcement Effects 0.000 title claims abstract description 20
- 230000009471 action Effects 0.000 claims abstract description 29
- 230000006870 function Effects 0.000 claims abstract description 14
- 230000008569 process Effects 0.000 claims abstract description 11
- 238000013528 artificial neural network Methods 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims abstract description 6
- 238000005070 sampling Methods 0.000 claims abstract description 4
- 238000010276 construction Methods 0.000 claims abstract description 3
- 239000003795 chemical substances by application Substances 0.000 claims description 34
- 239000003086 colorant Substances 0.000 claims description 28
- 238000005381 potential energy Methods 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 5
- 230000007704 transition Effects 0.000 claims description 4
- 108091007065 BIRCs Proteins 0.000 description 11
- 241000713321 Intracisternal A-particles Species 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/001—Texturing; Colouring; Generation of texture or colour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/90—Determination of colour characteristics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Generation (AREA)
Abstract
A graph coloring method based on deep reinforcement learning; creating a graph coloring environment; using a messaging neural network as an agent for deep reinforcement learning; the intelligent agent randomly explores in the graph coloring environment and stores the data generated in each step into an experience playback pool; randomly sampling the information of the generated data from an experience playback pool to perform estimation update, inputting the current state into the intelligent body to output Q-value of each action, taking the vertex corresponding to the maximum Q-value as the optimal action, calculating a reward value, and training the intelligent body; and coloring the graph by using the trained agent, and giving a coloring scheme and a color number. The application converts the graph coloring problem into a unidirectional Markov chain model, so that the agent does not fall into a circulation track in the exploration process, a strategy for increasing the color number according to the need is designed in the process of increasing the construction solution, priori knowledge of the color number is not needed, and finally, a reward function is designed to guide the agent to learn the optimal coloring strategy.
Description
Technical Field
The application belongs to the field of deep learning, and particularly relates to a graph coloring method based on deep reinforcement learning.
Background
Graphics shading problems (GCP) are well known combinatorial optimization problems in the field of operations research. The goal of this problem is to color the nodes of the graph with a minimum of colors, while ensuring that neighboring nodes do not have the same color. GCP has many practical applications such as creating schedules, fault diagnosis, mobile radio frequency allocation, and register allocation. Traditional methods for solving the graphics shading problem mainly comprise an accurate algorithm, an approximation algorithm and a heuristic algorithm. Accurate algorithms, such as algorithms using branch-and-bound or other enumeration methods, can produce optimal solutions for small-scale problems. However, due to their non-polynomial time complexity, these methods become difficult to handle for large scale problems. The approximation algorithm can improve the computational efficiency and provide a theoretically optimal solution, but cannot guarantee that the minimum color number is obtained in a scenario where a high quality solution is required. Heuristic algorithms, such as tabu searches, can find good solutions in an acceptable time, but the design of these algorithms requires extensive domain knowledge. Furthermore, the nature of these algorithms is iterative, requiring a new search for each new instance, and thus not applicable to practical scenarios where there is time limitation and flexibility is required. It is therefore crucial and profound to find a graph coloring method that can accommodate the need for fast solutions and has generalized performance.
Disclosure of Invention
The application aims to provide a graph coloring method based on deep reinforcement learning for coloring nodes of a graph.
The technical scheme of the application is as follows:
a graph coloring method based on deep reinforcement learning is characterized in that,
creating a graph coloring environment; using a messaging neural network as an agent for deep reinforcement learning;
the intelligent agent randomly explores in the graph coloring environment by using an epsilon-greedy strategy, and the data generated in each step are stored in an experience playback pool;
when the data in the experience playback pool reaches the quantity M, randomly sampling the Transition information of the generated data from the experience playback pool to perform evaluation updating, inputting the current state into the intelligent body to output Q-value of each action, taking the vertex corresponding to the maximum Q-value as the optimal action, calculating a reward value, and training the intelligent body;
and coloring the graph by using the trained agent, and giving a coloring scheme and a color number.
Further, the creating a graph coloring environment specifically includes:
the graph coloring problem is converted into a unidirectional Markov chain model, potential energy relations among various states are constructed, and then a strategy of increasing the number of colors according to the need is adopted to create the graph coloring environment.
Further, designing a bonus function with IAP directs the agent to learn the optimal coloring strategy.
Further, the design principle of the IAP specifically includes:
IAP should take precedence over other penalties; because the goal of the agent is to maximize the return, when the reward for increasing the color number is less than the reward for an invalid action, the agent may select an invalid action instead of an action that causes the color number to increase, which may result in the agent not being able to complete the graph coloring task in a limited number of steps until the agent's exploration is stopped.
The IAP and subject rewards should remain consistent throughout; excessive punishment can affect the learning of the principal objective by the agent.
Further, the reward function formula is:
wherein χ(s) t ) Is state s t Is used for the color number of the color number,representing a set of invalid actions, a representing the action currently performed, i.e. selecting a colored vertex.
Further, the strategy for increasing the color number according to the need is specifically as follows:
the number of colors is initialized to 1, and the number of colors is increased when the existing number of colors does not satisfy the constraint that the adjacent vertex colors are different in the process of increasing the construction solution.
Further, training the agent.
The application has the technical effects that:
the application provides a graph coloring method based on deep reinforcement learning. The graph coloring problem is converted into a unidirectional Markov chain model, so that an intelligent agent cannot fall into a circulation track in the exploration process, and potential energy relations among all states are constructed. The strategy of increasing the number of colors as needed is designed in the process of incrementally constructing the solution so that a priori knowledge of the number of colors is not required. Finally, a reward function with IAP is designed to guide the agent to learn the optimal coloring strategy.
Drawings
The accompanying drawings illustrate various embodiments by way of example in general and not by way of limitation, and together with the description and claims serve to explain the inventive embodiments. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Such embodiments are illustrative and not intended to be exhaustive or exclusive of the present apparatus or method.
FIG. 1 illustrates a schematic diagram of a deep reinforcement learning-based graph coloring algorithm framework of the present application;
figure 2 shows a schematic diagram of a unidirectional Markov chain model of the graph coloring process of the present application.
Detailed Description
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
A graph coloring method based on deep reinforcement learning,
first, the graphics shading problem GCP is converted to a unidirectional Markov chain model, which can simplify the complexity and ease of handling of the graphics shading process. In one aspect, the application provides that no matter what decision is performed, each state will return to the previous state, which allows the agent to not get into the loop track during the exploration process. On the other hand, based on the change in solution subset, the relationship of potential energy between states on the Markov chain is given. It was then demonstrated that when the return value was maximum, all vertices were colored and the number of colors was minimal.
The initial state s (1, 0) indicates that the graphics-rendering task starts from a blank graph with a color number of 1. The state s (, k) represents the number of colors χ in the current state, and the number of vertices that have been colored k (k.ltoreq.n). Lambda (lambda) (,k) Indicating that state s (χ, k) has been repeatedly accessed λ (,) And twice. When all vertices are colored k=n), i.e. the end state is reached. In addition to the initial state and the end state, there are three transition paths for each state: (1) Coloring the uncolored vertices results in an increase in color, i.e., s (, k) →s (χ+1, k+1); (2) Coloring the uncolored vertex without increasing the number of colors, i.e., s (k) →s (χ, k+1); (3) A vertex that has been colored is selected, the state remains unchanged, i.e
Then, the application provides a deep reinforcement learning framework, which adopts MPNN as an intelligent agent of the deep reinforcement learning and aggregates the related information among the vertexes in the graph. In particular, a reward function based on a unidirectional Markov chain is designed to guide the agent to a state of potential energy increase, so that the agent learns how to reach a state of maximum potential energy in a limited time. Based on the method, a GCP-oriented Q-learning algorithm is provided for training the intelligent agent. In the proposed algorithm, the strategy of increasing the number of colors as needed makes the method unnecessary to obtain a priori knowledge of the optimal number of colors. Specifically, the number of colors is initialized to 1, and the number of colors is increased when the existing number of colors does not satisfy the constraint that the adjacent vertex colors are different in the process of incrementally constructing the solution.
MPNN is used as a graph neural network framework, and can effectively transfer and aggregate information such as colors, degrees and the like of vertexes on a graph in the graph coloring problem. Aiming at the graph coloring problem, the application designs an MPNN architecture, and the message transfer function of the MPNN architecture is as follows:
wherein θ is 1, N (v) is the set of adjacent vertices of vertex v, w, which is the neural network parameter of the k-th layer messaging layer uv Weights for the edges connecting vertex v and vertex u,being characteristic of vertex v in the kth round of messaging, ζ v Is the adjacent edge feature of vertex v.
The update function is as follows:
wherein θ is 2, Neural network parameters of the layer are updated for the k-th layer vertices, with brackets indicating the stitching operation.
The output function of the readout phase is as follows:
wherein θ is 3 And theta 4 V is the set of vertices, which are parameters of the two-layer neural network in the readout phase.
Finally, the application provides a specific implementation of the reward function. To avoid that the state remains stationary, IAPs are added to the bonus function to mask invalid actions generated during the decision process (vertices that have already been colored cannot be selected for coloring again). Since IAPs are difficult to size and are subject to conflict with the primary target rewards, poor results are often caused. Therefore, the application provides the design principle of IAP for GCP: (1) IAP should take precedence over other penalties. Because the goal of the agent is to maximize the return (sum of all rewards), when the rewards for increasing the color number are less than the rewards for ineffective actions, the agent may select an ineffective action instead of an action that causes the color number to increase, which may result in the agent not completing the graph coloring task in a limited number of steps until the agent's exploration is stopped. (2) the IAP and subject rewards should remain consistent throughout. Excessive punishment can affect the learning of the principal objective by the agent. Coordination with the graph coloring rewards is achieved. The reward function formula is as follows:
wherein χ(s) t ) Is state s t Is used for the color number of the color number,representing a set of invalid actions, a representing the action currently performed (i.e., selecting a colored vertex).
Firstly, creating a graph coloring environment according to a Markov chain model and a strategy of increasing the number of colors according to needs, performing coloring actions, coloring vertexes according to the proposed strategy of increasing colors according to needs, calculating a reward value according to a proposed reward function formula with IAP, and finally updating vertex states (such as vertex colors, available colors of vertexes, current color numbers and the like).
Second, the agent randomly explores the data generated in each step (s, a, r, s) on a randomly generated graph with an epsilon-greedy strategy ′ ) (s represents the current state, a represents the action, r represents the reward for executing action a in the current state, s ′ Representing the next state) is stored in the experience playback pool.
Third, when the data in the empirical playback pool reaches a certain amount, randomly sampling (s, a, r, s ′ ) The Transition information of (a) is used for carrying out estimation update, the current state s is input into an intelligent agent to output the Q-value of each action, and the vertex corresponding to the biggest Q-value, namely argmaxQ (s, A) (A is an action set) is taken as the optimal action a * And calculating a reward value and training the intelligent agent.
And fourthly, coloring the graph by using the trained intelligent agent, and giving a coloring scheme and a color number.
The foregoing is only a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, who is within the scope of the present application, should make equivalent substitutions or modifications according to the technical solution of the present application and the inventive concept thereof, and should be covered by the scope of the present application.
Claims (6)
1. A graph coloring method based on deep reinforcement learning is characterized in that,
creating a graph coloring environment; using a messaging neural network as an agent for deep reinforcement learning;
the intelligent agent randomly explores in the graph coloring environment by using an epsilon-greedy strategy, and the data generated in each step are stored in an experience playback pool;
when the data in the experience playback pool reaches M, M is a system parameter, randomly sampling Transition information of the generated data from the experience playback pool to perform estimation update, inputting the current state into the intelligent body to output Q-value of each action, taking the vertex corresponding to the maximum Q-value as the optimal action to calculate a reward value, and training the intelligent body;
and coloring the graph by using the trained agent, and giving a coloring scheme and a color number.
2. The deep reinforcement learning-based graph coloring method according to claim 1, wherein the creating a graph coloring environment is specifically:
the graph coloring problem is converted into a unidirectional Markov chain model, potential energy relations among various states are constructed, and then a strategy of increasing the number of colors according to the need is adopted to create the graph coloring environment.
3. The deep reinforcement learning based graph coloring method of claim 1, wherein designing a reward function with an ineffective action penalty IAP directs an agent to learn an optimal coloring strategy.
4. The deep reinforcement learning-based graph coloring method according to claim 3, wherein the IAP design principle is specifically as follows:
IAP should take precedence over other penalties; because the goal of the agent is to maximize the return, when the reward for increasing the color number is less than the reward for an invalid action, the agent may select an invalid action instead of an action that causes the color number to increase, which may lead to the agent not being able to complete the graph coloring task in a limited number of steps before the exploration is stopped;
the IAP and subject rewards should remain consistent throughout; excessive punishment can affect the learning of the principal objective by the agent.
5. The deep reinforcement learning based graph coloring method of claim 3, wherein the reward function formula is:
wherein χ (x) t ) Is state s t Is used for the color number of the color number,representing a set of invalid actions, a representing the action currently performed, i.e. selecting a colored vertex.
6. The deep reinforcement learning-based graph coloring method according to claim 2, wherein the strategy of increasing the number of colors on demand is specifically as follows:
the number of colors is initialized to 1, and the number of colors is increased when the existing number of colors does not satisfy the constraint that the adjacent vertex colors are different in the process of increasing the construction solution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310643641.8A CN116681787A (en) | 2023-06-01 | 2023-06-01 | Deep reinforcement learning-based graph coloring method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310643641.8A CN116681787A (en) | 2023-06-01 | 2023-06-01 | Deep reinforcement learning-based graph coloring method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116681787A true CN116681787A (en) | 2023-09-01 |
Family
ID=87780399
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310643641.8A Pending CN116681787A (en) | 2023-06-01 | 2023-06-01 | Deep reinforcement learning-based graph coloring method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116681787A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118196439A (en) * | 2024-05-20 | 2024-06-14 | 山东浪潮科学研究院有限公司 | Certificate photo color auditing method based on visual language model and multiple agents |
-
2023
- 2023-06-01 CN CN202310643641.8A patent/CN116681787A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118196439A (en) * | 2024-05-20 | 2024-06-14 | 山东浪潮科学研究院有限公司 | Certificate photo color auditing method based on visual language model and multiple agents |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113435606A (en) | Method and device for optimizing reinforcement learning model, storage medium and electronic equipment | |
CN112434171A (en) | Knowledge graph reasoning and complementing method and system based on reinforcement learning | |
CN116681787A (en) | Deep reinforcement learning-based graph coloring method | |
CN113780002A (en) | Knowledge reasoning method and device based on graph representation learning and deep reinforcement learning | |
CN113988508B (en) | Power grid regulation strategy optimization method based on reinforcement learning | |
CN112990485A (en) | Knowledge strategy selection method and device based on reinforcement learning | |
Wu et al. | Matrix representation of stability definitions for the graph model for conflict resolution with reciprocal preference relations | |
CN113947320A (en) | Power grid regulation and control method based on multi-mode reinforcement learning | |
CN109726676A (en) | The planing method of automated driving system | |
Li et al. | Solving open shop scheduling problem via graph attention neural network | |
CN115099606A (en) | Training method and terminal for power grid dispatching model | |
CN114839884A (en) | Underwater vehicle bottom layer control method and system based on deep reinforcement learning | |
CN117273057A (en) | Multi-agent collaborative countermeasure decision-making method and device based on reinforcement learning | |
Tassel et al. | An end-to-end reinforcement learning approach for job-shop scheduling problems based on constraint programming | |
Huang et al. | Solving the inverse graph model for conflict resolution using a hybrid metaheuristic algorithm | |
Ling et al. | A deep reinforcement learning based real-time solution policy for the traveling salesman problem | |
Kim et al. | Solving pbqp-based register allocation using deep reinforcement learning | |
Zhang et al. | Universal value iteration networks: When spatially-invariant is not universal | |
Wang et al. | Discovering Lin-Kernighan-Helsgaun heuristic for routing optimization using self-supervised reinforcement learning | |
CN115186828A (en) | Behavior decision method, device and equipment for man-machine interaction and storage medium | |
CN114995818A (en) | Method for automatically configuring optimized parameters from Simulink model to C language | |
Wu et al. | Discrete teaching-learning-based optimization algorithm for traveling salesman problems | |
Li et al. | Diverse Policy Optimization for Structured Action Space | |
Han et al. | Robot path planning in dynamic environments based on deep reinforcement learning | |
CN113869488A (en) | Game AI intelligent agent reinforcement learning method facing continuous-discrete mixed decision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |