CN114118374A

CN114118374A - Multi-wisdom reinforcement learning method and system based on hierarchical consistency learning

Info

Publication number: CN114118374A
Application number: CN202111427417.2A
Authority: CN
Inventors: 金博; 李文浩; 王祥丰; 朱骏
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-01

Abstract

A multi-intelligence reinforcement learning method and a multi-intelligence reinforcement learning system based on hierarchical consistency learning combine a deep set network with a variation automatic encoder and are introduced into the multi-intelligence reinforcement learning to standardize the group behaviors of sorting robot clusters in a large-scale automatic sorting task. According to the method, the auxiliary unsupervised learning task and the self-supervised learning task are introduced, team intention representation and individual intention representation are learned efficiently, and various constraints are applied to team intentions and consistency constraints are applied to individual intentions, so that close cooperation in teams and various exploration among teams are guaranteed, the exploration efficiency and the cooperation efficiency of a large-scale multi-agent system when the large-scale multi-agent system completes a cooperation task are effectively improved, and the exploration efficiency and the cooperation efficiency of the large-scale multi-agent system when the large-scale multi-agent system completes the cooperation task can be effectively improved.

Description

Multi-wisdom reinforcement learning method and system based on hierarchical consistency learning

Technical Field

The invention relates to a technology in the field of warehousing automation, in particular to a multi-intelligence reinforcement learning method and a multi-intelligence reinforcement learning system based on hierarchical consistency learning, which are used for solving large-scale automatic sorting tasks of sorting robot clusters in the field of warehousing automation.

Background

The existing multi-intelligence reinforcement learning algorithm mostly follows a centralized training-decentralized execution framework. In the centralization training phase, the agent needs to learn a decentralization strategy by sharing local observations, parameters or gradients, etc. However, the existing algorithm models the agents as independent individuals in the centralized training phase, although the agents can realize cooperation through explicit or implicit communication, it is difficult to learn an effective communication protocol through end-to-end training, so that the agents are prevented from efficient exploration and cooperation in a large-scale multi-intelligence system, and a sorting robot cluster which needs to cooperatively complete a sorting task constitutes a typical large-scale multi-intelligence system. The behavior of the agent is determined by team goals and awareness of the environment. Thus, diverse team goals and consistent environmental awareness are key to the efficient exploration and collaboration of agents. In order to design a multi-intelligence reinforcement learning algorithm with efficient exploratory property and collaboration property, two problems of how to generate diversity among teams and how to closely cooperate in teams are important.

Disclosure of Invention

Aiming at the defects of the prior art in solving the automatic sorting task in the warehousing automation field, the invention provides a multi-wisdom reinforcement learning method and system based on hierarchical consistency learning, introduces an auxiliary unsupervised learning task and an automatic supervision learning task, acquires team intention expression and individual intention expression efficiently, ensures close cooperation in teams and diversified exploration among teams by applying diversity constraint to the team intention and applying consistency constraint to the individual intention, and thus effectively improves the exploration efficiency and the cooperation efficiency when a large-scale multi-wisdom system completes the cooperation task.

The invention is realized by the following technical scheme:

the invention relates to a multi-wisdom reinforcement learning method based on hierarchical consistency learning, which is characterized in that a team intention network, an individual intention network and a multi-wisdom decision network are constructed and randomly initialized; clustering and grouping all the agents by using a hierarchical clustering algorithm to form a team; the unsupervised comparison loss, the self-supervised regression loss, the consistency loss, the reconstruction loss and the decision loss are respectively calculated through the team intention network, the individual intention network and the multi-intelligence decision network and are comprehensively obtained to obtain the overall loss, the team intention network, the individual intention network and the multi-intelligence decision network are optimized according to the overall loss, the sorting robot cluster dynamically groups the team intention network, the individual intention network and the multi-intelligence decision network in the process of completing the large-scale sorting task, and the large-scale sorting task is decomposed into the small-scale subtasks of welfare. The intelligent agents in the teams can realize closer cooperation, and the intelligent agents among the teams can show the diversity of sorting strategies, so that each subtask can be effectively completed, and finally, the higher task completion degree can be quickly achieved in large-scale automatic sorting tasks.

The team intention network comprises: a state encoder, a depth aggregation network, and an intent encoder, wherein: the state encoder receives the local observation and the global state information of each intelligent agent as input and generates a local state code corresponding to each intelligent agent; the depth set network receives local state codes of all agents in the team, and after the local state codes corresponding to each agent are subjected to linear coding again and nonlinear mapping, average operation is carried out to obtain team codes corresponding to the whole team; the intention encoder receives the team code, and outputs the team intention after linear coding again and nonlinear mapping.

The state encoder is a Residual neural network (ResNet), which is implemented in a manner of, but not limited to, Kaimng He et al, "Deep Residual Learning for Image registration" CVPR (2016).

The Deep set network is a fully-connected neural network with a nonlinear active layer, and is realized in a mode of, but not limited to, Zaheer, Manzil et al, "Deep sets.

The specific network architecture of the intent encoder is also a fully-connected neural network with a nonlinear active layer.

The individual intention network comprises: an individual encoder, a graph convolution network, and an individual intent variation encoder, wherein: the individual encoder receives the local observation of each intelligent agent and outputs the individual code corresponding to each intelligent agent; the graph convolution network regards all the intelligent agents in the team as a full-connection undirected graph, nodes in the graph represent each intelligent agent, the node codes are the individual codes output by the individual encoders, and the graph convolution network executes graph convolution operation on the individual codes in the graph to obtain the individual perception corresponding to each intelligent agent; and the individual intention variation encoder takes the team intention output by the team intention network and the individual perception spliced characteristics output by the graph convolution network as input, and outputs the individual intention corresponding to each intelligent agent.

The individual encoders are Residual neural networks (ResNet), which are implemented in, but not limited to, Kaimng He et al, "Deep Residual Learning for Image registration" CVPR (2016).

The Graph convolution network is a Graph convolution neural network and is realized in a mode of but not limited to Kipf, Thomas and Max welling.

The individual intent variant encoder is a variant automatic encoder implemented using, but not limited to, the techniques in Diederik P.kingma and Max welling.

The multi-intelligence decision network generates decision actions for all the agents in each team according to the joint observation of all the agents in each team, and is a Residual error neural network (ResNet) which is realized in a mode of but not limited to Kaim He et al, "Deep Residual Learning for Image Recognition" CVPR (2016).

The invention relates to a multi-intelligence reinforcement learning system based on hierarchical consistency, which is used for a large-scale automatic sorting task of a sorting robot cluster in the field of warehousing automation and realizes the method, and the system comprises the following components: initialization unit, heuristic grouping unit, team intention unit, individual intention unit, multi-intelligence decision unit and model optimization unit, wherein: the initialization unit constructs and randomly initializes a team intention network, an individual intention network and a multi-wisdom decision network; a heuristic grouping unit clusters and groups all the intelligent agents by using a hierarchical clustering algorithm to form a team; the team intention unit generates team intents for each team according to the joint observation of all the agents in each team by using a team intention network, calculates the unsupervised contrast loss according to different team intents by using the team intention network, and calculates the auto-supervised regression loss according to the joint observation of all the agents in each team; the individual intention unit generates individual intents for all the intelligent agents by using an individual intention network, and simultaneously calculates consistency loss according to the hierarchy consistency constraint and reconstruction loss according to the loss of the variational automatic encoder by using the individual intention network; the multi-wisdom decision unit generates decision actions for all the intelligent agents according to the team intention of each team and the individual intentions of all the intelligent agents in the teams by using a multi-wisdom decision network, interacts with the environment to obtain the respective reward feedback of each intelligent agent, and calculates decision loss according to the reward obtained by each intelligent agent by using the multi-wisdom decision network; the model optimization unit weights and sums the unsupervised contrast loss, the auto-supervised regression loss, the consistency loss, the reconstruction loss and the decision loss to obtain an overall loss, and optimizes the team intention network, the individual intention network and the multi-intelligence decision network according to the overall loss.

Technical effects

The invention combines a deep set network and a variation automatic encoder, and is introduced into multi-intelligence reinforcement learning to standardize the group behaviors of the sorting robot clusters in a large-scale automatic sorting task, namely, the deep set network and the individual intention variation encoder are combined with unsupervised contrast loss and consistency loss, so that the intelligent behaviors among teams show enough diversity, and the intelligent behaviors in the teams show enough collaboration, thereby finally improving the time efficiency and the task completion degree of the automatic sorting task. However, the existing methods respectively solve the two problems, so that the automatic sorting strategy learned by the sorting robot cannot meet the requirements of diversity and collaboration at the same time, and thus, the time efficiency and the task completion degree are negatively affected.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a team intent network in an embodiment;

FIG. 3 is a schematic diagram of an individual intention network in an embodiment;

FIG. 4 is a logic diagram of the calculation of the unsupervised contrast loss and the unsupervised regression loss in the examples;

FIG. 5 is a logic diagram of the calculation of consistency loss and reconstruction loss in the embodiment;

FIG. 6 is a schematic diagram of a collaboration task in an embodiment;

FIG. 7 is a schematic diagram of a collaboration task in an embodiment;

FIG. 8 is a diagram of a multi-agent reinforcement learning system based on hierarchical consistency learning according to an embodiment.

Detailed Description

Example 1

The embodiment relates to a specific implementation of the above multi-intelligence reinforcement learning method based on hierarchical consistency learning in a large-scale automatic sorting scene of a sorting robot cluster in the field of warehousing automation, as shown in fig. 6.

In the sorting task of the embodiment, 12 sorting robots are located in the center area of the square map, and 32 shelves are uniformly distributed in the four corner areas of the square map. The sorting robot obtains the reward by navigating to the shelves. The local observation of each sorting robot includes all the information in the square area with radius of 7 unit length centered on itself, i.e. the decision actions of the two-dimensional coordinates of the remaining sorting robots and the two-dimensional coordinates of the racks include move up, move down, move left, move right, stay in place and lift up the racks, and the distance moved includes only the area of one unit length around. The sorting robot can only perform the action of lifting the goods shelf after reaching the position of the goods shelf. Sorting robots are penalized either by colliding with each other or by performing a rack-up operation at a location without a rack. At the same time, in order to encourage the sorting robot to be able to complete the sorting task faster, it will also be subject to less penalty each time it takes a movement action. The agents need to learn to automatically group and reasonably resolve large-scale sorting tasks so that each team can move to different shelf areas to sort the shelves.

As shown in fig. 1, the method for learning with reinforcement of multiple wisdom based on hierarchical consistency according to this embodiment includes:

s110, a team intention network, an individual intention network, a multi-wisdom decision network, and a joint decision evaluation function are constructed and randomly initialized, and all parameters of the networks/functions are initialized using Xavier method (which is implemented using the technology in, but not limited to, gloot, Xavier and Yoshua bengio. "Understanding of following raw feedback for forward neural networks.

As shown in Table 1, network parameters for team intent network and individual intent network

As shown in FIG. 2, the team intent network includes: a state encoder, a depth aggregation network, and an intent encoder, wherein: the state encoder firstly splices a two-dimensional local observation matrix of 14 multiplied by 14 of each sorting robot with a two-dimensional global state matrix of 14 multiplied by 14 of a global state to obtain a characteristic diagram of 14 multiplied by 2, samples the characteristic diagram to a size of 224 multiplied by 2, and convolves the characteristic diagram by 64 convolution kernels with a size of 7 multiplied by 7 and a step size of 2 to obtain a characteristic diagram of 112 multiplied by 64; after maximum convergence with the step length of 2 at 3 × 3, performing convolution twice by 64 convolution kernels with the step length of 3 × 3 to obtain a characteristic diagram of 56 × 56 × 64; respectively obtaining 7 × 7 × 512 feature maps after two times of 128 3 × 3 convolution kernel convolutions, two times of 256 3 × 3 convolution kernel convolutions and two times of 256 3 × 3 convolution kernel convolutions, and finally obtaining 512-dimensional state codes after 7 × 7 average convergence; then, the deep set network receives state codes of all agents in the team as input, firstly, the state codes of all the agents pass through a layer of fully-connected neural network and a ReLU nonlinear activation layer to obtain 64-dimensional hidden layer characteristics corresponding to each agent, and then, average operation is carried out on all the hidden layer characteristics which can only be provided in the team to obtain 64-dimensional team codes; finally, the intention encoder receives 64-dimensional team codes and 196(14 multiplied by 14) dimensional global states after expansion as input, and obtains 32-dimensional team intents corresponding to each team after passing through a layer of fully connected neural network and a ReLU nonlinear activation layer.

As shown in fig. 3, the individual intention network includes: an individual encoder, a graph convolution network, and an individual intent variation encoder, wherein: the state encoder firstly samples the two-dimensional local observation matrix of each sorting robot with the size of 14 multiplied by 14 to the size of 224 multiplied by 224, and obtains a characteristic diagram of 112 multiplied by 64 after convolution by 64 convolution kernels with the size of 7 multiplied by 7 and the step length of 2; after maximum convergence with the step length of 2 at 3 × 3, performing convolution twice by 64 convolution kernels with the step length of 3 × 3 to obtain a characteristic diagram of 56 × 56 × 64; respectively obtaining 7 × 7 × 512 feature maps after two times of 128 convolution kernel convolutions of 3 × 3, two times of 256 convolution kernel convolutions of 3 × 3 and two times of 256 convolution kernel convolutions of 3 × 3, and finally obtaining 512-dimensional individual codes after 7 × 7 average convergence; then, the graph convolution network receives 512-dimensional individual codes of all the agents in the team as input, and obtains 64-dimensional individual perception codes after sequentially passing through the full-connection neural network, the ReLU nonlinear activation layer and the graph convolution layer; finally, an individual intention variational encoder receives 64-dimensional individual perception codes of each intelligent agent in the team as input, on one hand, the team intention and the individual perception codes are spliced and then pass through a full-connection neural network and a ReLU nonlinear activation layer to obtain 64-dimensional individual intentions; on the other hand, after the individual perception code passes through an individual perception decoder and an individual decoder with the structure symmetrical to the individual encoder and the graph convolution network, the reconstruction local feature of 224 multiplied by 224 is obtained, and then the reconstruction state is subjected to down sampling to obtain the reconstruction local observation with the size of 14 multiplied by 14.

The multi-wisdom decision network firstly splices 64-dimensional individual intentions of each sorting robot in a team and team intentions of the whole team 32 to obtain 96-dimensional characteristic vectors; and finally, obtaining the decision action output of each 6-dimensional intelligent agent after passing through a layer of fully-connected neural network.

S120, clustering and grouping all the intelligent agents by using a hierarchical clustering algorithm to form a team;

s130, as shown in FIG. 4, generating team intents for each team through the team intention network according to the joint observation of all the agents in each team, calculating corresponding unsupervised contrast loss, and calculating the auto-supervised regression loss according to the joint observation of all the agents in each team.

The team intention is input into a state encoder after the local observation and the global state information of each agent are spliced through a team intention network, and a local state code corresponding to each agent is generated according to the process; then, the local state codes are input into a depth set network, and after the local state codes corresponding to each agent are subjected to nonlinear mapping as described in the above process, average operation is performed to obtain team codes corresponding to the whole team; finally, after the team codes pass through the intention encoder, the team intention c is obtained according to the process_t。

Said unsupervised loss of contrast

Wherein: m is a positive real super parameter, and the team k, u have the same label

Team k, v have different tags

The teams u and v are obtained by randomly sampling all the weakly connected components, and specifically include: for a specific team k obtained by the method through an organization control network at the moment t, firstly, the method is carried out according to the spatial positions of all the agents in the team

Calculating team space center of gravity by averaging

Combining the current time stamp t, the space-time feature representation of the team k at the time t can be obtained

Then, according to the space-time characteristic expression of all the teams, the Euclidean distance between every two teams is calculated, namely

Then, modeling all the teams into a weighted undirected graph model, wherein nodes in the graph represent the teams, edge connections exist among all the nodes, the weight of the edges represents the Euclidean distance among the teams, and then the method adopts the following rules to carry out edge clipping on the weighted undirected graph:

namely, two teams are connected with each other with edges, and only when a certain team belongs to the nearest team from another team, then, the method uses the traditional graph theory Tarjan algorithm to search all weakly connected components in the undirected graph subjected to edge clipping, numbers each weakly connected component as a label, and then, the method constructs a labeled training set for each team according to the team intention generated by each team through the team intention network and the number of the weakly connected component to which the team belongs

And calculate the unsupervised contrast loss.

The described auto-supervised regression loss

Wherein: f. of_ξ(. is a team intent decoder that receives team intent

And joint observations of global state information and predicting the next moment of all agents in the corresponding team

S140, as shown in fig. 5, generating individual intents for all agents through the individual intention network, calculating a consistency loss according to the hierarchical consistency constraint, and calculating a reconstruction loss according to the loss of the variational auto-encoder.

The individual intention is determined by an individual encoder g_φCoding the local observation of each agent i in the team k into an individual code according to the process

The agents in the team are then modeled as a fully connected undirected graph and a graph neural network g is used_ψEncoding the individuals of all agents in a team, as described in the above flow

And carrying out graph convolution operation to obtain individual perception:

then through individual intention variation coder

Stitching team intentions and individual perceptions into

Calculating to obtain individual intention

The sum of the consistency loss and the reconstruction loss

Wherein: for agents i, j belonging to the same team k, which represent different agents belonging to the same team, q is a reconstructed local observation obtained by a variational automatic encoder, and the specific calculation flow is as described above.

S150, generating decision actions for all the agents according to the process through the multi-agent decision network according to the team intention of each team and the individual intentions of all the agents in the team, interacting with the environment to obtain the respective reward feedback of each agent, and calculating decision loss through a QMIX algorithm according to the reward obtained by each agent through the multi-agent decision network;

the reward comprises: external rewards and internal rewards.

The external reward

The method comprises the following steps: agent i, i.e. the sorting robot, will get a reward of-0.01 each time it moves, successfully lifting the shelves will get a reward of 5, and performing the action of lifting the shelves in places where there are no shelves will get a reward of-0.1.

The internal award

Wherein:

and

respectively representing team intentions of two different teams.

Loss of said decision

Wherein:

loss of local decision

Joint decision loss for all agents

Satisfy between local and global:

wherein: lambda [ alpha ]_QMIX0.01 represents a positive real super parameter to balance the loss of the two parts; γ ═ 0.99 represents a discount factor between 0 and 1 to balance short-term revenue and long-term revenue.

S160, weighting and summing unsupervised contrast loss, auto-supervised regression loss, consistency loss, reconstruction loss and decision loss to obtain total loss, and optimizing a team intention network, an individual intention network and a multi-intelligence decision network according to the total loss, wherein the method specifically comprises the following steps: and optimizing all network parameters by using an Adam optimizer, wherein the used learning rate is 0.0001, the batch size is 512, the number of training rounds is 500, the maximum iteration number in each round is 250, and the parameter updating frequency is 4 times of updating of each batch of data.

The training set of the training is generated by the following method: and in each time step of each round of training, the sorting robot obtains respective decision actions according to the team intention network, the individual intention network and the multi-intelligence decision network and the flows and executes the decision actions in the simulation environment. After the environment receives the joint decisions of all the sorting robots, the state of each sorting robot, namely the two-dimensional coordinate information and the state of the goods shelf, is updated, and the reward is calculated for each sorting robot. Thus, each time step will generate (joint observation, joint decision, reward) such a triplet. The triples generated at each time step are placed into an empirical replay buffer to form the data set required for training. In this embodiment, the size of the empirical playback buffer is 1,000,000.

After ten experiments of initializing different random seeds by sampling, the accumulated reward value of 767 +/-74 can be achieved by the embodiment; in all simulation experiments, 100% task completion, i.e. sorting to all 32 shelves, could be achieved.

Example 2

On the basis of embodiment 1, the present embodiment introduces interference factors, such as observation noise, decision delay, target misinterpretation, etc., into the sorting environment, and these interference factors may affect the behavior of the sorting robot. The present embodiment models these real interference factors generally, abstractly, by introducing additional interfering agents. Specifically, compared with embodiment 1, the automatic sorting task schematic diagram processed by the present embodiment is shown in fig. 8, and includes: 16 disturbing robots, 16 sorting robots with faster movement speed and 32 racks. The sorting robot can only be rewarded by successfully arriving at the goods shelf position and lifting the goods shelf, and the interference robot can be rewarded by arriving at the position of the sorting robot. All agent actions include move up, move down, move left, move right, and stay in place; the sorting robot has the additional action of lifting the rack. The range of the interference robot moving is 1 unit length around, and the moving range of the sorting robot is 2 unit lengths around. The sorting robots are punished when colliding with each other, and the robots which are interfered with the sorting robots are punished when colliding with the sorting robots. . Since the moving speed of the sorting robot is higher than that of the interference robot, the sorting robot needs to learn to avoid the interference robot while learning to decompose a large-scale task by grouping. This embodiment is consistent with embodiment 1 except for the above-described settings.

For the external reward of the environment, all the robots do not receive extra punishment when moving, the sorting robot receives a punishment of-1 when being caught up by the interference robot, and the interference robot correspondingly receives a reward of + 1; successful lifting of the pallet by the sorting robot will receive a +5 reward, and likewise, the sorting robot will perform a pallet lifting action at a location without a pallet, with a penalty of-0.1.

In this embodiment, the disturbing robot is trained using the same algorithm as the sorting robot. With the continuous training, the interference robot has stronger and stronger capability, namely representing that the interference amplitude in the real environment is larger and larger. The sorting robot needs to learn to adopt a better cooperation strategy to resist stronger and stronger interference amplitude in the training process.

After ten experiments of initialization by sampling different random seeds, the cumulative reward value of 620 +/-44 can be achieved, and in all simulation experiments, the task completion degree of 100% can be finally achieved, namely all 32 shelves are sorted under the condition that the interference robot exists.

Example 3

As shown in fig. 8, the system for learning reinforcement based on multiple wisdom based on hierarchical consistency for implementing the above method includes: an initialization unit 510, a heuristic grouping unit 520, a team intent unit 530, an individual intent unit 540, a multi-intelligence decision unit 550, and a model optimization unit 560, wherein: the initialization unit 510 constructs and randomly initializes a team intention network, an individual intention network, and a multi-wisdom decision network; a heuristic grouping unit 520 clusters and groups all the agents by using a hierarchical clustering algorithm to form a team; the team intention unit 530 generates team intents for each team based on the joint observations of all agents in each team using the team intention network, while calculating unsupervised contrast loss based on different team intents using the team intention network, and calculating unsupervised regression loss based on the joint observations of all agents in each team; the individual intention unit 540 generates individual intents for all agents using the individual intention network, and at the same time, calculates a consistency loss according to the hierarchical consistency constraint using the individual intention network, and calculates a reconstruction loss according to the loss of the variational automatic encoder; the multi-agent decision unit 550 generates decision actions for all agents according to the team intention of each team and the individual intentions of all agents in the team by using a multi-agent decision network, interacts with the environment to obtain respective reward feedback of each agent, and calculates decision loss according to the reward obtained by each agent by using the multi-agent decision network; the model optimization unit 560 weights and sums the unsupervised contrast loss, the auto-supervised regression loss, the consistency loss, the reconstruction loss, and the decision loss to get an overall loss, and optimizes the team intention network, the individual intention network, and the multi-intelligence decision network accordingly.

The initialization unit 510 includes: the system comprises a team intention network initialization unit, an individual intention network initialization unit and a multi-intelligence decision network initialization unit, wherein the three units respectively adopt an Xavier method to respectively and correspondingly initialize parameters according to specific architecture information of the team intention network, the individual intention network and the multi-intelligence decision network.

And the heuristic grouping unit clusters and groups all the sorting robots by using a hierarchical clustering algorithm according to the two-dimensional coordinate information of all the sorting robots to form a team, and assigns a team ID to each sorting robot.

The team intention unit generates team intentions for each team by using a team intention network according to the joint observation of all the agents in each team, calculates unsupervised contrast loss according to the obtained team intentions, and calculates the auto-supervised regression loss according to the joint observation of all the agents in each team.

Through a specific practical experiment, under the specific environment settings of the embodiment 1 and the embodiment 2, an Adam optimizer is used to optimize all network parameters, the used learning rate is 0.0001, the batch size is 512, the number of training rounds is 500, the maximum iteration number in each round is 250, and the parameter updating frequency is 4 times for each batch of data. In the experiment initialized by sampling different random seeds for ten times, the accumulated reward value of 767 +/-74 can be achieved in the setting of the embodiment 1, and in all simulation experiments, the task completion degree of 100 percent can be achieved, namely all 32 goods shelves are sorted; in the experiment of initializing by sampling different random seeds ten times, the accumulated reward value of 620 +/-44 can be achieved in the setting of the embodiment 2, and in all simulation experiments, the task completion degree of 100 percent can be finally achieved, namely all 32 shelves are sorted under the condition that the interference robot exists.

Compared with the prior art, the invention has the advantages that the performance of the device is enhanced by the existing wisdom-rich learning method on the scale of the processable sorting robot and the scale of the goods shelf, on the time efficiency of finishing the sorting task and on the completion degree of the sorting task.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A multi-wisdom reinforcement learning method based on hierarchical consistency learning is characterized in that a team intention network, an individual intention network and a multi-wisdom decision network are constructed and randomly initialized; clustering and grouping all the agents by using a hierarchical clustering algorithm to form a team; the unsupervised contrast loss, the self-supervised regression loss, the consistency loss, the reconstruction loss and the decision loss are respectively calculated through a team intention network, an individual intention network and a multi-wisdom decision network and are comprehensively obtained to obtain the total loss, the team intention network, the individual intention network and the multi-wisdom decision network are optimized according to the total loss, a sorting robot cluster dynamically groups the large-scale sorting tasks, the large-scale sorting tasks are decomposed into welfare small-scale subtasks, the intelligent bodies in the teams can realize closer cooperation, the intelligent bodies among the teams can show the diversity of sorting strategies, so that each subtask can be effectively completed, and finally, the high task completion degree can be quickly achieved in the large-scale automatic sorting tasks.

2. The method as claimed in claim 1, wherein the team intention network comprises: a state encoder, a depth aggregation network, and an intent encoder, wherein: the state encoder receives the local observation and the global state information of each intelligent agent as input and generates a local state code corresponding to each intelligent agent; the depth set network receives local state codes of all agents in the team, and after the local state codes corresponding to each agent are subjected to linear coding again and nonlinear mapping, average operation is carried out to obtain team codes corresponding to the whole team; the intention encoder receives the team codes, and outputs team intentions after linear coding and nonlinear mapping again;

3. The method for learning intellectually based on hierarchical consistency learning according to claim 1 or 2, wherein the team intention network comprises: a state encoder, a depth aggregation network, and an intent encoder, wherein: the state encoder firstly splices a two-dimensional local observation matrix of 14 multiplied by 14 of each sorting robot with a two-dimensional global state matrix of 14 multiplied by 14 of a global state to obtain a characteristic diagram of 14 multiplied by 2, samples the characteristic diagram to a size of 224 multiplied by 2, and convolves the characteristic diagram by 64 convolution kernels with a size of 7 multiplied by 7 and a step size of 2 to obtain a characteristic diagram of 112 multiplied by 64; after maximum convergence with the step length of 2 at 3 × 3, performing convolution twice by 64 convolution kernels with the step length of 3 × 3 to obtain a characteristic diagram of 56 × 56 × 64; respectively obtaining 7 × 7 × 512 feature maps after two times of 128 3 × 3 convolution kernel convolutions, two times of 256 3 × 3 convolution kernel convolutions and two times of 256 3 × 3 convolution kernel convolutions, and finally obtaining 512-dimensional state codes after 7 × 7 average convergence; then, the deep set network receives state codes of all agents in the team as input, firstly, the state codes of all the agents pass through a layer of fully-connected neural network and a ReLU nonlinear activation layer to obtain 64-dimensional hidden layer characteristics corresponding to each agent, and then, average operation is carried out on all the hidden layer characteristics which can only be provided in the team to obtain 64-dimensional team codes; finally, an intention encoder receives 64-dimensional team codes and a 196(14 multiplied by 14) dimensional global state after expansion as input, and obtains 32-dimensional team intents corresponding to each team after passing through a layer of fully connected neural network and a ReLU nonlinear activation layer;

the individual intention network comprises: an individual encoder, a graph convolution network, and an individual intent variation encoder, wherein: the state encoder firstly samples the two-dimensional local observation matrix of each sorting robot with the size of 14 multiplied by 14 to the size of 224 multiplied by 224, and obtains a characteristic diagram of 112 multiplied by 64 after convolution by 64 convolution kernels with the size of 7 multiplied by 7 and the step length of 2; after maximum convergence with the step length of 2 at 3 × 3, performing convolution twice by 64 convolution kernels with the step length of 3 × 3 to obtain a characteristic diagram of 56 × 56 × 64; respectively obtaining 7 × 7 × 512 feature maps after two times of 128 convolution kernel convolutions of 3 × 3, two times of 256 convolution kernel convolutions of 3 × 3 and two times of 256 convolution kernel convolutions of 3 × 3, and finally obtaining 512-dimensional individual codes after 7 × 7 average convergence; then, the graph convolution network receives 512-dimensional individual codes of all the agents in the team as input, and obtains 64-dimensional individual perception codes after sequentially passing through the full-connection neural network, the ReLU nonlinear activation layer and the graph convolution layer; finally, an individual intention variational encoder receives 64-dimensional individual perception codes of each intelligent agent in the team as input, on one hand, the team intention and the individual perception codes are spliced and then pass through a full-connection neural network and a ReLU nonlinear activation layer to obtain 64-dimensional individual intentions; on the other hand, after the individual perception code passes through an individual perception decoder and an individual decoder with the structure symmetrical to the individual encoder and the graph convolution network, the reconstruction local feature of 224 multiplied by 224 is obtained, and then the reconstruction state is subjected to down sampling to obtain the reconstruction local observation with the size of 14 multiplied by 14.

4. The method as claimed in any one of claims 1 to 3, wherein the individual intention is determined by an individual encoder g_φCoding the local observation of each agent i in the team k into an individual code according to the process

And carrying out graph convolution operation to obtain individual perception:

then through individual intention variation coder

Stitching team intentions and individual perceptions into

Calculating to obtain individual intention

5. The method as claimed in claim 1, wherein the consistency loss is added to the reconstruction loss

6. A multi-intelligence reinforcement learning system based on hierarchical consistency and used for a large-scale automatic sorting task of a sorting robot cluster in the field of warehousing automation, which realizes the method of any one of claims 1 to 5, is characterized by comprising: initialization unit, heuristic grouping unit, team intention unit, individual intention unit, multi-intelligence decision unit and model optimization unit, wherein: the initialization unit constructs and randomly initializes a team intention network, an individual intention network and a multi-wisdom decision network; a heuristic grouping unit clusters and groups all the intelligent agents by using a hierarchical clustering algorithm to form a team; the team intention unit generates team intents for each team according to the joint observation of all the agents in each team by using a team intention network, calculates the unsupervised contrast loss according to different team intents by using the team intention network, and calculates the auto-supervised regression loss according to the joint observation of all the agents in each team; the individual intention unit generates individual intents for all the intelligent agents by using an individual intention network, and simultaneously calculates consistency loss according to the hierarchy consistency constraint and reconstruction loss according to the loss of the variational automatic encoder by using the individual intention network; the multi-wisdom decision unit generates decision actions for all the intelligent agents according to the team intention of each team and the individual intentions of all the intelligent agents in the teams by using a multi-wisdom decision network, interacts with the environment to obtain the respective reward feedback of each intelligent agent, and calculates decision loss according to the reward obtained by each intelligent agent by using the multi-wisdom decision network; the model optimization unit weights and sums the unsupervised contrast loss, the auto-supervised regression loss, the consistency loss, the reconstruction loss and the decision loss to obtain an overall loss, and optimizes the team intention network, the individual intention network and the multi-intelligence decision network according to the overall loss.

7. The system of claim 6, wherein the initialization unit comprises: the system comprises a team intention network initialization unit, an individual intention network initialization unit and a multi-intelligence decision network initialization unit, wherein the three units respectively adopt an Xavier method to respectively and correspondingly initialize parameters according to specific architecture information of the team intention network, the individual intention network and the multi-intelligence decision network.

8. The system of claim 6, wherein the heuristic grouping unit clusters and groups all the sorting robots to form a team by using a hierarchical clustering algorithm according to the two-dimensional coordinate information of all the sorting robots, and assigns a team ID to each sorting robot.

9. The system of claim 6, wherein the team intention unit is configured to generate a team intention for each team based on joint observations of all agents in each team using a team intention network, calculate unsupervised contrast loss based on the generated team intention, and calculate the auto-supervised regression loss based on joint observations of all agents in each team.