CN112884126B

CN112884126B - Deep reinforcement learning network system

Info

Publication number: CN112884126B
Application number: CN202110218003.2A
Authority: CN
Inventors: 张经纬; 资斌
Original assignee: Shenzhen Lan Pangzi Machine Intelligence Co ltd
Current assignee: Shenzhen Lan Pangzi Machine Intelligence Co ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2024-03-08
Anticipated expiration: 2041-02-26
Also published as: CN112884126A

Abstract

The application discloses deep reinforcement learning network system, including embedded layer, sequencing layer and loading layer, embedded layer is used for mapping the vanning information of waiting to the vanning body into the eigenvector, and the sequencing layer is used for ordering a plurality of unordered vanning bodies that wait, and loading layer is used for confirming the loading position and the orientation of waiting to vanning body according to the loading sequence of waiting to the vanning body of sequencing layer output to this decomposes the vanning problem into two steps and uses layering vanning model to solve: 1. sequence decision problem: given a sequence of unordered cases to be loaded, deep reinforcement learning is used to determine the loading order of the cases to be loaded. 2. Loading problem: determining loading positions and orientations of the boxes by using deep reinforcement learning based on the box loading sequence given by the second algorithm; the problem of boxing is decomposed, so that the whole action space of the problem to be searched is greatly reduced, and the learning and searching difficulty of the intelligent body is greatly reduced.

Description

Deep reinforcement learning network system

Technical Field

The application relates to the technical field of artificial intelligence and boxing, in particular to a deep reinforcement learning network system and a storage medium.

Background

Current boxing is typically performed empirically by hand. To enable improved rationality and efficiency of boxing, intelligent algorithms may be employed to assist in boxing.

The boxing problem is a classical academic problem and has wide commercial application value. In the field of logistics, there is often the problem of requiring a specified series of goods to be loaded into a specified container, such as a car, for transport.

The vast majority of existing algorithms can be divided into two categories, (one) heuristic search algorithms that use manually formulated rules; and (II) an optimization algorithm such as a genetic algorithm and a deep learning algorithm which are regarded as nonlinear optimization to solve.

However, the fundamental drawback of the first class of heuristic-based search methods is that their results rely on artificially specified heuristic rules. The result is good when the rule is applicable, whereas the rule is difficult to obtain a usable scheme. However, most boxing scenes have complex constraints themselves, which make it difficult to find a set of applicable rules. Second, the rules must be readjusted each time a major change occurs to the scene, which affects the versatility of the algorithm itself.

The second category of methods, nonlinear optimization-genetic algorithms and deep reinforcement learning algorithms, is that most existing boxing algorithms based on deep reinforcement learning can be divided into two categories: (1) The case loading order is determined using deep reinforcement learning, and conventional heuristic algorithms are used to calculate case loading positions. (2) The loading position is determined using deep reinforcement learning by other algorithms or experimental constraints given the box loading order.

The existing first boxing algorithm based on deep reinforcement learning has the main defects that: (1) The traditional heuristic algorithm is adopted to calculate the loading position of the box and the space occupation ratio of the loading result is used as the reward of reinforcement learning, so that the heuristic algorithm selected by the method becomes the bottleneck of the whole boxing algorithm. Since different heuristics are optimized only for specific datasets, this approach is difficult to migrate smoothly in different binning scenarios. (2) The conventional heuristic algorithm is difficult to parallelize by using the GPU, so that a great deal of time and calculation resources are required in the training process of deep reinforcement learning.

The main disadvantages of the second type of method are: (1) The actual problem of packing does not generally have a fixed loading sequence, and therefore such methods have limited applicability. (2) The loading order of boxes in the boxing problem has a great impact on the final loading result, and therefore it is often difficult for such algorithms to optimize the loading rate of the final algorithm. (3) Meanwhile, since the loading of boxes in the boxing algorithm has strict space restrictions, a large number of calculations are required.

Accordingly, it is desirable to provide a deep reinforcement learning network system that improves the packing versatility and computational efficiency.

Disclosure of Invention

In order to overcome the problems in the related art, the application provides a deep reinforcement learning network system and a storage medium, which aim to improve the packing universality and the computing efficiency.

The technical scheme for solving the technical problems is as follows: a deep reinforcement learning network system for use in intelligent boxing, the deep reinforcement learning network system comprising an embedded layer comprising: the box embedding model is used for carrying out data mapping on the encasement information of a plurality of unordered boxes to be encased to obtain a representation box packing of the encasement information of the unordered boxes to be encased in a high-order space; and the boundary embedding model is used for carrying out data mapping on the accommodation information of a container to obtain the representation frontier embedding of the accommodation information of the container in a high-order space.

Preferably, the deep reinforcement learning network system further includes: the sorting layer is used for sorting a plurality of unordered boxes to be packaged; the sequencing layer is provided with a sequence decision model which is used for outputting the probabilities of a plurality of boxes to be packaged according to the representation box packing of the box packaging information of the boxes to be packaged and the representation frontier embedding of the containing information of the containers.

Preferably, the deep reinforcement learning network system further comprises a loading layer, wherein the loading layer is used for determining the loading position and the orientation of the to-be-loaded boxes according to the loading sequence of the to-be-loaded boxes output by the sequencing layer.

Preferably, the sequence decision model is used for selecting one to-be-packaged box according to the probabilities of a plurality of to-be-packaged boxes; and inputting the selected boxing information of the to-be-boxed body into a loading model of the loading layer.

Preferably, the sequence decision model is further configured to derive a representation selected box embedding of the selected casing to be packaged based on a self-attention mechanism.

Preferably, the loading model is configured to output a loading position and an orientation of the selected to-be-loaded container in the container according to a representation box mapping of the boxing information of the unselected plurality of to-be-loaded containers, a representation selected box embedding of the selected to-be-loaded container, and a representation frontier embedding of the containing information of the container.

Preferably, the sequence decision model is further configured to output probabilities of the unselected multiple boxes to be packed according to a representation box packing of the packing information of the unselected multiple boxes to be packed and a representation frontier embedding of the containing information of the container.

Preferably, the sequence decision model employs a self-attention mechanism based network of points.

Preferably, the sequence decision model adopts a strategy gradient algorithm to train the self-attention mechanism-based point network.

Preferably, the structure of the box body embedded model is as follows according to the processing sequence of the computing unit: the first layer normalization unit is used for normalizing the boxing information of the box to be boxed; the self-attention mechanism unit is used for distributing weights to a plurality of unordered boxes to be packaged according to the sizes of the boxes to be packaged and the current state in the container; a second-layer normalization unit for normalizing the output of the self-attention mechanism unit; and the multi-layer sensing unit is used for carrying out data mapping on the output of the second-layer normalization unit.

Preferably, the self-attention mechanism unit comprises at least three self-attention layers.

A second aspect of the present application provides a computer readable storage medium storing a computer program which when executed by a processor implements a method as described above.

The application provides a degree of depth reinforcement study network system and storage medium, through degree of depth reinforcement study network system is equipped with embedded layer, sequencing layer and loading layer, the embedded layer be used for with the vanning information mapping of waiting to adorn the box becomes the eigenvector, the sequencing layer is used for ordering a plurality of unordered boxes of waiting to adorn, the loading layer is used for according to the sequencing layer output wait to adorn the loading sequence of box and confirm wait to adorn loading position and orientation of box to this decomposes the vanning problem into two steps and uses layering vanning model to solve: 1. sequence decision problem: given a sequence of unordered cases to be loaded, deep reinforcement learning is used to determine the loading order of the cases to be loaded. 2. Loading problem: determining loading positions and orientations of the boxes by using deep reinforcement learning based on the box loading sequence given by the second algorithm; the decomposition of the boxing problem greatly reduces the whole action space of the problem to be searched, changes the product of the sequence action space and the loading action space into the sum of the two action spaces, and greatly reduces the learning and searching difficulty of the intelligent agent. The two decision networks (the sequence decision network of the sequence decision problem and the loading network of the loading problem) of the invention use the same set of high-dimensional characterization as input, thus ensuring the consistency of the convergence of the neural network. Meanwhile, the algorithm of the invention fully utilizes GPU parallelization, so that the problem of overlong algorithm training time is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

FIG. 1 is a flow diagram of a boxing method shown in an embodiment of the present application;

FIG. 2 is a schematic illustration of a boxing model shown in an embodiment of the present application;

FIG. 3 is another flow diagram of a boxing method that is illustrated in embodiments of the present application;

FIG. 4 is a schematic view of the structure of the container shown in one state in an embodiment of the present application;

FIG. 5 is a schematic diagram of a deep reinforcement learning network system shown in an embodiment of the present application;

FIG. 6 is a schematic diagram of a box insert model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a boundary embedding model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device shown in an embodiment of the present application;

FIGS. 9 a-9 d are process state diagrams of the method of boxing as illustrated in the embodiments herein;

Fig. 10a to 10b are further process state diagrams of the boxing method shown in the embodiment of the present application.

Detailed Description

Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The following describes the technical scheme of the embodiments of the present application in detail with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic flow chart of a boxing method according to a first embodiment of the present application, as shown in fig. 1, the method includes the following steps:

and S1, obtaining loading information of the container and boxing information of a plurality of unordered to-be-boxed bodies.

Specifically, the container is used for loading the box body to be loaded, and a specified series of goods are loaded into a specified container such as: in the carriage, the loading information of the container comprises the length, width and height of the container and the loading condition of the container, namely the information of the existing articles in the container. Such as corner information, or box information that has been placed inside, etc. The boxing information of the box to be packaged comprises the length, width and height of the box to be packaged, the ID (identity) of the goods to be packaged, weight, volume, whether the goods to be packaged are required to be placed upwards, whether the goods to be packaged are limited in bearing or not, and the like. In this embodiment, the plurality of boxes to be packaged are disordered.

And S2, determining the loading sequence of the plurality of boxes to be loaded through a box loading model according to the loading information of the containers and the box loading information of the boxes to be loaded.

And step S3, determining the loading position and the loading direction of the box body to be loaded through the boxing model according to the loading sequence.

Fig. 2 is a schematic view of a boxing model shown in the first embodiment of the present application.

Referring to fig. 2, in the present embodiment, the binning model includes a first neural network, a second neural network, a sequence decision network and a loading network; training the sequence decision network through a deep reinforcement learning algorithm to obtain a trained sequence decision network, so that the loading sequence of a plurality of unordered boxes to be loaded is determined through the sequence decision network; and training the loading network through a deep reinforcement learning algorithm to obtain a trained loading network, so that the loading position and the loading direction of the body to be boxed in the container are determined through the loading network.

Fig. 3 is another flow chart of the boxing method shown in the first embodiment of the present application.

Referring to fig. 3, determining, according to the loading information of the container and the boxing information of the to-be-boxed bodies, the loading sequence of the to-be-boxed bodies through the boxing model specifically includes the following steps:

step S201, performing data mapping on the loading information of the container through a first neural network, to obtain a representation frontier embedding of the loading information of the container in a high-order space.

Specifically, the loading information of the container is expressed in a matrix manner, the data mapping is performed through a first neural network, and the representation frontier embedding of the loading information of the container in a high-level space is obtained, the loading information is the case condition of the loaded container and the remaining unloaded space condition of the container, the representation frontier embedding represents the boundary of the container, as shown in fig. 4, fig. 4 is a schematic structural diagram of the container in one state, and the dotted line in fig. 4 represents the boundary of the container. The boxes are typically placed against the boundaries of the containers.

Step S202, carrying out data mapping on the boxing information of the plurality of to-be-boxed bodies through a second neural network to obtain the representation box embedding of the boxing information of the plurality of to-be-boxed bodies in a high-order space.

The length, width and height information of a plurality of boxes to be filled, namely box sizes and the loading condition front of the target container are respectively subjected to data mapping by using two different neural networks (a second neural network box embedding model and a first neural network frontier embedding model) to obtain the representation of the boxes in a high-order space, namely box packing and frontier embedding.

In this embodiment, after step S1 obtains the loading information of the container and the boxing information of the unordered boxes to be boxed, step S201 and step S202 are performed respectively, and in this embodiment, step S201 and step S202 are performed simultaneously and are located after step S1.

Step S203, inputting the representation frontier embedding of the loading information and the representation box packing of the boxing information of the plurality of boxes to be boxed into a sequence decision network of the boxing model, where the sequence decision network outputs probabilities of the plurality of boxes to be boxed.

Wherein the probability is the probability that the container to be filled is selected to be filled in the container in the current state step.

Step 204, selecting one to-be-boxed body by the sequence decision network according to the probabilities of a plurality of to-be-boxed bodies.

In this embodiment, according to probabilities of the plurality of to-be-packaged boxes, the sequence decision network selects the packaging information of the to-be-packaged box with the highest probability, and the packaging information of the to-be-packaged box with the highest probability is input into the loading network of the packaging model, so that the loading network selects the to-be-packaged box to be packaged into the container.

In step 2041, the sequence decision network obtains a representation selected box embedding of the selected casing to be filled according to a self-attention mechanism.

In one embodiment, after step 204, the method further includes:

step 2042, inputting the representation box multiplexing of the boxing information of the unselected multiple boxes to be boxed and the representation frontier embedding of the loading information into the sequence decision network again, wherein the sequence decision network outputs probabilities of the unselected multiple boxes to be boxed. Wherein the probability at which the first to-be-palletized box has been selected in step 204 is the probability that a plurality of to-be-palletized boxes that have not been selected are selected to be palletized into the container in the current state step.

And outputting probabilities of a plurality of unselected boxes to be assembled by the sequence decision network, and selecting one box to be assembled again by the sequence decision network according to the probabilities of the plurality of unselected boxes to be assembled. I.e. the to-be-loaded box body selects one bagged box body from the selected plurality of to-be-loaded box bodies again, so that after repeating the steps 204, 2041 and 2042, the sequence decision network outputs the loading sequence of the plurality of to-be-loaded box bodies.

In one embodiment, determining the loading position and orientation of the to-be-loaded box body according to the loading sequence through the boxing model specifically includes:

and simultaneously inputting the representation frontier embedding of the loading information, the representation box of the boxing information of the unselected multiple to-be-boxed bodies and the representation selected box embedding of the selected to-be-boxed bodies into a loading network of the boxing model to obtain the loading position and the orientation of the selected to-be-boxed bodies in the container.

The loading position and orientation of the selected to-be-loaded box in the container are described by probabilities, wherein the probabilities are probabilities that the to-be-loaded box is loaded in one position and orientation of the container, and in the embodiment, the loading network loads the to-be-loaded box in a loading position and orientation mode with highest probabilities. In this embodiment, the container is meshed, and according to the selected to-be-packaged container, the loading network outputs the probability y, y=n×t of the selected to-be-packaged container being placed in the container width direction, where n is the number of the container width direction meshed, t is the dimension of the to-be-packaged container, and the box may be 2-dimensional or 3-dimensional. Assuming that the box is a two-dimensional box, after the loading network outputs the probability y that the selected box to be loaded is placed in the width direction of the container, the loading network determines the position of the selected box to be loaded in the length direction of the container according to the loading information of the container, so that the position and the orientation of the selected box to be loaded in the container are obtained. After the selected to-be-packaged box body is packaged in the container, the loading information of the container is updated, and the characterization box filling of the packaging information is also updated.

Specifically, in step S203, the sequence decision network outputs probabilities of a plurality of the to-be-packaged boxes, and in step S204, the boxing information of the to-be-packaged box with the largest probability of the sequence decision network being read is input into the loading network, so that the loading network selects the to-be-packaged box to be packaged into the container, the rest of the sequence decision networks which are not selected by the loading network to a plurality of the to-be-packaged boxes and are input into the boxing model again, the sequence decision network outputs probabilities of a plurality of the to-be-packaged boxes which are not selected by the loading network again, the operations in step S203 and step S204 are repeated until a plurality of unordered to-be-packaged boxes are all selected by the loading network to be packaged into the container, and finally the sequence decision network outputs a loading sequence of a plurality of to-be-packaged boxes, and the loading network outputs loading positions and orientations of a plurality of to-be-packaged boxes in the container.

In one embodiment, the sequence decision network of the boxing model adopts a self-attention mechanism-based Pointer network structure, and as a Pointer network, a probability distribution is output, that is, the probability that the sequence decision network outputs each box to be packaged in the current state is the probability that the box to be packaged is selected to be packaged in the container.

Specifically, the representation frontier embedding of the loading information of the container and the representation box packing of the boxing information of the plurality of boxes to be boxed are input into a sequence decision network of the boxing model, and the probability that the sequence decision network outputs the plurality of boxes to be boxed specifically further includes the following steps:

step 2031, performing weight distribution on the unordered boxes to be packaged according to the current state in the container and the sizes of the boxes to be packaged, so as to obtain the probability of each box to be packaged. The weight process of each box to be packaged is as follows: 1. and training the sequence decision network by adopting a strategy gradient algorithm to obtain a trained self-attention mechanism-based point network. 2. And inputting the length, width and height of the target container and the length, width and height of a plurality of to-be-packaged boxes into a trained self-attention mechanism-based point network to obtain the weight of each to-be-packaged box.

In step 2032, the sequence decision network selects one of the casings to be loaded according to the probability that the casing to be loaded is loaded into the container.

Specifically, in this embodiment, according to the probabilities of the multiple to-be-packaged boxes, the sequence decision network of the packaging model reads the packaging information of the to-be-packaged box with the largest probability, and the packaging information of the to-be-packaged box with the largest probability is input into the loading network of the packaging model, so that the loading network selects the to-be-packaged box to be packaged into the container.

In one embodiment, determining the loading position and orientation of the to-be-loaded box body according to the loading sequence through the boxing model specifically comprises the following steps:

step 301, inputting the representation frontier embedding of the loading information of the container, the representation box packing of the boxing information of the unselected multiple to-be-boxed bodies and the representation selected box embedding of the selected to-be-boxed body into a loading network in the boxing model at the same time, so as to obtain the loading position and orientation of the selected to-be-boxed body in the container.

Specifically, when step S203, after the sequence decision network outputs the probabilities of the plurality of to-be-packaged boxes, the selected representation selected box embedding of the to-be-packaged boxes, the representation frontier embedding of the loading information of the container, and the remaining representation boxes of the to-be-packaged boxes that are not selected are simultaneously input into the loading network in the packaging model, and finally the loading network outputs the loading position and orientation of the selected to-be-packaged boxes in the container. When the sequence decision network inputs the boxing information of the to-be-boxed body with the largest reading probability in the step S204 into the loading network, so that the loading network selects the to-be-boxed body to be loaded into the container, step S301 is repeated again until the loading network outputs loading positions and orientations of a plurality of to-be-boxed bodies, so that the loading sequence of the sequence decision network outputting a plurality of to-be-boxed bodies and the loading network outputting the loading positions and orientations of a plurality of to-be-boxed bodies are simultaneously input into a strategy gradient algorithm, and network parameters of the boxing model are updated.

Fig. 9a to 9d are process state diagrams of the boxing method according to the first embodiment of the present application.

Referring to fig. 9a to 9d, the sequence decision network calculates probabilities of a plurality of unordered to-be-packaged boxes, where the probabilities are probabilities that the to-be-packaged boxes are selected to be packaged in a target container. Providing A, B, C three boxes to be filled into a target container F, simultaneously inputting A, B, C three boxes as a group of data into the sequence decision network for the first time, wherein the sequence decision network calculates that the probability of a box is 0.5 (the probability of a box being selected to be filled into the container is 0.5), the probability of B is 0.2 (the probability of B box being selected to be filled into the container is 0.2), the probability of C is 0.3 (the probability of C box being selected to be filled into the container is 0.3), and the sequence decision network outputs the loading position and the orientation of the box A in the target container F when the sequence decision network selects box A to be filled into the target container F for the first time; inputting the remaining B, C boxes into the sequence decision network again and simultaneously as a group of data for the second time, wherein the sequence decision network calculates that the probability B is 0.7 and the probability C is 0.3, and the sequence decision network selects the box B to be loaded into the target container F for the second time, and then the loading network outputs the loading position and the loading direction of the box B in the target container F; and the sequence decision network selects the box C to be loaded into the target container F for the third time, and the loading network outputs the loading position and the orientation of the box C in the target container F, so that the loading sequence of the sequence decision network outputting a plurality of boxes to be loaded is obtained, and the loading network outputs the loading positions and the orientations of the boxes to be loaded according to the loading sequence.

Fig. 10a to 10b are further process state diagrams of the boxing method shown in the first embodiment of the present application.

Referring to fig. 10a to 10b, in the present embodiment, four sets of boxes A1, A2, A3, A4 are required to be respectively loaded into four target containers F1, F2, F3, F4, each set having 3 boxes; in step S2, four groups of boxes A1, A2, A3 and A4 are simultaneously calculated in parallel in the depth strengthening algorithm. Assuming that the probability of the B1 box in the A1 group is calculated to be 0.5 by the depth strengthening algorithm, the probability of the C1 is 0.2, and the probability of the D1 is 0.3; the probability of the B2 box in the A2 group is 0.6, the probability of the C2 is 0.3, and the probability of the D2 is 0.1; the probability of the B3 box in the A3 group is 0.6, the probability of the C3 is 0.3, and the probability of the D3 is 0.1; the probability of the B4 box in the A4 group is 0.6, the probability of the C4 is 0.3, and the probability of the D4 is 0.1; in step S203, the sequence decision network selects the box with the highest probability in each group at the same time, that is, the sequence decision network selects boxes B1, B2, B3 and B4 at the same time and inputs them into step S3, in step S301, the loading network calculates boxes B1, B2, B3 and B4 at the same time, that is, the loading network outputs the loading positions and orientations of boxes B1, B2, B3 and B4 in target containers F1, F2, F3 and F4 in a parallel operation manner, so that when a plurality of target containers calculate physical conditions and appearance limiting conditions of the to-be-boxed body, a fully parallelized matrix operation is adopted to calculate the placement positions and checking space limitations of the boxes, and the calculation speed and boxing efficiency are greatly improved.

In one embodiment, the structure of the first neural network is as follows in the processing order of the computing unit:

a first full connection layer for mapping loading information of the container to a feature space. In this embodiment, the first fully-connected layer maps the loading information of the container into 512-dimensional feature vectors.

A first normalization layer for normalizing the loading information of the container;

a first activation function layer, the output of the first normalization layer being used as the input of the first activation function layer;

and a second full connection layer for mapping the output of the first activation function layer to another feature space. In this embodiment, the second fully-connected layer transforms the 512-dimensional feature vector into a 128-dimensional feature vector.

A second normalization layer for normalizing the output of the first activation function layer;

and the output of the second normalization layer is used as the input of the second activation function layer. In this embodiment, the activation function is a ReLU activation function.

In one embodiment, the structure of the second neural network is as follows in the processing order of the computing units:

the first layer normalization unit is used for normalizing the boxing information of the box to be boxed;

The self-attention mechanism unit is used for distributing weights to a plurality of unordered boxes to be packaged according to the sizes of the boxes to be packaged and the current state in the container;

a second-layer normalization unit for normalizing the output of the self-attention mechanism unit;

and the multi-layer sensing unit is used for carrying out data mapping on the output of the second-layer normalization unit.

In this embodiment, the input of the boxing algorithm based on the deep reinforcement learning is:

(1) Length, width and height of target container

(2) Length, width and height of box to be packaged

The output of the algorithm is:

(1) And the location and Orientation (Orientation) information of each box in the container.

In this embodiment, the boxing problem is solved by decomposing it into two steps and using multiple agents (Multi-agents), one being a sequence decision network and the other being a loading network; wherein with respect to the sequence decision network: giving a series of unordered boxes to be loaded, and determining the loading sequence of the boxes to be loaded by using deep reinforcement learning; regarding the loading network: based on the given box loading sequence, the loading position and orientation of the box are determined by using deep reinforcement learning, the box loading method adopts fully parallelized matrix operation to calculate the placement position of the box and check calculation space limitation, each step of the embodiment is based on accurate numerical calculation, and other algorithms are not needed to be used for verifying the availability of loading results.

In the embodiment, loading information of a container and boxing information of a plurality of unordered boxes to be boxed are obtained; determining the loading sequence of a plurality of boxes to be loaded through a box loading model according to the loading information of the containers and the box loading information of the boxes to be loaded; according to the loading sequence, the loading position and the direction of the box body to be loaded are determined through the boxing model, so that the overall action space of the problem to be searched is greatly reduced due to the decomposition of the boxing problem, the product of the sequence action space and the loading action space is changed into the sum of the sequence action space and the loading action space, and the learning and searching difficulty of an intelligent body is greatly reduced. The two decision networks (the sequence decision network and the loading network) of the embodiment use the same set of high-dimensional characterization as input, so that the consistency of the convergence of the neural network is ensured. Meanwhile, the boxing algorithm of the embodiment fully utilizes GPU parallelization, so that the problem of overlong algorithm network training time is solved.

Referring to fig. 5, fig. 5 is a schematic diagram of a deep reinforcement learning network system according to a second embodiment of the present application.

The deep reinforcement learning network system is applied to intelligent boxing, and is respectively an embedded layer, a sorting layer and a loading layer according to the processing sequence of the computing unit. The embedded layer is used for mapping the boxing information of the to-be-boxed bodies into feature vectors, the ordering layer is used for ordering a plurality of unordered to-be-boxed bodies, and the loading layer is used for determining the loading position and the orientation of the to-be-boxed bodies according to the loading sequence of the to-be-boxed bodies output by the ordering layer.

Specifically, in this embodiment, the sequence decision model is trained by a deep reinforcement learning algorithm to obtain a trained sequence decision network, so that the loading sequence of a plurality of unordered boxes to be loaded is determined by the sequence decision network; and training the loading model through a deep reinforcement learning algorithm to obtain a trained loading network, so that the loading position and the loading direction of the body to be boxed in the container are determined through the loading network.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a box embedding model according to a second embodiment of the present application, where the embedding layer includes: the box embedding model and the boundary embedding model. The box embedding model is used for carrying out data mapping on the encasement information of a plurality of unordered boxes to be encased to obtain the representation box packing of the encasement information of the unordered boxes to be encased in a high-order space. The structure of the box body embedded model is respectively as follows according to the processing sequence of the computing unit: a first layer normalization unit, a self-attention mechanism unit, a second layer normalization unit and a multi-layer perception unit. The first layer normalization unit is used for normalizing the boxing information of the to-be-boxed body. The self-attention mechanism unit is used for distributing weights to a plurality of unordered boxes to be packaged according to the sizes of the boxes to be packaged and the current state in the container. The second layer normalization unit is configured to normalize an output of the self-attention mechanism unit. The multi-layer sensing unit is used for carrying out data mapping on the output of the second-layer normalization unit.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a boundary embedding model according to a second embodiment of the present application, where the boundary embedding model is used for performing data mapping on container accommodation information to obtain a representation frontier embedding of the container accommodation information in a high-order space. The structure of the boundary embedding model is respectively as follows according to the processing sequence of the computing unit: the system comprises a first full-connection layer, a first layer normalization layer, a first activation function layer, a second full-connection layer, a second layer normalization layer and a second activation function layer. The first normalization layer is used for normalizing the accommodation information of the container. The output of the first normalization layer serves as an input to the first activation function layer. The second normalization layer is used for normalizing the output of the first activation function layer. The output of the second normalization layer serves as the input of the second activation function layer.

In one embodiment, the self-attention mechanism unit includes at least three self-attention layers.

In one embodiment, the sorting layer is provided with a sequence decision model, and the sequence decision model is used for outputting probabilities of a plurality of to-be-packaged boxes according to characterization box packing of the boxing information of the to-be-packaged boxes and characterization frontier embedding of the containing information of the containers. The probability is the probability that the container to be packaged is selected to be packaged in the container in the current state.

In one embodiment, the loading layer is provided with a loading model, and the sequence decision model is used for selecting and inputting the boxing information of one to-be-boxed body into the loading model according to the probabilities of a plurality of to-be-boxed bodies.

Specifically, in this embodiment, according to the probabilities of the plurality of to-be-packaged boxes, the sequence decision model reads the packaging information of the to-be-packaged box with the largest probability, and the packaging information of the to-be-packaged box with the largest probability is input into the loading model, so that the loading model selects the to-be-packaged box to be packaged into the container.

The sequence decision model is also used to derive a representation selected box embedding of the selected casing to be filled based on a self-attention mechanism.

Specifically, the sequence decision model adopts a self-attention mechanism-based network, so that the sequence decision model outputs the selected representation selected box embedding of the to-be-boxed body. The sequence decision model adopts a Pointer network, and the way that the Pointer network obtains a prediction result is to output a probability distribution, namely the probability that the sequence decision model outputs each box to be packaged is the probability that the box to be packaged is selected to be packaged in the container. In this embodiment, the loading model loads the one to-be-loaded box with the highest selection probability into the container.

And training the self-attention mechanism-based point network by the sequence decision model by adopting a strategy gradient algorithm.

In one embodiment, the loading model is configured to output a loading position and an orientation of the selected to-be-loaded container in the container according to a representation box packing of the non-selected to-be-loaded container information, a representation selected box embedding of the selected to-be-loaded container, and a representation frontier embedding of the container accommodation information.

Specifically, after the sequence decision model obtains probabilities of the multiple to-be-packaged boxes, the selected representation selected box embedding of the to-be-packaged boxes, the representation frontier embedding of the loading information of the container and the remaining non-selected representation boxes of the multiple to-be-packaged boxes are simultaneously input into the loading model, and finally the loading model outputs the loading position and the loading direction of the selected to-be-packaged boxes in the container. The loading position and orientation of the selected to-be-loaded box in the container are described by probabilities, wherein the probabilities are probabilities that the to-be-loaded box is loaded in one position and orientation of the container, and in the embodiment, the loading network loads the to-be-loaded box in a loading position and orientation mode with highest probabilities. In this embodiment, the container is meshed, and according to the selected to-be-packaged container, the loading network outputs the probability y, y=n×t of the selected to-be-packaged container being placed in the container width direction, where n is the number of the container width direction meshed, t is the dimension of the to-be-packaged container, and the box may be 2-dimensional or 3-dimensional. Assuming that the box is a two-dimensional box, after the loading network outputs the probability y that the selected box to be loaded is placed in the width direction of the container, the loading network determines the position of the selected box to be loaded in the length direction of the container according to the loading information of the container, so that the position and the orientation of the selected box to be loaded in the container are obtained.

After the first to-be-loaded box body selected by the loading model is removed for the first time and is loaded into the container, the sequence decision model outputs the probability of the remaining unselected plurality of to-be-loaded boxes, the sequence decision model selects the second to-be-loaded box body from the remaining plurality of to-be-loaded boxes, the loading model outputs the loading position and orientation of the second selected to-be-loaded box body in the container, the cycle is repeated, finally the sequence decision model outputs the loading sequence of the plurality of to-be-loaded box bodies, the loading model outputs the loading sequence of the plurality of to-be-loaded box bodies, and the loading model outputs the loading position and orientation of the plurality of to-be-loaded box bodies, and the parameters of the deep reinforcement learning network system are updated.

Referring to fig. 9a to 9d, the sequence decision model calculates probabilities of a plurality of unordered to-be-packaged boxes, where the probabilities are probabilities that the to-be-packaged boxes are selected to be packaged in a target container. A, B, C three boxes are arranged to be filled into a target container F, A, B, C three boxes are used as a group of data for the first time and are simultaneously input into the sequence decision model, the probability of the box A is calculated to be 0.5 by the sequence decision model, the probability of the box B is calculated to be 0.2, the probability of the box C is calculated to be 0.3, the sequence decision model selects the box A to be filled into the target container F for the first time, and the loading model outputs the loading position and the loading orientation of the box A in the target container F; inputting the remaining B, C boxes into the sequence decision model again and simultaneously as a group of data for the second time, wherein the sequence decision model calculates that the probability B is 0.7 and the probability C is 0.3, the sequence decision model selects the box B for the second time to be loaded into the target container F, and the loading model outputs the loading position and the loading direction of the box B in the target container F; and the sequence decision model selects the box C for the third time to be loaded into the target container F, and the loading model outputs the loading position and the orientation of the box C in the target container F, so that the loading sequence of the sequence decision model outputting a plurality of boxes to be loaded is obtained, and the loading model outputs the loading positions and the orientations of the boxes to be loaded according to the loading sequence.

In the deep reinforcement learning network system of the embodiment, the boxing problem is disassembled into the sequence decision problem and the loading problem, and two sets of strategy networks are trained on the same set of high-dimensional characterization at the same time: the policy network for sequence decision regards loading policies as fixed, the learning goal of which is to find the optimal boxing sequence for a given loading policy; the loaded policy network regards the sequence policy as fixed, and its learning goal is to find the optimal loading policy for a given sequence. The decomposition of the boxing problem greatly reduces the whole action space of the problem to be searched, changes the product of the sequence action space and the loading action space into the sum of the two action spaces, and greatly reduces the learning and searching difficulty of the intelligent agent. Our two decision networks (the sequential decision network and the loading network) use the same set of high-dimensional characterizations as inputs, thus guaranteeing the consistency of neural network convergence. Meanwhile, as the GPU parallelization is fully utilized by the algorithm, the problem of overlong algorithm training time is solved.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Referring to fig. 8, an electronic device 400 includes a memory 410 and a processor 420.

The processor 420 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 410 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1020 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 410 may include any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some embodiments, memory 410 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 410 has stored thereon executable code that, when processed by the processor 420, can cause the processor 420 to perform some or all of the methods described above.

The aspects of the present application have been described in detail hereinabove with reference to the accompanying drawings. In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments. Those skilled in the art will also appreciate that the acts and modules referred to in the specification are not necessarily required in the present application. In addition, it can be understood that the steps in the method of the embodiment of the present application may be sequentially adjusted, combined and pruned according to actual needs, and the modules in the apparatus of the embodiment of the present application may be combined, divided and pruned according to actual needs.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) that, when executed by a processor of an electronic device (or electronic device, server, etc.), causes the processor to perform some or all of the steps of the above-described methods according to the present application.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the application herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A deep reinforcement learning network system is applied to intelligent boxing and is characterized in that,

the deep reinforcement learning network system includes an embedded layer including:

the box embedding model is used for carrying out data mapping on the encasement information of a plurality of unordered boxes to be encased to obtain a representation box packing of the encasement information of the unordered boxes to be encased in a high-order space;

the boundary embedding model is used for carrying out data mapping on the accommodation information of a container to obtain a representation frontier embedding of the accommodation information of the container in a high-order space;

The structure of the box body embedded model is as follows according to the processing sequence of the computing unit:

2. The deep reinforcement learning network system of claim 1, wherein,

the deep reinforcement learning network system further includes:

the sorting layer is used for sorting a plurality of unordered boxes to be packaged;

the sequencing layer is provided with a sequence decision model which is used for outputting the probabilities of a plurality of boxes to be packaged according to the representation box packing of the box packaging information of the boxes to be packaged and the representation frontier embedding of the containing information of the containers.

3. The deep reinforcement learning network system of claim 2, wherein,

The deep reinforcement learning network system further comprises a loading layer, wherein the loading layer is used for determining the loading position and the loading direction of the to-be-loaded boxes according to the loading sequence of the to-be-loaded boxes output by the sequencing layer.

4. The deep reinforcement learning network system of claim 3, wherein,

the sequence decision model is used for selecting one box to be packaged according to the probability of a plurality of boxes to be packaged;

and inputting the selected boxing information of the to-be-boxed body into a loading model of the loading layer.

5. The deep reinforcement learning network system of claim 4, wherein,

6. The deep reinforcement learning network system of claim 5, wherein,

the loading model is used for outputting the loading position and the orientation of the selected box to be loaded in the container according to the representation box packing of the non-selected box to be loaded information, the representation selected box embedding of the selected box to be loaded and the representation frontier embedding of the containing information of the container.

7. The deep reinforcement learning network system of claim 6, wherein,

the sequence decision model is further configured to output probabilities of the unselected multiple boxes to be packed according to the representation box packing of the packing information of the unselected multiple boxes to be packed and the representation frontier embedding of the containing information of the container.

8. The deep reinforcement learning network system of claim 5, wherein,

the sequence decision model adopts a self-attention mechanism-based point network.

9. The deep reinforcement learning network system of claim 6, wherein,

the sequence decision model adopts a strategy gradient algorithm to train the self-attention mechanism-based point network.

10. The deep reinforcement learning network system of claim 1, wherein,

the self-attention mechanism unit includes at least three self-attention layers.