CN116703104A

CN116703104A - Material box robot order picking method and device based on decision-making big model

Info

Publication number: CN116703104A
Application number: CN202310722981.XA
Authority: CN
Inventors: 蒋大培; 尚磊; 付樟华; 陈雅莉; 王宁
Original assignee: Chinese University of Hong Kong Shenzhen; Hai Robotics Co Ltd
Current assignee: Chinese University of Hong Kong Shenzhen; Hai Robotics Co Ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-05

Abstract

The embodiment of the application discloses a method and a device for sorting orders of a bin robot based on a decision-making big model, wherein the method comprises the following steps: acquiring target characteristic information; acquiring a pre-trained target action prediction model, wherein the target action prediction model is obtained by training a sequence model, a picking training sample of the target action prediction model at least comprises state information and real actions matched with the state information, and the state information at least comprises robot state information, picking platform state information, order information and inventory information; and inputting the target robot information, the target sorting channel information, the target order information and the target inventory information into a target action prediction model to obtain a target action instruction of the robot. According to the application, the target action prediction model is adopted to output the target action instruction based on the input target characteristic information, and the robot can issue and execute the target action instruction to realize the carrying of the material box.

Description

Material box robot order picking method and device based on decision-making big model

Technical Field

The embodiment of the application relates to the technical field of order picking, in particular to a method and a device for picking orders of a bin robot based on a decision-making big model.

Background

With the development of logistics industry, the requirement on the storage automation capability is higher and higher. The warehouse automation is required to meet the requirements of high timeliness and high safety of customer groups, and also solves various business fluctuations caused by gradually increasing warehouse storage pressure and the diversity and complexity of adaptation channels. The bin robot is applied to storage, so that the storage space utilization rate is improved, the order response efficiency can be improved, and the storage automation level is improved.

In an intelligent warehouse automatic picking system, when a new order is received, the system needs to select a plurality of bins which can meet the order requirement from the existing warehouse according to the order requirement; then the material box is distributed to a sorting table by a robot; and then, carrying out goods picking operation by an operator according to the order requirement, and distributing the goods required by the order.

In the related art, in order to improve the handling efficiency of the robot, the picking system performs batch picking on the orders, that is, according to a preset rule, groups a part of the orders into the same batch, then performs picking on the products of each batch of orders together, and determines the sequence of each batch at the same time, so as to reduce the picking times and the walking distance during picking, thereby improving the picking efficiency.

However, additional stripping operations are often required when the consolidated order arrives at the picking station, which can increase labor and site costs. Meanwhile, the poor order batching strategy may also affect the order picking efficiency.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a bin robot order picking method and device based on a decision-making big model, which can ensure the picking efficiency of orders.

In a first aspect, an embodiment of the present application provides a method for picking orders of a bin robot based on a decision-making big model, including: acquiring target feature information, wherein the target feature information at least comprises target robot information, target sorting table information, target order information and target inventory information; acquiring a target action prediction model which is trained in advance, wherein the target action prediction model is obtained by training a sequence model, a picking training sample of the target action prediction model at least comprises state information and real actions matched with the state information, and the state information at least comprises robot state information, picking platform state information, order information and inventory information; and inputting the target robot information, the target sorting channel information, the target order information and the target inventory information into the target action prediction model to obtain a target action instruction of the robot.

According to some embodiments of the invention, the target robot information includes robot state information and robot historical motion information.

According to some embodiments of the invention, the training step of the target motion prediction model comprises: acquiring a sequence model to be trained; acquiring a picking training sample, wherein the picking training sample at least comprises state information and real actions matched with the state information, and the state information comprises robot state information, picking station state information, order information and inventory information; inputting the picking training samples into the sequence model to be trained, and obtaining the target action prediction model when the predicted actions obtained by training all the picking training samples are better than or equal to the corresponding real actions through autoregressive learning training.

According to some embodiments of the present invention, the inputting the picking training samples into the sequence model to be trained, through autoregressive learning training, when the predicted actions obtained by training all the picking training samples are better than or equal to the corresponding real actions, obtaining the target action prediction model includes: time-position coding the picking training samples (S, A, R, T) to obtain a continuous sequence: τ= (S ₁ ,A ₁ ,R ₁ ；S ₂ ,A ₂ ,R ₂ ；……；S _K ,A _K ,R _K ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein, tau represents K continuous sequences, S in each sequence represents state information, A represents real actions corresponding to S, R represents real order return quantity type corresponding to A, and T represents time and position; inputting the continuous sequence into the sequence model to be trained, and obtaining the target action prediction model when the predicted actions obtained by training all the picking training samples are better than or equal to the corresponding real actions through autoregressive learning training.

According to some embodiments of the invention, the acquiring a sequence model to be trained includes: and obtaining a target action prediction model of the previous generation and taking the target action prediction model as a sequence model to be trained.

According to some embodiments of the invention, the picking training sample includes at least non-trained and historical status information; or, the picking training sample at least comprises state information which does not participate in training and history and state information obtained by random sampling.

According to some embodiments of the invention, after the obtaining the target motion prediction model, the method includes: acquiring a picking test sample, wherein the picking test sample at least comprises state information and real actions matched with the state information, and the state information comprises robot state information, picking station state information, order information and inventory information; inputting the picking test sample into the target action prediction model to obtain a target predicted action; and judging whether the target action prediction model meets the publishing requirements or not based on the comparison between the target prediction action and the real action, and retraining to acquire a new target action prediction model if the target action prediction model does not meet the publishing requirements.

According to some embodiments of the invention, the inputting the target robot information, the target sorting table information, the target order information and the target inventory information into the target action prediction model, after obtaining the target action instruction of the robot, includes: transmitting the target action instruction to the robot so that the robot carries a bin according to the target action instruction; the robot only receives one action instruction at a time, and a plurality of action instructions form one working task instruction of the robot.

According to some embodiments of the invention, the inputting the target robot information, the target sorting table information, the target order information and the target inventory information into the target action prediction model, after obtaining the target action instruction of the robot, includes: determining whether to re-predict the target action instruction based on current environmental information; the current environment information at least comprises current robot information, current inventory information and current picking station information; and if the target action instruction is predicted again, re-inputting target order information corresponding to the target action instruction into the target action prediction model, and obtaining a new target action instruction.

In a second aspect, an embodiment of the present application provides an apparatus, including a training unit and an action issuing unit, where:

the training unit comprises a storage module and a training module, wherein the storage module is used for storing state information and matched real actions, and the state information comprises robot state information, picking station state information, order information and inventory information; the training module takes at least part of data in the storage module as a picking training sample, and trains and obtains a target action prediction model;

the action issuing unit comprises a prediction module and an issuing module, wherein the prediction module is used for calling the target action prediction model and acquiring target characteristic information, the target characteristic information at least comprises target robot information, target sorting table information, target order information and target inventory information, the prediction module inputs the target robot information, the target sorting table information, the target order information and the target inventory information into the target action prediction model to obtain a target action instruction, and the issuing module is used for issuing the target action instruction to the robot.

According to some embodiments of the application, the memory module comprises: the historical action set module is used for collecting historical real actions of the robot which do not participate in training and state information corresponding to the real actions, and taking the historical real actions and the state information as the picking training samples; and the random sampling action set module is used for randomly acquiring the state information and correspondingly generating real actions to serve as the picking training sample.

According to some embodiments of the application, the action issuing unit further comprises a monitoring module that determines whether to re-predict the target action instruction based on current environmental information, wherein the current environmental information comprises at least current robot information, current inventory information, and current pick-up station information; and if the target action instruction is predicted again, the monitoring module re-inputs order information corresponding to the target action instruction into the target action prediction model to acquire a new target action instruction.

According to some embodiments of the present application, as can be seen from the above technical solutions, the embodiments of the present application have the following advantages: according to the method, state information and corresponding real motions are used as picking training samples, a target motion prediction model is obtained through training, and the target motion prediction model is obtained through training of a sequence model. When a new order is received, the picking system inputs target robot information, target picking station information, target order information and target inventory information into a target action prediction model, the target order information comprises information of the new order, the target action prediction model outputs a target action instruction, the target action instruction is issued to a group of robots, the robots carry bins meeting the order requirements in storage to the same picking station, and staff of the picking station select goods meeting the new order requirements in each bin.

Therefore, in the prior art, in order to ensure the efficiency of selecting the material box by the robot, the additional sheet disassembling operation can be performed on the combined order, the target action prediction model is adopted to output the target action instruction based on the input target characteristic information, the robot can automatically realize the carrying of the material box based on the action instruction issued by the target action instruction, the carrying efficiency of the material box is higher, and the labor cost is saved.

In addition, in the training process of the target action prediction model, the action with better history can be inherited, such as robot avoidance, optimal path planning and the like, so that the robot can be better adapted to the environment and timely adjust the action in the walking process.

Drawings

The application is further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a bin robotic order picking method in an embodiment of the application;

FIG. 2 is a schematic diagram of a training process of a target motion prediction model in an embodiment of the present application;

FIG. 3 is a schematic diagram of a detection flow of a target motion prediction model in an embodiment of the application;

FIG. 4 is a schematic diagram of an iterative process of a target motion prediction model in an embodiment of the application;

fig. 5 is a schematic diagram of the structure of the device in the embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, a flow diagram of a bin robot order picking method based on a decision-making big model is provided in an embodiment of the present application. The generation method includes S101-S104. Wherein:

s101, acquiring target feature information, wherein the target feature information at least comprises target robot information, target sorting channel information, target order information and target inventory information.

The target robot information may be understood as current and past change information of the robot, including: the carrying capacity of the robot, the boxing capacity of the robot, the current and past positions of the robot, the actions currently being performed by the robot, the actions that the robot has completed in the past, the current and past operating states of the robot, and the like; the destination picktable information may understand the current and past change information of the picktable: the mode of operation of the picking station, current and past operating state changes of the picking station, current and past available notch change information of the picking station, order information of current and past operation execution of the picking station, and the like; the target order information may be understood as current and past order information including: order information currently generated and generated in the past and order information currently executed or completed in the past; the target inventory information may be understood as current and past inventory information including: the current and past trajectory of position change of the bin, current and past SKU type change in the bin, current and past SKU quantity change in the bin.

S102, acquiring a pre-trained target action prediction model, wherein the target action prediction model is obtained by training a sequence model, and a picking training sample of the target action prediction model at least comprises state information and real actions matched with the state information, and the state information at least comprises robot state information, picking platform state information, order information and inventory information.

The robot state information may be understood as robot data change information, and the robot data change information includes: robot handling capability, robot boxing capability, robot position, actions being performed or completed by the robot, robot manual status, etc.; picking station status information may be understood as picking station data change information, including: sorting table working mode, working state, available notch information, order execution information and the like; order information may be understood as order data change information, which includes: order requirements, order completion, etc.; inventory information may be understood as inventory data change information including: bin position, bin SKU type, bin SKU number; and, the real actions are each correspondingly matched with the state information, in other words, because of the corresponding state information, the corresponding real actions are formed.

It will be appreciated that, as stated above, the picking training sample contains various feature information related to picking, and the application adopts the above-mentioned type of information data as the training sample of the target motion prediction model, and the target motion prediction model obtained by training has a rich experience, so that after the target motion instruction output by the target motion prediction model is issued to the robot, the robot can flexibly and efficiently complete the bin handling work or other works.

And secondly, the target action prediction model is obtained by training a sequence model. Specifically, the historical track action of the robot is a sequence with time attribute, and if a certain optimal action of a certain time node is considered alone, the local optimal action at the current moment is obtained. The accumulation of multiple locally optimal actions does not result in a globally better solution. The application adopts the sequence model to comprehensively consider the whole sequence model, flexibly captures the global and local relation, and can obtain better target action instructions on the global.

It should be noted that step S102 is not limited to be completed after step S101, and step S102 may be completed simultaneously with step S101, or step S102 may be completed before step S101.

S103, inputting the target robot information, the target sorting channel information, the target order information and the target inventory information into a target action prediction model to obtain a target action instruction of the robot.

In a specific implementation, when a new order is received, the picking system inputs target robot information, target picking station information, target order information and target inventory information into a target action prediction model, or inputs data change information into the target action prediction model, and the target action prediction model outputs a target action instruction.

Wherein the target action instruction is issued to a group of robots, comprising at least the following decisions: inventory allocation decisions, robotic allocation decisions, pick station allocation decisions, and path planning decisions. Wherein, inventory allocation decisions can be understood as how to select bins in a warehouse to meet order requirements; the robot allocation decision may be understood as how to issue target instructions to the individual robots of a group of robots; a pick station allocation decision may be understood as the order being picked by one pick station; path planning decisions can be understood as how the robot walks to transport bins from warehouse to pick station or from pick station to warehouse upon instruction to complete a job.

It should be noted that, the target action instruction includes an inventory allocation decision, a robot allocation decision, a picking station allocation decision, and a path planning decision, and of course, the target action instruction may also include other instructions, such as a robot charging instruction, a robot stopping instruction, and the like.

S104, sending a target action instruction to the robot so as to control the robot to carry the bin according to the target action instruction; the robot only receives one action instruction at a time, and a plurality of action instructions form one working task instruction of the robot.

The robot completes a work task under the sequential control of the action instructions. A work order may be understood as a single transfer of goods by a robot, for example a robot transferring a bin from a warehouse to a picking station or a robot transferring a bin from a picking station back to a warehouse. More specifically, the robot only accepts one action command at a time, and after completing one action command, the robot acquires the next action command and executes the acquired action command, so that the robot continuously completes a plurality of action commands, thereby completing one work task of the robot. And, the robot may simultaneously handle one or more bins while performing one work task. The action command may be a period of action command or an instantaneous action command.

In a specific implementation, when the target action instruction is distributed to the robots, a new action instruction can exist, the action instruction is sent to the idle robots, and the robots press the action instruction to finish the carrying of the material box; there may also be a modification of the action instructions being performed by the robot to change the loading position, unloading position or assigned order of its work task; there may be an action instruction to cancel the robot being executed, and other action instructions, which are not necessarily listed in the present application.

It can be understood that the picking method of the application adopts a single-step decision method, namely, the robot only receives the instruction of one work task at a time, the method ensures that the decision process is more flexible, and when the state is changed greatly, the next action of the robot can be adjusted at any time, thereby improving the overall working efficiency. Specifically, for example, the robot suddenly redirects to a new bin during the transport of a bin; for example, the robot dispatches the handled bins to a certain emergency inserted order before the unloading action is performed; for example, before the unloading operation is performed, the robot can not timely work due to accumulation or faults of work tasks and the like, and can redirect the work to other sorting stations for sorting operation.

In summary, in steps S101 to S104, the present application trains and obtains a target motion prediction model by using the state information and the corresponding real motion as a picking training sample, and the target motion prediction model is obtained by training a sequence model. When a new order is received, the picking system inputs target robot information, target picking station information, target order information and target inventory information into a target action prediction model, the target order information comprises information of the new order, the target action prediction model outputs a target action instruction, the target action instruction is issued to a group of robots, the robots carry bins meeting the order requirements in storage to the same picking station, and staff of the picking station select goods meeting the new order requirements in each bin.

In some embodiments, the target robot information includes robot state information and robot historical motion information.

The robot state information may be understood as current and past change information of the robot, including, that is, a carrying capacity of the robot, a boxing capacity of the robot, a current position of the robot, an action currently being performed by the robot, a current working state of the robot, and the like. The historical motion information of the robot may be understood as motion information of the robot in the past, that is, motion performed in the past by the robot, a working state of the robot in the past, and the like, and more specifically, motion that the robot has completed is the historical motion information of the robot.

It can be understood that when the target motion prediction model outputs a corresponding target motion instruction based on the target feature information, the target robot information includes historical motion information of the robot, so that the target motion prediction model can infer a target motion instruction of the robot in the next step based on one or more continuous historical motions of the front of the robot, and thus the accuracy of the target motion instruction can be better ensured. For example, we want to translate "i am chinese" with a sequence model, if we look at the word apart, we can disassemble it into [ 'i am', 'is', 'chinese' ], if we do not consider the order, the model cannot know whether "i am chinese" or "chinese is i am". Similarly, if the robot only looks at the current motion of 'moving one meter leftwards', it may not know what it needs to do, but if a group of ordered sequences of 'moving 1 meter leftwards- > moving the lifting bin a- > moving rightwards 1 meter- > moving forwards 2 meters- > putting the bin a down', the motion track of the whole robot is clear. By means of the sequence modeling, the target action prediction model can more clearly output a better target action instruction so as to control the robot to efficiently transport the bin.

In one possible embodiment, referring to fig. 1 and 2, the training step of the target motion prediction model includes:

s201, acquiring a sequence model to be trained.

S202, acquiring a picking training sample, wherein the picking training sample at least comprises state information and real actions matched with the state information, and the state information comprises robot state information, picking platform state information, order information and inventory information.

The step S201 is not limited to be completed before the step S202, and the step S201 and the step S202 may be performed simultaneously, or the step S201 may be completed after the step S202.

S203, inputting the picking training samples into a sequence model to be trained, and obtaining a target action prediction model when all the picking training samples are trained to obtain predicted actions which are superior to or equal to the corresponding real actions through autoregressive learning training.

The output predicted action is better than or equal to the real action, that is, the return amount of the output predicted action is better than or equal to the return amount of the real action, and the return amount can be understood as the order completion amount corresponding to the action, and the better the return amount, the better the order completion amount.

In a specific implementation, the picking training samples may be input to the sequence model to be trained in batches, or multiple sets of data change information are input to the sequence model to be trained, and each set of picking training samples, after being conveyed to the sequence model to be trained, outputs a set of predicted robot actions. The output predicted actions are compared with the corresponding real actions, and the order return amounts under different actions are compared. For example, if the order return amount obtained by the output predicted action is much smaller than the order return amount obtained by the corresponding real action, the sequence model to be trained is not satisfactory, the sequence model to be trained needs to be trained reversely, training is repeated continuously until the order return amount obtainable by the predicted action of the robot is not lower than the order return amount of the real action in the input data information, and training of the sequence model is completed, so that the target action predicted model is obtained.

More specifically, the picking system has a training module 150 and a prediction module 130, with the training module 150 traversing the picking training samples and training one by one. After being input into the prediction module 130, a prediction action is obtained; and calculating the mean square error of the obtained predicted action and the input real action, and reversely transmitting to realize code training to obtain a target action prediction model.

Further, step S203 includes the steps of:

time-position coding the picking training samples (S, A, R, T) to obtain a continuous sequence:

τ＝(S ₁ ,A ₁ ,R ₁ ；S ₂ ,A ₂ ,R ₂ ；……；S _K ,A _K ,R _K )；

wherein, represent K consecutive sequences, S in each sequence represents the status information, A represents the true action corresponding to S, R represents the true order report quantity type corresponding to A, T represents time and position.

And inputting the continuous sequence into a sequence model to be trained, and obtaining a target action prediction model when the predicted actions obtained by training all the picking training samples are better than or equal to the corresponding real actions through autoregressive learning training.

Specifically, the prediction module 130 encodes the picking training samples, that is, inputs each continuous future report, state, and motion into the neural network related to the mathematical modality to extract the corresponding linear embedment (linear embeddings), and adds the time-position code to the corresponding linear embedment to obtain a series of token data with the time-position code; these coding results are then input into a sequence model to be trained (such as GPT/causal transformer), autoregressive predictive actions using a causal self-attention mask (cau sal self-attention mark), and finally return to predictive actions, whereby, through autoregressive learning training, when the output predictive actions are superior or equivalent to real actions, a target action predictive model is obtained.

In some embodiments, in order to achieve continuous upgrading of the target motion prediction model, and in the process of updating the target motion prediction model, a picking training sample is newly selected, the target motion prediction model of the previous generation can be used as a sequence model to be trained, and the newly acquired picking training sample is input into the target motion prediction model of the previous generation, so that the target motion prediction model of the previous generation is trained into the target motion prediction model of the new generation. Referring to fig. 4, for example, in the training process of the first generation target motion prediction model, the corresponding data set is used as a first generation picking training sample, and the first generation picking training sample is input into the original sequence model for training, so as to obtain the first generation target motion prediction model; in the training process of the second generation target action prediction model, the corresponding data in the data set is adopted as a second generation picking training sample, the second generation picking training sample is input into the first generation target action prediction model to obtain the second generation target action prediction model, and the target action prediction model is updated all the time to obtain the N generation target action prediction model.

It can be appreciated that in the scheme of the application, the target motion prediction model is continuously and iteratively updated, and the target motion prediction model is continuously updated on learning experience. The target motion prediction model learns new excellent experience, and the environmental applicability of the target motion prediction model is increased, so that a more reasonable target motion instruction can be deduced by the iteratively updated target motion prediction model, and the robot can be ensured to rapidly transport the bin corresponding to the order. For example, when new goods are added in the warehouse and a new order is acquired, the target motion prediction model can be iteratively updated, so that the newly generated target motion prediction model can learn new experience, and a better target motion instruction is output according to the order.

In some embodiments, the pick training samples include at least non-engaged in training and historical state information and randomly sampled state information.

The historical state information which does not participate in training indirectly refers to the fact that the same picking training sample cannot be used as a training sample of a two-generation target action prediction model; and randomly sampling the state information, wherein the state information is randomly selected and the real action is correspondingly generated according to the randomly selected state information.

Specifically, in the iterative updating of the target motion prediction model, starting from the updating of the target motion prediction model of the second generation, when the picking training sample is selected, not only a typical picking training sample needs to be selected in the data set, but also the picking sample which has participated in the training of the target motion prediction model before needs to be carefully avoided, so that the selected picking training sample is easier to train the target motion prediction model with rich experience.

When picking training samples, the picking system only picks non-participated training and typical picking training samples, and the target action prediction model obtained through training is easy to fall into a certain local optimal state. Based on the above reasons, when the training sample is selected, random state information can be selected, the randomly selected state information correspondingly generates a real action, for example, the randomly selected state information is input into the target action prediction model, and the result output by the target action prediction model can be used as the real action. The randomly selected state information and the corresponding generated real actions are used as picking training samples, the training samples are input into a sequence model to be trained and generate corresponding predicted actions, and according to comparison between the predicted actions and the corresponding real actions, the sequence model to be trained is trained, and a target action prediction model is obtained. It can be understood that the picking system can randomly select the state information and generate the corresponding real actions, and is used as a picking training sample to train the sequence model to be trained, and the picking system has exploration capacity, so that the target action prediction model trained by the picking training sample has wider coverage range, and the environmental applicability of the target action prediction model is improved.

For example, according to historical empirical data, it is possible to fall into some kind of solidified thinking, add some randomness, i.e. some unexplored states, and expand the solution space. For example, to train a model F (x), if according to historical empirical data, when the sample of x falls with a high probability of 0,10, the model trained may be only a local optimum within 0, 10. If a random number is adopted, the probability problem of actual sample sampling is solved, the value of x can fall outside [0,10], for example, the value of x is 100, the exploration space is expanded from [0,10] to [0,100], and the result predicted by the model at [0,100] is better. In the bin robot order picking method, it is shown that a bin exists at a certain far corner, and according to previous experience data, the motion of conveying the bin may not be generated, but if the motion of conveying the bin is randomly generated, a better solution may be obtained in the overall scheme evaluation, and the prediction effect of the target motion prediction model obtained through training is also better.

In some embodiments, the robot has failed to normally execute the command after the target predicted motion output by the target motion output model is assigned to the robot due to an abnormality or other problem in at least one of the sorting station, the robot, or the warehouse. For the above reasons, after obtaining the target motion prediction, it further includes:

Determining whether to re-predict the target action instruction based on the current environmental information; the current environment information at least comprises current robot information, current inventory information and current picking station information.

And if the target action instruction is predicted again, re-inputting target order information corresponding to the target action instruction into a target action prediction model, and obtaining a new target action instruction.

The current robot information may be understood as robot state information, for example, whether the robot is operating normally. The current stock information may be understood as storage information of the warehouse, for example, whether the warehouse has a newly added cargo box; current inventory information may be understood as pick station status information, e.g., whether a pick station is operating properly.

In a specific implementation, when the picking system accepts a new order and inputs the target action model, other corresponding information, such as current robot information, current picking station information and current inventory information, is also input into the target action model together with the corresponding information corresponding to the new order. Meanwhile, the picking system is further provided with a monitoring module, the monitoring module acquires input environment information, namely current robot information, current inventory information and current picking platform information, so that whether an acquired action instruction still meets current application requirements is judged, if the acquired action instruction does not meet the current application requirements, order information corresponding to the action instruction distributed to the robot is required to be input into a target action prediction model again, the target action prediction model outputs a target action instruction again aiming at the order, and the target action instruction is distributed to the corresponding robot again. If the front and rear motion instructions are the same, the robot still holds the previous motion instruction; or if the front and rear motion instructions are different, the robot stops the previous motion instruction and executes the new motion instruction. Of course, it is also possible that the motion instruction before the robot stops is not allocated to the new motion instruction.

In some embodiments, referring to fig. 1 and 3, after the target motion prediction model is obtained in step S103, the target motion prediction model needs to be detected, and after the target motion prediction model is detected to be qualified, the target motion prediction model is allowed to be disclosed, and the published target motion prediction model can be used in step S104; if the target action prediction model is unqualified, the picking training sample is required to be re-acquired, and a new target action prediction model is acquired through training, and the method specifically comprises the following steps:

s301, acquiring a picking test sample, wherein the picking test sample at least comprises state information and real actions matched with the state information, and the state information comprises robot state information, picking station state information, order information and inventory information.

When the picking test sample is selected, the picking test sample and the picking training sample should be selected to be different samples, so that the picking test sample can be used for testing the target prediction model preferably after the target prediction model is input.

S302, inputting the picking test sample into a target action prediction model to obtain a target prediction action;

s303, judging whether the target action prediction model meets the publishing requirements or not based on the comparison between the target prediction action and the real action, and retraining to obtain the target action prediction model if the target action prediction model does not meet the publishing requirements.

In practical application, in step S302, in the process of invoking the target action prediction model to perform action prediction and test, the test module 140 of the picking system needs to manually set a target report, i.e. the number of orders to be completed in K steps represents the target report, which may be generally set to be slightly higher than the best recorded report (highest order completion efficiency) in the historical data set. With target return as firstAnd initial state S ₁ We can put +.>The input motion prediction master module obtains the first predicted motion +.>After this action is performed, the environment will feed back the next state S ₂ And a timely prize r; we use->Subtracting the timely prize r (i.e. the number of completed orders) to obtain the next timeCarved->(i.e., the number of orders that follow-up actions need to complete); existing data +.>Continuing to input the prediction master module, the next prediction action can be obtained>And so on. Finally, a group of predicted action sequences is obtained>

Picking the status in the test sample (S ₁ ，S ₂ …) corresponds to a set of real action sequences (A ₁ ,A ₂ …), for a target predicted action sequenceWith the actual action sequence (A) ₁ ,A ₂ …) and if the sequence of the actual actions (A) ₁ ,A ₂ …) is significantly better than the target predicted action sequence +.>And indicating that the target action prediction model obtained through training does not meet the publishing requirement, and retraining to obtain a new generation target action prediction model.

In a second aspect, the present invention discloses an apparatus, comprising a training unit 100 and an action issuing unit 200, wherein:

the training unit 100 includes a storage module 110 and a training module 120, wherein the storage module 110 is used for storing state information and matched real actions, and the state information includes robot state information, picking station state information, order information and inventory information; the training module 120 takes at least part of the data in the storage module 110 as a picking training sample to train and obtain the target action prediction model.

The action issuing unit 200 includes a prediction module 210 and an issuing module 230, the prediction module 210 is configured to call a target action prediction model and obtain target feature information, the target feature information at least includes target robot information, target sorting table information, target order information and target inventory information, the prediction module 210 inputs the target robot information, the target sorting table information, the target order information and the target inventory information into the target action prediction model to obtain a target action instruction, and the issuing module 230 is configured to issue the target action instruction to the robot.

By setting the storage module 110 and the training module 120, the state information and the corresponding real actions are used as picking training samples, the target action prediction model is obtained through training, and the target action prediction model is obtained through training of the sequence model. When a new order is received, the picking system inputs target robot information, target picking station information, target order information and target inventory information into a target action prediction model, the target order information comprises information of the new order, the target action prediction model outputs a target action instruction, the target action instruction is issued to a group of robots, the robots carry bins meeting the order requirements in storage to the same picking station, and staff of the picking station select goods meeting the new order requirements in each bin. Therefore, compared with the prior art that the additional order removing operation is performed after the orders are combined, the efficiency of selecting the material box by the robot can be ensured, the target action prediction model is adopted to output the target action instruction based on the input target characteristic information, the robot can automatically realize the carrying of the material box based on the action instruction issued by the target action instruction, the carrying efficiency of the material box is higher, and the labor cost is saved; in addition, in the training process of the target action prediction model, historic actions can be inherited, such as avoidance, optimal path planning and the like, so that the robot can be well adapted to the environment and timely adjust the actions in the walking process

In some embodiments, the storage module 110 includes a historical action set module 111 and a random sampling action set module 112, where the historical action set module 111 is configured to collect historical real actions of the robot that do not participate in training and status information corresponding to the real actions, and one part of the historical real actions can be used as a picking training sample, and the other part of the historical real actions can be used as a picking test sample; the random sampling action set module 112 is configured to randomly acquire status information and correspondingly generate real actions to be used as a picking training sample.

In some embodiments, the action issuing unit 200 further includes a monitoring module 220, the monitoring module 220 determining whether to re-predict the target action instruction based on current environmental information, wherein the current environmental information includes at least current robot information, current inventory information, and current pick-station information; if the target action command is predicted again, the monitoring module 220 inputs the order information corresponding to the target action command into the target action prediction model again, and obtains a new target action command.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

Claims

1. A bin robot order picking method based on a decision-making large model, comprising:

acquiring target feature information, wherein the target feature information at least comprises target robot information, target sorting table information, target order information and target inventory information;

acquiring a target action prediction model which is trained in advance, wherein the target action prediction model is obtained by training a sequence model, a picking training sample of the target action prediction model at least comprises state information and real actions matched with the state information, and the state information at least comprises robot state information, picking platform state information, order information and inventory information;

and inputting the target robot information, the target sorting channel information, the target order information and the target inventory information into the target action prediction model to obtain a target action instruction of the robot.

2. The method of bin robot order picking based on decision-making large models of claim 1, wherein the target robot information includes robot state information and robot historical motion information.

3. The method for bin robotic order picking based on decision-making large model of claim 1, wherein the training step of the target action prediction model comprises:

Acquiring a sequence model to be trained;

acquiring a picking training sample, wherein the picking training sample at least comprises state information and real actions matched with the state information, and the state information comprises robot state information, picking station state information, order information and inventory information;

inputting the picking training samples into the sequence model to be trained, and obtaining the target action prediction model when the predicted actions obtained by training all the picking training samples are better than or equal to the corresponding real actions through autoregressive learning training.

4. A bin robot order picking method based on a decision-making big model according to claim 3, wherein said inputting the picking training samples into the sequence model to be trained, through autoregressive learning training, when all the picking training samples train to obtain a predicted motion that is better than or equal to the corresponding real motion, to obtain the target motion prediction model, comprises:

performing time-position coding on the picking training samples S, A, R and T to obtain a continuous sequence:

τ＝S ₁ ,A ₁ ,R ₁ ；S ₂ ,A ₂ ,R ₂ ；……；S _K ,A _K ,R _K the method comprises the steps of carrying out a first treatment on the surface of the Wherein, tau represents K continuous sequences, S in each sequence represents state information, A represents real actions corresponding to S, R represents real order return quantity type corresponding to A, and T represents time and position;

Inputting the continuous sequence into the sequence model to be trained, and obtaining the target action prediction model when the predicted actions obtained by training all the picking training samples are better than or equal to the corresponding real actions through autoregressive learning training.

5. A method of bin robotic order picking based on decision-making large models as claimed in claim 3, wherein said obtaining a sequence model to be trained comprises:

and obtaining a target action prediction model of the previous generation and taking the target action prediction model as a sequence model to be trained.

6. A bin robotic order picking method based on decision-making large models as recited in claim 3, wherein said picking training samples include at least untrained and historical status information; or, the picking training sample at least comprises state information which does not participate in training and history and state information obtained by random sampling.

7. A method of bin robotic order picking based on decision-making large models as claimed in claim 3, said deriving said target action predictive model comprising, after:

acquiring a picking test sample, wherein the picking test sample at least comprises state information and real actions matched with the state information, and the state information comprises robot state information, picking station state information, order information and inventory information;

Inputting the picking test sample into the target action prediction model to obtain a target predicted action;

and judging whether the target action prediction model meets the publishing requirements or not based on the comparison between the target prediction action and the real action, and retraining to acquire a new target action prediction model if the target action prediction model does not meet the publishing requirements.

8. The method for picking orders of a bin robot based on a decision-making large model according to claim 1, wherein the steps of inputting the target robot information, the target picking station information, the target order information and the target inventory information into the target action prediction model to obtain a target action instruction of the robot comprise:

transmitting the target action instruction to the robot so that the robot carries a bin according to the target action instruction; the robot only receives one action instruction at a time, and a plurality of action instructions form one working task instruction of the robot.

9. The method for picking orders of a bin robot based on a decision-making large model according to claim 1, wherein the steps of inputting the target robot information, the target picking station information, the target order information and the target inventory information into the target action prediction model to obtain a target action instruction of the robot include:

Determining whether to re-predict the target action instruction based on current environmental information; the current environment information at least comprises current robot information, current inventory information and current picking station information;

and if the target action instruction is predicted again, re-inputting target order information corresponding to the target action instruction into the target action prediction model, and obtaining a new target action instruction.

10. An apparatus, comprising a training unit and an action delivery unit, wherein:

11. The apparatus of claim 10, wherein the memory module comprises:

the historical action set module is used for collecting historical real actions of the robot which do not participate in training and state information corresponding to the real actions, and taking the historical real actions and the state information as the picking training samples;

and the random sampling action set module is used for randomly acquiring the state information and correspondingly generating real actions to serve as the picking training sample.

12. The apparatus of claim 10, wherein the action issuing unit further comprises a monitoring module that determines whether to re-predict the target action instruction based on current environmental information, wherein the current environmental information includes at least current robot information, current inventory information, and current pick station information; and if the target action instruction is predicted again, the monitoring module re-inputs order information corresponding to the target action instruction into the target action prediction model to acquire a new target action instruction.