CN117993683A

CN117993683A - Automatic reinforcement learning scheduling method, system, equipment and medium

Info

Publication number: CN117993683A
Application number: CN202410286480.6A
Authority: CN
Inventors: 王勇; 杨骁�; 王唯鉴; 梁娇; 吕宗喆
Original assignee: Beizisuo Beijing Technology Development Co ltd
Current assignee: Beizisuo Beijing Technology Development Co ltd
Priority date: 2024-03-13
Filing date: 2024-03-13
Publication date: 2024-05-07

Abstract

The application relates to the technical field of chemical fiber production, in particular to an automatic reinforcement learning scheduling method, system, equipment and medium. Obtaining an issued production plan and production sequence data corresponding to the production plan; and inputting the issued production plan and production sequence data corresponding to the production plan into a preset automatic production scheduling model based on digital twinning to obtain a related result of automatic packaging and production of the chemical fiber coiled product. According to the application, the next optimal packaging plan can be automatically generated in real time after the current packaging plan is finished through the automatic scheduling model based on digital twinning, so that a real full-automatic scheduling process is realized, the operating efficiency of a packaging line is improved, and the labor cost is reduced.

Description

Automatic reinforcement learning scheduling method, system, equipment and medium

Technical Field

The invention relates to the technical field of chemical fiber production, in particular to an automatic reinforcement learning scheduling method, system, equipment and medium.

Background

Chemical fiber is an important raw material in textile industry and is closely related to national life. The sales products produced by chemical fiber enterprises are POY, FDY, DTY silk rolls, and after the silk rolls are produced by a winding machine, the silk rolls can reach textile factories for weaving after processes such as doffing, physical property detection, bagging, packaging, stacking, warehousing, sales and the like.

In recent years, with the increase of labor cost and the development of automation technology, chemical fiber production workshops have been changed from traditional modes relying on pure manpower to fully-automatic production modes, and the effect that the whole process from production to sales of the reels does not need manual handling is realized by utilizing advanced control systems, information systems, sensors, robots and automation equipment.

The automatic scheduling means that: and the production execution system sequentially and automatically executes the delivery tasks of the wire carts according to the time sequence of the packaging plan. The two most important fields of the packaging program are lot number, number of reels required, from which the number of reels required can be determined.

The worker will usually make the lot number with the largest number of wire carts in the wire garage as the next packaging plan, because the more wire carts a packaging plan contains, the fewer number of batch changes on the packaging line, and the higher the automation efficiency.

In addition, it should be noted that the wire garage is a temporary storage, has limited storage capacity, and cannot be always taken out of the garage because of the small number of wire carts in a certain lot, so that the wires in the lot cannot be packaged, and sales are affected. Therefore, when the packaging plan is manually formulated, not only the number of the current batch number of the silk carts, but also the delivery turnover rate of the silk carts of various batch numbers should be considered. In practice, manually determining a packaging plan can only rely on experience, making it difficult to determine an optimal packaging plan.

That is, in the prior art, the manual determination of the packaging plan can only depend on experience, and the optimal packaging plan is difficult to determine, so that the packaging efficiency is affected to a certain extent.

Disclosure of Invention

In view of the above, the present invention aims to provide a reinforcement learning automatic scheduling method, system, device and medium, so as to solve the problem that the manual determination of a packaging plan in the prior art only depends on experience, and it is difficult to determine an optimal packaging plan, which affects the packaging efficiency to a certain extent.

According to a first aspect of an embodiment of the present invention, there is provided a reinforcement learning automatic scheduling method, including:

Obtaining an issued production plan and production sequence data corresponding to the production plan;

And inputting the issued production plan and production sequence data corresponding to the production plan into a preset automatic production scheduling model based on digital twinning to obtain a related result of automatic packaging and production of the chemical fiber coiled product.

Further, the step of inputting the issued production plan and the production sequence data corresponding to the production plan into a preset automatic production model based on digital twin to obtain a related result of automatic packaging and production of the chemical fiber package product, including:

Judging whether the real-time operation data in the pre-acquired wire cart temporary storage library accords with a first preset rule, if so, continuously judging whether the current production plan accords with a second preset rule;

If the second preset rule is not met, inputting the production sequence data corresponding to the issued production plan and the real-time operation data in the wire cart temporary storage into a preset trained dispatcher, and outputting a packaging plan;

Executing the packaging plan by using a preset packaging line model to obtain a related result of automatic packaging and scheduling of the chemical fiber packaged products;

If the second preset rule is met, finishing the current packaging plan;

If the first preset rule is not met, arranging a next wire feeding machine to be on line according to the current packaging plan;

Wherein, the first preset rule includes: the wire feeding vehicle is a preset number in the current batch number in the temporary storage;

The second preset rule includes: a preset value for the completion of the plan is currently generated.

Further, the automatic production scheduling model based on digital twin comprises:

constructing a doffing workshop model, which is used for receiving a production plan and production sequence data corresponding to the production plan;

the wire cart temporary storage library model is used for acquiring production data and packaging data in real time, updating the current self stock state in real time according to production and packaging conditions and providing real-time stock information for a trainer in a dispatcher training stage;

And the packaging line model is used for executing the received packaging plan, obtaining the related result of automatic packaging and scheduling of the chemical fiber packaged products, and pushing the related result to the scheduler.

Further, the automatic scheduling model based on digital twin further comprises a scheduler;

The first part of the scheduler is a packaging planning value estimation module, a two-layer neural network model is adopted, and two hidden layers are arranged in the middle of the network and used for outputting the value corresponding to each batch number in the chemical fiber package product;

The second part of the scheduler is an execution module and is used for selecting the lot number with the highest expected benefit to make a packaging plan according to the values corresponding to different actions.

Further, training of the scheduler selects a reinforcement learning method based on a value function.

According to a second aspect of an embodiment of the present invention, there is provided an automatic reinforcement learning scheduling system, including:

the acquisition module is used for acquiring the issued production plan and production sequence data corresponding to the production plan;

And the execution module is used for inputting the issued production plan and production sequence data corresponding to the production plan into a preset automatic production scheduling model based on digital twinning to obtain a related result of automatic packaging and production of the chemical fiber package product.

According to a third aspect of an embodiment of the present invention, there is provided an reinforcement learning automatic production apparatus, characterized in that the apparatus includes:

A memory having an executable program stored thereon;

a processor for executing the executable program in the memory to implement the steps of any of the methods described above.

According to a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a computer to perform the steps of any one of the methods described above.

The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects:

It can be understood that the technical scheme provided by the invention obtains the issued production plan and the production sequence data corresponding to the production plan; and inputting the issued production plan and production sequence data corresponding to the production plan into a preset automatic production scheduling model based on digital twinning to obtain a related result of automatic packaging and production of the chemical fiber coiled product. It can be understood that according to the technical scheme provided by the invention, the automatic scheduling model based on digital twin can automatically generate the next optimal packaging plan in real time after the current packaging plan is finished by establishing the digital twin simulation environment and combining the reinforcement learning method, the packaging plan can be generated through simulation, the expected benefits of different packaging plans are executed under different environment information can be updated through interaction with the environment, the real full-automatic scheduling process is realized, the operation efficiency of the packaging line is improved, and the labor cost is reduced. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart illustrating an automatic scheduling method based on reinforcement learning according to an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating the composition of an automatic production scheduling system based on reinforcement learning, according to an exemplary embodiment;

FIG. 3 is a diagram illustrating a value estimation module training flow for a scheduler based on an reinforcement learning automatic scheduling, according to an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating the composition of an automatic production scheduling system based on reinforcement learning, according to an exemplary embodiment;

fig. 5 is a schematic diagram showing the composition of an automatic production scheduling apparatus based on reinforcement learning according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

Example 1

Referring to fig. 1, fig. 1 is a schematic flow chart of an automatic scheduling method based on reinforcement learning according to an exemplary embodiment, the method includes:

s1, obtaining a issued production plan and production sequence data corresponding to the production plan;

S2, inputting the issued production plan and production sequence data corresponding to the production plan into a preset automatic production model based on digital twinning to obtain a related result of automatic packaging and production of the chemical fiber package product.

In one embodiment, referring to fig. 2, the step of inputting the issued production plan and the production sequence data corresponding to the production plan into a preset automatic production model based on digital twin to obtain a related result of automatic packaging and production of the chemical fiber wound product includes:

If the second preset rule is met, finishing the current packaging plan;

In a specific implementation, the automatic scheduling model based on digital twin comprises:

After any packaging plan is issued, the packaging line model automatically selects the batch of silk carts from the silk cart temporary storage warehouse, and sends the batch of silk carts to the packaging line through conveying mechanisms such as a shuttle car, a rotary table and the like, and then packages of the batch of products are completed through a series of packaging conveying equipment. The packaging line model then sends packaging completion information to the scheduler after packaging is completed, thereby helping the scheduler to issue new decisions at the time of packaging completion and to perform its own optimization training according to the actual efficiency of packaging.

In one embodiment, to achieve automatic delivery of a packaging plan, an automatic scheduling scheduler is trained that can analyze the optimal solution of the packaging plan based on the production plan and real-time production conditions, in combination with the current temporary repository dynamic inventory conditions. Compared with the manual scheduling adopted at present, the adoption of the scheduler has two main advantages: 1. the automatic production scheduling can be realized after the current packaging plan is finished, so that the full automation from a doffing line to a packaging line is realized, and the labor cost is reduced; 2. the scheduler can learn from historical experience in a mode of simulating interaction with a digital twin environment, and the trained scheduler has the capability of evaluating different packaging plans in any production scene, so that influences of different production plans (namely, wire cars with different lot numbers in a temporary repository are selected to be on line) on lot changing are calculated at decision time respectively, and lot number on line with the smallest influence on lot changing is selected.

In the embodiment of the application, the automatic scheduling request dispatcher can select a designated batch number from the stock of the current wire garage to generate a packaging plan according to the dynamic information of the wire car temporary storage, the current remaining production plan and the production plan of the next period. It is difficult to select the packaging plan that is most favorable for the overall packaging efficiency improvement based on these information only by human experience, so a simulation environment for scheduler training optimization is created by constructing a digital twin model, and an adaptive optimization method is selected for training optimization of an automatic packaging scheduler of an industrial yarn system. Preferably, a reinforcement learning method based on a value function is selected.

When in specific implementation, the algorithm operates on the principle: in an actual scene, after each packaging plan is formulated, the packaging line packages the wire according to the plan by using the wire type stored in the wire garage until the current packaging plan is completed. At this time, the automatic scheduling and dispatching device automatically generates a new round of packaging scheme by using the current production and packaging information, and the automatic scheduling and dispatching device reciprocates in this way, so that the automatic issuing of a packaging plan is realized.

In specific implementation, the steps for realizing automatic issuing of the packaging plan are as follows:

The time at which the scheduler is running: the currently issued packaging program is already on line entirely. For example, when the current packaging plan is lot a, the wire carts for storing the industrial wires of lot a in the temporary storage warehouse are all online currently, and then a scheduler at the moment (called decision moment) issues the next packaging plan according to the dynamic information; data received by the scheduler (corresponding to the context information in reinforcement learning): at the decision time, the information that the scheduler can acquire includes: the stock condition of the current wire cart temporary warehouse, namely the number of each batch of wire carts; the current remaining production plan, namely the number of ingots to be produced of each batch of industrial yarns in the production plan; training of the scheduler: the automatic scheduling scheduler is used as a part of a chemical fiber production line digital twin model, and training is completed by utilizing production line historical production data and performing simulation interaction with the twin model. The model obtained through training can be used for carrying out the influence of different lot number packaging plans on the lot changing times according to the real-time inventory information and the residual production plan estimation. At the decision moment, the dispatcher selects the lot number which is most helpful for reducing the lot replacement according to the potential influence of different lot number production plans in the current temporary repository on the lot replacement times, and generates a new packaging plan; a new packaging plan (e.g., lot B) is issued and the cart with lot B in the temporary store is fed to the packaging line, repeating step 1.

In one embodiment, the first part of the scheduler is a packaging plan value estimation module, a two-layer neural network model is adopted, and two hidden layers are arranged in the middle of the network and used for outputting values corresponding to batch numbers in the chemical fiber package product;

In particular implementations, the first part of the scheduler is a packaging plan value estimation moduleAdopts a two-layer neural network model, inputs/>Is of a scale/>Tensor of scale (/ >)For the total number of all lot numbers, i.e. input/>The number of remaining package plans and the number of temporary stores of each batch number are contained), and the network comprises two hidden layers respectively/>Scale, finally by one/>The output layer of the scale outputs one/>And represents the value corresponding to each lot number (it can be understood that if the lot number is on line in the current selection repository, the weighted sum of the expected time from the next lot change and the time interval of the future lot changes quantifies the impact of the current packaging plan on the overall lot change).

The scheduler then selects the lot number with the highest expected benefit (i.e. value for the action) to formulate the packaging plan based on the value for the different actions (the action space is made up of all lots numbers contained in the current repository).

The training of the scheduler selects a reinforcement learning method based on a value function.

In one embodiment, the step of training the reinforcement learning algorithm by combining a digital twin model, i.e., an automatic production scheduling model based on digital twin, is as follows:

1. Determining a training period, wherein the training period is as follows: each training starts from the beginning of one production plan until all filaments in one production plan are brought to the end of the packaging line.

2. At the beginning of the training, the scheduler will randomly select an action as the initial action for training.

3. How to interact with the digital twin model during training: When the production task is issued, the digital twin model starts to operate, and the stock information of the wire garage can be updated along with the data of the historical production process, and the stock information is updated in/> The packaging line starts to run, and the scheduler generates a packaging scheme (i.e. selects an optimal lot number from all lots contained in the temporary storage warehouse to be used for a new lot to be fed) according to the real-time data (the current remaining production planning information and the current temporary storage warehouse storage information), and the packaging line starts to be fed at a fixed speed until no wire carts of the current lot number exist in the temporary storage warehouse. The scheduler then generates a new packaging scheme and reciprocates so until the current production task is fully executed. At this point the packaging line will count the number of changes in the process.

4. In the actual production process, there is a process data, such as that a car of a lot number is produced at a certain moment, a car of a lot number is produced at a next discrete moment, and the like, which are input as front-end information of a temporary repository, however, the scheduler does not acquire the data in advance, but can acquire the data which are already produced at the moment when a decision is needed.

5. The model evaluates the state-action value of the current scene by:

Such as At time, the scheduler selects lot C from the scratch pool stock to be brought on line (at this time the scratch pool includes four lots of ABCD) and then/>At this point, the carts of lot C in the temporary store are all on-line (in the process, the lot C wire may also be produced some, not just stock in the previous temporary store). Then at/>In the state of (2), the single step benefit of selecting the action of the C batch number is/>It is understood that the time of batch change is not taken. The latter part of the formula is the value (expected benefit) of taking the current optimal action in the next state, and the value is calculated iteratively according to the simulation interaction result with the environment. The idea of this way of profit calculation is: ensure that the overall lot change interval is as large as possible (global angle reduces the number of lot changes).

In the first placeAfter the secondary packaging plan is issued, the reward function is calculated as follows:

Wherein: for the next batch change time,/> And the current batch changing time.

In implementation, referring to FIG. 3, FIG. 3 is a value estimation module of a schedulerThe specific training process is described as follows:

: imitate a learning experience pool from pre-obtained tuple data/> The method comprises the steps of interacting a model with a digital twin model by adopting an original greedy method (when a current production plan is finished, selecting the lot number with the largest quantity from a temporary storage warehouse to form a new production plan), namely, taking the current total production plan as the input of the digital twin model, starting simulated production by the model and storing the data into the wire car temporary storage warehouse model, updating the stock by the wire garage model, forming a packaging plan according to the greedy method, and then selecting the wire car with the specific lot number according to the packaging plan to enter the packaging line digital twin model for simulated packaging. In the process, each time a packaging plan is newly generated by the cart temporary warehouse, a new set of/>, is generated。

: Reinforcement learning experience pool, tuple data/>, obtained by simulation interaction of an agent (which can be understood as a decision maker that automatically generates a new packaging plan at the time of completion of the current packaging plan) with a digital twin lineThe composition, i.e. each time the agent newly generates a packaging plan, a new/>, is obtained。

Episode to the reinforcement learning training process, in this patent, a episode starts production from the production model to a production plan, until all finished products produced have been fed to the packaging line for packaging.

: The training curtain number is generally not lower than 10000 according to the training condition and the scale setting of the historical data;

Input device Is of a scale/>Tensor of scale (/ >)For the total number of all lot numbers, i.e. input/>The remaining packing plan number including the lot number and the temporary stock number)/>The method can be understood as a period of the initial production, and the state information generated according to the production condition updated in real time by the digital twin model, namely the state information when the first packaging plan is generated;

The probability of selecting a random lot number as a production plan is generally selected to be 0.1, and the probability is used for jumping out of a local optimal solution; In/> Generating a new packaging plan for the action of the intelligent agent, specifically, an automatic packaging dispatcher; /(I)I.e. all actions available at the present time/>In (i.e. action space), one/>, is selectedMaximum value. Specifically, selecting a lot number with the least influence on increasing the number of the change lot from all lot numbers in the current temporary repository;

for reinforcement learning of single-step benefits, the calculation mode is as follows: /(I) Wherein/>For the next batch change time,/>And the current batch changing time.

Can be understood as state information of the current batch changing time,/>The state information is the state information of the next batch changing moment;

the sampling scale is that is, how much data is extracted from E at one time to carry out training update of a value estimation network model;

optimization target: In/> Is taken as a decay factor, typically 0.9, for balancing current and future benefits,/>The expected benefit which can be obtained by the intelligent agent after the optimal action selection is carried out in the next state is calculated by adopting the current value estimation model. When the next state is that the current production plan is all online (a new packaging plan cannot be generated), the target/>, is optimizedOnly single step benefit/>。

Finally, the depth network updates the depth parameters by using a gradient descent method,Is the expected output result of the value estimation network,/>Is to perform the/>The actual output of the value estimation network during the batch changing action is updated by utilizing the difference value, so that the influence of different actions on the whole batch changing frequency under different states can be estimated more accurately.

In the actual training process, the algorithm needs to use actual production data of a plurality of production plans (the data comprise production rule information of different types of wires, and the depth network is helped to find potential rules through the data, so that the model is guided to make decisions more favorable for reducing batch changing and improving packaging efficiency in real time).

By the method, the most suitable packaging plans under different situations can be analyzed according to the influence of different production plans on actual batch changing times, so that the product flow rate is improved, and batch changing is reduced.

According to the application, by establishing the digital twin model from the doffing equipment to the packaging line and combining the historical data of the packaging line, a scheduler which can be used for automatic production scheduling (automatic generation of a packaging plan) can be trained under the condition of not interfering with actual production, and the automatic establishment of the packaging plan in a dynamic scene is completed. The method can learn potential rules of the production packaging process through simulation interaction with the digital twin model, so that a packaging scheme which is more beneficial to improving the packaging efficiency is generated. The operation efficiency of the packaging line is improved, and the labor cost is reduced.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a reinforcement-learning-based automatic scheduling system according to an exemplary embodiment, the system includes:

An obtaining module 41, configured to obtain an issued production plan and production sequence data corresponding to the production plan;

And the execution module 42 is configured to input the issued production plan and production sequence data corresponding to the production plan into a preset automatic production model based on digital twin, so as to obtain a related result of automatic packaging and production of the chemical fiber wound product.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating an automatic reinforcement learning-based production scheduling apparatus according to an exemplary embodiment, the apparatus includes:

a memory 51 on which an executable program is stored;

A processor 52 for executing the executable program in the memory 51 to implement the steps of the method as described in any of the above.

Furthermore, the present application provides a computer-readable storage medium storing computer instructions for causing a computer to perform the steps of any one of the methods described above. Wherein the storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a hard disk (HARD DISK DRIVE, abbreviated as HDD), a Solid state disk (Solid-state-STATE DRIVE, SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.

It should be noted that in the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise indicated, the meaning of "plurality" means at least two.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. An automatic reinforcement learning scheduling method, comprising:

2. The method according to claim 1, wherein the step of inputting the issued production plan and the production sequence data corresponding to the production plan into a preset automatic production model based on digital twin to obtain a related result of automatic packaging production of the chemical fiber wound product comprises:

If the second preset rule is met, finishing the current packaging plan;

3. The method of claim 1, wherein the digital twinning-based automated production model comprises:

4. The method of claim 3, wherein the digital twinning-based automated production model further comprises a scheduler;

5. The method of claim 4, wherein the training of the scheduler selects a reinforcement learning method based on a value function.

6. An automatic reinforcement learning scheduling system, the system comprising:

7. An automatic reinforcement learning scheduling apparatus, the apparatus comprising:

A memory having an executable program stored thereon;

A processor for executing the executable program in the memory to implement the steps of the method of any one of claims 1-5.

8. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the steps of the method according to any one of claims 1-5.