CN117807403A

CN117807403A - Steel transformation control method and device based on behavior cloning, medium and computer equipment

Info

Publication number: CN117807403A
Application number: CN202410224883.8A
Authority: CN
Inventors: 何纯玉; 段席兆; 薛松; 矫志杰; 吴志强; 赵忠
Original assignee: 东北大学
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-04-02
Anticipated expiration: 2044-02-29
Also published as: CN117807403B

Abstract

The invention relates to the technical field of rolling automation and discloses a steel turning control method, a steel turning control device, a steel turning control medium and a steel turning computer device based on behavior cloning. According to the method, the steel transferring operation sample data are preprocessed and screened in the steel transferring process, and then a behavior cloning algorithm is used for training a large amount of offline steel transferring experience data, so that an optimal steel transferring control strategy in the experience data can be obtained without interaction of an agent and a real environment, and the production requirement is met.

Description

Steel transformation control method and device based on behavior cloning, medium and computer equipment

Technical Field

The invention relates to the technical field of rolling automation, in particular to a steel conversion control method and device based on behavior cloning, a medium and computer equipment.

Background

The steel conversion is an important link in the rolling production of the medium plate, so how to reduce the steel conversion time and improve the stability of the steel conversion process is an important research content in the intelligent control of the lifting rolling process. In the actual production process, the steel conversion efficiency is influenced by field equipment and environment, and the unified operation specification and control logic are lacked, so that the steel conversion process is difficult to directly control by adopting a mechanism model.

With the continued development of imitation learning and reinforcement learning in the control field, more and more engineering tasks began to attempt to use reinforcement learning algorithms for selection of optimal control strategies. For sequence decision tasks, reinforcement learning can interact with the environment in real time, and long-term rewards can be considered, which is particularly advantageous over traditional control methods in terms of its adaptive, self-learning capabilities. However, in the automatic steel turning control scene, the optimal strategy is ideally obtained by constructing a high-precision environment simulator to enable an intelligent body to freely explore, training the intelligent body of a behavior strategy through reinforcement learning from scratch, and applying a training result to a production environment after training is finished so as to realize the automatic steel turning control process. However, in the actual application process, the simulation environment is difficult to be completely consistent with the actual environment, so that the strategy trained in the simulation environment cannot be applied in the actual control process, and even the strategy trained in the simulation environment contains dangerous operation, thereby threatening the safety of industrial production.

Disclosure of Invention

In view of the above, the present application provides a method and apparatus for controlling steel transformation based on behavior cloning, a medium, and a computer device, and aims to solve the technical problem that in the prior art, a steel transformation control strategy trained in a simulation environment has a certain risk and cannot be applied in an actual control process.

According to a first aspect of the present invention, there is provided a behavioural clone-based steel transformation control method, the method comprising:

collecting steel turning operation sample data, wherein the steel turning operation sample data comprises billet state data and a steel turning operation action sequence with corresponding relations;

constructing a reward function based on preset parameters of a steel turning target, calculating a comprehensive reward value of the steel turning operation action sequence by using the reward function, and constructing a steel turning original data set according to the steel turning operation sample data and the comprehensive reward value;

performing iterative computation on the steel conversion original data set by using a preset reinforcement learning model to obtain expected returns corresponding to the steel conversion operation sample data, and constructing a steel conversion training data set according to the steel conversion original data set and the expected returns;

performing correlation analysis on the steel transformation training data set by using a pearson correlation analysis method to obtain a correlation analysis result, and screening the correlation analysis result by taking the expected return as a screening condition to obtain expert example data;

And performing offline supervised training based on the expert example data by using a behavior cloning algorithm to obtain a steel conversion control strategy.

Preferably, the steel turning operation action sequence comprises steel turning time and steel turning angle, and the reward function comprises a steel turning angle reward function and a steel turning time reward function; constructing a reward function based on preset parameters of the steel turning target, calculating a comprehensive reward value of the steel turning operation action sequence by using the reward function, and comprising the following steps:

obtaining preset parameters of a steel turning target, wherein the preset parameters of the steel turning target comprise a steel turning target angle, a steel turning angle rewarding upper limit value, a steel turning angle rewarding lower limit value and a steel turning time rewarding parameter;

constructing a steel turning angle rewarding function based on the steel turning target angle, the steel turning angle rewarding upper limit value and the steel turning angle rewarding lower limit value, inputting the steel turning angle into the steel turning angle rewarding function, and calculating to obtain a steel turning angle rewarding value;

constructing a steel turning time rewarding function based on the steel turning time rewarding parameter, inputting the steel turning time into the steel turning time rewarding function, and calculating to obtain a steel turning time rewarding value;

determining a first preset weight corresponding to the steel turning angle rewarding value and a second preset weight corresponding to the steel turning time rewarding value, and calculating a first product between the first preset weight and the steel turning angle rewarding value and a second product between the second preset weight and the steel turning time rewarding value;

And summing the first product and the second product to obtain the comprehensive rewarding value of the steel turning operation action sequence.

Preferably, the steel-turning operation action sequence comprises a roller way set speed; performing iterative computation on the steel conversion original data set by using a preset reinforcement learning model to obtain expected returns corresponding to the steel conversion operation sample data, and constructing a steel conversion training data set according to the steel conversion original data set and the expected returns, wherein the method comprises the following steps:

determining a preset reinforcement learning model, setting the billet state data as an input state of the reinforcement learning model, and setting the roller way set speed as an output action of the reinforcement learning model;

acquiring a comprehensive rewarding value of the steel turning operation sequence, setting the comprehensive rewarding value as expected rewards of final input states in the reinforcement learning model, and iteratively calculating expected rewards corresponding to all other input states except the final input states in the reinforcement learning model by using a back propagation mode;

and constructing a steel conversion training data set according to the steel conversion original data set and expected returns corresponding to all input states, wherein the steel conversion training data set comprises a plurality of pieces of training data, and the training data comprises billet state data in the current input state, roller way setting speed, instant rewarding value, expected returns and billet state data in the later input state in the current input state.

Preferably, the setting the comprehensive rewards value as the expected rewards of the final input state in the reinforcement learning model, and iteratively calculating the expected rewards corresponding to all the other input states except the final input state in the reinforcement learning model by using a back propagation mode includes:

determining instant rewards of all other input states except the final input state in the reinforcement learning model, and constructing an expected return calculation function based on a preset discount factor, wherein in the expected return calculation function, the expected return of the current input state is equal to the sum of the product of the discount factor and the expected return of the input state which is the next to the current input state and the instant rewards of the current input state;

and inputting the expected return of the final input state into the expected return calculation function, and calculating the expected returns of all the other input states except the final input state one by one.

Preferably, the performing a correlation analysis on the steel transformation training data set by using a pearson correlation analysis method to obtain a correlation analysis result, and screening the correlation analysis result with the expected return as a screening condition to obtain expert example data, where the step of obtaining expert example data includes:

Traversing the steel conversion training data set to obtain billet state data of each piece of training data in the steel conversion training data set;

calculating correlation coefficients between any two steel billet state data in the steel conversion training data set by using a pearson correlation analysis method, and selecting a preset number of correlation coefficients from large to small based on the absolute values of all the correlation coefficients;

determining two billet state data corresponding to each selected correlation coefficient, acquiring training data corresponding to the billet state data, and constructing an effective data set;

and acquiring expected returns corresponding to each training data in the effective data set, sequencing all the expected returns from large to small, and selecting training data corresponding to a preset number of expected returns as expert example data.

Preferably, the calculating the correlation coefficient between any two pieces of the billet state data in the steel transformation training data set by using a pearson correlation analysis method includes:

acquiring a plurality of observation values of each billet state data in the steel turning training data set, calculating an average value of the plurality of observation values, and selecting a designated observation value from the plurality of observation values;

Selecting any two pieces of steel billet state data in the steel turning training data set, substituting the average value of a plurality of observation values of each piece of steel billet state data and the appointed observation value into a pearson correlation coefficient calculation formula to calculate, and obtaining the correlation coefficient between the two pieces of steel billet state data.

Preferably, the performing offline supervised training based on the expert example data by using a behavior cloning algorithm to obtain a steel conversion control strategy includes:

extracting training samples from the expert example data, and acquiring the input state of the training samples and expert action labels;

based on a behavior cloning algorithm, constructing a behavior cloning network, inputting the input state of the training sample into the behavior cloning network, and obtaining the output action of the behavior cloning network;

calculating error loss between the expert action tag and the output action of the behavior clone network by using a preset loss function, and updating network parameters of the behavior clone network by using a gradient descent algorithm according to the error loss;

and based on the updated behavior clone network, acquiring a steel conversion control strategy corresponding to the expert example data.

According to a second aspect of the present invention, there is provided a behavioural clone based steel transfer control device, comprising:

the data acquisition module is used for acquiring steel turning operation sample data, wherein the steel turning operation sample data comprises billet state data and a steel turning operation action sequence with corresponding relations;

the data set construction module is used for constructing a reward function based on a steel turning target preset parameter, calculating a comprehensive reward value of the steel turning operation action sequence by utilizing the reward function, and constructing a steel turning original data set according to the steel turning operation sample data and the comprehensive reward value;

the data preprocessing module is used for carrying out iterative computation on the steel conversion original data set by utilizing a preset reinforcement learning model to obtain expected return corresponding to the steel conversion operation sample data, and constructing a steel conversion training data set according to the steel conversion original data set and the expected return;

the data screening module is used for carrying out correlation analysis on the steel transformation training data set by using a Pearson correlation analysis method to obtain a correlation analysis result, and screening the correlation analysis result by taking the expected return as a screening condition to obtain expert example data;

And the strategy output module is used for performing offline supervision training based on the expert example data by utilizing a behavior cloning algorithm to obtain a steel conversion control strategy.

According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer program which when executed by a processor implements the above-described behavioural clone based steel conversion control method.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the behavioural clone based steel conversion control method as described above when executing the program.

According to the steel turning control method and device, the medium and the computer equipment based on behavior cloning, steel turning operation sample data are firstly collected, wherein the steel turning operation sample data comprise billet state data and steel turning operation action sequences with corresponding relations, then a reward function is constructed based on preset parameters of steel turning targets, comprehensive reward values of the steel turning operation action sequences are calculated by the reward function, a steel turning original data set is constructed according to the steel turning operation sample data and the comprehensive reward values, iteration calculation is conducted on the steel turning original data set by using a preset reinforcement learning model to obtain expected return corresponding to the steel turning operation sample data, a steel turning training data set is constructed according to the steel turning original data set and the expected return, then correlation analysis is conducted on the steel turning training data set by using a Pearson correlation analysis method to obtain correlation analysis results, the correlation analysis results are screened by using expected return as screening conditions to obtain expert example data, finally the behavior cloning algorithm is utilized to conduct offline supervision training based on the expert example data, and a steel turning control strategy is obtained. The method utilizes the rewarding function to evaluate the merits of the action sequences of the steel turning operation, thereby quantitatively evaluating the comprehensive effects of different steel turning operations, and further constructing a steel turning original data set to provide a basis of the data set for the subsequent model training; iterative computation is carried out on the steel conversion original data set by utilizing the reinforcement learning model, so that the model can be gradually optimized, and the accuracy and reliability of expected return computation are improved; the pearson correlation analysis method is utilized to carry out correlation analysis, and data samples with high correlation and higher expected return can be screened from correlation analysis results, so that the effect and generalization capability of model training are improved, and noise and irrelevant information in the training process are reduced; finally, off-line supervision training is carried out by utilizing a behavior cloning algorithm, a steel conversion control strategy is obtained, experience is transferred to a machine learning model, and the effect and reliability of the model are improved. According to the method, the steel transferring operation sample data established in the steel transferring process is preprocessed and screened, a behavior cloning algorithm is used for training a large amount of offline steel transferring experience data, an intelligent steel transferring control strategy in the experience data can be obtained without interaction of an agent and a real environment, and the requirement of quick steel transferring on a production site is met through intelligent adjustment of the steel transferring strategy.

The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

fig. 1 shows a flow diagram of a steel transfer control method based on behavior cloning according to an embodiment of the present invention;

fig. 2 shows a schematic diagram of a behavioral clone network structure in a behavioral clone-based steel transformation control method according to an embodiment of the present invention;

fig. 3 shows a schematic diagram of a steel transfer strategy setting curve in a steel transfer control method based on behavior cloning according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of another method for controlling steel transfer based on behavior cloning according to an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of a steel turning control device based on behavior cloning according to an embodiment of the present invention;

Fig. 6 shows a schematic device structure of a computer device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the application provides a steel turning control method based on behavior cloning, which comprises the following steps as shown in fig. 1:

101. and collecting steel turning operation sample data, wherein the steel turning operation sample data comprises billet state data and a steel turning operation action sequence with corresponding relations.

Specifically, the billet state data and the steel transferring operation action sequence with the corresponding relation can be understood as that in the steel transferring process, the state data of the billet and the corresponding steel transferring operation actions are recorded according to a certain sequence, and the corresponding relation between the state data and the corresponding steel transferring operation actions is established, wherein the billet state data can comprise relevant characteristics of billet temperature, size, shape and the like obtained through a sensor, monitoring equipment or other measuring means and recorded in a digital or other form, and the steel transferring operation action sequence comprises various operation actions in the steel transferring process. The corresponding relation between the steel billet state data and the steel conversion operation action sequence records which specific operation is executed at which time point, and the time sequence association between the steel billet state data and the steel conversion operation action sequence can be established by recording the corresponding relation, so that the analysis and understanding of the relation between different states and operations in the steel conversion process are facilitated, and the follow-up data analysis, model construction, optimization operation and other works are facilitated.

In the embodiment of the application, the collected steel turning operation sample data can reflect the actual steel turning operation condition, and basic data is provided for building a machine learning model, so that the model is trained, and the capability of the model in the aspects of steel turning operation prediction and optimization is improved. The collected steel transferring operation sample data provides data support for subsequent decisions, accurate information and guidance can be provided for the decision of the steel transferring process based on analysis and understanding of the data, reasonable strategies and planning are facilitated, and the efficiency and quality of steel transferring operation are further improved.

102. And constructing a reward function based on the preset parameters of the steel turning target, calculating a comprehensive reward value of the steel turning operation action sequence by using the reward function, and constructing a steel turning original data set according to the steel turning operation sample data and the comprehensive reward value.

Specifically, the reward function is constructed by utilizing preset parameters of the steel-turning target, so that the reward function can comprehensively consider a plurality of indexes, further, the target of steel-turning operation can be definitely optimized, and the reward function realizes quantitative measurement of the advantages and disadvantages of different operations through evaluation of the action sequence of the steel-turning operation, thereby providing definite targets and directions for the subsequent optimization process, and the evaluation mode is objective and real.

In the embodiment of the application, the comprehensive reward value of the steel turning operation action sequence is calculated by using the reward function, the steel turning original data set can be constructed according to the steel turning operation sample data and the comprehensive reward value, and the steel turning original data set comprises the operation sample data and the corresponding comprehensive reward value, so that a representative and reliable data sample is provided for a subsequent machine learning algorithm.

103. And carrying out iterative computation on the steel conversion original data set by using a preset reinforcement learning model to obtain expected return corresponding to the steel conversion operation sample data, and constructing a steel conversion training data set according to the steel conversion original data set and the expected return.

Specifically, the reinforcement learning model is a machine learning method for solving decision-making problems in an environment with explicit goals, in which an agent selects appropriate actions by observing the state of the environment in pursuit of maximizing a jackpot. The expected return is the expected cumulative prize value used in reinforcement learning to evaluate a certain sequence of actions or strategies, and by calculating the prize of the possible actions in each state and taking into account the future cumulative effect thereof, the expected return value can be obtained, and the expected return can be defined as the expected value of the sum of the state sequences under a specific strategy, and can measure the overall performance quality of a strategy over a long period of time. In reinforcement learning, the expected return is used as an evaluation criterion to guide the decision of the agent to optimize its strategy and behavior so that the jackpot is maximized.

In the embodiment, by performing iterative computation on the steel conversion original data set by using a preset reinforcement learning model, the effects of different operation samples can be evaluated according to expected returns, and a better operation strategy can be found; different sample data are different in expected return, and personalized decision support can be carried out on the sample data of the steel conversion operation according to expected return values calculated by the model; the quality and effect of the steel transferring operation sample data can be quantitatively evaluated through the calculation of the expected return, which is helpful for finding and weakening bad operation behaviors and avoiding similar errors from repeatedly appearing in actual production. According to the original data set and the expected return of the steel turning, a steel turning training data set can be constructed, wherein the data of the operation sample and the corresponding expected return value thereof are contained, the constructed training data set is used as a training sample of the reinforcement learning model and used for parameter adjustment and optimization of the model, the accuracy and effect of steel turning operation are further improved through training, and a better steel turning result is achieved.

104. And performing correlation analysis on the steel transformation training data set by using a pearson correlation analysis method to obtain a correlation analysis result, and screening the correlation analysis result by taking expected return as a screening condition to obtain expert example data.

Specifically, the pearson correlation analysis method is a statistical method for evaluating the linear correlation between two continuous variables, and specifically uses pearson correlation coefficients to represent the strength and the direction of the correlation, wherein the pearson correlation coefficients are values between-1 and-1, so that the linear correlation degree between the two variables can be measured, the linear correlation between the variables can be evaluated by calculating the pearson correlation coefficients, and whether the variables have obvious correlation or not can be judged, and the closer the correlation coefficients are to 1 or-1, the stronger the correlation is explained; the closer the correlation coefficient is to 0, the weaker the linear correlation between the two variables is explained.

In the embodiment of the application, the pearson correlation analysis method can calculate the correlation among the characteristics in the steel transformation training data set, so as to find out the potential modes and the association relations among the different characteristics; selecting features with higher correlation with expected returns by taking the expected returns as screening conditions and screening the features based on correlation analysis results, so that the dimension of a feature space is reduced, and the efficiency and accuracy of a modeling process are improved; the correlation analysis and screening are utilized, the characteristics and sample data which are relatively strong in relation to expected returns can be obtained from the steel transformation training data set, expert example data are constructed, the expert example data comprise sample data of the screened high-correlation characteristics, the sample data can be used as a benchmark of expert experience, and reliable references are provided for training and verification of models; and further, expert example data is used for model training, so that the prediction capability of the model on expected return and the effect of steel conversion operation can be improved.

105. And performing offline supervised training based on expert example data by using a behavior cloning algorithm to obtain a steel conversion control strategy.

Specifically, the behavior cloning algorithm is a machine learning method for generating a behavior policy of a model or an agent through expert example data. The basic idea of the behavior cloning algorithm is to copy the strategy by observing and learning the behaviors of expert examples, and apply the strategy to similar tasks or environments, so that the strategy is widely applied to supervised learning problems, especially in the fields of simulation, robot control, automatic driving and the like. The behavior cloning algorithm has the advantages of simplicity, easiness in use, high training speed and capability of accurately copying the behaviors of the expert.

In the embodiment of the application, the expert example data is used as training data to generate the model through the behavior cloning algorithm, and a high-quality control strategy can be established by virtue of experience, so that the performance and quality of steel conversion operation are improved, the model can learn the experience implicit in the expert example data, and similar or more excellent operation results can be realized in practical application; the behavior cloning algorithm can be directly applied to actual steel conversion control by learning and copying an operation strategy in expert example data, so that trial-and-error and optimization from beginning is avoided, time and resource cost caused by a trial-and-error process are reduced, an effective control strategy can be quickly constructed by utilizing the existing expert knowledge and experience, and risk and cost of system improvement are reduced; the behavior cloning algorithm can keep the level of the model close to the expert level by learning and simulating the operation in expert example data, and is beneficial to solving the complex steel conversion control problem; the control strategy generated by the behavior cloning algorithm is constructed based on the existing expert example data, so that the decision process has better interpretability, and the model can explain the basis and the reason of the decision, so that the decision process is more reliable and reliable, and is very important for decision support and fault elimination in the steel transferring operation; the strategy generated by the final behavior cloning algorithm can be used as an initial solution for optimization and iteration, and the generated strategy can be further adjusted and optimized by comparing the generated strategy with actual operation data, so that the control strategy is better adapted to actual environment and requirements, and further, the performance and effect of the steel conversion control strategy can be continuously improved and improved.

The embodiment of the application provides another behavior clone-based steel transformation control method, as shown in fig. 4, which comprises the following steps:

201. and collecting billet state data and a steel turning operation action sequence with corresponding relations, constructing a reward function, calculating a comprehensive reward value of the steel turning operation action sequence, and generating a steel turning original data set.

Specifically, firstly, obtaining steel turning target preset parameters, wherein the steel turning target preset parameters comprise a steel turning target angle, a steel turning angle rewarding upper limit value, a steel turning angle rewarding lower limit value and a steel turning time rewarding parameter, then constructing a steel turning angle rewarding function based on the steel turning target angle, the steel turning angle rewarding upper limit value and the steel turning angle rewarding lower limit value, inputting the steel turning angle into the steel turning angle rewarding function, calculating to obtain a steel turning angle rewarding value, constructing a steel turning time rewarding function based on the steel turning time rewarding parameter, inputting the steel turning time into the steel turning time rewarding function, calculating to obtain a steel turning time rewarding value, finally determining a first preset weight corresponding to the steel turning angle rewarding value and a second preset weight corresponding to the steel turning time rewarding value, calculating a first product between the first preset weight and the steel turning angle rewarding value and a second product between the second preset weight and the steel turning time rewarding value, and summing the first product and the second product to obtain a comprehensive rewarding value of the steel turning operation sequence.

In the embodiment of the application, the steel transferring system collects manual steel transferring operation sample data in the production process at extremely short time intervals (20 ms) through CCD cameras installed near the front and rear steel transferring roller ways of the rolling mill, the manual steel transferring operation sample data comprise state information of each steel billet and a steel transferring operation action sequence, wherein the state information of the steel billets comprises the size of the steel billets, the angle of the steel billets and the feedback speed of the roller way, and the steel transferring operation action sequence comprises the roller way set speed, the steel transferring time and the steel transferring angle, and finally a group of state-action sequences are formed. And when each steel billet is turned to the end, weighting the turning time and the turning angle in the turning operation action sequence to give a comprehensive rewarding value, wherein the shorter the turning time is, the closer the turning angle is to 90 degrees, and the higher the comprehensive rewarding value is.

The reward function in this application is specifically as follows: setting the steel turning angle at the end of turning to be A, the steel turning target angle to be T, T=90°, and then giving a reward function to the steel turning angle(reorder_max-reorder_min) +reorder_min, wherein reorder_max is the maximum value of the prize, taking the value of 100; the minimum value of rewards is REWard_min, and the value is 0; assuming that the steel turning time rewarding parameter is 300000, the steel turning time rewarding function=300000/t, wherein t is steel turning time, the unit is ms, the steel turning comprehensive rewarding value is finally calculated, the first preset weight and the second preset weight are both set to be 0.5, and the steel turning comprehensive rewarding value=0.5% >Steel angle rewarding +0.5->And finally, calculating a steel turning comprehensive rewarding value of each group of state-action sequences in the manual steel turning operation sample data, and generating an original steel turning data set. Specifically, taking an example that the steel turning angle is 90 degrees and the steel turning time is 3000ms when one steel turning is finished, the rewarding value is calculated as follows: steel angle prize value= (1- |90-90|/90) ×100-0) +0=100; spin time prize value = 300000/3000 = 100; steel conversion comprehensive rewarding value=0.5 +.>100 + 0.5/>100 = 100。

202. And iterating the steel conversion original data set by using the reinforcement learning model, calculating to obtain expected returns corresponding to each input state in the steel conversion operation sample data, and constructing a steel conversion training data set.

Specifically, a preset reinforcement learning model is firstly determined, billet state data is set as an input state of the reinforcement learning model, a roller way setting speed is set as an output action of the reinforcement learning model, then a comprehensive rewarding value of a steel turning operation action sequence is obtained, the comprehensive rewarding value is set as expected rewards of final input states in the reinforcement learning model, expected rewards corresponding to all other input states except the final input states in the reinforcement learning model are calculated in an iterative mode through a back propagation mode, and finally a steel turning training data set is constructed according to a steel turning original data set and the expected rewards corresponding to all the input states, wherein the steel turning training data set comprises a plurality of pieces of training data, and the training data comprises billet state data in the current input state, roller way setting speed, instant rewarding value and expected rewards and billet state data in the next input state in the current input state.

In the embodiment of the application, a preset reinforcement learning model is determined, then billet state data in the steel conversion original data set is set as an input state of the reinforcement learning model, specifically, billet size, billet angle and roller feedback speed are used as states, and then roller setting speed is set as an output action of the reinforcement learning model, namely, the roller setting speed is used as action. A delayed reward is given in the last step of reinforcement learning in the steel transformation process, that is, the integrated reward value calculated in step 201 is set as the expected return of the final input state in the reinforcement learning model, and the expected return of each input state before the final input state is calculated as the value of the state by using a single-step back propagation methodForm a steel-turning training data set +.>Wherein s is _i : current input stateIs a billet state data; a, a _i : setting the roller way speed in the current input state; r is (r) _i : an instant prize value for the current input state; s is(s) _i+1 : billet state data in a subsequent input state to the current input state; q (Q) _i : the expected return of the current input state.

The method comprises the steps of calculating expected returns corresponding to all input states except a final input state in a reinforcement learning model, specifically determining instant rewards of all input states except the final input state in the reinforcement learning model, and constructing an expected return calculation function based on a preset discount factor, wherein in the expected return calculation function, the expected return of a current input state is equal to the sum of the discount factor and the expected return of a next input state of the current input state and the instant rewards of the current input state; the expected returns of the final input states are input into an expected return calculation function, and the expected returns of all the other input states except the final input state are calculated one by one.

In the embodiment of the present application, the expected return calculation function is:wherein->Is from the t moment state->The sum of rewards starting up to the termination state, called the expected reward,>is instant rewarding, is->Is a preset discount factor. Specifically, the present application sets a preset discount factor +.>Because reinforcement learning is a delayed rewarding process, the instant rewards in other input states are all 0, i.e., +.>=0. Is provided with->=100, by->Sequentially calculating to obtain +.>= 95，/>By this way of reverse iteration, the prize value of the final input state may be sequentially transferred back to each previous input state to obtain the expected return of each previous input state, and each expected return obtained by calculation is taken as the value of the corresponding current input state +.>I.e., a return is desired.

203. And carrying out correlation analysis on the steel transformation training data set by using a pearson correlation analysis method, and screening expert example data from correlation analysis results according to expected returns.

Specifically, firstly traversing a steel conversion training data set, acquiring billet state data of each piece of training data in the steel conversion training data set, then calculating correlation coefficients between any two pieces of billet state data in the steel conversion training data set by using a pearson correlation analysis method, selecting a preset number of correlation coefficients based on the sequence from big to small of absolute values of all the correlation coefficients, then determining two pieces of billet state data corresponding to each selected correlation coefficient, acquiring training data corresponding to the billet state data, constructing an effective data set, finally acquiring expected returns corresponding to each piece of training data in the effective data set, sequencing all the expected returns from big to small, and selecting training data corresponding to the preset number of expected returns as expert example data.

In the present applicationIn the embodiment, the method comprises the steps of traversing a steel-turning training data set, setting a threshold value, extracting similar state data sets based on a pearson correlation analysis method, setting the threshold value as N, selecting the first N groups of state sets with the largest correlation according to pearson correlation coefficients, marking the state sets as effective data sets, and extracting the effective data sets according to the threshold valueThe values are ordered, only the +.>State and action with the largest value. The "state-action" expert example data is formed off-line mimicking learning.

The method comprises the steps of obtaining a plurality of observation values of each billet state data in a steel transformation training data set, calculating an average value of the plurality of observation values, and selecting a specified observation value from the plurality of observation values; selecting any two pieces of steel billet state data in the steel conversion training data set, substituting an average value of a plurality of observation values of each piece of steel billet state data and a specified observation value into a pearson correlation coefficient calculation formula to calculate, and obtaining a correlation coefficient between the two pieces of steel billet state data.

In the embodiment of the present application, the pearson correlation coefficient calculation formula is shown as follows:

。

Wherein,representing the pearson correlation coefficient,/->And->Observations of two billet state data, respectively,/->And->Are respectively->And->An average of a plurality of observations of the two billet state data.

Specifically, the specific flow of this embodiment of this step in the present application is:

firstly traversing a steel-turning training data set, setting a threshold value as N=5, and extracting the closest 5 groups of state data sets based on a pearson correlation analysis method; then the threshold is set to n=5, the given state_a is A5-dimensional vector, denoted as a= [ A1, A2, A3, A4, A5]Where Ai represents the ith observation of State_A; for each State_B in the dataset, it is represented as a 5-dimensional vector B= [ B1, B2, B3, B4, B5 ]]Where Bi represents the ith observation of State_B; calculating a pearson correlation coefficient r_AB between the State_A and the State_B by using a pearson correlation coefficient calculation formula; repeating the two steps, and calculating the correlation coefficient between the State_A and each State_B in the steel transformation training data set to obtain a group of correlation coefficients r_AB; and comparing absolute values of all the correlation coefficients, and selecting State_B corresponding to the first 5 groups of data with the largest absolute value of the correlation coefficient. Corresponding states and actions in the 5 groups of data The values are ordered, retaining +.>The set of states and actions with the largest values; and finally, extracting all reserved 'state-action' pairs in the sequence to serve as expert example data, wherein each 'state-action' pair serves as a training sample, a state vector state serves as a feature vector, an action serves as a label, and the follow-up supervised behavior clone imitation learning is supported.

204. And performing offline supervised training based on expert example data by using a behavior cloning algorithm to obtain a steel conversion control strategy.

Specifically, firstly, a training sample is extracted from expert example data, the input state of the training sample and an expert action label are obtained, then, a behavior cloning network is constructed based on a behavior cloning algorithm, the input state of the training sample is input into the behavior cloning network, the output action of the behavior cloning network is obtained, then, error loss between the expert action label and the output action of the behavior cloning network is calculated by using a preset loss function, network parameters of the behavior cloning network are updated by using a gradient descent algorithm according to the error loss, and finally, a steel conversion control strategy corresponding to the expert example data is obtained based on the updated behavior cloning network.

In the embodiment of the application, training is performed on the screened expert example data by using a behavior cloning method, states are used as input states, actions are used as output actions, and learning targets are as follows:

，

in the method, in the process of the invention,is the loss function under the corresponding supervised learning framework, B is the behavioural clone dataset, +.>Is a network of policies that,is a parameter of the behavioural clone network, +.>Is the final optimized parameter of the behavior cloning network, s is the input state, a is the output action, E _(s,a)∈B The expected values for all samples of the dataset are cloned for behavior.

While the loss function uses the root mean square error loss, the root mean square error RMSE can evaluate the variation degree of the data, and the smaller the value is, the higher the accuracy of the prediction model is. The formula is as follows:

，/>

wherein:for the number of data set samples, +.>And->And respectively obtaining a predicted value and a true value of the ith sample, and finally obtaining a behavior cloning network parameter, wherein the behavior cloning network output is the optimal strategy corresponding to different steel transformation states.

According to the two formulas, the Root Mean Square Error (RMSE), namely the loss function, is substituted into the learning target formula, and the finally obtained formula is the behavioral clone network, and is as follows:

。

specifically, using a behavior cloning algorithm, offline supervised training is performed based on the screened expert example data, as shown in fig. 2, and the input of the behavior cloning network is a billet 5-dimensional parameter vector: length, width, thickness, real-time angle and roller feedback speed of billet; the output is a 1-dimensional vector: the set speed of the odd rollers (the set speed of the even rollers is equal to the set speed of the even rollers in size and opposite in direction) in the process of rotating the steel by the roller way. The behavioral clone network comprises a three-layer fully-connected network, the activation function uses a Relu activation function, the first fully-connected layer: length 256; second full tie layer: the length is 256. Training by using the screened expert sample data as a sample, performing root mean square error loss on action labels in the training sample and output values of an initial behavior clone network, training the initial behavior clone network according to an error back-propagation and gradient descent algorithm, and fitting behavior clone network parameters through root mean square error 。

Establishing a mapping f from state to action: s is(s)and a, obtaining a final behavior clone network, wherein the trained network can perform actions similar to the actions of an expert according to the state. As shown in FIG. 2, the steel switching strategy output after the behavior clone network training is the optimal experience strategy with the largest rewarding value in operation experience, namely the shortest steel switching time and the highest steel switching efficiency, and is used for actual steel switching control. Fig. 3 shows control strategies output by the behavior cloning network for steel billets, and from the graph, it can be seen that the steel turning system can cope with different steel billet states, give an optimal strategy in time, and quickly finish steel turning operation.

According to the steel conversion control method and device based on behavior cloning, as shown in fig. 4, firstly steel billet state data and steel conversion operation action sequences with corresponding relations are collected, a reward function is constructed, comprehensive reward values of the steel conversion operation action sequences are calculated, a steel conversion original data set is generated, then the steel conversion original data set is iterated by using a reinforcement learning model, expected returns corresponding to each input state in steel conversion operation sample data are calculated, a steel conversion training data set is constructed, then a pearson correlation analysis method is used for carrying out correlation analysis on the steel conversion training data set, expert example data are screened out in correlation analysis results according to the expected returns, finally an behavior cloning algorithm is used for carrying out offline supervision training based on the expert example data, and a steel conversion control strategy is obtained. The method can improve data quality, define target standards, optimize model performance, generalization capability and the like by collecting data with corresponding relations, constructing a reward function, generating a data set, utilizing reinforcement learning model training, pelson related analysis and the like, thereby being beneficial to generating a more accurate and reliable steel conversion control strategy.

Further, as a specific implementation of the method of fig. 1, an embodiment of the present application provides a steel transformation control device based on behavioral cloning, as shown in fig. 5, where the device includes: a data acquisition module 301, a data set construction module 302, a data preprocessing module 303, a data screening module 304 and a policy output module 305.

The data acquisition module 301 is configured to acquire steel conversion operation sample data, where the steel conversion operation sample data includes billet state data and a steel conversion operation action sequence with a corresponding relationship;

the data set construction module 302 is configured to construct a reward function based on a preset parameter of the steel turning target, calculate a comprehensive reward value of the steel turning operation action sequence by using the reward function, and construct a steel turning original data set according to the steel turning operation sample data and the comprehensive reward value;

the data preprocessing module 303 is configured to perform iterative computation on the steel conversion original data set by using a preset reinforcement learning model, obtain expected returns corresponding to the steel conversion operation sample data, and construct a steel conversion training data set according to the steel conversion original data set and the expected returns;

the data screening module 304 is configured to perform correlation analysis on the steel transformation training data set by using a pearson correlation analysis method to obtain a correlation analysis result, and screen the correlation analysis result with the expected return as a screening condition to obtain expert example data;

And the strategy output module 305 is used for performing offline supervised training based on expert example data by utilizing a behavior cloning algorithm to obtain a steel conversion control strategy.

In a specific application scenario, the data set construction module 302 may be specifically configured to obtain preset parameters of a steel turning target, where the preset parameters of the steel turning target include a steel turning target angle, a steel turning angle rewarding upper limit value, a steel turning angle rewarding lower limit value, and a steel turning time rewarding parameter; constructing a steel turning angle rewarding function based on the steel turning target angle, the steel turning angle rewarding upper limit value and the steel turning angle rewarding lower limit value, inputting the steel turning angle into the steel turning angle rewarding function, and calculating to obtain a steel turning angle rewarding value; constructing a steel turning time rewarding function based on the steel turning time rewarding parameter, inputting the steel turning time into the steel turning time rewarding function, and calculating to obtain a steel turning time rewarding value; determining a first preset weight corresponding to the steel turning angle rewarding value and a second preset weight corresponding to the steel turning time rewarding value, and calculating a first product between the first preset weight and the steel turning angle rewarding value and a second product between the second preset weight and the steel turning time rewarding value; and summing the first product and the second product to obtain the comprehensive rewarding value of the steel turning operation action sequence.

In a specific application scenario, the data preprocessing module 303 may be specifically configured to determine a preset reinforcement learning model, set billet state data to be an input state of the reinforcement learning model, and set a roller way setting speed to be an output action of the reinforcement learning model; acquiring a comprehensive rewarding value of a steel-turning operation sequence, setting the comprehensive rewarding value as expected rewards of final input states in the reinforcement learning model, and iteratively calculating expected rewards corresponding to all other input states except the final input states in the reinforcement learning model by using a back propagation mode; and constructing a steel conversion training data set according to the steel conversion original data set and expected returns corresponding to all input states, wherein the steel conversion training data set comprises a plurality of pieces of training data, and the training data comprises billet state data in the current input state, roller way setting speed, instant rewarding value, expected returns and billet state data in the later input state in the current input state.

In a specific application scenario, the data preprocessing module 303 may be further configured to determine instant prize values of all input states except for the final input state in the reinforcement learning model, and construct a desired return calculation function based on a preset discount factor, where in the desired return calculation function, a desired return of a current input state is equal to a sum of a product of the discount factor and a desired return of a next input state to the current input state and the instant prize value of the current input state; the expected returns of the final input states are input into an expected return calculation function, and the expected returns of all the other input states except the final input state are calculated one by one.

In a specific application scenario, the data screening module 304 may be specifically configured to traverse the steel transformation training data set to obtain billet state data of each piece of training data in the steel transformation training data set; calculating correlation coefficients between any two billet state data in the steel transformation training data set by using a Pearson correlation analysis method, and selecting a preset number of correlation coefficients from large to small based on the absolute values of all the correlation coefficients; determining two billet state data corresponding to each selected correlation coefficient, acquiring training data corresponding to the billet state data, and constructing an effective data set; and acquiring expected returns corresponding to each training data in the effective data set, sequencing all the expected returns from large to small, and selecting training data corresponding to a preset number of expected returns as expert example data.

In a specific application scenario, the data screening module 304 may be further configured to obtain a plurality of observations of each billet state data in the steel transformation training data set, calculate an average value of the plurality of observations, and select a specified observation from the plurality of observations; selecting any two pieces of steel billet state data in the steel conversion training data set, substituting an average value of a plurality of observation values of each piece of steel billet state data and a specified observation value into a pearson correlation coefficient calculation formula to calculate, and obtaining a correlation coefficient between the two pieces of steel billet state data.

In a specific application scenario, the policy output module 305 may be specifically configured to extract a training sample from expert example data, and obtain an input state of the training sample and an expert action tag; based on a behavior cloning algorithm, constructing a behavior cloning network, inputting an input state of a training sample into the behavior cloning network, and obtaining an output action of the behavior cloning network; calculating error loss between the expert action label and the output action of the behavior clone network by using a preset loss function, and updating network parameters of the behavior clone network by using a gradient descent algorithm according to the error loss; and based on the updated behavior clone network, acquiring a steel conversion control strategy corresponding to expert example data.

It should be noted that, other corresponding descriptions of each functional unit related to the behavioural clone-based steel transfer control device provided in this embodiment may refer to corresponding descriptions in fig. 1 and fig. 4, and are not repeated here.

Based on the above method as shown in fig. 1, correspondingly, the present embodiment further provides a storage medium, on which a computer program is stored, which when executed by a processor, implements the above behavioural clone based steel conversion control method.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, where the software product to be identified may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.), and includes several instructions to cause a computer device (may be a personal computer, a server, or a network device, etc.) to execute a behavioural clone-based steel conversion control method in each implementation scenario of the present application.

Based on the method shown in fig. 1 and fig. 4 and the embodiment of the steel conversion control device based on behavior cloning shown in fig. 5, in order to achieve the above object, as shown in fig. 6, the embodiment further provides an entity device of steel conversion control based on behavior cloning, where the device includes a communication bus, a processor, a memory, a communication interface, and may further include an input/output interface and a display device, where each functional unit may complete communication with each other through the bus. The memory stores a computer program and a processor for executing the program stored in the memory, and executing the steel transfer control method based on behavior cloning in the above embodiment.

Optionally, the physical device may further include a user interface, a network interface, a camera, radio Frequency (RF) circuitry, sensors, audio circuitry, WI-FI modules, and the like. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

It will be appreciated by those skilled in the art that the structure of the steel transformation control entity device based on behavior cloning provided in this embodiment is not limited to this entity device, and may include more or fewer components, or may combine some components, or may be different in arrangement of components.

The storage medium may also include an operating system, a network communication module. The operating system is a program for managing the entity equipment hardware and the software resources to be identified, and supports the operation of the information processing program and other software and/or programs to be identified. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the information processing entity equipment.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. By applying the technical scheme, firstly, steel turning operation sample data are collected, wherein the steel turning operation sample data comprise billet state data and steel turning operation action sequences with corresponding relations, then a reward function is constructed based on preset parameters of a steel turning target, comprehensive reward values of the steel turning operation action sequences are calculated by using the reward function, a steel turning original data set is constructed according to the steel turning operation sample data and the comprehensive reward values, iterative computation is carried out on the steel turning original data set by using a preset reinforcement learning model to obtain expected return corresponding to the steel turning operation sample data, a steel turning training data set is constructed according to the steel turning original data set and the expected return, then correlation analysis is carried out on the steel turning training data set by using a Pearson correlation analysis method to obtain correlation analysis results, the correlation analysis results are screened by using the expected return as screening conditions to obtain expert example data, finally, offline supervision training is carried out on the basis of the expert example data by using a behavior cloning algorithm to obtain a steel turning control strategy. The method utilizes the rewarding function to evaluate the merits of the action sequences of the steel turning operation, thereby quantitatively evaluating the comprehensive effects of different steel turning operations, and further constructing a steel turning original data set to provide a basis of the data set for the subsequent model training; iterative computation is carried out on the steel conversion original data set by utilizing the reinforcement learning model, so that the model can be gradually optimized, and the accuracy and reliability of expected return computation are improved; the pearson correlation analysis method is utilized to carry out correlation analysis, and data samples with high correlation and higher expected return can be screened from correlation analysis results, so that the effect and generalization capability of model training are improved, and noise and irrelevant information in the training process are reduced; finally, off-line supervision training is carried out by utilizing a behavior cloning algorithm to obtain a steel conversion control strategy, so that the experience and knowledge of an expert are transferred to a machine learning model, and the effect and reliability of the model are improved. According to the method, the steel transferring operation sample data established in the steel transferring process is preprocessed and screened, a behavior cloning algorithm is used for training a large amount of offline steel transferring experience data, an intelligent steel transferring control strategy in the experience data can be obtained without interaction of an agent and a real environment, and the requirement of quick steel transferring on a production site is met through intelligent adjustment of the steel transferring strategy.

Those skilled in the art will appreciate that the drawings are merely schematic illustrations of one preferred implementation scenario, and that the modules or flows in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that modules in an apparatus in an implementation scenario may be distributed in an apparatus in an implementation scenario according to an implementation scenario description, or that corresponding changes may be located in one or more apparatuses different from the implementation scenario. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The foregoing application serial numbers are merely for description, and do not represent advantages or disadvantages of the implementation scenario. The foregoing disclosure is merely a few specific implementations of the present application, but the present application is not limited thereto and any variations that can be considered by a person skilled in the art shall fall within the protection scope of the present application.

Claims

1. A behavioral clone-based steel transformation control method, the method comprising:

2. The method of claim 1, wherein the sequence of turning operation actions includes a turning time and a turning angle, and wherein the reward function includes a turning angle reward function and a turning time reward function; constructing a reward function based on preset parameters of the steel turning target, calculating a comprehensive reward value of the steel turning operation action sequence by using the reward function, and comprising the following steps:

3. The method of claim 1, wherein the sequence of turning steel operations comprises a roller table set speed; performing iterative computation on the steel conversion original data set by using a preset reinforcement learning model to obtain expected returns corresponding to the steel conversion operation sample data, and constructing a steel conversion training data set according to the steel conversion original data set and the expected returns, wherein the method comprises the following steps:

4. The method of claim 3, wherein setting the composite prize value as the expected return for the final input state in the reinforcement learning model and iteratively calculating the expected returns for all but the final input state in the reinforcement learning model using a back propagation method comprises:

5. The method of claim 1, wherein performing a correlation analysis on the steel transformation training data set by using a pearson correlation analysis method to obtain a correlation analysis result, and screening the correlation analysis result with the expected return as a screening condition to obtain expert example data, includes:

6. The method of claim 5, wherein said calculating correlation coefficients between any two of said billet state data in said steel transformation training data set using pearson correlation analysis comprises:

7. The method of claim 1, wherein the performing offline supervised training based on the expert example data using a behavioral cloning algorithm to obtain a steel conversion control strategy comprises:

8. A behavioural clone based steel conversion control device, the device comprising:

9. A storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the method of any of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the method according to any one of claims 1 to 7.