WO2023137631A1

WO2023137631A1 - Method and apparatus for data driven control optimization for production quality improvement

Info

Publication number: WO2023137631A1
Application number: PCT/CN2022/072774
Authority: WO
Inventors: Cheng FENG; Jinyan GUAN; Jie Huang
Original assignee: Siemens Aktiengesellschaft; Siemens Ltd., China
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2023-07-27

Abstract

A method, apparatus and computer-readable medium for data driven control optimization for production quality improvement. The method (300) include: collecting (S301) historical dataset from a production unit including control parameters, conditional parameters and control result variables; training (S302) multiple models capturing the probabilistic distribution of the control result variables given the control parameters and conditional parameters; computing (S303) divergence between the predicted distributions of the trained multiple models; acquiring (S304) a control policy which outputs the optimal control parameters that maximize the expected reward, wherein the expected reward is rescaled by risk of over-confident recommendations.

Description

Method and apparatus for data driven control optimization for production quality improvement

Technical Field

The present invention relates to data-driven optimization technique, and more particularly to a method, apparatus and computer-readable storage medium for data driven control optimization for production quality improvement.

Background Art

With the advance of machine learning techniques, data-driven methods are considered as a promising approach to optimize the control parameters for improving production quality in both process and discrete manufacturing. However, some data-driven methods can cause unexpected errors in practice.

Summary of the Invention

In present disclosure, a major technical problem is outlined, which significantly obstructs the practical usage of data-driven control optimization solutions in real world production lines, that is data-driven solutions are often risk-insensitive, meaning that they can provide some over-confident control recommendations that can cause unexpected errors in practice.

To solve the technical problem, a risk-sensitive solution of framework for learning data-driven control policies for production quality improvement is proposed, which can significantly enhance the reliability and usability of data-driven control optimization solutions in production.

Embodiments of the present disclosure include methods, apparatuses, system and computer-readable storage medium.

According to a first aspect of the present disclosure, a method for data driven control optimization for production quality improvement is presented, including following steps:

- collecting historical dataset from a production unit, wherein a dataset includes control parameters, conditional parameters and control result variables;

- training multiple models with same structure, same hyperparameters and randomized initialization of weights, wherein each model captures the probabilistic distribution of the control result variables given the control parameters and conditional parameters;

- computing divergence between the predicted distributions of the trained multiple models;

- acquiring a control policy which outputs the optimal control parameters at an arbitrary time point that maximize the expected reward at a finite time horizon given observed conditional parameters, wherein the expected reward is rescaled by risk of over-confident recommendations of the control parameters, the larger the divergence the larger the risk.

According to a second aspect of the present disclosure, an apparatus for data driven control optimization for production quality improvement is presented including modules to execute the method according to the first aspect of the present disclosure.

According to a third aspect of the present disclosure, an apparatus for data driven control optimization for production quality improvement is presented. The apparatus includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method according to the first aspect of the present disclosure.

According to a fourth aspect of the present disclosure, a computer-readable medium for data driven control optimization for production quality improvement is presented. The computer-readable medium stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the first aspect of the present disclosure.

The solution presented in the disclosure is risk-sensitive, it can significantly reduce the chances for recommending over-confident control parameters in pratice. As a result, the reliability of developed data-driven control policy for production quality improvement is enhanced.

On one hand, multiple models can be used to fit probabilistic distribution of the control result variables given the control parameters and conditional parameters by which the uncertainties within the production process and the correlations between control result variables can be well captured; on the other hand, a risk factor is added to penalize control parameters with high risk of giving over-confident recommendations to regions of the production process that are not sufficiently reflected in the historical datasets.

Brief Description of the Drawings

The above-mentioned attributes and other features and advantages of the present technique and the manner of attaining them will become more apparent and the present technique itself will be better understood by reference to the following description of embodiments of the present technique taken in conjunction with the accompanying drawings, wherein:

FIG. 1 depicts a flow diagram of a method in accordance with embodiment 1 of the present disclosure.

FIG. 2 depicts how to train multiple CGANs in embodiments of the present disclosure.

FIG. 3 depicts a flow diagram of a method in accordance with embodiment 2 of the present disclosure.

FIG. 4 depicts structure of an apparatus in accordance with embodiment 3 of the present disclosure.

Reference Numbers:

100, a method according to embodiment 1 of the present disclosure

S101～S102, steps of method 100

300, a method according to embodiment 2 of the present disclosure

S301～S304, steps of method 300

10, an apparatus according to embodiment of the present disclosure

101, at least one processor

102, at least one memory

103, an I/O port

20, a control optimization program

201, collection module

202, training module

203, computing module

204, acquisition module

30, historical dataset

40, multiple models

Detailed Description of Example Embodiments

Hereinafter, above-mentioned and other features of the present technique are described in detail. Various embodiments are described with reference to the drawing, where like reference numerals are used to refer to like elements throughout. In the following description, for purpose of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be noted that the illustrated embodiments are intended to explain, and not to limit the invention. It may be evident that such embodiments may be practiced without these specific details.

When introducing elements of various embodiments of the present disclosure, the articles “a” , “an” , “the” and “said” are intended to mean that there are one or more of the elements. The terms “comprising” , “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.

Problem of current data-driven control optimization

The problem of data-driven control optimization for production quality improvement can be formulated as followed:

Let:

x denote the set of control parameters, which are related to production quality,

c denote the set of conditional parameters, which can only be measured but are uncontrollable,

y denote a set of control result variables which determine the production quality.

Furthermore, there is a hidden probabilistic model:

P (y|x, c) , which governs the distribution of the control result variables given specific control parameters under specific conditional parameters.

Moreover, there also exists a reward function:

R (y) , which defines a scalar reward value given the control result variables of the production.

Then, given a historical dataset { (x, c, y) ₁, (x, c, y) ₂, ... } , the target is to learn a control policy π (x _t|c _t) which outputs the optimal control parameters at an arbitrary time t that maximize the expected reward at a finite time horizon T given the observed conditional parameters. The learning objective of the control policy can be formulated as follows:

Where γ∈ (0, 1) can be a discount factor for future rewards. Formula (1) can be applicable in process manufacturing.

It is worthy to note that conditional parameters are commonly treated as independent between each production in discrete manufacturing, thus we can set T=0 in those cases, then the learning objective can be simplified as follows:

Formula (1) can be applicable in discrete manufacturing. To develop a data-driven control optimization solution, we can take the following two steps:

Learning the probabilistic model P (y|x, c) from the historical dataset, e.g., the probabilistic model can be typically fitted by machine learning models such as Gaussian processes, Bayesian neural networks, etc.

Learning the optimal control policy in Equation (1) or (2) via optimization algorithms such as model-based offline reinforcement learning, Bayesian optimization, etc.

A major problem for the above approach is its risk insensitivity. Specifically, since the learned control policy only tries to maximize the reward function in equation (1) or (2) , it may output some over-confident control strategies that can cause error in practice. Based on a lot of research and test, it is found that this phenomenon is mainly caused by the distribution shift problem which means that real dynamics of the production process may not follow the same distribution within the historical dataset, and the learned control policy tends to give over-confident recommendations to the regions in the real dynamics of the production process that are not sufficiently reflected in the historical dataset.

Embodiment 1 of the present disclosure

To solve the above-mentioned risk-insensitive problem of current optimization solution,

On one hand, an ensemble model of conditional generative adversarial networks (CGANs) can be used to fit the probabilistic model P (y|x, c) by which the uncertainties within the production process and the correlations between control result variables can be well captured.

On the other hand, a risk factor in Equation (1) and (2) can be added to penalize control parameters with high risk of giving over-confident recommendations to regions of the production process that are not sufficiently reflected in the historical datasets.

Assume collected a historical dataset { (x, c, y) ₁, (x, c, y) ₂, ..., (x, c, y) _K} in which each tuple (x, c, y) _t consists of the observed control parameters, conditional parameters, and quality attributes at a discrete time point t.

Now referring to FIG. 1 and FIG. 2, the method 100 for data driven control optimization for production quality improvement can include following steps:

S101: Learning the probabilistic model P (y|x, c) by an ensemble of CGANs from the historical dataset. Structure of One of the CGANs is shown in FIG. 2.

M CGAN models (or called CGANs) can be trained to learn the probabilistic model P (y|x, c) . Each CGAN can be trained with randomized initialization of weights, optionally with the same structure and hyperparameters, such as the number of layers of neural network, the number of neurons in each layer, etc., which are set by users initially. More specifically, as shown in the FIG. 2, each CGAN consists of two sub-models, the generator (G) and the discriminator (D) :

Generator (G) :

The input of the generator can include a white noise vector (z) consisting of randomly generated float numbers between 0 and 1, and a condition vector which is the combination of control parameters (x) and conditional parameters (c) . The output of the generator is the control results of production

We define the generator as a function

The objective of the generator is to generate simulated quality attributes that cannot be distinguished by the discriminator.

Discriminator (D) :

The discriminator can also have two inputs. The first is a condition vector as the same as the counterpart in the generator. The second part is either the control result variables of a generated control result variable sample

or a real control result variable sample (y) . The objective of the discriminator is to distinguish between real and generated control result variable samples.

After the M CGANs are trained, following Monte Carlo approximation can be done to evaluate the expected reward under given control parameters and conditional parameters:

Where N is a large integer number in comparison with the number of quality result variables, and can be set according to computing power, the larger the computer power, the larger N. (the proportion of N to the number of quality result variables can be 100, 500, etc. ) . z ^j～P (z) where P (z) is a multivariate uniform distribution between 0 and 1 with a diagonal covariance matrix, G _i means the ith CGAN in the ensemble.

S202: Learning a risk-sensitive control policy from the historical dataset.

To reduce the risk of learning an over-confident control policy, the risk of making an over-confident recommendation of control parameters can be modeled. We noticed that over-confident recommendations mainly occur when the learned probabilistic models (i.e., CGANs) have limited knowledge in the regions of the production process that are not sufficiently reflected in the historical datasets, and then they tend to take random guesses. Due to this randomness nature of over-confident recommendations, we propose that the risk be modeled by computing the divergence between the predicted distributions of CGANs. Concretely, let μ _i and ∑ _i denote the empirical mean and covariance of generated control result variables

of a CGAN, firstly the squared Hellinger distance between a pair of CGAN predicted distributions can be computed as follows assuming the distributions are normally distributed:

Then the risk of making an over-confident recommendation can be calculated as the mean of squared Hellinger distances of the predicted distributions of the M CGANs:

where r∈ [0, 1] .

Lastly, a risk-sensitive optimization objective cab be set for learning the optimal control policy (π) as follows:

Specifically, the reward can be rescaled by a factor of (1-r _t+l) to reduce the risk of over-confident recommendations. A larger value for (1-r _t+l) indicates a lower likelihood to making an over-confident recommendation. Similarly, when T=0, the objective function can be simplified as follows:

The risk-sensitive control policy can be learned with objective function in Equation (6) via offline reinforcement learning, in Equation (7) via Bayesian optimization.

Experiments

Here, taking a worm gear production line as an example, the risk-sensitive solution presented in above-mentioned method 200 can be applied to learn a control policy for quality improvement.

Specifically, a historical dataset from the production lines with about 15000 production records is collected. Each production record consists of 2 conditional parameters, 4 control parameters and 7 control result variables. The reward for a qualified production is set to 1, otherwise the reward is set to -1.

We compare our learned policy with two baseline policies:

1) the log policy in the historical dataset;

2) a risk-insensitive data-driven policy as described with formulas (1) and (2) .

To compare the performance of different control policies, the ideal way is to run the policies and observed the rewards, however, it is too costly or mostly impossible to do so in practice. Thus, we propose to evaluate the policies by using the historical dataset based on the doubly robust estimator that is commonly used for off-policy policy evaluation. Specifically, we choose the switch doubly robust estimator that is originally proposed in following paper to calculate the expected reward of a control policy using the following formula:

Wang, Yu-Xiang, Alekh Agarwal, and Miroslav Dud1k. "Optimal and adaptive off-policy evaluation in contextual bandits. " International Conference on Machine Learning. PMLR, 2017.

Where P (x _t|c _t) is the estimated probability of choosing control parameters x _t given conditional parameters c _t in the historical dataset, R _t is the log reward in the historical dataset,

is the estimated reward for a given control parameter

σ is a predefined threshold value to switching between using importance sampling or direct method to estimate the expected reward. Intuitively, the above evaluation method will penalize policies that have a relatively lower chance of recommending control parameters that have good rewards in the historical dataset or have a higher chance of recommending control parameters that have bad rewards in the historical dataset. Meanwhile, it will also favor policies that have a relatively higher chance of recommending control parameters that have good rewards in the historical dataset or have a lower chance of recommending control parameters that have bad rewards in the historical dataset. Meanwhile, when the importance weight is a larger than a threshold σ, it will estimate the expected reward by using direct estimates. As a result, it can estimate the expected reward of a new policy given a historical dataset with low bias and variance.

We show the expected reward of different policies using the switch doubly robust estimator in the following table:

As can be seen, the risk-sensitive data-driven policy learned using our proposed method significantly outperforms the risk-insensitive counterpart. Moreover, it also effectively improves the expected reward compared with the log policy.

Embodiment 2 of the present disclosure

More commonly, the method 300 as shown in FIG. 3 is presented, which can include following steps:

- S301: collecting historical dataset from a production unit, wherein a dataset includes control parameters, conditional parameters and control result variables.

Here, a production unit can be defined according to actual situation, such as a factory, a production line, or part of a production line, etc. historical dataset of the production unit can be collected and for future model training and control parameters optimization.

The dataset can include above-mentioned x (the set of control parameters) , c (the set of conditional parameters) and y (the set of control result variables) .

- S302: training multiple models with same structure, same hyperparameters and randomized initialization of weights, wherein each model captures the probabilistic distribution of the control result variables given the control parameters and conditional parameters.

The multiple models can be CGANs as shown in FIG. 2 and described above. The multiple models have same structure and same hyperparameters, but with randomized initialization of weights, so that the trained models are different. With the divergence of the multiple trained models, the risk of over-confident recommendation of control parameters can be measured and used to rescale the expected reward during acquisition of control policy on the optimal control parameters, then the risk can be reduced.

- S303: computing divergence between the predicted distributions of the trained multiple models.

For example, the divergence between 2 models can be calculated by above-mentioned formula (4) .

- S304: acquiring a control policy which outputs the optimal control parameters at an arbitrary time point that maximize the expected reward at a finite time horizon given observed conditional parameters, wherein the expected reward is rescaled by risk of over-confident recommendations of the control parameters, the larger the divergence the larger the risk.

For example, the risk can be calculated as mean of squared Hellinger distance of the predicted distributions of the multiple models (as shown in above-mentioned formula (5) ) .

The calculation process can be referred to above mentioned formula (6) and (7) .

Embodiment 3 of the present disclosure

FIG. 4 depicts a block diagram of an apparatus for data driven control optimization for production quality improvement in accordance with one embodiment of the present disclosure. The apparatus 10 for data driven control optimization for production quality improvement presented in the present disclosure can be implemented as a network of computer processors, to execute the above-mentioned

method

100 or 300 for data driven control optimization for production quality improvement presented in the present disclosure. The apparatus 10 can also be a single computer, as shown in FIG. 4, including at least one memory 102, which includes computer-readable medium, such as a random access memory (RAM) . The apparatus 10 also includes at least one processor 101, coupled with the at least one memory 102. Computer-executable instructions are stored in the at least one memory 102, and when executed by the at least one processor 101, can cause the at least one processor 101 to perform the steps described herein. The at least one processor 101 may include a microprocessor, an application specific integrated circuit (ASIC) , a digital signal processor (DSP) , a central processing unit (CPU) , a graphics processing unit (GPU) , state machines, etc. embodiments of computer-readable medium include, but not limited to a floppy disk, CD-ROM, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions. Also, various other forms of computer-readable medium may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless. The instructions may include code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, and JavaScript.

The at least one memory 102 shown in FIG. 4 can contain a control optimization program 20, when executed by the at least one processor 101, causing the at least one processor 101 to execute the

method

100 or 300 for data driven control optimization for production quality improvement presented in the present disclosure. Historical set 30 and related data or parameters of multiple 40 can also be stored in the at least one memory 102. Historical dataset 30 can be received via an I/O port 103 of the apparatus 10.

The control optimization program 20 can include:

- a collection module 201, configured to collect historical dataset from a production unit, wherein a dataset includes control parameters, conditional parameters and control result variables;

- a training module 202, configured to train multiple models with same structure, same hyperparameters and randomized initialization of weights, wherein each model captures the probabilistic distribution of the control result variables given the control parameters and conditional parameters;

- a computing module 203, configured to compute divergence between the predicted distributions of the trained multiple models;

- an acquisition module 204, configured to acquire a control policy which outputs the optimal control parameters at an arbitrary time point that maximize the expected reward at a finite time horizon given observed conditional parameters, wherein the expected reward is rescaled by risk of over-confident recommendations of the control parameters, the larger the divergence the larger the risk.

Optionally, the multiple models (40) are CGANs.

Optionally, the risk is mean of squared Hellinger distances of the predicted distributions of the multiple models.

A computer-readable medium is also provided in the present disclosure, storing computer-executable instructions, which upon execution by a computer, enables the computer to execute any of the methods presented in this disclosure.

A computer program, which is being executed by at least one processor and performs any of the methods presented in this disclosure.

While the present technique has been described in detail with reference to certain embodiments, it should be appreciated that the present technique is not limited to those precise embodiments. Rather, in view of the present disclosure which describes exemplary modes for practicing the invention, many modifications and variations would present themselves, to those skilled in the art without departing from the scope and spirit of this invention. The scope of the invention is, therefore, indicated by the following claims rather than by the foregoing description. All changes, modifications, and variations coming within the meaning and range of equivalency of the claims are to be considered within their scope.

Claims

A method (300) for data driven control optimization for production quality improvement, comprising following steps:

- collecting (S301) historical dataset from a production unit, wherein a dataset includes control parameters, conditional parameters and control result variables;

- training (S302) multiple models with same structure, same hyperparameters and randomized initialization of weights, wherein each model captures the probabilistic distribution of the control result variables given the control parameters and conditional parameters;

- computing (S303) divergence between the predicted distributions of the trained multiple models;

- acquiring (S304) a control policy which outputs the optimal control parameters at an arbitrary time point that maximize the expected reward at a finite time horizon given observed conditional parameters, wherein the expected reward is rescaled by risk of over-confident recommendations of the control parameters, the larger the divergence, the larger the risk and more reward will be penalized.
the method according to claim 1, wherein the multiple models (40) are CGANs.
the method according to claim 1, wherein the risk is mean of squared Hellinger distances of the predicted distributions of the multiple models.
An apparatus (10) for data driven control optimization for production quality improvement, comprising:

- a collection module (201) , configured to collect historical dataset from a production unit, wherein a dataset includes control parameters, conditional parameters and control result variables;

- a training module (202) , configured to train multiple models with same structure, same hyperparameters and randomized initialization of weights, wherein each model captures the probabilistic distribution of the control result variables given the control parameters and conditional parameters;

- a computing module (203) , configured to compute divergence between the predicted distributions of the trained multiple models;

- an acquisition module (204) , configured to acquire a control policy which outputs the optimal control parameters at an arbitrary time point that maximize the expected reward at a finite time horizon given observed conditional parameters, wherein the expected reward is rescaled by risk of over-confident recommendations of the control parameters, the larger the divergence the larger the risk and more reward will be penalized.
the apparatus according to claim 4, wherein the multiple models (40) are CGANs.
the apparatus according to claim 4, wherein the risk is mean of squared Hellinger distances of the predicted distributions of the multiple models.
An apparatus (10) for data driven control optimization for production quality improvement, comprising:

- at least one processor (101) ;

- at least one memory (102) , coupled to the at least one processor (101) , configured to execute method according to any of claims 1～3.
A computer-readable medium for data driven control optimization for production quality improvement, storing computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to any of claims 1～3.