CN107609582B

CN107609582B - Multi-dimensional parameter identification method and device

Info

Publication number: CN107609582B
Application number: CN201710772738.3A
Authority: CN
Inventors: 萧伟; 刘雪松; 凌娅; 陈勇; 王振中; 姜晓红; 毕宇安; 李页瑞; 包乐伟; 章晨峰; 王磊; 陈永杰; 杜定益
Original assignee: Jiangsu Kanion Pharmaceutical Co Ltd; Zhejiang University ZJU
Current assignee: Jiangsu Kanion Pharmaceutical Co Ltd; Zhejiang University ZJU
Priority date: 2017-08-31
Filing date: 2017-08-31
Publication date: 2023-10-10
Anticipated expiration: 2037-08-31
Also published as: CN107609582A

Abstract

The invention discloses a multi-dimensional parameter identification method and a device, wherein the identification method comprises the following steps: collecting a plurality of training samples to form a training sample set, wherein each training sample comprises a plurality of process parameters, each process parameter has corresponding attribute parameters and categories, and the combination of the attribute parameters and the categories is a plurality of types; acquiring a distribution transfer information value of the training sample set according to the category in the training sample set; according to the distribution transfer information value, obtaining the information gain of each process parameter; selecting the process parameter with the maximum information gain as a splitting node, and establishing a decision tree; according to the decision tree, the decision tree is used, and carrying out category identification on the new data. The invention takes the information gain as the establishment basis of the decision tree model, improves the accuracy of the model, effectively and accurately identifies the data type and provides reliable basis for intelligent parameter feedback.

Description

Multi-dimensional parameter identification method and device

Technical Field

The invention relates to the field of process knowledge systems, in particular to a multi-dimensional parameter identification method and device.

Background

In a process knowledge system (Process Knowledge System, PKS for short), intelligent feedback of parameters mainly includes real-time monitoring of parameter data, and if the parameters are out of range, warning is issued.

In the prior art, one-dimensional data is usually monitored and fed back, and for two-dimensional data or multi-dimensional data, the precondition of parameter feedback is to identify multi-dimensional parameters so as to conveniently judge and monitor the threshold value of the two-dimensional data.

In the prior art, KPS only monitors one-dimensional data, the feedback surface is narrow, and problems existing in a system are difficult to accurately and comprehensively reflect, so that the early warning efficiency of the system is low, and the manufacturing efficiency and quality are affected.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a multi-dimensional parameter identification method and a device, and the technical scheme is as follows:

in one aspect, the present invention provides a multidimensional parameter identification method, including:

collecting a plurality of training samples to form a training sample set, wherein each training sample comprises a plurality of process parameters, each process parameter has corresponding attribute parameters and categories, and the combination of the attribute parameters and the categories is a plurality of types;

acquiring a distribution transfer information value of the training sample set according to the category in the training sample set;

according to the distribution transfer information value, obtaining the information gain of each process parameter;

selecting the process parameter with the maximum information gain as a splitting node, and establishing a decision tree;

and carrying out category identification on the new data according to the decision tree.

Further, the obtaining the distribution transfer information value of the training sample set includes: the distribution transfer information value is obtained by the following calculation formula:

info(S)＝-∑Pi*log ₂ (Pi), wherein info (S) conveys information values for the distribution of the training sample set, and Pi is the probability that the training sample belongs to the ith class.

Further, the obtaining the information gain of each process parameter includes: the information gain is obtained by the following calculation formula:

wherein info (S) delivers information values for the distribution of training sample sets, info (S _v ) Delivering information values for the distribution of a certain property parameter, for example>For the probability that the certain attribute parameter is in a certain category, varies (a) is a set of attribute parameters.

Further, the composing the training sample set further comprises: verifying the validity of the training sample, comprising: detecting whether the process parameters and the categories in each training sample are complete, and if not, rejecting the training samples; and/or

The training sample set is composed of: verifying the validity of the training sample set comprises: and detecting whether attribute parameters corresponding to the process parameters are complete, and if not, judging that the training sample set is invalid.

Further, the process parameter with the maximum information gain is selected as a split node, and the decision tree establishment comprises the following steps:

taking the process parameter with the maximum information gain as a root node and the corresponding attribute parameter as a first branch node;

judging whether the categories corresponding to the attribute parameters are consistent, if so, taking the category as a leaf node, otherwise, taking the last large process parameter of the information gain sequence as a second branch node, and repeating the steps until the category is obtained as the leaf node.

Further, if the root node and each branch node are the same and the leaf nodes are different, counting the number of the leaf nodes under the root node and each branch node in the training sample set, if the counted number is consistent, randomly discarding any leaf node, otherwise discarding the leaf node with lower number.

Further, the process parameters in the same training sample are key process parameters obtained by screening according to key quality attributes, wherein the key quality attributes are attribute parameters selected according to working sections in a process knowledge system.

In another aspect, the present invention provides a multidimensional parameter identification apparatus, comprising:

the data acquisition module is used for acquiring a plurality of training samples to form a training sample set, each training sample comprises a plurality of process parameters, each process parameter has corresponding attribute parameters and categories, and the combination of the attribute parameters and the categories is a plurality of types;

the distribution transfer module is used for acquiring a distribution transfer information value of the training sample set according to the category in the training sample set;

the gain module is used for transmitting information values according to the distribution and acquiring information gain of each process parameter;

the decision tree module is used for selecting the process parameter with the maximum information gain as a splitting node and establishing a decision tree;

and the identification module is used for carrying out category identification on the new data according to the decision tree.

Further, the apparatus further comprises:

a first verification module for verifying the validity of training samples prior to composing a training sample set, comprising: detecting whether the process parameters and the categories in each training sample are complete, and if not, rejecting the training samples; and/or

A second checking module, configured to check validity of the training sample set after composing the training sample set, includes: and detecting whether attribute parameters corresponding to the process parameters are complete, and if not, judging that the training sample set is invalid.

Further, the distribution transfer module obtains the distribution transfer information value through the following calculation formula:

info(S)＝-∑Pi*log ₂ (Pi), wherein info (S) is a distribution transfer information value of the training sample set, and Pi is a probability that the training sample belongs to the ith class;

the gain module obtains the information gain through the following calculation formula:

The technical scheme provided by the invention has the following beneficial effects:

1) The information gain is used as a selection standard of the splitting nodes of the decision tree, so that the accuracy of the decision tree is improved;

2) And establishing a decision tree model to realize accurate classification of the multidimensional data.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a multi-dimensional parameter identification method provided by an embodiment of the present invention;

FIG. 2 is a flow chart of a method for creating a decision tree provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a decision tree according to an embodiment of the present invention;

FIG. 4 is a flowchart of a training sample verification method provided by an embodiment of the present invention;

fig. 5 is a block diagram of a multidimensional parameter identification apparatus according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or device.

Example 1

In one embodiment of the present invention, a multi-dimensional parameter identification method is provided, referring to fig. 1, the method includes the following steps:

s1, collecting a plurality of training samples to form a training sample set.

Specifically, each training sample includes a plurality of process parameters, each process parameter having a corresponding attribute parameter and category, wherein the attribute parameter and category are combined in a plurality of ways. For example, in a certain working section, the production process is related to three process parameters, i.e., temperature (T), pressure (P), and humidity (PH), and the process parameters are temperature, pressure, and humidity, and the attribute parameters are classified into high (H), medium (N), and low (L), and the classification is classified into good, and bad, and the certain training sample can be as follows: t (H), P (N), PH (N), are good, and a plurality of training samples in various same or different states form a training sample set, and the training sample set can be shown in the following table 1.

TABLE 1

Sequence number	T	P	PH	Category(s)
					1	H	N	H	Good grade (good)
2	L	H	N	Good grade (good)
					3	N	L	L	Excellent (excellent)
4	H	N	N	Difference of difference
					5	N	N	H	Good grade (good)
6	L	N	N	Excellent (excellent)
					7	L	H	L	Excellent (excellent)
8	N	H	H	Difference of difference
					9	H	L	N	Good grade (good)

S2, acquiring a distributed transmission information value of the training sample set according to the category in the training sample set.

Let M be a training sample set comprising n classes of samples, denoted by C1, C2, …, cn, respectively, and if a given probability distribution mi= (M1, M2, … Mn) represents the probability of Ci, the amount of information conveyed by this distribution is referred to as the information entropy of M.

The obtaining the distribution transfer information value of the training sample set includes: the distribution transfer information value is obtained by the following calculation formula:

Taking table 1 as an example, the training sample set includes 9 training samples, wherein three samples (serial numbers 3, 6, and 7) with good categories are included, 4 with good categories and 2 with bad categories are included.

S3, transmitting information values according to the distribution, and acquiring information gain of each process parameter.

In particular, the information gain is used to measure the expected reduction in information entropy.

The obtaining the information gain of each process parameter includes: the information gain is obtained by the following calculation formula:

Taking the data in table 1 as an example,with info (T) _H ) For example, a->In the same way, info (T) _N ) And info (T) _L ) Further, gain (S, T) is obtained, and gain (S, P) and gain (S, PH) can be obtained by the same method.

S4, selecting the process parameter with the maximum information gain as a splitting node, and establishing a decision tree.

If gain (S, T) > gain (S, PH) > gain (S, P) is calculated, the temperature T is used as the split node to build the model.

S5, carrying out category identification on the new data according to the decision tree.

In the embodiment of the invention, the information gain is taken as the basis for selecting the split nodes to establish the decision tree model, which is favorable for accurately identifying and classifying two-dimensional (multidimensional) data and provides reliable basis for intelligent feedback of parameters. The two-dimensional (multidimensional) data may be applied in the following scenarios: the process parameters in the same training sample are key process parameters obtained by screening according to key quality attributes, wherein the key quality attributes are attribute parameters selected according to working sections in a process knowledge system, and the parameter types can be identified by adopting the method in the embodiment of the invention under the condition that the same key quality attribute has two or more key process parameters.

Example 2

In one embodiment of the present invention, a method of building a decision tree is provided, see fig. 2, comprising the steps of:

s21, taking the process parameter with the maximum information gain as a root node;

s22, taking the corresponding attribute parameter as a first branch node;

s23, judging whether the categories corresponding to the attribute parameters are consistent, if so, executing S24; otherwise, executing S25;

s24, taking the class as a leaf node;

s25, taking the process parameter with the information gain sequence larger than the last time as a second branch node, and repeating the steps until the category is obtained as a leaf node.

Taking the data in Table 1 as an example, if, through calculation, gain (S, T) > gain (S, PH) > gain (S, P), the decision tree established according to steps S21-S25 is shown in FIG. 3.

According to the decision tree shown in fig. 3, a type judgment can be performed on new data, for example, the new data is T (L) P (H) PH (L), and although the new data does not appear in table 1, according to the decision tree shown in fig. 3, the type of the parameter can be identified as being optimal; and if the new data is T (N) P (H) PH (H), judging that the parameter type is bad, and timely feeding back to the user or making a reminding warning.

If the root node and each branch node are the same and the leaf nodes are different, counting the number of the leaf nodes under the root node and each branch node in the training sample set, if the counted number is consistent, randomly discarding any leaf node, otherwise discarding the leaf node with lower number. Specific examples include training samples with a number of 10 added to table 1: t (N) P (H) PH (H) is of good type, which is obviously different from the type conclusion of the sample with the number of 8 in the table 1, in this case, the number of the samples of the two is counted, the number of the samples is more, and if the number is the same, any type is selected randomly.

Example 3

In one embodiment of the present invention, a training sample verification method is provided, see fig. 4, including the following procedures:

s31, detecting whether the process parameters and the categories in each training sample are complete, if not, rejecting the training sample, and if so, executing S32.

S32, adding training samples into a training sample set, wherein the number of samples in the training sample set is increased by 1;

s33, judging whether the number of samples in the training sample set reaches the index, if so, executing S34, otherwise, repeatedly executing S31-S32.

S34, detecting whether attribute parameters corresponding to the process parameters are complete, and if not, judging that the training sample set is invalid.

In order to make the established decision tree model have integrity, the validity of each training sample needs to be checked before the training sample set is formed, and the validity of the whole training sample set is checked after the training sample set is formed, otherwise, the decision tree cannot give out the condition of type leaf nodes.

Example 4

In an embodiment of the present invention, a multi-dimensional parameter identification apparatus is provided, referring to fig. 5, including the following modules:

the data collection module 510 is configured to collect a plurality of training samples to form a training sample set, where each training sample includes a plurality of process parameters, and each process parameter has a corresponding attribute parameter and category, where the attribute parameter and category are combined in a plurality of ways;

the distribution transfer module 520 is configured to obtain a distribution transfer information value of the training sample set according to the category in the training sample set;

a gain module 530, configured to transmit information values according to the distribution, and obtain information gain of each process parameter;

the decision tree module 540 is configured to select a process parameter with the maximum information gain as a split node, and establish a decision tree;

and the identification module 550 is used for carrying out category identification on the new data according to the decision tree.

Further, the identifying means further comprises a first checking module 511 for checking the validity of the training samples before forming the training sample set, including: detecting whether the process parameters and the categories in each training sample are complete, and if not, rejecting the training samples; and/or

A second checking module 512, configured to check validity of the training sample set after composing the training sample set, includes: and detecting whether attribute parameters corresponding to the process parameters are complete, and if not, judging that the training sample set is invalid.

Wherein, the distribution transfer module 520 obtains the distribution transfer information value by the following calculation formula:

the gain module 530 obtains the information gain by the following calculation formula:

It should be noted that: in the multi-dimensional parameter identification device provided in the above embodiment, when performing intelligent parameter identification, only the division of each functional module is used for illustration, in practical application, the above-mentioned function allocation can be completed by different functional modules according to needs, that is, the internal structure of the multi-dimensional parameter identification device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the embodiment of the multidimensional parameter identification apparatus provided in this embodiment and the multidimensional parameter identification method provided in the foregoing embodiments belong to the same concept, and detailed implementation processes of the multidimensional parameter identification apparatus are referred to as method embodiments, which are not described herein.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method for comprehensively detecting temperature, pressure and humidity of multidimensional parameters in a PKS production section, which is characterized by comprising the following steps:

collecting a plurality of training samples to form a training sample set, wherein each training sample comprises a plurality of process parameters, each process parameter has corresponding attribute parameters and categories, the process parameters in the same training sample are key process parameters obtained by screening according to key quality attributes, and the key quality attributes are attribute parameters selected according to working sections in a process knowledge system;

the process parameters comprise temperature, pressure and humidity, the combination of the attribute parameters and categories is multiple, the attribute parameters of the temperature, the pressure and the humidity comprise three steps of high, medium and low, and the categories comprise the categories that the temperature, the pressure and the humidity under the respective attribute parameter conditions are good or bad or poor correspondingly;

selecting the process parameter with the maximum information gain as a splitting node, establishing a decision tree, wherein the process for establishing the decision tree comprises the following steps: selecting the process parameter with the maximum information gain in the training sample set as a root node, and taking attribute parameters of the process parameter of the root node in the training sample set as primary branch nodes respectively; selecting a process parameter with the second-highest information gain in a training sample set as a secondary branch node, and searching attribute parameters of the process parameter of the secondary branch node under the attribute parameters of the primary branch node in the training sample set to respectively serve as tertiary branch nodes; according to each training sample in the training sample set, selecting a process parameter with the smallest gain after part or all of the three-level branch nodes as a four-level branch node, and searching attribute parameters of the process parameter of the four-level branch node in the training sample set as five-level branch nodes respectively; if the categories from the root node to the tertiary branch node can be uniquely determined in a training sample library, setting the corresponding category as a leaf node behind the tertiary branch node, otherwise, setting the corresponding category as a leaf node behind the five-stage branch node according to the training sample library; the process of establishing the decision tree does not carry out discrete segmentation operation on any range of high, medium and low continuous values of each attribute parameter;

and identifying the new data according to the decision tree, wherein the new data comprises temperature, pressure and humidity parameters and corresponding new attribute parameters thereof, further identifying and obtaining the category corresponding to the attribute parameters of the new data, finding a first-stage branch node according to the corresponding attribute parameters of the process parameters of the root node in the new data, finding a third-stage branch node according to the corresponding attribute parameters of the second-stage branch node in the new data, identifying the category of the new data as the category corresponding to the directly connected leaf node if the third-stage branch node is directly connected with the leaf node in the decision tree, otherwise, finding a fifth-stage branch node and the corresponding leaf node according to the corresponding attribute parameters of the fourth-stage branch node in the new data, identifying the category of the new data as the category corresponding to the leaf node after the fifth-stage branch node, and feeding back or giving a warning to a user if the identified category is poor.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises, the obtaining the distribution transfer information value of the training sample set includes: the distribution transfer information value is obtained by the following calculation formula:

3. The method of claim 1, wherein the obtaining the information gain for each process parameter comprises: the information gain is obtained by the following calculation formula:

4. The method of claim 1, wherein the composing the training sample set is preceded by: verifying the validity of the training sample, comprising: detecting whether the process parameters and the categories in each training sample are complete, and if not, rejecting the training samples; and/or

5. The method of claim 1, wherein if the root node and each branch node are the same and the leaf nodes are different, counting the number of leaf nodes under the root node and each branch node in the training sample set, if the counted number is consistent, randomly discarding any one of the leaf nodes, otherwise discarding the leaf node with the lower number.

6. A comprehensive detection device for temperature, pressure and humidity of multidimensional parameters in a PKS production section, which is characterized by comprising the following modules:

the data acquisition module is used for acquiring a plurality of training samples to form a training sample set, each training sample comprises a plurality of process parameters, each process parameter has corresponding attribute parameters and categories, wherein the process parameters in the same training sample are key process parameters obtained by screening according to key quality attributes, and the key quality attributes are attribute parameters selected according to working sections in a process knowledge system;

the decision tree module is used for selecting the process parameter with the maximum information gain as a splitting node, establishing a decision tree, and the process for establishing the decision tree comprises the following steps: selecting the process parameter with the maximum information gain in the training sample set as a root node, and taking attribute parameters of the process parameter of the root node in the training sample set as primary branch nodes respectively; selecting a process parameter with the second-highest information gain in a training sample set as a secondary branch node, and searching attribute parameters of the process parameter of the secondary branch node under the attribute parameters of the primary branch node in the training sample set to respectively serve as tertiary branch nodes; according to each training sample in the training sample set, selecting a process parameter with the smallest gain after part or all of the three-level branch nodes as a four-level branch node, and searching attribute parameters of the process parameter of the four-level branch node in the training sample set as five-level branch nodes respectively; if the categories from the root node to the tertiary branch node can be uniquely determined in a training sample library, setting the corresponding category as a leaf node behind the tertiary branch node, otherwise, setting the corresponding category as a leaf node behind the five-stage branch node according to the training sample library; the process of establishing the decision tree does not carry out discrete segmentation operation on any range of high, medium and low continuous values of each attribute parameter;

the identification module is used for carrying out category identification on new data according to the decision tree, wherein the new data comprises temperature, pressure and humidity parameters and corresponding new attribute parameters thereof, further identification is carried out to obtain categories corresponding to the attribute parameters of the new data, the identification module comprises the steps of finding a first-level branch node according to the corresponding attribute parameters of the process parameters of the root node in the new data, finding a third-level branch node according to the corresponding attribute parameters of the process parameters of the second-level branch node in the new data, if the third-level branch node is directly connected with a leaf node in the decision tree, identifying the categories of the new data as the categories corresponding to the directly connected leaf node, otherwise, finding a fifth-level branch node and the corresponding leaf node according to the corresponding attribute parameters of the fourth-level branch node in the new data, identifying the categories corresponding to the leaf node after the fifth-level branch node, and if the identified categories are poor, feeding back or giving a warning to a user.

7. The apparatus as recited in claim 6, further comprising:

8. The apparatus of claim 6, wherein the distribution delivery module obtains the distribution delivery information value by the following calculation formula:

wherein info (S) delivers information values for the distribution of training sample sets, info (S _v ) Delivering information values for the distribution of a certain property parameter, for example>Vari is the probability of the certain attribute parameter being in a certain categoryes (A) is a set of attribute parameters.