CN109101626A - Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree - Google Patents

Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree Download PDF

Info

Publication number
CN109101626A
CN109101626A CN201810917990.3A CN201810917990A CN109101626A CN 109101626 A CN109101626 A CN 109101626A CN 201810917990 A CN201810917990 A CN 201810917990A CN 109101626 A CN109101626 A CN 109101626A
Authority
CN
China
Prior art keywords
attribute
characteristic
data
value
spanning tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810917990.3A
Other languages
Chinese (zh)
Inventor
刘斌
黄卫华
王昳晗
蒋峥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Science and Engineering WUSE
Wuhan University of Science and Technology WHUST
Original Assignee
Wuhan University of Science and Engineering WUSE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Science and Engineering WUSE filed Critical Wuhan University of Science and Engineering WUSE
Priority to CN201810917990.3A priority Critical patent/CN109101626A/en
Publication of CN109101626A publication Critical patent/CN109101626A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, it include: step 1, pretreatment work is carried out to hot-strip data, the step of including: data scrubbing, data integration and Discretization for Continuous Attribute, step 2, the removal of uncorrelated features, Step 3: building minimum spanning tree, Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted.The present invention effectively avoids de-redundancy operation failure problem caused by dissipating since single-point is multiple, significantly improves the efficiency extracted and influence finishing temperature key feature attribute, to achieve the purpose that improve hot rolling finishing temperature modeling accuracy and roll control reliability.

Description

Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree
Technical field
The present invention relates to a kind of based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, belongs to high dimension According to excavation applications.
Background technique
In modern industrial production, largely there is a kind of production object with complex industrial characteristic, they generally have There are violent working condition variation, strong nonlinearity, close coupling, parameter time varying, mathematical model to be difficult to the complex industrial accurately described spy Property.Existing control method, which exists, not to be adapted to change frequent working condition, excessively relies on asking for plant model precision Topic, the control precision for easily leading to system is not high and poor to the tracking effect of Setting signal, can not fully meet modern industrial production Demand.However, a large amount of industrial process data is had accumulated in industrial processes, it is under cover very rich in these data Information knowledge, can intuitively reflect each variable and predict target between relationship, as the progress of computer technology makes These process datas must be collected to become increasingly easy, but the complexity of operating condition causes database size increasing, finally Due to the influence of " dimension calamity ", so that high dimensional data mining becomes difficult singularly.
Hot-strip is highly important one kind in numerous steel products, since finishing temperature largely affects The mechanical property and structure property of belt steel product, therefore the prediction and control of finishing temperature are always the research heat of steel industry Point.Due to entire hot rolling production district bad environments, temperature instrument cannot be installed point by point, belt steel temperature is difficult to continuously detect, The technological parameter that its production process is related to is various and the relationship of each parameter and finishing temperature is again extremely complex.Existing process modeling It is difficult to adapt to change frequent working condition, and numerous influence factor is obvious to hot-rolled temperature precision of forecasting model is improved It is unfavorable.Therefore, before modeling to finishing stands data, need to analyze influence journey of each factor to finishing temperature Degree reduces the redundancy between characteristic and effectively extracts crucial characteristic variable, so as to reduce model complexity While improve prediction model precision of prediction.
Currently, phase is found in literature search in the high dimensional data critical characteristic extraction method of a kind of complex industrial object Close patent: application No. is a kind of patent of invention of 201610298079X " features of high dimensional data disclosed on September 28th, 2016 Selection method and device ", the feature selection approach and device of a kind of high dimensional data are provided, by maximum information coefficient (Maximal Information Coefficient, MIC) introduced feature choose in, and based on MIC to feature carry out effective evaluation, with basis The virtual value that evaluation generates selects feature.
But there are two big defects for above-mentioned patent: 1. without solving the problems, such as that higher-dimension sample data is pretreated, working as sample number There are when missing values and exceptional value in, it is more likely that and these sample datas cannot be used directly for data mining and analytic process, Reduce the analyticity of data;2. not accounting for the application background of data set, the data set under needing according to the actual situation is right Designed sample data Processing Algorithm is analyzed and is adjusted, and satisfactory feature extraction result could be obtained.
Summary of the invention
The purpose of the present invention is to provide it is a kind of based on improve minimum spanning tree high dimensional data critical characteristic extraction method, To solve the above problems.
Present invention employs following technical solutions:
A kind of high dimensional data critical characteristic extraction method based on improvement minimum spanning tree characterized by comprising
Step 1 carries out pretreatment work to hot-strip data, comprising: data scrubbing, data integration and connection attribute The step of discretization,
Data scrubbing, the operation that the exceptional value in course of hot rolling data is searched and deleted,
Data in multiple data sources are merged, are stored in the consistent data set of structure by data integration,
Discretization for Continuous Attribute takes non-linear division methods to carry out discretization to finishing temperature,
Step 2, the removal of uncorrelated features,
It is if X, Y are discrete random variable, then symmetrical uncertain are as follows:
Wherein, H (X) is the entropy of discrete random variable X, it is assumed that p (x) is the prior probability of all values in X, then H (X) are as follows:
It include the property set F={ F of m characteristic attribute for one1,F2,...,FmAnd objective attribute target attribute C data set D, Be intended to identification feature property set in the incoherent attribute of objective attribute target attribute, first to each characteristic attribute Fi(1≤i≤m) and target SU (F between attribute Ci, C) and value is calculated, if any feature attribute FiSU (F corresponding to (1≤i≤m)i, C) and value is greater than One predefined relevance threshold θ, then it is assumed that it is characteristic attribute relevant to objective attribute target attribute, these are met to the spy of condition Sign attributes extraction comes out and forms new characteristic attribute subset F'={ F '1,F′2,...,F′k(k≤m),
Step 3: building minimum spanning tree,
Firstly, defined below and decision condition is provided,
Define 1:SU (Fi, C) and it is characterized attribute FiCorrelation between ∈ F and objective attribute target attribute C, if SU (Fi, C) and than predetermined The threshold θ of justice is big, then it is assumed that FiIt is the relevant characteristic attribute of objective attribute target attribute C,
Define 2:SU (Fi,Fj) it is a pair of of attribute FiAnd Fj(Fi,Fj∈ F ∧ i ≠ j) between correlation,
Define 3: assuming that S={ F1,F2,...,Fi,...,FK < | F |It is a feature cluster, ifTo any One Fi∈ S (i ≠ j), condition [SU (Fj,C)≥SU(Fi,C)]∧[SU(Fi,Fj) > SU (Fi, C)] it sets up always, then it is assumed that FiFor FjIt is redundancy,
Define 4: one characteristic attribute Fi∈ S={ F1,F2,...,Fk(k < | F |) be considered as that one in S represents spy Sign, and if only ifThat is possess maximum SU (F in feature set Sj, C) value characteristic attribute energy Enough as the representative feature in this feature cluster,
1 is being defined to defining in 4, ∧ represents logical AND, | F | the number in property set F comprising attribute is represented, according to above It provides, the process that the process namely key feature of feature subset selection are extracted, so that it may be converted to identification and retain to meet and determine SU (F in justice 1i, C) >=θ condition characteristic attribute, and selection represents the process of feature in feature cluster,
Decision condition 1: assuming that F 'i(i ∈ [1, k]) is a characteristic attribute in characteristic attribute collection F', if it and feature Concentrate another characteristic attribute F 'jThe symmetrical uncertainty SU (F ' of (j ∈ [1, k] ∧ i ≠ j)i,F′j) value with F 'jIt is related It is minimum in symmetrical uncertainty value, then it is assumed that F 'iIt is F ' in entire characteristic attribute concentrationjMinimal redundancy characteristic attribute,
Decision condition 2: if characteristic attribute F 'i(i ∈ [1, k]) is that characteristic attribute concentrates at least δ × k characteristic attribute most Small redundancy feature attribute, then it is assumed that this attribute will lead to the more divergent structures of single-point in minimum spanning tree building process, Middle δ ∈ (0,1) is a predefined value,
Then, it is based on above-mentioned decision condition 1 and decision condition 2, the specific step of minimum spanning tree building module in the present invention It is rapid as follows:
(1) for characteristic attribute collection F'={ F '1,F′2,...,F′k, by characteristic attribute F 'i(1≤i≤k) and target category Relativity measurement SU (F ' between property Ci, C) and the weight of (i ∈ [1, k]) as connected graph interior joint, the phase between characteristic attribute Closing property measurement SU (F 'i,F′j) weight of (i, j ∈ [1, k] ∧ i ≠ j) as side in connected graph, Connected undigraph G is constructed, to even Logical figure G establishes a minimum spanning tree using classical Prim algorithm, not only can all keep connecting by all nodes in this way, but also can So that the weights sum on all sides is minimum in tree,
(2) a characteristic attribute F ' in variable n, F' that an initial value is 0 is definediIts in (i ∈ [1, k]) and F' He successively determines characteristic attribute according to decision condition 1, if F 'iFor the minimal redundancy attribute of another characteristic attribute, then by n Value add 1,
(3) by characteristic attribute F 'iCorresponding n value is determined according to decision condition 2, if n >=δ × k, by F 'iFrom even It individually extracts and is added in final characteristic attribute subset in logical figure G;If n < δ × k on the contrary, not to F 'iCarry out any place It manages and to another not by the characteristic attribute F ' of judgementj(j ∈ [1, k] ∧ i ≠ j) executes step (2) and (3),
All after determining, the building of minimum spanning tree finishes all characteristic attributes in F',
Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted,
After the building process of minimum spanning tree terminates, any a line E={ (F ' in minimum spanning tree is analyzedi, F′j)|F′i,F′j∈ F' ∧ i, j ∈ [1, k] ∧ i ≠ j }, if [SU (F 'i,F′j) < SU (F 'i,C)]∧[SU(F′i,F′j) < SU (F′j, C)] it sets up, i.e. the weight SU (F ' of this edgei,F′j) less than related between the node and objective attribute target attribute at this edge both ends Property measurement SU (F 'i, C) and SU (F 'j, C), then this edge is removed from minimum spanning tree,
During removing these sides, a tree will be divided into two sons by the operation of every removal for carrying out a secondary side Tree, after all cutting operations terminate, for each subtree, it is assumed that the node collection V (T) in this subtree, Neng Goufa Every a pair of of the node F ' concentrated referring now to this nodei,F′j∈ V (T) always meets condition [SU (F 'i,F′j) < SU (F 'i,C)] ∧[SU(F′i,F′j) < SU (F 'j, C)], according to defining 3, judge the institute that characteristic attribute corresponding to node collection V (T) is concentrated Characteristic attribute be all for objective attribute target attribute C it is mutually redundant,
After the cutting operation for completing tree, the forest for possessing multiple trees has just been obtained, it is assumed that finally obtained Altogether comprising n tree in forest, for each subtree Ti∈{T1,T2,...,Tn, tree in include all characteristic attributes All be it is mutually redundant, according to define 4, in this tree extract SU (Fj, C) and the maximum characteristic attribute of value is as representing spy Levy attribute, obtain one include n key feature attribute property set, be not present in this property set and to be mutually redundant Attribute completes the removal operation of redundant attributes.
Further, of the invention based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, can also have There is such feature:
In step 1, the method searched exceptional value is: according to since each attribute in data is in actual production Zone of reasonableness in journey is searched method using bound and is searched exceptional value.
Further, of the invention based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, can also have There is such feature: in step 1, after the completion of data integration, each data in tables of data, using band grade of steel as unique index, In include that the strip by each rack rolls all procedural informations collected, including finishing temperature, rack water, drafts And strip width thickness.
Further, of the invention based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, can also have There is such feature:
In step 1, using finishing temperature as objective attribute target attribute, other attributes in data set as characteristic attribute,
The numerical intervals of finishing temperature are divided into symmetrical five region centered on finishing temperature target value, It is respectively: (0~T0- 4 α), (T0- 4 α~T0- α), (T0- α~T0+ α), (T0+ α~T0+ 4 α), (T0+ 4 α~+∞),
For characteristic attribute, discretization is carried out to it using minimum description length algorithm,
Firstly, defining the class entropy in sample S are as follows:
Wherein, S --- data set;
K --- it include class { C in objective attribute target attribute C1,...,CKNumber;
P(Ci, S) --- belong to class C in SiSample proportion,
Then, the Entropy of the data set after definition segmentation is
Wherein, | S | --- it include the number of sample in data set S;
A a --- characteristic attribute in data set;
T --- the actual value in objective attribute target attribute finishing temperature;
S1, S2--- the two datasets after data set S segmentation,
Information gain corresponding to determined cut-point at this time are as follows:
Had according to Minimum description length criterion:
Wherein, Δ (A, T;S)=log2(3K-2)-[K·Ent(S)-K1·Ent(S1)-K2·Ent(S2)], K is former number According to the class number that collection includes, K1, K2Respectively two subset S1, S2The class number for including,
Formula (4) is exactly decision condition of the MDL algorithm to separate division point, referred to as MDLPC criterion, to attribute A's Each value is determined that the condition of satisfaction is separate division point according to MDLPC criterion.
Advantageous effect of the invention
The present invention proposes a kind of high dimensional data critical characteristic extraction method based on improvement minimum spanning tree, and is applied In finishing temperature key feature extraction process.To in high dimensional data missing values and exceptional value pre-process, pass through data set At and the pretreatment operations such as Discretization for Continuous Attribute, improve the analyticity of data;Consider that practical finishing stands data are Typical high dimensional data, and be easy to cause the more divergence problems of single-point in minimum spanning tree, minimum will be caused by providing decision condition Spanning tree is extracted in the characteristic attribute of the more divergent structures of single-point, so that it is added without the building process of minimum spanning tree, thus De-redundancy operation failure problem caused by dissipating since single-point is multiple is effectively avoided, it is crucial special to significantly improve extraction influence finishing temperature The efficiency of attribute is levied, to achieve the purpose that improve hot rolling finishing temperature modeling accuracy and roll control reliability.
Therefore, the present invention help to realize object forecasting model high-precision and multiobjective optimal control strategy it is effective Property, have great importance for improving product quality.
Detailed description of the invention
Fig. 1 is the schematic diagram of finishing temperature discretization;
Fig. 2 is the schematic diagram of minimum spanning tree of the invention;
Fig. 3 is the figure of character subset attribute number when θ takes different value;
Fig. 4 is that θ takes the total redundancy R of character subset when different valuesum
Specific embodiment
Illustrate a specific embodiment of the invention below in conjunction with attached drawing.
<embodiment one>
Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, comprising the following steps:
Step 1: carrying out pretreatment work to hot-strip data, comprising: data scrubbing, data integration and connection attribute Discretization.
1. data scrubbing
For the exceptional value in course of hot rolling data, since each attribute in data is owned by certainly in the actual production process The zone of reasonableness of body is searched method using bound and is searched exceptional value.Both according to the priori knowledge in actual production technique Reasonable bound is set to the value of each characteristic attribute, is considered as exceptional value beyond the data in this zone of reasonableness.By In the data sample comprising missing values and exceptional value, accounting is very small in data set, and between the data of every strip mutually It is independent, the operation directly deleted is taken to these exception strip data samples.
2. data integration
Data in multiple data sources are merged, the process being stored in a consistent data set of structure.Course of hot rolling Initial data, which is concentrated, contains two tables of data, is section actual value table and rack actual value table respectively.In conjunction with finish rolling production technology mistake Journey, every strip are divided into leading portion, middle section, this three sections of back segment during the rolling process, and need to be by the rolling of 7 racks in total. Since the plate state of section strip steel middle in the operation of rolling is more stable compared to other both ends, the data information of strip is more represented Property, select strip Mid-Section Data as data to be analyzed.
By data integration, multiple process data tables are integrated into a new data table.Each data in tables of data, with Band grade of steel is unique index, rolls all procedural informations collected, including finish to gauge temperature by 7 racks including the strip Degree, rack water, drafts, strip width thickness etc..Part attribute is discrete variable, and value is discrete magnitude, and most attributes are Continuous variable.
Using finishing temperature as objective attribute target attribute, other attributes in data set are as characteristic attribute.
4. attribute discretization
Objective attribute target attribute finishing temperature is continuous process variable, according to data classification demand, need by finishing temperature value from Dispersion.Non-linear division methods are taken to carry out discretization to finishing temperature.According to finishing temperature in practical finishing stands data The numerical intervals of finishing temperature are divided into symmetrical five area centered on finishing temperature target value by distribution situation Domain.In order to keep the prediction of finishing temperature more accurate, finishing temperature target value will be included when dividing this five regions Section carries out diminution appropriate, and discretization mode is as shown in Figure 1.
Wherein, T0It is finishing temperature target value, a is temperature variation, and the value of the two variables is according to practical finish-rolling process Depending on.
After the discretization operations that have passed through objective attribute target attribute finishing temperature, the value of objective attribute target attribute is divided into several discrete Value.
For the characteristic attribute in data set, using minimum description length (Minimum Description Length, MDL) algorithm carries out discretization to it.
Firstly, defining the class entropy in sample S are as follows:
Wherein, S --- data set;
K --- it include class { C in objective attribute target attribute C1,...,CKNumber;
P(Ci, S) --- belong to class C in SiSample proportion.
Then, the Entropy of the data set after definition segmentation is
Wherein, | S | --- it include the number of sample in data set S;
A a --- characteristic attribute in data set;
T --- the actual value in objective attribute target attribute finishing temperature;
S1, S2--- the two datasets after data set S segmentation.
Information gain corresponding to determined cut-point at this time are as follows:
Had according to Minimum description length criterion:
Wherein, Δ (A, T;S)=log2(3K-2)-[K·Ent(S)-K1·Ent(S1)-K2·Ent(S2)], K is former number According to the class number that collection includes, K1, K2Respectively two subset S1, S2The class number for including.
Formula (4) is exactly decision condition of the MDL algorithm to separate division point, referred to as MDLPC criterion.To attribute A's Each value is determined that the condition of satisfaction is separate division point according to MDLPC criterion.
Continuous property segmentation is carried out compared to two points of traditional discretization method loop iterations, MDL algorithm can be one Continuous property is divided into multiple zone of dispersion in secondary departure process, reduces the computation complexity of algorithm.Therefore the present invention Discretization is carried out using MDL algorithm to the characteristic attribute in practical finishing stands data.
Step 2: the removal of uncorrelated features
Before illustrating uncorrelated features attribute removing method, need to introduce information gain and symmetrical probabilistic fixed Justice.
Information gain: the information delta of X under the conditions of enabling information gain Gain (X | Y) indicate known to the Y, then
Wherein, H (X | Y) is conditional entropy, the entropy of stochastic variable X under the conditions of represent known to the value of stochastic variable Y.It is false If p (x) is the prior probability of X all values, and p (x | y) be Y value known case under X value posterior probability, then
Information gain is a symmetrical measurement, and this guarantees the sequences between X and Y not to influence calculated result, But also symmetrical uncertainty is symmetrical for a pair of of attribute.
It is symmetrical uncertain: X is set, Y is discrete random variable, symmetrically uncertain (Symmetric Uncertainty, SU) it is
Wherein, H (X) is the entropy of discrete random variable X, it is assumed that p (x) is the prior probability of all values in X, then H (X) is
The value of information gain is normalized into section [0,1] by symmetrical uncertainty, represents when SU (X Y) takes 1 The value of a variable can be with the value of another variable of perfect forecast among them, and opposite represents them when it takes 0 Between completely it is uncorrelated.
It include m characteristic attribute F={ F for one1,F2,...,FmAnd objective attribute target attribute C data set D, be intended to identify category Property concentrate uncorrelated attribute, first to each characteristic attribute FiSU (F between (1≤i≤m) and objective attribute target attribute Ci, C) value into Row calculates.If any feature attribute FiSU (F corresponding to (1≤i≤m)i, C) value be greater than a predefined relevance threshold θ, Then think that it is characteristic attribute relevant to objective attribute target attribute, these characteristic attributes for meeting condition are extracted and formed new Characteristic attribute subset F'={ F '1,F'2,...,F'k}(k≤m)。
All features that this stylish characteristic attribute is concentrated all are the attributes with the useful certain correlation of objective attribute target attribute, thus Guarantee all to be removed all with the incoherent attribute of objective attribute target attribute.
Step 3: building minimum spanning tree
Firstly, providing defined below and decision condition.
Define 1:SU (Fi, C) and it is characterized attribute FiCorrelation between ∈ F and objective attribute target attribute C.If SU (Fi, C) and than predetermined The threshold θ of justice is big, then it is assumed that FiIt is the relevant characteristic attribute of objective attribute target attribute C.
Define 2:SU (Fi,Fj) it is a pair of of attribute FiAnd Fj(Fi,Fj∈ F ∧ i ≠ j) between correlation.
Define 3: assuming that S={ F1,F2,...,Fi,...,FK < | F |It is a feature cluster.IfTo any One Fi∈ S (i ≠ j), condition [SU (Fj,C)≥SU(Fi,C)]∧[SU(Fi,Fj) > SU (Fi, C)] it sets up always, then it is assumed that FiFor FjIt is redundancy.
Define 4: one characteristic attribute Fi∈ S={ F1,F2,...,Fk(k < | F |) be considered as that one in S represents spy Sign and if only ifThat is possess maximum SU (F in feature set Sj, C) and the characteristic attribute of value can make For the representative feature in this feature cluster.
1 is being defined to defining in 4, ∧ represents logical AND, | F | represent the number in property set F comprising attribute.According to above It provides, the process that the process namely key feature of feature subset selection are extracted, so that it may be converted to identification and retain to meet and determine SU (F in justice 1i, C) >=θ condition characteristic attribute, and selection represents the process of feature in feature cluster.
Decision condition 1: assuming that F 'i(i ∈ [1, k]) is a characteristic attribute in characteristic attribute collection F', if it and feature Concentrate another characteristic attribute F 'jThe symmetrical uncertainty SU (F ' of (j ∈ [1, k] ∧ i ≠ j)i,F′j) value with F 'jIt is related It is minimum in symmetrical uncertainty value, then it is assumed that F 'iIt is F ' in entire characteristic attribute concentrationjMinimal redundancy characteristic attribute.
Decision condition 2: if characteristic attribute F 'i(i ∈ [1, k]) is that (δ ∈ (0,1) is one to characteristic attribute concentration at least δ × k A predefined value) a characteristic attribute minimal redundancy characteristic attribute, then it is assumed that this attribute will be constructed in minimum spanning tree Lead to the more divergent structures of single-point in the process.
Then, it is based on above-mentioned decision condition 1 and decision condition 2, the specific step of minimum spanning tree building module in the present invention It is rapid as follows:
(1) for characteristic attribute collection F'={ F '1,F′2,...,F′k, by characteristic attribute F 'i(1≤i≤k) and target category Relativity measurement SU (F ' between property Ci, C) and the weight of (i ∈ [1, k]) as connected graph interior joint, the phase between characteristic attribute Closing property measurement SU (F 'i,F′j) the weight building Connected undigraph G of (i, j ∈ [1, k] ∧ i ≠ j) as side in connected graph.To even Logical figure G establishes a minimum spanning tree using classical Prim algorithm, not only can all keep connecting by all nodes in this way, but also can So that the weights sum on all sides is minimum in tree.
(2) a characteristic attribute F ' in variable n, F' that an initial value is 0 is definediIts in (i ∈ [1, k]) and F' He successively determines characteristic attribute according to decision condition 1, if F 'iFor the minimal redundancy attribute of another characteristic attribute, then by n Value add 1.
(3) by characteristic attribute F 'iCorresponding n value is determined according to decision condition 2, if n >=δ × k, by F 'iFrom even It individually extracts and is added in final characteristic attribute subset in logical figure G;If n < δ × k on the contrary, not to F 'iCarry out any place It manages and to another not by the characteristic attribute F ' of judgementj(j ∈ [1, k] ∧ i ≠ j) executes step (2) and (3).
All after determining, the building of minimum spanning tree finishes all characteristic attributes in F'.
Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted.
After the building process of minimum spanning tree terminates, any a line E={ (F ' in minimum spanning tree is analyzedi, F′j)|F′i,F′j∈ F' ∧ i, j ∈ [1, k] ∧ i ≠ j }, if
[SU(F′i,F′j) < SU (F 'i,C)]∧[SU(F′i,F′j) < SU (F 'j,C)] (9)
It sets up, i.e. the weight SU (F ' of this edgei,F′j) less than related between the node and objective attribute target attribute at this edge both ends Property measurement SU (F 'i, C) and SU (F 'j, C), then this edge is removed from minimum spanning tree.
During removing these sides, a tree will be divided into two sons by the operation of every removal for carrying out a secondary side Tree.After all cutting operations terminate, for each subtree, it is assumed that the node collection V (T) in this subtree, Neng Goufa Every a pair of of the node F ' concentrated referring now to this nodei,F′j∈ V (T) always meets condition [SU (F 'i,F′j) < SU (F 'i,C)] ∧[SU(F′i,F′j) < SU (F 'j, C)].According to definition 3 hereinbefore, it is believed that feature corresponding to node collection V (T) All characteristic attributes in property set are all mutually redundant for objective attribute target attribute C.
After the cutting operation for completing tree, the forest for possessing multiple trees has just been obtained.Assuming that finally obtained Altogether comprising n tree in forest, for each subtree Ti∈{T1,T2,...,Tn, tree in include all characteristic attributes It is all mutually redundant.According to definition 4, it is only necessary to extract SU (F in this treej, C) and the maximum characteristic attribute conduct of value Represent characteristic attribute.The property set comprising n key feature attribute is thus obtained, it is clear that in this property set Through there is no the attributes being mutually redundant, that is, complete the removal operation of redundant attributes.
<embodiment two>
Algorithm flow of the invention is as follows:
Step 1: initialization, definition input, output data set, complete data scrubbing, data integration and attribute discretization;
Step 2: removing uncorrelated attribute;
Step 3: mentioning construction minimum spanning tree;
Step 4: completing the segmentation of minimum spanning tree, and select crucial (representative) special based on the relativity measurement between attribute Sign.
It first does following variable to assume: setting TR to calculate characteristic attribute FiBetween objective attribute target attribute C it is symmetrical it is probabilistic in Between variable, FC be calculate two characteristic attribute FiWith FjBetween symmetrical probabilistic intermediate variable,For subtree TiIn with mesh The maximum characteristic attribute of symmetrical uncertainty value between attribute C is marked, k is the number of nodes in connected graph G.It is false based on the above variable It is as follows if providing the detailed process of the critical characteristic extraction method proposed in the present invention:
When certain domestic hot rolling mill produces certain strip, character subset is carried out by objective attribute target attribute of finishing temperature for experimental data It extracts.500 data samples are extracted in pretreated tables of data at random as experimental data from having been subjected to, the tables of data is according to heat Roll process knowledge eliminates obvious redundancy and incoherent characteristic attribute, in tables of data altogether include 57 characteristic attributes and One objective attribute target attribute, the property set after uncorrelated attribute removal are as shown in table 1.SU (the F in entire property seti,C)(i∈[1, M]) value arrangesThe SU value of the characteristic attribute of position, i.e., in the case where characteristic attribute number m is 57, relevance threshold θ is taken Arrange the 13rd SU value.
After the extraction for carrying out character subset to the tables of data using common FAST algorithm, the key feature extracted is sub Collection is as shown in table 2.It can be seen that common FAST (Fast clustering-bAsed feature Selection AlgoriThm) character subset that algorithm is extracted includes 13 characteristic attributes, wherein have the drafts of 7 racks, 5 machines The mill speed of the exit thickness of frame and a rack, this illustrates that FAST algorithm is caused due to the appearance of the more divergence problems of single-point The failure of de-redundancy operation, causes key feature extraction efficiency low.
Property set after the removal of the uncorrelated attribute of table 1
In table 1, SCREW_DOWN represents drafts, and SPEED represents mill speed, and EXIT_THICK represents strip rack and goes out Mouth thickness.
2 the extracted character subset of FAST algorithm of table
The extraction of character subset is carried out using this patent, steps are as follows:
Firstly, taking the predefined relevance threshold θ in two algorithms in uncorrelated attribute removal process for the institute in FAST The threshold value of proposition is the same, and the parameter δ in this patent algorithm in the more divergent structure decision conditions of single-point takes
Secondly, minimum spanning tree constructed by building this patent algorithm (containing subtree) is as shown in Figure 2:
Again, using symmetrical uncertainty as the relativity measurement between attribute, the Critical eigenvalues extracted are such as Shown in table 3;
The extracted character subset of algorithm that 3 this patent of table provides
In table, SCREW_DOWN represents drafts, and SPEED represents mill speed, and EXIT_THICK represents strip rack and goes out Mouth thickness.
Finally, overall redundancy comparative analysis, to following three kinds of situations: former characteristic attribute collection, original FAST algorithm are extracted The overall redundancy for the subset feature collection that the algorithm of subset and this patent extracts is compared analysis, as shown in table 4.Assuming that data Characteristic attribute corresponding to collection D integrates as F={ F1,F2,...,Fm, then the overall redundancy of this property set are as follows:
In formula, SU (Fa,Fb) it is characterized attribute FaWith FbBetween symmetrical uncertainty value.
The overall redundancy of feature set in the case of 4 three kinds of table
To the key feature extraction side in the finishing stands, provided using finishing temperature as objective attribute target attribute data application this patent It only include 2 characteristic attributes in the extracted character subset of algorithm that this patent provides, wherein having as can be seen from Table 3 after method The mill speed of one rack and the drafts of a rack.According to the analysis to finishing temperature influence factor, mill speed, pressure Lower amount and exit thickness are all the important factor in order of finishing temperature, this illustrates what whether FAST algorithm or this patent provided Algorithm can correctly extract the key feature attribute for influencing finishing temperature.As can be seen from Table 4, the algorithm that this patent provides It is better than original FAST algorithm in the performance of character subset totality redundancy.
Contrast effect that relevance threshold θ influences character subset extraction effect is given below as shown in figure 3, this patent mentions The extracted character subset totality redundancy R of the algorithm of confessionsumVariation is as shown in Figure 4.Figure 4, it is seen that working as character subset In attribute number be 1 when, this patent provide improvement after the total redundancy of FAST algorithm be 0;When θ is greater than 28, due to not The overall redundancy of the increase of association attributes, character subset starts to reduce.Thus illustrate, in practical finishing stands key feature number According in extraction, the algorithm that this patent provides can be such that attribute uncorrelated and mutually redundant to finishing temperature in data is once mentioning It is removed during taking, has been able to satisfy the demand for reducing former feature set totality redundancy.

Claims (4)

1. a kind of based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree characterized by comprising
Step 1 carries out pretreatment work to hot-strip data, comprising: data scrubbing, data integration and connection attribute are discrete The step of change,
Data scrubbing, the operation that the exceptional value in course of hot rolling data is searched and deleted,
Data in multiple data sources are merged, are stored in the consistent data set of structure by data integration,
Discretization for Continuous Attribute takes non-linear division methods to carry out discretization to finishing temperature,
Step 2, the removal of uncorrelated features,
It is if X, Y are discrete random variable, then symmetrical uncertain are as follows:
Wherein, H (X) is the entropy of discrete random variable X, it is assumed that p (x) is the prior probability of all values in X, then H (X) are as follows:
It include the property set F={ F of m characteristic attribute for one1,F2,...,FmAnd objective attribute target attribute C data set D, be intended to know In other property set with the incoherent characteristic attribute of objective attribute target attribute, first to each characteristic attribute Fi(1≤i≤m) and objective attribute target attribute C Between SU (Fi, C) and value is calculated, if any feature attribute FiSU (F corresponding to (1≤i≤m)i, C) and value is in advance greater than one The relevance threshold θ of definition, then it is assumed that it is characteristic attribute relevant to objective attribute target attribute, these are met to the characteristic attribute of condition It extracts and forms new characteristic attribute subset F'={ F '1,F′2,...,F′k(k≤m),
Step 3: building minimum spanning tree,
Firstly, defined below and decision condition is provided,
Define 1:SU (Fi, C) and it is characterized attribute FiCorrelation between ∈ F and objective attribute target attribute C, if SU (Fi, C) and than predefined Threshold θ is big, then it is assumed that FiIt is the relevant characteristic attribute of objective attribute target attribute C,
Define 2:SU (Fi,Fj) it is a pair of of attribute FiAnd Fj(Fi,Fj∈ F ∧ i ≠ j) between correlation,
Define 3: assuming that S={ F1,F2,...,Fi,...,FK < | F |It is a feature cluster, ifTo any one Fi ∈ S (i ≠ j), condition [SU (Fj,C)≥SU(Fi,C)]∧[SU(Fi,Fj) > SU (Fi, C)] it sets up always, then it is assumed that FiFor FjIt is redundancy,
Define 4: one characteristic attribute Fi∈ S={ F1,F2,...,Fk(k < | F |) be considered as that one in S represents feature, when And if only ifThat is possess maximum SU (F in feature set Sj, C) and the characteristic attribute of value can make For the representative feature in this feature cluster,
1 is being defined to defining in 4, ∧ represents logical AND, | F | the number in property set F comprising attribute is represented, according to given above , the process that the process namely key feature of feature subset selection are extracted, so that it may be converted to identification and retain satisfaction definition 1 Middle SU (Fi, C) >=θ condition characteristic attribute, and selection represents the process of feature in feature cluster,
Decision condition 1: assuming that F 'i(i ∈ [1, k]) is a characteristic attribute in characteristic attribute collection F', if another in it and feature set One characteristic attribute F 'jThe symmetrical uncertainty SU (F ' of (j ∈ [1, k] ∧ i ≠ j)i,F′j) value with F 'jIt is related it is symmetrical not It is minimum in certainty value, then it is assumed that Fi' entire characteristic attribute concentration be F 'jMinimal redundancy characteristic attribute,
Decision condition 2: if characteristic attribute F 'i(i ∈ [1, k]) is that characteristic attribute concentrates the minimum of at least δ × k characteristic attribute superfluous Remaining characteristic attribute, then it is assumed that this attribute will lead to the more divergent structures of single-point in minimum spanning tree building process, wherein δ ∈ (0,1) is a predefined value,
Then, it is based on above-mentioned decision condition 1 and decision condition 2, minimum spanning tree constructs the specific steps of module such as in the present invention Under:
(1) for characteristic attribute collection F'={ F '1,F′2,...,F′k, by characteristic attribute Fi' (1≤i≤k) and objective attribute target attribute C it Between relativity measurement SU (Fi', C) weight of (i ∈ [1, k]) as connected graph interior joint, the correlation between characteristic attribute Measure SU (Fi',Fj') weight of (i, j ∈ [1, k] ∧ i ≠ j) as side in connected graph, Connected undigraph G is constructed, to connected graph G establishes a minimum spanning tree using classical Prim algorithm, not only can all keep connecting by all nodes in this way, but also can make The weights sum on all sides is minimum in must setting,
(2) a characteristic attribute F in variable n, F' that an initial value is 0 is definedi' (i ∈ [1, k]) and other spies in F' Sign attribute is successively determined according to decision condition 1, if Fi' be another characteristic attribute minimal redundancy attribute, then by the value of n Add 1,
(3) by characteristic attribute Fi' corresponding n value determined according to decision condition 2, if n >=δ × k, by Fi' from connected graph It individually extracts and is added in final characteristic attribute subset in G;If n < δ × k on the contrary, not to Fi' carry out any processing simultaneously To another not by the characteristic attribute F of judgementj' (j ∈ [1, k] ∧ i ≠ j) execution step (2) and (3),
All after determining, the building of minimum spanning tree finishes all characteristic attributes in F',
Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted,
After the building process of minimum spanning tree terminates, any a line E={ (F in minimum spanning tree is analyzedi',Fj')| Fi',Fj' ∈ F' ∧ i, j ∈ [1, k] ∧ i ≠ j, if [SU (Fi',Fj') < SU (Fi',C)]∧[SU(Fi',Fj') < SU (Fj', C)]
It sets up, i.e. the weight SU (F of this edgei',Fj') it is less than the correlation degree between the node and objective attribute target attribute at this edge both ends Measure SU (Fi', C) and SU (Fj', C), then this edge is removed from minimum spanning tree,
During removing these sides, a tree will be divided into two subtrees by the operation of every removal for carrying out a secondary side, After all cutting operations terminate, for each subtree, it is assumed that the node collection V (T) in this subtree, it can be found that right In every a pair of of node F that this node is concentratedi',Fj' ∈ V (T) always meets condition [SU (Fi',Fj') < SU (Fi',C)]∧ [SU(Fi',Fj') < SU (Fj', C)], according to defining 3, judge that characteristic attribute corresponding to node collection V (T) is concentrated all Characteristic attribute be all for objective attribute target attribute C it is mutually redundant,
After the cutting operation for completing tree, the forest for possessing multiple trees has just been obtained, it is assumed that in finally obtained forest In altogether comprising n set, for each subtree Ti∈{T1,T2,...,Tn, tree in include all characteristic attributes be all Mutually redundant, according to defining 4, SU (F is extracted in this treej, C) and the maximum characteristic attribute of value is as representing feature category Property, the property set comprising n key feature attribute is obtained, the category being mutually redundant has been not present in this property set Property, that is, complete the removal operation of redundant attributes.
2. as described in claim 1 based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, feature exists In:
In step 1, the method searched exceptional value is: according to due to each attribute in data in the actual production process Zone of reasonableness, using bound search method exceptional value is searched.
3. as described in claim 1 based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, feature exists In:
In step 1, after the completion of data integration, each data in tables of data, using band grade of steel as unique index, including this Strip rolls all procedural informations collected, including finishing temperature, rack water, drafts and strip by each rack Width and thickness.
4. as claimed in claim 3 based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, feature exists In:
In step 1, using finishing temperature as objective attribute target attribute, other attributes in data set as characteristic attribute,
The numerical intervals of finishing temperature are divided into symmetrical five region centered on finishing temperature target value, respectively It is: (0~T0- 4 α), (T0- 4 α~T0- α), (T0- α~T0+ α), (T0+ α~T0+ 4 α), (T0+ 4 α~+∞),
For characteristic attribute, discretization is carried out to it using minimum description length algorithm,
Firstly, defining the class entropy in sample S are as follows:
Wherein, S --- data set;
K --- it include class { C in objective attribute target attribute C1,...,CKNumber;
P(Ci, S) --- belong to class C in SiSample proportion,
Then, the Entropy of the data set after definition segmentation is
Wherein, | S | --- it include the number of sample in data set S;
A a --- characteristic attribute in data set;
T --- the actual value in objective attribute target attribute finishing temperature;
S1, S2--- the two datasets after data set S segmentation,
Information gain corresponding to determined cut-point at this time are as follows:
Had according to Minimum description length criterion:
Wherein, Δ (A, T;S)=log2(3K-2)-[K·Ent(S)-K1·Ent(S1)-K2·Ent(S2)], K is original data set The class number for including, K1, K2Respectively two subset S1, S2The class number for including,
Formula (4) is exactly decision condition of the MDL algorithm to separate division point, referred to as MDLPC criterion, to each of attribute A Value is determined that the condition of satisfaction is separate division point according to MDLPC criterion.
CN201810917990.3A 2018-08-13 2018-08-13 Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree Pending CN109101626A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810917990.3A CN109101626A (en) 2018-08-13 2018-08-13 Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810917990.3A CN109101626A (en) 2018-08-13 2018-08-13 Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree

Publications (1)

Publication Number Publication Date
CN109101626A true CN109101626A (en) 2018-12-28

Family

ID=64849686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810917990.3A Pending CN109101626A (en) 2018-08-13 2018-08-13 Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree

Country Status (1)

Country Link
CN (1) CN109101626A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116809652A (en) * 2023-03-28 2023-09-29 材谷金带(佛山)金属复合材料有限公司 Abnormality analysis method and system for hot rolling mill control system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975589A (en) * 2016-05-06 2016-09-28 哈尔滨理工大学 Feature selection method and device of high-dimension data
CN106570178A (en) * 2016-11-10 2017-04-19 重庆邮电大学 High-dimension text data characteristic selection method based on graph clustering
CN107273909A (en) * 2016-04-08 2017-10-20 上海市玻森数据科技有限公司 The sorting algorithm of high dimensional data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273909A (en) * 2016-04-08 2017-10-20 上海市玻森数据科技有限公司 The sorting algorithm of high dimensional data
CN105975589A (en) * 2016-05-06 2016-09-28 哈尔滨理工大学 Feature selection method and device of high-dimension data
CN106570178A (en) * 2016-11-10 2017-04-19 重庆邮电大学 High-dimension text data characteristic selection method based on graph clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王昳晗: ""一种改进MST关键特征提取方法及其在终轧温度建模中的应用"", 《万方学术期刊数据库》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116809652A (en) * 2023-03-28 2023-09-29 材谷金带(佛山)金属复合材料有限公司 Abnormality analysis method and system for hot rolling mill control system
CN116809652B (en) * 2023-03-28 2024-04-26 材谷金带(佛山)金属复合材料有限公司 Abnormality analysis method and system for hot rolling mill control system

Similar Documents

Publication Publication Date Title
Mittal et al. Stock prediction using twitter sentiment analysis
Xu et al. Distributed plant-wide process monitoring based on PCA with minimal redundancy maximal relevance
CN107273924B (en) Multi-data fusion power plant fault diagnosis method based on fuzzy clustering analysis
Ma et al. A novel hierarchical detection and isolation framework for quality-related multiple faults in large-scale processes
KR20190072652A (en) Information processing apparatus and information processing method
WO2021241580A1 (en) Abnormality/irregularity cause identifying apparatus, abnormality/irregularity cause identifying method, and abnormality/irregularity cause identifying program
CN112529053A (en) Short-term prediction method and system for time sequence data in server
CN111631682A (en) Physiological feature integration method and device based on trend-removing analysis and computer equipment
CN109101626A (en) Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree
Katebi et al. Optimal economic statistical design of combined double sampling and variable sampling interval multivariate T 2 control charts
US20230229136A1 (en) Abnormal irregularity cause identifying device, abnormal irregularity cause identifying method, and abnormal irregularity cause identifying program
Ardakani et al. Optimal features selection for designing a fault diagnosis system
CN112418522B (en) Industrial heating furnace steel temperature prediction method based on three-branch integrated prediction model
CN109919626B (en) High-risk bank card identification method and device
CN110196797B (en) Automatic optimization method and system suitable for credit scoring card system
CN105824785A (en) Rapid abnormal point detection method based on penalized regression
CN106022915A (en) Enterprise credit risk assessment method and apparatus
CN110287114A (en) A kind of method and device of database script performance test
Liu et al. Nearest neighbor optimal smooth denoising dynamic classification method for financial time series
US11544601B2 (en) System for generating topic inference information of lyrics
CN112581188A (en) Construction method, prediction method and model of engineering project bid quotation prediction model
Cateni et al. Cause and effect analysis in a real industrial context: study of a particular application devoted to quality improvement
CN114101346B (en) Cold rolled silicon steel thickness defect identification method, device and system
Sanei et al. Sensitivity analysis with fuzzy Data in DEA
Qiubo et al. Research on code plagiarism detection model based on Random Forest and Gradient Boosting Decision Tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181228

RJ01 Rejection of invention patent application after publication