CN109101626A

CN109101626A - Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree

Info

Publication number: CN109101626A
Application number: CN201810917990.3A
Authority: CN
Inventors: 刘斌; 黄卫华; 王昳晗; 蒋峥
Original assignee: Wuhan University of Science and Engineering WUSE
Current assignee: Wuhan University of Science and Engineering WUSE; Wuhan University of Science and Technology WHUST
Priority date: 2018-08-13
Filing date: 2018-08-13
Publication date: 2018-12-28

Abstract

The present invention provides a kind of based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, it include: step 1, pretreatment work is carried out to hot-strip data, the step of including: data scrubbing, data integration and Discretization for Continuous Attribute, step 2, the removal of uncorrelated features, Step 3: building minimum spanning tree, Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted.The present invention effectively avoids de-redundancy operation failure problem caused by dissipating since single-point is multiple, significantly improves the efficiency extracted and influence finishing temperature key feature attribute, to achieve the purpose that improve hot rolling finishing temperature modeling accuracy and roll control reliability.

Description

Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree

Technical field

The present invention relates to a kind of based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, belongs to high dimension According to excavation applications.

Background technique

In modern industrial production, largely there is a kind of production object with complex industrial characteristic, they generally have There are violent working condition variation, strong nonlinearity, close coupling, parameter time varying, mathematical model to be difficult to the complex industrial accurately described spy Property.Existing control method, which exists, not to be adapted to change frequent working condition, excessively relies on asking for plant model precision Topic, the control precision for easily leading to system is not high and poor to the tracking effect of Setting signal, can not fully meet modern industrial production Demand.However, a large amount of industrial process data is had accumulated in industrial processes, it is under cover very rich in these data Information knowledge, can intuitively reflect each variable and predict target between relationship, as the progress of computer technology makes These process datas must be collected to become increasingly easy, but the complexity of operating condition causes database size increasing, finally Due to the influence of " dimension calamity ", so that high dimensional data mining becomes difficult singularly.

Hot-strip is highly important one kind in numerous steel products, since finishing temperature largely affects The mechanical property and structure property of belt steel product, therefore the prediction and control of finishing temperature are always the research heat of steel industry Point.Due to entire hot rolling production district bad environments, temperature instrument cannot be installed point by point, belt steel temperature is difficult to continuously detect, The technological parameter that its production process is related to is various and the relationship of each parameter and finishing temperature is again extremely complex.Existing process modeling It is difficult to adapt to change frequent working condition, and numerous influence factor is obvious to hot-rolled temperature precision of forecasting model is improved It is unfavorable.Therefore, before modeling to finishing stands data, need to analyze influence journey of each factor to finishing temperature Degree reduces the redundancy between characteristic and effectively extracts crucial characteristic variable, so as to reduce model complexity While improve prediction model precision of prediction.

Currently, phase is found in literature search in the high dimensional data critical characteristic extraction method of a kind of complex industrial object Close patent: application No. is a kind of patent of invention of 201610298079X " features of high dimensional data disclosed on September 28th, 2016 Selection method and device ", the feature selection approach and device of a kind of high dimensional data are provided, by maximum information coefficient (Maximal Information Coefficient, MIC) introduced feature choose in, and based on MIC to feature carry out effective evaluation, with basis The virtual value that evaluation generates selects feature.

But there are two big defects for above-mentioned patent: 1. without solving the problems, such as that higher-dimension sample data is pretreated, working as sample number There are when missing values and exceptional value in, it is more likely that and these sample datas cannot be used directly for data mining and analytic process, Reduce the analyticity of data；2. not accounting for the application background of data set, the data set under needing according to the actual situation is right Designed sample data Processing Algorithm is analyzed and is adjusted, and satisfactory feature extraction result could be obtained.

Summary of the invention

The purpose of the present invention is to provide it is a kind of based on improve minimum spanning tree high dimensional data critical characteristic extraction method, To solve the above problems.

Present invention employs following technical solutions:

A kind of high dimensional data critical characteristic extraction method based on improvement minimum spanning tree characterized by comprising

Step 1 carries out pretreatment work to hot-strip data, comprising: data scrubbing, data integration and connection attribute The step of discretization,

Data scrubbing, the operation that the exceptional value in course of hot rolling data is searched and deleted,

Data in multiple data sources are merged, are stored in the consistent data set of structure by data integration,

Discretization for Continuous Attribute takes non-linear division methods to carry out discretization to finishing temperature,

Step 2, the removal of uncorrelated features,

It is if X, Y are discrete random variable, then symmetrical uncertain are as follows:

Wherein, H (X) is the entropy of discrete random variable X, it is assumed that p (x) is the prior probability of all values in X, then H (X) are as follows:

It include the property set F={ F of m characteristic attribute for one₁,F₂,...,F_mAnd objective attribute target attribute C data set D, Be intended to identification feature property set in the incoherent attribute of objective attribute target attribute, first to each characteristic attribute F_i(1≤i≤m) and target SU (F between attribute C_i, C) and value is calculated, if any feature attribute F_iSU (F corresponding to (1≤i≤m)_i, C) and value is greater than One predefined relevance threshold θ, then it is assumed that it is characteristic attribute relevant to objective attribute target attribute, these are met to the spy of condition Sign attributes extraction comes out and forms new characteristic attribute subset F'={ F '₁,F′₂,...,F′_k(k≤m),

Step 3: building minimum spanning tree,

Firstly, defined below and decision condition is provided,

Define 1:SU (F_i, C) and it is characterized attribute F_iCorrelation between ∈ F and objective attribute target attribute C, if SU (F_i, C) and than predetermined The threshold θ of justice is big, then it is assumed that F_iIt is the relevant characteristic attribute of objective attribute target attribute C,

Define 2:SU (F_i,F_j) it is a pair of of attribute F_iAnd F_j(F_i,F_j∈ F ∧ i ≠ j) between correlation,

Define 3: assuming that S={ F₁,F₂,...,F_i,...,F_{K < | F |}It is a feature cluster, ifTo any One F_i∈ S (i ≠ j), condition [SU (F_j,C)≥SU(F_i,C)]∧[SU(F_i,F_j) > SU (F_i, C)] it sets up always, then it is assumed that F_iFor F_jIt is redundancy,

Define 4: one characteristic attribute F_i∈ S={ F₁,F₂,...,F_k(k < | F |) be considered as that one in S represents spy Sign, and if only ifThat is possess maximum SU (F in feature set S_j, C) value characteristic attribute energy Enough as the representative feature in this feature cluster,

1 is being defined to defining in 4, ∧ represents logical AND, | F | the number in property set F comprising attribute is represented, according to above It provides, the process that the process namely key feature of feature subset selection are extracted, so that it may be converted to identification and retain to meet and determine SU (F in justice 1_i, C) >=θ condition characteristic attribute, and selection represents the process of feature in feature cluster,

Decision condition 1: assuming that F '_i(i ∈ [1, k]) is a characteristic attribute in characteristic attribute collection F', if it and feature Concentrate another characteristic attribute F '_jThe symmetrical uncertainty SU (F ' of (j ∈ [1, k] ∧ i ≠ j)_i,F′_j) value with F '_jIt is related It is minimum in symmetrical uncertainty value, then it is assumed that F '_iIt is F ' in entire characteristic attribute concentration_jMinimal redundancy characteristic attribute,

Decision condition 2: if characteristic attribute F '_i(i ∈ [1, k]) is that characteristic attribute concentrates at least δ × k characteristic attribute most Small redundancy feature attribute, then it is assumed that this attribute will lead to the more divergent structures of single-point in minimum spanning tree building process, Middle δ ∈ (0,1) is a predefined value,

Then, it is based on above-mentioned decision condition 1 and decision condition 2, the specific step of minimum spanning tree building module in the present invention It is rapid as follows:

(1) for characteristic attribute collection F'={ F '₁,F′₂,...,F′_k, by characteristic attribute F '_i(1≤i≤k) and target category Relativity measurement SU (F ' between property C_i, C) and the weight of (i ∈ [1, k]) as connected graph interior joint, the phase between characteristic attribute Closing property measurement SU (F '_i,F′_j) weight of (i, j ∈ [1, k] ∧ i ≠ j) as side in connected graph, Connected undigraph G is constructed, to even Logical figure G establishes a minimum spanning tree using classical Prim algorithm, not only can all keep connecting by all nodes in this way, but also can So that the weights sum on all sides is minimum in tree,

(2) a characteristic attribute F ' in variable n, F' that an initial value is 0 is defined_iIts in (i ∈ [1, k]) and F' He successively determines characteristic attribute according to decision condition 1, if F '_iFor the minimal redundancy attribute of another characteristic attribute, then by n Value add 1,

(3) by characteristic attribute F '_iCorresponding n value is determined according to decision condition 2, if n >=δ × k, by F '_iFrom even It individually extracts and is added in final characteristic attribute subset in logical figure G；If n < δ × k on the contrary, not to F '_iCarry out any place It manages and to another not by the characteristic attribute F ' of judgement_j(j ∈ [1, k] ∧ i ≠ j) executes step (2) and (3),

All after determining, the building of minimum spanning tree finishes all characteristic attributes in F',

Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted,

After the building process of minimum spanning tree terminates, any a line E={ (F ' in minimum spanning tree is analyzed_i, F′_j)|F′_i,F′_j∈ F' ∧ i, j ∈ [1, k] ∧ i ≠ j }, if [SU (F '_i,F′_j) < SU (F '_i,C)]∧[SU(F′_i,F′_j) < SU (F′_j, C)] it sets up, i.e. the weight SU (F ' of this edge_i,F′_j) less than related between the node and objective attribute target attribute at this edge both ends Property measurement SU (F '_i, C) and SU (F '_j, C), then this edge is removed from minimum spanning tree,

During removing these sides, a tree will be divided into two sons by the operation of every removal for carrying out a secondary side Tree, after all cutting operations terminate, for each subtree, it is assumed that the node collection V (T) in this subtree, Neng Goufa Every a pair of of the node F ' concentrated referring now to this node_i,F′_j∈ V (T) always meets condition [SU (F '_i,F′_j) < SU (F '_i,C)] ∧[SU(F′_i,F′_j) < SU (F '_j, C)], according to defining 3, judge the institute that characteristic attribute corresponding to node collection V (T) is concentrated Characteristic attribute be all for objective attribute target attribute C it is mutually redundant,

After the cutting operation for completing tree, the forest for possessing multiple trees has just been obtained, it is assumed that finally obtained Altogether comprising n tree in forest, for each subtree T_i∈{T₁,T₂,...,T_n, tree in include all characteristic attributes All be it is mutually redundant, according to define 4, in this tree extract SU (F_j, C) and the maximum characteristic attribute of value is as representing spy Levy attribute, obtain one include n key feature attribute property set, be not present in this property set and to be mutually redundant Attribute completes the removal operation of redundant attributes.

Further, of the invention based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, can also have There is such feature:

In step 1, the method searched exceptional value is: according to since each attribute in data is in actual production Zone of reasonableness in journey is searched method using bound and is searched exceptional value.

Further, of the invention based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, can also have There is such feature: in step 1, after the completion of data integration, each data in tables of data, using band grade of steel as unique index, In include that the strip by each rack rolls all procedural informations collected, including finishing temperature, rack water, drafts And strip width thickness.

In step 1, using finishing temperature as objective attribute target attribute, other attributes in data set as characteristic attribute,

The numerical intervals of finishing temperature are divided into symmetrical five region centered on finishing temperature target value, It is respectively: (0~T₀- 4 α), (T₀- 4 α~T₀- α), (T₀- α~T₀+ α), (T₀+ α~T₀+ 4 α), (T₀+ 4 α~+∞),

For characteristic attribute, discretization is carried out to it using minimum description length algorithm,

Firstly, defining the class entropy in sample S are as follows:

Wherein, S --- data set；

K --- it include class { C in objective attribute target attribute C₁,...,C_KNumber；

P(C_i, S) --- belong to class C in S_iSample proportion,

Then, the Entropy of the data set after definition segmentation is

Wherein, | S | --- it include the number of sample in data set S；

A a --- characteristic attribute in data set；

T --- the actual value in objective attribute target attribute finishing temperature；

S₁, S₂--- the two datasets after data set S segmentation,

Information gain corresponding to determined cut-point at this time are as follows:

Had according to Minimum description length criterion:

Wherein, Δ (A, T；S)=log₂(3^K-2)-[K·Ent(S)-K₁·Ent(S₁)-K₂·Ent(S₂)], K is former number According to the class number that collection includes, K₁, K₂Respectively two subset S₁, S₂The class number for including,

Formula (4) is exactly decision condition of the MDL algorithm to separate division point, referred to as MDLPC criterion, to attribute A's Each value is determined that the condition of satisfaction is separate division point according to MDLPC criterion.

Advantageous effect of the invention

The present invention proposes a kind of high dimensional data critical characteristic extraction method based on improvement minimum spanning tree, and is applied In finishing temperature key feature extraction process.To in high dimensional data missing values and exceptional value pre-process, pass through data set At and the pretreatment operations such as Discretization for Continuous Attribute, improve the analyticity of data；Consider that practical finishing stands data are Typical high dimensional data, and be easy to cause the more divergence problems of single-point in minimum spanning tree, minimum will be caused by providing decision condition Spanning tree is extracted in the characteristic attribute of the more divergent structures of single-point, so that it is added without the building process of minimum spanning tree, thus De-redundancy operation failure problem caused by dissipating since single-point is multiple is effectively avoided, it is crucial special to significantly improve extraction influence finishing temperature The efficiency of attribute is levied, to achieve the purpose that improve hot rolling finishing temperature modeling accuracy and roll control reliability.

Therefore, the present invention help to realize object forecasting model high-precision and multiobjective optimal control strategy it is effective Property, have great importance for improving product quality.

Detailed description of the invention

Fig. 1 is the schematic diagram of finishing temperature discretization；

Fig. 2 is the schematic diagram of minimum spanning tree of the invention；

Fig. 3 is the figure of character subset attribute number when θ takes different value；

Fig. 4 is that θ takes the total redundancy R of character subset when different value_sum。

Specific embodiment

Illustrate a specific embodiment of the invention below in conjunction with attached drawing.

Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, comprising the following steps:

Step 1: carrying out pretreatment work to hot-strip data, comprising: data scrubbing, data integration and connection attribute Discretization.

1. data scrubbing

For the exceptional value in course of hot rolling data, since each attribute in data is owned by certainly in the actual production process The zone of reasonableness of body is searched method using bound and is searched exceptional value.Both according to the priori knowledge in actual production technique Reasonable bound is set to the value of each characteristic attribute, is considered as exceptional value beyond the data in this zone of reasonableness.By In the data sample comprising missing values and exceptional value, accounting is very small in data set, and between the data of every strip mutually It is independent, the operation directly deleted is taken to these exception strip data samples.

2. data integration

Data in multiple data sources are merged, the process being stored in a consistent data set of structure.Course of hot rolling Initial data, which is concentrated, contains two tables of data, is section actual value table and rack actual value table respectively.In conjunction with finish rolling production technology mistake Journey, every strip are divided into leading portion, middle section, this three sections of back segment during the rolling process, and need to be by the rolling of 7 racks in total. Since the plate state of section strip steel middle in the operation of rolling is more stable compared to other both ends, the data information of strip is more represented Property, select strip Mid-Section Data as data to be analyzed.

By data integration, multiple process data tables are integrated into a new data table.Each data in tables of data, with Band grade of steel is unique index, rolls all procedural informations collected, including finish to gauge temperature by 7 racks including the strip Degree, rack water, drafts, strip width thickness etc..Part attribute is discrete variable, and value is discrete magnitude, and most attributes are Continuous variable.

Using finishing temperature as objective attribute target attribute, other attributes in data set are as characteristic attribute.

4. attribute discretization

Objective attribute target attribute finishing temperature is continuous process variable, according to data classification demand, need by finishing temperature value from Dispersion.Non-linear division methods are taken to carry out discretization to finishing temperature.According to finishing temperature in practical finishing stands data The numerical intervals of finishing temperature are divided into symmetrical five area centered on finishing temperature target value by distribution situation Domain.In order to keep the prediction of finishing temperature more accurate, finishing temperature target value will be included when dividing this five regions Section carries out diminution appropriate, and discretization mode is as shown in Figure 1.

Wherein, T₀It is finishing temperature target value, a is temperature variation, and the value of the two variables is according to practical finish-rolling process Depending on.

After the discretization operations that have passed through objective attribute target attribute finishing temperature, the value of objective attribute target attribute is divided into several discrete Value.

For the characteristic attribute in data set, using minimum description length (Minimum Description Length, MDL) algorithm carries out discretization to it.

Firstly, defining the class entropy in sample S are as follows:

Wherein, S --- data set；

P(C_i, S) --- belong to class C in S_iSample proportion.

Then, the Entropy of the data set after definition segmentation is

Wherein, | S | --- it include the number of sample in data set S；

A a --- characteristic attribute in data set；

S₁, S₂--- the two datasets after data set S segmentation.

Had according to Minimum description length criterion:

Wherein, Δ (A, T；S)=log₂(3^K-2)-[K·Ent(S)-K₁·Ent(S₁)-K₂·Ent(S₂)], K is former number According to the class number that collection includes, K₁, K₂Respectively two subset S₁, S₂The class number for including.

Formula (4) is exactly decision condition of the MDL algorithm to separate division point, referred to as MDLPC criterion.To attribute A's Each value is determined that the condition of satisfaction is separate division point according to MDLPC criterion.

Continuous property segmentation is carried out compared to two points of traditional discretization method loop iterations, MDL algorithm can be one Continuous property is divided into multiple zone of dispersion in secondary departure process, reduces the computation complexity of algorithm.Therefore the present invention Discretization is carried out using MDL algorithm to the characteristic attribute in practical finishing stands data.

Step 2: the removal of uncorrelated features

Before illustrating uncorrelated features attribute removing method, need to introduce information gain and symmetrical probabilistic fixed Justice.

Information gain: the information delta of X under the conditions of enabling information gain Gain (X | Y) indicate known to the Y, then

Wherein, H (X | Y) is conditional entropy, the entropy of stochastic variable X under the conditions of represent known to the value of stochastic variable Y.It is false If p (x) is the prior probability of X all values, and p (x | y) be Y value known case under X value posterior probability, then

Information gain is a symmetrical measurement, and this guarantees the sequences between X and Y not to influence calculated result, But also symmetrical uncertainty is symmetrical for a pair of of attribute.

It is symmetrical uncertain: X is set, Y is discrete random variable, symmetrically uncertain (Symmetric Uncertainty, SU) it is

Wherein, H (X) is the entropy of discrete random variable X, it is assumed that p (x) is the prior probability of all values in X, then H (X) is

The value of information gain is normalized into section [0,1] by symmetrical uncertainty, represents when SU (X Y) takes 1 The value of a variable can be with the value of another variable of perfect forecast among them, and opposite represents them when it takes 0 Between completely it is uncorrelated.

It include m characteristic attribute F={ F for one₁,F₂,...,F_mAnd objective attribute target attribute C data set D, be intended to identify category Property concentrate uncorrelated attribute, first to each characteristic attribute F_iSU (F between (1≤i≤m) and objective attribute target attribute C_i, C) value into Row calculates.If any feature attribute F_iSU (F corresponding to (1≤i≤m)_i, C) value be greater than a predefined relevance threshold θ, Then think that it is characteristic attribute relevant to objective attribute target attribute, these characteristic attributes for meeting condition are extracted and formed new Characteristic attribute subset F'={ F '₁,F'₂,...,F'_k}(k≤m)。

All features that this stylish characteristic attribute is concentrated all are the attributes with the useful certain correlation of objective attribute target attribute, thus Guarantee all to be removed all with the incoherent attribute of objective attribute target attribute.

Step 3: building minimum spanning tree

Firstly, providing defined below and decision condition.

Define 1:SU (F_i, C) and it is characterized attribute F_iCorrelation between ∈ F and objective attribute target attribute C.If SU (F_i, C) and than predetermined The threshold θ of justice is big, then it is assumed that F_iIt is the relevant characteristic attribute of objective attribute target attribute C.

Define 2:SU (F_i,F_j) it is a pair of of attribute F_iAnd F_j(F_i,F_j∈ F ∧ i ≠ j) between correlation.

Define 3: assuming that S={ F₁,F₂,...,F_i,...,F_{K < | F |}It is a feature cluster.IfTo any One F_i∈ S (i ≠ j), condition [SU (F_j,C)≥SU(F_i,C)]∧[SU(F_i,F_j) > SU (F_i, C)] it sets up always, then it is assumed that F_iFor F_jIt is redundancy.

Define 4: one characteristic attribute F_i∈ S={ F₁,F₂,...,F_k(k < | F |) be considered as that one in S represents spy Sign and if only ifThat is possess maximum SU (F in feature set S_j, C) and the characteristic attribute of value can make For the representative feature in this feature cluster.

1 is being defined to defining in 4, ∧ represents logical AND, | F | represent the number in property set F comprising attribute.According to above It provides, the process that the process namely key feature of feature subset selection are extracted, so that it may be converted to identification and retain to meet and determine SU (F in justice 1_i, C) >=θ condition characteristic attribute, and selection represents the process of feature in feature cluster.

Decision condition 1: assuming that F '_i(i ∈ [1, k]) is a characteristic attribute in characteristic attribute collection F', if it and feature Concentrate another characteristic attribute F '_jThe symmetrical uncertainty SU (F ' of (j ∈ [1, k] ∧ i ≠ j)_i,F′_j) value with F '_jIt is related It is minimum in symmetrical uncertainty value, then it is assumed that F '_iIt is F ' in entire characteristic attribute concentration_jMinimal redundancy characteristic attribute.

Decision condition 2: if characteristic attribute F '_i(i ∈ [1, k]) is that (δ ∈ (0,1) is one to characteristic attribute concentration at least δ × k A predefined value) a characteristic attribute minimal redundancy characteristic attribute, then it is assumed that this attribute will be constructed in minimum spanning tree Lead to the more divergent structures of single-point in the process.

(1) for characteristic attribute collection F'={ F '₁,F′₂,...,F′_k, by characteristic attribute F '_i(1≤i≤k) and target category Relativity measurement SU (F ' between property C_i, C) and the weight of (i ∈ [1, k]) as connected graph interior joint, the phase between characteristic attribute Closing property measurement SU (F '_i,F′_j) the weight building Connected undigraph G of (i, j ∈ [1, k] ∧ i ≠ j) as side in connected graph.To even Logical figure G establishes a minimum spanning tree using classical Prim algorithm, not only can all keep connecting by all nodes in this way, but also can So that the weights sum on all sides is minimum in tree.

(2) a characteristic attribute F ' in variable n, F' that an initial value is 0 is defined_iIts in (i ∈ [1, k]) and F' He successively determines characteristic attribute according to decision condition 1, if F '_iFor the minimal redundancy attribute of another characteristic attribute, then by n Value add 1.

(3) by characteristic attribute F '_iCorresponding n value is determined according to decision condition 2, if n >=δ × k, by F '_iFrom even It individually extracts and is added in final characteristic attribute subset in logical figure G；If n < δ × k on the contrary, not to F '_iCarry out any place It manages and to another not by the characteristic attribute F ' of judgement_j(j ∈ [1, k] ∧ i ≠ j) executes step (2) and (3).

All after determining, the building of minimum spanning tree finishes all characteristic attributes in F'.

Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted.

After the building process of minimum spanning tree terminates, any a line E={ (F ' in minimum spanning tree is analyzed_i, F′_j)|F′_i,F′_j∈ F' ∧ i, j ∈ [1, k] ∧ i ≠ j }, if

[SU(F′_i,F′_j) < SU (F '_i,C)]∧[SU(F′_i,F′_j) < SU (F '_j,C)] (9)

It sets up, i.e. the weight SU (F ' of this edge_i,F′_j) less than related between the node and objective attribute target attribute at this edge both ends Property measurement SU (F '_i, C) and SU (F '_j, C), then this edge is removed from minimum spanning tree.

During removing these sides, a tree will be divided into two sons by the operation of every removal for carrying out a secondary side Tree.After all cutting operations terminate, for each subtree, it is assumed that the node collection V (T) in this subtree, Neng Goufa Every a pair of of the node F ' concentrated referring now to this node_i,F′_j∈ V (T) always meets condition [SU (F '_i,F′_j) < SU (F '_i,C)] ∧[SU(F′_i,F′_j) < SU (F '_j, C)].According to definition 3 hereinbefore, it is believed that feature corresponding to node collection V (T) All characteristic attributes in property set are all mutually redundant for objective attribute target attribute C.

After the cutting operation for completing tree, the forest for possessing multiple trees has just been obtained.Assuming that finally obtained Altogether comprising n tree in forest, for each subtree T_i∈{T₁,T₂,...,T_n, tree in include all characteristic attributes It is all mutually redundant.According to definition 4, it is only necessary to extract SU (F in this tree_j, C) and the maximum characteristic attribute conduct of value Represent characteristic attribute.The property set comprising n key feature attribute is thus obtained, it is clear that in this property set Through there is no the attributes being mutually redundant, that is, complete the removal operation of redundant attributes.

Algorithm flow of the invention is as follows:

Step 1: initialization, definition input, output data set, complete data scrubbing, data integration and attribute discretization；

Step 2: removing uncorrelated attribute；

Step 3: mentioning construction minimum spanning tree；

Step 4: completing the segmentation of minimum spanning tree, and select crucial (representative) special based on the relativity measurement between attribute Sign.

It first does following variable to assume: setting TR to calculate characteristic attribute F_iBetween objective attribute target attribute C it is symmetrical it is probabilistic in Between variable, FC be calculate two characteristic attribute F_iWith F_jBetween symmetrical probabilistic intermediate variable,For subtree T_iIn with mesh The maximum characteristic attribute of symmetrical uncertainty value between attribute C is marked, k is the number of nodes in connected graph G.It is false based on the above variable It is as follows if providing the detailed process of the critical characteristic extraction method proposed in the present invention:

When certain domestic hot rolling mill produces certain strip, character subset is carried out by objective attribute target attribute of finishing temperature for experimental data It extracts.500 data samples are extracted in pretreated tables of data at random as experimental data from having been subjected to, the tables of data is according to heat Roll process knowledge eliminates obvious redundancy and incoherent characteristic attribute, in tables of data altogether include 57 characteristic attributes and One objective attribute target attribute, the property set after uncorrelated attribute removal are as shown in table 1.SU (the F in entire property set_i,C)(i∈[1, M]) value arrangesThe SU value of the characteristic attribute of position, i.e., in the case where characteristic attribute number m is 57, relevance threshold θ is taken Arrange the 13rd SU value.

After the extraction for carrying out character subset to the tables of data using common FAST algorithm, the key feature extracted is sub Collection is as shown in table 2.It can be seen that common FAST (Fast clustering-bAsed feature Selection AlgoriThm) character subset that algorithm is extracted includes 13 characteristic attributes, wherein have the drafts of 7 racks, 5 machines The mill speed of the exit thickness of frame and a rack, this illustrates that FAST algorithm is caused due to the appearance of the more divergence problems of single-point The failure of de-redundancy operation, causes key feature extraction efficiency low.

Property set after the removal of the uncorrelated attribute of table 1

In table 1, SCREW_DOWN represents drafts, and SPEED represents mill speed, and EXIT_THICK represents strip rack and goes out Mouth thickness.

2 the extracted character subset of FAST algorithm of table

The extraction of character subset is carried out using this patent, steps are as follows:

Firstly, taking the predefined relevance threshold θ in two algorithms in uncorrelated attribute removal process for the institute in FAST The threshold value of proposition is the same, and the parameter δ in this patent algorithm in the more divergent structure decision conditions of single-point takes

Secondly, minimum spanning tree constructed by building this patent algorithm (containing subtree) is as shown in Figure 2:

Again, using symmetrical uncertainty as the relativity measurement between attribute, the Critical eigenvalues extracted are such as Shown in table 3；

The extracted character subset of algorithm that 3 this patent of table provides

In table, SCREW_DOWN represents drafts, and SPEED represents mill speed, and EXIT_THICK represents strip rack and goes out Mouth thickness.

Finally, overall redundancy comparative analysis, to following three kinds of situations: former characteristic attribute collection, original FAST algorithm are extracted The overall redundancy for the subset feature collection that the algorithm of subset and this patent extracts is compared analysis, as shown in table 4.Assuming that data Characteristic attribute corresponding to collection D integrates as F={ F₁,F₂,...,F_m, then the overall redundancy of this property set are as follows:

In formula, SU (F_a,F_b) it is characterized attribute F_aWith F_bBetween symmetrical uncertainty value.

The overall redundancy of feature set in the case of 4 three kinds of table

To the key feature extraction side in the finishing stands, provided using finishing temperature as objective attribute target attribute data application this patent It only include 2 characteristic attributes in the extracted character subset of algorithm that this patent provides, wherein having as can be seen from Table 3 after method The mill speed of one rack and the drafts of a rack.According to the analysis to finishing temperature influence factor, mill speed, pressure Lower amount and exit thickness are all the important factor in order of finishing temperature, this illustrates what whether FAST algorithm or this patent provided Algorithm can correctly extract the key feature attribute for influencing finishing temperature.As can be seen from Table 4, the algorithm that this patent provides It is better than original FAST algorithm in the performance of character subset totality redundancy.

Contrast effect that relevance threshold θ influences character subset extraction effect is given below as shown in figure 3, this patent mentions The extracted character subset totality redundancy R of the algorithm of confession_sumVariation is as shown in Figure 4.Figure 4, it is seen that working as character subset In attribute number be 1 when, this patent provide improvement after the total redundancy of FAST algorithm be 0；When θ is greater than 28, due to not The overall redundancy of the increase of association attributes, character subset starts to reduce.Thus illustrate, in practical finishing stands key feature number According in extraction, the algorithm that this patent provides can be such that attribute uncorrelated and mutually redundant to finishing temperature in data is once mentioning It is removed during taking, has been able to satisfy the demand for reducing former feature set totality redundancy.

Claims

1. a kind of based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree characterized by comprising

Step 1 carries out pretreatment work to hot-strip data, comprising: data scrubbing, data integration and connection attribute are discrete The step of change,

Step 2, the removal of uncorrelated features,

It include the property set F={ F of m characteristic attribute for one₁,F₂,...,F_mAnd objective attribute target attribute C data set D, be intended to know In other property set with the incoherent characteristic attribute of objective attribute target attribute, first to each characteristic attribute F_i(1≤i≤m) and objective attribute target attribute C Between SU (F_i, C) and value is calculated, if any feature attribute F_iSU (F corresponding to (1≤i≤m)_i, C) and value is in advance greater than one The relevance threshold θ of definition, then it is assumed that it is characteristic attribute relevant to objective attribute target attribute, these are met to the characteristic attribute of condition It extracts and forms new characteristic attribute subset F'={ F '₁,F′₂,...,F′_k(k≤m),

Step 3: building minimum spanning tree,

Firstly, defined below and decision condition is provided,

Define 1:SU (F_i, C) and it is characterized attribute F_iCorrelation between ∈ F and objective attribute target attribute C, if SU (F_i, C) and than predefined Threshold θ is big, then it is assumed that F_iIt is the relevant characteristic attribute of objective attribute target attribute C,

Define 3: assuming that S={ F₁,F₂,...,F_i,...,F_{K < | F |}It is a feature cluster, ifTo any one F_i ∈ S (i ≠ j), condition [SU (F_j,C)≥SU(F_i,C)]∧[SU(F_i,F_j) > SU (F_i, C)] it sets up always, then it is assumed that F_iFor F_jIt is redundancy,

Define 4: one characteristic attribute F_i∈ S={ F₁,F₂,...,F_k(k < | F |) be considered as that one in S represents feature, when And if only ifThat is possess maximum SU (F in feature set S_j, C) and the characteristic attribute of value can make For the representative feature in this feature cluster,

1 is being defined to defining in 4, ∧ represents logical AND, | F | the number in property set F comprising attribute is represented, according to given above , the process that the process namely key feature of feature subset selection are extracted, so that it may be converted to identification and retain satisfaction definition 1 Middle SU (F_i, C) >=θ condition characteristic attribute, and selection represents the process of feature in feature cluster,

Decision condition 1: assuming that F '_i(i ∈ [1, k]) is a characteristic attribute in characteristic attribute collection F', if another in it and feature set One characteristic attribute F '_jThe symmetrical uncertainty SU (F ' of (j ∈ [1, k] ∧ i ≠ j)_i,F′_j) value with F '_jIt is related it is symmetrical not It is minimum in certainty value, then it is assumed that F_i' entire characteristic attribute concentration be F '_jMinimal redundancy characteristic attribute,

Decision condition 2: if characteristic attribute F '_i(i ∈ [1, k]) is that characteristic attribute concentrates the minimum of at least δ × k characteristic attribute superfluous Remaining characteristic attribute, then it is assumed that this attribute will lead to the more divergent structures of single-point in minimum spanning tree building process, wherein δ ∈ (0,1) is a predefined value,

Then, it is based on above-mentioned decision condition 1 and decision condition 2, minimum spanning tree constructs the specific steps of module such as in the present invention Under:

(1) for characteristic attribute collection F'={ F '₁,F′₂,...,F′_k, by characteristic attribute F_i' (1≤i≤k) and objective attribute target attribute C it Between relativity measurement SU (F_i', C) weight of (i ∈ [1, k]) as connected graph interior joint, the correlation between characteristic attribute Measure SU (F_i',F_j') weight of (i, j ∈ [1, k] ∧ i ≠ j) as side in connected graph, Connected undigraph G is constructed, to connected graph G establishes a minimum spanning tree using classical Prim algorithm, not only can all keep connecting by all nodes in this way, but also can make The weights sum on all sides is minimum in must setting,

(2) a characteristic attribute F in variable n, F' that an initial value is 0 is defined_i' (i ∈ [1, k]) and other spies in F' Sign attribute is successively determined according to decision condition 1, if F_i' be another characteristic attribute minimal redundancy attribute, then by the value of n Add 1,

(3) by characteristic attribute F_i' corresponding n value determined according to decision condition 2, if n >=δ × k, by F_i' from connected graph It individually extracts and is added in final characteristic attribute subset in G；If n < δ × k on the contrary, not to F_i' carry out any processing simultaneously To another not by the characteristic attribute F of judgement_j' (j ∈ [1, k] ∧ i ≠ j) execution step (2) and (3),

After the building process of minimum spanning tree terminates, any a line E={ (F in minimum spanning tree is analyzed_i',F_j')| F_i',F_j' ∈ F' ∧ i, j ∈ [1, k] ∧ i ≠ j, if [SU (F_i',F_j') < SU (F_i',C)]∧[SU(F_i',F_j') < SU (F_j', C)]

It sets up, i.e. the weight SU (F of this edge_i',F_j') it is less than the correlation degree between the node and objective attribute target attribute at this edge both ends Measure SU (F_i', C) and SU (F_j', C), then this edge is removed from minimum spanning tree,

During removing these sides, a tree will be divided into two subtrees by the operation of every removal for carrying out a secondary side, After all cutting operations terminate, for each subtree, it is assumed that the node collection V (T) in this subtree, it can be found that right In every a pair of of node F that this node is concentrated_i',F_j' ∈ V (T) always meets condition [SU (F_i',F_j') < SU (F_i',C)]∧ [SU(F_i',F_j') < SU (F_j', C)], according to defining 3, judge that characteristic attribute corresponding to node collection V (T) is concentrated all Characteristic attribute be all for objective attribute target attribute C it is mutually redundant,

After the cutting operation for completing tree, the forest for possessing multiple trees has just been obtained, it is assumed that in finally obtained forest In altogether comprising n set, for each subtree T_i∈{T₁,T₂,...,T_n, tree in include all characteristic attributes be all Mutually redundant, according to defining 4, SU (F is extracted in this tree_j, C) and the maximum characteristic attribute of value is as representing feature category Property, the property set comprising n key feature attribute is obtained, the category being mutually redundant has been not present in this property set Property, that is, complete the removal operation of redundant attributes.

2. as described in claim 1 based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, feature exists In:

In step 1, the method searched exceptional value is: according to due to each attribute in data in the actual production process Zone of reasonableness, using bound search method exceptional value is searched.

3. as described in claim 1 based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, feature exists In:

In step 1, after the completion of data integration, each data in tables of data, using band grade of steel as unique index, including this Strip rolls all procedural informations collected, including finishing temperature, rack water, drafts and strip by each rack Width and thickness.

4. as claimed in claim 3 based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, feature exists In:

The numerical intervals of finishing temperature are divided into symmetrical five region centered on finishing temperature target value, respectively It is: (0~T₀- 4 α), (T₀- 4 α~T₀- α), (T₀- α~T₀+ α), (T₀+ α~T₀+ 4 α), (T₀+ 4 α~+∞),

Firstly, defining the class entropy in sample S are as follows:

Wherein, S --- data set；

P(C_i, S) --- belong to class C in S_iSample proportion,

Then, the Entropy of the data set after definition segmentation is

Wherein, | S | --- it include the number of sample in data set S；

A a --- characteristic attribute in data set；

S₁, S₂--- the two datasets after data set S segmentation,

Had according to Minimum description length criterion:

Wherein, Δ (A, T；S)=log₂(3^K-2)-[K·Ent(S)-K₁·Ent(S₁)-K₂·Ent(S₂)], K is original data set The class number for including, K₁, K₂Respectively two subset S₁, S₂The class number for including,

Formula (4) is exactly decision condition of the MDL algorithm to separate division point, referred to as MDLPC criterion, to each of attribute A Value is determined that the condition of satisfaction is separate division point according to MDLPC criterion.