CN109101626A - Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree - Google Patents
Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree Download PDFInfo
- Publication number
- CN109101626A CN109101626A CN201810917990.3A CN201810917990A CN109101626A CN 109101626 A CN109101626 A CN 109101626A CN 201810917990 A CN201810917990 A CN 201810917990A CN 109101626 A CN109101626 A CN 109101626A
- Authority
- CN
- China
- Prior art keywords
- attribute
- characteristic
- data
- value
- spanning tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, it include: step 1, pretreatment work is carried out to hot-strip data, the step of including: data scrubbing, data integration and Discretization for Continuous Attribute, step 2, the removal of uncorrelated features, Step 3: building minimum spanning tree, Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted.The present invention effectively avoids de-redundancy operation failure problem caused by dissipating since single-point is multiple, significantly improves the efficiency extracted and influence finishing temperature key feature attribute, to achieve the purpose that improve hot rolling finishing temperature modeling accuracy and roll control reliability.
Description
Technical field
The present invention relates to a kind of based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, belongs to high dimension
According to excavation applications.
Background technique
In modern industrial production, largely there is a kind of production object with complex industrial characteristic, they generally have
There are violent working condition variation, strong nonlinearity, close coupling, parameter time varying, mathematical model to be difficult to the complex industrial accurately described spy
Property.Existing control method, which exists, not to be adapted to change frequent working condition, excessively relies on asking for plant model precision
Topic, the control precision for easily leading to system is not high and poor to the tracking effect of Setting signal, can not fully meet modern industrial production
Demand.However, a large amount of industrial process data is had accumulated in industrial processes, it is under cover very rich in these data
Information knowledge, can intuitively reflect each variable and predict target between relationship, as the progress of computer technology makes
These process datas must be collected to become increasingly easy, but the complexity of operating condition causes database size increasing, finally
Due to the influence of " dimension calamity ", so that high dimensional data mining becomes difficult singularly.
Hot-strip is highly important one kind in numerous steel products, since finishing temperature largely affects
The mechanical property and structure property of belt steel product, therefore the prediction and control of finishing temperature are always the research heat of steel industry
Point.Due to entire hot rolling production district bad environments, temperature instrument cannot be installed point by point, belt steel temperature is difficult to continuously detect,
The technological parameter that its production process is related to is various and the relationship of each parameter and finishing temperature is again extremely complex.Existing process modeling
It is difficult to adapt to change frequent working condition, and numerous influence factor is obvious to hot-rolled temperature precision of forecasting model is improved
It is unfavorable.Therefore, before modeling to finishing stands data, need to analyze influence journey of each factor to finishing temperature
Degree reduces the redundancy between characteristic and effectively extracts crucial characteristic variable, so as to reduce model complexity
While improve prediction model precision of prediction.
Currently, phase is found in literature search in the high dimensional data critical characteristic extraction method of a kind of complex industrial object
Close patent: application No. is a kind of patent of invention of 201610298079X " features of high dimensional data disclosed on September 28th, 2016
Selection method and device ", the feature selection approach and device of a kind of high dimensional data are provided, by maximum information coefficient (Maximal
Information Coefficient, MIC) introduced feature choose in, and based on MIC to feature carry out effective evaluation, with basis
The virtual value that evaluation generates selects feature.
But there are two big defects for above-mentioned patent: 1. without solving the problems, such as that higher-dimension sample data is pretreated, working as sample number
There are when missing values and exceptional value in, it is more likely that and these sample datas cannot be used directly for data mining and analytic process,
Reduce the analyticity of data;2. not accounting for the application background of data set, the data set under needing according to the actual situation is right
Designed sample data Processing Algorithm is analyzed and is adjusted, and satisfactory feature extraction result could be obtained.
Summary of the invention
The purpose of the present invention is to provide it is a kind of based on improve minimum spanning tree high dimensional data critical characteristic extraction method,
To solve the above problems.
Present invention employs following technical solutions:
A kind of high dimensional data critical characteristic extraction method based on improvement minimum spanning tree characterized by comprising
Step 1 carries out pretreatment work to hot-strip data, comprising: data scrubbing, data integration and connection attribute
The step of discretization,
Data scrubbing, the operation that the exceptional value in course of hot rolling data is searched and deleted,
Data in multiple data sources are merged, are stored in the consistent data set of structure by data integration,
Discretization for Continuous Attribute takes non-linear division methods to carry out discretization to finishing temperature,
Step 2, the removal of uncorrelated features,
It is if X, Y are discrete random variable, then symmetrical uncertain are as follows:
Wherein, H (X) is the entropy of discrete random variable X, it is assumed that p (x) is the prior probability of all values in X, then H (X) are as follows:
It include the property set F={ F of m characteristic attribute for one1,F2,...,FmAnd objective attribute target attribute C data set D,
Be intended to identification feature property set in the incoherent attribute of objective attribute target attribute, first to each characteristic attribute Fi(1≤i≤m) and target
SU (F between attribute Ci, C) and value is calculated, if any feature attribute FiSU (F corresponding to (1≤i≤m)i, C) and value is greater than
One predefined relevance threshold θ, then it is assumed that it is characteristic attribute relevant to objective attribute target attribute, these are met to the spy of condition
Sign attributes extraction comes out and forms new characteristic attribute subset F'={ F '1,F′2,...,F′k(k≤m),
Step 3: building minimum spanning tree,
Firstly, defined below and decision condition is provided,
Define 1:SU (Fi, C) and it is characterized attribute FiCorrelation between ∈ F and objective attribute target attribute C, if SU (Fi, C) and than predetermined
The threshold θ of justice is big, then it is assumed that FiIt is the relevant characteristic attribute of objective attribute target attribute C,
Define 2:SU (Fi,Fj) it is a pair of of attribute FiAnd Fj(Fi,Fj∈ F ∧ i ≠ j) between correlation,
Define 3: assuming that S={ F1,F2,...,Fi,...,FK < | F |It is a feature cluster, ifTo any
One Fi∈ S (i ≠ j), condition [SU (Fj,C)≥SU(Fi,C)]∧[SU(Fi,Fj) > SU (Fi, C)] it sets up always, then it is assumed that
FiFor FjIt is redundancy,
Define 4: one characteristic attribute Fi∈ S={ F1,F2,...,Fk(k < | F |) be considered as that one in S represents spy
Sign, and if only ifThat is possess maximum SU (F in feature set Sj, C) value characteristic attribute energy
Enough as the representative feature in this feature cluster,
1 is being defined to defining in 4, ∧ represents logical AND, | F | the number in property set F comprising attribute is represented, according to above
It provides, the process that the process namely key feature of feature subset selection are extracted, so that it may be converted to identification and retain to meet and determine
SU (F in justice 1i, C) >=θ condition characteristic attribute, and selection represents the process of feature in feature cluster,
Decision condition 1: assuming that F 'i(i ∈ [1, k]) is a characteristic attribute in characteristic attribute collection F', if it and feature
Concentrate another characteristic attribute F 'jThe symmetrical uncertainty SU (F ' of (j ∈ [1, k] ∧ i ≠ j)i,F′j) value with F 'jIt is related
It is minimum in symmetrical uncertainty value, then it is assumed that F 'iIt is F ' in entire characteristic attribute concentrationjMinimal redundancy characteristic attribute,
Decision condition 2: if characteristic attribute F 'i(i ∈ [1, k]) is that characteristic attribute concentrates at least δ × k characteristic attribute most
Small redundancy feature attribute, then it is assumed that this attribute will lead to the more divergent structures of single-point in minimum spanning tree building process,
Middle δ ∈ (0,1) is a predefined value,
Then, it is based on above-mentioned decision condition 1 and decision condition 2, the specific step of minimum spanning tree building module in the present invention
It is rapid as follows:
(1) for characteristic attribute collection F'={ F '1,F′2,...,F′k, by characteristic attribute F 'i(1≤i≤k) and target category
Relativity measurement SU (F ' between property Ci, C) and the weight of (i ∈ [1, k]) as connected graph interior joint, the phase between characteristic attribute
Closing property measurement SU (F 'i,F′j) weight of (i, j ∈ [1, k] ∧ i ≠ j) as side in connected graph, Connected undigraph G is constructed, to even
Logical figure G establishes a minimum spanning tree using classical Prim algorithm, not only can all keep connecting by all nodes in this way, but also can
So that the weights sum on all sides is minimum in tree,
(2) a characteristic attribute F ' in variable n, F' that an initial value is 0 is definediIts in (i ∈ [1, k]) and F'
He successively determines characteristic attribute according to decision condition 1, if F 'iFor the minimal redundancy attribute of another characteristic attribute, then by n
Value add 1,
(3) by characteristic attribute F 'iCorresponding n value is determined according to decision condition 2, if n >=δ × k, by F 'iFrom even
It individually extracts and is added in final characteristic attribute subset in logical figure G;If n < δ × k on the contrary, not to F 'iCarry out any place
It manages and to another not by the characteristic attribute F ' of judgementj(j ∈ [1, k] ∧ i ≠ j) executes step (2) and (3),
All after determining, the building of minimum spanning tree finishes all characteristic attributes in F',
Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted,
After the building process of minimum spanning tree terminates, any a line E={ (F ' in minimum spanning tree is analyzedi,
F′j)|F′i,F′j∈ F' ∧ i, j ∈ [1, k] ∧ i ≠ j }, if [SU (F 'i,F′j) < SU (F 'i,C)]∧[SU(F′i,F′j) < SU
(F′j, C)] it sets up, i.e. the weight SU (F ' of this edgei,F′j) less than related between the node and objective attribute target attribute at this edge both ends
Property measurement SU (F 'i, C) and SU (F 'j, C), then this edge is removed from minimum spanning tree,
During removing these sides, a tree will be divided into two sons by the operation of every removal for carrying out a secondary side
Tree, after all cutting operations terminate, for each subtree, it is assumed that the node collection V (T) in this subtree, Neng Goufa
Every a pair of of the node F ' concentrated referring now to this nodei,F′j∈ V (T) always meets condition [SU (F 'i,F′j) < SU (F 'i,C)]
∧[SU(F′i,F′j) < SU (F 'j, C)], according to defining 3, judge the institute that characteristic attribute corresponding to node collection V (T) is concentrated
Characteristic attribute be all for objective attribute target attribute C it is mutually redundant,
After the cutting operation for completing tree, the forest for possessing multiple trees has just been obtained, it is assumed that finally obtained
Altogether comprising n tree in forest, for each subtree Ti∈{T1,T2,...,Tn, tree in include all characteristic attributes
All be it is mutually redundant, according to define 4, in this tree extract SU (Fj, C) and the maximum characteristic attribute of value is as representing spy
Levy attribute, obtain one include n key feature attribute property set, be not present in this property set and to be mutually redundant
Attribute completes the removal operation of redundant attributes.
Further, of the invention based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, can also have
There is such feature:
In step 1, the method searched exceptional value is: according to since each attribute in data is in actual production
Zone of reasonableness in journey is searched method using bound and is searched exceptional value.
Further, of the invention based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, can also have
There is such feature: in step 1, after the completion of data integration, each data in tables of data, using band grade of steel as unique index,
In include that the strip by each rack rolls all procedural informations collected, including finishing temperature, rack water, drafts
And strip width thickness.
Further, of the invention based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, can also have
There is such feature:
In step 1, using finishing temperature as objective attribute target attribute, other attributes in data set as characteristic attribute,
The numerical intervals of finishing temperature are divided into symmetrical five region centered on finishing temperature target value,
It is respectively: (0~T0- 4 α), (T0- 4 α~T0- α), (T0- α~T0+ α), (T0+ α~T0+ 4 α), (T0+ 4 α~+∞),
For characteristic attribute, discretization is carried out to it using minimum description length algorithm,
Firstly, defining the class entropy in sample S are as follows:
Wherein, S --- data set;
K --- it include class { C in objective attribute target attribute C1,...,CKNumber;
P(Ci, S) --- belong to class C in SiSample proportion,
Then, the Entropy of the data set after definition segmentation is
Wherein, | S | --- it include the number of sample in data set S;
A a --- characteristic attribute in data set;
T --- the actual value in objective attribute target attribute finishing temperature;
S1, S2--- the two datasets after data set S segmentation,
Information gain corresponding to determined cut-point at this time are as follows:
Had according to Minimum description length criterion:
Wherein, Δ (A, T;S)=log2(3K-2)-[K·Ent(S)-K1·Ent(S1)-K2·Ent(S2)], K is former number
According to the class number that collection includes, K1, K2Respectively two subset S1, S2The class number for including,
Formula (4) is exactly decision condition of the MDL algorithm to separate division point, referred to as MDLPC criterion, to attribute A's
Each value is determined that the condition of satisfaction is separate division point according to MDLPC criterion.
Advantageous effect of the invention
The present invention proposes a kind of high dimensional data critical characteristic extraction method based on improvement minimum spanning tree, and is applied
In finishing temperature key feature extraction process.To in high dimensional data missing values and exceptional value pre-process, pass through data set
At and the pretreatment operations such as Discretization for Continuous Attribute, improve the analyticity of data;Consider that practical finishing stands data are
Typical high dimensional data, and be easy to cause the more divergence problems of single-point in minimum spanning tree, minimum will be caused by providing decision condition
Spanning tree is extracted in the characteristic attribute of the more divergent structures of single-point, so that it is added without the building process of minimum spanning tree, thus
De-redundancy operation failure problem caused by dissipating since single-point is multiple is effectively avoided, it is crucial special to significantly improve extraction influence finishing temperature
The efficiency of attribute is levied, to achieve the purpose that improve hot rolling finishing temperature modeling accuracy and roll control reliability.
Therefore, the present invention help to realize object forecasting model high-precision and multiobjective optimal control strategy it is effective
Property, have great importance for improving product quality.
Detailed description of the invention
Fig. 1 is the schematic diagram of finishing temperature discretization;
Fig. 2 is the schematic diagram of minimum spanning tree of the invention;
Fig. 3 is the figure of character subset attribute number when θ takes different value;
Fig. 4 is that θ takes the total redundancy R of character subset when different valuesum。
Specific embodiment
Illustrate a specific embodiment of the invention below in conjunction with attached drawing.
<embodiment one>
Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, comprising the following steps:
Step 1: carrying out pretreatment work to hot-strip data, comprising: data scrubbing, data integration and connection attribute
Discretization.
1. data scrubbing
For the exceptional value in course of hot rolling data, since each attribute in data is owned by certainly in the actual production process
The zone of reasonableness of body is searched method using bound and is searched exceptional value.Both according to the priori knowledge in actual production technique
Reasonable bound is set to the value of each characteristic attribute, is considered as exceptional value beyond the data in this zone of reasonableness.By
In the data sample comprising missing values and exceptional value, accounting is very small in data set, and between the data of every strip mutually
It is independent, the operation directly deleted is taken to these exception strip data samples.
2. data integration
Data in multiple data sources are merged, the process being stored in a consistent data set of structure.Course of hot rolling
Initial data, which is concentrated, contains two tables of data, is section actual value table and rack actual value table respectively.In conjunction with finish rolling production technology mistake
Journey, every strip are divided into leading portion, middle section, this three sections of back segment during the rolling process, and need to be by the rolling of 7 racks in total.
Since the plate state of section strip steel middle in the operation of rolling is more stable compared to other both ends, the data information of strip is more represented
Property, select strip Mid-Section Data as data to be analyzed.
By data integration, multiple process data tables are integrated into a new data table.Each data in tables of data, with
Band grade of steel is unique index, rolls all procedural informations collected, including finish to gauge temperature by 7 racks including the strip
Degree, rack water, drafts, strip width thickness etc..Part attribute is discrete variable, and value is discrete magnitude, and most attributes are
Continuous variable.
Using finishing temperature as objective attribute target attribute, other attributes in data set are as characteristic attribute.
4. attribute discretization
Objective attribute target attribute finishing temperature is continuous process variable, according to data classification demand, need by finishing temperature value from
Dispersion.Non-linear division methods are taken to carry out discretization to finishing temperature.According to finishing temperature in practical finishing stands data
The numerical intervals of finishing temperature are divided into symmetrical five area centered on finishing temperature target value by distribution situation
Domain.In order to keep the prediction of finishing temperature more accurate, finishing temperature target value will be included when dividing this five regions
Section carries out diminution appropriate, and discretization mode is as shown in Figure 1.
Wherein, T0It is finishing temperature target value, a is temperature variation, and the value of the two variables is according to practical finish-rolling process
Depending on.
After the discretization operations that have passed through objective attribute target attribute finishing temperature, the value of objective attribute target attribute is divided into several discrete
Value.
For the characteristic attribute in data set, using minimum description length (Minimum Description Length,
MDL) algorithm carries out discretization to it.
Firstly, defining the class entropy in sample S are as follows:
Wherein, S --- data set;
K --- it include class { C in objective attribute target attribute C1,...,CKNumber;
P(Ci, S) --- belong to class C in SiSample proportion.
Then, the Entropy of the data set after definition segmentation is
Wherein, | S | --- it include the number of sample in data set S;
A a --- characteristic attribute in data set;
T --- the actual value in objective attribute target attribute finishing temperature;
S1, S2--- the two datasets after data set S segmentation.
Information gain corresponding to determined cut-point at this time are as follows:
Had according to Minimum description length criterion:
Wherein, Δ (A, T;S)=log2(3K-2)-[K·Ent(S)-K1·Ent(S1)-K2·Ent(S2)], K is former number
According to the class number that collection includes, K1, K2Respectively two subset S1, S2The class number for including.
Formula (4) is exactly decision condition of the MDL algorithm to separate division point, referred to as MDLPC criterion.To attribute A's
Each value is determined that the condition of satisfaction is separate division point according to MDLPC criterion.
Continuous property segmentation is carried out compared to two points of traditional discretization method loop iterations, MDL algorithm can be one
Continuous property is divided into multiple zone of dispersion in secondary departure process, reduces the computation complexity of algorithm.Therefore the present invention
Discretization is carried out using MDL algorithm to the characteristic attribute in practical finishing stands data.
Step 2: the removal of uncorrelated features
Before illustrating uncorrelated features attribute removing method, need to introduce information gain and symmetrical probabilistic fixed
Justice.
Information gain: the information delta of X under the conditions of enabling information gain Gain (X | Y) indicate known to the Y, then
Wherein, H (X | Y) is conditional entropy, the entropy of stochastic variable X under the conditions of represent known to the value of stochastic variable Y.It is false
If p (x) is the prior probability of X all values, and p (x | y) be Y value known case under X value posterior probability, then
Information gain is a symmetrical measurement, and this guarantees the sequences between X and Y not to influence calculated result,
But also symmetrical uncertainty is symmetrical for a pair of of attribute.
It is symmetrical uncertain: X is set, Y is discrete random variable, symmetrically uncertain (Symmetric Uncertainty,
SU) it is
Wherein, H (X) is the entropy of discrete random variable X, it is assumed that p (x) is the prior probability of all values in X, then H (X) is
The value of information gain is normalized into section [0,1] by symmetrical uncertainty, represents when SU (X Y) takes 1
The value of a variable can be with the value of another variable of perfect forecast among them, and opposite represents them when it takes 0
Between completely it is uncorrelated.
It include m characteristic attribute F={ F for one1,F2,...,FmAnd objective attribute target attribute C data set D, be intended to identify category
Property concentrate uncorrelated attribute, first to each characteristic attribute FiSU (F between (1≤i≤m) and objective attribute target attribute Ci, C) value into
Row calculates.If any feature attribute FiSU (F corresponding to (1≤i≤m)i, C) value be greater than a predefined relevance threshold θ,
Then think that it is characteristic attribute relevant to objective attribute target attribute, these characteristic attributes for meeting condition are extracted and formed new
Characteristic attribute subset F'={ F '1,F'2,...,F'k}(k≤m)。
All features that this stylish characteristic attribute is concentrated all are the attributes with the useful certain correlation of objective attribute target attribute, thus
Guarantee all to be removed all with the incoherent attribute of objective attribute target attribute.
Step 3: building minimum spanning tree
Firstly, providing defined below and decision condition.
Define 1:SU (Fi, C) and it is characterized attribute FiCorrelation between ∈ F and objective attribute target attribute C.If SU (Fi, C) and than predetermined
The threshold θ of justice is big, then it is assumed that FiIt is the relevant characteristic attribute of objective attribute target attribute C.
Define 2:SU (Fi,Fj) it is a pair of of attribute FiAnd Fj(Fi,Fj∈ F ∧ i ≠ j) between correlation.
Define 3: assuming that S={ F1,F2,...,Fi,...,FK < | F |It is a feature cluster.IfTo any
One Fi∈ S (i ≠ j), condition [SU (Fj,C)≥SU(Fi,C)]∧[SU(Fi,Fj) > SU (Fi, C)] it sets up always, then it is assumed that
FiFor FjIt is redundancy.
Define 4: one characteristic attribute Fi∈ S={ F1,F2,...,Fk(k < | F |) be considered as that one in S represents spy
Sign and if only ifThat is possess maximum SU (F in feature set Sj, C) and the characteristic attribute of value can make
For the representative feature in this feature cluster.
1 is being defined to defining in 4, ∧ represents logical AND, | F | represent the number in property set F comprising attribute.According to above
It provides, the process that the process namely key feature of feature subset selection are extracted, so that it may be converted to identification and retain to meet and determine
SU (F in justice 1i, C) >=θ condition characteristic attribute, and selection represents the process of feature in feature cluster.
Decision condition 1: assuming that F 'i(i ∈ [1, k]) is a characteristic attribute in characteristic attribute collection F', if it and feature
Concentrate another characteristic attribute F 'jThe symmetrical uncertainty SU (F ' of (j ∈ [1, k] ∧ i ≠ j)i,F′j) value with F 'jIt is related
It is minimum in symmetrical uncertainty value, then it is assumed that F 'iIt is F ' in entire characteristic attribute concentrationjMinimal redundancy characteristic attribute.
Decision condition 2: if characteristic attribute F 'i(i ∈ [1, k]) is that (δ ∈ (0,1) is one to characteristic attribute concentration at least δ × k
A predefined value) a characteristic attribute minimal redundancy characteristic attribute, then it is assumed that this attribute will be constructed in minimum spanning tree
Lead to the more divergent structures of single-point in the process.
Then, it is based on above-mentioned decision condition 1 and decision condition 2, the specific step of minimum spanning tree building module in the present invention
It is rapid as follows:
(1) for characteristic attribute collection F'={ F '1,F′2,...,F′k, by characteristic attribute F 'i(1≤i≤k) and target category
Relativity measurement SU (F ' between property Ci, C) and the weight of (i ∈ [1, k]) as connected graph interior joint, the phase between characteristic attribute
Closing property measurement SU (F 'i,F′j) the weight building Connected undigraph G of (i, j ∈ [1, k] ∧ i ≠ j) as side in connected graph.To even
Logical figure G establishes a minimum spanning tree using classical Prim algorithm, not only can all keep connecting by all nodes in this way, but also can
So that the weights sum on all sides is minimum in tree.
(2) a characteristic attribute F ' in variable n, F' that an initial value is 0 is definediIts in (i ∈ [1, k]) and F'
He successively determines characteristic attribute according to decision condition 1, if F 'iFor the minimal redundancy attribute of another characteristic attribute, then by n
Value add 1.
(3) by characteristic attribute F 'iCorresponding n value is determined according to decision condition 2, if n >=δ × k, by F 'iFrom even
It individually extracts and is added in final characteristic attribute subset in logical figure G;If n < δ × k on the contrary, not to F 'iCarry out any place
It manages and to another not by the characteristic attribute F ' of judgementj(j ∈ [1, k] ∧ i ≠ j) executes step (2) and (3).
All after determining, the building of minimum spanning tree finishes all characteristic attributes in F'.
Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted.
After the building process of minimum spanning tree terminates, any a line E={ (F ' in minimum spanning tree is analyzedi,
F′j)|F′i,F′j∈ F' ∧ i, j ∈ [1, k] ∧ i ≠ j }, if
[SU(F′i,F′j) < SU (F 'i,C)]∧[SU(F′i,F′j) < SU (F 'j,C)] (9)
It sets up, i.e. the weight SU (F ' of this edgei,F′j) less than related between the node and objective attribute target attribute at this edge both ends
Property measurement SU (F 'i, C) and SU (F 'j, C), then this edge is removed from minimum spanning tree.
During removing these sides, a tree will be divided into two sons by the operation of every removal for carrying out a secondary side
Tree.After all cutting operations terminate, for each subtree, it is assumed that the node collection V (T) in this subtree, Neng Goufa
Every a pair of of the node F ' concentrated referring now to this nodei,F′j∈ V (T) always meets condition [SU (F 'i,F′j) < SU (F 'i,C)]
∧[SU(F′i,F′j) < SU (F 'j, C)].According to definition 3 hereinbefore, it is believed that feature corresponding to node collection V (T)
All characteristic attributes in property set are all mutually redundant for objective attribute target attribute C.
After the cutting operation for completing tree, the forest for possessing multiple trees has just been obtained.Assuming that finally obtained
Altogether comprising n tree in forest, for each subtree Ti∈{T1,T2,...,Tn, tree in include all characteristic attributes
It is all mutually redundant.According to definition 4, it is only necessary to extract SU (F in this treej, C) and the maximum characteristic attribute conduct of value
Represent characteristic attribute.The property set comprising n key feature attribute is thus obtained, it is clear that in this property set
Through there is no the attributes being mutually redundant, that is, complete the removal operation of redundant attributes.
<embodiment two>
Algorithm flow of the invention is as follows:
Step 1: initialization, definition input, output data set, complete data scrubbing, data integration and attribute discretization;
Step 2: removing uncorrelated attribute;
Step 3: mentioning construction minimum spanning tree;
Step 4: completing the segmentation of minimum spanning tree, and select crucial (representative) special based on the relativity measurement between attribute
Sign.
It first does following variable to assume: setting TR to calculate characteristic attribute FiBetween objective attribute target attribute C it is symmetrical it is probabilistic in
Between variable, FC be calculate two characteristic attribute FiWith FjBetween symmetrical probabilistic intermediate variable,For subtree TiIn with mesh
The maximum characteristic attribute of symmetrical uncertainty value between attribute C is marked, k is the number of nodes in connected graph G.It is false based on the above variable
It is as follows if providing the detailed process of the critical characteristic extraction method proposed in the present invention:
When certain domestic hot rolling mill produces certain strip, character subset is carried out by objective attribute target attribute of finishing temperature for experimental data
It extracts.500 data samples are extracted in pretreated tables of data at random as experimental data from having been subjected to, the tables of data is according to heat
Roll process knowledge eliminates obvious redundancy and incoherent characteristic attribute, in tables of data altogether include 57 characteristic attributes and
One objective attribute target attribute, the property set after uncorrelated attribute removal are as shown in table 1.SU (the F in entire property seti,C)(i∈[1,
M]) value arrangesThe SU value of the characteristic attribute of position, i.e., in the case where characteristic attribute number m is 57, relevance threshold θ is taken
Arrange the 13rd SU value.
After the extraction for carrying out character subset to the tables of data using common FAST algorithm, the key feature extracted is sub
Collection is as shown in table 2.It can be seen that common FAST (Fast clustering-bAsed feature Selection
AlgoriThm) character subset that algorithm is extracted includes 13 characteristic attributes, wherein have the drafts of 7 racks, 5 machines
The mill speed of the exit thickness of frame and a rack, this illustrates that FAST algorithm is caused due to the appearance of the more divergence problems of single-point
The failure of de-redundancy operation, causes key feature extraction efficiency low.
Property set after the removal of the uncorrelated attribute of table 1
In table 1, SCREW_DOWN represents drafts, and SPEED represents mill speed, and EXIT_THICK represents strip rack and goes out
Mouth thickness.
2 the extracted character subset of FAST algorithm of table
The extraction of character subset is carried out using this patent, steps are as follows:
Firstly, taking the predefined relevance threshold θ in two algorithms in uncorrelated attribute removal process for the institute in FAST
The threshold value of proposition is the same, and the parameter δ in this patent algorithm in the more divergent structure decision conditions of single-point takes
Secondly, minimum spanning tree constructed by building this patent algorithm (containing subtree) is as shown in Figure 2:
Again, using symmetrical uncertainty as the relativity measurement between attribute, the Critical eigenvalues extracted are such as
Shown in table 3;
The extracted character subset of algorithm that 3 this patent of table provides
In table, SCREW_DOWN represents drafts, and SPEED represents mill speed, and EXIT_THICK represents strip rack and goes out
Mouth thickness.
Finally, overall redundancy comparative analysis, to following three kinds of situations: former characteristic attribute collection, original FAST algorithm are extracted
The overall redundancy for the subset feature collection that the algorithm of subset and this patent extracts is compared analysis, as shown in table 4.Assuming that data
Characteristic attribute corresponding to collection D integrates as F={ F1,F2,...,Fm, then the overall redundancy of this property set are as follows:
In formula, SU (Fa,Fb) it is characterized attribute FaWith FbBetween symmetrical uncertainty value.
The overall redundancy of feature set in the case of 4 three kinds of table
To the key feature extraction side in the finishing stands, provided using finishing temperature as objective attribute target attribute data application this patent
It only include 2 characteristic attributes in the extracted character subset of algorithm that this patent provides, wherein having as can be seen from Table 3 after method
The mill speed of one rack and the drafts of a rack.According to the analysis to finishing temperature influence factor, mill speed, pressure
Lower amount and exit thickness are all the important factor in order of finishing temperature, this illustrates what whether FAST algorithm or this patent provided
Algorithm can correctly extract the key feature attribute for influencing finishing temperature.As can be seen from Table 4, the algorithm that this patent provides
It is better than original FAST algorithm in the performance of character subset totality redundancy.
Contrast effect that relevance threshold θ influences character subset extraction effect is given below as shown in figure 3, this patent mentions
The extracted character subset totality redundancy R of the algorithm of confessionsumVariation is as shown in Figure 4.Figure 4, it is seen that working as character subset
In attribute number be 1 when, this patent provide improvement after the total redundancy of FAST algorithm be 0;When θ is greater than 28, due to not
The overall redundancy of the increase of association attributes, character subset starts to reduce.Thus illustrate, in practical finishing stands key feature number
According in extraction, the algorithm that this patent provides can be such that attribute uncorrelated and mutually redundant to finishing temperature in data is once mentioning
It is removed during taking, has been able to satisfy the demand for reducing former feature set totality redundancy.
Claims (4)
1. a kind of based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree characterized by comprising
Step 1 carries out pretreatment work to hot-strip data, comprising: data scrubbing, data integration and connection attribute are discrete
The step of change,
Data scrubbing, the operation that the exceptional value in course of hot rolling data is searched and deleted,
Data in multiple data sources are merged, are stored in the consistent data set of structure by data integration,
Discretization for Continuous Attribute takes non-linear division methods to carry out discretization to finishing temperature,
Step 2, the removal of uncorrelated features,
It is if X, Y are discrete random variable, then symmetrical uncertain are as follows:
Wherein, H (X) is the entropy of discrete random variable X, it is assumed that p (x) is the prior probability of all values in X, then H (X) are as follows:
It include the property set F={ F of m characteristic attribute for one1,F2,...,FmAnd objective attribute target attribute C data set D, be intended to know
In other property set with the incoherent characteristic attribute of objective attribute target attribute, first to each characteristic attribute Fi(1≤i≤m) and objective attribute target attribute C
Between SU (Fi, C) and value is calculated, if any feature attribute FiSU (F corresponding to (1≤i≤m)i, C) and value is in advance greater than one
The relevance threshold θ of definition, then it is assumed that it is characteristic attribute relevant to objective attribute target attribute, these are met to the characteristic attribute of condition
It extracts and forms new characteristic attribute subset F'={ F '1,F′2,...,F′k(k≤m),
Step 3: building minimum spanning tree,
Firstly, defined below and decision condition is provided,
Define 1:SU (Fi, C) and it is characterized attribute FiCorrelation between ∈ F and objective attribute target attribute C, if SU (Fi, C) and than predefined
Threshold θ is big, then it is assumed that FiIt is the relevant characteristic attribute of objective attribute target attribute C,
Define 2:SU (Fi,Fj) it is a pair of of attribute FiAnd Fj(Fi,Fj∈ F ∧ i ≠ j) between correlation,
Define 3: assuming that S={ F1,F2,...,Fi,...,FK < | F |It is a feature cluster, ifTo any one Fi
∈ S (i ≠ j), condition [SU (Fj,C)≥SU(Fi,C)]∧[SU(Fi,Fj) > SU (Fi, C)] it sets up always, then it is assumed that FiFor
FjIt is redundancy,
Define 4: one characteristic attribute Fi∈ S={ F1,F2,...,Fk(k < | F |) be considered as that one in S represents feature, when
And if only ifThat is possess maximum SU (F in feature set Sj, C) and the characteristic attribute of value can make
For the representative feature in this feature cluster,
1 is being defined to defining in 4, ∧ represents logical AND, | F | the number in property set F comprising attribute is represented, according to given above
, the process that the process namely key feature of feature subset selection are extracted, so that it may be converted to identification and retain satisfaction definition 1
Middle SU (Fi, C) >=θ condition characteristic attribute, and selection represents the process of feature in feature cluster,
Decision condition 1: assuming that F 'i(i ∈ [1, k]) is a characteristic attribute in characteristic attribute collection F', if another in it and feature set
One characteristic attribute F 'jThe symmetrical uncertainty SU (F ' of (j ∈ [1, k] ∧ i ≠ j)i,F′j) value with F 'jIt is related it is symmetrical not
It is minimum in certainty value, then it is assumed that Fi' entire characteristic attribute concentration be F 'jMinimal redundancy characteristic attribute,
Decision condition 2: if characteristic attribute F 'i(i ∈ [1, k]) is that characteristic attribute concentrates the minimum of at least δ × k characteristic attribute superfluous
Remaining characteristic attribute, then it is assumed that this attribute will lead to the more divergent structures of single-point in minimum spanning tree building process, wherein δ ∈
(0,1) is a predefined value,
Then, it is based on above-mentioned decision condition 1 and decision condition 2, minimum spanning tree constructs the specific steps of module such as in the present invention
Under:
(1) for characteristic attribute collection F'={ F '1,F′2,...,F′k, by characteristic attribute Fi' (1≤i≤k) and objective attribute target attribute C it
Between relativity measurement SU (Fi', C) weight of (i ∈ [1, k]) as connected graph interior joint, the correlation between characteristic attribute
Measure SU (Fi',Fj') weight of (i, j ∈ [1, k] ∧ i ≠ j) as side in connected graph, Connected undigraph G is constructed, to connected graph
G establishes a minimum spanning tree using classical Prim algorithm, not only can all keep connecting by all nodes in this way, but also can make
The weights sum on all sides is minimum in must setting,
(2) a characteristic attribute F in variable n, F' that an initial value is 0 is definedi' (i ∈ [1, k]) and other spies in F'
Sign attribute is successively determined according to decision condition 1, if Fi' be another characteristic attribute minimal redundancy attribute, then by the value of n
Add 1,
(3) by characteristic attribute Fi' corresponding n value determined according to decision condition 2, if n >=δ × k, by Fi' from connected graph
It individually extracts and is added in final characteristic attribute subset in G;If n < δ × k on the contrary, not to Fi' carry out any processing simultaneously
To another not by the characteristic attribute F of judgementj' (j ∈ [1, k] ∧ i ≠ j) execution step (2) and (3),
All after determining, the building of minimum spanning tree finishes all characteristic attributes in F',
Step 4: segmentation minimum spanning tree, removes redundant attributes, key characteristic variables are extracted,
After the building process of minimum spanning tree terminates, any a line E={ (F in minimum spanning tree is analyzedi',Fj')|
Fi',Fj' ∈ F' ∧ i, j ∈ [1, k] ∧ i ≠ j, if [SU (Fi',Fj') < SU (Fi',C)]∧[SU(Fi',Fj') < SU (Fj',
C)]
It sets up, i.e. the weight SU (F of this edgei',Fj') it is less than the correlation degree between the node and objective attribute target attribute at this edge both ends
Measure SU (Fi', C) and SU (Fj', C), then this edge is removed from minimum spanning tree,
During removing these sides, a tree will be divided into two subtrees by the operation of every removal for carrying out a secondary side,
After all cutting operations terminate, for each subtree, it is assumed that the node collection V (T) in this subtree, it can be found that right
In every a pair of of node F that this node is concentratedi',Fj' ∈ V (T) always meets condition [SU (Fi',Fj') < SU (Fi',C)]∧
[SU(Fi',Fj') < SU (Fj', C)], according to defining 3, judge that characteristic attribute corresponding to node collection V (T) is concentrated all
Characteristic attribute be all for objective attribute target attribute C it is mutually redundant,
After the cutting operation for completing tree, the forest for possessing multiple trees has just been obtained, it is assumed that in finally obtained forest
In altogether comprising n set, for each subtree Ti∈{T1,T2,...,Tn, tree in include all characteristic attributes be all
Mutually redundant, according to defining 4, SU (F is extracted in this treej, C) and the maximum characteristic attribute of value is as representing feature category
Property, the property set comprising n key feature attribute is obtained, the category being mutually redundant has been not present in this property set
Property, that is, complete the removal operation of redundant attributes.
2. as described in claim 1 based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, feature exists
In:
In step 1, the method searched exceptional value is: according to due to each attribute in data in the actual production process
Zone of reasonableness, using bound search method exceptional value is searched.
3. as described in claim 1 based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, feature exists
In:
In step 1, after the completion of data integration, each data in tables of data, using band grade of steel as unique index, including this
Strip rolls all procedural informations collected, including finishing temperature, rack water, drafts and strip by each rack
Width and thickness.
4. as claimed in claim 3 based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree, feature exists
In:
In step 1, using finishing temperature as objective attribute target attribute, other attributes in data set as characteristic attribute,
The numerical intervals of finishing temperature are divided into symmetrical five region centered on finishing temperature target value, respectively
It is: (0~T0- 4 α), (T0- 4 α~T0- α), (T0- α~T0+ α), (T0+ α~T0+ 4 α), (T0+ 4 α~+∞),
For characteristic attribute, discretization is carried out to it using minimum description length algorithm,
Firstly, defining the class entropy in sample S are as follows:
Wherein, S --- data set;
K --- it include class { C in objective attribute target attribute C1,...,CKNumber;
P(Ci, S) --- belong to class C in SiSample proportion,
Then, the Entropy of the data set after definition segmentation is
Wherein, | S | --- it include the number of sample in data set S;
A a --- characteristic attribute in data set;
T --- the actual value in objective attribute target attribute finishing temperature;
S1, S2--- the two datasets after data set S segmentation,
Information gain corresponding to determined cut-point at this time are as follows:
Had according to Minimum description length criterion:
Wherein, Δ (A, T;S)=log2(3K-2)-[K·Ent(S)-K1·Ent(S1)-K2·Ent(S2)], K is original data set
The class number for including, K1, K2Respectively two subset S1, S2The class number for including,
Formula (4) is exactly decision condition of the MDL algorithm to separate division point, referred to as MDLPC criterion, to each of attribute A
Value is determined that the condition of satisfaction is separate division point according to MDLPC criterion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810917990.3A CN109101626A (en) | 2018-08-13 | 2018-08-13 | Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810917990.3A CN109101626A (en) | 2018-08-13 | 2018-08-13 | Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109101626A true CN109101626A (en) | 2018-12-28 |
Family
ID=64849686
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810917990.3A Pending CN109101626A (en) | 2018-08-13 | 2018-08-13 | Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109101626A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116809652A (en) * | 2023-03-28 | 2023-09-29 | 材谷金带(佛山)金属复合材料有限公司 | Abnormality analysis method and system for hot rolling mill control system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105975589A (en) * | 2016-05-06 | 2016-09-28 | 哈尔滨理工大学 | Feature selection method and device of high-dimension data |
CN106570178A (en) * | 2016-11-10 | 2017-04-19 | 重庆邮电大学 | High-dimension text data characteristic selection method based on graph clustering |
CN107273909A (en) * | 2016-04-08 | 2017-10-20 | 上海市玻森数据科技有限公司 | The sorting algorithm of high dimensional data |
-
2018
- 2018-08-13 CN CN201810917990.3A patent/CN109101626A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273909A (en) * | 2016-04-08 | 2017-10-20 | 上海市玻森数据科技有限公司 | The sorting algorithm of high dimensional data |
CN105975589A (en) * | 2016-05-06 | 2016-09-28 | 哈尔滨理工大学 | Feature selection method and device of high-dimension data |
CN106570178A (en) * | 2016-11-10 | 2017-04-19 | 重庆邮电大学 | High-dimension text data characteristic selection method based on graph clustering |
Non-Patent Citations (1)
Title |
---|
王昳晗: ""一种改进MST关键特征提取方法及其在终轧温度建模中的应用"", 《万方学术期刊数据库》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116809652A (en) * | 2023-03-28 | 2023-09-29 | 材谷金带(佛山)金属复合材料有限公司 | Abnormality analysis method and system for hot rolling mill control system |
CN116809652B (en) * | 2023-03-28 | 2024-04-26 | 材谷金带(佛山)金属复合材料有限公司 | Abnormality analysis method and system for hot rolling mill control system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mittal et al. | Stock prediction using twitter sentiment analysis | |
Xu et al. | Distributed plant-wide process monitoring based on PCA with minimal redundancy maximal relevance | |
CN107273924B (en) | Multi-data fusion power plant fault diagnosis method based on fuzzy clustering analysis | |
Ma et al. | A novel hierarchical detection and isolation framework for quality-related multiple faults in large-scale processes | |
KR20190072652A (en) | Information processing apparatus and information processing method | |
WO2021241580A1 (en) | Abnormality/irregularity cause identifying apparatus, abnormality/irregularity cause identifying method, and abnormality/irregularity cause identifying program | |
CN112529053A (en) | Short-term prediction method and system for time sequence data in server | |
CN111631682A (en) | Physiological feature integration method and device based on trend-removing analysis and computer equipment | |
CN109101626A (en) | Based on the high dimensional data critical characteristic extraction method for improving minimum spanning tree | |
Katebi et al. | Optimal economic statistical design of combined double sampling and variable sampling interval multivariate T 2 control charts | |
US20230229136A1 (en) | Abnormal irregularity cause identifying device, abnormal irregularity cause identifying method, and abnormal irregularity cause identifying program | |
Ardakani et al. | Optimal features selection for designing a fault diagnosis system | |
CN112418522B (en) | Industrial heating furnace steel temperature prediction method based on three-branch integrated prediction model | |
CN109919626B (en) | High-risk bank card identification method and device | |
CN110196797B (en) | Automatic optimization method and system suitable for credit scoring card system | |
CN105824785A (en) | Rapid abnormal point detection method based on penalized regression | |
CN106022915A (en) | Enterprise credit risk assessment method and apparatus | |
CN110287114A (en) | A kind of method and device of database script performance test | |
Liu et al. | Nearest neighbor optimal smooth denoising dynamic classification method for financial time series | |
US11544601B2 (en) | System for generating topic inference information of lyrics | |
CN112581188A (en) | Construction method, prediction method and model of engineering project bid quotation prediction model | |
Cateni et al. | Cause and effect analysis in a real industrial context: study of a particular application devoted to quality improvement | |
CN114101346B (en) | Cold rolled silicon steel thickness defect identification method, device and system | |
Sanei et al. | Sensitivity analysis with fuzzy Data in DEA | |
Qiubo et al. | Research on code plagiarism detection model based on Random Forest and Gradient Boosting Decision Tree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181228 |
|
RJ01 | Rejection of invention patent application after publication |