CN112685690A

CN112685690A - Large-scale feature selection method based on rough hypercube under cloud platform

Info

Publication number: CN112685690A
Application number: CN202011561665.1A
Authority: CN
Inventors: 王思朝; 罗川; 马磊; 曹潜; 张展云
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-20

Abstract

The invention discloses a large-scale feature selection method based on a rough hypercube under a cloud platform. The invention mainly provides a representation method and a feature measurement standard of a hypercube equivalent partition matrix facing a cloud platform, designs a cache-update-filtering acceleration mechanism and provides a large-scale feature selection method based on a rough hypercube under the cloud platform. The large-scale feature selection method disclosed by the invention can not only effectively process massive continuous data, but also show good expandability and scalability for clusters and data volumes of different scales while ensuring the feature selection quality and eliminating redundant features.

Description

Large-scale feature selection method based on rough hypercube under cloud platform

Technical Field

The invention relates to the technical field of large-scale data feature selection, in particular to a large-scale feature selection method based on a rough hypercube under a cloud platform.

Background

Due to the rapid development of computer and internet technologies, the data volume generation speed and the storage scale are increasing in the industries of military, finance, communication and the like. Meanwhile, the form of the data is not limited to discrete features, and more continuous features are provided, especially the data in the fields of energy, weather, remote sensing and the like. The high-dimensional data not only increases the complexity of calculation, but also easily causes the phenomenon of overfitting of a machine learning algorithm, thereby affecting the learning performance of the machine learning algorithm. The feature selection can ensure stable performance of the learning model, simultaneously determine relevant features for data analysis and eliminate redundant features as much as possible, and is a main reason why the feature selection is popular in the fields of pattern recognition, machine learning and the like.

Cloud computing, which is one of distributed computing, breaks through the limitation of insufficient resources of a single computer, and provides a good solution for large-scale data computing by constructing a computer cluster. Therefore, the current common methods for dealing with the continuous large-scale feature selection problem are: 1) firstly, discretizing continuous features in a data set, wherein the discretization method comprises equidistant discretization, equal-frequency discretization, optimized discretization and the like, and then selecting the features of the discretized data by combining a cloud platform and applying a distributed computing technology and a Pawlak rough set model. Although the equivalence relation in the Pawlak rough set model is very suitable for distributed computation, the process of data discretization can cause information loss, thereby affecting the quality of the selected features. 2) And selecting a rough set model suitable for continuous features, wherein the rough set model mainly comprises a neighborhood rough set and a fuzzy rough set, and then parallelizing the rough set model by a Hash method and the like so as to be suitable for a cloud computing paradigm. Although the method avoids information loss caused by discretization data, the problem of continuous big data feature selection cannot be efficiently processed due to the limitation of the model, namely the problem that the calculation of neighborhood relations and similar matrixes relates to global communication.

Disclosure of Invention

Aiming at the defects in the prior art, the large-scale feature selection method based on the rough hypercube under the cloud platform solves the problem of processing continuous large data feature selection under the cloud platform.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a large-scale feature selection method based on a rough hypercube under a cloud platform comprises the following steps:

s1, initializing weight parameters omega and lambda and predicting the number d of the selected features;

s2, initializing a selected feature set S and a candidate feature subset C;

s3, reading a data set, calculating a value domain matrix in a data parallel local distributed mode through a cloud platform, and calculating a hypercube equivalent division matrix obtained by decomposing and reconstructing a hypercube equivalent division matrix of the characteristics according to the value domain matrix distributed mode;

s4, calculating the correlation degree between each feature and the decision attribute in a distributed manner in a data parallel mode based on the decomposed and reconstructed hypercube equivalent partition matrix, selecting the most correlated feature to add to the selected feature set S, and deleting the feature from the candidate feature subset C;

s5 when | S<d and

if so, entering the step S6, otherwise, outputting a feature set S;

s6, calculating the dependency and average importance of each candidate feature on the selected feature set S in a distributed manner by a data parallel mode on a cloud platform based on a decomposed and reconstructed hypercube equivalent partition matrix and combining an acceleration method of a cache-update-filter mechanism, and deleting the candidate feature from the candidate feature subset C if the dependency is not changed after a certain candidate feature is added to the selected feature set S;

and S7, calculating the metric function value of each candidate feature according to the weight parameters omega and lambda, selecting the candidate feature with the maximum value, adding the candidate feature into the selected feature set S, and deleting the feature from the candidate feature subset C.

Further: the value range matrix calculation method in step S3 includes: given a decision table<U, C U.D >, wherein U ═ x₁,x₂,…,x_nDenotes a set of n samples,

and is

The representation set U may consist of q disjoint subsets U_iComposition is carried out; c ═ A₁,A₂,…,A_mDenotes a set of m condition features, denotes a set of decision attributes, U/D ═ β₁,β₂,…,β_cRepresents a set of c decision categories; by LU (C) [ (L)_ij,U_ij)]Represents a value domain matrix, wherein

L_ijIndicates all belongings to decision class β_iIs characterized by the feature A_jMinimum value of_ijIndicates all belongings to decision class β_iIs characterized by the feature A_jThe maximum value of (c).

Further: the hypercube equivalent partition matrix decomposed and reconstructed in the step S3 is:

in the above formula, H (A)_k,U_p) Is a subset

In the feature A_k(A_kE.g. C) is used to divide the matrix,

interval [ L ]_ik,U_ik]For all belonging to decision class beta_iIs characterized by the feature A_kThe value range of.

Further: the correlation J in the step S4_relev(_k) The calculation formula of (2) is as follows:

in the above formula, the vector values are obfuscated

Represents a subset U_pSample x in (1)_jIn the feature A_kWhether it belongs to only one category, namely the positive region. When the value is 0, it represents a sample x_jBelong to only one category; the value of 1 represents the sample x_jBelonging to multiple classes, are misclassified samples. U ═ U_pI represents the subset U_pThe number of samples in (c).

Further: the dependency J in the step S6_depenThe calculation formula of (S) is:

in the above formula, the first and second carbon atoms are,

is the subset U_pSample x in (1)_jA value of an aliasing vector on the feature set S, wherein

h_ij(S,U_p) Is the subset U_pHypercube equivalent partition matrix H (S, U) over feature set S_p) Row i and column j;

the average importance J_avgsig(A_kAnd S) is as follows:

in the above formula, the first and second carbon atoms are,

represents a feature A_kWith respect to feature A_iThe importance of the attributes of (a) to (b),

is the subset U_pSample x in (1)_jIn the feature set { A_k,A_iThe value of the confusion vector on.

Further: the caching-updating-filtering method of the dependency in the step S6 includes: h (S- { A) calculated in the S-th feature selection process_s}∪{A_k},U_p) The matrix is cached in the cluster by summing it with H (A)_s,U_p) Calculating J in the s +1 th characteristic selection process by updating corresponding element phase-and mode_depen(S∪{A_k}) the desired H (S.U.S. { A ]_k},U_p) Matrix, and finally filtered out and_s(ii) related data;

wherein the subset U_pIn the candidate feature

Supercube equivalent partitioning matrix H (S { U { A) }_k},U_p) The calculation formula of (2) is as follows:

H(S∪{A_k},U_p)＝H(S-{A_s}∪{A_k},U_p)∩H(A_s,U_p)

feature set S { [ A ]_kDependency J between } and decision attribute D_depen(S∪{A_k}) is:

the caching-updating-filtering method of the average importance degree comprises the following steps: in the s-th feature selection process, feature A is selected_kIn the subset U_pUpper relative feature set S- { A_sMean importance of } of

Caching the data into a cluster, and passing through a sum characteristic A_kRelative to A_sDegree of importance of

By adding, the feature A can be updated_kIn the subset U_pAverage importance of the upper phase with respect to the feature set S | J_avgsig(A_kS), and finally filtering out the product A_s(ii) related data;

wherein, characteristic A_kAverage importance J with respect to feature set S_avgsig(A_kAnd S) is as follows:

further: a metric function value J (a) of each candidate feature in the step S7_kAnd S) is as follows:

J(A_k,S)＝ωJ_relev(A_k)+λ(1-ω)[J_depen({A_k}∪S)-J_depen(S)]+

(1-ω)(1-λ)J_avgsig(A_k,S)。

the invention has the beneficial effects that: the method combines the characteristics of cloud platform distributed computing and the advantages of the rough hypercube in processing continuous data, thereby solving the problem of selecting the characteristics of massive continuous data, such as energy, climate and other large-scale data sets, and being applicable to the relevant fields of pattern recognition and machine learning. The invention mainly provides a representation method and a feature measurement standard of a hypercube equivalent partition matrix facing a cloud platform, designs a cache-update-filtering acceleration mechanism and provides a rough hypercube-based big data feature selection method under the cloud platform. The big data feature selection method disclosed by the invention can not only effectively process massive continuous data while ensuring the feature selection quality and eliminating redundant features, but also show good expandability and scalability in the face of clusters and data volumes of different scales.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

The invention provides an acceleration method of a cache-update-filter mechanism and a big data feature selection method based on a rough hypercube, and aims to provide a solution for processing a continuous big data feature selection problem under a cloud platform. The invention is mainly embodied in the following four aspects:

1. cloud platform-oriented rough hypercube model representation method

The rough hypercube model fully utilizes the class information of the sample to granulate in a supervised learning mode, and the granularity information is represented by a hypercube equivalent partition matrix of the characteristics. The granulation mechanism and the equivalent matrix representation method enable the coarse hypercube model to be superior to the Pawlak coarse set model, the neighborhood coarse set model and the fuzzy coarse set model in both quality and efficiency of feature selection. However, the representation form of the hypercube equivalent partition matrix of the features still involves the problem of global communication in the sample space and is not suitable for direct construction under the cloud platform. Aiming at the problem, the invention decomposes and reconstructs the representation matrix of the characteristics and provides a new representation method of the hypercube equivalent partition matrix based on the object or sample subset. The novel representation method has the innovative points that the global communication problem is avoided, the hypercube equivalent division matrix of the sample or the subset is allowed to be parallelly and independently calculated on each node or each data block of the cloud platform, and the concept of 'divide and conquer' of the cloud platform is perfectly matched, so that the efficiency of processing big data under the cloud platform is effectively improved.

2. Characteristic measurement standard of hypercube equivalent partition matrix based on decomposition reconstruction

The metric of the feature is the key to determining the quality of the feature selection. The rough hypercube model provides three concepts of correlation between the features and decision attributes, dependency between feature subsets and decision attributes and attribute importance between selected features on the basis of a hypercube equivalent partition matrix of the features, and comprehensively considers the three concepts as the measurement standard of the features by giving different weights. As is well known, big data to be processed in a cloud environment has not only a large number of samples but also a large sample dimension. In the case that the characteristic hypercube equivalent partition matrix is not applicable to a cloud platform, the invention selects the decomposed and reconstructed hypercube equivalent partition matrix as a basis, and further provides a parallelization calculation method of characteristic measurement standards (relevance, dependency and attribute importance) in a distributed environment, and the method reduces the complexity of a big data characteristic evaluation process.

3. A cache-update-filter mechanism acceleration method is provided

Under a big data environment, aiming at an iterative computing process, the cloud platform provides a data persistence method, and can cache intermediate data into a memory or a disk of a cluster, such as a persistence operator in a Spark example, so as to reduce repeated computing and improve the big data processing speed. The invention deeply analyzes the calculation process of the dependence and the attribute importance in the iterative feature selection process, and provides a cache-update-filtering acceleration mechanism by combining a persistence method provided by a cloud platform, thereby accelerating the process of big data feature selection, and the speed of selecting features each time is faster and faster along with the performance of feature selection.

4. Coarse hypercube-based big data feature selection method

The big data selection method under the cloud platform provides a rough hypercube representation method facing the cloud platform, and the rough hypercube representation method specifically comprises the following steps: the invention provides two methods for expressing the hypercube equivalent partition matrix based on samples or subsets by decomposing and reconstructing the hypercube equivalent partition matrix of the characteristics by combining the characteristics of cloud platform distributed computing. The method comprises the following steps:

as shown in fig. 1, a large-scale feature selection method based on a rough hypercube under a cloud platform includes the following steps:

s2, initializing a selected feature set S and a candidate feature subset C;

s3, reading a data set, calculating a value domain matrix through data on the cloud platform in a parallel manner, and decomposing and reconstructing a rough hypercube matrix of the characteristics according to the value domain matrix to obtain a hypercube equivalent division matrix;

the value domain matrix calculation method comprises the following steps: given a decision table<U, C U.D >, wherein U ═ x₁,x₂,…,x_nDenotes a set of n samples,

and is

The present embodiment assumes that large datasets are stored in a distributed file system HDFS on a Hadoop platform. Firstly, reading HDFS files of a data set to a Spark cluster, then aggregating all characteristic values of samples under each decision category under each characteristic, and comparing to obtain a minimum value and a maximum value, namely a value domain matrix. And finally, collecting value domain matrix data to a Driver node, converting the value domain matrix data into a two-dimensional array, and broadcasting the two-dimensional array to each computing node of the cluster. This process is followed by a flatMap, a reduceByKey, a collect, and a broadcast.

The hypercube equivalent partition matrix is:

in the above formula, H (A)_k,U_p) Is a subset

In the feature A_k(A_kE.g. C) is used to divide the matrix,

interval [ L ]_ik,U_ik]For all belonging to decision class beta_iIs characterized by the feature A_kIn the present embodiment, a hypercube equivalent partition matrix representation method of the subset obtained by decomposition and reconstruction is selected. Subset U here_pWhich may be understood as a partition of the RDD in Spark. According to the value range matrix obtained by broadcasting, the samples in each partition are divided into a subset U_pConversion into each feature A_kHypercube equivalent partitioning matrix H (A) over the subset_k,U_p) And caching the generated RDD on the Spark cluster. This process is mappartions and persistence, in turn.

S4, calculating the correlation degree between each feature and the decision attribute in a data parallel mode based on the hypercube equivalent partition matrix, selecting the most relevant feature to add into the selected feature set S, and deleting the feature from the candidate feature subset C;

equivalently dividing matrix H (A) based on hypercube_k,U_p) The confusion vector of (a) is defined as: v (A)_k,U_p)＝[v₁(A_k,U_p),v₂(A_k,U_p),...,v_u(A_k,U_p)]；

Wherein,

wherein j is more than or equal to 1 and less than or equal to u. According to the analysis, if the sample x_jIn the feature A_kBelonging to only one category, i.e. to the lower approximation set, then v_j(A_k,U_p) Equal to 0. Otherwise, it belongs to the boundary field, v_j(A_k,U_p) Is 1.

Degree of correlation J_relev(A_k) The calculation formula of (2) is as follows:

in the above formula, the vector values are obfuscated

The analysis shows that the characteristic A_kIs based on the corresponding hypercube equivalent partition matrix H (A) on each partition_k,U_p) And (4) calculating. In this embodiment, the item is calculated for each record on the RDD cached in step S3

Then using the feature index as key value, summing to obtain the value of each featureAnd (4) collecting the result to a Driver node for storing the correlation value, finding out the features with the maximum correlation, deleting the features from the set C, and adding the features to the S. This process is map, reduceByKey and collect in order.

S5 when | S<d and

if so, entering the step S6, otherwise, outputting a feature set S;

s6, calculating the dependency and average importance of each candidate feature on the selected feature set S by means of data parallel on a cloud platform based on a hypercube equivalent partition matrix and combining an acceleration method of a cache-update-filter mechanism, and deleting the candidate feature from the candidate feature subset C if the dependency is not changed after a certain candidate feature is added to the selected feature set S;

degree of dependence J_depenThe calculation formula of (S) is:

in the above formula, the first and second carbon atoms are,

the average importance J_avgsig(A_kAnd S) is as follows:

in the above formula, the first and second carbon atoms are,

The caching-updating-filtering method of the dependence degree comprises the following steps: h (S- { A) calculated in the S-th feature selection process_s}∪{A_k},U_p) The matrix is cached in the cluster by summing it with H (A)_s,U_p) Calculating J in the s +1 th characteristic selection process by updating corresponding element phase-and mode_depen(S∪{A_k}) the desired H (S.U.S. { A ]_k},U_p) Matrix, and finally filtered out and_s(ii) related data;

given a selected feature set S, A_sE.s is the feature selected in the S-th feature selection process (S ═ S |), and the subset U_pIn the candidate feature

H(S∪{A_k},U_p)＝H(S-{A_s}∪{A_k},U_p)∩H(A_s,U_p)

given a selected feature set S, A_sE.s is the feature selected in the S-th feature selection process (S ═ S |), and the feature a_kAverage importance J with respect to feature set S_avgsig(A_kAnd S) is as follows:

based on the cache data, calculating the dependency and average importance of each candidate feature relative to the selected feature set S, and collecting and storing the candidate features in a Driver node. This process is, in order, mapPartitions, persistence, redeByKey, and collectThe process is described in detail below.

Metric function value J (A) for each candidate feature_kAnd S) is as follows:

J(A_k,S)＝ωJ_relev(A_k)+λ(1-ω)[J_depen({A_k}∪S)-J_depen(S)]+

(1-ω)(1-λ)J_avgsig(A_k,S)。

the invention discloses a rough hypercube-based big data feature selection method under a cloud platform, which combines the characteristics of cloud platform distributed computing and the advantages of a rough hypercube in processing continuous data, thereby solving the problem of feature selection of massive continuous data, such as large data sets of energy, climate and the like, and being applicable to the fields related to pattern recognition and machine learning. The invention mainly provides a representation method and a feature measurement standard of a hypercube equivalent partition matrix facing a cloud platform, designs a cache-update-filtering acceleration mechanism and provides a rough hypercube-based big data feature selection method under the cloud platform. The big data feature selection method disclosed by the invention can not only effectively process massive continuous data while ensuring the feature selection quality and eliminating redundant features, but also show good expandability and scalability in the face of clusters and data volumes of different scales.

Claims

1. A large-scale feature selection method based on a rough hypercube under a cloud platform is characterized by comprising the following steps:

s2, initializing a selected feature set S and a candidate feature subset C;

s5, when | S | < d and

if so, entering the step S6, otherwise, outputting a feature set S;

2. The method for selecting the large-scale features based on the rough hypercube under the cloud platform as claimed in claim 1, wherein the value domain matrix calculation method in the step S3 is as follows: given a decision table<U，C∪D>Wherein U ═ x₁，x₂，...，x_nDenotes a set of n samples,

and is

The representation set U may consist of q disjoint subsets U_iComposition is carried out; c ═ A₁，A₂，...，A_mDenotes a set of m condition features, D denotes a set of decision attributes, U/D ═ β₁，β₂，...，β_cRepresents a set of c decision categories; by LU (C) [ (L)_ij，U_ij)]Represents a value domain matrix, wherein

3. The method for large-scale feature selection based on the rough hypercube under the cloud platform as claimed in claim 2, wherein the hypercube equivalent partition matrix of the features in the step S3 is:

in the above formula, H (A)_k) Is characterized by A_kThe hypercube equivalent partition matrix of (a),

interval [ L ]_ik，U_ik]For all belonging to decision class beta_iIs characterized by the feature A_kThe value range of.

The hypercube equivalent partition matrix for the cloud platform obtained by the matrix decomposition and reconstruction in the step S3 is as follows:

in the above formula, H { A_k，U_p) Is a subset

In the feature A_k{A_kE.g. C) is used to divide the matrix,

4. The method for large-scale feature selection based on rough hypercube under cloud platform as claimed in claim 3, wherein the correlation J in step S4_relev(A_k) The calculation formula of (2) is as follows:

in the above formula, the vector values are obfuscated

Represents a subset U_pSample x in (1)_jIn the feature A_kWhether it belongs to only one category, i.e. positive region, and a value of 0 indicates a sample x_jBelong to only one category; the value of 1 represents the sample x_jBelongs to multiple categories, is a misclassified sample, U ═ U_pI represents the subset U_pThe number of samples in (c).

5. The method for large-scale feature selection based on rough hypercube under cloud platform as claimed in claim 4, wherein the dependency J in step S6_depenThe calculation formula of (S) is:

in the above formula, the first and second carbon atoms are,

h_ij(S，U_p) Is the subset U_pHypercube equivalent partition matrix H (S, U) over feature set S_p) Row i and column j;

the average importance J_avgsig(A_kAnd S) is as follows:

in the above formula, the first and second carbon atoms are,

is the subset U_pSample x in (1)_jIn the feature set { A_k，A_iThe value of the confusion vector on.

6. The rough hypercube-based large-scale feature selection method under the cloud platform as claimed in claim 5, wherein the caching-updating-filtering method of the dependency in step S6 is: h (S- { A) calculated in the S-th feature selection process_s}∪{A_k}，U_p) The matrix is cached in the cluster by summing it with H { A }_s，U_p) Calculating J in the s +1 th characteristic selection process by updating corresponding element phase-and mode_depen(S∪{A_k}) the desired H (S.U.S. { A ]_k}，U_p) Matrix, and finally filtered out and_s(ii) related data;

wherein the subset U_pCandidate features relative to the selected feature set S

Supercube equivalent partitioning matrix H (S { U { A) }_k}，U_p) The calculation formula of (2) is as follows:

H(S∪{A_k}，U_p)＝H(S-{A_s}∪{A_k}，U_p)∩H(A_s，U_p)

By adding, the feature A can be updated_kIn the subset U_pAverage importance of upper phase relative to feature set S J_avgsig(A_kS), and finally, the sum A is filtered out_s(ii) related data;

7. the method for large-scale rough hypercube-based feature selection under cloud platform as claimed in claim 6 wherein said step S7 is implemented by using metric function value J (a) of each candidate feature_kAnd S) is as follows:

J(A_k，S)＝ωJ_relev(A_k)+λ(1-ω)[J_depen({A_k}∪S)-J_depen(S)]+(1-ω)(1-λ)J_avgsig(A_k，S)。