CN112685690A - Large-scale feature selection method based on rough hypercube under cloud platform - Google Patents

Large-scale feature selection method based on rough hypercube under cloud platform Download PDF

Info

Publication number
CN112685690A
CN112685690A CN202011561665.1A CN202011561665A CN112685690A CN 112685690 A CN112685690 A CN 112685690A CN 202011561665 A CN202011561665 A CN 202011561665A CN 112685690 A CN112685690 A CN 112685690A
Authority
CN
China
Prior art keywords
feature
hypercube
matrix
subset
cloud platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011561665.1A
Other languages
Chinese (zh)
Inventor
王思朝
罗川
马磊
曹潜
张展云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202011561665.1A priority Critical patent/CN112685690A/en
Publication of CN112685690A publication Critical patent/CN112685690A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a large-scale feature selection method based on a rough hypercube under a cloud platform. The invention mainly provides a representation method and a feature measurement standard of a hypercube equivalent partition matrix facing a cloud platform, designs a cache-update-filtering acceleration mechanism and provides a large-scale feature selection method based on a rough hypercube under the cloud platform. The large-scale feature selection method disclosed by the invention can not only effectively process massive continuous data, but also show good expandability and scalability for clusters and data volumes of different scales while ensuring the feature selection quality and eliminating redundant features.

Description

Large-scale feature selection method based on rough hypercube under cloud platform
Technical Field
The invention relates to the technical field of large-scale data feature selection, in particular to a large-scale feature selection method based on a rough hypercube under a cloud platform.
Background
Due to the rapid development of computer and internet technologies, the data volume generation speed and the storage scale are increasing in the industries of military, finance, communication and the like. Meanwhile, the form of the data is not limited to discrete features, and more continuous features are provided, especially the data in the fields of energy, weather, remote sensing and the like. The high-dimensional data not only increases the complexity of calculation, but also easily causes the phenomenon of overfitting of a machine learning algorithm, thereby affecting the learning performance of the machine learning algorithm. The feature selection can ensure stable performance of the learning model, simultaneously determine relevant features for data analysis and eliminate redundant features as much as possible, and is a main reason why the feature selection is popular in the fields of pattern recognition, machine learning and the like.
Cloud computing, which is one of distributed computing, breaks through the limitation of insufficient resources of a single computer, and provides a good solution for large-scale data computing by constructing a computer cluster. Therefore, the current common methods for dealing with the continuous large-scale feature selection problem are: 1) firstly, discretizing continuous features in a data set, wherein the discretization method comprises equidistant discretization, equal-frequency discretization, optimized discretization and the like, and then selecting the features of the discretized data by combining a cloud platform and applying a distributed computing technology and a Pawlak rough set model. Although the equivalence relation in the Pawlak rough set model is very suitable for distributed computation, the process of data discretization can cause information loss, thereby affecting the quality of the selected features. 2) And selecting a rough set model suitable for continuous features, wherein the rough set model mainly comprises a neighborhood rough set and a fuzzy rough set, and then parallelizing the rough set model by a Hash method and the like so as to be suitable for a cloud computing paradigm. Although the method avoids information loss caused by discretization data, the problem of continuous big data feature selection cannot be efficiently processed due to the limitation of the model, namely the problem that the calculation of neighborhood relations and similar matrixes relates to global communication.
Disclosure of Invention
Aiming at the defects in the prior art, the large-scale feature selection method based on the rough hypercube under the cloud platform solves the problem of processing continuous large data feature selection under the cloud platform.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a large-scale feature selection method based on a rough hypercube under a cloud platform comprises the following steps:
s1, initializing weight parameters omega and lambda and predicting the number d of the selected features;
s2, initializing a selected feature set S and a candidate feature subset C;
s3, reading a data set, calculating a value domain matrix in a data parallel local distributed mode through a cloud platform, and calculating a hypercube equivalent division matrix obtained by decomposing and reconstructing a hypercube equivalent division matrix of the characteristics according to the value domain matrix distributed mode;
s4, calculating the correlation degree between each feature and the decision attribute in a distributed manner in a data parallel mode based on the decomposed and reconstructed hypercube equivalent partition matrix, selecting the most correlated feature to add to the selected feature set S, and deleting the feature from the candidate feature subset C;
s5 when | S<d and
Figure BDA0002859534810000021
if so, entering the step S6, otherwise, outputting a feature set S;
s6, calculating the dependency and average importance of each candidate feature on the selected feature set S in a distributed manner by a data parallel mode on a cloud platform based on a decomposed and reconstructed hypercube equivalent partition matrix and combining an acceleration method of a cache-update-filter mechanism, and deleting the candidate feature from the candidate feature subset C if the dependency is not changed after a certain candidate feature is added to the selected feature set S;
and S7, calculating the metric function value of each candidate feature according to the weight parameters omega and lambda, selecting the candidate feature with the maximum value, adding the candidate feature into the selected feature set S, and deleting the feature from the candidate feature subset C.
Further: the value range matrix calculation method in step S3 includes: given a decision table<U, C U.D >, wherein U ═ x1,x2,…,xnDenotes a set of n samples,
Figure BDA0002859534810000022
and is
Figure BDA0002859534810000023
The representation set U may consist of q disjoint subsets UiComposition is carried out; c ═ A1,A2,…,AmDenotes a set of m condition features, denotes a set of decision attributes, U/D ═ β12,…,βcRepresents a set of c decision categories; by LU (C) [ (L)ij,Uij)]Represents a value domain matrix, wherein
Figure BDA0002859534810000031
Figure BDA0002859534810000032
LijIndicates all belongings to decision class βiIs characterized by the feature AjMinimum value ofijIndicates all belongings to decision class βiIs characterized by the feature AjThe maximum value of (c).
Further: the hypercube equivalent partition matrix decomposed and reconstructed in the step S3 is:
Figure BDA0002859534810000033
in the above formula, H (A)k,Up) Is a subset
Figure BDA0002859534810000034
In the feature Ak(AkE.g. C) is used to divide the matrix,
Figure BDA0002859534810000035
interval [ L ]ik,Uik]For all belonging to decision class betaiIs characterized by the feature AkThe value range of.
Further: the correlation J in the step S4relev(k) The calculation formula of (2) is as follows:
Figure BDA0002859534810000036
in the above formula, the vector values are obfuscated
Figure BDA0002859534810000037
Represents a subset UpSample x in (1)jIn the feature AkWhether it belongs to only one category, namely the positive region. When the value is 0, it represents a sample xjBelong to only one category; the value of 1 represents the sample xjBelonging to multiple classes, are misclassified samples. U ═ UpI represents the subset UpThe number of samples in (c).
Further: the dependency J in the step S6depenThe calculation formula of (S) is:
Figure BDA0002859534810000038
in the above formula, the first and second carbon atoms are,
Figure BDA0002859534810000041
is the subset UpSample x in (1)jA value of an aliasing vector on the feature set S, wherein
Figure BDA0002859534810000042
hij(S,Up) Is the subset UpHypercube equivalent partition matrix H (S, U) over feature set Sp) Row i and column j;
the average importance Javgsig(AkAnd S) is as follows:
Figure BDA0002859534810000043
in the above formula, the first and second carbon atoms are,
Figure BDA0002859534810000044
represents a feature AkWith respect to feature AiThe importance of the attributes of (a) to (b),
Figure BDA0002859534810000045
Figure BDA0002859534810000046
is the subset UpSample x in (1)jIn the feature set { Ak,AiThe value of the confusion vector on.
Further: the caching-updating-filtering method of the dependency in the step S6 includes: h (S- { A) calculated in the S-th feature selection processs}∪{Ak},Up) The matrix is cached in the cluster by summing it with H (A)s,Up) Calculating J in the s +1 th characteristic selection process by updating corresponding element phase-and modedepen(S∪{Ak}) the desired H (S.U.S. { A ]k},Up) Matrix, and finally filtered out ands(ii) related data;
wherein the subset UpIn the candidate feature
Figure BDA0002859534810000049
Supercube equivalent partitioning matrix H (S { U { A) }k},Up) The calculation formula of (2) is as follows:
H(S∪{Ak},Up)=H(S-{As}∪{Ak},Up)∩H(As,Up)
feature set S { [ A ]kDependency J between } and decision attribute Ddepen(S∪{Ak}) is:
Figure BDA0002859534810000047
the caching-updating-filtering method of the average importance degree comprises the following steps: in the s-th feature selection process, feature A is selectedkIn the subset UpUpper relative feature set S- { AsMean importance of } of
Figure BDA0002859534810000048
Caching the data into a cluster, and passing through a sum characteristic AkRelative to AsDegree of importance of
Figure BDA0002859534810000051
By adding, the feature A can be updatedkIn the subset UpAverage importance of the upper phase with respect to the feature set S | Javgsig(AkS), and finally filtering out the product As(ii) related data;
wherein, characteristic AkAverage importance J with respect to feature set Savgsig(AkAnd S) is as follows:
Figure BDA0002859534810000052
further: a metric function value J (a) of each candidate feature in the step S7kAnd S) is as follows:
J(Ak,S)=ωJrelev(Ak)+λ(1-ω)[Jdepen({Ak}∪S)-Jdepen(S)]+
(1-ω)(1-λ)Javgsig(Ak,S)。
the invention has the beneficial effects that: the method combines the characteristics of cloud platform distributed computing and the advantages of the rough hypercube in processing continuous data, thereby solving the problem of selecting the characteristics of massive continuous data, such as energy, climate and other large-scale data sets, and being applicable to the relevant fields of pattern recognition and machine learning. The invention mainly provides a representation method and a feature measurement standard of a hypercube equivalent partition matrix facing a cloud platform, designs a cache-update-filtering acceleration mechanism and provides a rough hypercube-based big data feature selection method under the cloud platform. The big data feature selection method disclosed by the invention can not only effectively process massive continuous data while ensuring the feature selection quality and eliminating redundant features, but also show good expandability and scalability in the face of clusters and data volumes of different scales.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
The invention provides an acceleration method of a cache-update-filter mechanism and a big data feature selection method based on a rough hypercube, and aims to provide a solution for processing a continuous big data feature selection problem under a cloud platform. The invention is mainly embodied in the following four aspects:
1. cloud platform-oriented rough hypercube model representation method
The rough hypercube model fully utilizes the class information of the sample to granulate in a supervised learning mode, and the granularity information is represented by a hypercube equivalent partition matrix of the characteristics. The granulation mechanism and the equivalent matrix representation method enable the coarse hypercube model to be superior to the Pawlak coarse set model, the neighborhood coarse set model and the fuzzy coarse set model in both quality and efficiency of feature selection. However, the representation form of the hypercube equivalent partition matrix of the features still involves the problem of global communication in the sample space and is not suitable for direct construction under the cloud platform. Aiming at the problem, the invention decomposes and reconstructs the representation matrix of the characteristics and provides a new representation method of the hypercube equivalent partition matrix based on the object or sample subset. The novel representation method has the innovative points that the global communication problem is avoided, the hypercube equivalent division matrix of the sample or the subset is allowed to be parallelly and independently calculated on each node or each data block of the cloud platform, and the concept of 'divide and conquer' of the cloud platform is perfectly matched, so that the efficiency of processing big data under the cloud platform is effectively improved.
2. Characteristic measurement standard of hypercube equivalent partition matrix based on decomposition reconstruction
The metric of the feature is the key to determining the quality of the feature selection. The rough hypercube model provides three concepts of correlation between the features and decision attributes, dependency between feature subsets and decision attributes and attribute importance between selected features on the basis of a hypercube equivalent partition matrix of the features, and comprehensively considers the three concepts as the measurement standard of the features by giving different weights. As is well known, big data to be processed in a cloud environment has not only a large number of samples but also a large sample dimension. In the case that the characteristic hypercube equivalent partition matrix is not applicable to a cloud platform, the invention selects the decomposed and reconstructed hypercube equivalent partition matrix as a basis, and further provides a parallelization calculation method of characteristic measurement standards (relevance, dependency and attribute importance) in a distributed environment, and the method reduces the complexity of a big data characteristic evaluation process.
3. A cache-update-filter mechanism acceleration method is provided
Under a big data environment, aiming at an iterative computing process, the cloud platform provides a data persistence method, and can cache intermediate data into a memory or a disk of a cluster, such as a persistence operator in a Spark example, so as to reduce repeated computing and improve the big data processing speed. The invention deeply analyzes the calculation process of the dependence and the attribute importance in the iterative feature selection process, and provides a cache-update-filtering acceleration mechanism by combining a persistence method provided by a cloud platform, thereby accelerating the process of big data feature selection, and the speed of selecting features each time is faster and faster along with the performance of feature selection.
4. Coarse hypercube-based big data feature selection method
The big data selection method under the cloud platform provides a rough hypercube representation method facing the cloud platform, and the rough hypercube representation method specifically comprises the following steps: the invention provides two methods for expressing the hypercube equivalent partition matrix based on samples or subsets by decomposing and reconstructing the hypercube equivalent partition matrix of the characteristics by combining the characteristics of cloud platform distributed computing. The method comprises the following steps:
as shown in fig. 1, a large-scale feature selection method based on a rough hypercube under a cloud platform includes the following steps:
s1, initializing weight parameters omega and lambda and predicting the number d of the selected features;
s2, initializing a selected feature set S and a candidate feature subset C;
s3, reading a data set, calculating a value domain matrix through data on the cloud platform in a parallel manner, and decomposing and reconstructing a rough hypercube matrix of the characteristics according to the value domain matrix to obtain a hypercube equivalent division matrix;
the value domain matrix calculation method comprises the following steps: given a decision table<U, C U.D >, wherein U ═ x1,x2,…,xnDenotes a set of n samples,
Figure BDA0002859534810000081
and is
Figure BDA0002859534810000082
The representation set U may consist of q disjoint subsets UiComposition is carried out; c ═ A1,A2,…,AmDenotes a set of m condition features, denotes a set of decision attributes, U/D ═ β12,…,βcRepresents a set of c decision categories; by LU (C) [ (L)ij,Uij)]Represents a value domain matrix, wherein
Figure BDA0002859534810000083
LijIndicates all belongings to decision class βiIs characterized by the feature AjMinimum value ofijIndicates all belongings to decision class βiIs characterized by the feature AjThe maximum value of (c).
The present embodiment assumes that large datasets are stored in a distributed file system HDFS on a Hadoop platform. Firstly, reading HDFS files of a data set to a Spark cluster, then aggregating all characteristic values of samples under each decision category under each characteristic, and comparing to obtain a minimum value and a maximum value, namely a value domain matrix. And finally, collecting value domain matrix data to a Driver node, converting the value domain matrix data into a two-dimensional array, and broadcasting the two-dimensional array to each computing node of the cluster. This process is followed by a flatMap, a reduceByKey, a collect, and a broadcast.
The hypercube equivalent partition matrix is:
Figure BDA0002859534810000091
in the above formula, H (A)k,Up) Is a subset
Figure BDA0002859534810000092
In the feature Ak(AkE.g. C) is used to divide the matrix,
Figure BDA0002859534810000093
interval [ L ]ik,Uik]For all belonging to decision class betaiIs characterized by the feature AkIn the present embodiment, a hypercube equivalent partition matrix representation method of the subset obtained by decomposition and reconstruction is selected. Subset U herepWhich may be understood as a partition of the RDD in Spark. According to the value range matrix obtained by broadcasting, the samples in each partition are divided into a subset UpConversion into each feature AkHypercube equivalent partitioning matrix H (A) over the subsetk,Up) And caching the generated RDD on the Spark cluster. This process is mappartions and persistence, in turn.
S4, calculating the correlation degree between each feature and the decision attribute in a data parallel mode based on the hypercube equivalent partition matrix, selecting the most relevant feature to add into the selected feature set S, and deleting the feature from the candidate feature subset C;
equivalently dividing matrix H (A) based on hypercubek,Up) The confusion vector of (a) is defined as: v (A)k,Up)=[v1(Ak,Up),v2(Ak,Up),...,vu(Ak,Up)];
Wherein,
Figure BDA0002859534810000094
wherein j is more than or equal to 1 and less than or equal to u. According to the analysis, if the sample xjIn the feature AkBelonging to only one category, i.e. to the lower approximation set, then vj(Ak,Up) Equal to 0. Otherwise, it belongs to the boundary field, vj(Ak,Up) Is 1.
Degree of correlation Jrelev(Ak) The calculation formula of (2) is as follows:
Figure BDA0002859534810000095
in the above formula, the vector values are obfuscated
Figure BDA0002859534810000096
Represents a subset UpSample x in (1)jIn the feature AkWhether it belongs to only one category, namely the positive region. When the value is 0, it represents a sample xjBelong to only one category; the value of 1 represents the sample xjBelonging to multiple classes, are misclassified samples. U ═ UpI represents the subset UpThe number of samples in (c).
The analysis shows that the characteristic AkIs based on the corresponding hypercube equivalent partition matrix H (A) on each partitionk,Up) And (4) calculating. In this embodiment, the item is calculated for each record on the RDD cached in step S3
Figure BDA0002859534810000101
Then using the feature index as key value, summing to obtain the value of each featureAnd (4) collecting the result to a Driver node for storing the correlation value, finding out the features with the maximum correlation, deleting the features from the set C, and adding the features to the S. This process is map, reduceByKey and collect in order.
S5 when | S<d and
Figure BDA0002859534810000107
if so, entering the step S6, otherwise, outputting a feature set S;
s6, calculating the dependency and average importance of each candidate feature on the selected feature set S by means of data parallel on a cloud platform based on a hypercube equivalent partition matrix and combining an acceleration method of a cache-update-filter mechanism, and deleting the candidate feature from the candidate feature subset C if the dependency is not changed after a certain candidate feature is added to the selected feature set S;
degree of dependence JdepenThe calculation formula of (S) is:
Figure BDA0002859534810000102
in the above formula, the first and second carbon atoms are,
Figure BDA0002859534810000103
is the subset UpSample x in (1)jA value of an aliasing vector on the feature set S, wherein
Figure BDA0002859534810000104
hij(S,Up) Is the subset UpHypercube equivalent partition matrix H (S, U) over feature set Sp) Row i and column j;
the average importance Javgsig(AkAnd S) is as follows:
Figure BDA0002859534810000105
in the above formula, the first and second carbon atoms are,
Figure BDA0002859534810000106
represents a feature AkWith respect to feature AiThe importance of the attributes of (a) to (b),
Figure BDA0002859534810000111
Figure BDA0002859534810000112
is the subset UpSample x in (1)jIn the feature set { Ak,AiThe value of the confusion vector on.
The caching-updating-filtering method of the dependence degree comprises the following steps: h (S- { A) calculated in the S-th feature selection processs}∪{Ak},Up) The matrix is cached in the cluster by summing it with H (A)s,Up) Calculating J in the s +1 th characteristic selection process by updating corresponding element phase-and modedepen(S∪{Ak}) the desired H (S.U.S. { A ]k},Up) Matrix, and finally filtered out ands(ii) related data;
given a selected feature set S, AsE.s is the feature selected in the S-th feature selection process (S ═ S |), and the subset UpIn the candidate feature
Figure BDA0002859534810000116
Supercube equivalent partitioning matrix H (S { U { A) }k},Up) The calculation formula of (2) is as follows:
H(S∪{Ak},Up)=H(S-{As}∪{Ak},Up)∩H(As,Up)
feature set S { [ A ]kDependency J between } and decision attribute Ddepen(S∪{Ak}) is:
Figure BDA0002859534810000113
the caching-updating-filtering method of the average importance degree comprises the following steps: in the s-th feature selection process, feature A is selectedkIn the subset UpUpper relative feature set S- { AsMean importance of } of
Figure BDA0002859534810000114
Caching the data into a cluster, and passing through a sum characteristic AkRelative to AsDegree of importance of
Figure BDA0002859534810000115
By adding, the feature A can be updatedkIn the subset UpAverage importance of the upper phase with respect to the feature set S | Javgsig(AkS), and finally filtering out the product As(ii) related data;
given a selected feature set S, AsE.s is the feature selected in the S-th feature selection process (S ═ S |), and the feature akAverage importance J with respect to feature set Savgsig(AkAnd S) is as follows:
Figure BDA0002859534810000121
based on the cache data, calculating the dependency and average importance of each candidate feature relative to the selected feature set S, and collecting and storing the candidate features in a Driver node. This process is, in order, mapPartitions, persistence, redeByKey, and collectThe process is described in detail below.
And S7, calculating the metric function value of each candidate feature according to the weight parameters omega and lambda, selecting the candidate feature with the maximum value, adding the candidate feature into the selected feature set S, and deleting the feature from the candidate feature subset C.
Metric function value J (A) for each candidate featurekAnd S) is as follows:
J(Ak,S)=ωJrelev(Ak)+λ(1-ω)[Jdepen({Ak}∪S)-Jdepen(S)]+
(1-ω)(1-λ)Javgsig(Ak,S)。
the invention discloses a rough hypercube-based big data feature selection method under a cloud platform, which combines the characteristics of cloud platform distributed computing and the advantages of a rough hypercube in processing continuous data, thereby solving the problem of feature selection of massive continuous data, such as large data sets of energy, climate and the like, and being applicable to the fields related to pattern recognition and machine learning. The invention mainly provides a representation method and a feature measurement standard of a hypercube equivalent partition matrix facing a cloud platform, designs a cache-update-filtering acceleration mechanism and provides a rough hypercube-based big data feature selection method under the cloud platform. The big data feature selection method disclosed by the invention can not only effectively process massive continuous data while ensuring the feature selection quality and eliminating redundant features, but also show good expandability and scalability in the face of clusters and data volumes of different scales.

Claims (7)

1. A large-scale feature selection method based on a rough hypercube under a cloud platform is characterized by comprising the following steps:
s1, initializing weight parameters omega and lambda and predicting the number d of the selected features;
s2, initializing a selected feature set S and a candidate feature subset C;
s3, reading a data set, calculating a value domain matrix in a data parallel local distributed mode through a cloud platform, and calculating a hypercube equivalent division matrix obtained by decomposing and reconstructing a hypercube equivalent division matrix of the characteristics according to the value domain matrix distributed mode;
s4, calculating the correlation degree between each feature and the decision attribute in a distributed manner in a data parallel mode based on the decomposed and reconstructed hypercube equivalent partition matrix, selecting the most correlated feature to add to the selected feature set S, and deleting the feature from the candidate feature subset C;
s5, when | S | < d and
Figure FDA0002859534800000011
if so, entering the step S6, otherwise, outputting a feature set S;
s6, calculating the dependency and average importance of each candidate feature on the selected feature set S in a distributed manner by a data parallel mode on a cloud platform based on a decomposed and reconstructed hypercube equivalent partition matrix and combining an acceleration method of a cache-update-filter mechanism, and deleting the candidate feature from the candidate feature subset C if the dependency is not changed after a certain candidate feature is added to the selected feature set S;
and S7, calculating the metric function value of each candidate feature according to the weight parameters omega and lambda, selecting the candidate feature with the maximum value, adding the candidate feature into the selected feature set S, and deleting the feature from the candidate feature subset C.
2. The method for selecting the large-scale features based on the rough hypercube under the cloud platform as claimed in claim 1, wherein the value domain matrix calculation method in the step S3 is as follows: given a decision table<U,C∪D>Wherein U ═ x1,x2,...,xnDenotes a set of n samples,
Figure FDA0002859534800000012
and is
Figure FDA0002859534800000013
The representation set U may consist of q disjoint subsets UiComposition is carried out; c ═ A1,A2,...,AmDenotes a set of m condition features, D denotes a set of decision attributes, U/D ═ β1,β2,...,βcRepresents a set of c decision categories; by LU (C) [ (L)ij,Uij)]Represents a value domain matrix, wherein
Figure FDA0002859534800000021
LijIndicates all belongings to decision class βiIs characterized by the feature AjMinimum value ofijIndicates all belongings to decision class βiIs characterized by the feature AjThe maximum value of (c).
3. The method for large-scale feature selection based on the rough hypercube under the cloud platform as claimed in claim 2, wherein the hypercube equivalent partition matrix of the features in the step S3 is:
Figure FDA0002859534800000022
in the above formula, H (A)k) Is characterized by AkThe hypercube equivalent partition matrix of (a),
Figure FDA0002859534800000023
interval [ L ]ik,Uik]For all belonging to decision class betaiIs characterized by the feature AkThe value range of.
The hypercube equivalent partition matrix for the cloud platform obtained by the matrix decomposition and reconstruction in the step S3 is as follows:
Figure FDA0002859534800000024
in the above formula, H { Ak,Up) Is a subset
Figure FDA0002859534800000025
In the feature Ak{AkE.g. C) is used to divide the matrix,
Figure FDA0002859534800000026
interval [ L ]ik,Uik]For all belonging to decision class betaiIs characterized by the feature AkThe value range of.
4. The method for large-scale feature selection based on rough hypercube under cloud platform as claimed in claim 3, wherein the correlation J in step S4relev(Ak) The calculation formula of (2) is as follows:
Figure FDA0002859534800000027
in the above formula, the vector values are obfuscated
Figure FDA0002859534800000031
Represents a subset UpSample x in (1)jIn the feature AkWhether it belongs to only one category, i.e. positive region, and a value of 0 indicates a sample xjBelong to only one category; the value of 1 represents the sample xjBelongs to multiple categories, is a misclassified sample, U ═ UpI represents the subset UpThe number of samples in (c).
5. The method for large-scale feature selection based on rough hypercube under cloud platform as claimed in claim 4, wherein the dependency J in step S6depenThe calculation formula of (S) is:
Figure FDA0002859534800000032
in the above formula, the first and second carbon atoms are,
Figure FDA0002859534800000033
is the subset UpSample x in (1)jA value of an aliasing vector on the feature set S, wherein
Figure FDA0002859534800000034
hij(S,Up) Is the subset UpHypercube equivalent partition matrix H (S, U) over feature set Sp) Row i and column j;
the average importance Javgsig(AkAnd S) is as follows:
Figure FDA0002859534800000035
in the above formula, the first and second carbon atoms are,
Figure FDA0002859534800000036
represents a feature AkWith respect to feature AiThe importance of the attributes of (a) to (b),
Figure FDA0002859534800000037
Figure FDA0002859534800000038
is the subset UpSample x in (1)jIn the feature set { Ak,AiThe value of the confusion vector on.
6. The rough hypercube-based large-scale feature selection method under the cloud platform as claimed in claim 5, wherein the caching-updating-filtering method of the dependency in step S6 is: h (S- { A) calculated in the S-th feature selection processs}∪{Ak},Up) The matrix is cached in the cluster by summing it with H { A }s,Up) Calculating J in the s +1 th characteristic selection process by updating corresponding element phase-and modedepen(S∪{Ak}) the desired H (S.U.S. { A ]k},Up) Matrix, and finally filtered out ands(ii) related data;
wherein the subset UpCandidate features relative to the selected feature set S
Figure FDA0002859534800000039
Supercube equivalent partitioning matrix H (S { U { A) }k},Up) The calculation formula of (2) is as follows:
H(S∪{Ak},Up)=H(S-{As}∪{Ak},Up)∩H(As,Up)
feature set S { [ A ]kDependency J between } and decision attribute Ddepen(S∪{Ak}) is:
Figure FDA0002859534800000041
the caching-updating-filtering method of the average importance degree comprises the following steps: in the s-th feature selection process, feature A is selectedkIn the subset UpUpper relative feature set S- { AsMean importance of } of
Figure FDA0002859534800000042
Caching the data into a cluster, and passing through a sum characteristic AkRelative to AsDegree of importance of
Figure FDA0002859534800000043
By adding, the feature A can be updatedkIn the subset UpAverage importance of upper phase relative to feature set S Javgsig(AkS), and finally, the sum A is filtered outs(ii) related data;
wherein, characteristic AkAverage importance J with respect to feature set Savgsig(AkAnd S) is as follows:
Figure FDA0002859534800000044
7. the method for large-scale rough hypercube-based feature selection under cloud platform as claimed in claim 6 wherein said step S7 is implemented by using metric function value J (a) of each candidate featurekAnd S) is as follows:
J(Ak,S)=ωJrelev(Ak)+λ(1-ω)[Jdepen({Ak}∪S)-Jdepen(S)]+(1-ω)(1-λ)Javgsig(Ak,S)。
CN202011561665.1A 2020-12-25 2020-12-25 Large-scale feature selection method based on rough hypercube under cloud platform Pending CN112685690A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011561665.1A CN112685690A (en) 2020-12-25 2020-12-25 Large-scale feature selection method based on rough hypercube under cloud platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011561665.1A CN112685690A (en) 2020-12-25 2020-12-25 Large-scale feature selection method based on rough hypercube under cloud platform

Publications (1)

Publication Number Publication Date
CN112685690A true CN112685690A (en) 2021-04-20

Family

ID=75451880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011561665.1A Pending CN112685690A (en) 2020-12-25 2020-12-25 Large-scale feature selection method based on rough hypercube under cloud platform

Country Status (1)

Country Link
CN (1) CN112685690A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829198A (en) * 2024-01-03 2024-04-05 南通大学 High-dimensional massive parallel attribute reduction method for guiding rough hypercube by informative

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829198A (en) * 2024-01-03 2024-04-05 南通大学 High-dimensional massive parallel attribute reduction method for guiding rough hypercube by informative

Similar Documents

Publication Publication Date Title
Chen et al. Fast density peak clustering for large scale data based on kNN
Xiao et al. SMK-means: an improved mini batch k-means algorithm based on mapreduce with big data
GarcíA-Pedrajas et al. A scalable approach to simultaneous evolutionary instance and feature selection
Kacem et al. MapReduce-based k-prototypes clustering method for big data
Shafiq et al. A parallel K-medoids algorithm for clustering based on MapReduce
Ghadiri et al. BigFCM: Fast, precise and scalable FCM on hadoop
García-García et al. Efficient large-scale distance-based join queries in spatialhadoop
CN112925821A (en) MapReduce-based parallel frequent item set incremental data mining method
Sowkuntla et al. MapReduce based improved quick reduct algorithm with granular refinement using vertical partitioning scheme
Ko et al. Mascot: A quantization framework for efficient matrix factorization in recommender systems
HajKacem et al. Overview of scalable partitional methods for big data clustering
García-García et al. Enhancing SpatialHadoop with closest pair queries
Yonezu et al. Knowledge-transfer-based cost-effective search for interface structures: a case study on fcc-Al [110] tilt grain boundary
CN112685690A (en) Large-scale feature selection method based on rough hypercube under cloud platform
Wang et al. Data mining applications in big data
Bousbaci et al. Efficient data distribution and results merging for parallel data clustering in mapreduce environment
Gavagsaz Efficient parallel processing of k-nearest neighbor queries by using a centroid-based and hierarchical clustering algorithm
Pappula A Novel Binary Search Tree Method to Find an Item Using Scaling.
Rammer et al. Alleviating i/o inefficiencies to enable effective model training over voluminous, high-dimensional datasets
CN117093890A (en) Comprehensive evaluation method for ecological environment of energy resource development area
Suryanarayana et al. Novel dynamic k-modes clustering of categorical and non categorical dataset with optimized genetic algorithm based feature selection
Li et al. A scalable association rule learning and recommendation algorithm for large-scale microarray datasets
Triantafillou Data-less big data analytics (towards intelligent data analytics systems)
Kancharla Feature selection in big data using filter based techniques
Cui et al. SA-GNN: Prediction of material properties using graph neural network based on multi-head self-attention optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210420

RJ01 Rejection of invention patent application after publication