CN112685690A - Large-scale feature selection method based on rough hypercube under cloud platform - Google Patents
Large-scale feature selection method based on rough hypercube under cloud platform Download PDFInfo
- Publication number
- CN112685690A CN112685690A CN202011561665.1A CN202011561665A CN112685690A CN 112685690 A CN112685690 A CN 112685690A CN 202011561665 A CN202011561665 A CN 202011561665A CN 112685690 A CN112685690 A CN 112685690A
- Authority
- CN
- China
- Prior art keywords
- feature
- hypercube
- matrix
- subset
- cloud platform
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 19
- 239000011159 matrix material Substances 0.000 claims abstract description 68
- 238000000034 method Methods 0.000 claims abstract description 60
- 238000005192 partition Methods 0.000 claims abstract description 31
- 230000008569 process Effects 0.000 claims abstract description 24
- 238000001914 filtration Methods 0.000 claims abstract description 12
- 230000001133 acceleration Effects 0.000 claims abstract description 9
- 238000004364 calculation method Methods 0.000 claims description 16
- 125000004432 carbon atom Chemical group C* 0.000 claims description 6
- 230000000875 corresponding effect Effects 0.000 claims description 4
- 238000000638 solvent extraction Methods 0.000 claims description 4
- 229910052799 carbon Inorganic materials 0.000 claims description 3
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 230000002596 correlated effect Effects 0.000 claims description 2
- 238000005259 measurement Methods 0.000 abstract description 6
- 238000012545 processing Methods 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 5
- 230000002688 persistence Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000003909 pattern recognition Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 239000008187 granular material Substances 0.000 description 1
- 238000005469 granulation Methods 0.000 description 1
- 230000003179 granulation Effects 0.000 description 1
- 230000007786 learning performance Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a large-scale feature selection method based on a rough hypercube under a cloud platform. The invention mainly provides a representation method and a feature measurement standard of a hypercube equivalent partition matrix facing a cloud platform, designs a cache-update-filtering acceleration mechanism and provides a large-scale feature selection method based on a rough hypercube under the cloud platform. The large-scale feature selection method disclosed by the invention can not only effectively process massive continuous data, but also show good expandability and scalability for clusters and data volumes of different scales while ensuring the feature selection quality and eliminating redundant features.
Description
Technical Field
The invention relates to the technical field of large-scale data feature selection, in particular to a large-scale feature selection method based on a rough hypercube under a cloud platform.
Background
Due to the rapid development of computer and internet technologies, the data volume generation speed and the storage scale are increasing in the industries of military, finance, communication and the like. Meanwhile, the form of the data is not limited to discrete features, and more continuous features are provided, especially the data in the fields of energy, weather, remote sensing and the like. The high-dimensional data not only increases the complexity of calculation, but also easily causes the phenomenon of overfitting of a machine learning algorithm, thereby affecting the learning performance of the machine learning algorithm. The feature selection can ensure stable performance of the learning model, simultaneously determine relevant features for data analysis and eliminate redundant features as much as possible, and is a main reason why the feature selection is popular in the fields of pattern recognition, machine learning and the like.
Cloud computing, which is one of distributed computing, breaks through the limitation of insufficient resources of a single computer, and provides a good solution for large-scale data computing by constructing a computer cluster. Therefore, the current common methods for dealing with the continuous large-scale feature selection problem are: 1) firstly, discretizing continuous features in a data set, wherein the discretization method comprises equidistant discretization, equal-frequency discretization, optimized discretization and the like, and then selecting the features of the discretized data by combining a cloud platform and applying a distributed computing technology and a Pawlak rough set model. Although the equivalence relation in the Pawlak rough set model is very suitable for distributed computation, the process of data discretization can cause information loss, thereby affecting the quality of the selected features. 2) And selecting a rough set model suitable for continuous features, wherein the rough set model mainly comprises a neighborhood rough set and a fuzzy rough set, and then parallelizing the rough set model by a Hash method and the like so as to be suitable for a cloud computing paradigm. Although the method avoids information loss caused by discretization data, the problem of continuous big data feature selection cannot be efficiently processed due to the limitation of the model, namely the problem that the calculation of neighborhood relations and similar matrixes relates to global communication.
Disclosure of Invention
Aiming at the defects in the prior art, the large-scale feature selection method based on the rough hypercube under the cloud platform solves the problem of processing continuous large data feature selection under the cloud platform.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a large-scale feature selection method based on a rough hypercube under a cloud platform comprises the following steps:
s1, initializing weight parameters omega and lambda and predicting the number d of the selected features;
s2, initializing a selected feature set S and a candidate feature subset C;
s3, reading a data set, calculating a value domain matrix in a data parallel local distributed mode through a cloud platform, and calculating a hypercube equivalent division matrix obtained by decomposing and reconstructing a hypercube equivalent division matrix of the characteristics according to the value domain matrix distributed mode;
s4, calculating the correlation degree between each feature and the decision attribute in a distributed manner in a data parallel mode based on the decomposed and reconstructed hypercube equivalent partition matrix, selecting the most correlated feature to add to the selected feature set S, and deleting the feature from the candidate feature subset C;
s6, calculating the dependency and average importance of each candidate feature on the selected feature set S in a distributed manner by a data parallel mode on a cloud platform based on a decomposed and reconstructed hypercube equivalent partition matrix and combining an acceleration method of a cache-update-filter mechanism, and deleting the candidate feature from the candidate feature subset C if the dependency is not changed after a certain candidate feature is added to the selected feature set S;
and S7, calculating the metric function value of each candidate feature according to the weight parameters omega and lambda, selecting the candidate feature with the maximum value, adding the candidate feature into the selected feature set S, and deleting the feature from the candidate feature subset C.
Further: the value range matrix calculation method in step S3 includes: given a decision table<U, C U.D >, wherein U ═ x1,x2,…,xnDenotes a set of n samples,and isThe representation set U may consist of q disjoint subsets UiComposition is carried out; c ═ A1,A2,…,AmDenotes a set of m condition features, denotes a set of decision attributes, U/D ═ β1,β2,…,βcRepresents a set of c decision categories; by LU (C) [ (L)ij,Uij)]Represents a value domain matrix, wherein LijIndicates all belongings to decision class βiIs characterized by the feature AjMinimum value ofijIndicates all belongings to decision class βiIs characterized by the feature AjThe maximum value of (c).
Further: the hypercube equivalent partition matrix decomposed and reconstructed in the step S3 is:
in the above formula, H (A)k,Up) Is a subsetIn the feature Ak(AkE.g. C) is used to divide the matrix,interval [ L ]ik,Uik]For all belonging to decision class betaiIs characterized by the feature AkThe value range of.
Further: the correlation J in the step S4relev(k) The calculation formula of (2) is as follows:
in the above formula, the vector values are obfuscatedRepresents a subset UpSample x in (1)jIn the feature AkWhether it belongs to only one category, namely the positive region. When the value is 0, it represents a sample xjBelong to only one category; the value of 1 represents the sample xjBelonging to multiple classes, are misclassified samples. U ═ UpI represents the subset UpThe number of samples in (c).
Further: the dependency J in the step S6depenThe calculation formula of (S) is:
in the above formula, the first and second carbon atoms are,is the subset UpSample x in (1)jA value of an aliasing vector on the feature set S, whereinhij(S,Up) Is the subset UpHypercube equivalent partition matrix H (S, U) over feature set Sp) Row i and column j;
the average importance Javgsig(AkAnd S) is as follows:
in the above formula, the first and second carbon atoms are,represents a feature AkWith respect to feature AiThe importance of the attributes of (a) to (b), is the subset UpSample x in (1)jIn the feature set { Ak,AiThe value of the confusion vector on.
Further: the caching-updating-filtering method of the dependency in the step S6 includes: h (S- { A) calculated in the S-th feature selection processs}∪{Ak},Up) The matrix is cached in the cluster by summing it with H (A)s,Up) Calculating J in the s +1 th characteristic selection process by updating corresponding element phase-and modedepen(S∪{Ak}) the desired H (S.U.S. { A ]k},Up) Matrix, and finally filtered out ands(ii) related data;
wherein the subset UpIn the candidate featureSupercube equivalent partitioning matrix H (S { U { A) }k},Up) The calculation formula of (2) is as follows:
H(S∪{Ak},Up)=H(S-{As}∪{Ak},Up)∩H(As,Up)
feature set S { [ A ]kDependency J between } and decision attribute Ddepen(S∪{Ak}) is:
the caching-updating-filtering method of the average importance degree comprises the following steps: in the s-th feature selection process, feature A is selectedkIn the subset UpUpper relative feature set S- { AsMean importance of } ofCaching the data into a cluster, and passing through a sum characteristic AkRelative to AsDegree of importance ofBy adding, the feature A can be updatedkIn the subset UpAverage importance of the upper phase with respect to the feature set S | Javgsig(AkS), and finally filtering out the product As(ii) related data;
wherein, characteristic AkAverage importance J with respect to feature set Savgsig(AkAnd S) is as follows:
further: a metric function value J (a) of each candidate feature in the step S7kAnd S) is as follows:
J(Ak,S)=ωJrelev(Ak)+λ(1-ω)[Jdepen({Ak}∪S)-Jdepen(S)]+
(1-ω)(1-λ)Javgsig(Ak,S)。
the invention has the beneficial effects that: the method combines the characteristics of cloud platform distributed computing and the advantages of the rough hypercube in processing continuous data, thereby solving the problem of selecting the characteristics of massive continuous data, such as energy, climate and other large-scale data sets, and being applicable to the relevant fields of pattern recognition and machine learning. The invention mainly provides a representation method and a feature measurement standard of a hypercube equivalent partition matrix facing a cloud platform, designs a cache-update-filtering acceleration mechanism and provides a rough hypercube-based big data feature selection method under the cloud platform. The big data feature selection method disclosed by the invention can not only effectively process massive continuous data while ensuring the feature selection quality and eliminating redundant features, but also show good expandability and scalability in the face of clusters and data volumes of different scales.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
The invention provides an acceleration method of a cache-update-filter mechanism and a big data feature selection method based on a rough hypercube, and aims to provide a solution for processing a continuous big data feature selection problem under a cloud platform. The invention is mainly embodied in the following four aspects:
1. cloud platform-oriented rough hypercube model representation method
The rough hypercube model fully utilizes the class information of the sample to granulate in a supervised learning mode, and the granularity information is represented by a hypercube equivalent partition matrix of the characteristics. The granulation mechanism and the equivalent matrix representation method enable the coarse hypercube model to be superior to the Pawlak coarse set model, the neighborhood coarse set model and the fuzzy coarse set model in both quality and efficiency of feature selection. However, the representation form of the hypercube equivalent partition matrix of the features still involves the problem of global communication in the sample space and is not suitable for direct construction under the cloud platform. Aiming at the problem, the invention decomposes and reconstructs the representation matrix of the characteristics and provides a new representation method of the hypercube equivalent partition matrix based on the object or sample subset. The novel representation method has the innovative points that the global communication problem is avoided, the hypercube equivalent division matrix of the sample or the subset is allowed to be parallelly and independently calculated on each node or each data block of the cloud platform, and the concept of 'divide and conquer' of the cloud platform is perfectly matched, so that the efficiency of processing big data under the cloud platform is effectively improved.
2. Characteristic measurement standard of hypercube equivalent partition matrix based on decomposition reconstruction
The metric of the feature is the key to determining the quality of the feature selection. The rough hypercube model provides three concepts of correlation between the features and decision attributes, dependency between feature subsets and decision attributes and attribute importance between selected features on the basis of a hypercube equivalent partition matrix of the features, and comprehensively considers the three concepts as the measurement standard of the features by giving different weights. As is well known, big data to be processed in a cloud environment has not only a large number of samples but also a large sample dimension. In the case that the characteristic hypercube equivalent partition matrix is not applicable to a cloud platform, the invention selects the decomposed and reconstructed hypercube equivalent partition matrix as a basis, and further provides a parallelization calculation method of characteristic measurement standards (relevance, dependency and attribute importance) in a distributed environment, and the method reduces the complexity of a big data characteristic evaluation process.
3. A cache-update-filter mechanism acceleration method is provided
Under a big data environment, aiming at an iterative computing process, the cloud platform provides a data persistence method, and can cache intermediate data into a memory or a disk of a cluster, such as a persistence operator in a Spark example, so as to reduce repeated computing and improve the big data processing speed. The invention deeply analyzes the calculation process of the dependence and the attribute importance in the iterative feature selection process, and provides a cache-update-filtering acceleration mechanism by combining a persistence method provided by a cloud platform, thereby accelerating the process of big data feature selection, and the speed of selecting features each time is faster and faster along with the performance of feature selection.
4. Coarse hypercube-based big data feature selection method
The big data selection method under the cloud platform provides a rough hypercube representation method facing the cloud platform, and the rough hypercube representation method specifically comprises the following steps: the invention provides two methods for expressing the hypercube equivalent partition matrix based on samples or subsets by decomposing and reconstructing the hypercube equivalent partition matrix of the characteristics by combining the characteristics of cloud platform distributed computing. The method comprises the following steps:
as shown in fig. 1, a large-scale feature selection method based on a rough hypercube under a cloud platform includes the following steps:
s1, initializing weight parameters omega and lambda and predicting the number d of the selected features;
s2, initializing a selected feature set S and a candidate feature subset C;
s3, reading a data set, calculating a value domain matrix through data on the cloud platform in a parallel manner, and decomposing and reconstructing a rough hypercube matrix of the characteristics according to the value domain matrix to obtain a hypercube equivalent division matrix;
the value domain matrix calculation method comprises the following steps: given a decision table<U, C U.D >, wherein U ═ x1,x2,…,xnDenotes a set of n samples,and isThe representation set U may consist of q disjoint subsets UiComposition is carried out; c ═ A1,A2,…,AmDenotes a set of m condition features, denotes a set of decision attributes, U/D ═ β1,β2,…,βcRepresents a set of c decision categories; by LU (C) [ (L)ij,Uij)]Represents a value domain matrix, whereinLijIndicates all belongings to decision class βiIs characterized by the feature AjMinimum value ofijIndicates all belongings to decision class βiIs characterized by the feature AjThe maximum value of (c).
The present embodiment assumes that large datasets are stored in a distributed file system HDFS on a Hadoop platform. Firstly, reading HDFS files of a data set to a Spark cluster, then aggregating all characteristic values of samples under each decision category under each characteristic, and comparing to obtain a minimum value and a maximum value, namely a value domain matrix. And finally, collecting value domain matrix data to a Driver node, converting the value domain matrix data into a two-dimensional array, and broadcasting the two-dimensional array to each computing node of the cluster. This process is followed by a flatMap, a reduceByKey, a collect, and a broadcast.
The hypercube equivalent partition matrix is:
in the above formula, H (A)k,Up) Is a subsetIn the feature Ak(AkE.g. C) is used to divide the matrix,interval [ L ]ik,Uik]For all belonging to decision class betaiIs characterized by the feature AkIn the present embodiment, a hypercube equivalent partition matrix representation method of the subset obtained by decomposition and reconstruction is selected. Subset U herepWhich may be understood as a partition of the RDD in Spark. According to the value range matrix obtained by broadcasting, the samples in each partition are divided into a subset UpConversion into each feature AkHypercube equivalent partitioning matrix H (A) over the subsetk,Up) And caching the generated RDD on the Spark cluster. This process is mappartions and persistence, in turn.
S4, calculating the correlation degree between each feature and the decision attribute in a data parallel mode based on the hypercube equivalent partition matrix, selecting the most relevant feature to add into the selected feature set S, and deleting the feature from the candidate feature subset C;
equivalently dividing matrix H (A) based on hypercubek,Up) The confusion vector of (a) is defined as: v (A)k,Up)=[v1(Ak,Up),v2(Ak,Up),...,vu(Ak,Up)];
Wherein,wherein j is more than or equal to 1 and less than or equal to u. According to the analysis, if the sample xjIn the feature AkBelonging to only one category, i.e. to the lower approximation set, then vj(Ak,Up) Equal to 0. Otherwise, it belongs to the boundary field, vj(Ak,Up) Is 1.
Degree of correlation Jrelev(Ak) The calculation formula of (2) is as follows:
in the above formula, the vector values are obfuscatedRepresents a subset UpSample x in (1)jIn the feature AkWhether it belongs to only one category, namely the positive region. When the value is 0, it represents a sample xjBelong to only one category; the value of 1 represents the sample xjBelonging to multiple classes, are misclassified samples. U ═ UpI represents the subset UpThe number of samples in (c).
The analysis shows that the characteristic AkIs based on the corresponding hypercube equivalent partition matrix H (A) on each partitionk,Up) And (4) calculating. In this embodiment, the item is calculated for each record on the RDD cached in step S3Then using the feature index as key value, summing to obtain the value of each featureAnd (4) collecting the result to a Driver node for storing the correlation value, finding out the features with the maximum correlation, deleting the features from the set C, and adding the features to the S. This process is map, reduceByKey and collect in order.
s6, calculating the dependency and average importance of each candidate feature on the selected feature set S by means of data parallel on a cloud platform based on a hypercube equivalent partition matrix and combining an acceleration method of a cache-update-filter mechanism, and deleting the candidate feature from the candidate feature subset C if the dependency is not changed after a certain candidate feature is added to the selected feature set S;
degree of dependence JdepenThe calculation formula of (S) is:
in the above formula, the first and second carbon atoms are,is the subset UpSample x in (1)jA value of an aliasing vector on the feature set S, whereinhij(S,Up) Is the subset UpHypercube equivalent partition matrix H (S, U) over feature set Sp) Row i and column j;
the average importance Javgsig(AkAnd S) is as follows:
in the above formula, the first and second carbon atoms are,represents a feature AkWith respect to feature AiThe importance of the attributes of (a) to (b), is the subset UpSample x in (1)jIn the feature set { Ak,AiThe value of the confusion vector on.
The caching-updating-filtering method of the dependence degree comprises the following steps: h (S- { A) calculated in the S-th feature selection processs}∪{Ak},Up) The matrix is cached in the cluster by summing it with H (A)s,Up) Calculating J in the s +1 th characteristic selection process by updating corresponding element phase-and modedepen(S∪{Ak}) the desired H (S.U.S. { A ]k},Up) Matrix, and finally filtered out ands(ii) related data;
given a selected feature set S, AsE.s is the feature selected in the S-th feature selection process (S ═ S |), and the subset UpIn the candidate featureSupercube equivalent partitioning matrix H (S { U { A) }k},Up) The calculation formula of (2) is as follows:
H(S∪{Ak},Up)=H(S-{As}∪{Ak},Up)∩H(As,Up)
feature set S { [ A ]kDependency J between } and decision attribute Ddepen(S∪{Ak}) is:
the caching-updating-filtering method of the average importance degree comprises the following steps: in the s-th feature selection process, feature A is selectedkIn the subset UpUpper relative feature set S- { AsMean importance of } ofCaching the data into a cluster, and passing through a sum characteristic AkRelative to AsDegree of importance ofBy adding, the feature A can be updatedkIn the subset UpAverage importance of the upper phase with respect to the feature set S | Javgsig(AkS), and finally filtering out the product As(ii) related data;
given a selected feature set S, AsE.s is the feature selected in the S-th feature selection process (S ═ S |), and the feature akAverage importance J with respect to feature set Savgsig(AkAnd S) is as follows:
based on the cache data, calculating the dependency and average importance of each candidate feature relative to the selected feature set S, and collecting and storing the candidate features in a Driver node. This process is, in order, mapPartitions, persistence, redeByKey, and collectThe process is described in detail below.
And S7, calculating the metric function value of each candidate feature according to the weight parameters omega and lambda, selecting the candidate feature with the maximum value, adding the candidate feature into the selected feature set S, and deleting the feature from the candidate feature subset C.
Metric function value J (A) for each candidate featurekAnd S) is as follows:
J(Ak,S)=ωJrelev(Ak)+λ(1-ω)[Jdepen({Ak}∪S)-Jdepen(S)]+
(1-ω)(1-λ)Javgsig(Ak,S)。
the invention discloses a rough hypercube-based big data feature selection method under a cloud platform, which combines the characteristics of cloud platform distributed computing and the advantages of a rough hypercube in processing continuous data, thereby solving the problem of feature selection of massive continuous data, such as large data sets of energy, climate and the like, and being applicable to the fields related to pattern recognition and machine learning. The invention mainly provides a representation method and a feature measurement standard of a hypercube equivalent partition matrix facing a cloud platform, designs a cache-update-filtering acceleration mechanism and provides a rough hypercube-based big data feature selection method under the cloud platform. The big data feature selection method disclosed by the invention can not only effectively process massive continuous data while ensuring the feature selection quality and eliminating redundant features, but also show good expandability and scalability in the face of clusters and data volumes of different scales.
Claims (7)
1. A large-scale feature selection method based on a rough hypercube under a cloud platform is characterized by comprising the following steps:
s1, initializing weight parameters omega and lambda and predicting the number d of the selected features;
s2, initializing a selected feature set S and a candidate feature subset C;
s3, reading a data set, calculating a value domain matrix in a data parallel local distributed mode through a cloud platform, and calculating a hypercube equivalent division matrix obtained by decomposing and reconstructing a hypercube equivalent division matrix of the characteristics according to the value domain matrix distributed mode;
s4, calculating the correlation degree between each feature and the decision attribute in a distributed manner in a data parallel mode based on the decomposed and reconstructed hypercube equivalent partition matrix, selecting the most correlated feature to add to the selected feature set S, and deleting the feature from the candidate feature subset C;
s6, calculating the dependency and average importance of each candidate feature on the selected feature set S in a distributed manner by a data parallel mode on a cloud platform based on a decomposed and reconstructed hypercube equivalent partition matrix and combining an acceleration method of a cache-update-filter mechanism, and deleting the candidate feature from the candidate feature subset C if the dependency is not changed after a certain candidate feature is added to the selected feature set S;
and S7, calculating the metric function value of each candidate feature according to the weight parameters omega and lambda, selecting the candidate feature with the maximum value, adding the candidate feature into the selected feature set S, and deleting the feature from the candidate feature subset C.
2. The method for selecting the large-scale features based on the rough hypercube under the cloud platform as claimed in claim 1, wherein the value domain matrix calculation method in the step S3 is as follows: given a decision table<U,C∪D>Wherein U ═ x1,x2,...,xnDenotes a set of n samples,and isThe representation set U may consist of q disjoint subsets UiComposition is carried out; c ═ A1,A2,...,AmDenotes a set of m condition features, D denotes a set of decision attributes, U/D ═ β1,β2,...,βcRepresents a set of c decision categories; by LU (C) [ (L)ij,Uij)]Represents a value domain matrix, whereinLijIndicates all belongings to decision class βiIs characterized by the feature AjMinimum value ofijIndicates all belongings to decision class βiIs characterized by the feature AjThe maximum value of (c).
3. The method for large-scale feature selection based on the rough hypercube under the cloud platform as claimed in claim 2, wherein the hypercube equivalent partition matrix of the features in the step S3 is:
in the above formula, H (A)k) Is characterized by AkThe hypercube equivalent partition matrix of (a),interval [ L ]ik,Uik]For all belonging to decision class betaiIs characterized by the feature AkThe value range of.
The hypercube equivalent partition matrix for the cloud platform obtained by the matrix decomposition and reconstruction in the step S3 is as follows:
4. The method for large-scale feature selection based on rough hypercube under cloud platform as claimed in claim 3, wherein the correlation J in step S4relev(Ak) The calculation formula of (2) is as follows:
in the above formula, the vector values are obfuscatedRepresents a subset UpSample x in (1)jIn the feature AkWhether it belongs to only one category, i.e. positive region, and a value of 0 indicates a sample xjBelong to only one category; the value of 1 represents the sample xjBelongs to multiple categories, is a misclassified sample, U ═ UpI represents the subset UpThe number of samples in (c).
5. The method for large-scale feature selection based on rough hypercube under cloud platform as claimed in claim 4, wherein the dependency J in step S6depenThe calculation formula of (S) is:
in the above formula, the first and second carbon atoms are,is the subset UpSample x in (1)jA value of an aliasing vector on the feature set S, whereinhij(S,Up) Is the subset UpHypercube equivalent partition matrix H (S, U) over feature set Sp) Row i and column j;
the average importance Javgsig(AkAnd S) is as follows:
6. The rough hypercube-based large-scale feature selection method under the cloud platform as claimed in claim 5, wherein the caching-updating-filtering method of the dependency in step S6 is: h (S- { A) calculated in the S-th feature selection processs}∪{Ak},Up) The matrix is cached in the cluster by summing it with H { A }s,Up) Calculating J in the s +1 th characteristic selection process by updating corresponding element phase-and modedepen(S∪{Ak}) the desired H (S.U.S. { A ]k},Up) Matrix, and finally filtered out ands(ii) related data;
wherein the subset UpCandidate features relative to the selected feature set SSupercube equivalent partitioning matrix H (S { U { A) }k},Up) The calculation formula of (2) is as follows:
H(S∪{Ak},Up)=H(S-{As}∪{Ak},Up)∩H(As,Up)
feature set S { [ A ]kDependency J between } and decision attribute Ddepen(S∪{Ak}) is:
the caching-updating-filtering method of the average importance degree comprises the following steps: in the s-th feature selection process, feature A is selectedkIn the subset UpUpper relative feature set S- { AsMean importance of } ofCaching the data into a cluster, and passing through a sum characteristic AkRelative to AsDegree of importance ofBy adding, the feature A can be updatedkIn the subset UpAverage importance of upper phase relative to feature set S Javgsig(AkS), and finally, the sum A is filtered outs(ii) related data;
wherein, characteristic AkAverage importance J with respect to feature set Savgsig(AkAnd S) is as follows:
7. the method for large-scale rough hypercube-based feature selection under cloud platform as claimed in claim 6 wherein said step S7 is implemented by using metric function value J (a) of each candidate featurekAnd S) is as follows:
J(Ak,S)=ωJrelev(Ak)+λ(1-ω)[Jdepen({Ak}∪S)-Jdepen(S)]+(1-ω)(1-λ)Javgsig(Ak,S)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011561665.1A CN112685690A (en) | 2020-12-25 | 2020-12-25 | Large-scale feature selection method based on rough hypercube under cloud platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011561665.1A CN112685690A (en) | 2020-12-25 | 2020-12-25 | Large-scale feature selection method based on rough hypercube under cloud platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112685690A true CN112685690A (en) | 2021-04-20 |
Family
ID=75451880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011561665.1A Pending CN112685690A (en) | 2020-12-25 | 2020-12-25 | Large-scale feature selection method based on rough hypercube under cloud platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112685690A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117829198A (en) * | 2024-01-03 | 2024-04-05 | 南通大学 | High-dimensional massive parallel attribute reduction method for guiding rough hypercube by informative |
-
2020
- 2020-12-25 CN CN202011561665.1A patent/CN112685690A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117829198A (en) * | 2024-01-03 | 2024-04-05 | 南通大学 | High-dimensional massive parallel attribute reduction method for guiding rough hypercube by informative |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Fast density peak clustering for large scale data based on kNN | |
Xiao et al. | SMK-means: an improved mini batch k-means algorithm based on mapreduce with big data | |
GarcíA-Pedrajas et al. | A scalable approach to simultaneous evolutionary instance and feature selection | |
Kacem et al. | MapReduce-based k-prototypes clustering method for big data | |
Shafiq et al. | A parallel K-medoids algorithm for clustering based on MapReduce | |
Ghadiri et al. | BigFCM: Fast, precise and scalable FCM on hadoop | |
García-García et al. | Efficient large-scale distance-based join queries in spatialhadoop | |
CN112925821A (en) | MapReduce-based parallel frequent item set incremental data mining method | |
Sowkuntla et al. | MapReduce based improved quick reduct algorithm with granular refinement using vertical partitioning scheme | |
Ko et al. | Mascot: A quantization framework for efficient matrix factorization in recommender systems | |
HajKacem et al. | Overview of scalable partitional methods for big data clustering | |
García-García et al. | Enhancing SpatialHadoop with closest pair queries | |
Yonezu et al. | Knowledge-transfer-based cost-effective search for interface structures: a case study on fcc-Al [110] tilt grain boundary | |
CN112685690A (en) | Large-scale feature selection method based on rough hypercube under cloud platform | |
Wang et al. | Data mining applications in big data | |
Bousbaci et al. | Efficient data distribution and results merging for parallel data clustering in mapreduce environment | |
Gavagsaz | Efficient parallel processing of k-nearest neighbor queries by using a centroid-based and hierarchical clustering algorithm | |
Pappula | A Novel Binary Search Tree Method to Find an Item Using Scaling. | |
Rammer et al. | Alleviating i/o inefficiencies to enable effective model training over voluminous, high-dimensional datasets | |
CN117093890A (en) | Comprehensive evaluation method for ecological environment of energy resource development area | |
Suryanarayana et al. | Novel dynamic k-modes clustering of categorical and non categorical dataset with optimized genetic algorithm based feature selection | |
Li et al. | A scalable association rule learning and recommendation algorithm for large-scale microarray datasets | |
Triantafillou | Data-less big data analytics (towards intelligent data analytics systems) | |
Kancharla | Feature selection in big data using filter based techniques | |
Cui et al. | SA-GNN: Prediction of material properties using graph neural network based on multi-head self-attention optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210420 |
|
RJ01 | Rejection of invention patent application after publication |