CN113077843A

CN113077843A - Distributed column subset selection method and system and leukemia gene information mining method

Info

Publication number: CN113077843A
Application number: CN202110350013.1A
Authority: CN
Inventors: 肖正; 魏鹏程
Original assignee: Hunan University; Shaodong Intelligent Manufacturing Innovative Institute
Current assignee: Hunan University; Shaodong Intelligent Manufacturing Innovative Institute
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-06

Abstract

The invention discloses a distributed column subset selection method, which comprises the steps of acquiring all characteristics in a data set, processing and uniformly grouping the characteristics to each computing node; executing a subset quality evaluation method on each computing node and obtaining a corresponding characteristic subset target characteristic number; each computing node performs respective feature selection calculation to obtain features selected by each computing node; and summarizing the feature selection calculation results of each calculation node to obtain the finally selected features. The invention also discloses a system based on the distributed column subset selection method and a leukemia gene information mining method based on the method and the system. The invention effectively avoids selecting redundant features in the subset and also accelerates the feature selection process; directly summarizing the selected characteristics as a final selection result, so that the method can achieve at least linear acceleration in theory; the accuracy is high, the calculation speed is high, and the reliability is better; simultaneously obtains the gene characteristics and the association of leukemia.

Description

Distributed column subset selection method and system and leukemia gene information mining method

Technical Field

The invention belongs to the field of big data processing, and particularly relates to a distributed column subset selection method and system and a leukemia gene information mining method.

Background

With the emerging applications of emerging computers such as internet of things, machine learning, computer vision, natural language processing and the like, people often encounter high-dimensional data with massive sample numbers and characteristic numbers. Processing these high dimensional data requires more computational and memory resources, often cannot be processed using a single machine, and most of the features in the data may be useless and redundant. Therefore, selecting representative features from high-dimensional data and serving the features for computer applications becomes an urgent problem to be solved. Therefore, as a method for efficiently selecting representative features from an original feature set, a feature selection technique has been an important point in recent years. Meanwhile, the gene information of leukemia people is mined, and the obtained gene characteristics and the leukemia relevance are important ways for treating leukemia. However, because the gene sequence has a large scale and a complex structure and contains a large amount of redundant feature information, the traditional single-version feature selection algorithm cannot effectively mine the effective information contained in the gene.

The Column Subset Selection (CSS) problem is a core sub-problem in feature Selection studies, and is also a constrained low rank approximation problem. In particular, CSS aims at finding a sub-matrix S from the matrix a containing at most k columns (i.e. features) such that the sub-matrix S contains as much information as possible from the matrix a. In research, reconstruction error rates are typically used to measure this containment capability, with characteristics equivalent to columns.

Different from other low rank approximation problems such as SVD, PCA and the like, CSS is more flexible, more explanatory and more efficient. However, the existing solution algorithm for the CSS problem is not practical, and is not suitable for large-scale data sets. For example, in 2019, von et al proposed the POCSS algorithm, achieving the lowest reconstruction error rate known at present; however, such an algorithm requires a large number of iterations, which is very time consuming. Furthermore, with the explosive growth of data volumes, it becomes particularly important to study distributed CSS algorithms.

The existing distributed CSS algorithm belongs to a two-stage algorithm. Specifically, in a distributed algorithm, a target is changed into k features (m is more than or equal to 2) selected from m feature subsets, and the feature subsets are obtained by dividing an original data set; according to the distributed two-stage CSS algorithm, k features are selected from each subset of features according to the algorithm in the first stage, and then k features are selected from m x k features as the final output result using the same algorithm in the second stage.

This two-stage distributed CSS algorithm has three disadvantages:

1) because the number of representative features contained in each subset is different after the division, for example, most of the features contained in some subsets are redundant features, and the features selected from these subsets may have an adverse effect on the feature selection of the next stage, each subset is not worth selecting k features from the subsets;

2) for k most representative features (optimal features) in the data set, dividing the k most representative features into different subsets, wherein the number of the optimal features contained in each subset is less than or equal to k, so that the k features are selected from each subset unnecessarily;

3) the two-stage algorithm theoretically does not achieve linear acceleration, so the algorithm is very time-consuming to calculate in practice and is not strong in practicability.

Furthermore, the two-phase algorithm assumes that all subsets have the same quality, which is the main reason for the above deficiency. In practice, the quality tends to vary between subsets, and ignoring this variation results in wasted time and resources, and even affects the outcome of the final feature selection. Therefore, the existing distributed feature selection method facing to column subset selection still has the problems of low accuracy, slow calculation speed and poor reliability.

Disclosure of Invention

One of the objectives of the present invention is to provide a distributed column subset selection method, which avoids selecting redundant features in a subset by integrating a subset quality evaluation method in a distributed feature selection framework, accelerates feature selection, and has good expandability, fast calculation speed, and better reliability.

The second objective of the present invention is to provide a system based on the distributed column subset selection method.

The invention also aims to provide a leukemia gene information mining method based on the distributed column subset selection method and the distributed column subset selection system.

The invention discloses a distributed column subset selection method, which comprises the following steps:

s1, acquiring all characteristics in a data set;

s2, processing the characteristics in the data set obtained in the step S1, and then uniformly grouping the characteristics to each computing node;

s3, executing a subset quality evaluation method on each computing node, and thus calculating to obtain the target feature number of the corresponding feature subset;

s4, according to the feature subset target feature number of each computing node obtained in the step S3, each computing node performs respective feature selection calculation, and therefore the features selected by each computing node are obtained;

and S5, summarizing the feature selection calculation results of the calculation nodes obtained in the step S4 to obtain the finally selected features.

Step S2 includes: firstly, converting data in a data set into a two-dimensional matrix consisting of features and feature values, then deleting the features with the feature values all being empty and the feature value variance being 0, then carrying out normalization processing on the remaining features by using an L2 norm, and finally establishing a grouping label according to the number of computing nodes in a cluster, and randomly distributing a label for each feature, thereby randomly dividing each feature into feature subsets of different computing nodes.

The formula for normalizing the L2 norm for each feature F is as follows:

wherein, fv₁，fv₂，…，fv_nIs a possible eigenvalue of the feature F; | F | non-conducting phosphor₂Representing the L2 norm of feature F.

The subset quality evaluation method of step S3 is specifically to measure the feature subset V using information entropy_iQuality of subset SQ_i(ii) a The feature information entropy H (F) is used for measuring the size of the information content contained in one feature F, and the higher the information entropy H (F) is, the larger the information content contained in the feature F is, the feature set entropy is defined as follows:

wherein N is_iFor a subset of features V_iNumber of features contained, fv_tIs characterized by_jAll possible characteristic values, p (fv)_t)＝Pr(F_j＝fv_t) Is a probability mass function; subset quality SQ_iThe larger the value of (A), the more the feature subset V is represented_iThe larger the amount of information contained, the more optimal features are distributed in the feature subset V_iIn (1), therefore the number of features k_iThe larger.

In step S4, according to the target feature number of the feature subset of each computing node obtained in step S3, each computing node performs respective feature selection calculation, specifically, the higher the quality is, the feature number k is_iThe larger; feature subset V for higher quality assurance_iCan be assigned to a larger number of features k_iQuality of each subset SQ_iDescending order, calculating the feature number k of the first m-1 subsets in descending order_iM is the number of computing nodes in the cluster;

wherein i is more than or equal to 1 and less than or equal to m-1, wherein [. cndot. ] represents upward rounding, and k is the total number of target features;

obtaining the first m-1 subsetsNumber of features k_iThen, the number k of features of the last subset in descending order_iIs marked as

Each of the computing nodes described in step S4 performs its own feature selection calculation, and each of the computing nodes performs its own feature selection calculation by using the POCSS algorithm.

The invention also discloses a system based on the distributed column subset selection method, which comprises an acquisition module, a preprocessing module, an evaluation module, a selection module and an output module; the acquisition module, the preprocessing module, the evaluation module, the selection module and the output module are sequentially connected in series; the acquisition module is used for acquiring all characteristics in the data set; the preprocessing module is used for preprocessing an original data set, is responsible for cleaning and normalization processing of features, uniformly and randomly distributes grouping labels for the processed feature set according to the number of computing nodes in the cluster, and makes input preparation for the calculation of the next module; the evaluation module is used for evaluating the quality of each subset of the features and finding the number of target features for the subset according to the quality of each subset; the selection module is used for calculating on each calculation node by adopting a POCSS algorithm according to the feature subset and the target feature number, and then summarizing calculation results of each node to obtain finally selected features; the output module is used for outputting the feature selection result.

The invention also discloses a leukemia gene information mining method based on the distributed column subset selection method and system, which comprises the following steps:

B1. giving a total feature selection number k;

B2. reading the gene data set by an acquisition module and converting the gene data set into a two-dimensional matrix A (the number of samples and the number of characteristics) consisting of samples and characteristics;

B3. performing characteristic cleaning and normalization processing on the matrix obtained in the step B2 through a preprocessing module;

B4. forming a gene subset V by the data cleaned in the step B3 according to the number of nodes in the cluster_iDistributing to each node;

B5. through the evaluation module, each node calculates the quality SQ of the assigned factor set by utilizing a subset quality evaluation algorithm_i；

B6. Calculating the feature number k to be selected of each subset according to the quality of each subset and the total target feature number_i；

B7. According to k_iExecuting POCSS algorithm to select k in each base factor set_iA feature;

B8. and summarizing the selection results of all the nodes so as to obtain the final k gene expressions which are most relevant to the leukemia.

According to the distributed column subset selection method, the distributed column subset selection system and the leukemia gene information mining method, the subset quality evaluation method is integrated into the distributed feature selection framework, redundant features are effectively prevented from being selected from the subsets, and the feature selection process is accelerated; moreover, the characteristics selected by each computing node are directly summarized to be used as a final selection result, so that the method can achieve linear acceleration in theory at least; finally, the invention avoids selecting redundant characteristics in the subset, accelerates the characteristic selection, has good expandability, higher calculation speed and better reliability, and can quickly mine the gene characteristics of leukemia gene information and the leukemia relevance.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

FIG. 2 is a schematic diagram of the system of the present invention.

Detailed Description

The two-phase algorithm assumes that all subsets have the same quality, which is a major cause of the deficiencies of the prior art. However, in practice the quality tends to be different between subsets, and ignoring this difference results in wasted time and resources, and even affects the outcome of the final feature selection. In this application, the number of the subset containing the optimal features is used to measure the quality of the subset. Specifically, k most representative features, called k optimal features, must exist in a data set, and the greater the number of optimal features contained in a subset is, the higher the quality of the subset is. For the CSS problem, the most definedOptimal solution S_OPTThe method is a set containing k optimal features, and the combination of the k features is the feature combination with the strongest ability to fit the original data set in all feature combinations.

FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a distributed column subset selection method, which comprises the following steps:

s1, acquiring all characteristics in a data set;

in this embodiment, the data set is obtained from a UCI database provided by the university of california, the european branch, which contains data sets with different scale feature numbers and sample numbers. The method comprises the steps of firstly converting data in a data set into a two-dimensional matrix formed by characteristics and characteristic values, then deleting the characteristics of which the characteristic values are all null and the value variance is 0, then carrying out normalization processing on the rest characteristics by using an L2 norm, and finally establishing grouping labels according to the number of computing nodes in a cluster and randomly distributing the labels for each characteristic.

The formula for normalizing the L2 norm for each feature F is as follows:

wherein, fv₁，fv₂，…，fv_nIs a feature value that the feature F can take; | F | non-conducting phosphor₂Representing the L2 norm of feature F.

The steps comprise a distributed characteristic selection framework facing the CSS problem, and the method specifically comprises the following steps:

A1. establishing and initializing a distributed software and hardware running environment, and preloading resources on which the feature selection framework depends to prepare for running the framework;

A2. acquiring all the characteristics with the grouping labels in the step S1, and grouping the characteristics according to the grouping labels to form a characteristic subset V_iWherein i is more than or equal to 1 and less than or equal to m, and m is the number of computing nodes in the clusterMeasuring and distributing the packets to corresponding computing nodes;

A3. for the feature subset V obtained in step A2 on each computing node_iRunning the subset quality evaluation method to obtain a feature subset V_iQuality of subset SQ_i；

Specifically, the characteristic subset V is measured by using the information entropy_iQuality of subset SQ_iThe characteristic information entropy h (F) is used for measuring the size of the information content contained in one characteristic F, and the higher the information entropy h (F) is, the larger the information content contained in the characteristic F is, the characteristic set entropy is defined as follows:

wherein N is_iFor a subset of features V_iNumber of features contained, fv_tIs characterized by_jAll available characteristic values, p (fv)_t)＝Pr(F_j＝fv_t) Is a probability mass function; subset quality SQ_iThe larger the value of (A), the more the feature subset V is represented_iThe larger the amount of information contained, the more optimal features are distributed in the feature subset V_iIn, therefore k_iThe larger.

The method has the advantages that firstly, the subset quality evaluation method based on the information entropy has high calculation efficiency and is a quick evaluation method; secondly, the subset quality assessment index is flexible and pluggable, and can be selected according to different preferences of users and applications.

A4. Summarizing subset quality SQ for feature subsets assigned to each compute node_iAccording to the number k of target features, according to each feature subset V_iDetermines the number k of features to be selected in the subset_iAnd distributing the data to corresponding computing nodes;

according to each feature subset V_iDetermines the number k of features to be selected in the subset_iSpecifically, the higher the quality, the number of features k_iThe larger; for high quality feature subset V_iCan be assigned to a larger number of features k_iWill beSubset quality of each subset SQ_iDescending order, calculating the feature number k of the first m-1 subsets in descending order_iM is the number of computing nodes in the cluster;

wherein i is more than or equal to 1 and less than or equal to m-1, and [. cndot ] represents upward rounding;

get the feature number k of the first m-1 subsets_iThen, the number k of features of the last subset in descending order_iIs marked as

The advantage of this step is that since the signatures are randomly assigned grouping labels in step S2, S is_OPTThe optimal features contained are scattered into subsets, i.e. k_iOften less than k, so compared with the two-stage CSS algorithm to select k features from each subset, the method of the present invention only needs to select k_iThe running time of the algorithm is effectively shortened by the aid of the characteristics.

A5. Each computing node obtains the feature number k according to the step A4_iFeature subset V at this node_iOn-line single-edition CSS algorithm selects k_iEach feature forming a feature set S_i；

The single CSS algorithm specifically adopts a POCSS algorithm, the POCSS algorithm is an algorithm for solving the CSS problem by using the idea of pareto optimization, and the POCSS algorithm needs 2ek²The solution S obtained by N iterations is guaranteed to be 2ek²The solution S obtained by N iterations is superior to a lower bound (1-e)^-γ) OPT, this lower bound is the highest currently known lower bound, where e is the Euler number, e ≈ 2.71828; k is the number of features to be selected; n is the total number of features in the dataset, and OPT is the function value of the optimal solution, i.e., f (S)_OPT) (ii) a Gamma is the sub-modulus; compared with other CSS algorithms, the POCSS algorithm achieves the lowest reconstruction error rate.

The purpose of this step is toThe over-subset quality evaluation method is characterized in that the feature subset V is_iFinding a reasonable target feature number k_i. The distributed feature selection algorithm of the prior art would directly follow the feature subset V_iK features are selected, however in practice each subset of features V_iIn (1) contains only S_OPTSome of the best features. If each feature subset V is known_iS contained in_OPTNumber of best features in

Then can only select from

And (4) a feature. This will greatly reduce the run time of the stand-alone feature selection algorithm, especially for feature selection algorithms that are sensitive to the number of target feature selections. Therefore, when the POCSS algorithm is adopted, 2ek is required²The iteration times are N times to obtain a better feature selection result, the iteration times are seen to increase along with the square times of k, and the algorithm running time is also increased sharply. However, obtaining S is most often the case_OPTAnd

is very difficult, so that the feature subset V can only be obtained by a heuristic method_iFind relatively reasonable

I.e. k_i. The invention therefore proposes a subset quality evaluation method, a feature subset V_iThe higher the quality of (1), namely SQ_iThe larger the value, the more optimal features may be distributed in which the corresponding k_iThe larger the value.

Compared with a single-edition POCSS algorithm, the improved distributed POCSS algorithm has the following acceleration effect:

for the CSS problem, in a distributed scenario, the feature set is divided into { V } according to the number m of computing nodes in the cluster₁,V₂,...,V_mThe subsets have m groups, and the purpose is to select the m groups from the characteristicsK features are selected from the subset to obtain a solution S, such that the solution S minimizes the objective function f (S). In the ideal case, all feature subsets V_iIs of the same quality, i.e. SQ₁＝SQ₂＝…＝SQ_mFor any V, according to the subset quality evaluation algorithm_iHas k_iK/m. For the POCSS algorithm, subset V_iThe number of iterations of the algorithm above is

(Single POCSS algorithm iteration number is 2ek²N), the first k/m indicates that it should be at V_iNumber of features selected in (i.e. k)_i(ii) a The second k/m indicates that the POCSS algorithm is used to deposit the set of selected features, since k_iK/m, so the set size is at most k/m; the last N/m indicates that we divide the total number of features (N is the number of features) equally into m feature subsets, each with a size of N/m. Thus, ideally, applying the invention to the POCSS algorithm would be able to reach m³Multiple acceleration, where m is the number of compute nodes in the cluster. Then, considering the worst case, there is a subset of features V_iThe subset quality is much higher than the subset of other features from which the k features should all be from the subset V according to the subset quality evaluation algorithm_iSelecting. In this case, the number of iterations of the POCSS algorithm is

Namely, the invention is applied to the POCSS algorithm and achieves linear acceleration in theory compared with the single-edition POCSS algorithm. In summary, theoretically, the POCSS algorithm which is modified by the invention has the acceleration ratio of m and m compared with the original algorithm³In the meantime. The advantage of this step is two-fold, first, even in the worst case, the POCSS algorithm, which should be modified in the present invention, theoretically achieves linear acceleration. Since this worst case is extremely rare due to the way of average grouping, the acceleration ratio in practice is often higher than linear acceleration up to m³Accelerating times; second, the feature selection algorithm in this step may be plugged in. Virtually any, can be in a single computing sectionThe feature selection algorithms running on the points are all integrated into the invention, and the distributed computation acceleration is carried out by using the invention.

A6. The feature set S obtained in the step A5_iPolymerizing to obtain a final feature selection result S,

the step has the advantages that the selection results of all the subsets are directly collected and merged to be the final output result, the method is simple and efficient, the method is different from the existing two-stage CSS algorithm, and the final selection result can be obtained only by operating the single CSS algorithm again on the selection results of the collected subsets.

the subset quality evaluation method specifically measures the characteristic subset V by using information entropy_iQuality of subset SQ_i(ii) a The feature information entropy H (F) is used for measuring the size of the information content contained in one feature F, and the higher the information entropy H (F) is, the larger the information content contained in the feature F is, the feature set entropy is defined as follows:

wherein N is_iFor a subset of features V_iNumber of features contained, fv_tIs characterized by_jAll possible characteristic values, p (fv)_t)＝Pr(F_j＝fv_t) Is a probability mass function; subset quality SQ_iThe larger the value of (A), the more the feature subset V is represented_iThe larger the amount of information contained, the more optimal features are distributed in the feature subset V_iIn (1), therefore the number of features k_iThe larger;

s5, summarizing the feature selection calculation results of the calculation nodes obtained in the step S4 to obtain the finally selected features

The CSS problem is a constrained low-rank approximation problem aimed at fitting an original matrix A by S, where S is a matrix composed of columns (features) selected from A, and the strength of the fitting ability of S is determined by A and SS⁺The variance of A means that the smaller the variance, the stronger the fitting ability.

The mathematical definition of the CSS problem is: given a matrix

And a positive integer k ≦ n, finding a sub-matrix S containing at most k columns of A such that

The basic idea of the invention is that aiming at the CSS problem, a subset quality evaluation method is integrated into a distributed feature selection framework to accelerate the feature selection process, on one hand, the heuristic subset quality evaluation method is utilized to determine the number of features to be selected in each subset, thereby effectively avoiding the selection of redundant features in the subsets and accelerating the feature selection process; on the other hand, the CSS algorithm is operated after the subset quality evaluation method is carried out, and then the characteristics selected from each subset are summarized to directly obtain the final selection result, so that the method can achieve at least linear acceleration in theory; on the last hand, according to different preferences of users and applications, the selection of a single version feature selection algorithm and a subset quality evaluation index can be flexibly changed, and a flexible and pluggable distributed feature selection framework is realized.

In the existing experiments, the distribution was built by embedding the single-edition POCSS feature selection algorithmIn a formula characteristic selection framework, tests are carried out on small, medium and large-scale data sets, the acceleration effect on all the data sets is obviously improved, the highest acceleration ratio reaches 3788, and the theoretical acceleration ratio reaches m under ideal conditions³And m is the number of compute nodes in the cluster. The main reason is that the number of features selected in each subset determined according to the subset quality evaluation method is often less than k, thereby greatly reducing the number of iterations required by the POCSS algorithm and shortening the running time.

Fig. 2 is a schematic structural diagram of the system of the present invention, and the present invention provides a system including the above-mentioned distributed column subset selection method, including an obtaining module, a preprocessing module, an evaluating module, a selecting module, and an output module; the acquisition module, the preprocessing module, the evaluation module, the selection module and the output module are sequentially connected in series; an acquisition module for acquiring all the characteristics in the data set; the preprocessing module is used for preprocessing an original data set, is responsible for cleaning and normalization processing of features, uniformly and randomly distributes grouping labels for the processed feature set according to the number of computing nodes in the cluster, and makes input preparation for the calculation of the next module; the evaluation module is used for evaluating the quality of the subsets for each characteristic subset and finding out the reasonable number of target characteristics to be selected for the subset according to the quality of each subset; the selection module selects a single CSS algorithm in the module to calculate on each calculation node according to the feature subset obtained by the upstream module and the number of the target features to be selected, and then summarizes the calculation results of each node to obtain the finally selected features; and the output module is used for outputting the feature selection result.

The following uses an embodiment to illustrate the advantages of the present invention.

In the aspect of hardware, 8 computing nodes are adopted, and each computing node is provided with a Xeon Gold 5118 CPU and a 12G memory; in terms of software, a CentOS 7.7.1980 system is installed on each computing node, Hadoop 3.1.2 and Spark 2.4.5 distributed computing platforms are built, the method is implemented by using Python 3.6.8 programming, and the number of target features selected for the CSS problem is 50, namely k is 50.

In order to illustrate the effectiveness of the method and the improvement of the acceleration effect, tests are carried out on a plurality of data sets, the POCSS algorithm running time after the modification of the method is compared with the single-edition POCSS algorithm running time, and the calculation formula is as follows:

wherein, T_fSelecting an operating time for the features of the invention, Speedup being the speed-up ratio of the operating time of the invention compared to the single-version POCSS algorithm, T_POCSSThe run time is selected for the single version of the POCSS algorithm features.

The acceleration ratio evaluation results are as follows:

TABLE 1

From the results of the acceleration ratio experiments in Table 1 above, it can be seen that only k is selected on each subset of features due to the present invention_iThe acceleration effect of the invention applied to the POCSS algorithm is very obvious due to the characteristics. And as the size of the data set is continuously enlarged, the total number of the features is continuously increased, and the acceleration effect is more and more obvious. It can be seen that the highest acceleration ratio is 447 in the Scene data set and 3788 in the sEMG data set.

The acceleration effect of the application of the invention is extremely remarkable for three reasons: first, because fewer features are selected on each feature subset, the POCSS algorithm after the theoretical transformation speeds up more than m and m³In practical experiments, the quality of each subset of the sEMG data set is approximately the same after being divided, so that the quality can reach near m³Accelerating by times; secondly, as the number of the computing nodes increases, the feature subsets become smaller and smaller, the two-dimensional matrix formed by the feature subsets also becomes smaller and smaller, and the time consumption for computing the matrix is correspondingly reduced; finally, due to the different sparsity of the data sets, the computation of the sparse matrix is obviously longer than that of the dense matrix, and the sEMG belongs to the sparse matrix.

The invention can be widely applied to the fields of biological information mining, rapid image compression and the like, and the biological information mining is taken as an example below.

The invention takes a published leukemia data set as an example. The data set contains gene expression corresponding to Acute Lymphoblastic Leukemia (ALL) and Acute Myelogenous Leukemia (AML) samples from bone marrow and peripheral blood. The data set consisted of 72 samples: 49 ALL samples; 23 samples of AML. More than 7,129 genes were measured per sample.

The application process is as follows:

B1. acquiring all characteristics in a data set; giving a total feature selection number k;

B2. processing the characteristics in the data set obtained in the step B1, and then uniformly grouping the characteristics to each computing node; reading the gene data set by an acquisition module and converting the gene data set into a two-dimensional matrix A (the number of samples and the number of characteristics) consisting of samples and characteristics;

B4. executing a subset quality evaluation method on each computing node, thereby calculating to obtain the target feature number of the corresponding feature subset; forming a gene subset V by the data cleaned in the step B3 according to the number of nodes in the cluster_iDistributing to each node;

B8. summarizing the obtained feature selection calculation results of each calculation node so as to obtain the finally selected features; specifically, the selection results of all nodes are summarized to obtain the final k gene expressions which are most relevant to leukemia.

Step B3 includes: firstly, converting data in a data set into a two-dimensional matrix consisting of features and feature values, then deleting the features with the feature values all being empty and the feature value variance being 0, then carrying out normalization processing on the remaining features by using an L2 norm, and finally establishing a grouping label according to the number of computing nodes in a cluster, and randomly distributing a label for each feature, thereby randomly dividing each feature into feature subsets of different computing nodes.

The formula for normalizing the L2 norm for each feature F is as follows:

The subset quality evaluation method of step B5, specifically, using information entropy to measure the feature subset V_iQuality of subset SQ_i(ii) a The feature information entropy H (F) is used for measuring the size of the information content contained in one feature F, and the higher the information entropy H (F) is, the larger the information content contained in the feature F is, the feature set entropy is defined as follows:

Step B7, according to the target feature number of the feature subset of each computing node obtained in step B4, each computing node performs respective feature selection calculation, specifically, the higher the quality is, the feature number k is_iThe larger; to ensureFeature subset V for higher quality certificates_iCan be assigned to a larger number of features k_iQuality of each subset SQ_iDescending order, calculating the feature number k of the first m-1 subsets in descending order_iM is the number of computing nodes in the cluster;

And each computing node performs respective feature selection calculation, and each specific computing node performs respective feature selection calculation by adopting a POCSS algorithm.

The invention also provides a system based on the leukemia gene information mining method, which comprises an acquisition module, a preprocessing module, an evaluation module, a selection module and an output module; the acquisition module, the preprocessing module, the evaluation module, the selection module and the output module are sequentially connected in series; the acquisition module is used for acquiring all characteristics in the data set; the preprocessing module is used for preprocessing an original data set, is responsible for cleaning and normalization processing of features, uniformly and randomly distributes grouping labels for the processed feature set according to the number of computing nodes in the cluster, and makes input preparation for the calculation of the next module; reading the gene data set by an acquisition module and converting the gene data set into a two-dimensional matrix A (the number of samples and the number of characteristics) consisting of samples and characteristics; the pre-processing module forms a subset of genes V_iDistributing to each node; the evaluation module is used for evaluating the quality of each subset of the features, finding the number of target features for the subset according to the quality of each subset, and each node calculates the quality SQ of the assigned subset of the features by utilizing a subset quality evaluation algorithm_i(ii) a A selection module for selecting a feature based onCalculating the number of the subsets and the target features on each calculation node by adopting a POCSS algorithm, and then summarizing the calculation results of each node to obtain the finally selected features; and the output module is used for outputting the feature selection result to obtain the final k gene expressions which are most relevant to the leukemia.

Claims

1. A distributed column subset selection method facing column subset selection comprises the following steps:

s1, acquiring all characteristics in a data set;

2. The distributed column subset selection method of claim 1, wherein step S2 comprises: firstly, converting data in a data set into a two-dimensional matrix consisting of features and feature values, then deleting the features with the feature values all being empty and the feature value variance being 0, then carrying out normalization processing on the remaining features by using an L2 norm, and finally establishing a grouping label according to the number of computing nodes in a cluster, and randomly distributing a label for each feature, thereby randomly dividing each feature into feature subsets of different computing nodes.

3. The distributed column subset selection method of claim 2, wherein the calculation formula for normalizing the L2 norm of each feature F is as follows:

4. The distributed column subset selection method of claim 3, wherein the subset quality evaluation method of step S3 is specifically to measure the feature subset V by using entropy_iQuality of subset SQ_i(ii) a The feature information entropy H (F) is used for measuring the size of the information content contained in one feature F, and the higher the information entropy H (F) is, the larger the information content contained in the feature F is, the feature set entropy is defined as follows:

5. The distributed column subset selection method according to claim 4, wherein in step S4, each computing node performs its own feature selection calculation according to the target feature number of the feature subset of each computing node obtained in step S3, specifically, the higher the quality is, the higher the feature number k is_iThe larger; feature subset V for higher quality assurance_iCan be assigned to a larger number of features k_iQuality of each subset SQ_iDescending order, calculating the feature number k of the first m-1 subsets in descending order_iM is the number of computing nodes in the cluster;

6. The distributed column subset selection method of claim 5, wherein each of the computing nodes in step S4 performs its own feature selection calculation, and each of the computing nodes performs its own feature selection calculation using the POCSS algorithm.

7. A system based on the distributed column subset selection method of any one of claims 1 to 6, comprising an acquisition module, a preprocessing module, an evaluation module, a selection module and an output module; the acquisition module, the preprocessing module, the evaluation module, the selection module and the output module are sequentially connected in series; the acquisition module is used for acquiring all characteristics in the data set; the preprocessing module is used for preprocessing an original data set, is responsible for cleaning and normalization processing of features, uniformly and randomly distributes grouping labels for the processed feature set according to the number of computing nodes in the cluster, and makes input preparation for the calculation of the next module; the evaluation module is used for evaluating the quality of each subset of the features and finding the number of target features for the subset according to the quality of each subset; the selection module is used for calculating on each calculation node by adopting a POCSS algorithm according to the feature subset and the target feature number, and then summarizing calculation results of each node to obtain finally selected features; the output module is used for outputting the feature selection result.

8. A leukemia gene mining method based on the distributed column subset selection method and system according to any one of claims 1 to 7, characterized by comprising the following steps:

B1. giving a total feature selection number k;