CN113077843A - Distributed column subset selection method and system and leukemia gene information mining method - Google Patents

Distributed column subset selection method and system and leukemia gene information mining method Download PDF

Info

Publication number
CN113077843A
CN113077843A CN202110350013.1A CN202110350013A CN113077843A CN 113077843 A CN113077843 A CN 113077843A CN 202110350013 A CN202110350013 A CN 202110350013A CN 113077843 A CN113077843 A CN 113077843A
Authority
CN
China
Prior art keywords
feature
subset
features
selection
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110350013.1A
Other languages
Chinese (zh)
Inventor
肖正
魏鹏程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Shaodong Intelligent Manufacturing Innovative Institute
Original Assignee
Hunan University
Shaodong Intelligent Manufacturing Innovative Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University, Shaodong Intelligent Manufacturing Innovative Institute filed Critical Hunan University
Priority to CN202110350013.1A priority Critical patent/CN113077843A/en
Publication of CN113077843A publication Critical patent/CN113077843A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed column subset selection method, which comprises the steps of acquiring all characteristics in a data set, processing and uniformly grouping the characteristics to each computing node; executing a subset quality evaluation method on each computing node and obtaining a corresponding characteristic subset target characteristic number; each computing node performs respective feature selection calculation to obtain features selected by each computing node; and summarizing the feature selection calculation results of each calculation node to obtain the finally selected features. The invention also discloses a system based on the distributed column subset selection method and a leukemia gene information mining method based on the method and the system. The invention effectively avoids selecting redundant features in the subset and also accelerates the feature selection process; directly summarizing the selected characteristics as a final selection result, so that the method can achieve at least linear acceleration in theory; the accuracy is high, the calculation speed is high, and the reliability is better; simultaneously obtains the gene characteristics and the association of leukemia.

Description

Distributed column subset selection method and system and leukemia gene information mining method
Technical Field
The invention belongs to the field of big data processing, and particularly relates to a distributed column subset selection method and system and a leukemia gene information mining method.
Background
With the emerging applications of emerging computers such as internet of things, machine learning, computer vision, natural language processing and the like, people often encounter high-dimensional data with massive sample numbers and characteristic numbers. Processing these high dimensional data requires more computational and memory resources, often cannot be processed using a single machine, and most of the features in the data may be useless and redundant. Therefore, selecting representative features from high-dimensional data and serving the features for computer applications becomes an urgent problem to be solved. Therefore, as a method for efficiently selecting representative features from an original feature set, a feature selection technique has been an important point in recent years. Meanwhile, the gene information of leukemia people is mined, and the obtained gene characteristics and the leukemia relevance are important ways for treating leukemia. However, because the gene sequence has a large scale and a complex structure and contains a large amount of redundant feature information, the traditional single-version feature selection algorithm cannot effectively mine the effective information contained in the gene.
The Column Subset Selection (CSS) problem is a core sub-problem in feature Selection studies, and is also a constrained low rank approximation problem. In particular, CSS aims at finding a sub-matrix S from the matrix a containing at most k columns (i.e. features) such that the sub-matrix S contains as much information as possible from the matrix a. In research, reconstruction error rates are typically used to measure this containment capability, with characteristics equivalent to columns.
Different from other low rank approximation problems such as SVD, PCA and the like, CSS is more flexible, more explanatory and more efficient. However, the existing solution algorithm for the CSS problem is not practical, and is not suitable for large-scale data sets. For example, in 2019, von et al proposed the POCSS algorithm, achieving the lowest reconstruction error rate known at present; however, such an algorithm requires a large number of iterations, which is very time consuming. Furthermore, with the explosive growth of data volumes, it becomes particularly important to study distributed CSS algorithms.
The existing distributed CSS algorithm belongs to a two-stage algorithm. Specifically, in a distributed algorithm, a target is changed into k features (m is more than or equal to 2) selected from m feature subsets, and the feature subsets are obtained by dividing an original data set; according to the distributed two-stage CSS algorithm, k features are selected from each subset of features according to the algorithm in the first stage, and then k features are selected from m x k features as the final output result using the same algorithm in the second stage.
This two-stage distributed CSS algorithm has three disadvantages:
1) because the number of representative features contained in each subset is different after the division, for example, most of the features contained in some subsets are redundant features, and the features selected from these subsets may have an adverse effect on the feature selection of the next stage, each subset is not worth selecting k features from the subsets;
2) for k most representative features (optimal features) in the data set, dividing the k most representative features into different subsets, wherein the number of the optimal features contained in each subset is less than or equal to k, so that the k features are selected from each subset unnecessarily;
3) the two-stage algorithm theoretically does not achieve linear acceleration, so the algorithm is very time-consuming to calculate in practice and is not strong in practicability.
Furthermore, the two-phase algorithm assumes that all subsets have the same quality, which is the main reason for the above deficiency. In practice, the quality tends to vary between subsets, and ignoring this variation results in wasted time and resources, and even affects the outcome of the final feature selection. Therefore, the existing distributed feature selection method facing to column subset selection still has the problems of low accuracy, slow calculation speed and poor reliability.
Disclosure of Invention
One of the objectives of the present invention is to provide a distributed column subset selection method, which avoids selecting redundant features in a subset by integrating a subset quality evaluation method in a distributed feature selection framework, accelerates feature selection, and has good expandability, fast calculation speed, and better reliability.
The second objective of the present invention is to provide a system based on the distributed column subset selection method.
The invention also aims to provide a leukemia gene information mining method based on the distributed column subset selection method and the distributed column subset selection system.
The invention discloses a distributed column subset selection method, which comprises the following steps:
s1, acquiring all characteristics in a data set;
s2, processing the characteristics in the data set obtained in the step S1, and then uniformly grouping the characteristics to each computing node;
s3, executing a subset quality evaluation method on each computing node, and thus calculating to obtain the target feature number of the corresponding feature subset;
s4, according to the feature subset target feature number of each computing node obtained in the step S3, each computing node performs respective feature selection calculation, and therefore the features selected by each computing node are obtained;
and S5, summarizing the feature selection calculation results of the calculation nodes obtained in the step S4 to obtain the finally selected features.
Step S2 includes: firstly, converting data in a data set into a two-dimensional matrix consisting of features and feature values, then deleting the features with the feature values all being empty and the feature value variance being 0, then carrying out normalization processing on the remaining features by using an L2 norm, and finally establishing a grouping label according to the number of computing nodes in a cluster, and randomly distributing a label for each feature, thereby randomly dividing each feature into feature subsets of different computing nodes.
The formula for normalizing the L2 norm for each feature F is as follows:
Figure BDA0003001889050000031
wherein, fv1,fv2,…,fvnIs a possible eigenvalue of the feature F; | F | non-conducting phosphor2Representing the L2 norm of feature F.
The subset quality evaluation method of step S3 is specifically to measure the feature subset V using information entropyiQuality of subset SQi(ii) a The feature information entropy H (F) is used for measuring the size of the information content contained in one feature F, and the higher the information entropy H (F) is, the larger the information content contained in the feature F is, the feature set entropy is defined as follows:
Figure BDA0003001889050000032
wherein N isiFor a subset of features ViNumber of features contained, fvtIs characterized byjAll possible characteristic values, p (fv)t)=Pr(Fj=fvt) Is a probability mass function; subset quality SQiThe larger the value of (A), the more the feature subset V is representediThe larger the amount of information contained, the more optimal features are distributed in the feature subset ViIn (1), therefore the number of features kiThe larger.
In step S4, according to the target feature number of the feature subset of each computing node obtained in step S3, each computing node performs respective feature selection calculation, specifically, the higher the quality is, the feature number k isiThe larger; feature subset V for higher quality assuranceiCan be assigned to a larger number of features kiQuality of each subset SQiDescending order, calculating the feature number k of the first m-1 subsets in descending orderiM is the number of computing nodes in the cluster;
Figure BDA0003001889050000041
wherein i is more than or equal to 1 and less than or equal to m-1, wherein [. cndot. ] represents upward rounding, and k is the total number of target features;
obtaining the first m-1 subsetsNumber of features kiThen, the number k of features of the last subset in descending orderiIs marked as
Figure BDA0003001889050000042
Each of the computing nodes described in step S4 performs its own feature selection calculation, and each of the computing nodes performs its own feature selection calculation by using the POCSS algorithm.
The invention also discloses a system based on the distributed column subset selection method, which comprises an acquisition module, a preprocessing module, an evaluation module, a selection module and an output module; the acquisition module, the preprocessing module, the evaluation module, the selection module and the output module are sequentially connected in series; the acquisition module is used for acquiring all characteristics in the data set; the preprocessing module is used for preprocessing an original data set, is responsible for cleaning and normalization processing of features, uniformly and randomly distributes grouping labels for the processed feature set according to the number of computing nodes in the cluster, and makes input preparation for the calculation of the next module; the evaluation module is used for evaluating the quality of each subset of the features and finding the number of target features for the subset according to the quality of each subset; the selection module is used for calculating on each calculation node by adopting a POCSS algorithm according to the feature subset and the target feature number, and then summarizing calculation results of each node to obtain finally selected features; the output module is used for outputting the feature selection result.
The invention also discloses a leukemia gene information mining method based on the distributed column subset selection method and system, which comprises the following steps:
B1. giving a total feature selection number k;
B2. reading the gene data set by an acquisition module and converting the gene data set into a two-dimensional matrix A (the number of samples and the number of characteristics) consisting of samples and characteristics;
B3. performing characteristic cleaning and normalization processing on the matrix obtained in the step B2 through a preprocessing module;
B4. forming a gene subset V by the data cleaned in the step B3 according to the number of nodes in the clusteriDistributing to each node;
B5. through the evaluation module, each node calculates the quality SQ of the assigned factor set by utilizing a subset quality evaluation algorithmi
B6. Calculating the feature number k to be selected of each subset according to the quality of each subset and the total target feature numberi
B7. According to kiExecuting POCSS algorithm to select k in each base factor setiA feature;
B8. and summarizing the selection results of all the nodes so as to obtain the final k gene expressions which are most relevant to the leukemia.
According to the distributed column subset selection method, the distributed column subset selection system and the leukemia gene information mining method, the subset quality evaluation method is integrated into the distributed feature selection framework, redundant features are effectively prevented from being selected from the subsets, and the feature selection process is accelerated; moreover, the characteristics selected by each computing node are directly summarized to be used as a final selection result, so that the method can achieve linear acceleration in theory at least; finally, the invention avoids selecting redundant characteristics in the subset, accelerates the characteristic selection, has good expandability, higher calculation speed and better reliability, and can quickly mine the gene characteristics of leukemia gene information and the leukemia relevance.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
FIG. 2 is a schematic diagram of the system of the present invention.
Detailed Description
The two-phase algorithm assumes that all subsets have the same quality, which is a major cause of the deficiencies of the prior art. However, in practice the quality tends to be different between subsets, and ignoring this difference results in wasted time and resources, and even affects the outcome of the final feature selection. In this application, the number of the subset containing the optimal features is used to measure the quality of the subset. Specifically, k most representative features, called k optimal features, must exist in a data set, and the greater the number of optimal features contained in a subset is, the higher the quality of the subset is. For the CSS problem, the most definedOptimal solution SOPTThe method is a set containing k optimal features, and the combination of the k features is the feature combination with the strongest ability to fit the original data set in all feature combinations.
FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a distributed column subset selection method, which comprises the following steps:
s1, acquiring all characteristics in a data set;
s2, processing the characteristics in the data set obtained in the step S1, and then uniformly grouping the characteristics to each computing node;
in this embodiment, the data set is obtained from a UCI database provided by the university of california, the european branch, which contains data sets with different scale feature numbers and sample numbers. The method comprises the steps of firstly converting data in a data set into a two-dimensional matrix formed by characteristics and characteristic values, then deleting the characteristics of which the characteristic values are all null and the value variance is 0, then carrying out normalization processing on the rest characteristics by using an L2 norm, and finally establishing grouping labels according to the number of computing nodes in a cluster and randomly distributing the labels for each characteristic.
The formula for normalizing the L2 norm for each feature F is as follows:
Figure BDA0003001889050000061
wherein, fv1,fv2,…,fvnIs a feature value that the feature F can take; | F | non-conducting phosphor2Representing the L2 norm of feature F.
The steps comprise a distributed characteristic selection framework facing the CSS problem, and the method specifically comprises the following steps:
A1. establishing and initializing a distributed software and hardware running environment, and preloading resources on which the feature selection framework depends to prepare for running the framework;
A2. acquiring all the characteristics with the grouping labels in the step S1, and grouping the characteristics according to the grouping labels to form a characteristic subset ViWherein i is more than or equal to 1 and less than or equal to m, and m is the number of computing nodes in the clusterMeasuring and distributing the packets to corresponding computing nodes;
A3. for the feature subset V obtained in step A2 on each computing nodeiRunning the subset quality evaluation method to obtain a feature subset ViQuality of subset SQi
Specifically, the characteristic subset V is measured by using the information entropyiQuality of subset SQiThe characteristic information entropy h (F) is used for measuring the size of the information content contained in one characteristic F, and the higher the information entropy h (F) is, the larger the information content contained in the characteristic F is, the characteristic set entropy is defined as follows:
Figure BDA0003001889050000062
wherein N isiFor a subset of features ViNumber of features contained, fvtIs characterized byjAll available characteristic values, p (fv)t)=Pr(Fj=fvt) Is a probability mass function; subset quality SQiThe larger the value of (A), the more the feature subset V is representediThe larger the amount of information contained, the more optimal features are distributed in the feature subset ViIn, therefore kiThe larger.
The method has the advantages that firstly, the subset quality evaluation method based on the information entropy has high calculation efficiency and is a quick evaluation method; secondly, the subset quality assessment index is flexible and pluggable, and can be selected according to different preferences of users and applications.
A4. Summarizing subset quality SQ for feature subsets assigned to each compute nodeiAccording to the number k of target features, according to each feature subset ViDetermines the number k of features to be selected in the subsetiAnd distributing the data to corresponding computing nodes;
according to each feature subset ViDetermines the number k of features to be selected in the subsetiSpecifically, the higher the quality, the number of features kiThe larger; for high quality feature subset ViCan be assigned to a larger number of features kiWill beSubset quality of each subset SQiDescending order, calculating the feature number k of the first m-1 subsets in descending orderiM is the number of computing nodes in the cluster;
Figure BDA0003001889050000071
wherein i is more than or equal to 1 and less than or equal to m-1, and [. cndot ] represents upward rounding;
get the feature number k of the first m-1 subsetsiThen, the number k of features of the last subset in descending orderiIs marked as
Figure BDA0003001889050000072
The advantage of this step is that since the signatures are randomly assigned grouping labels in step S2, S isOPTThe optimal features contained are scattered into subsets, i.e. kiOften less than k, so compared with the two-stage CSS algorithm to select k features from each subset, the method of the present invention only needs to select kiThe running time of the algorithm is effectively shortened by the aid of the characteristics.
A5. Each computing node obtains the feature number k according to the step A4iFeature subset V at this nodeiOn-line single-edition CSS algorithm selects kiEach feature forming a feature set Si
The single CSS algorithm specifically adopts a POCSS algorithm, the POCSS algorithm is an algorithm for solving the CSS problem by using the idea of pareto optimization, and the POCSS algorithm needs 2ek2The solution S obtained by N iterations is guaranteed to be 2ek2The solution S obtained by N iterations is superior to a lower bound (1-e)) OPT, this lower bound is the highest currently known lower bound, where e is the Euler number, e ≈ 2.71828; k is the number of features to be selected; n is the total number of features in the dataset, and OPT is the function value of the optimal solution, i.e., f (S)OPT) (ii) a Gamma is the sub-modulus; compared with other CSS algorithms, the POCSS algorithm achieves the lowest reconstruction error rate.
The purpose of this step is toThe over-subset quality evaluation method is characterized in that the feature subset V isiFinding a reasonable target feature number ki. The distributed feature selection algorithm of the prior art would directly follow the feature subset ViK features are selected, however in practice each subset of features ViIn (1) contains only SOPTSome of the best features. If each feature subset V is knowniS contained inOPTNumber of best features in
Figure BDA0003001889050000081
Then can only select from
Figure BDA0003001889050000082
And (4) a feature. This will greatly reduce the run time of the stand-alone feature selection algorithm, especially for feature selection algorithms that are sensitive to the number of target feature selections. Therefore, when the POCSS algorithm is adopted, 2ek is required2The iteration times are N times to obtain a better feature selection result, the iteration times are seen to increase along with the square times of k, and the algorithm running time is also increased sharply. However, obtaining S is most often the caseOPTAnd
Figure BDA0003001889050000083
is very difficult, so that the feature subset V can only be obtained by a heuristic methodiFind relatively reasonable
Figure BDA0003001889050000084
I.e. ki. The invention therefore proposes a subset quality evaluation method, a feature subset ViThe higher the quality of (1), namely SQiThe larger the value, the more optimal features may be distributed in which the corresponding kiThe larger the value.
Compared with a single-edition POCSS algorithm, the improved distributed POCSS algorithm has the following acceleration effect:
for the CSS problem, in a distributed scenario, the feature set is divided into { V } according to the number m of computing nodes in the cluster1,V2,...,VmThe subsets have m groups, and the purpose is to select the m groups from the characteristicsK features are selected from the subset to obtain a solution S, such that the solution S minimizes the objective function f (S). In the ideal case, all feature subsets ViIs of the same quality, i.e. SQ1=SQ2=…=SQmFor any V, according to the subset quality evaluation algorithmiHas kiK/m. For the POCSS algorithm, subset ViThe number of iterations of the algorithm above is
Figure BDA0003001889050000085
(Single POCSS algorithm iteration number is 2ek2N), the first k/m indicates that it should be at ViNumber of features selected in (i.e. k)i(ii) a The second k/m indicates that the POCSS algorithm is used to deposit the set of selected features, since kiK/m, so the set size is at most k/m; the last N/m indicates that we divide the total number of features (N is the number of features) equally into m feature subsets, each with a size of N/m. Thus, ideally, applying the invention to the POCSS algorithm would be able to reach m3Multiple acceleration, where m is the number of compute nodes in the cluster. Then, considering the worst case, there is a subset of features ViThe subset quality is much higher than the subset of other features from which the k features should all be from the subset V according to the subset quality evaluation algorithmiSelecting. In this case, the number of iterations of the POCSS algorithm is
Figure BDA0003001889050000086
Namely, the invention is applied to the POCSS algorithm and achieves linear acceleration in theory compared with the single-edition POCSS algorithm. In summary, theoretically, the POCSS algorithm which is modified by the invention has the acceleration ratio of m and m compared with the original algorithm3In the meantime. The advantage of this step is two-fold, first, even in the worst case, the POCSS algorithm, which should be modified in the present invention, theoretically achieves linear acceleration. Since this worst case is extremely rare due to the way of average grouping, the acceleration ratio in practice is often higher than linear acceleration up to m3Accelerating times; second, the feature selection algorithm in this step may be plugged in. Virtually any, can be in a single computing sectionThe feature selection algorithms running on the points are all integrated into the invention, and the distributed computation acceleration is carried out by using the invention.
A6. The feature set S obtained in the step A5iPolymerizing to obtain a final feature selection result S,
Figure BDA0003001889050000091
the step has the advantages that the selection results of all the subsets are directly collected and merged to be the final output result, the method is simple and efficient, the method is different from the existing two-stage CSS algorithm, and the final selection result can be obtained only by operating the single CSS algorithm again on the selection results of the collected subsets.
S3, executing a subset quality evaluation method on each computing node, and thus calculating to obtain the target feature number of the corresponding feature subset;
the subset quality evaluation method specifically measures the characteristic subset V by using information entropyiQuality of subset SQi(ii) a The feature information entropy H (F) is used for measuring the size of the information content contained in one feature F, and the higher the information entropy H (F) is, the larger the information content contained in the feature F is, the feature set entropy is defined as follows:
Figure BDA0003001889050000092
wherein N isiFor a subset of features ViNumber of features contained, fvtIs characterized byjAll possible characteristic values, p (fv)t)=Pr(Fj=fvt) Is a probability mass function; subset quality SQiThe larger the value of (A), the more the feature subset V is representediThe larger the amount of information contained, the more optimal features are distributed in the feature subset ViIn (1), therefore the number of features kiThe larger;
s4, according to the feature subset target feature number of each computing node obtained in the step S3, each computing node performs respective feature selection calculation, and therefore the features selected by each computing node are obtained;
s5, summarizing the feature selection calculation results of the calculation nodes obtained in the step S4 to obtain the finally selected features
The CSS problem is a constrained low-rank approximation problem aimed at fitting an original matrix A by S, where S is a matrix composed of columns (features) selected from A, and the strength of the fitting ability of S is determined by A and SS+The variance of A means that the smaller the variance, the stronger the fitting ability.
The mathematical definition of the CSS problem is: given a matrix
Figure BDA0003001889050000093
And a positive integer k ≦ n, finding a sub-matrix S containing at most k columns of A such that
Figure BDA0003001889050000094
Figure BDA0003001889050000101
Where | is the number of columns of the matrix S, S+Moore-Penrose generalized inverse matrix representing matrix S, | | · | | | luminanceFIs the Frobenius norm of the matrix.
The basic idea of the invention is that aiming at the CSS problem, a subset quality evaluation method is integrated into a distributed feature selection framework to accelerate the feature selection process, on one hand, the heuristic subset quality evaluation method is utilized to determine the number of features to be selected in each subset, thereby effectively avoiding the selection of redundant features in the subsets and accelerating the feature selection process; on the other hand, the CSS algorithm is operated after the subset quality evaluation method is carried out, and then the characteristics selected from each subset are summarized to directly obtain the final selection result, so that the method can achieve at least linear acceleration in theory; on the last hand, according to different preferences of users and applications, the selection of a single version feature selection algorithm and a subset quality evaluation index can be flexibly changed, and a flexible and pluggable distributed feature selection framework is realized.
In the existing experiments, the distribution was built by embedding the single-edition POCSS feature selection algorithmIn a formula characteristic selection framework, tests are carried out on small, medium and large-scale data sets, the acceleration effect on all the data sets is obviously improved, the highest acceleration ratio reaches 3788, and the theoretical acceleration ratio reaches m under ideal conditions3And m is the number of compute nodes in the cluster. The main reason is that the number of features selected in each subset determined according to the subset quality evaluation method is often less than k, thereby greatly reducing the number of iterations required by the POCSS algorithm and shortening the running time.
Fig. 2 is a schematic structural diagram of the system of the present invention, and the present invention provides a system including the above-mentioned distributed column subset selection method, including an obtaining module, a preprocessing module, an evaluating module, a selecting module, and an output module; the acquisition module, the preprocessing module, the evaluation module, the selection module and the output module are sequentially connected in series; an acquisition module for acquiring all the characteristics in the data set; the preprocessing module is used for preprocessing an original data set, is responsible for cleaning and normalization processing of features, uniformly and randomly distributes grouping labels for the processed feature set according to the number of computing nodes in the cluster, and makes input preparation for the calculation of the next module; the evaluation module is used for evaluating the quality of the subsets for each characteristic subset and finding out the reasonable number of target characteristics to be selected for the subset according to the quality of each subset; the selection module selects a single CSS algorithm in the module to calculate on each calculation node according to the feature subset obtained by the upstream module and the number of the target features to be selected, and then summarizes the calculation results of each node to obtain the finally selected features; and the output module is used for outputting the feature selection result.
The following uses an embodiment to illustrate the advantages of the present invention.
In the aspect of hardware, 8 computing nodes are adopted, and each computing node is provided with a Xeon Gold 5118 CPU and a 12G memory; in terms of software, a CentOS 7.7.1980 system is installed on each computing node, Hadoop 3.1.2 and Spark 2.4.5 distributed computing platforms are built, the method is implemented by using Python 3.6.8 programming, and the number of target features selected for the CSS problem is 50, namely k is 50.
In order to illustrate the effectiveness of the method and the improvement of the acceleration effect, tests are carried out on a plurality of data sets, the POCSS algorithm running time after the modification of the method is compared with the single-edition POCSS algorithm running time, and the calculation formula is as follows:
Figure BDA0003001889050000111
wherein, TfSelecting an operating time for the features of the invention, Speedup being the speed-up ratio of the operating time of the invention compared to the single-version POCSS algorithm, TPOCSSThe run time is selected for the single version of the POCSS algorithm features.
The acceleration ratio evaluation results are as follows:
TABLE 1
Figure BDA0003001889050000112
From the results of the acceleration ratio experiments in Table 1 above, it can be seen that only k is selected on each subset of features due to the present inventioniThe acceleration effect of the invention applied to the POCSS algorithm is very obvious due to the characteristics. And as the size of the data set is continuously enlarged, the total number of the features is continuously increased, and the acceleration effect is more and more obvious. It can be seen that the highest acceleration ratio is 447 in the Scene data set and 3788 in the sEMG data set.
The acceleration effect of the application of the invention is extremely remarkable for three reasons: first, because fewer features are selected on each feature subset, the POCSS algorithm after the theoretical transformation speeds up more than m and m3In practical experiments, the quality of each subset of the sEMG data set is approximately the same after being divided, so that the quality can reach near m3Accelerating by times; secondly, as the number of the computing nodes increases, the feature subsets become smaller and smaller, the two-dimensional matrix formed by the feature subsets also becomes smaller and smaller, and the time consumption for computing the matrix is correspondingly reduced; finally, due to the different sparsity of the data sets, the computation of the sparse matrix is obviously longer than that of the dense matrix, and the sEMG belongs to the sparse matrix.
The invention can be widely applied to the fields of biological information mining, rapid image compression and the like, and the biological information mining is taken as an example below.
The invention takes a published leukemia data set as an example. The data set contains gene expression corresponding to Acute Lymphoblastic Leukemia (ALL) and Acute Myelogenous Leukemia (AML) samples from bone marrow and peripheral blood. The data set consisted of 72 samples: 49 ALL samples; 23 samples of AML. More than 7,129 genes were measured per sample.
The application process is as follows:
B1. acquiring all characteristics in a data set; giving a total feature selection number k;
B2. processing the characteristics in the data set obtained in the step B1, and then uniformly grouping the characteristics to each computing node; reading the gene data set by an acquisition module and converting the gene data set into a two-dimensional matrix A (the number of samples and the number of characteristics) consisting of samples and characteristics;
B3. performing characteristic cleaning and normalization processing on the matrix obtained in the step B2 through a preprocessing module;
B4. executing a subset quality evaluation method on each computing node, thereby calculating to obtain the target feature number of the corresponding feature subset; forming a gene subset V by the data cleaned in the step B3 according to the number of nodes in the clusteriDistributing to each node;
B5. through the evaluation module, each node calculates the quality SQ of the assigned factor set by utilizing a subset quality evaluation algorithmi
B6. Calculating the feature number k to be selected of each subset according to the quality of each subset and the total target feature numberi
B7. According to kiExecuting POCSS algorithm to select k in each base factor setiA feature;
B8. summarizing the obtained feature selection calculation results of each calculation node so as to obtain the finally selected features; specifically, the selection results of all nodes are summarized to obtain the final k gene expressions which are most relevant to leukemia.
Step B3 includes: firstly, converting data in a data set into a two-dimensional matrix consisting of features and feature values, then deleting the features with the feature values all being empty and the feature value variance being 0, then carrying out normalization processing on the remaining features by using an L2 norm, and finally establishing a grouping label according to the number of computing nodes in a cluster, and randomly distributing a label for each feature, thereby randomly dividing each feature into feature subsets of different computing nodes.
The formula for normalizing the L2 norm for each feature F is as follows:
Figure BDA0003001889050000131
wherein, fv1,fv2,…,fvnIs a possible eigenvalue of the feature F; | F | non-conducting phosphor2Representing the L2 norm of feature F.
The subset quality evaluation method of step B5, specifically, using information entropy to measure the feature subset ViQuality of subset SQi(ii) a The feature information entropy H (F) is used for measuring the size of the information content contained in one feature F, and the higher the information entropy H (F) is, the larger the information content contained in the feature F is, the feature set entropy is defined as follows:
Figure BDA0003001889050000132
wherein N isiFor a subset of features ViNumber of features contained, fvtIs characterized byjAll possible characteristic values, p (fv)t)=Pr(Fj=fvt) Is a probability mass function; subset quality SQiThe larger the value of (A), the more the feature subset V is representediThe larger the amount of information contained, the more optimal features are distributed in the feature subset ViIn (1), therefore the number of features kiThe larger.
Step B7, according to the target feature number of the feature subset of each computing node obtained in step B4, each computing node performs respective feature selection calculation, specifically, the higher the quality is, the feature number k isiThe larger; to ensureFeature subset V for higher quality certificatesiCan be assigned to a larger number of features kiQuality of each subset SQiDescending order, calculating the feature number k of the first m-1 subsets in descending orderiM is the number of computing nodes in the cluster;
Figure BDA0003001889050000133
wherein i is more than or equal to 1 and less than or equal to m-1, wherein [. cndot. ] represents upward rounding, and k is the total number of target features;
get the feature number k of the first m-1 subsetsiThen, the number k of features of the last subset in descending orderiIs marked as
Figure BDA0003001889050000134
And each computing node performs respective feature selection calculation, and each specific computing node performs respective feature selection calculation by adopting a POCSS algorithm.
The invention also provides a system based on the leukemia gene information mining method, which comprises an acquisition module, a preprocessing module, an evaluation module, a selection module and an output module; the acquisition module, the preprocessing module, the evaluation module, the selection module and the output module are sequentially connected in series; the acquisition module is used for acquiring all characteristics in the data set; the preprocessing module is used for preprocessing an original data set, is responsible for cleaning and normalization processing of features, uniformly and randomly distributes grouping labels for the processed feature set according to the number of computing nodes in the cluster, and makes input preparation for the calculation of the next module; reading the gene data set by an acquisition module and converting the gene data set into a two-dimensional matrix A (the number of samples and the number of characteristics) consisting of samples and characteristics; the pre-processing module forms a subset of genes ViDistributing to each node; the evaluation module is used for evaluating the quality of each subset of the features, finding the number of target features for the subset according to the quality of each subset, and each node calculates the quality SQ of the assigned subset of the features by utilizing a subset quality evaluation algorithmi(ii) a A selection module for selecting a feature based onCalculating the number of the subsets and the target features on each calculation node by adopting a POCSS algorithm, and then summarizing the calculation results of each node to obtain the finally selected features; and the output module is used for outputting the feature selection result to obtain the final k gene expressions which are most relevant to the leukemia.

Claims (8)

1. A distributed column subset selection method facing column subset selection comprises the following steps:
s1, acquiring all characteristics in a data set;
s2, processing the characteristics in the data set obtained in the step S1, and then uniformly grouping the characteristics to each computing node;
s3, executing a subset quality evaluation method on each computing node, and thus calculating to obtain the target feature number of the corresponding feature subset;
s4, according to the feature subset target feature number of each computing node obtained in the step S3, each computing node performs respective feature selection calculation, and therefore the features selected by each computing node are obtained;
and S5, summarizing the feature selection calculation results of the calculation nodes obtained in the step S4 to obtain the finally selected features.
2. The distributed column subset selection method of claim 1, wherein step S2 comprises: firstly, converting data in a data set into a two-dimensional matrix consisting of features and feature values, then deleting the features with the feature values all being empty and the feature value variance being 0, then carrying out normalization processing on the remaining features by using an L2 norm, and finally establishing a grouping label according to the number of computing nodes in a cluster, and randomly distributing a label for each feature, thereby randomly dividing each feature into feature subsets of different computing nodes.
3. The distributed column subset selection method of claim 2, wherein the calculation formula for normalizing the L2 norm of each feature F is as follows:
Figure FDA0003001889040000011
wherein, fv1,fv2,…,fvnIs a possible eigenvalue of the feature F; | F | non-conducting phosphor2Representing the L2 norm of feature F.
4. The distributed column subset selection method of claim 3, wherein the subset quality evaluation method of step S3 is specifically to measure the feature subset V by using entropyiQuality of subset SQi(ii) a The feature information entropy H (F) is used for measuring the size of the information content contained in one feature F, and the higher the information entropy H (F) is, the larger the information content contained in the feature F is, the feature set entropy is defined as follows:
Figure FDA0003001889040000021
wherein N isiFor a subset of features ViNumber of features contained, fvtIs characterized byjAll possible characteristic values, p (fv)t)=Pr(Fj=fvt) Is a probability mass function; subset quality SQiThe larger the value of (A), the more the feature subset V is representediThe larger the amount of information contained, the more optimal features are distributed in the feature subset ViIn (1), therefore the number of features kiThe larger.
5. The distributed column subset selection method according to claim 4, wherein in step S4, each computing node performs its own feature selection calculation according to the target feature number of the feature subset of each computing node obtained in step S3, specifically, the higher the quality is, the higher the feature number k isiThe larger; feature subset V for higher quality assuranceiCan be assigned to a larger number of features kiQuality of each subset SQiDescending order, calculating the feature number k of the first m-1 subsets in descending orderiM is the number of computing nodes in the cluster;
Figure FDA0003001889040000022
wherein i is more than or equal to 1 and less than or equal to m-1, wherein [. cndot. ] represents upward rounding, and k is the total number of target features;
get the feature number k of the first m-1 subsetsiThen, the number k of features of the last subset in descending orderiIs marked as
Figure FDA0003001889040000023
6. The distributed column subset selection method of claim 5, wherein each of the computing nodes in step S4 performs its own feature selection calculation, and each of the computing nodes performs its own feature selection calculation using the POCSS algorithm.
7. A system based on the distributed column subset selection method of any one of claims 1 to 6, comprising an acquisition module, a preprocessing module, an evaluation module, a selection module and an output module; the acquisition module, the preprocessing module, the evaluation module, the selection module and the output module are sequentially connected in series; the acquisition module is used for acquiring all characteristics in the data set; the preprocessing module is used for preprocessing an original data set, is responsible for cleaning and normalization processing of features, uniformly and randomly distributes grouping labels for the processed feature set according to the number of computing nodes in the cluster, and makes input preparation for the calculation of the next module; the evaluation module is used for evaluating the quality of each subset of the features and finding the number of target features for the subset according to the quality of each subset; the selection module is used for calculating on each calculation node by adopting a POCSS algorithm according to the feature subset and the target feature number, and then summarizing calculation results of each node to obtain finally selected features; the output module is used for outputting the feature selection result.
8. A leukemia gene mining method based on the distributed column subset selection method and system according to any one of claims 1 to 7, characterized by comprising the following steps:
B1. giving a total feature selection number k;
B2. reading the gene data set by an acquisition module and converting the gene data set into a two-dimensional matrix A (the number of samples and the number of characteristics) consisting of samples and characteristics;
B3. performing characteristic cleaning and normalization processing on the matrix obtained in the step B2 through a preprocessing module;
B4. forming a gene subset V by the data cleaned in the step B3 according to the number of nodes in the clusteriDistributing to each node;
B5. through the evaluation module, each node calculates the quality SQ of the assigned factor set by utilizing a subset quality evaluation algorithmi
B6. Calculating the feature number k to be selected of each subset according to the quality of each subset and the total target feature numberi
B7. According to kiExecuting POCSS algorithm to select k in each base factor setiA feature;
B8. and summarizing the selection results of all the nodes so as to obtain the final k gene expressions which are most relevant to the leukemia.
CN202110350013.1A 2021-03-31 2021-03-31 Distributed column subset selection method and system and leukemia gene information mining method Pending CN113077843A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110350013.1A CN113077843A (en) 2021-03-31 2021-03-31 Distributed column subset selection method and system and leukemia gene information mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110350013.1A CN113077843A (en) 2021-03-31 2021-03-31 Distributed column subset selection method and system and leukemia gene information mining method

Publications (1)

Publication Number Publication Date
CN113077843A true CN113077843A (en) 2021-07-06

Family

ID=76614115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110350013.1A Pending CN113077843A (en) 2021-03-31 2021-03-31 Distributed column subset selection method and system and leukemia gene information mining method

Country Status (1)

Country Link
CN (1) CN113077843A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116073836A (en) * 2023-03-14 2023-05-05 中南大学 Game data compression method based on column subset selection

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116073836A (en) * 2023-03-14 2023-05-05 中南大学 Game data compression method based on column subset selection

Similar Documents

Publication Publication Date Title
CN109543830B (en) Splitting accumulator for convolutional neural network accelerator
JP7070093B2 (en) Clustering device, clustering method and program
US9058541B2 (en) Object detection method, object detector and object detection computer program
US11295236B2 (en) Machine learning in heterogeneous processing systems
Yin et al. Clustering of gene expression data: performance and similarity analysis
Balhaf et al. Using gpus to speed-up levenshtein edit distance computation
Chavan et al. Mini batch K-Means clustering on large dataset
Yang et al. A heuristic sampling method for maintaining the probability distribution
CN113077843A (en) Distributed column subset selection method and system and leukemia gene information mining method
Yang et al. Reskm: a general framework to accelerate large-scale spectral clustering
CN117409968B (en) Hierarchical attention-based cancer dynamic survival analysis method and system
WO2022016261A1 (en) System and method for accelerating training of deep learning networks
Kiiveri Multivariate analysis of microarray data: differential expression and differential connection
CN109145111B (en) Multi-feature text data similarity calculation method based on machine learning
Adinetz et al. GPUMAFIA: Efficient subspace clustering with MAFIA on GPUs
CN114270341B (en) Data attribute grouping method, device, equipment and storage medium
Sun et al. MCNet: Multivariate Long-Term Time Series Forecasting with Local and Global Context Modeling
CN104965976A (en) Sampling method and device
CN108280461B (en) Rapid global K-means clustering method accelerated by OpenCL
Painsky MSc Dissertation: Exclusive Row Biclustering for Gene Expression Using a Combinatorial Auction Approach
CN115019101B (en) Image classification method based on information bottleneck algorithm in image classification network
CN108549669A (en) A kind of outlier detection method towards big data
CN112509640B (en) Gene ontology item name generation method and device and storage medium
Du et al. Local-Global Graph Fusion to Enhance scRNA-seq Clustering
US20240321397A1 (en) Extracting properties from a sparse data set by applying hyperdimensional computing and dimension reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210706