CN111259933B - High-dimensional characteristic data classification method and system based on distributed parallel decision tree - Google Patents

High-dimensional characteristic data classification method and system based on distributed parallel decision tree Download PDF

Info

Publication number
CN111259933B
CN111259933B CN202010022431.3A CN202010022431A CN111259933B CN 111259933 B CN111259933 B CN 111259933B CN 202010022431 A CN202010022431 A CN 202010022431A CN 111259933 B CN111259933 B CN 111259933B
Authority
CN
China
Prior art keywords
feature
node
characteristic
data
decision tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010022431.3A
Other languages
Chinese (zh)
Other versions
CN111259933A (en
Inventor
孙莹
庄福振
敖翔
何清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202010022431.3A priority Critical patent/CN111259933B/en
Publication of CN111259933A publication Critical patent/CN111259933A/en
Application granted granted Critical
Publication of CN111259933B publication Critical patent/CN111259933B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a high-dimensional characteristic data classification method and system based on a distributed parallel decision tree. The parallel decision tree algorithm for the high-dimensional characteristic data based on Spark is realized, the parallel algorithm has high parallelism degree, a large-scale data set can be processed, parallel calculation can be performed between nodes at the same layer in the decision tree, parallel calculation can be performed on a characteristic layer, the parallelism degree of the high-dimensional data is improved, and the processing time of the high-dimensional characteristic can be effectively reduced.

Description

High-dimensional characteristic data classification method and system based on distributed parallel decision tree
Technical Field
The invention relates to the field of tree classification, in particular to a high-dimensional characteristic data classification method and system based on a distributed parallel decision tree.
Background
The decision tree classification algorithm is an example-based inductive learning method, which can extract a tree classification model from a given unordered training sample. Each non-leaf node in the tree records which feature is used to make the category determination, and each leaf node represents the last determined category. The root node forms a classified path rule to each leaf node. When a new sample is tested, the test is only needed to be performed at each branch node from the root node, subtrees are recursively entered along the corresponding branch until the new sample reaches the leaf node, and the category represented by the leaf node is the prediction category of the current test sample. Quinlan proposed a well-known ID3 algorithm in 1986. Based on the ID3 algorithm, the C4.5 algorithm was proposed by Quinlan in 1993.
The decision tree is constructed by a top-down recursive construction method. The result of the decision tree construction is a binary or multi-way tree whose input is a set of training data with class labels. The internal nodes (non-leaf nodes) of a binary tree are typically represented as a logical predicate, e.g., a logical predicate of the form a=b, where a is an attribute and b is some attribute value of the attribute; the edges of the tree are the branch results of the logical decision. The internal nodes of the binary decision tree are attributes, and the edges are divided according to the values of the attributes, and each branch only divides the left sub node and the right sub node, so that the overfitting can be effectively reduced. Classifying by using a decision tree, firstly establishing and refining a decision tree by using a training set, and establishing a decision tree model. This process is actually a process of acquiring knowledge from data and performing machine learning. And classifying the input data by using the generated decision tree. And testing the attribute values of the input records sequentially from the root node until a certain leaf node is reached, so that the class of the record is found.
In order to meet the requirement of processing large-scale data sets, a parallel scheme of a plurality of decision trees is proposed, and in some industrial scenes, a plurality of decision tree algorithms with special designs are presented for the characteristics of the data, such as an algorithm for establishing a plurality of decision trees in parallel according to time sequence and aggregating, a parallel decision tree algorithm capable of processing real-time stream data for the characteristics of data input and output, and the like. These algorithms mainly improve decision tree algorithms based on business scenarios, rather than proposing a generic parallel decision tree algorithm.
The general parallel decision tree algorithm performs parallel processing of data in the traditional decision tree establishment process, the algorithm overall thought is consistent with that of the serial decision tree algorithm, and the parallel thought is to divide a large-scale sample into nodes in parallel and perform optimal division selection on the nodes in parallel. Such as MapReduce-based parallel decision tree algorithms and hierarchical policy-based parallel decision tree algorithms. Since the calculations of these algorithms are serial on the various features, the efficiency is lower with higher data dimensions. The invention mainly designs a spark-based parallel binary decision tree algorithm capable of effectively processing high-dimensional features.
The greatest operational cost of constructing the decision tree is that the optimal splitting attribute is calculated and selected, because each node needs to consider each field when splitting is selected so as to find the dividing method with the maximum information gain, common measurement criteria are methods such as information entropy, gininindex and the like. Specifically, firstly, all samples on the nodes need to be counted to obtain information required for calculating the information gain of each field, and for information entropy, the number of samples of each category corresponding to each characteristic value is obtained. And then ordering the information gains of all the division methods, and selecting the optimal division method.
In order to process large-scale data, a lot of distributed decision tree works, namely a parallel decision tree algorithm based on MapReduce, which divides samples into nodes and performs statistics on the number of the samples at the same time and then selects splitting attributes among the nodes of the same layer in parallel, but adopts a multi-way tree algorithm, and takes each attribute value of the dividing attributes as a node, so that data of high-dimensional characteristics cannot be processed, and the situation of over fitting or over memory use can occur. Similarly, the optimal splitting point of the same layer of nodes of the decision tree is calculated in parallel, so that the performance of the decision tree is improved, but the algorithm is processed in series during the optimal attribute decision, so that the problem of insufficient parallelism in the processing of high-dimensional characteristic data is also solved. The Spark-based parallel decision tree algorithm adopts a binary decision tree, performs statistics on samples in parallel, directly calculates statistical information of each node, then performs optimal division selection in parallel among nodes similar to the MapReduce-based parallel decision tree algorithm, and performs information gain calculation in parallel on all divisions of all features although the data with higher processing dimension have lower memory occupation, and the calculation inside the nodes is serial, so that huge time consumption is brought under the condition of very high data dimension.
Based on the analysis, the serial and memory-based classification decision tree cannot process mass data, and the existing distributed processing mode has greatly improved data processing scale, but most of MapReduce-based algorithms have low parallel efficiency and cannot obtain a global classification model. While the decision tree algorithm based on node parallelism can obtain a global classification model, the method has less attention to the data characteristics and has the problems that the high-dimensional characteristics cannot be processed or the processing time of the high-dimensional characteristics is long.
Disclosure of Invention
The invention aims to solve the problem that the existing parallel decision tree algorithm cannot efficiently process high-dimensional characteristic data, and provides a parallel decision tree algorithm which is simultaneously processed in parallel on a node and a characteristic layer.
Specifically, the invention provides a high-dimensional characteristic data classification method based on a distributed parallel decision tree, which is characterized by comprising the following steps:
step 1, obtaining training data comprising a plurality of sample high-dimensional feature data, wherein the sample high-dimensional feature data has corresponding label types, storing the training data in a distributed file system, carrying out parallel sampling statistics on samples of the training data on a distributed cluster to obtain feature distribution information on the training data, obtaining metadata for supporting decision tree calculation, and preprocessing continuous features;
step 2, by sampling and calculating the metadata, distributing feature groups for all calculation nodes in the distributed cluster, establishing a root node of a tree, and carrying out label category distribution on a combined statistical sample of all working nodes of the distributed cluster to obtain a root node initial information entropy;
step 3, respectively counting the sample data stored in each working node of the distributed cluster for all sample high-dimensional characteristic data, obtaining the current tree node of each sample according to the vector of the characteristic of each sample and the dividing rule of the decision tree, simultaneously counting the occurrence times of four-element groups (belonging nodes, characteristics, characteristic values and labels), grouping and aggregating the four-element groups according to the (nodes and the characteristic groups) by each node, and storing the statistical information of key value pairs of < (nodes, characteristic groups), (characteristics, characteristic values and labels) of each working node in a distributed mode, and obtaining the information entropy of each characteristic value according to the statistical information;
sorting the feature values according to the information entropy of each tag, attributing the statistical values of all the tags to a right node, sequentially traversing the feature values as a left node feature value, reserving the feature value with the largest information gain every time of traversing to obtain < (node, feature group), optimally dividing > key value pairs, optimally dividing and gathering each feature group of the same node to obtain < node, optimally dividing > key value pairs, and selecting optimal dividing pair nodes;
and 5, circulating the steps 2 to 4 until all nodes in the decision tree are divided, storing the current decision tree as a classification model, and inputting data to be classified into the classification model to obtain the category corresponding to the data to be classified.
The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized in that the training data are text data or image data.
The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized in that the step 2 comprises the following steps:
sequencing the features according to the number of the feature values to obtain a sequence
Figure BDA0002361286840000041
And obtaining the maximum value K of the total number of the binary characteristic values, and obtaining the characteristic that the total number of the G groups is not more than K by using a dynamic programming algorithm, and finding out the minimum K, wherein the G groups at the moment are the optimal characteristic groups.
The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized in that the preprocessing in the step 1 comprises the following steps: sampling the continuous features, collecting the feature values of the sampled samples to a main node, counting the number of samples of each feature value, sequencing all the feature values according to the values to obtain a sequence, grouping the samples according to the preset maximum feature division number, wherein each group is used as a barrel of the continuous features, and the median of the minimum difference between two adjacent groups of feature values is used as a candidate division.
The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized in that the metadata is obtained through statistics of training data and comprises characteristic numbers, sample numbers, label numbers, maximum characteristic division, discrete characteristic value ranges, unordered discrete characteristics, maximum depth, node minimum sample numbers and split minimum information gain.
The invention also provides a high-dimensional characteristic data classification system based on the distributed parallel decision tree, which is characterized by comprising the following steps:
the method comprises the steps of 1, acquiring training data comprising a plurality of sample high-dimensional feature data, wherein the sample high-dimensional feature data has corresponding label types, storing the training data in a distributed file system, carrying out parallel sampling statistics on samples of the training data on a distributed cluster to acquire feature distribution information on the training data, acquiring metadata for supporting decision tree calculation, and preprocessing continuous features;
the module 2 distributes feature groups for all computing nodes in the distributed cluster by sampling and calculating the metadata, establishes a root node of a tree, and obtains the initial information entropy of the root node by combining label category distribution of statistical samples of all working nodes of the distributed cluster;
the module 3 is used for respectively counting the stored sample data on each working node of the distributed cluster according to the high-dimensional characteristic data of all samples, obtaining the current tree node of each sample according to the vector of the characteristic of each sample and the dividing rule of the decision tree, simultaneously counting the occurrence times of four-element groups (belonging nodes, characteristics, characteristic values and labels), grouping and aggregating the four-element groups according to the (nodes and the characteristic groups) by each node, and carrying out distributed storage on the statistical information of key value pairs of < (nodes, characteristic groups), (characteristics, characteristic values and labels) >, and obtaining the information entropy of each characteristic value according to the statistical information;
the module 4 sorts the characteristic values according to the information entropy of the respective labels, attributes the statistical values of all the labels to a right node, sequentially traverses the characteristic values as a left node characteristic value, reserves the characteristic value with the largest information gain every time of traversing to obtain < (node, characteristic group), optimally divides > key value pairs, gathers and optimizes the optimal division of each characteristic group of the same node to obtain < node, optimally divides > key value pairs, and selects optimal division to divide the nodes;
and 5, circulating the modules 2 to 4 until all nodes in the decision tree are divided, storing the current decision tree as a classification model, and inputting the data to be classified into the classification model to obtain the category corresponding to the data to be classified.
The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the training data are text data or image data.
The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the module 2 comprises:
sequencing the features according to the number of the feature values to obtain a sequence
Figure BDA0002361286840000051
And obtaining the maximum value K of the total number of the binary characteristic values, and obtaining the characteristic that the total number of the G groups is not more than K by using a dynamic programming algorithm, and finding out the minimum K, wherein the G groups at the moment are the optimal characteristic groups.
The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the preprocessing in the module 1 comprises the following steps: sampling the continuous features, collecting the feature values of the sampled samples to a main node, counting the number of samples of each feature value, sequencing all the feature values according to the values to obtain a sequence, grouping the samples according to the preset maximum feature division number, wherein each group is used as a barrel of the continuous features, and the median of the minimum difference between two adjacent groups of feature values is used as a candidate division.
The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the metadata is obtained through statistics of training data and comprises a characteristic number, a sample number, a label number, a maximum characteristic division, a discrete characteristic value range, unordered discrete characteristics, a maximum depth, a node minimum sample number and a split minimum information gain.
Aiming at the defects of the prior art, the invention provides a method for preparing the composite material
The advantages of the invention are as follows:
compared with the prior art, the method improves the parallel efficiency of the decision tree model and is more efficient in processing large-scale high-dimensional data. The time for traversing all the division modes during optimal division selection can be effectively reduced. The method can better utilize the large-scale clusters under the condition of richer current computing resources, and is very suitable for the mainstream method of combining the features and then establishing the tree classification model in the current industry. The inventor performs experiments on 10000-dimensional real data on smaller-scale clusters, and compared with the most popular Spark parallel decision tree algorithm, the method can shorten the model building time by more than 30%, and can achieve better effects under the conditions of larger cluster specification and higher data dimension.
Drawings
FIG. 1 is a flow chart of decision tree construction in accordance with the present invention;
FIG. 2 is a schematic diagram of data transformation in Spark for the decision tree algorithm of the present invention.
Detailed Description
The inventor finds that the situation of large data dimension occurs when large-scale data mining research is carried out, and the existing decision tree algorithm cannot process the data well. The reason is that the serial decision tree cannot process large-scale data, the existing parallel decision tree algorithm has low parallel degree, and the fastest algorithm is parallel only on the node level, but not in the part selected by the optimal feature. Under the conditions of larger feature dimension and more feature values, excessive memory usage and overfitting caused by excessive decision tree nodes are caused by the use of a multi-way decision tree, and the use of a binary decision tree is necessary to traverse all possible node partitions, so that the information gain of each partition is calculated and the optimal node is decided, and the time consumption is also increased. The existing parallel decision tree algorithm does not take this into account, because naturally occurring data rarely has a particularly large feature dimension. However, the industry often adopts a method of combining multidimensional features to generate new features, which makes the dimension of the final features exponentially increase, and in this case, the conventional parallel decision tree cannot efficiently screen the high-dimensional features and establish the best classification model. The inventors have found that this defect can be achieved by parallel at the feature level. If parallelism is desired at the feature level, each node of spark is required to calculate the optimal division of each feature, and due to the characteristics of the distributed system, although time consumption is greatly reduced from the theoretical point of view, the time consumption of node waiting caused by data transfer time required by shuffle among nodes, unbalanced data distribution of each node and the like cannot be directly divided into each node of spark for processing. Based on the existing parallel decision tree based on node parallelism, the inventor researches the complexity of the feature division part and optimizes the optimal feature selection part, discovers that the complexity of feature division is determined by the total feature value quantity, and designs a parallel decision tree algorithm for carrying out balanced grouping on the features according to the feature value quantity and simultaneously carrying out parallel processing on the features according to the nodes from the sample division step so as to carry out optimal division in order to increase the parallelism degree and balance the spark other time consumption.
The invention designs and realizes the Spark-based parallel decision tree algorithm for the high-dimensional characteristic data, the parallel algorithm has high parallel degree, can process a large-scale data set, not only carries out parallel calculation among nodes at the same layer in the decision tree, but also can carry out parallel calculation on a characteristic layer, improves the parallel degree of the high-dimensional data, and can effectively reduce the processing time of the high-dimensional characteristic.
The invention comprises the following key points:
and the key point 1 is designed to realize a parallel binary decision tree algorithm parallel in the characteristic dimension, so that the processing efficiency of the parallel decision tree algorithm on high-dimensional data is improved.
And 2, characteristic grouping parallelism and group size adjustment are realized according to cluster conditions, so that data transmission consumption and node operation time consumption are balanced, clusters are effectively utilized to the greatest extent, and parallel efficiency is improved.
And the key points 3 are used for balancing the operation amount of each node by grouping the characteristics according to the characteristic values, so that the idle time of the nodes in the cluster can be further reduced, and the operation time is reduced.
In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.
The decision tree algorithm is implemented in Spark as shown in fig. 1, and the basic process is that firstly, data is counted, information such as feature distribution is obtained, then all samples are put into decision tree root nodes, statistical information such as sample features and labels is calculated in parallel, information gain of feature division is calculated according to the statistical information, an optimal division mode is selected, and the nodes are divided. In the parallel decision tree structure, the parallelism degree of the decision tree is further improved, the optimal division selection is performed on all nodes of each layer on the decision tree in parallel, and meanwhile, the parallel calculation of the information gain is further performed on the feature layer by grouping the features. The data conversion in Spark is shown in fig. 2, and the algorithm specifically comprises the following steps:
metadata (statistical information needed to be used in the calculation process, super parameters set by a user, and the like are stored in the metadata) is established. The metadata is obtained by statistics of input samples (the input samples comprise all training data, each sample comprises a sample number, all characteristics of the sample and sample category labels), and the metadata comprises a characteristic number, a sample number, a label number, a maximum characteristic division, a discrete characteristic value range (the number of the values of each characteristic), unordered discrete characteristics, a maximum depth, a node minimum sample number, split minimum information gain and the like.
Continuous feature pretreatment. The main idea of the invention is to group the features according to the possible division quantity (related to the calculated quantity, which is described in detail in the next section) so as to realize the parallel processing under the high-dimensional features, and the optional division mode of the continuous features is too many to determine the calculated quantity in advance, so that the invention aims at the continuous featuresThe pre-processing is performed and possible candidates are determined before the decision tree is built, in particular for each successive feature (the feature having a value of successive values, such as the height feature of a chair may have a value of 1 meter, 1.2 meters.) the candidate partitioning means is first selected for each successive feature, corresponding to discrete features, such as the color (feature) of a chair may have a value of yellow, red. Specifically, first, samples are sampled, feature values of the sampled samples (values of all samples of the feature to be preprocessed on the feature vector) are collected into a master node driver, the number of samples of each feature value is counted, and all the feature values are sorted according to the value to obtain a sequence (a) 1 ,count 1 ),(a 2 ,count 2 ),...,(a n ,count n ) According to the maximum feature division number selected initially, the method divides the maximum feature division number into a plurality of groups (the value of the feature is divided into a plurality of sections, for example, the section is possibly divided into three sections of 0-0.2,0.3-0.4,0.7-1 after the feature value is required to be divided, the number of the samples falling in the three sections is as equal as possible, the median of the adjacent boundaries of the two sections is taken as the division, two candidates are divided into 0.25,0.55 after the division of the two sections, each group is taken as a barrel bin of the continuous feature, and the median of the minimum difference between the two adjacent groups of feature values is taken as the candidate division.
The characteristics balance grouping. In order to distribute the task of computing the optimal division of features to the individual work nodes worker, the features need to be divided into several groups, each worker performing a group of feature optimal division computations. The time of this step depends on the worker with the longest working time, and the complexity of the optimal division calculation is o (Σ) i∈F v I ) Where F is the set of features, v i Representing the number of the characteristic feature values or bins. Therefore, in order to minimize the working time, it is necessary to minimize the maximum value of the total number of feature values of each group of features. Sequencing the features according to the number of the feature values to obtain a sequence
Figure BDA0002361286840000081
Wherein v represents the number of feature values corresponding to each feature, and for discrete features, the value isAll possible features take on a number of values, which for continuous features is the number of candidate scores. According to the preset grouping number G, a minimum value K is searched in a binary search mode, so that the division group number does not exceed G, the sum of the feature value numbers (or possible division number sum) of features in each group does not exceed K, a specific division can be found for the determined K in the binary search mode by using a greedy algorithm, and finally the grouping scheme corresponding to the minimum K is the optimal feature grouping. The G group at this time is the optimal feature group. Under the grouping, the maximum calculated amount born by each working node on the cluster is minimum, and the calculated amount distribution is balanced, so that the time for the working node which has completed the task itself to wait for the other nodes to complete the task in the parallel processing process can be reduced to the greatest extent.
And (5) counting node information. And determining the tree nodes according to the respective feature vectors and the established decision tree dividing rules by all samples, and simultaneously calculating the feature values and the label distribution conditions of all the nodes on all the samples, specifically counting the number of four-element groups (belonging to the nodes, features, feature values and labels). In order to reduce the consumption of transmission, firstly, statistics are respectively carried out on each partition of the parallel tasks, then, the statistics values are grouped and aggregated according to (nodes and feature groups), and key value pairs of < (nodes and feature groups), (features, feature values and labels) >' are obtained on each working node.
And (5) calculating optimal partition. On each (node, feature set) an optimal division of each feature is calculated, respectively. For continuous features, traversing all candidate partitions in sequence, firstly attributing the statistical values of all labels to a right node, putting the label numbers corresponding to the partitions into a left node each time, and obtaining the label numbers of two nodes to calculate the information gain. For discrete features, feature values are firstly ordered according to the information entropy of the respective labels, and then the feature values are sequentially traversed to serve as left node feature values. And (3) maintaining the characteristic value with the maximum information gain every time of traversal, and obtaining < (node, characteristic group), and optimally dividing > key value pairs. And aggregating and optimizing the optimal partitions of the feature groups of the same node to obtain key value pairs of the node and the optimal partition.
And (5) node division. Dividing the nodes into left and right sub-nodes according to the optimal division of each node, writing division information into node information, and ending the node division of the current layer.
Taking the watermelon dataset as an example:
numbering device Color Root base Knock sound Texture and method for producing the same Umbilical region Tactile sensation Good melon
1 Bluish green Crimping and shrinking Turbid sound Clear and clear Recess in the bottom of the container Hard slide Is that
2 Bluish green Stiffening member Crisp and clean Clear and clear Flat and flat Soft adhesive Whether or not
3 Black black Crimping and shrinking Turbid sound Clear and clear Recess in the bottom of the container Hard slide Is that
4 Light white Crimping and shrinking Turbid sound Blurring Flat and flat Soft adhesive Whether or not
5 Light white Crimping and shrinking Turbid sound Clear and clear Recess in the bottom of the container Hard slide Is that
6 Light white Slightly spiral wound Clunk and clunk with feeling of fullness Slightly paste Recess in the bottom of the container Hard slide Whether or not
7 Black black Slightly spiral wound Turbid sound Slightly paste Slightly concave Soft adhesive Is that
8 Light white Crimping and shrinking Turbid sound Blurring Flat and flat Hard slide Whether or not
9 Black black Slightly spiral wound Clunk and clunk with feeling of fullness Slightly paste Slightly concave Hard slide Whether or not
10 Black black Crimping and shrinking Clunk and clunk with feeling of fullness Clear and clear Recess in the bottom of the container Hard slide Is that
11 Light white Stiffening member Crisp and clean Blurring Flat and flat Hard slide Whether or not
12 Bluish green Crimping and shrinking Clunk and clunk with feeling of fullness Clear and clear Recess in the bottom of the container Hard slide Is that
13 Bluish green Slightly spiral wound Turbid sound Slightly paste Recess in the bottom of the container Hard slide Whether or not
14 Bluish green Slightly spiral wound Turbid sound Clear and clear Slightly concave Soft adhesive Is that
15 Black black Slightly spiral wound Turbid sound Clear and clear Slightly concave Soft adhesive Whether or not
16 Black black Slightly spiral wound Turbid sound Clear and clear Slightly concave Hard slide Is that
17 Bluish green Crimping and shrinking Clunk and clunk with feeling of fullness Slightly paste Slightly concave Hard slide Whether or not
Feature grouping
Is divided into three groups
The number of the characteristic values after statistics is as follows:
3 pedicles, 3 knocks, 3 textures, 3 umbilical parts, 3 touch feeling and 2
According to the principle of eigenvalue quantity balance, the average group is (1: { color, root, stem },2: { knock, texture },3: { umbilical region, touch }
Sample division statistics:
determining the node to which the sample belongs according to the existing rule of the decision tree
Numbering device Color Root base Knock sound Texture and method for producing the same Umbilical region Tactile sensation Good melon Node
1 Bluish green Crimping and shrinking Turbid sound Clear and clear Recess in the bottom of the container Hard slide Is that 0
2 Bluish green Stiffening member Crisp and clean Clear and clear Flat and flat Soft adhesive Whether or not 0
3 Black black Crimping and shrinking Turbid sound Clear and clear Recess in the bottom of the container Hard slide Is that 0
4 Light white Crimping and shrinking Turbid sound Blurring Flat and flat Soft adhesive Whether or not 0
5 Light white Crimping and shrinking Turbid sound Clear and clear Recess in the bottom of the container Hard slide Is that 0
6 Light white Slightly spiral wound Clunk and clunk with feeling of fullness Slightly paste Recess in the bottom of the container Hard slide Whether or not 0
7 Black black Slightly spiral wound Turbid sound Slightly paste Slightly concave Soft adhesive Is that 0
8 Light white Crimping and shrinking Turbid sound Blurring Flat and flat Hard slide Whether or not 0
9 Black black Slightly spiral wound Clunk and clunk with feeling of fullness Slightly paste Slightly concave Hard slide Whether or not 0
10 Black black Crimping and shrinking Clunk and clunk with feeling of fullness Clear and clear Recess in the bottom of the container Hard slide Is that 0
11 Light white Stiffening member Crisp and clean Blurring Flat and flat Hard slide Whether or not 0
12 Bluish green Crimping and shrinking Clunk and clunk with feeling of fullness Clear and clear Recess in the bottom of the container Hard slide Is that 0
13 Bluish green Slightly spiral wound Turbid sound Slightly paste Recess in the bottom of the container Hard slide Whether or not 0
14 Bluish green Slightly spiral wound Turbid sound Clear and clear Slightly concave Soft adhesive Is that 0
15 Black black Slightly spiral wound Turbid sound Clear and clear Slightly concave Soft adhesive Whether or not 0
16 Black black Slightly spiral wound Turbid sound Clear and clear Slightly concave Hard slide Is that 0
17 Bluish green Crimping and shrinking Clunk and clunk with feeling of fullness Slightly paste Slightly concave Hard slide Whether or not 0
Statistical sample information of each part
Figure BDA0002361286840000101
Figure BDA0002361286840000111
Partition merging
First group of
Color Is that Whether or not
Black black 4 2
Bluish green 3 3
Light white 1 4
Root base Is that Whether or not
Crimping and shrinking 5 3
Slightly spiral wound 3 4
Stiffening member 0 2
Optimal division of root pedicel stiffness, entropy 0.89
Second group of
Knock sound Is that Whether or not
Clunk and clunk with feeling of fullness 2 3
Turbid sound 6 4
Crisp and clean 0 2
Texture and method for producing the same Is that Whether or not
Clear and clear 7 2
Slightly paste 1 4
Blurring 0 3
And (3) optimizing: whether texture is blurred or not, entropy 0.83
Third group of
Umbilical region Is that Whether or not
Recess in the bottom of the container 5 2
Slightly concave 3 3
Flat and flat 0 4
Tactile sensation Is that Whether or not
Hard slide 6 6
Soft adhesive 2 3
And (3) optimizing: whether the umbilical region is flat or not, entropy 0.76
Node division, namely dividing the 0 node into a left node and a right node, wherein the left node and the right node are flat and non-flat umbilical areas respectively.
The above process is repeated until all the divisions are completed.
The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a high-dimensional characteristic data classification system based on the distributed parallel decision tree, which is characterized by comprising the following steps:
the method comprises the steps of 1, acquiring training data comprising a plurality of sample high-dimensional feature data, wherein the sample high-dimensional feature data has corresponding label types, storing the training data in a distributed file system, carrying out parallel sampling statistics on samples of the training data on a distributed cluster to acquire feature distribution information on the training data, acquiring metadata for supporting decision tree calculation, and preprocessing continuous features;
the module 2 distributes feature groups for all computing nodes in the distributed cluster by sampling and calculating the metadata, establishes a root node of a tree, and obtains the initial information entropy of the root node by combining label category distribution of statistical samples of all working nodes of the distributed cluster;
the module 3 is used for respectively counting the stored sample data on each working node of the distributed cluster according to the high-dimensional characteristic data of all samples, obtaining the current tree node of each sample according to the vector of the characteristic of each sample and the dividing rule of the decision tree, simultaneously counting the occurrence times of four-element groups (belonging nodes, characteristics, characteristic values and labels), grouping and aggregating the four-element groups according to the (nodes and the characteristic groups) by each node, and carrying out distributed storage on the statistical information of key value pairs of < (nodes, characteristic groups), (characteristics, characteristic values and labels) >, and obtaining the information entropy of each characteristic value according to the statistical information;
the module 4 sorts the characteristic values according to the information entropy of the respective labels, attributes the statistical values of all the labels to a right node, sequentially traverses the characteristic values as a left node characteristic value, reserves the characteristic value with the largest information gain every time of traversing to obtain < (node, characteristic group), optimally divides > key value pairs, gathers and optimizes the optimal division of each characteristic group of the same node to obtain < node, optimally divides > key value pairs, and selects optimal division to divide the nodes;
and 5, circulating the modules 2 to 4 until all nodes in the decision tree are divided, storing the current decision tree as a classification model, and inputting the data to be classified into the classification model to obtain the category corresponding to the data to be classified.
The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the training data are text data or image data.
The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the module 2 comprises:
sequencing the features according to the number of the feature values to obtain a sequence
Figure BDA0002361286840000131
And obtaining the maximum value K of the total number of the binary characteristic values, and obtaining the characteristic that the total number of the G groups is not more than K by using a dynamic programming algorithm, and finding out the minimum K, wherein the G groups at the moment are the optimal characteristic groups.
The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the preprocessing in the module 1 comprises the following steps: sampling the continuous features, collecting the feature values of the sampled samples to a main node, counting the number of samples of each feature value, sequencing all the feature values according to the values to obtain a sequence, grouping the samples according to the preset maximum feature division number, wherein each group is used as a barrel of the continuous features, and the median of the minimum difference between two adjacent groups of feature values is used as a candidate division.
The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the metadata is obtained through statistics of training data and comprises a characteristic number, a sample number, a label number, a maximum characteristic division, a discrete characteristic value range, unordered discrete characteristics, a maximum depth, a node minimum sample number and a split minimum information gain.

Claims (10)

1. The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized by comprising the following steps of:
step 1, obtaining training data comprising a plurality of sample high-dimensional feature data, wherein the sample high-dimensional feature data has corresponding label types, storing the training data in a distributed file system, carrying out parallel sampling statistics on samples of the training data on a distributed cluster to obtain feature distribution information on the training data, obtaining metadata for supporting decision tree calculation, and preprocessing continuous features;
step 2, by sampling and calculating the metadata, distributing feature groups for all calculation nodes in the distributed cluster, establishing a root node of a tree, and carrying out label category distribution on a combined statistical sample of all working nodes of the distributed cluster to obtain a root node initial information entropy;
step 3, respectively counting the sample data stored in each working node of the distributed cluster for all sample high-dimensional characteristic data, obtaining the current tree node of each sample according to the vector of the characteristic of each sample and the dividing rule of the decision tree, simultaneously counting the occurrence times of four-element groups (belonging nodes, characteristics, characteristic values and labels), grouping and aggregating the four-element groups according to the (nodes and the characteristic groups) by each node, and storing the statistical information of key value pairs of < (nodes, characteristic groups), (characteristics, characteristic values and labels) of each working node in a distributed mode, and obtaining the information entropy of each characteristic value according to the statistical information;
sorting the feature values according to the information entropy of each tag, attributing the statistical values of all the tags to a right node, sequentially traversing the feature values as a left node feature value, reserving the feature value with the largest information gain every time of traversing to obtain < (node, feature group), optimally dividing > key value pairs, optimally dividing and gathering each feature group of the same node to obtain < node, optimally dividing > key value pairs, and selecting optimal dividing pair nodes;
and 5, circulating the steps 2 to 4 until all nodes in the decision tree are divided, storing the current decision tree as a classification model, and inputting data to be classified into the classification model to obtain the category corresponding to the data to be classified.
2. The method of claim 1, wherein the training data is text data or image data.
3. The method for classifying high-dimensional feature data based on distributed parallel decision trees according to claim 1, wherein the step 2 comprises:
sequencing the features according to the number of the feature values to obtain a sequence
Figure FDA0002361286830000011
And obtaining the maximum value K of the total number of the binary characteristic values, and obtaining the characteristic that the total number of the G groups is not more than K by using a dynamic programming algorithm, and finding out the minimum K, wherein the G groups at the moment are the optimal characteristic groups.
4. The method for classifying high-dimensional feature data based on a distributed parallel decision tree according to claim 1, wherein the preprocessing in step 1 comprises: sampling the continuous features, collecting the feature values of the sampled samples to a main node, counting the number of samples of each feature value, sequencing all the feature values according to the values to obtain a sequence, grouping the samples according to the preset maximum feature division number, wherein each group is used as a barrel of the continuous features, and the median of the minimum difference between two adjacent groups of feature values is used as a candidate division.
5. The method for classifying high-dimensional feature data based on a distributed parallel decision tree according to claim 1, wherein the metadata is statistically derived from training data and comprises a feature number, a sample number, a tag number, a maximum feature division, a discrete feature value range, unordered discrete features, a maximum depth, a node minimum sample number, and a split minimum information gain.
6. A distributed parallel decision tree-based high-dimensional feature data classification system, comprising:
the method comprises the steps of 1, acquiring training data comprising a plurality of sample high-dimensional feature data, wherein the sample high-dimensional feature data has corresponding label types, storing the training data in a distributed file system, carrying out parallel sampling statistics on samples of the training data on a distributed cluster to acquire feature distribution information on the training data, acquiring metadata for supporting decision tree calculation, and preprocessing continuous features;
the module 2 distributes feature groups for all computing nodes in the distributed cluster by sampling and calculating the metadata, establishes a root node of a tree, and obtains the initial information entropy of the root node by combining label category distribution of statistical samples of all working nodes of the distributed cluster;
the module 3 is used for respectively counting the stored sample data on each working node of the distributed cluster according to the high-dimensional characteristic data of all samples, obtaining the current tree node of each sample according to the vector of the characteristic of each sample and the dividing rule of the decision tree, simultaneously counting the occurrence times of four-element groups (belonging nodes, characteristics, characteristic values and labels), grouping and aggregating the four-element groups according to the (nodes and the characteristic groups) by each node, and carrying out distributed storage on the statistical information of key value pairs of < (nodes, characteristic groups), (characteristics, characteristic values and labels) >, and obtaining the information entropy of each characteristic value according to the statistical information;
the module 4 sorts the characteristic values according to the information entropy of the respective labels, attributes the statistical values of all the labels to a right node, sequentially traverses the characteristic values as a left node characteristic value, reserves the characteristic value with the largest information gain every time of traversing to obtain < (node, characteristic group), optimally divides > key value pairs, gathers and optimizes the optimal division of each characteristic group of the same node to obtain < node, optimally divides > key value pairs, and selects optimal division to divide the nodes;
and 5, circulating the modules 2 to 4 until all nodes in the decision tree are divided, storing the current decision tree as a classification model, and inputting the data to be classified into the classification model to obtain the category corresponding to the data to be classified.
7. The distributed parallel decision tree based high-dimensional feature data classification system of claim 6, wherein the training data is text data or image data.
8. The distributed parallel decision tree based high-dimensional feature data classification system of claim 6, wherein the module 2 comprises:
sequencing the features according to the number of the feature values to obtain a sequence
Figure FDA0002361286830000031
And obtaining the maximum value K of the total number of the binary characteristic values, and obtaining the characteristic that the total number of the G groups is not more than K by using a dynamic programming algorithm, and finding out the minimum K, wherein the G groups at the moment are the optimal characteristic groups.
9. The distributed parallel decision tree based high-dimensional feature data classification system of claim 6, wherein the preprocessing in module 1 comprises: sampling the continuous features, collecting the feature values of the sampled samples to a main node, counting the number of samples of each feature value, sequencing all the feature values according to the values to obtain a sequence, grouping the samples according to the preset maximum feature division number, wherein each group is used as a barrel of the continuous features, and the median of the minimum difference between two adjacent groups of feature values is used as a candidate division.
10. The distributed parallel decision tree based high-dimensional feature data classification system of claim 6, wherein the metadata is statistically derived from training data comprising feature numbers, sample numbers, tag numbers, maximum feature partitioning, discrete feature value ranges, unordered discrete features, maximum depth, node minimum sample numbers, and split minimum information gain.
CN202010022431.3A 2020-01-09 2020-01-09 High-dimensional characteristic data classification method and system based on distributed parallel decision tree Active CN111259933B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010022431.3A CN111259933B (en) 2020-01-09 2020-01-09 High-dimensional characteristic data classification method and system based on distributed parallel decision tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010022431.3A CN111259933B (en) 2020-01-09 2020-01-09 High-dimensional characteristic data classification method and system based on distributed parallel decision tree

Publications (2)

Publication Number Publication Date
CN111259933A CN111259933A (en) 2020-06-09
CN111259933B true CN111259933B (en) 2023-06-13

Family

ID=70950331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010022431.3A Active CN111259933B (en) 2020-01-09 2020-01-09 High-dimensional characteristic data classification method and system based on distributed parallel decision tree

Country Status (1)

Country Link
CN (1) CN111259933B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822432B (en) * 2021-04-06 2024-02-06 京东科技控股股份有限公司 Sample data processing method and device, electronic equipment and storage medium
CN113268505B (en) * 2021-04-29 2021-11-30 广东海洋大学 Offline batch processing method and system for multi-source multi-mode ocean big data
CN114638309B (en) * 2022-03-21 2024-04-09 北京左江科技股份有限公司 Information entropy-based hypercust decision tree strategy set preprocessing method
CN115188381B (en) * 2022-05-17 2023-10-24 贝壳找房(北京)科技有限公司 Voice recognition result optimization method and device based on click ordering

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214213A (en) * 2011-05-31 2011-10-12 中国科学院计算技术研究所 Method and system for classifying data by adopting decision tree
CN106056134A (en) * 2016-05-20 2016-10-26 重庆大学 Semi-supervised random forests classification method based on Spark
CN106228389A (en) * 2016-07-14 2016-12-14 武汉斗鱼网络科技有限公司 Network potential usage mining method and system based on random forests algorithm
EP3133511A1 (en) * 2015-08-19 2017-02-22 Palantir Technologies, Inc. Systems and methods for automatic clustering and canonical designation of related data in various data structures
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN107193967A (en) * 2017-05-25 2017-09-22 南开大学 A kind of multi-source heterogeneous industry field big data handles full link solution
WO2018014610A1 (en) * 2016-07-20 2018-01-25 武汉斗鱼网络科技有限公司 C4.5 decision tree algorithm-based specific user mining system and method therefor
CN108491226A (en) * 2018-02-05 2018-09-04 西安电子科技大学 Spark based on cluster scaling configures parameter automated tuning method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214213A (en) * 2011-05-31 2011-10-12 中国科学院计算技术研究所 Method and system for classifying data by adopting decision tree
EP3133511A1 (en) * 2015-08-19 2017-02-22 Palantir Technologies, Inc. Systems and methods for automatic clustering and canonical designation of related data in various data structures
CN106056134A (en) * 2016-05-20 2016-10-26 重庆大学 Semi-supervised random forests classification method based on Spark
CN106228389A (en) * 2016-07-14 2016-12-14 武汉斗鱼网络科技有限公司 Network potential usage mining method and system based on random forests algorithm
WO2018014610A1 (en) * 2016-07-20 2018-01-25 武汉斗鱼网络科技有限公司 C4.5 decision tree algorithm-based specific user mining system and method therefor
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN107193967A (en) * 2017-05-25 2017-09-22 南开大学 A kind of multi-source heterogeneous industry field big data handles full link solution
CN108491226A (en) * 2018-02-05 2018-09-04 西安电子科技大学 Spark based on cluster scaling configures parameter automated tuning method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吴进玲;张海东;李哲;施伟;田小军.基于计算机视觉的葵花子外观品质检测研究.湖北农业科学.2019,(第23期),全文. *
王蓉.邻域粗糙集的属性约简算法及其在分类器中的应用.中国优秀硕士学位论文全文数据库信息科技辑.2018,全文. *
赵静娴;倪春鹏;詹原瑞;杜子平.一种大规模数据库的组合优化决策树算法.系统工程与电子技术.2009,(第03期),全文. *

Also Published As

Publication number Publication date
CN111259933A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN111259933B (en) High-dimensional characteristic data classification method and system based on distributed parallel decision tree
CN102663100B (en) Two-stage hybrid particle swarm optimization clustering method
CN103218435B (en) Method and system for clustering Chinese text data
CN107291847A (en) A kind of large-scale data Distributed Cluster processing method based on MapReduce
CN106845536B (en) Parallel clustering method based on image scaling
CN107832456B (en) Parallel KNN text classification method based on critical value data division
Fu et al. An experimental evaluation of large scale GBDT systems
CN110737805B (en) Method and device for processing graph model data and terminal equipment
CN106228554A (en) Fuzzy coarse central coal dust image partition methods based on many attribute reductions
CN105808582A (en) Parallel generation method and device of decision tree on the basis of layered strategy
CN108764307A (en) The density peaks clustering method of natural arest neighbors optimization
CN106980639B (en) Short text data aggregation system and method
CN113010597A (en) Parallel association rule mining method for ocean big data
CN102141988A (en) Method, system and device for clustering data in data mining system
Zheng et al. k-dominant Skyline query algorithm for dynamic datasets
CN104794215A (en) Fast recursive clustering method suitable for large-scale data
CN108090514B (en) Infrared image identification method based on two-stage density clustering
CN111127184A (en) Distributed combined credit evaluation method
CN110609832A (en) Non-repeated sampling method for streaming data
Chen et al. Optimization Simulation of Big Data Analysis Model Based on K-means Algorithm
CN115688034B (en) Method for extracting and reducing mixed data of numerical value type and category type
CN115858629B (en) KNN query method based on learning index
CN109543711A (en) A kind of decision tree generation method based on ID3 algorithm
CN112948712B (en) Stackable community discovery method
CN116595102B (en) Big data management method and system for improving clustering algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant