CN111259933B

CN111259933B - High-dimensional characteristic data classification method and system based on distributed parallel decision tree

Info

Publication number: CN111259933B
Application number: CN202010022431.3A
Authority: CN
Inventors: 孙莹; 庄福振; 敖翔; 何清
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-01-09
Filing date: 2020-01-09
Publication date: 2023-06-13
Anticipated expiration: 2040-01-09
Also published as: CN111259933A

Abstract

The invention provides a high-dimensional characteristic data classification method and system based on a distributed parallel decision tree. The parallel decision tree algorithm for the high-dimensional characteristic data based on Spark is realized, the parallel algorithm has high parallelism degree, a large-scale data set can be processed, parallel calculation can be performed between nodes at the same layer in the decision tree, parallel calculation can be performed on a characteristic layer, the parallelism degree of the high-dimensional data is improved, and the processing time of the high-dimensional characteristic can be effectively reduced.

Description

High-dimensional characteristic data classification method and system based on distributed parallel decision tree

Technical Field

The invention relates to the field of tree classification, in particular to a high-dimensional characteristic data classification method and system based on a distributed parallel decision tree.

Background

The decision tree classification algorithm is an example-based inductive learning method, which can extract a tree classification model from a given unordered training sample. Each non-leaf node in the tree records which feature is used to make the category determination, and each leaf node represents the last determined category. The root node forms a classified path rule to each leaf node. When a new sample is tested, the test is only needed to be performed at each branch node from the root node, subtrees are recursively entered along the corresponding branch until the new sample reaches the leaf node, and the category represented by the leaf node is the prediction category of the current test sample. Quinlan proposed a well-known ID3 algorithm in 1986. Based on the ID3 algorithm, the C4.5 algorithm was proposed by Quinlan in 1993.

The decision tree is constructed by a top-down recursive construction method. The result of the decision tree construction is a binary or multi-way tree whose input is a set of training data with class labels. The internal nodes (non-leaf nodes) of a binary tree are typically represented as a logical predicate, e.g., a logical predicate of the form a=b, where a is an attribute and b is some attribute value of the attribute; the edges of the tree are the branch results of the logical decision. The internal nodes of the binary decision tree are attributes, and the edges are divided according to the values of the attributes, and each branch only divides the left sub node and the right sub node, so that the overfitting can be effectively reduced. Classifying by using a decision tree, firstly establishing and refining a decision tree by using a training set, and establishing a decision tree model. This process is actually a process of acquiring knowledge from data and performing machine learning. And classifying the input data by using the generated decision tree. And testing the attribute values of the input records sequentially from the root node until a certain leaf node is reached, so that the class of the record is found.

In order to meet the requirement of processing large-scale data sets, a parallel scheme of a plurality of decision trees is proposed, and in some industrial scenes, a plurality of decision tree algorithms with special designs are presented for the characteristics of the data, such as an algorithm for establishing a plurality of decision trees in parallel according to time sequence and aggregating, a parallel decision tree algorithm capable of processing real-time stream data for the characteristics of data input and output, and the like. These algorithms mainly improve decision tree algorithms based on business scenarios, rather than proposing a generic parallel decision tree algorithm.

The general parallel decision tree algorithm performs parallel processing of data in the traditional decision tree establishment process, the algorithm overall thought is consistent with that of the serial decision tree algorithm, and the parallel thought is to divide a large-scale sample into nodes in parallel and perform optimal division selection on the nodes in parallel. Such as MapReduce-based parallel decision tree algorithms and hierarchical policy-based parallel decision tree algorithms. Since the calculations of these algorithms are serial on the various features, the efficiency is lower with higher data dimensions. The invention mainly designs a spark-based parallel binary decision tree algorithm capable of effectively processing high-dimensional features.

The greatest operational cost of constructing the decision tree is that the optimal splitting attribute is calculated and selected, because each node needs to consider each field when splitting is selected so as to find the dividing method with the maximum information gain, common measurement criteria are methods such as information entropy, gininindex and the like. Specifically, firstly, all samples on the nodes need to be counted to obtain information required for calculating the information gain of each field, and for information entropy, the number of samples of each category corresponding to each characteristic value is obtained. And then ordering the information gains of all the division methods, and selecting the optimal division method.

In order to process large-scale data, a lot of distributed decision tree works, namely a parallel decision tree algorithm based on MapReduce, which divides samples into nodes and performs statistics on the number of the samples at the same time and then selects splitting attributes among the nodes of the same layer in parallel, but adopts a multi-way tree algorithm, and takes each attribute value of the dividing attributes as a node, so that data of high-dimensional characteristics cannot be processed, and the situation of over fitting or over memory use can occur. Similarly, the optimal splitting point of the same layer of nodes of the decision tree is calculated in parallel, so that the performance of the decision tree is improved, but the algorithm is processed in series during the optimal attribute decision, so that the problem of insufficient parallelism in the processing of high-dimensional characteristic data is also solved. The Spark-based parallel decision tree algorithm adopts a binary decision tree, performs statistics on samples in parallel, directly calculates statistical information of each node, then performs optimal division selection in parallel among nodes similar to the MapReduce-based parallel decision tree algorithm, and performs information gain calculation in parallel on all divisions of all features although the data with higher processing dimension have lower memory occupation, and the calculation inside the nodes is serial, so that huge time consumption is brought under the condition of very high data dimension.

Based on the analysis, the serial and memory-based classification decision tree cannot process mass data, and the existing distributed processing mode has greatly improved data processing scale, but most of MapReduce-based algorithms have low parallel efficiency and cannot obtain a global classification model. While the decision tree algorithm based on node parallelism can obtain a global classification model, the method has less attention to the data characteristics and has the problems that the high-dimensional characteristics cannot be processed or the processing time of the high-dimensional characteristics is long.

Disclosure of Invention

The invention aims to solve the problem that the existing parallel decision tree algorithm cannot efficiently process high-dimensional characteristic data, and provides a parallel decision tree algorithm which is simultaneously processed in parallel on a node and a characteristic layer.

Specifically, the invention provides a high-dimensional characteristic data classification method based on a distributed parallel decision tree, which is characterized by comprising the following steps:

step 1, obtaining training data comprising a plurality of sample high-dimensional feature data, wherein the sample high-dimensional feature data has corresponding label types, storing the training data in a distributed file system, carrying out parallel sampling statistics on samples of the training data on a distributed cluster to obtain feature distribution information on the training data, obtaining metadata for supporting decision tree calculation, and preprocessing continuous features;

step 2, by sampling and calculating the metadata, distributing feature groups for all calculation nodes in the distributed cluster, establishing a root node of a tree, and carrying out label category distribution on a combined statistical sample of all working nodes of the distributed cluster to obtain a root node initial information entropy;

step 3, respectively counting the sample data stored in each working node of the distributed cluster for all sample high-dimensional characteristic data, obtaining the current tree node of each sample according to the vector of the characteristic of each sample and the dividing rule of the decision tree, simultaneously counting the occurrence times of four-element groups (belonging nodes, characteristics, characteristic values and labels), grouping and aggregating the four-element groups according to the (nodes and the characteristic groups) by each node, and storing the statistical information of key value pairs of < (nodes, characteristic groups), (characteristics, characteristic values and labels) of each working node in a distributed mode, and obtaining the information entropy of each characteristic value according to the statistical information;

sorting the feature values according to the information entropy of each tag, attributing the statistical values of all the tags to a right node, sequentially traversing the feature values as a left node feature value, reserving the feature value with the largest information gain every time of traversing to obtain < (node, feature group), optimally dividing > key value pairs, optimally dividing and gathering each feature group of the same node to obtain < node, optimally dividing > key value pairs, and selecting optimal dividing pair nodes;

and 5, circulating the steps 2 to 4 until all nodes in the decision tree are divided, storing the current decision tree as a classification model, and inputting data to be classified into the classification model to obtain the category corresponding to the data to be classified.

The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized in that the training data are text data or image data.

The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized in that the step 2 comprises the following steps:

sequencing the features according to the number of the feature values to obtain a sequence

And obtaining the maximum value K of the total number of the binary characteristic values, and obtaining the characteristic that the total number of the G groups is not more than K by using a dynamic programming algorithm, and finding out the minimum K, wherein the G groups at the moment are the optimal characteristic groups.

The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized in that the preprocessing in the step 1 comprises the following steps: sampling the continuous features, collecting the feature values of the sampled samples to a main node, counting the number of samples of each feature value, sequencing all the feature values according to the values to obtain a sequence, grouping the samples according to the preset maximum feature division number, wherein each group is used as a barrel of the continuous features, and the median of the minimum difference between two adjacent groups of feature values is used as a candidate division.

The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized in that the metadata is obtained through statistics of training data and comprises characteristic numbers, sample numbers, label numbers, maximum characteristic division, discrete characteristic value ranges, unordered discrete characteristics, maximum depth, node minimum sample numbers and split minimum information gain.

The invention also provides a high-dimensional characteristic data classification system based on the distributed parallel decision tree, which is characterized by comprising the following steps:

the method comprises the steps of 1, acquiring training data comprising a plurality of sample high-dimensional feature data, wherein the sample high-dimensional feature data has corresponding label types, storing the training data in a distributed file system, carrying out parallel sampling statistics on samples of the training data on a distributed cluster to acquire feature distribution information on the training data, acquiring metadata for supporting decision tree calculation, and preprocessing continuous features;

the module 2 distributes feature groups for all computing nodes in the distributed cluster by sampling and calculating the metadata, establishes a root node of a tree, and obtains the initial information entropy of the root node by combining label category distribution of statistical samples of all working nodes of the distributed cluster;

the module 3 is used for respectively counting the stored sample data on each working node of the distributed cluster according to the high-dimensional characteristic data of all samples, obtaining the current tree node of each sample according to the vector of the characteristic of each sample and the dividing rule of the decision tree, simultaneously counting the occurrence times of four-element groups (belonging nodes, characteristics, characteristic values and labels), grouping and aggregating the four-element groups according to the (nodes and the characteristic groups) by each node, and carrying out distributed storage on the statistical information of key value pairs of < (nodes, characteristic groups), (characteristics, characteristic values and labels) >, and obtaining the information entropy of each characteristic value according to the statistical information;

the module 4 sorts the characteristic values according to the information entropy of the respective labels, attributes the statistical values of all the labels to a right node, sequentially traverses the characteristic values as a left node characteristic value, reserves the characteristic value with the largest information gain every time of traversing to obtain < (node, characteristic group), optimally divides > key value pairs, gathers and optimizes the optimal division of each characteristic group of the same node to obtain < node, optimally divides > key value pairs, and selects optimal division to divide the nodes;

and 5, circulating the modules 2 to 4 until all nodes in the decision tree are divided, storing the current decision tree as a classification model, and inputting the data to be classified into the classification model to obtain the category corresponding to the data to be classified.

The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the training data are text data or image data.

The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the module 2 comprises:

The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the preprocessing in the module 1 comprises the following steps: sampling the continuous features, collecting the feature values of the sampled samples to a main node, counting the number of samples of each feature value, sequencing all the feature values according to the values to obtain a sequence, grouping the samples according to the preset maximum feature division number, wherein each group is used as a barrel of the continuous features, and the median of the minimum difference between two adjacent groups of feature values is used as a candidate division.

The high-dimensional characteristic data classification system based on the distributed parallel decision tree is characterized in that the metadata is obtained through statistics of training data and comprises a characteristic number, a sample number, a label number, a maximum characteristic division, a discrete characteristic value range, unordered discrete characteristics, a maximum depth, a node minimum sample number and a split minimum information gain.

Aiming at the defects of the prior art, the invention provides a method for preparing the composite material

The advantages of the invention are as follows:

compared with the prior art, the method improves the parallel efficiency of the decision tree model and is more efficient in processing large-scale high-dimensional data. The time for traversing all the division modes during optimal division selection can be effectively reduced. The method can better utilize the large-scale clusters under the condition of richer current computing resources, and is very suitable for the mainstream method of combining the features and then establishing the tree classification model in the current industry. The inventor performs experiments on 10000-dimensional real data on smaller-scale clusters, and compared with the most popular Spark parallel decision tree algorithm, the method can shorten the model building time by more than 30%, and can achieve better effects under the conditions of larger cluster specification and higher data dimension.

Drawings

FIG. 1 is a flow chart of decision tree construction in accordance with the present invention;

FIG. 2 is a schematic diagram of data transformation in Spark for the decision tree algorithm of the present invention.

Detailed Description

The inventor finds that the situation of large data dimension occurs when large-scale data mining research is carried out, and the existing decision tree algorithm cannot process the data well. The reason is that the serial decision tree cannot process large-scale data, the existing parallel decision tree algorithm has low parallel degree, and the fastest algorithm is parallel only on the node level, but not in the part selected by the optimal feature. Under the conditions of larger feature dimension and more feature values, excessive memory usage and overfitting caused by excessive decision tree nodes are caused by the use of a multi-way decision tree, and the use of a binary decision tree is necessary to traverse all possible node partitions, so that the information gain of each partition is calculated and the optimal node is decided, and the time consumption is also increased. The existing parallel decision tree algorithm does not take this into account, because naturally occurring data rarely has a particularly large feature dimension. However, the industry often adopts a method of combining multidimensional features to generate new features, which makes the dimension of the final features exponentially increase, and in this case, the conventional parallel decision tree cannot efficiently screen the high-dimensional features and establish the best classification model. The inventors have found that this defect can be achieved by parallel at the feature level. If parallelism is desired at the feature level, each node of spark is required to calculate the optimal division of each feature, and due to the characteristics of the distributed system, although time consumption is greatly reduced from the theoretical point of view, the time consumption of node waiting caused by data transfer time required by shuffle among nodes, unbalanced data distribution of each node and the like cannot be directly divided into each node of spark for processing. Based on the existing parallel decision tree based on node parallelism, the inventor researches the complexity of the feature division part and optimizes the optimal feature selection part, discovers that the complexity of feature division is determined by the total feature value quantity, and designs a parallel decision tree algorithm for carrying out balanced grouping on the features according to the feature value quantity and simultaneously carrying out parallel processing on the features according to the nodes from the sample division step so as to carry out optimal division in order to increase the parallelism degree and balance the spark other time consumption.

The invention designs and realizes the Spark-based parallel decision tree algorithm for the high-dimensional characteristic data, the parallel algorithm has high parallel degree, can process a large-scale data set, not only carries out parallel calculation among nodes at the same layer in the decision tree, but also can carry out parallel calculation on a characteristic layer, improves the parallel degree of the high-dimensional data, and can effectively reduce the processing time of the high-dimensional characteristic.

The invention comprises the following key points:

and the key point 1 is designed to realize a parallel binary decision tree algorithm parallel in the characteristic dimension, so that the processing efficiency of the parallel decision tree algorithm on high-dimensional data is improved.

And 2, characteristic grouping parallelism and group size adjustment are realized according to cluster conditions, so that data transmission consumption and node operation time consumption are balanced, clusters are effectively utilized to the greatest extent, and parallel efficiency is improved.

And the key points 3 are used for balancing the operation amount of each node by grouping the characteristics according to the characteristic values, so that the idle time of the nodes in the cluster can be further reduced, and the operation time is reduced.

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

The decision tree algorithm is implemented in Spark as shown in fig. 1, and the basic process is that firstly, data is counted, information such as feature distribution is obtained, then all samples are put into decision tree root nodes, statistical information such as sample features and labels is calculated in parallel, information gain of feature division is calculated according to the statistical information, an optimal division mode is selected, and the nodes are divided. In the parallel decision tree structure, the parallelism degree of the decision tree is further improved, the optimal division selection is performed on all nodes of each layer on the decision tree in parallel, and meanwhile, the parallel calculation of the information gain is further performed on the feature layer by grouping the features. The data conversion in Spark is shown in fig. 2, and the algorithm specifically comprises the following steps:

metadata (statistical information needed to be used in the calculation process, super parameters set by a user, and the like are stored in the metadata) is established. The metadata is obtained by statistics of input samples (the input samples comprise all training data, each sample comprises a sample number, all characteristics of the sample and sample category labels), and the metadata comprises a characteristic number, a sample number, a label number, a maximum characteristic division, a discrete characteristic value range (the number of the values of each characteristic), unordered discrete characteristics, a maximum depth, a node minimum sample number, split minimum information gain and the like.

Continuous feature pretreatment. The main idea of the invention is to group the features according to the possible division quantity (related to the calculated quantity, which is described in detail in the next section) so as to realize the parallel processing under the high-dimensional features, and the optional division mode of the continuous features is too many to determine the calculated quantity in advance, so that the invention aims at the continuous featuresThe pre-processing is performed and possible candidates are determined before the decision tree is built, in particular for each successive feature (the feature having a value of successive values, such as the height feature of a chair may have a value of 1 meter, 1.2 meters.) the candidate partitioning means is first selected for each successive feature, corresponding to discrete features, such as the color (feature) of a chair may have a value of yellow, red. Specifically, first, samples are sampled, feature values of the sampled samples (values of all samples of the feature to be preprocessed on the feature vector) are collected into a master node driver, the number of samples of each feature value is counted, and all the feature values are sorted according to the value to obtain a sequence (a) ₁ ，count ₁ )，(a ₂ ，count ₂ )，...，(a _n ，count _n ) According to the maximum feature division number selected initially, the method divides the maximum feature division number into a plurality of groups (the value of the feature is divided into a plurality of sections, for example, the section is possibly divided into three sections of 0-0.2,0.3-0.4,0.7-1 after the feature value is required to be divided, the number of the samples falling in the three sections is as equal as possible, the median of the adjacent boundaries of the two sections is taken as the division, two candidates are divided into 0.25,0.55 after the division of the two sections, each group is taken as a barrel bin of the continuous feature, and the median of the minimum difference between the two adjacent groups of feature values is taken as the candidate division.

The characteristics balance grouping. In order to distribute the task of computing the optimal division of features to the individual work nodes worker, the features need to be divided into several groups, each worker performing a group of feature optimal division computations. The time of this step depends on the worker with the longest working time, and the complexity of the optimal division calculation is o (Σ) _i∈F v _I ) Where F is the set of features, v _i Representing the number of the characteristic feature values or bins. Therefore, in order to minimize the working time, it is necessary to minimize the maximum value of the total number of feature values of each group of features. Sequencing the features according to the number of the feature values to obtain a sequence

Wherein v represents the number of feature values corresponding to each feature, and for discrete features, the value isAll possible features take on a number of values, which for continuous features is the number of candidate scores. According to the preset grouping number G, a minimum value K is searched in a binary search mode, so that the division group number does not exceed G, the sum of the feature value numbers (or possible division number sum) of features in each group does not exceed K, a specific division can be found for the determined K in the binary search mode by using a greedy algorithm, and finally the grouping scheme corresponding to the minimum K is the optimal feature grouping. The G group at this time is the optimal feature group. Under the grouping, the maximum calculated amount born by each working node on the cluster is minimum, and the calculated amount distribution is balanced, so that the time for the working node which has completed the task itself to wait for the other nodes to complete the task in the parallel processing process can be reduced to the greatest extent.

And (5) counting node information. And determining the tree nodes according to the respective feature vectors and the established decision tree dividing rules by all samples, and simultaneously calculating the feature values and the label distribution conditions of all the nodes on all the samples, specifically counting the number of four-element groups (belonging to the nodes, features, feature values and labels). In order to reduce the consumption of transmission, firstly, statistics are respectively carried out on each partition of the parallel tasks, then, the statistics values are grouped and aggregated according to (nodes and feature groups), and key value pairs of < (nodes and feature groups), (features, feature values and labels) >' are obtained on each working node.

And (5) calculating optimal partition. On each (node, feature set) an optimal division of each feature is calculated, respectively. For continuous features, traversing all candidate partitions in sequence, firstly attributing the statistical values of all labels to a right node, putting the label numbers corresponding to the partitions into a left node each time, and obtaining the label numbers of two nodes to calculate the information gain. For discrete features, feature values are firstly ordered according to the information entropy of the respective labels, and then the feature values are sequentially traversed to serve as left node feature values. And (3) maintaining the characteristic value with the maximum information gain every time of traversal, and obtaining < (node, characteristic group), and optimally dividing > key value pairs. And aggregating and optimizing the optimal partitions of the feature groups of the same node to obtain key value pairs of the node and the optimal partition.

And (5) node division. Dividing the nodes into left and right sub-nodes according to the optimal division of each node, writing division information into node information, and ending the node division of the current layer.

Taking the watermelon dataset as an example:

numbering device

Color

Root base

Knock sound

Texture and method for producing the same

Umbilical region

Tactile sensation

Good melon

1

Bluish green

Crimping and shrinking

Turbid sound

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

2

Bluish green

Stiffening member

Crisp and clean

Clear and clear

Flat and flat

Soft adhesive

Whether or not

3

Black black

Crimping and shrinking

Turbid sound

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

4

Light white

Crimping and shrinking

Turbid sound

Blurring

Flat and flat

Soft adhesive

Whether or not

5

Light white

Crimping and shrinking

Turbid sound

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

6

Light white

Slightly spiral wound

Clunk and clunk with feeling of fullness

Slightly paste

Recess in the bottom of the container

Hard slide

Whether or not

7

Black black

Slightly spiral wound

Turbid sound

Slightly paste

Slightly concave

Soft adhesive

Is that

8

Light white

Crimping and shrinking

Turbid sound

Blurring

Flat and flat

Hard slide

Whether or not

9

Black black

Slightly spiral wound

Clunk and clunk with feeling of fullness

Slightly paste

Slightly concave

Hard slide

Whether or not

10

Black black

Crimping and shrinking

Clunk and clunk with feeling of fullness

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

11

Light white

Stiffening member

Crisp and clean

Blurring

Flat and flat

Hard slide

Whether or not

12

Bluish green

Crimping and shrinking

Clunk and clunk with feeling of fullness

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

13

Bluish green

Slightly spiral wound

Turbid sound

Slightly paste

Recess in the bottom of the container

Hard slide

Whether or not

14

Bluish green

Slightly spiral wound

Turbid sound

Clear and clear

Slightly concave

Soft adhesive

Is that

15

Black black

Slightly spiral wound

Turbid sound

Clear and clear

Slightly concave

Soft adhesive

Whether or not

16

Black black

Slightly spiral wound

Turbid sound

Clear and clear

Slightly concave

Hard slide

Is that

17

Bluish green

Crimping and shrinking

Clunk and clunk with feeling of fullness

Slightly paste

Slightly concave

Hard slide

Whether or not

Feature grouping

Is divided into three groups

The number of the characteristic values after statistics is as follows:

3 pedicles, 3 knocks, 3 textures, 3 umbilical parts, 3 touch feeling and 2

According to the principle of eigenvalue quantity balance, the average group is (1: { color, root, stem },2: { knock, texture },3: { umbilical region, touch }

Sample division statistics:

determining the node to which the sample belongs according to the existing rule of the decision tree

Numbering device

Color

Root base

Knock sound

Texture and method for producing the same

Umbilical region

Tactile sensation

Good melon

Node

1

Bluish green

Crimping and shrinking

Turbid sound

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

0

2

Bluish green

Stiffening member

Crisp and clean

Clear and clear

Flat and flat

Soft adhesive

Whether or not

0

3

Black black

Crimping and shrinking

Turbid sound

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

0

4

Light white

Crimping and shrinking

Turbid sound

Blurring

Flat and flat

Soft adhesive

Whether or not

0

5

Light white

Crimping and shrinking

Turbid sound

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

0

6

Light white

Slightly spiral wound

Clunk and clunk with feeling of fullness

Slightly paste

Recess in the bottom of the container

Hard slide

Whether or not

0

7

Black black

Slightly spiral wound

Turbid sound

Slightly paste

Slightly concave

Soft adhesive

Is that

0

8

Light white

Crimping and shrinking

Turbid sound

Blurring

Flat and flat

Hard slide

Whether or not

0

9

Black black

Slightly spiral wound

Clunk and clunk with feeling of fullness

Slightly paste

Slightly concave

Hard slide

Whether or not

0

10

Black black

Crimping and shrinking

Clunk and clunk with feeling of fullness

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

0

11

Light white

Stiffening member

Crisp and clean

Blurring

Flat and flat

Hard slide

Whether or not

0

12

Bluish green

Crimping and shrinking

Clunk and clunk with feeling of fullness

Clear and clear

Recess in the bottom of the container

Hard slide

Is that

0

13

Bluish green

Slightly spiral wound

Turbid sound

Slightly paste

Recess in the bottom of the container

Hard slide

Whether or not

0

14

Bluish green

Slightly spiral wound

Turbid sound

Clear and clear

Slightly concave

Soft adhesive

Is that

0

15

Black black

Slightly spiral wound

Turbid sound

Clear and clear

Slightly concave

Soft adhesive

Whether or not

0

16

Black black

Slightly spiral wound

Turbid sound

Clear and clear

Slightly concave

Hard slide

Is that

0

17

Bluish green

Crimping and shrinking

Clunk and clunk with feeling of fullness

Slightly paste

Slightly concave

Hard slide

Whether or not

0

Statistical sample information of each part

Partition merging

First group of

Color	Is that	Whether or not
			Black black	4	2
Bluish green	3	3
			Light white	1	4

Root base	Is that	Whether or not
			Crimping and shrinking	5	3
Slightly spiral wound	3	4
			Stiffening member	0	2

Optimal division of root pedicel stiffness, entropy 0.89

Second group of

Knock sound	Is that	Whether or not
			Clunk and clunk with feeling of fullness	2	3
Turbid sound	6	4
			Crisp and clean	0	2

Texture and method for producing the same	Is that	Whether or not
			Clear and clear	7	2
Slightly paste	1	4
			Blurring	0	3

And (3) optimizing: whether texture is blurred or not, entropy 0.83

Third group of

Umbilical region	Is that	Whether or not
			Recess in the bottom of the container	5	2
Slightly concave	3	3
			Flat and flat	0	4

Tactile sensation	Is that	Whether or not
			Hard slide	6	6
Soft adhesive	2	3

And (3) optimizing: whether the umbilical region is flat or not, entropy 0.76

Node division, namely dividing the 0 node into a left node and a right node, wherein the left node and the right node are flat and non-flat umbilical areas respectively.

The above process is repeated until all the divisions are completed.

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

Claims

1. The high-dimensional characteristic data classification method based on the distributed parallel decision tree is characterized by comprising the following steps of:

2. The method of claim 1, wherein the training data is text data or image data.

3. The method for classifying high-dimensional feature data based on distributed parallel decision trees according to claim 1, wherein the step 2 comprises:

4. The method for classifying high-dimensional feature data based on a distributed parallel decision tree according to claim 1, wherein the preprocessing in step 1 comprises: sampling the continuous features, collecting the feature values of the sampled samples to a main node, counting the number of samples of each feature value, sequencing all the feature values according to the values to obtain a sequence, grouping the samples according to the preset maximum feature division number, wherein each group is used as a barrel of the continuous features, and the median of the minimum difference between two adjacent groups of feature values is used as a candidate division.

5. The method for classifying high-dimensional feature data based on a distributed parallel decision tree according to claim 1, wherein the metadata is statistically derived from training data and comprises a feature number, a sample number, a tag number, a maximum feature division, a discrete feature value range, unordered discrete features, a maximum depth, a node minimum sample number, and a split minimum information gain.

6. A distributed parallel decision tree-based high-dimensional feature data classification system, comprising:

7. The distributed parallel decision tree based high-dimensional feature data classification system of claim 6, wherein the training data is text data or image data.

8. The distributed parallel decision tree based high-dimensional feature data classification system of claim 6, wherein the module 2 comprises:

9. The distributed parallel decision tree based high-dimensional feature data classification system of claim 6, wherein the preprocessing in module 1 comprises: sampling the continuous features, collecting the feature values of the sampled samples to a main node, counting the number of samples of each feature value, sequencing all the feature values according to the values to obtain a sequence, grouping the samples according to the preset maximum feature division number, wherein each group is used as a barrel of the continuous features, and the median of the minimum difference between two adjacent groups of feature values is used as a candidate division.

10. The distributed parallel decision tree based high-dimensional feature data classification system of claim 6, wherein the metadata is statistically derived from training data comprising feature numbers, sample numbers, tag numbers, maximum feature partitioning, discrete feature value ranges, unordered discrete features, maximum depth, node minimum sample numbers, and split minimum information gain.