CN110309882A - The effective ways of mixed type large-scale data in a kind of life of Coping with Reality - Google Patents
The effective ways of mixed type large-scale data in a kind of life of Coping with Reality Download PDFInfo
- Publication number
- CN110309882A CN110309882A CN201910594183.7A CN201910594183A CN110309882A CN 110309882 A CN110309882 A CN 110309882A CN 201910594183 A CN201910594183 A CN 201910594183A CN 110309882 A CN110309882 A CN 110309882A
- Authority
- CN
- China
- Prior art keywords
- data
- cluster
- data set
- obtains
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/28—Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses the effective ways of mixed type large-scale data in a kind of life of Coping with Reality, the core algorithm of the effective ways of mixed type large-scale data is compared with existing large-scale data clustering technique in a kind of Coping with Reality life of the present invention, its remarkable advantage is: the information of raw data set entirety being preferably utilized by global sampling techniques, obtain that there is preferable representative sample, sorting criterion is obtained by carrying out clustering to these samples, algorithm operation iteration time is not only reduced in this way, and can obtain higher clustering precision;The accuracy rate finally divided is effectively improved by clustering ensemble technology, higher clustering precision can be obtained in practical applications.
Description
Technical field
The present invention relates to the big rule of mixed type in advanced calculating and data processing field more particularly to a kind of life of Coping with Reality
The effective ways of modulus evidence.
Background technique
The data that we face in actual life are all kinds of, effectively classified to it most important, to real number
When according to being divided, if method is proper, clustering can more effectively classify to data.Clustering is a kind of
Unsupervised algorithm, its target be the biggish data of similarity in data set are assigned to according to certain similarity measurement it is same
Cluster makes the data similarity in cluster larger as far as possible, and data similarity is smaller between cluster.The meaning of cluster is to seek initial data
Collect internal structure, thus deeper to initial data and progress correlation analysis.Traditional clustering algorithm such as K-
Prototypes algorithm is used widely since its computation complexity is smaller and it is convenient to realize, but K-prototypes
There is also some shortcomingss for algorithm.Be first the algorithm when choosing initial prototype using principle is randomly selected, in this way
It will cause the influence that arithmetic result is unstable, vulnerable to initial prototype selection;Secondly, the algorithm iteration number is more, handling
Riming time of algorithm is longer when data volume is larger or data dimension higher data.
Advancing by leaps and bounds for information technology makes the every aspect of human society that variation with rapid changepl. never-ending changes and improvements occur, these variations
It is the generation and accumulation of all trades and professions mass data behind.Data volume is also in explosive growth, the data volume and data of growth
The complication of itself brings challenge to clustering, and traditional K-prototypes algorithm can not be quickly and effectively to life
More complicated large-scale data is quickly and effectively classified in work.
Summary of the invention
To solve the disadvantage that the prior art and deficiency, provide mixed type large-scale data in a kind of Coping with Reality life has
Efficacious prescriptions method, in Coping with Reality life when mixed type large-scale dataset, reaching not only reduces complexity, shortens runing time,
And the purpose of higher dividing precision can be obtained.
Provided for achieving the object of the present invention in a kind of life of Coping with Reality mixed type large-scale data effective ways,
Include following steps:
Step 1, μ k initial point is chosen from the data set X containing n sample at random, k is the classification number of data set X
Mesh, μ are parameter;
Step 2, the data in data set X are divided according to apart from nearest principle, obtains μ k cluster;
Step 3, it is as shown in Equation 1 to calculate rational sample number s, the s calculating formula to be extracted;
F is the ratio to be taken out in the formula, and n is data count in data set X, niFor cluster CiMiddle data count, 1 meaning of formula are
With the probability of 1- δ (0 < δ < 1) from CiIt extracts and is not less than f × niA data;
Step 4, in above-mentioned μ k cluster according to the ratio of s/n from each cluster CiIn have a randomly drawing sample put back to, one
It extracts five times altogether and obtains sample set S={ s1,s2,s3,s4,s5};
Step 5, K-prototypes algorithm is run to five samples of extraction, obtains five kinds of division results;
Step 6, the division result that step 5 obtains is integrated according to Integrated Algorithm, obtains k cluster center.
As a further improvement of the foregoing solution, in the step 1 parameter μ determination, specifically:
Step 1.1, by the raw data set X containing n sample, k initial point is randomly selected;
Step 1.2, the data in data set X are divided according to apart from nearest principle, obtains k cluster;
Step 1.3, the rational sample number s to be extracted is calculated according to formula 1;
Step 1.4, in above-mentioned k cluster according to the ratio of s/n from each cluster CiIn have a randomly drawing sample put back to, one
It extracts five times altogether, obtains sample set S={ s1,s2,s3,s4,s5};
Step 1.5, to the sample set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five kinds
Division result brings the data in raw data set X into above-mentioned five kinds of division results and obtains the mean value and variance of error rate;
Step 1.6, k+1 initial point is taken, step 1.1~1.5 is repeated, the data in raw data set X is brought into above-mentioned
Five kinds of division results obtain the mean value and variance of error rate, select the μ when mean value and lower variance of error rate.
As a further improvement of the foregoing solution, K-prototypes is run to five samples of extraction in the step 5
Algorithm obtains five kinds of division results, specifically:
Step 5.1, the raw data set X containing n sample is inputted, k initial point is randomly selected, if raw data set is
X={ X1,X2,X3,…,Xn, n data, each X are shared in data set XiThere are m attribute, i.e. Xi={ xi1,xi2,xi3,…,
xim, if the initial prototype set V={ V formed by k initial point1,V2,V3,…,Vk, input the original number containing n sample
According to collection X, k initial point is randomly selected;
Step 5.2, according to apart from nearest principleEach data in raw data set X are assigned to
Cluster C belonging to respectivelyiIn, update cluster CiCenter;
Step 5.3, cluster C obtained in the data and step 5.2 in raw data set X is calculatediThe distance at center, according to away from
From nearest principleDivision is re-started to data, and obtains new cluster CiCenter;
Step 5.4, step 5.2~5.3 are repeated, until cluster CiUntil center no longer changes.
As a further improvement of the foregoing solution, division result step 5 obtained according to Integrated Algorithm in the step 6
It is integrated, obtains the final criteria for classifying, specifically:
Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division resulti, to original
Data set X={ X1,X2,X3,…,Xn, the cluster C obtained by clusteringiWith in raw data set X={ X1,X2,X3,…,
XnIn to have the cluster that divides of class label be respectively C={ C1,C2, C3..., Ck-1, CkAnd C '={ C '1,C′2,C′3,…
C′k, riThe data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number;
Step 6.2, according to formula 2 calculate every kind of division result shared by weight Wi;
Step 6.3, according to formula 3 to five kinds of division result cluster CiCenter is weighted integrated, obtains final criteria for classifying ll
The beneficial effects of the present invention are:
Compared with prior art, in a kind of life of Coping with Reality of the present invention effective ways of mixed type large-scale data core
Compared with existing large-scale data clustering technique, remarkable advantage is center algorithm: preferably being utilized by global sampling techniques
The information of raw data set entirety has obtained having preferable representative data, by carrying out clustering to these data
Sorting criterion is obtained, not only reduces algorithm operation iteration time in this way, and higher clustering precision can be obtained;Pass through cluster
Integrated technology effectively improves the accuracy rate finally divided, can obtain higher clustering precision in practical applications.
Detailed description of the invention
Below in conjunction with attached drawing, specific embodiments of the present invention will be described in further detail, in which:
Fig. 1 be present invention determine that the mean value of algorithm output error rate and variance comparison diagram on different data sets when parameter μ,
Wherein, (a) is Transfusion data set comparison diagram, (b) is Banknote data set comparison diagram, (c) is HTRU2 data set
Comparison diagram.
Specific embodiment
The method that mixed type large-scale data in actual life can be effectively treated in one kind of the present invention, the present invention is from UCI number
According to 4 data sets of selection and manually generated 8 data sets in library.We choose Transfusion in UCI database,
This four data sets of Banknote, HTRU2 and Activity Recognition, Activity Recognition data set
Represent the true benchmark of activity recognition application field;HTRU2 data set is description during the measurement of high time resolution universe
The data set of the pulsar candidate data of collection, each data include 9 attributes in data set;Transfusion data set
Donations database from Hsinchu City, Taiwan Province blood transfusion service centre;Data in Banknote are from banknote number that is true and forging
According to image in extract, there are five attributes, the respectively variance of wavelet image, the degree of bias of wavelet image, small echo
Continuity, image entropy, the class of changing image.Because this four data sets instance difference it is larger, and HTRU2 and
The h of Activity Recognition data set is 9, and dimension is higher, and the data set chosen in this way is representative preferably, can preferable body
Performance of the existing algorithm when handling the data set of different dimensions and different data number.Therefore, the Artificial data of generation
Also increase the otherness between data set as far as possible, as Artificial data 1 arrives the h of 4 data set of Artificial data
=3, but instance difference is larger, and differs larger with the h of Artificial data 5 to Artificial data 8,
More comprehensively algorithm performance can be tested in this way.Data set specifying information is as shown in table 1.
Step 1, μ k initial point is chosen from 1 data set of table at random, k is the classification number of data set X, and μ is parameter, tool
Body step is as shown in step 1.1~1.6.
Step 1.1, random to select by raw data set Transfusion, Banknote, HTRU2 containing n sample
K initial point is taken, the specific value of k is the class in table 1, and data amount check is the instance in table 1;
Step 1.2, the data in data set X are divided according to apart from nearest principle, obtains k cluster;
Step 1.3, the reasonable data number s to be extracted is calculated according to formula 1;
Step 1.4, in above-mentioned k cluster according to the ratio of s/n from each cluster CiIn have a randomly drawing sample put back to, one
It extracts five times altogether, obtains sample set S={ s1,s2,s3,s4,s5};
Step 1.5, to the sample set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five kinds
Division result brings the data in raw data set X into above-mentioned five kinds of division results and obtains the mean value and variance of error rate.
Step 1.6, k+1 initial point is taken, step 1.1~1.5 is repeated, the data in raw data set X is brought into above-mentioned
Five kinds of division results obtain the mean value and variance of error rate, select the μ when mean value and lower variance of error rate, with
Different value of K is horizontal axis, the mean value E (1-r of error ratei) and variance V (1-ri) it is that the longitudinal axis maps to obtain Fig. 1.
K are taken first, then takes k+1, and a initial point of k+2 ... ... can be seen that the E (1- as k=5 from Fig. 1 (a)
ri) and V (1-ri) value maximum;As k=6, the mean value E (1-r of partition error rate on Transfusion data seti) and
Variance yields V (1-ri) minimum, i.e. division effect is best;When k continues to increase, mean value and the variance yields variation of partition error rate
Less.It is contemplated that algorithm complexity and the mean value and variance yields of error rate, μ=6/2 on Transfusion data set
=3, i.e. μ take 3 more appropriate.From Fig. 1 (b) as can be seen that as k=9, the V (1-r on Banknote data seti) value minimum,
E(1-ri) value is slightly higher;Mean value E (the 1-r of partition error rate when k=4i) minimum, but variance V (1-ri) larger;As k=8, E
(1-ri) value is lower, and V (1-ri) value is also smaller, it is contemplated that the mean value and variance yields of algorithm complexity and error rate are arrived,
μ takes 4 more appropriate on Banknote data set.From Fig. 1 (c) as can be seen that as k=10, the E (1- on HTRU2 data set
ri) value is minimum, and V (1-ri) value is also smaller, but takes a long time when k=10;V (the 1-r that algorithm 5 exports when k=3i) value is most
It is low, but the mean value E (1-r of error ratei) larger;As k=8, the E (1-r on HTRU2 data seti) value minimum, V (1-ri) value
It is smaller;It is contemplated that μ takes 4 more to close on HTRU2 data set to Riming time of algorithm and the mean value and variance yields of error rate
It is suitable.
In conclusion μ is set to 4 by the present invention.
Step 2, the data in X are divided according to apart from nearest principle, obtains μ k cluster;
Step 3, the reasonable data number s to be extracted is calculated according to formula 1;
Step 4, in above-mentioned 4k cluster according to the ratio of s/n from each cluster CiIn have a randomly drawing sample put back to, one
It extracts five times altogether and obtains sample set S={ s1,s2,s3,s4,s5};
Step 5, to five sample set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five
Kind division result, K-prototypes algorithmic procedure are specific as shown in step 5.1~5.4;
Step 5.1, the raw data set X containing n sample is inputted, k initial point is randomly selected, if raw data set is
X={ X1,X2,X3,…,Xn, n data, each X are shared in data set XiThere are m attribute, i.e. Xi={ xi1,xi2,xi3,…,
xim, if the initial prototype set V={ V formed by k initial point1,V2,V3,…,Vk, input the original number containing n sample
According to collection X, k initial point is randomly selected;
Step 5.2, according to apart from nearest principleEach data in raw data set X are assigned to
Cluster C belonging to respectivelyiIn, update cluster CiCenter;
Step 5.3, cluster C obtained in the data and step 5.2 in raw data set X is calculatediThe distance at center, according to away from
From nearest principleDivision is re-started to data, and obtains new cluster CiCenter;
Step 5.4, step 5.2~5.3 are repeated, until cluster CiUntil center no longer changes.
Step 6, the division result that step 5 obtains is integrated according to Integrated Algorithm, obtains k cluster center, collected
At algorithm specific steps such as step 6.1~6.3.
Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division resulti, to original
Data set X={ X1,X2,X3,…,Xn, the cluster C obtained by clusteringiWith in raw data set X={ X1,X2,X3,…,
XnIn to have the cluster that divides of class label be respectively C={ C1,C2, C3..., Ck-1, CkAnd C '={ C '1,C′2,C′3,…
C′k, riThe data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number;
Step 6.2, according to formula 2 calculate every kind of division result shared by weight Wi;
Step 6.3, five kinds of division result cluster centers are weighted according to formula 3 integrated, obtain final criteria for classifying ll
It chooses RI (Rand index) and Riming time of algorithm T (s) is used as evaluation index, RI value illustrates to draw closer to 1
Divide effect better, T (s) is shorter, illustrates that algorithm is more effective.Specific comparing result is as shown in table 2, from table 2 it can be seen that at four
It is best to illustrate that inventive algorithm divides effect for the RI value highest of inventive algorithm on data set, in Transfusion and
Inventive algorithm runing time is slightly longer than K-prototypes algorithm on Banknote data set, but in HTRU2 and Activity
Inventive algorithm operation is very fast on Recognition data set, as the h higher of data set and instance more, this hair
The bright algorithm speed of service is very fast;The RI value of inventive algorithm is highest on 8 Artificial data, when
When data amount check is less in Artificial data, the time of inventive algorithm operation is slower than K-prototypes algorithm, when
When Artificial data data amount check is more, the inventive algorithm speed of service is faster than K-prototypes algorithm, and
Instance is more in Artificial data, and what the superiority of the inventive algorithm speed of service embodied is more obvious, and in number
When increasing according to collection h, runing time of the present invention is far faster than K-prototypes algorithm.
Further, it chooses real data set in life further to verify the method for the present invention, specific steps such as 1~6
It is shown;
Adult data set is taken further to verify performance of the invention, Adult data set is from Census Bureau's database
It extracts, shares 32561 objects in data set, 14 attributes, wherein 10 categorical attributes, 4 Numeric Attributes, altogether
It is divided into two classes, for predicting people's annual income whether more than 50K.
Step 1,8 initial points are chosen from the Adult data set containing 32561 data at random;
Step 2, the data in Adult data set are divided according to apart from nearest principle, obtains 8 clusters;
Step 3, the reasonable data number s to be extracted is calculated according to formula 1;
Step 4, there are the data of randomly selecting put back to from each cluster according to the ratio of s/n in above-mentioned 8 clusters, take out altogether
It takes five times and obtains data set S={ s1,s2,s3,s4,s5};
Step 5, to five data set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five
Kind division result, K-prototypes algorithmic procedure are specific as shown in step 5.1~5.4:
Step 5.1, the Adult data set containing 32561 data is inputted, 2 initial points are randomly selected;
Step 5.2, according to apart from nearest principleEach data in Adult data set are assigned to
Cluster C belonging to respectivelyiIn, update cluster CiCenter;
Step 5.3, cluster C obtained in the data and step 5.2 in Adult data set is calculatediThe distance at center, according to away from
From nearest principleDivision is re-started to data, and obtains new cluster CiCenter;
Step 5.4, step 5.2- step 5.3 is repeated, until cluster CiUntil center no longer changes;
Step 6, five kinds of division results that step 5 obtains are integrated according to Integrated Algorithm, obtain the final criteria for classifying,
Integrated Algorithm specific steps such as step 6.1~6.3.
Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division resulti, to original
Data set X={ X1,X2,X3,…,Xn, the cluster C obtained by clusteringiWith in raw data set X={ X1,X2,X3,…,
XnIn to have the cluster that divides of class label be respectively C={ C1,C2, C3..., Ck-1, CkAnd C '={ C '1,C′2,C′3,…
C′k, riThe data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number;
Step 6.2, according to formula 2 calculate every kind of division result shared by weight Wi;
Step 6.3, five kinds of division result cluster centers are weighted according to formula 3 integrated, obtain final criteria for classifying ll
It calculates the runing time of the method for the present invention and the present invention is calculated according to the true tag in Adult data set and divide
RI value;
Comparing result is as shown in table 3, it can be seen that the RI value of inventive algorithm highest on Adult data set, and consume
When be less than K-prototypes algorithm, illustrate inventive algorithm superior performance.
1 data set information of table
2 different data collection Comparative result of table
2.1 real data set Comparative result of table
The comparison of 2.2 Artificial data experimental result of table
3 actual life data set Comparative result of table
Above embodiments are not limited to the technical solution of the embodiment itself, can be incorporated between embodiment new
Embodiment.The above embodiments are merely illustrative of the technical solutions of the present invention and is not intended to limit it, all without departing from the present invention
Any modification of spirit and scope or equivalent replacement, shall fall within the scope of the technical solution of the present invention.
Claims (4)
1. the effective ways of mixed type large-scale data in a kind of Coping with Reality life, it is characterised in that: include following steps:
Step 1, μ k initial point is chosen from the data set X containing n sample at random, k is the classification number of data set X, and μ is
Parameter;
Step 2, the data in data set X are divided according to apart from nearest principle, obtains μ k cluster;
Step 3, it is as shown in Equation 1 to calculate rational sample number s, the s calculating formula to be extracted;
F is the ratio to be taken out in the formula, and n is data count in data set X, niFor cluster CiMiddle data count, 1 meaning of formula are with 1- δ
The probability of (0 < δ < 1) is from CiIt extracts and is not less than f × niA data;
Step 4, in above-mentioned μ k cluster according to the ratio of s/n from each cluster CiIn have the randomly drawing sample put back to, extract altogether
Obtain sample set S={ s five times1,s2,s3,s4,s5};
Step 5, K-prototypes algorithm is run to five samples of extraction, obtains five kinds of division results;
Step 6, the division result that step 5 obtains is integrated according to Integrated Algorithm, obtains k cluster center.
2. the effective ways of mixed type large-scale data, feature in a kind of Coping with Reality life according to claim 1
It is: the determination of parameter μ in the step 1, specifically:
Step 1.1, by the raw data set X containing n sample, k initial point is randomly selected;
Step 1.2, the data in data set X are divided according to apart from nearest principle, obtains k cluster;
Step 1.3, the rational sample number s to be extracted is calculated according to formula 1;
Step 1.4, in above-mentioned k cluster according to the ratio of s/n from each cluster CiIn have the randomly drawing sample put back to, take out altogether
It takes five times, obtains sample set S={ s1,s2,s3,s4,s5};
Step 1.5, to the sample set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five kinds of divisions
As a result, bringing the data in raw data set X into above-mentioned five kinds of division results obtains the mean value and variance of error rate;
Step 1.6, k+1 initial point is taken, step 1.1~1.5 is repeated, brings the data in raw data set X into above-mentioned five
Kind division result obtains the mean value and variance of error rate, selects the μ when mean value and lower variance of error rate.
3. the effective ways of mixed type large-scale data, feature in a kind of Coping with Reality life according to claim 1
It is: K-prototypes algorithm is run to five samples of extraction in the step 5, obtains five kinds of division results, specifically:
Step 5.1, the raw data set X containing n sample is inputted, k initial point is randomly selected, if raw data set is X=
{X1,X2,X3,…,Xn, n data, each X are shared in data set XiThere are m attribute, i.e. Xi={ xi1,xi2,xi3,…,xim,
If the initial prototype set V={ V formed by k initial point1,V2,V3,…,Vk, input the raw data set containing n sample
X randomly selects k initial point;
Step 5.2, according to apart from nearest principleEach data in raw data set X are assigned to respectively
Affiliated cluster CiIn, update cluster CiCenter;
Step 5.3, cluster C obtained in the data and step 5.2 in raw data set X is calculatediThe distance at center, most according to distance
Approximately principleDivision is re-started to data, and obtains new cluster CiCenter;
Step 5.4, step 5.2~5.3 are repeated, until cluster CiUntil center no longer changes.
4. the effective ways of mixed type large-scale data, feature in a kind of Coping with Reality life according to claim 1
It is: the division result that step 5 obtains is integrated according to Integrated Algorithm in the step 6, obtain the final criteria for classifying, has
Body are as follows:
Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division resulti, to initial data
Collect X={ X1,X2,X3,…,Xn, the cluster C obtained by clusteringiWith in raw data set X={ X1,X2,X3,…,XnIn
The cluster that existing class label divides is respectively C={ C1,C2, C3..., Ck-1, CkAnd C '={ C '1,C′2,C′3,…C′k, ri
The data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number;
Step 6.2, according to formula 2 calculate every kind of division result shared by weight Wi;
Step 6.3, according to formula 3 to five kinds of division result cluster CiCenter is weighted integrated, obtains final criteria for classifying ll
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910594183.7A CN110309882A (en) | 2019-07-03 | 2019-07-03 | The effective ways of mixed type large-scale data in a kind of life of Coping with Reality |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910594183.7A CN110309882A (en) | 2019-07-03 | 2019-07-03 | The effective ways of mixed type large-scale data in a kind of life of Coping with Reality |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110309882A true CN110309882A (en) | 2019-10-08 |
Family
ID=68079667
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910594183.7A Pending CN110309882A (en) | 2019-07-03 | 2019-07-03 | The effective ways of mixed type large-scale data in a kind of life of Coping with Reality |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110309882A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111738319A (en) * | 2020-06-11 | 2020-10-02 | 佳都新太科技股份有限公司 | Clustering result evaluation method and device based on large-scale samples |
-
2019
- 2019-07-03 CN CN201910594183.7A patent/CN110309882A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111738319A (en) * | 2020-06-11 | 2020-10-02 | 佳都新太科技股份有限公司 | Clustering result evaluation method and device based on large-scale samples |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Thenmozhi et al. | Heart disease prediction using classification with different decision tree techniques | |
Genolini et al. | KmL: k-means for longitudinal data | |
US6871201B2 (en) | Method for building space-splitting decision tree | |
CN108846259A (en) | A kind of gene sorting method and system based on cluster and random forests algorithm | |
CN107832456B (en) | Parallel KNN text classification method based on critical value data division | |
US11971892B2 (en) | Methods for stratified sampling-based query execution | |
Zhang et al. | Novel density-based and hierarchical density-based clustering algorithms for uncertain data | |
CN106228554A (en) | Fuzzy coarse central coal dust image partition methods based on many attribute reductions | |
CN110826618A (en) | Personal credit risk assessment method based on random forest | |
Parashar et al. | An efficient classification approach for data mining | |
Gunjan | Instantaneous approach for evaluating the initial centers in the agricultural databases using K-means clustering algorithm | |
Evchenko et al. | Frugal machine learning | |
CN109684477A (en) | A kind of patent text feature extracting method and system | |
WO2012041861A2 (en) | Computer-implemented method for analyzing multivariate data | |
Bruzzese et al. | DESPOTA: DEndrogram slicing through a pemutation test approach | |
Mandal et al. | Unsupervised non-redundant feature selection: a graph-theoretic approach | |
Dahiya et al. | A rank aggregation algorithm for ensemble of multiple feature selection techniques in credit risk evaluation | |
CN110309882A (en) | The effective ways of mixed type large-scale data in a kind of life of Coping with Reality | |
Heckerman et al. | An experimental comparison of several clustering and initialization methods | |
Chen et al. | See more for scene: Pairwise consistency learning for scene classification | |
Pandeeswari et al. | K-means clustering and Naïve Bayes classifier for categorization of diabetes patients | |
CN104468276A (en) | Network traffic identification method based on random sampling multiple classifiers | |
Akyol | Clustering hotels and analyzing the importance of their features by machine learning techniques | |
CN113221966A (en) | Differential privacy decision tree construction method based on F _ Max attribute measurement | |
Al-Mhairat et al. | Performance Evaluation of clustering Algorthims |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20191008 |
|
WD01 | Invention patent application deemed withdrawn after publication |