CN110309882A - The effective ways of mixed type large-scale data in a kind of life of Coping with Reality - Google Patents

The effective ways of mixed type large-scale data in a kind of life of Coping with Reality Download PDF

Info

Publication number
CN110309882A
CN110309882A CN201910594183.7A CN201910594183A CN110309882A CN 110309882 A CN110309882 A CN 110309882A CN 201910594183 A CN201910594183 A CN 201910594183A CN 110309882 A CN110309882 A CN 110309882A
Authority
CN
China
Prior art keywords
data
cluster
data set
obtains
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910594183.7A
Other languages
Chinese (zh)
Inventor
李顺勇
张钰嘉
张苗苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Priority to CN201910594183.7A priority Critical patent/CN110309882A/en
Publication of CN110309882A publication Critical patent/CN110309882A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses the effective ways of mixed type large-scale data in a kind of life of Coping with Reality, the core algorithm of the effective ways of mixed type large-scale data is compared with existing large-scale data clustering technique in a kind of Coping with Reality life of the present invention, its remarkable advantage is: the information of raw data set entirety being preferably utilized by global sampling techniques, obtain that there is preferable representative sample, sorting criterion is obtained by carrying out clustering to these samples, algorithm operation iteration time is not only reduced in this way, and can obtain higher clustering precision;The accuracy rate finally divided is effectively improved by clustering ensemble technology, higher clustering precision can be obtained in practical applications.

Description

The effective ways of mixed type large-scale data in a kind of life of Coping with Reality
Technical field
The present invention relates to the big rule of mixed type in advanced calculating and data processing field more particularly to a kind of life of Coping with Reality The effective ways of modulus evidence.
Background technique
The data that we face in actual life are all kinds of, effectively classified to it most important, to real number When according to being divided, if method is proper, clustering can more effectively classify to data.Clustering is a kind of Unsupervised algorithm, its target be the biggish data of similarity in data set are assigned to according to certain similarity measurement it is same Cluster makes the data similarity in cluster larger as far as possible, and data similarity is smaller between cluster.The meaning of cluster is to seek initial data Collect internal structure, thus deeper to initial data and progress correlation analysis.Traditional clustering algorithm such as K- Prototypes algorithm is used widely since its computation complexity is smaller and it is convenient to realize, but K-prototypes There is also some shortcomingss for algorithm.Be first the algorithm when choosing initial prototype using principle is randomly selected, in this way It will cause the influence that arithmetic result is unstable, vulnerable to initial prototype selection;Secondly, the algorithm iteration number is more, handling Riming time of algorithm is longer when data volume is larger or data dimension higher data.
Advancing by leaps and bounds for information technology makes the every aspect of human society that variation with rapid changepl. never-ending changes and improvements occur, these variations It is the generation and accumulation of all trades and professions mass data behind.Data volume is also in explosive growth, the data volume and data of growth The complication of itself brings challenge to clustering, and traditional K-prototypes algorithm can not be quickly and effectively to life More complicated large-scale data is quickly and effectively classified in work.
Summary of the invention
To solve the disadvantage that the prior art and deficiency, provide mixed type large-scale data in a kind of Coping with Reality life has Efficacious prescriptions method, in Coping with Reality life when mixed type large-scale dataset, reaching not only reduces complexity, shortens runing time, And the purpose of higher dividing precision can be obtained.
Provided for achieving the object of the present invention in a kind of life of Coping with Reality mixed type large-scale data effective ways, Include following steps:
Step 1, μ k initial point is chosen from the data set X containing n sample at random, k is the classification number of data set X Mesh, μ are parameter;
Step 2, the data in data set X are divided according to apart from nearest principle, obtains μ k cluster;
Step 3, it is as shown in Equation 1 to calculate rational sample number s, the s calculating formula to be extracted;
F is the ratio to be taken out in the formula, and n is data count in data set X, niFor cluster CiMiddle data count, 1 meaning of formula are With the probability of 1- δ (0 < δ < 1) from CiIt extracts and is not less than f × niA data;
Step 4, in above-mentioned μ k cluster according to the ratio of s/n from each cluster CiIn have a randomly drawing sample put back to, one It extracts five times altogether and obtains sample set S={ s1,s2,s3,s4,s5};
Step 5, K-prototypes algorithm is run to five samples of extraction, obtains five kinds of division results;
Step 6, the division result that step 5 obtains is integrated according to Integrated Algorithm, obtains k cluster center.
As a further improvement of the foregoing solution, in the step 1 parameter μ determination, specifically:
Step 1.1, by the raw data set X containing n sample, k initial point is randomly selected;
Step 1.2, the data in data set X are divided according to apart from nearest principle, obtains k cluster;
Step 1.3, the rational sample number s to be extracted is calculated according to formula 1;
Step 1.4, in above-mentioned k cluster according to the ratio of s/n from each cluster CiIn have a randomly drawing sample put back to, one It extracts five times altogether, obtains sample set S={ s1,s2,s3,s4,s5};
Step 1.5, to the sample set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five kinds Division result brings the data in raw data set X into above-mentioned five kinds of division results and obtains the mean value and variance of error rate;
Step 1.6, k+1 initial point is taken, step 1.1~1.5 is repeated, the data in raw data set X is brought into above-mentioned Five kinds of division results obtain the mean value and variance of error rate, select the μ when mean value and lower variance of error rate.
As a further improvement of the foregoing solution, K-prototypes is run to five samples of extraction in the step 5 Algorithm obtains five kinds of division results, specifically:
Step 5.1, the raw data set X containing n sample is inputted, k initial point is randomly selected, if raw data set is X={ X1,X2,X3,…,Xn, n data, each X are shared in data set XiThere are m attribute, i.e. Xi={ xi1,xi2,xi3,…, xim, if the initial prototype set V={ V formed by k initial point1,V2,V3,…,Vk, input the original number containing n sample According to collection X, k initial point is randomly selected;
Step 5.2, according to apart from nearest principleEach data in raw data set X are assigned to Cluster C belonging to respectivelyiIn, update cluster CiCenter;
Step 5.3, cluster C obtained in the data and step 5.2 in raw data set X is calculatediThe distance at center, according to away from From nearest principleDivision is re-started to data, and obtains new cluster CiCenter;
Step 5.4, step 5.2~5.3 are repeated, until cluster CiUntil center no longer changes.
As a further improvement of the foregoing solution, division result step 5 obtained according to Integrated Algorithm in the step 6 It is integrated, obtains the final criteria for classifying, specifically:
Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division resulti, to original Data set X={ X1,X2,X3,…,Xn, the cluster C obtained by clusteringiWith in raw data set X={ X1,X2,X3,…, XnIn to have the cluster that divides of class label be respectively C={ C1,C2, C3..., Ck-1, CkAnd C '={ C '1,C′2,C′3,… C′k, riThe data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number;
Step 6.2, according to formula 2 calculate every kind of division result shared by weight Wi
Step 6.3, according to formula 3 to five kinds of division result cluster CiCenter is weighted integrated, obtains final criteria for classifying ll
The beneficial effects of the present invention are:
Compared with prior art, in a kind of life of Coping with Reality of the present invention effective ways of mixed type large-scale data core Compared with existing large-scale data clustering technique, remarkable advantage is center algorithm: preferably being utilized by global sampling techniques The information of raw data set entirety has obtained having preferable representative data, by carrying out clustering to these data Sorting criterion is obtained, not only reduces algorithm operation iteration time in this way, and higher clustering precision can be obtained;Pass through cluster Integrated technology effectively improves the accuracy rate finally divided, can obtain higher clustering precision in practical applications.
Detailed description of the invention
Below in conjunction with attached drawing, specific embodiments of the present invention will be described in further detail, in which:
Fig. 1 be present invention determine that the mean value of algorithm output error rate and variance comparison diagram on different data sets when parameter μ, Wherein, (a) is Transfusion data set comparison diagram, (b) is Banknote data set comparison diagram, (c) is HTRU2 data set Comparison diagram.
Specific embodiment
The method that mixed type large-scale data in actual life can be effectively treated in one kind of the present invention, the present invention is from UCI number According to 4 data sets of selection and manually generated 8 data sets in library.We choose Transfusion in UCI database, This four data sets of Banknote, HTRU2 and Activity Recognition, Activity Recognition data set Represent the true benchmark of activity recognition application field;HTRU2 data set is description during the measurement of high time resolution universe The data set of the pulsar candidate data of collection, each data include 9 attributes in data set;Transfusion data set Donations database from Hsinchu City, Taiwan Province blood transfusion service centre;Data in Banknote are from banknote number that is true and forging According to image in extract, there are five attributes, the respectively variance of wavelet image, the degree of bias of wavelet image, small echo Continuity, image entropy, the class of changing image.Because this four data sets instance difference it is larger, and HTRU2 and The h of Activity Recognition data set is 9, and dimension is higher, and the data set chosen in this way is representative preferably, can preferable body Performance of the existing algorithm when handling the data set of different dimensions and different data number.Therefore, the Artificial data of generation Also increase the otherness between data set as far as possible, as Artificial data 1 arrives the h of 4 data set of Artificial data =3, but instance difference is larger, and differs larger with the h of Artificial data 5 to Artificial data 8, More comprehensively algorithm performance can be tested in this way.Data set specifying information is as shown in table 1.
Step 1, μ k initial point is chosen from 1 data set of table at random, k is the classification number of data set X, and μ is parameter, tool Body step is as shown in step 1.1~1.6.
Step 1.1, random to select by raw data set Transfusion, Banknote, HTRU2 containing n sample K initial point is taken, the specific value of k is the class in table 1, and data amount check is the instance in table 1;
Step 1.2, the data in data set X are divided according to apart from nearest principle, obtains k cluster;
Step 1.3, the reasonable data number s to be extracted is calculated according to formula 1;
Step 1.4, in above-mentioned k cluster according to the ratio of s/n from each cluster CiIn have a randomly drawing sample put back to, one It extracts five times altogether, obtains sample set S={ s1,s2,s3,s4,s5};
Step 1.5, to the sample set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five kinds Division result brings the data in raw data set X into above-mentioned five kinds of division results and obtains the mean value and variance of error rate.
Step 1.6, k+1 initial point is taken, step 1.1~1.5 is repeated, the data in raw data set X is brought into above-mentioned Five kinds of division results obtain the mean value and variance of error rate, select the μ when mean value and lower variance of error rate, with Different value of K is horizontal axis, the mean value E (1-r of error ratei) and variance V (1-ri) it is that the longitudinal axis maps to obtain Fig. 1.
K are taken first, then takes k+1, and a initial point of k+2 ... ... can be seen that the E (1- as k=5 from Fig. 1 (a) ri) and V (1-ri) value maximum;As k=6, the mean value E (1-r of partition error rate on Transfusion data seti) and Variance yields V (1-ri) minimum, i.e. division effect is best;When k continues to increase, mean value and the variance yields variation of partition error rate Less.It is contemplated that algorithm complexity and the mean value and variance yields of error rate, μ=6/2 on Transfusion data set =3, i.e. μ take 3 more appropriate.From Fig. 1 (b) as can be seen that as k=9, the V (1-r on Banknote data seti) value minimum, E(1-ri) value is slightly higher;Mean value E (the 1-r of partition error rate when k=4i) minimum, but variance V (1-ri) larger;As k=8, E (1-ri) value is lower, and V (1-ri) value is also smaller, it is contemplated that the mean value and variance yields of algorithm complexity and error rate are arrived, μ takes 4 more appropriate on Banknote data set.From Fig. 1 (c) as can be seen that as k=10, the E (1- on HTRU2 data set ri) value is minimum, and V (1-ri) value is also smaller, but takes a long time when k=10;V (the 1-r that algorithm 5 exports when k=3i) value is most It is low, but the mean value E (1-r of error ratei) larger;As k=8, the E (1-r on HTRU2 data seti) value minimum, V (1-ri) value It is smaller;It is contemplated that μ takes 4 more to close on HTRU2 data set to Riming time of algorithm and the mean value and variance yields of error rate It is suitable.
In conclusion μ is set to 4 by the present invention.
Step 2, the data in X are divided according to apart from nearest principle, obtains μ k cluster;
Step 3, the reasonable data number s to be extracted is calculated according to formula 1;
Step 4, in above-mentioned 4k cluster according to the ratio of s/n from each cluster CiIn have a randomly drawing sample put back to, one It extracts five times altogether and obtains sample set S={ s1,s2,s3,s4,s5};
Step 5, to five sample set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five Kind division result, K-prototypes algorithmic procedure are specific as shown in step 5.1~5.4;
Step 5.1, the raw data set X containing n sample is inputted, k initial point is randomly selected, if raw data set is X={ X1,X2,X3,…,Xn, n data, each X are shared in data set XiThere are m attribute, i.e. Xi={ xi1,xi2,xi3,…, xim, if the initial prototype set V={ V formed by k initial point1,V2,V3,…,Vk, input the original number containing n sample According to collection X, k initial point is randomly selected;
Step 5.2, according to apart from nearest principleEach data in raw data set X are assigned to Cluster C belonging to respectivelyiIn, update cluster CiCenter;
Step 5.3, cluster C obtained in the data and step 5.2 in raw data set X is calculatediThe distance at center, according to away from From nearest principleDivision is re-started to data, and obtains new cluster CiCenter;
Step 5.4, step 5.2~5.3 are repeated, until cluster CiUntil center no longer changes.
Step 6, the division result that step 5 obtains is integrated according to Integrated Algorithm, obtains k cluster center, collected At algorithm specific steps such as step 6.1~6.3.
Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division resulti, to original Data set X={ X1,X2,X3,…,Xn, the cluster C obtained by clusteringiWith in raw data set X={ X1,X2,X3,…, XnIn to have the cluster that divides of class label be respectively C={ C1,C2, C3..., Ck-1, CkAnd C '={ C '1,C′2,C′3,… C′k, riThe data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number;
Step 6.2, according to formula 2 calculate every kind of division result shared by weight Wi
Step 6.3, five kinds of division result cluster centers are weighted according to formula 3 integrated, obtain final criteria for classifying ll
It chooses RI (Rand index) and Riming time of algorithm T (s) is used as evaluation index, RI value illustrates to draw closer to 1 Divide effect better, T (s) is shorter, illustrates that algorithm is more effective.Specific comparing result is as shown in table 2, from table 2 it can be seen that at four It is best to illustrate that inventive algorithm divides effect for the RI value highest of inventive algorithm on data set, in Transfusion and Inventive algorithm runing time is slightly longer than K-prototypes algorithm on Banknote data set, but in HTRU2 and Activity Inventive algorithm operation is very fast on Recognition data set, as the h higher of data set and instance more, this hair The bright algorithm speed of service is very fast;The RI value of inventive algorithm is highest on 8 Artificial data, when When data amount check is less in Artificial data, the time of inventive algorithm operation is slower than K-prototypes algorithm, when When Artificial data data amount check is more, the inventive algorithm speed of service is faster than K-prototypes algorithm, and Instance is more in Artificial data, and what the superiority of the inventive algorithm speed of service embodied is more obvious, and in number When increasing according to collection h, runing time of the present invention is far faster than K-prototypes algorithm.
Further, it chooses real data set in life further to verify the method for the present invention, specific steps such as 1~6 It is shown;
Adult data set is taken further to verify performance of the invention, Adult data set is from Census Bureau's database It extracts, shares 32561 objects in data set, 14 attributes, wherein 10 categorical attributes, 4 Numeric Attributes, altogether It is divided into two classes, for predicting people's annual income whether more than 50K.
Step 1,8 initial points are chosen from the Adult data set containing 32561 data at random;
Step 2, the data in Adult data set are divided according to apart from nearest principle, obtains 8 clusters;
Step 3, the reasonable data number s to be extracted is calculated according to formula 1;
Step 4, there are the data of randomly selecting put back to from each cluster according to the ratio of s/n in above-mentioned 8 clusters, take out altogether It takes five times and obtains data set S={ s1,s2,s3,s4,s5};
Step 5, to five data set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five Kind division result, K-prototypes algorithmic procedure are specific as shown in step 5.1~5.4:
Step 5.1, the Adult data set containing 32561 data is inputted, 2 initial points are randomly selected;
Step 5.2, according to apart from nearest principleEach data in Adult data set are assigned to Cluster C belonging to respectivelyiIn, update cluster CiCenter;
Step 5.3, cluster C obtained in the data and step 5.2 in Adult data set is calculatediThe distance at center, according to away from From nearest principleDivision is re-started to data, and obtains new cluster CiCenter;
Step 5.4, step 5.2- step 5.3 is repeated, until cluster CiUntil center no longer changes;
Step 6, five kinds of division results that step 5 obtains are integrated according to Integrated Algorithm, obtain the final criteria for classifying, Integrated Algorithm specific steps such as step 6.1~6.3.
Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division resulti, to original Data set X={ X1,X2,X3,…,Xn, the cluster C obtained by clusteringiWith in raw data set X={ X1,X2,X3,…, XnIn to have the cluster that divides of class label be respectively C={ C1,C2, C3..., Ck-1, CkAnd C '={ C '1,C′2,C′3,… C′k, riThe data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number;
Step 6.2, according to formula 2 calculate every kind of division result shared by weight Wi
Step 6.3, five kinds of division result cluster centers are weighted according to formula 3 integrated, obtain final criteria for classifying ll
It calculates the runing time of the method for the present invention and the present invention is calculated according to the true tag in Adult data set and divide RI value;
Comparing result is as shown in table 3, it can be seen that the RI value of inventive algorithm highest on Adult data set, and consume When be less than K-prototypes algorithm, illustrate inventive algorithm superior performance.
1 data set information of table
2 different data collection Comparative result of table
2.1 real data set Comparative result of table
The comparison of 2.2 Artificial data experimental result of table
3 actual life data set Comparative result of table
Above embodiments are not limited to the technical solution of the embodiment itself, can be incorporated between embodiment new Embodiment.The above embodiments are merely illustrative of the technical solutions of the present invention and is not intended to limit it, all without departing from the present invention Any modification of spirit and scope or equivalent replacement, shall fall within the scope of the technical solution of the present invention.

Claims (4)

1. the effective ways of mixed type large-scale data in a kind of Coping with Reality life, it is characterised in that: include following steps:
Step 1, μ k initial point is chosen from the data set X containing n sample at random, k is the classification number of data set X, and μ is Parameter;
Step 2, the data in data set X are divided according to apart from nearest principle, obtains μ k cluster;
Step 3, it is as shown in Equation 1 to calculate rational sample number s, the s calculating formula to be extracted;
F is the ratio to be taken out in the formula, and n is data count in data set X, niFor cluster CiMiddle data count, 1 meaning of formula are with 1- δ The probability of (0 < δ < 1) is from CiIt extracts and is not less than f × niA data;
Step 4, in above-mentioned μ k cluster according to the ratio of s/n from each cluster CiIn have the randomly drawing sample put back to, extract altogether Obtain sample set S={ s five times1,s2,s3,s4,s5};
Step 5, K-prototypes algorithm is run to five samples of extraction, obtains five kinds of division results;
Step 6, the division result that step 5 obtains is integrated according to Integrated Algorithm, obtains k cluster center.
2. the effective ways of mixed type large-scale data, feature in a kind of Coping with Reality life according to claim 1 It is: the determination of parameter μ in the step 1, specifically:
Step 1.1, by the raw data set X containing n sample, k initial point is randomly selected;
Step 1.2, the data in data set X are divided according to apart from nearest principle, obtains k cluster;
Step 1.3, the rational sample number s to be extracted is calculated according to formula 1;
Step 1.4, in above-mentioned k cluster according to the ratio of s/n from each cluster CiIn have the randomly drawing sample put back to, take out altogether It takes five times, obtains sample set S={ s1,s2,s3,s4,s5};
Step 1.5, to the sample set S={ s of extraction1,s2,s3,s4,s5Operation K-prototypes algorithm, obtain five kinds of divisions As a result, bringing the data in raw data set X into above-mentioned five kinds of division results obtains the mean value and variance of error rate;
Step 1.6, k+1 initial point is taken, step 1.1~1.5 is repeated, brings the data in raw data set X into above-mentioned five Kind division result obtains the mean value and variance of error rate, selects the μ when mean value and lower variance of error rate.
3. the effective ways of mixed type large-scale data, feature in a kind of Coping with Reality life according to claim 1 It is: K-prototypes algorithm is run to five samples of extraction in the step 5, obtains five kinds of division results, specifically:
Step 5.1, the raw data set X containing n sample is inputted, k initial point is randomly selected, if raw data set is X= {X1,X2,X3,…,Xn, n data, each X are shared in data set XiThere are m attribute, i.e. Xi={ xi1,xi2,xi3,…,xim, If the initial prototype set V={ V formed by k initial point1,V2,V3,…,Vk, input the raw data set containing n sample X randomly selects k initial point;
Step 5.2, according to apart from nearest principleEach data in raw data set X are assigned to respectively Affiliated cluster CiIn, update cluster CiCenter;
Step 5.3, cluster C obtained in the data and step 5.2 in raw data set X is calculatediThe distance at center, most according to distance Approximately principleDivision is re-started to data, and obtains new cluster CiCenter;
Step 5.4, step 5.2~5.3 are repeated, until cluster CiUntil center no longer changes.
4. the effective ways of mixed type large-scale data, feature in a kind of Coping with Reality life according to claim 1 It is: the division result that step 5 obtains is integrated according to Integrated Algorithm in the step 6, obtain the final criteria for classifying, has Body are as follows:
Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division resulti, to initial data Collect X={ X1,X2,X3,…,Xn, the cluster C obtained by clusteringiWith in raw data set X={ X1,X2,X3,…,XnIn The cluster that existing class label divides is respectively C={ C1,C2, C3..., Ck-1, CkAnd C '={ C '1,C′2,C′3,…C′k, ri The data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number;
Step 6.2, according to formula 2 calculate every kind of division result shared by weight Wi
Step 6.3, according to formula 3 to five kinds of division result cluster CiCenter is weighted integrated, obtains final criteria for classifying ll
CN201910594183.7A 2019-07-03 2019-07-03 The effective ways of mixed type large-scale data in a kind of life of Coping with Reality Pending CN110309882A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910594183.7A CN110309882A (en) 2019-07-03 2019-07-03 The effective ways of mixed type large-scale data in a kind of life of Coping with Reality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910594183.7A CN110309882A (en) 2019-07-03 2019-07-03 The effective ways of mixed type large-scale data in a kind of life of Coping with Reality

Publications (1)

Publication Number Publication Date
CN110309882A true CN110309882A (en) 2019-10-08

Family

ID=68079667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910594183.7A Pending CN110309882A (en) 2019-07-03 2019-07-03 The effective ways of mixed type large-scale data in a kind of life of Coping with Reality

Country Status (1)

Country Link
CN (1) CN110309882A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738319A (en) * 2020-06-11 2020-10-02 佳都新太科技股份有限公司 Clustering result evaluation method and device based on large-scale samples

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738319A (en) * 2020-06-11 2020-10-02 佳都新太科技股份有限公司 Clustering result evaluation method and device based on large-scale samples

Similar Documents

Publication Publication Date Title
Thenmozhi et al. Heart disease prediction using classification with different decision tree techniques
Genolini et al. KmL: k-means for longitudinal data
US6871201B2 (en) Method for building space-splitting decision tree
CN108846259A (en) A kind of gene sorting method and system based on cluster and random forests algorithm
CN107832456B (en) Parallel KNN text classification method based on critical value data division
US11971892B2 (en) Methods for stratified sampling-based query execution
Zhang et al. Novel density-based and hierarchical density-based clustering algorithms for uncertain data
CN106228554A (en) Fuzzy coarse central coal dust image partition methods based on many attribute reductions
CN110826618A (en) Personal credit risk assessment method based on random forest
Parashar et al. An efficient classification approach for data mining
Gunjan Instantaneous approach for evaluating the initial centers in the agricultural databases using K-means clustering algorithm
Evchenko et al. Frugal machine learning
CN109684477A (en) A kind of patent text feature extracting method and system
WO2012041861A2 (en) Computer-implemented method for analyzing multivariate data
Bruzzese et al. DESPOTA: DEndrogram slicing through a pemutation test approach
Mandal et al. Unsupervised non-redundant feature selection: a graph-theoretic approach
Dahiya et al. A rank aggregation algorithm for ensemble of multiple feature selection techniques in credit risk evaluation
CN110309882A (en) The effective ways of mixed type large-scale data in a kind of life of Coping with Reality
Heckerman et al. An experimental comparison of several clustering and initialization methods
Chen et al. See more for scene: Pairwise consistency learning for scene classification
Pandeeswari et al. K-means clustering and Naïve Bayes classifier for categorization of diabetes patients
CN104468276A (en) Network traffic identification method based on random sampling multiple classifiers
Akyol Clustering hotels and analyzing the importance of their features by machine learning techniques
CN113221966A (en) Differential privacy decision tree construction method based on F _ Max attribute measurement
Al-Mhairat et al. Performance Evaluation of clustering Algorthims

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191008

WD01 Invention patent application deemed withdrawn after publication