CN110309882A

CN110309882A - The effective ways of mixed type large-scale data in a kind of life of Coping with Reality

Info

Publication number: CN110309882A
Application number: CN201910594183.7A
Authority: CN
Inventors: 李顺勇; 张钰嘉; 张苗苗
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2019-10-08

Abstract

The invention discloses the effective ways of mixed type large-scale data in a kind of life of Coping with Reality, the core algorithm of the effective ways of mixed type large-scale data is compared with existing large-scale data clustering technique in a kind of Coping with Reality life of the present invention, its remarkable advantage is: the information of raw data set entirety being preferably utilized by global sampling techniques, obtain that there is preferable representative sample, sorting criterion is obtained by carrying out clustering to these samples, algorithm operation iteration time is not only reduced in this way, and can obtain higher clustering precision；The accuracy rate finally divided is effectively improved by clustering ensemble technology, higher clustering precision can be obtained in practical applications.

Description

The effective ways of mixed type large-scale data in a kind of life of Coping with Reality

Technical field

The present invention relates to the big rule of mixed type in advanced calculating and data processing field more particularly to a kind of life of Coping with Reality The effective ways of modulus evidence.

Background technique

The data that we face in actual life are all kinds of, effectively classified to it most important, to real number When according to being divided, if method is proper, clustering can more effectively classify to data.Clustering is a kind of Unsupervised algorithm, its target be the biggish data of similarity in data set are assigned to according to certain similarity measurement it is same Cluster makes the data similarity in cluster larger as far as possible, and data similarity is smaller between cluster.The meaning of cluster is to seek initial data Collect internal structure, thus deeper to initial data and progress correlation analysis.Traditional clustering algorithm such as K- Prototypes algorithm is used widely since its computation complexity is smaller and it is convenient to realize, but K-prototypes There is also some shortcomingss for algorithm.Be first the algorithm when choosing initial prototype using principle is randomly selected, in this way It will cause the influence that arithmetic result is unstable, vulnerable to initial prototype selection；Secondly, the algorithm iteration number is more, handling Riming time of algorithm is longer when data volume is larger or data dimension higher data.

Advancing by leaps and bounds for information technology makes the every aspect of human society that variation with rapid changepl. never-ending changes and improvements occur, these variations It is the generation and accumulation of all trades and professions mass data behind.Data volume is also in explosive growth, the data volume and data of growth The complication of itself brings challenge to clustering, and traditional K-prototypes algorithm can not be quickly and effectively to life More complicated large-scale data is quickly and effectively classified in work.

Summary of the invention

To solve the disadvantage that the prior art and deficiency, provide mixed type large-scale data in a kind of Coping with Reality life has Efficacious prescriptions method, in Coping with Reality life when mixed type large-scale dataset, reaching not only reduces complexity, shortens runing time, And the purpose of higher dividing precision can be obtained.

Provided for achieving the object of the present invention in a kind of life of Coping with Reality mixed type large-scale data effective ways, Include following steps:

Step 1, μ k initial point is chosen from the data set X containing n sample at random, k is the classification number of data set X Mesh, μ are parameter；

Step 2, the data in data set X are divided according to apart from nearest principle, obtains μ k cluster；

Step 3, it is as shown in Equation 1 to calculate rational sample number s, the s calculating formula to be extracted；

F is the ratio to be taken out in the formula, and n is data count in data set X, n_iFor cluster C_iMiddle data count, 1 meaning of formula are With the probability of 1- δ (0 < δ < 1) from C_iIt extracts and is not less than f × n_iA data；

Step 4, in above-mentioned μ k cluster according to the ratio of s/n from each cluster C_iIn have a randomly drawing sample put back to, one It extracts five times altogether and obtains sample set S={ s₁,s₂,s₃,s₄,s₅}；

Step 5, K-prototypes algorithm is run to five samples of extraction, obtains five kinds of division results；

Step 6, the division result that step 5 obtains is integrated according to Integrated Algorithm, obtains k cluster center.

As a further improvement of the foregoing solution, in the step 1 parameter μ determination, specifically:

Step 1.1, by the raw data set X containing n sample, k initial point is randomly selected；

Step 1.2, the data in data set X are divided according to apart from nearest principle, obtains k cluster；

Step 1.3, the rational sample number s to be extracted is calculated according to formula 1；

Step 1.4, in above-mentioned k cluster according to the ratio of s/n from each cluster C_iIn have a randomly drawing sample put back to, one It extracts five times altogether, obtains sample set S={ s₁,s₂,s₃,s₄,s₅}；

Step 1.5, to the sample set S={ s of extraction₁,s₂,s₃,s₄,s₅Operation K-prototypes algorithm, obtain five kinds Division result brings the data in raw data set X into above-mentioned five kinds of division results and obtains the mean value and variance of error rate；

Step 1.6, k+1 initial point is taken, step 1.1~1.5 is repeated, the data in raw data set X is brought into above-mentioned Five kinds of division results obtain the mean value and variance of error rate, select the μ when mean value and lower variance of error rate.

As a further improvement of the foregoing solution, K-prototypes is run to five samples of extraction in the step 5 Algorithm obtains five kinds of division results, specifically:

Step 5.1, the raw data set X containing n sample is inputted, k initial point is randomly selected, if raw data set is X={ X₁,X₂,X₃,…,X_n, n data, each X are shared in data set X_iThere are m attribute, i.e. X_i={ x_i1,x_i2,x_i3,…, x_im, if the initial prototype set V={ V formed by k initial point₁,V₂,V₃,…,V_k, input the original number containing n sample According to collection X, k initial point is randomly selected；

Step 5.2, according to apart from nearest principleEach data in raw data set X are assigned to Cluster C belonging to respectively_iIn, update cluster C_iCenter；

Step 5.3, cluster C obtained in the data and step 5.2 in raw data set X is calculated_iThe distance at center, according to away from From nearest principleDivision is re-started to data, and obtains new cluster C_iCenter；

Step 5.4, step 5.2~5.3 are repeated, until cluster C_iUntil center no longer changes.

As a further improvement of the foregoing solution, division result step 5 obtained according to Integrated Algorithm in the step 6 It is integrated, obtains the final criteria for classifying, specifically:

Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division result_i, to original Data set X={ X₁,X₂,X₃,…,X_n, the cluster C obtained by clustering_iWith in raw data set X={ X₁,X₂,X₃,…, X_nIn to have the cluster that divides of class label be respectively C={ C₁,C₂, C₃..., C_k-1, C_kAnd C '={ C '₁,C′₂,C′₃,… C′_k, r_iThe data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number；

Step 6.2, according to formula 2 calculate every kind of division result shared by weight W_i；

Step 6.3, according to formula 3 to five kinds of division result cluster C_iCenter is weighted integrated, obtains final criteria for classifying ll

The beneficial effects of the present invention are:

Compared with prior art, in a kind of life of Coping with Reality of the present invention effective ways of mixed type large-scale data core Compared with existing large-scale data clustering technique, remarkable advantage is center algorithm: preferably being utilized by global sampling techniques The information of raw data set entirety has obtained having preferable representative data, by carrying out clustering to these data Sorting criterion is obtained, not only reduces algorithm operation iteration time in this way, and higher clustering precision can be obtained；Pass through cluster Integrated technology effectively improves the accuracy rate finally divided, can obtain higher clustering precision in practical applications.

Detailed description of the invention

Below in conjunction with attached drawing, specific embodiments of the present invention will be described in further detail, in which:

Fig. 1 be present invention determine that the mean value of algorithm output error rate and variance comparison diagram on different data sets when parameter μ, Wherein, (a) is Transfusion data set comparison diagram, (b) is Banknote data set comparison diagram, (c) is HTRU2 data set Comparison diagram.

Specific embodiment

The method that mixed type large-scale data in actual life can be effectively treated in one kind of the present invention, the present invention is from UCI number According to 4 data sets of selection and manually generated 8 data sets in library.We choose Transfusion in UCI database, This four data sets of Banknote, HTRU2 and Activity Recognition, Activity Recognition data set Represent the true benchmark of activity recognition application field；HTRU2 data set is description during the measurement of high time resolution universe The data set of the pulsar candidate data of collection, each data include 9 attributes in data set；Transfusion data set Donations database from Hsinchu City, Taiwan Province blood transfusion service centre；Data in Banknote are from banknote number that is true and forging According to image in extract, there are five attributes, the respectively variance of wavelet image, the degree of bias of wavelet image, small echo Continuity, image entropy, the class of changing image.Because this four data sets instance difference it is larger, and HTRU2 and The h of Activity Recognition data set is 9, and dimension is higher, and the data set chosen in this way is representative preferably, can preferable body Performance of the existing algorithm when handling the data set of different dimensions and different data number.Therefore, the Artificial data of generation Also increase the otherness between data set as far as possible, as Artificial data 1 arrives the h of 4 data set of Artificial data =3, but instance difference is larger, and differs larger with the h of Artificial data 5 to Artificial data 8, More comprehensively algorithm performance can be tested in this way.Data set specifying information is as shown in table 1.

Step 1, μ k initial point is chosen from 1 data set of table at random, k is the classification number of data set X, and μ is parameter, tool Body step is as shown in step 1.1~1.6.

Step 1.1, random to select by raw data set Transfusion, Banknote, HTRU2 containing n sample K initial point is taken, the specific value of k is the class in table 1, and data amount check is the instance in table 1；

Step 1.3, the reasonable data number s to be extracted is calculated according to formula 1；

Step 1.5, to the sample set S={ s of extraction₁,s₂,s₃,s₄,s₅Operation K-prototypes algorithm, obtain five kinds Division result brings the data in raw data set X into above-mentioned five kinds of division results and obtains the mean value and variance of error rate.

Step 1.6, k+1 initial point is taken, step 1.1~1.5 is repeated, the data in raw data set X is brought into above-mentioned Five kinds of division results obtain the mean value and variance of error rate, select the μ when mean value and lower variance of error rate, with Different value of K is horizontal axis, the mean value E (1-r of error rate_i) and variance V (1-r_i) it is that the longitudinal axis maps to obtain Fig. 1.

K are taken first, then takes k+1, and a initial point of k+2 ... ... can be seen that the E (1- as k=5 from Fig. 1 (a) r_i) and V (1-r_i) value maximum；As k=6, the mean value E (1-r of partition error rate on Transfusion data set_i) and Variance yields V (1-r_i) minimum, i.e. division effect is best；When k continues to increase, mean value and the variance yields variation of partition error rate Less.It is contemplated that algorithm complexity and the mean value and variance yields of error rate, μ=6/2 on Transfusion data set =3, i.e. μ take 3 more appropriate.From Fig. 1 (b) as can be seen that as k=9, the V (1-r on Banknote data set_i) value minimum, E(1-r_i) value is slightly higher；Mean value E (the 1-r of partition error rate when k=4_i) minimum, but variance V (1-r_i) larger；As k=8, E (1-r_i) value is lower, and V (1-r_i) value is also smaller, it is contemplated that the mean value and variance yields of algorithm complexity and error rate are arrived, μ takes 4 more appropriate on Banknote data set.From Fig. 1 (c) as can be seen that as k=10, the E (1- on HTRU2 data set r_i) value is minimum, and V (1-r_i) value is also smaller, but takes a long time when k=10；V (the 1-r that algorithm 5 exports when k=3_i) value is most It is low, but the mean value E (1-r of error rate_i) larger；As k=8, the E (1-r on HTRU2 data set_i) value minimum, V (1-r_i) value It is smaller；It is contemplated that μ takes 4 more to close on HTRU2 data set to Riming time of algorithm and the mean value and variance yields of error rate It is suitable.

In conclusion μ is set to 4 by the present invention.

Step 2, the data in X are divided according to apart from nearest principle, obtains μ k cluster；

Step 3, the reasonable data number s to be extracted is calculated according to formula 1；

Step 4, in above-mentioned 4k cluster according to the ratio of s/n from each cluster C_iIn have a randomly drawing sample put back to, one It extracts five times altogether and obtains sample set S={ s₁,s₂,s₃,s₄,s₅}；

Step 5, to five sample set S={ s of extraction₁,s₂,s₃,s₄,s₅Operation K-prototypes algorithm, obtain five Kind division result, K-prototypes algorithmic procedure are specific as shown in step 5.1~5.4；

Step 6, the division result that step 5 obtains is integrated according to Integrated Algorithm, obtains k cluster center, collected At algorithm specific steps such as step 6.1~6.3.

Step 6.3, five kinds of division result cluster centers are weighted according to formula 3 integrated, obtain final criteria for classifying ll

It chooses RI (Rand index) and Riming time of algorithm T (s) is used as evaluation index, RI value illustrates to draw closer to 1 Divide effect better, T (s) is shorter, illustrates that algorithm is more effective.Specific comparing result is as shown in table 2, from table 2 it can be seen that at four It is best to illustrate that inventive algorithm divides effect for the RI value highest of inventive algorithm on data set, in Transfusion and Inventive algorithm runing time is slightly longer than K-prototypes algorithm on Banknote data set, but in HTRU2 and Activity Inventive algorithm operation is very fast on Recognition data set, as the h higher of data set and instance more, this hair The bright algorithm speed of service is very fast；The RI value of inventive algorithm is highest on 8 Artificial data, when When data amount check is less in Artificial data, the time of inventive algorithm operation is slower than K-prototypes algorithm, when When Artificial data data amount check is more, the inventive algorithm speed of service is faster than K-prototypes algorithm, and Instance is more in Artificial data, and what the superiority of the inventive algorithm speed of service embodied is more obvious, and in number When increasing according to collection h, runing time of the present invention is far faster than K-prototypes algorithm.

Further, it chooses real data set in life further to verify the method for the present invention, specific steps such as 1~6 It is shown；

Adult data set is taken further to verify performance of the invention, Adult data set is from Census Bureau's database It extracts, shares 32561 objects in data set, 14 attributes, wherein 10 categorical attributes, 4 Numeric Attributes, altogether It is divided into two classes, for predicting people's annual income whether more than 50K.

Step 1,8 initial points are chosen from the Adult data set containing 32561 data at random；

Step 2, the data in Adult data set are divided according to apart from nearest principle, obtains 8 clusters；

Step 4, there are the data of randomly selecting put back to from each cluster according to the ratio of s/n in above-mentioned 8 clusters, take out altogether It takes five times and obtains data set S={ s₁,s₂,s₃,s₄,s₅}；

Step 5, to five data set S={ s of extraction₁,s₂,s₃,s₄,s₅Operation K-prototypes algorithm, obtain five Kind division result, K-prototypes algorithmic procedure are specific as shown in step 5.1~5.4:

Step 5.1, the Adult data set containing 32561 data is inputted, 2 initial points are randomly selected；

Step 5.2, according to apart from nearest principleEach data in Adult data set are assigned to Cluster C belonging to respectively_iIn, update cluster C_iCenter；

Step 5.3, cluster C obtained in the data and step 5.2 in Adult data set is calculated_iThe distance at center, according to away from From nearest principleDivision is re-started to data, and obtains new cluster C_iCenter；

Step 5.4, step 5.2- step 5.3 is repeated, until cluster C_iUntil center no longer changes；

Step 6, five kinds of division results that step 5 obtains are integrated according to Integrated Algorithm, obtain the final criteria for classifying, Integrated Algorithm specific steps such as step 6.1~6.3.

It calculates the runing time of the method for the present invention and the present invention is calculated according to the true tag in Adult data set and divide RI value；

Comparing result is as shown in table 3, it can be seen that the RI value of inventive algorithm highest on Adult data set, and consume When be less than K-prototypes algorithm, illustrate inventive algorithm superior performance.

1 data set information of table

2 different data collection Comparative result of table

2.1 real data set Comparative result of table

The comparison of 2.2 Artificial data experimental result of table

3 actual life data set Comparative result of table

Above embodiments are not limited to the technical solution of the embodiment itself, can be incorporated between embodiment new Embodiment.The above embodiments are merely illustrative of the technical solutions of the present invention and is not intended to limit it, all without departing from the present invention Any modification of spirit and scope or equivalent replacement, shall fall within the scope of the technical solution of the present invention.

Claims

1. the effective ways of mixed type large-scale data in a kind of Coping with Reality life, it is characterised in that: include following steps:

Step 1, μ k initial point is chosen from the data set X containing n sample at random, k is the classification number of data set X, and μ is Parameter；

F is the ratio to be taken out in the formula, and n is data count in data set X, n_iFor cluster C_iMiddle data count, 1 meaning of formula are with 1- δ The probability of (0 < δ < 1) is from C_iIt extracts and is not less than f × n_iA data；

Step 4, in above-mentioned μ k cluster according to the ratio of s/n from each cluster C_iIn have the randomly drawing sample put back to, extract altogether Obtain sample set S={ s five times₁,s₂,s₃,s₄,s₅}；

2. the effective ways of mixed type large-scale data, feature in a kind of Coping with Reality life according to claim 1 It is: the determination of parameter μ in the step 1, specifically:

Step 1.4, in above-mentioned k cluster according to the ratio of s/n from each cluster C_iIn have the randomly drawing sample put back to, take out altogether It takes five times, obtains sample set S={ s₁,s₂,s₃,s₄,s₅}；

Step 1.5, to the sample set S={ s of extraction₁,s₂,s₃,s₄,s₅Operation K-prototypes algorithm, obtain five kinds of divisions As a result, bringing the data in raw data set X into above-mentioned five kinds of division results obtains the mean value and variance of error rate；

Step 1.6, k+1 initial point is taken, step 1.1~1.5 is repeated, brings the data in raw data set X into above-mentioned five Kind division result obtains the mean value and variance of error rate, selects the μ when mean value and lower variance of error rate.

3. the effective ways of mixed type large-scale data, feature in a kind of Coping with Reality life according to claim 1 It is: K-prototypes algorithm is run to five samples of extraction in the step 5, obtains five kinds of division results, specifically:

Step 5.1, the raw data set X containing n sample is inputted, k initial point is randomly selected, if raw data set is X= {X₁,X₂,X₃,…,X_n, n data, each X are shared in data set X_iThere are m attribute, i.e. X_i={ x_i1,x_i2,x_i3,…,x_im, If the initial prototype set V={ V formed by k initial point₁,V₂,V₃,…,V_k, input the raw data set containing n sample X randomly selects k initial point；

Step 5.2, according to apart from nearest principleEach data in raw data set X are assigned to respectively Affiliated cluster C_iIn, update cluster C_iCenter；

Step 5.3, cluster C obtained in the data and step 5.2 in raw data set X is calculated_iThe distance at center, most according to distance Approximately principleDivision is re-started to data, and obtains new cluster C_iCenter；

4. the effective ways of mixed type large-scale data, feature in a kind of Coping with Reality life according to claim 1 It is: the division result that step 5 obtains is integrated according to Integrated Algorithm in the step 6, obtain the final criteria for classifying, has Body are as follows:

Step 6.1, five kinds of division results that input step 5 obtains calculate the accuracy rate r of every kind of division result_i, to initial data Collect X={ X₁,X₂,X₃,…,X_n, the cluster C obtained by clustering_iWith in raw data set X={ X₁,X₂,X₃,…,X_nIn The cluster that existing class label divides is respectively C={ C₁,C₂, C₃..., C_k-1, C_kAnd C '={ C '₁,C′₂,C′₃,…C′_k, r_i The data that the same category is divided into i.e. in C and C ' account for the ratio of total data number to number；