CN109447103A - A kind of big data classification method, device and equipment based on hard clustering algorithm - Google Patents

A kind of big data classification method, device and equipment based on hard clustering algorithm Download PDF

Info

Publication number
CN109447103A
CN109447103A CN201811044932.0A CN201811044932A CN109447103A CN 109447103 A CN109447103 A CN 109447103A CN 201811044932 A CN201811044932 A CN 201811044932A CN 109447103 A CN109447103 A CN 109447103A
Authority
CN
China
Prior art keywords
data information
cluster
data
sample
cluster centre
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811044932.0A
Other languages
Chinese (zh)
Other versions
CN109447103B (en
Inventor
金戈
徐亮
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201811044932.0A priority Critical patent/CN109447103B/en
Publication of CN109447103A publication Critical patent/CN109447103A/en
Application granted granted Critical
Publication of CN109447103B publication Critical patent/CN109447103B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of big data classification method, device and equipment based on hard clustering algorithm, and wherein method includes: acquisition data information, and data information is divided into N parts of sample datas;First hard clustering is carried out to every part of sample data, determines N*K1 first cluster centres;Secondary hard clustering is carried out to N*K1 first cluster centres, determines K2 secondary cluster centres;According to the K2 secondary cluster centres, the data information is divided into K2 classification item, and in the database by each classification item and corresponding data information memory.Through the above scheme, the accuracy of obtained secondary cluster centre is higher, so that more preferable according to the effect that the secondary cluster centre is classified, obtained each classification item, which can have, compares salient feature, it allows users to preferably distinguish each classification item, will not be confused.

Description

A kind of big data classification method, device and equipment based on hard clustering algorithm
Technical field
This application involves data analysis technique fields, more particularly to a kind of big data classification side based on hard clustering algorithm Method, device and equipment.
Background technique
The development of some companies is more and more rapider, and the employee of company is also just more and more, these headcount are compared More companies, needs to carry out population analysis to employee, divides classification for employee.
Currently, usually all classifying using clustering algorithm to the demographic data of acquisition, different classes of personnel are marked off The characteristics of, crowd is analyzed according to classification results, for example, Market Analyst area from customer database can be helped The different consumer groups is separated, and summarizes the consumption mode or habit of every a kind of consumer.
Most common clustering algorithm is K-means algorithm, still, the random selection of existing k-means clustering center, and such as The selection of fruit cluster centre is improper, then Clustering Effect is bad, and obtained classification results are not accurate enough.
Summary of the invention
In view of this, this application provides a kind of big data classification method, device and equipment based on hard clustering algorithm.It is main Syllabus is to solve the random selection of existing k-means clustering center, if cluster centre selects improper, Clustering Effect It is bad, the not accurate enough technical problem of obtained classification results.
According to the application's in a first aspect, providing a kind of big data classification method based on hard clustering algorithm, the side The step of method includes:
Data information is obtained, data information is divided into N parts of sample datas, N >=1;
First hard clustering is carried out to every part of sample data, determines N*K1 first cluster centres, wherein K1 is first The first cluster centre quantity determined in secondary hard clustering, K1 >=1;
Secondary hard clustering is carried out to N*K1 first cluster centres, determines K2 secondary cluster centres, wherein K2 For the secondary cluster centre quantity determined in secondary hard clustering, K2 >=1;
According to the K2 secondary cluster centres, the data information is divided into K2 classification item, and will each divide Intermediate item and corresponding data information memory are in the database.
According to the second aspect of the application, a kind of big data sorter based on hard clustering algorithm, the dress are provided It sets and includes:
Data information is divided into N parts of sample datas, N >=1 for obtaining data information by acquiring unit;
Cluster analysis unit determines that N*K1 is a first poly- for carrying out first hard clustering to every part of sample data Class center, wherein K1 is the first cluster centre quantity determined in first hard clustering, K1 >=1;
The cluster analysis unit is also used to carry out secondary hard clustering to N*K1 first cluster centres, determine K2 secondary cluster centres, wherein K2 is the secondary cluster centre quantity determined in secondary hard clustering, K2 >=1;
Taxon, for according to the K2 secondary cluster centres, the data information to be divided into K2 sorting item Mesh, and in the database by each classification item and corresponding data information memory.
According to the third aspect of the application, a kind of computer equipment, including memory and processor, the storage are provided Device is stored with computer program, and the processor is realized described in first aspect when executing the computer program based on hard cluster The step of big data classification method of algorithm.
According to the fourth aspect of the application, a kind of computer storage medium is provided, computer program is stored thereon with, institute It states and realizes the big data classification method based on hard clustering algorithm described in first aspect when computer program is executed by processor Step.
It is provided by the present application a kind of based on the big data classification method of hard clustering algorithm, device by above-mentioned technical proposal And equipment, it largely can carry out sample division by data information, be divided into N parts of sample datas, then utilize hard clustering algorithm pair Every part of sample data carries out first clustering, and then obtains N*K1 first cluster centres, then recycles hard clustering algorithm Clustering again is carried out to this N*K1 first cluster centres, obtains K2 secondary cluster centres, what is obtained in this way is secondary poly- The accuracy at class center is higher, so that, obtained each sorting item more preferable according to the effect that the secondary cluster centre is classified Mesh, which can have, compares salient feature, allows users to preferably distinguish each classification item, will not be confused.
Above description is only the general introduction of technical scheme, in order to better understand the technological means of the application, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects, features and advantages of the application can It is clearer and more comprehensible, below the special specific embodiment for lifting the application.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the application Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the flow chart of one embodiment of the big data classification method based on hard clustering algorithm of the application;
Fig. 2 is the structural block diagram of one embodiment of the big data sorter based on hard clustering algorithm of the application;
Fig. 3 is the structural schematic diagram of the computer equipment of the application.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
The embodiment of the present application provides a kind of big data classification method based on hard clustering algorithm, by data information into Row hard cluster mode twice or more than twice, obtained cluster centre is more accurate, allow in this way according to cluster centre into Capable category division is more accurate.
As shown in Figure 1, the embodiment of the present application provides a kind of big data classification method based on hard clustering algorithm, step packet It includes:
Step 101, data information is obtained, data information is divided into N parts of sample datas, N >=1.
In this step, the data information of company personnel, corporate client or other crowds are obtained, data information includes: Gender, age, hobby, height, weight, income etc..Clustering is carried out according to these data informations for subsequent, is divided corresponding Classification.
Step 102, cluster (K-means) analysis firmly for the first time is carried out to every part of sample data, determines that N*K1 are gathered for the first time Class center, wherein K1 is the first cluster centre quantity that every part of sample data is determined after hard clustering for the first time, K1 >=1.
In this step, corresponding numerical value is assigned for each data information.
For example, gender: the first digit is 1, and male's numerical value is 1 in the second digit, and woman's numerical value is 2;
Age: the first digit is 2, and directly age number is placed on behind the first digit;
Hobby: the first digit is 3, and corresponding second digit is arranged for a variety of different hobbies;
Height: the first digit is 4, and corresponding height values (unit cm) are placed on behind the first digit;
Weight: the first digit is 5, and corresponding weight value (units/kg) is placed on behind the first digit;
Income: the first digit is 6, and corresponding Revenue (identical element) is placed on behind the first digit.
Using these data informations as sample, it is divided into N parts of sample datas.
Then the entry time for obtaining each data information, using the entry time of data information as horizontal axis, with data information Numerical value be the longitudinal axis, establish coordinate system O1.By coordinate system O1 using hard cluster (K-means) algorithm to every part of sample data First clustering is carried out, and then obtains N*K1 first cluster centres.
Wherein, each corresponding data information of first cluster centre.
Step 103, secondary hard clustering is carried out to N*K1 first cluster centres, determined in K2 secondary clusters The heart, wherein K2 is the secondary cluster centre quantity determined in secondary hard clustering, K2 >=1.
Using the entry time of the N*K1 corresponding data informations of first cluster centre as horizontal axis, numerical value is the longitudinal axis, is built again Vertical coordinate system O2.Secondary clustering is carried out to N*K1 first cluster centres using K-means algorithm, it is a secondary poly- to obtain K2 Class center.Equally, the corresponding data information of each secondary cluster centre, which is corresponding from first cluster centre Data letter is obtained by secondary clustering.
Step 104, according to K2 secondary cluster centres, data information is divided into K2 classification item, and will each divide Intermediate item and corresponding data information memory are in the database.
In this step, the K2 secondary cluster centres that K-means clustering obtains twice will be passed through, it is each by what is obtained A secondary cluster centre carries out corresponding with the coordinate of the data information in the coordinate system O1 that step 102 is established.In secondary cluster Centered on the heart, each data information on coordinate system O1 is divided into K2 region, respectively corresponds different classification items.Its In, specific classification item is obtained according to final cluster result.For example, obtained classification item has: high professional ability class, low industry Business ability class is good at linking up class, class of having a bright and cheerful disposition etc..
Then by the data information and the data information in each classification item, according to the entry time of data information Sequencing is arranged, and with the personal information of each personage be associated it is corresponding after, list store in the database.This Sample, user can search the crowd of the classification item of oneself needs from database.
For example, the leader of insurance company, it is desirable to the insurance agent of high professional ability class is searched from insurance agent, it is right It is rewarded, then high professional ability class is directly found in the database, and transfers all insurance agents of high professional ability class These personal information are showed the leader of insurance company by the personal information of people.Allow company leader or other people, being capable of basis The classification of division carries out feature summary to company personnel or client.
Through the above technical solutions, carrying out clustering using hard clustering algorithm twice, more accurate gather can be obtained Class center, so that more preferable according to the effect that cluster centre is classified, the difference between each classification item is more obvious.In this way User can carry out characteristic analysis to crowd according to obtained classification results, or oneself needs is transferred from classification results Crowd.
Step 104 specifically includes:
Step 1041, judge whether secondary cluster centre quantity K2 is more than or equal to given threshold, if the determination result is YES, then Enter step 1042, if judging result be it is no, enter step 1043.
Step 1042, hard clustering is carried out to K2 secondary cluster centres again, until in the final cluster determined The quantity of the heart is less than given threshold.
Step 1043, according to K2 secondary cluster centres, data information is divided into K2 classification item, and will each divide Intermediate item and corresponding data information memory are in the database.
In the above-mentioned technical solutions, user, which can according to the actual situation be configured given threshold, (such as is set as 100), then judged using quantity of the given threshold to the secondary cluster centre that above scheme obtains.If it is greater than etc. In the given threshold, then prove that corresponding obtained classification item is relatively more, the distinctive points between each classification item are not very bright It is aobvious, so that classifying quality is bad.And obtained classification item is excessive, and user is in the classification for transferring oneself needs from database When project, it also will increase and search the time.Therefore it needs to carry out K2 secondary cluster centres again using K-means clustering algorithm Clustering, the process and above-mentioned steps 103 of clustering are similarly.It is set until the quantity for the final cluster centre determined is less than Determine threshold value.
Through the above technical solutions, can be judged according to the quantity of the secondary cluster centre obtained after secondary clustering Whether need to carry out clustering again, the quantity for the classification item that can guarantee in this way does not exceed given threshold, makes Obtaining between each classification item has apparent distinguishing characteristics, will not generate the case where obscuring, and then improve classifying quality.
Step 102 specifically includes:
Step 1021, K1 the first initial cluster centers are determined for every part of sample data.
In this step, after determining the first initial cluster center, it is initial poly- that each first is found out in coordinate system O1 The position at class center, and be marked, facilitate and is positioned.
Step 1022, it calculates in every part of sample data at a distance from a first initial cluster centers of data information and K1.
In this step, by the seat of K1 the first initial cluster centers in a copy of it sample data in coordinate system O1 The coordinate of mark and other data informations is determined, calculates other data informations P1 (x1, y1) into each first initial clustering The distance of heart P2 (x2, y2)
Step 1023, data information is distributed in the corresponding first category of shortest first initial cluster center, Every part of sample data all obtains K1 first category and data information corresponding with each first category.
In this step, if data information is most short at a distance from some first initial cluster center, it was demonstrated that the data information Most like with first initial cluster center, the two may be a kind of.Therefore the data information is grouped into first initial clustering In the corresponding first category in center.
Step 1024, first nodal point is determined to each first category of every part of sample data, and in selected distance first Heart point is apart from shortest data information, and as first cluster centre, then N parts of sample datas correspond to N*K1 first cluster centres.
In this step, the cluster centre of the corresponding data information of each first category obtained is simultaneously inaccurate, needs Again corresponding cluster centre is chosen.Therefore, calculate two data informations of lie farthest away in each first category, and with this two First nodal point of the midpoint of the line of a data information as corresponding first category.Since the first nodal point may be one Virtual point, it is not corresponding with data information, it cannot function as first cluster centre, therefore, it is necessary to selected distance first nodal points Apart from shortest data information, as first cluster centre.
In step 1021 the first initial cluster center can be determined for every part of sample data by two different modes.
First, if user does not know these data informations, it is inexperienced, then take following steps:
Step 10211, a data information C1 is randomly selected from every part of sample data, where calculating C1 in sample data Each data information and C1 distance D (x), and according to formulaIt is initial poly- as first to calculate each data information The probability value at class center.
Step 10212, probability value is more than or equal to K1 data information of predetermined probability value, as in the first initial clustering The heart.
Second, if user's logarithm is it is believed that manner of breathing to understanding, can estimate the first initial cluster center clustered for the first time Quantity then takes following steps:
Step 10211 ', setting quantity K1 is the first initial cluster center quantity.
For example, user, which triggers cluster centre quantity, is arranged button, the window of input quantity just will pop up, user only needs will be certainly Oneself thinks that the reasonable number of comparison is inputted from window, and clicks acknowledgement key.
Step 10212 ', K1 data information is randomly selected from every part of sample data as the first initial cluster center.
In the above-mentioned technical solutions, user can determine that first is initial poly- from both the above according to the actual conditions of oneself One kind is selected in the mode at class center, and then is convenient for the user to use.
Step 103 specifically includes:
Step 1031, K2 are chosen from N*K1 first cluster centres, as the second initial cluster center.
In this step, the selection process of the second initial cluster center, the selection process phase with the first initial cluster center Seemingly, specifically:
Using coordinate system O2, a data information C2 is randomly selected from N*K1 first cluster centres, is calculated each first Secondary cluster centre and C2 distance D (x), and according to formulaIt is initial as second to calculate each first cluster centre The probability value of cluster centre.Probability value is more than or equal to K2 first cluster centres of predetermined probability value, it is initial poly- as second Class center.
Alternatively,
Setting quantity K2 is the second initial cluster center quantity, using coordinate system O2 from the first cluster centres of N*K1 with Machine chooses K2 data information as the second initial cluster center.
Step 1032, each first cluster centre is calculated at a distance from K2 the second initial cluster centers.
In this step, first the coordinate of K2 the second initial cluster centers is marked in coordinate system O2, then into The calculating of row distance.
Step 1033, first cluster centre is distributed to apart from the corresponding second category of shortest second initial cluster center In, wherein K2 the second initial cluster centers, correspond to K2 second category.
In this step, if first cluster centre is most short at a distance from some second initial cluster center, it was demonstrated that this is first Cluster centre and second initial cluster center are most like, and the two may be a kind of.Therefore the first cluster centre is grouped into this In the corresponding second category of first initial cluster center.
Step 1034, the second central point is determined to each second category, and chosen with the second central point apart from shortest first Secondary cluster centre obtains K2 secondary cluster centres as secondary cluster centre.
In this step, the corresponding cluster centre of each second category that obtains is simultaneously inaccurate, needs to choose again pair The cluster centre answered.Therefore, two initial cluster centers of lie farthest away in each second category are chosen, and first with the two Second central point of the midpoint of the line of cluster centre as corresponding second category.Since second central point may be a void Quasi- point, cannot function as secondary cluster centre, and therefore, it is necessary to the second central points of selected distance in shortest first cluster The heart, as secondary cluster centre.
Through the above technical solutions, the secondary cluster centre obtained is in the base for clustering obtained first cluster centre for the first time On plinth, obtain after time K-means clustering, so that the accuracy of secondary cluster centre is higher, according to this two The effect that secondary cluster centre is classified is more preferable.
Step 101 specifically includes:
Step 1011, the sum for obtaining data information carries out data information according to every part of predetermined quantity to data information It is average to divide, N parts of sample datas are divided into, wherein the quantity of last portion sample data is less than or equal to predetermined quantity.
Alternatively,
A to B is carried out average N equal part, obtains N group number by the maximum value A and minimum value B of step 1011 ' acquisition data information It is worth range, data information is divided into N parts of sample datas according to N group numberical range.
Through the above technical solutions, being clustered due to the substantial amounts of data information, while to huge data information Analysis, it may appear that the case where system crash, therefore average division can be carried out according to quantity to data information, it can also be according to number Value carries out average division, after being divided into N parts of sample datas, so that it may clustering is carried out to every a sample data, so that poly- The effect of alanysis effectively improves.
Step 104 specifically includes:
Step 1041, corresponding classification item is determined for each secondary cluster centre.
Step 1042, data information is calculated at a distance from each secondary cluster centre, and data information is distributed to distance most In the short corresponding classification item of secondary cluster centre.
Step 1043, in the database by K2 classification item of acquisition and corresponding data information memory.
In the above-mentioned technical solutions, user can be that each secondary cluster centre is corresponding according to the practical experience of oneself Classification item is named, such as: high experience employee, medium experience employee, low experience employee etc..
Then, the K2 secondary cluster centres that will be obtained according to coordinate system O2, are transferred in coordinate system O1, and in coordinate system Each secondary cluster centre is marked in O1.Each data information is calculated at a distance from each secondary cluster centre, according to The size of distance determines that the degree of correlation of each data information with corresponding secondary cluster centre, the shorter proof degree of correlation of distance are higher. Therefore data information is distributed in the corresponding classification item of shortest secondary cluster centre, to complete to data information Classification task.
Technical solution through this embodiment largely can carry out sample division by data information, be divided into N parts of sample numbers According to then carrying out first clustering to every part of sample data using hard clustering algorithm, and then obtain in N*K1 first clusters Then the heart recycles hard clustering algorithm to carry out clustering again to this N*K1 first cluster centres, it is a secondary poly- to obtain K2 The accuracy at class center, the secondary cluster centre obtained in this way is higher, so that the effect classified according to the secondary cluster centre Fruit is more preferable, and obtained each classification item, which can have, compares salient feature, allows users to preferably to each sorting item Mesh distinguishes, and will not be confused.
Include the following steps: in the big data classification method based on hard clustering algorithm of another embodiment of the application
One, sample is obtained
For insurance company, wants to carry out classifying and dividing to insurance agent, need to collect insurance agent's personal information Data (that is, data information), comprising: gender, age, hobby, height, weight, schooling, reception client amount, income etc., These personal data informations are summarized as sample.
Two, first cluster
Average division is carried out to above-mentioned sample, N equal portions is divided into, K-means cluster is carried out respectively to this N equal portions sample, Obtain N*K1 cluster centre.
Specific cluster process are as follows:
(1) K1 initial cluster center is determined for every part of sample.K1 value can be oneself setting in advance, or under The mode of stating is determined:
A data information C1 is first randomly selected for each sample, calculates other data informations and C1 distance D (x), is counted Calculate probability of other data informations as initial cluster centerChoose the K1 data that probability value is greater than predetermined probability Information, as initial cluster center.
(2) each data in every part of sample are calculated and arrive the distance of initial cluster center, and the data are assigned to apart from most In the corresponding classification of short initial cluster center, recalculated pair for the data information of each classification using K-means algorithm The first cluster centre answered, and the step is repeated until the first cluster centre obtained no longer changes.
(3) first cluster centre all is calculated by the way of above-mentioned steps (1) and (2) for N parts of samples, obtains N*K1 First cluster centre.
Three, secondary cluster
Using N*K1 first cluster centres as sample, clustered again using K-means algorithm.
(1) number of secondary cluster centre is set as K2.K2 value can be oneself setting in advance, or according to following sides Formula is determined:
From N*K1 first cluster centres, a first cluster centre C2 is first randomly selected, at the beginning of calculating other in sample Secondary cluster centre and C2 distance D (x) calculate other probability of the first cluster centre as the initial cluster center of secondary cluster ValueIt is more than K2 data of predetermined probability value as secondary cluster centre using probability value.
(2) each data information is calculated to the distance of secondary cluster centre, and data information is assigned to apart from shortest In the corresponding classification of secondary cluster centre, the corresponding data information of K2 classification is obtained.Then K- is utilized for each classification again Means recalculates corresponding secondary cluster centre, and repeats the step until the secondary cluster centre obtained no longer changes, into And obtain final K2 secondary cluster centres.
In addition, can also be carried out even more multiple three times if the quantity of K2 obtained secondary cluster centres is more Clustering recycles the finally obtained cluster centre after repeatedly clustering to be used to classify to data information.
In above-mentioned multiple cluster process, the sample clustered again is all last cluster as a result, such than only carrying out one The cluster centre that secondary cluster obtains is more accurate.
Four, classify to sample
Category division is carried out using all data informations of the obtained K2 secondary cluster centres to acquisition, by data information It is divided into K2 classification.
The leader of insurance company can be according to the data information of the corresponding insurance agent of each classification in this way, analysis pair The characteristics of answering the insurance agent of classification, and then suitable work is planned for it the characteristics of be preferably directed to every class insurance agent Make.
In conclusion this programme is obtained by carrying out clustering even more multiple twice to the data information of collection More accurate cluster centre, so as to be classified according to cluster centre, obtained each classification has compared with high diversity, Clustering Effect is ideal.The group of each classification is enabled to embody respective characteristic.Furthermore it is possible to be tied according to known classification Fruit checks the reflection of population difference characteristically, to comb the influence that agent's feature sorts out its group.
Further, the specific implementation as Fig. 1 method, the embodiment of the present application provide a kind of based on hard clustering algorithm Big data sorter, as shown in Fig. 2, device includes: acquiring unit 21, cluster analysis unit 22 and taxon 23.
Data information is divided into N parts of sample datas, N >=1 for obtaining data information by acquiring unit 21;
Cluster analysis unit 22 determines that N*K1 is a first for carrying out first hard clustering to every part of sample data Cluster centre, wherein K1 is the first cluster centre quantity that every part of sample data is determined after hard clustering for the first time, K1 >=1;
Cluster analysis unit 22 is also used to carry out secondary hard clustering to N*K1 first cluster centres, determines K2 A secondary cluster centre, wherein K2 is the secondary cluster centre quantity determined in secondary hard clustering, K2 >=1;
Taxon 23, for data information being divided into K2 classification item, and will according to K2 secondary cluster centres Each classification item and corresponding data information memory are in the database.
In a particular embodiment, taxon 23 specifically includes:
Judgment module, for judging whether secondary cluster centre quantity K2 is more than or equal to given threshold;If judging result is It is that hard clustering then is carried out to K2 secondary cluster centres again, until the quantity for the final cluster centre determined is less than Given threshold;If judging result be it is no, according to K2 secondary cluster centres, data information is divided into K2 classification item, And in the database by each classification item and corresponding data information memory.
In a particular embodiment, cluster analysis unit 22 specifically includes:
Center determining module, for determining K1 the first initial cluster centers for every part of sample data;
Distance calculation module, for calculate data information and K1 the first initial cluster center in every part of sample data away from From;
Distribution module, for distributing data information to apart from the corresponding first category of shortest first initial cluster center In, every part of sample data all obtains K1 first category and data information corresponding with each first category;
Module is chosen, determines first nodal point for each first category to every part of sample data, and selected distance the One central point is apart from shortest data information, and as first cluster centre, then N parts of sample datas correspond in N*K1 first clusters The heart.
In a particular embodiment, center determining module specifically includes:
Probability evaluation entity calculates sample where C1 for randomly selecting a data information C1 from every part of sample data Each data information and C1 distance D (x) in notebook data, and according to formulaEach data information is calculated as The probability value of one initial cluster center;Probability value is more than or equal to K1 data information of predetermined probability value, it is initial as first Cluster centre;
Alternatively,
Randomized blocks are the first initial cluster center quantity for quantity K1 to be arranged;It is selected at random from every part of sample data Take K1 data information as the first initial cluster center.
In a particular embodiment, center determining module is also used to choose K2 from N*K1 first cluster centres, as Second initial cluster center;
Distance calculation module is also used to calculate each first cluster centre at a distance from K2 the second initial cluster centers;
Distribution module is also used to distribute first cluster centre to apart from shortest second initial cluster center corresponding In two classifications, wherein K2 the second initial cluster centers, correspond to K2 second category;
Module is chosen, is also used to determine each second category the second central point, and is chosen with the second central point distance most Short first cluster centre obtains K2 secondary cluster centres as secondary cluster centre.
In a particular embodiment, acquiring unit 21 are also used to obtain the sum of data information, by data information according to every part Predetermined quantity carries out average division to data information, is divided into N parts of sample datas, wherein the quantity of last portion sample data is small In equal to predetermined quantity;
Alternatively,
It is also used to obtain the maximum value A and minimum value B of data information, A to B is subjected to average N equal part, obtains N group numerical value Data information is divided into N parts of sample datas according to N group numberical range by range.
In a particular embodiment, taxon 23 is specific further include:
Category determination module determines corresponding classification item for each secondary cluster centre;Calculate data information and each The distance of secondary cluster centre distributes data information in the corresponding classification item of shortest secondary cluster centre;
Memory module, K2 classification item for that will obtain and corresponding data information memory are in the database.
Embodiment based on method shown in above-mentioned Fig. 1 and Fig. 2 shown device, to achieve the goals above, the application are implemented Example additionally provides a kind of computer equipment, as shown in figure 3, including memory 32 and processor 31, wherein memory 32 and processing Device 31 is arranged at memory 32 in bus 33 and is stored with computer program, and processor 31 realizes Fig. 1 when executing computer program Shown in the big data classification method based on hard clustering algorithm.
Based on this understanding, the technical solution of the application can be embodied in the form of software products, which produces Product can store in a nonvolatile memory (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are used So that a computer equipment (can be personal computer, server or the network equipment etc.) executes each reality of the application Apply method described in scene.
Optionally, the equipment can also connect user interface, network interface, camera, radio frequency (Radio Frequency, RF) circuit, sensor, voicefrequency circuit, WI-FI module etc..User interface may include display screen (Display), input list First such as keyboard (Keyboard) etc., optional user interface can also include USB interface, card reader interface etc..Network interface can Choosing may include standard wireline interface and wireless interface (such as blue tooth interface, WI-FI interface).
Based on the embodiment of above-mentioned method as shown in Figure 1 and Fig. 2 shown device, correspondingly, the embodiment of the present application also provides A kind of storage medium, is stored thereon with computer program, which realizes above-mentioned base as shown in Figure 1 when being executed by processor In the big data classification method of hard clustering algorithm.
It will be understood by those skilled in the art that a kind of structure of computer equipment provided in this embodiment is not constituted to this The restriction of entity device may include more or fewer components, perhaps combine certain components or different component layouts.
It can also include operating system, network communication module in storage medium.Operating system is that management and quantization transaction are set The program of standby hardware and software resource, supports the operation of message handling program and other softwares and/or program.Network communication mould Block for realizing the communication between each component in storage medium inside, and in quantization traction equipment between other hardware and softwares Communication.
Through the above description of the embodiments, those skilled in the art can be understood that the application can borrow It helps software that the mode of necessary general hardware platform is added to realize, hardware realization can also be passed through.
By the technical solution of application the application, it largely can carry out sample division by data information, be divided into N parts of samples Then data carry out first clustering to every part of sample data using hard clustering algorithm, and then obtain N*K1 first clusters Then center recycles hard clustering algorithm to carry out clustering again to this N*K1 first cluster centres, it is a secondary to obtain K2 The accuracy of cluster centre, the secondary cluster centre obtained in this way is higher, so that classify according to the secondary cluster centre Effect is more preferable, and obtained each classification item, which can have, compares salient feature, allows users to preferably to each classification Project distinguishes, and will not be confused.
It will be appreciated by those skilled in the art that the accompanying drawings are only schematic diagrams of a preferred implementation scenario, module in attached drawing or Process is not necessarily implemented necessary to the application.It will be appreciated by those skilled in the art that the mould in device in implement scene Block can according to implement scene describe be distributed in the device of implement scene, can also carry out corresponding change be located at be different from In one or more devices of this implement scene.The module of above-mentioned implement scene can be merged into a module, can also be into one Step splits into multiple submodule.
Above-mentioned the application serial number is for illustration only, does not represent the superiority and inferiority of implement scene.Disclosed above is only the application Several specific implementation scenes, still, the application is not limited to this, and the changes that any person skilled in the art can think of is all The protection scope of the application should be fallen into.

Claims (10)

1. a kind of big data classification method based on hard clustering algorithm, which is characterized in that the step of the method includes:
Data information is obtained, data information is divided into N parts of sample datas, N >=1;
First hard clustering is carried out to every part of sample data, determines N*K1 first cluster centres, wherein K1 is every part of sample The first cluster centre quantity that notebook data is determined after hard clustering for the first time, K1 >=1;
Secondary hard clustering is carried out to N*K1 first cluster centres, determines K2 secondary cluster centres, wherein K2 bis- The secondary cluster centre quantity determined in secondary hard clustering, K2 >=1;
According to the K2 secondary cluster centres, the data information is divided into K2 classification item, and by each sorting item Mesh and corresponding data information memory are in the database.
2. big data classification method according to claim 1, which is characterized in that according to the K2 secondary cluster centres, The data information is divided into K2 classification item, and by each classification item with corresponding data information memory in database In, it specifically includes:
Judge whether secondary cluster centre quantity K2 is more than or equal to given threshold;
If the determination result is YES, then hard clustering is carried out to the K2 secondary cluster centres again, until determining most The quantity of whole cluster centre is less than given threshold;
If judging result be it is no, according to the K2 secondary cluster centres, the data information is divided into K2 sorting item Mesh, and in the database by each classification item and corresponding data information memory.
3. big data classification method according to claim 1, which is characterized in that described to be carried out for the first time to every part of sample data Hard clustering is determined N*K1 first cluster centres, is specifically included:
K1 the first initial cluster centers are determined for every part of sample data;
It calculates in every part of sample data at a distance from data information and K1 the first initial cluster centers;
Data information is distributed in the corresponding first category of shortest first initial cluster center, every part of sample data is all Obtain K1 first category and data information corresponding with each first category;
First nodal point is determined to each first category of every part of sample data, and selected distance first nodal point is apart from shortest Data information, as first cluster centre, then N parts of sample datas correspond to N*K1 first cluster centres.
4. big data classification method according to claim 3, which is characterized in that described to determine K1 for every part of sample data First initial cluster center, specifically includes:
Randomly select a data information C1 from every part of sample data, where calculating C1 in sample data each data information with The distance D (x) of C1, and according to formulaCalculate probability value of each data information as the first initial cluster center;
The probability value is more than or equal to K1 data information of predetermined probability value, as the first initial cluster center;
Alternatively,
Setting quantity K1 is the first initial cluster center quantity;
K1 data information is randomly selected from every part of sample data as the first initial cluster center.
5. big data classification method according to claim 1, which is characterized in that it is described to the first cluster centres of N*K1 into The secondary hard clustering of row is determined K2 secondary cluster centres, is specifically included:
K2 are chosen from N*K1 first cluster centres, as the second initial cluster center;
Each first cluster centre is calculated at a distance from K2 the second initial cluster centers;
First cluster centre is distributed in the corresponding second category of shortest second initial cluster center, wherein K2 the Two initial cluster centers, corresponding K2 second category;
Second central point is determined to each second category, and is chosen with the second central point apart from shortest first cluster centre, is made For secondary cluster centre, K2 secondary cluster centres are obtained.
6. big data classification method according to claim 1, which is characterized in that the acquisition data information believes data Breath is divided into N parts of sample datas, specifically includes:
The sum for obtaining data information, the data information averagely draws the data information according to every part of predetermined quantity Point, N parts of sample datas are divided into, wherein the quantity of last portion sample data is less than or equal to predetermined quantity;
Alternatively,
A to B is carried out average N equal part, obtains N group numberical range by the maximum value A and minimum value B for obtaining data information, will be described Data information is divided into N parts of sample datas according to the N group numberical range.
7. big data classification method according to claim 1, which is characterized in that according to the K2 secondary cluster centres, The data information is divided into K2 classification item, and by each classification item with corresponding data information memory in database In, it specifically includes:
Corresponding classification item is determined for each secondary cluster centre;
The data information is calculated at a distance from each secondary cluster centre, the data information is distributed to apart from shortest two In the corresponding classification item of secondary cluster centre;
In the database by K2 classification item of acquisition and corresponding data information memory.
8. a kind of big data sorter based on hard clustering algorithm, which is characterized in that described device includes:
Data information is divided into N parts of sample datas, N >=1 for obtaining data information by acquiring unit;
Cluster analysis unit is determined in N*K1 first clusters for carrying out first hard clustering to every part of sample data The heart, wherein K1 is the first cluster centre quantity that every part of sample data is determined after hard clustering for the first time, K1 >=1;
Cluster analysis unit is also used to carry out secondary hard clustering to N*K1 first cluster centres, determines that K2 is a secondary Cluster centre, wherein K2 is the secondary cluster centre quantity determined in secondary hard clustering, K2 >=1;
Taxon, for according to the K2 secondary cluster centres, the data information to be divided into K2 classification item, and In the database by each classification item and corresponding data information memory.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the processor is realized described in any one of claims 1 to 7 when executing the computer program based on hard clustering algorithm Big data classification method the step of.
10. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the computer program is located The step of reason device realizes the big data classification method described in any one of claims 1 to 7 based on hard clustering algorithm when executing.
CN201811044932.0A 2018-09-07 2018-09-07 Big data classification method, device and equipment based on hard clustering algorithm Active CN109447103B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811044932.0A CN109447103B (en) 2018-09-07 2018-09-07 Big data classification method, device and equipment based on hard clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811044932.0A CN109447103B (en) 2018-09-07 2018-09-07 Big data classification method, device and equipment based on hard clustering algorithm

Publications (2)

Publication Number Publication Date
CN109447103A true CN109447103A (en) 2019-03-08
CN109447103B CN109447103B (en) 2023-09-29

Family

ID=65530438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811044932.0A Active CN109447103B (en) 2018-09-07 2018-09-07 Big data classification method, device and equipment based on hard clustering algorithm

Country Status (1)

Country Link
CN (1) CN109447103B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134839A (en) * 2019-03-27 2019-08-16 平安科技(深圳)有限公司 Time series data characteristic processing method, apparatus and computer readable storage medium
CN110163241A (en) * 2019-03-18 2019-08-23 腾讯科技(深圳)有限公司 Data sample generation method, device, computer equipment and storage medium
CN110209260A (en) * 2019-04-26 2019-09-06 平安科技(深圳)有限公司 Power consumption method for detecting abnormality, device, equipment and computer readable storage medium
CN110367969A (en) * 2019-07-05 2019-10-25 复旦大学 A kind of improved electrocardiosignal K-Means Cluster

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130226922A1 (en) * 2012-02-28 2013-08-29 International Business Machines Corporation Identification of Complementary Data Objects
CN104615752A (en) * 2015-02-12 2015-05-13 北京嘀嘀无限科技发展有限公司 Information classification method and system
CN106650228A (en) * 2016-11-08 2017-05-10 浙江理工大学 Noise data removal method through improved k-means algorithm and implementation system
WO2017181660A1 (en) * 2016-04-21 2017-10-26 华为技术有限公司 K-means algorithm-based data clustering method and device
CN107367277A (en) * 2017-06-05 2017-11-21 南京邮电大学 Indoor location fingerprint positioning method based on secondary K Means clusters
CN107480708A (en) * 2017-07-31 2017-12-15 微梦创科网络科技(中国)有限公司 The clustering method and system of a kind of complex model
CN108154163A (en) * 2016-12-06 2018-06-12 北京京东尚科信息技术有限公司 Data processing method, data identification and learning method and its device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130226922A1 (en) * 2012-02-28 2013-08-29 International Business Machines Corporation Identification of Complementary Data Objects
CN104615752A (en) * 2015-02-12 2015-05-13 北京嘀嘀无限科技发展有限公司 Information classification method and system
WO2017181660A1 (en) * 2016-04-21 2017-10-26 华为技术有限公司 K-means algorithm-based data clustering method and device
CN106650228A (en) * 2016-11-08 2017-05-10 浙江理工大学 Noise data removal method through improved k-means algorithm and implementation system
CN108154163A (en) * 2016-12-06 2018-06-12 北京京东尚科信息技术有限公司 Data processing method, data identification and learning method and its device
CN107367277A (en) * 2017-06-05 2017-11-21 南京邮电大学 Indoor location fingerprint positioning method based on secondary K Means clusters
CN107480708A (en) * 2017-07-31 2017-12-15 微梦创科网络科技(中国)有限公司 The clustering method and system of a kind of complex model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李金涛 等: "基于K-means聚类算法的改进", 国外电子测量技术, vol. 36, no. 06, pages 9 - 13 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163241A (en) * 2019-03-18 2019-08-23 腾讯科技(深圳)有限公司 Data sample generation method, device, computer equipment and storage medium
CN110163241B (en) * 2019-03-18 2022-12-30 腾讯科技(深圳)有限公司 Data sample generation method and device, computer equipment and storage medium
CN110134839A (en) * 2019-03-27 2019-08-16 平安科技(深圳)有限公司 Time series data characteristic processing method, apparatus and computer readable storage medium
CN110134839B (en) * 2019-03-27 2023-06-06 平安科技(深圳)有限公司 Time sequence data characteristic processing method and device and computer readable storage medium
CN110209260A (en) * 2019-04-26 2019-09-06 平安科技(深圳)有限公司 Power consumption method for detecting abnormality, device, equipment and computer readable storage medium
CN110209260B (en) * 2019-04-26 2024-02-23 平安科技(深圳)有限公司 Power consumption abnormality detection method, device, equipment and computer readable storage medium
CN110367969A (en) * 2019-07-05 2019-10-25 复旦大学 A kind of improved electrocardiosignal K-Means Cluster

Also Published As

Publication number Publication date
CN109447103B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
CN109447103A (en) A kind of big data classification method, device and equipment based on hard clustering algorithm
CN106355449B (en) User selection method and device
CN110377804A (en) Method for pushing, device, system and the storage medium of training course data
CN108197532A (en) The method, apparatus and computer installation of recognition of face
CN106817251B (en) Link prediction method and device based on node similarity
CN106506705A (en) Listener clustering method and device based on location-based service
CN110909222B (en) User portrait establishing method and device based on clustering, medium and electronic equipment
CN112488863B (en) Dangerous seed recommendation method and related equipment in user cold start scene
US20120109959A1 (en) Method and system for-clustering data arising from a database
CN110750697B (en) Merchant classification method, device, equipment and storage medium
CN109242012A (en) It is grouped inductive method and device, electronic device and computer readable storage medium
CN109543940B (en) Activity evaluation method, activity evaluation device, electronic equipment and storage medium
CN108038217B (en) Information recommendation method and device
CN110727859A (en) Recommendation information pushing method and device
Son et al. Top-k manhattan spatial skyline queries
CN109345201A (en) Human Resources Management Method, device, electronic equipment and storage medium
CN111460315A (en) Social portrait construction method, device and equipment and storage medium
CN111538909A (en) Information recommendation method and device
CN109272351B (en) Passenger flow line and passenger flow hot area determining method and device
CN112560105B (en) Joint modeling method and device for protecting multi-party data privacy
CN110737771B (en) Topic distribution method and device based on big data
CN104899232B (en) The method and apparatus of Cooperative Clustering
CN111563207B (en) Search result sorting method and device, storage medium and computer equipment
CN111400663B (en) Model training method, device, equipment and computer readable storage medium
US20160357708A1 (en) Data analysis method, data analysis apparatus, and recording medium having recorded program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant