CN112905863A - Automatic customer classification method based on K-Means clustering - Google Patents

Automatic customer classification method based on K-Means clustering Download PDF

Info

Publication number
CN112905863A
CN112905863A CN202110293773.3A CN202110293773A CN112905863A CN 112905863 A CN112905863 A CN 112905863A CN 202110293773 A CN202110293773 A CN 202110293773A CN 112905863 A CN112905863 A CN 112905863A
Authority
CN
China
Prior art keywords
sample
matrix
data
cluster
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110293773.3A
Other languages
Chinese (zh)
Inventor
霍胜军
郑鑫
于德尚
徐楠楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Mengdou Network Technology Co ltd
Original Assignee
Qingdao Mengdou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Mengdou Network Technology Co ltd filed Critical Qingdao Mengdou Network Technology Co ltd
Priority to CN202110293773.3A priority Critical patent/CN112905863A/en
Publication of CN112905863A publication Critical patent/CN112905863A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Abstract

The invention provides a client automatic classification method based on K-Means clustering, which comprises the following steps: initializing platform client data, forming a data set matrix A by data of all company clients on a platform, and normalizing the data set matrix A by using the column number in a data dimension corresponding matrix and the row number in a client number corresponding matrix to obtain a normalized data set matrix B; automatically classifying the customers by adopting a K-means clustering method, determining a K value by adopting a contour coefficient method, and further determining an initial clustering center and K clustering centers; then, distributing all samples in the data set matrix B to the nearest cluster set according to the principle of minimum distance, and taking the average value of all samples in each cluster as a new cluster center; and repeating the steps until the clustering center is not changed any more, and obtaining K clusters, namely the result of automatically classifying the customers. The method can ensure the objectivity of the classification result and save the labor cost.

Description

Automatic customer classification method based on K-Means clustering
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a client automatic classification method based on K-Means clustering.
Background
In the field of education, the Confucius presents an education concept of 'teaching according to the factors', and personalized learning is an ideal situation pursued by education. For the service industry, personalized service is an ideal boundary for pursuing.
With the rapid development of the internet, the network provides technical support for personalized services of an internet platform by virtue of strong interactivity and distributed characteristics of the network. A large amount of customer data exist in an internet e-commerce platform database, and the data are used for classifying customers, so that more personalized services are provided for different types of customers, and more benefits can be brought to the platform.
For the classification of the existing customers of the platform, if the manual classification mode is directly adopted is unrealistic, some problems mainly exist:
(1) the objectivity is not strong, in the classification process of different people, personal subjective factors can be mixed in the classification process, and meanwhile, the classification standard of a client is limited and influenced by the subjective factors of the people, so that the classification result of the client is not objective;
(2) the resources are wasted, a manual mode is adopted, a large number of clients are classified, a large amount of time of people is consumed, and the waste of human resources is caused for the platform.
And the prior art lacks a proper automatic classification method. In view of the above, it is desirable to provide an automatic classification method to solve the above problems.
Disclosure of Invention
The purpose of the invention is: aiming at the problems described in the background technology, the invention provides a customer automatic classification method based on K-Means clustering, which is used for automatically classifying customers through the autonomous behavior of the customers on a platform as the classification basis and the K-Means clustering method aiming at an Internet electronic commerce platform, so that the objectivity of the classification result can be ensured, and the occupied human resources can be released.
In order to solve the problems, the technical scheme adopted by the invention is as follows:
the automatic customer classification method based on K-Means clustering is characterized by comprising the following steps of:
initializing platform client data, forming a data set matrix A by data of all company clients on a platform, and normalizing the data set matrix A by using the column number in a data dimension corresponding matrix and the row number in a client number corresponding matrix to obtain a normalized data set matrix B;
the method for automatically classifying the clients by adopting a K-means clustering method comprises the following steps:
(1) determining a K value, namely the number of the types of clustering by adopting a contour coefficient method;
(2) determining an initial clustering center M1
(3) Determining K clustering centers; the K clustering centers are respectively marked as M1,M2,……,MK
(4) Distributing all samples in the data set matrix B to the nearest cluster set according to the principle of minimum distance, wherein the distance is calculated by adopting Euclidean distance:
Figure BDA0002983524170000021
wherein, YiRepresenting the ith sample data, M, in data set BkDenotes the k-th cluster center, yijRepresents the value corresponding to the jth dimension of the ith sample in the data set B, MkjRepresenting data corresponding to the jth dimensionality of the kth clustering center;
(5) taking all sample mean values in each cluster as a new cluster center;
Figure BDA0002983524170000022
wherein M iskRepresenting the new k-th cluster center, l representing the total number of samples in the current k-th cluster,
Figure BDA0002983524170000023
the sum of the values representing the m-th dimension of the samples in the current k-th cluster;
(6) repeating the steps (4) to (5) until the clustering center is not changed any more;
(7) and finishing to obtain K clusters, namely obtaining the result of automatically classifying the client.
Further, initializing the platform client data to obtain a normalized data set matrix B, and the specific steps include:
(1) defining a data set A, wherein data of all company clients on a platform form the data set A, and the A is an n multiplied by m matrix, wherein m represents a data dimension and corresponds to the number of columns in the matrix; n represents the number of clients and corresponds to the number of rows in the matrix; data, including all corporate customers of the platform; the data dimension includes, but is not limited to, number of purchases, number of supplies, amount of expenses, amount of income, and size of client enterprise, and can be adjusted appropriately according to platform use and development conditions.
(2) Normalizing the data set A, normalizing the numerical values of all data dimensions in the data set A to the range of [0,1], wherein when a certain data dimension is normalized, the normalization formula is as follows:
Figure BDA0002983524170000031
wherein, yijRepresenting the original data xijThe value after normalization; x is the number ofijRepresenting the original data corresponding to the ith row and the jth column in A; x is the number ofjminRepresenting the minimum value of j-th dimension data in A; x is the number ofjmaxRepresenting the maximum value of j-th dimension data in A;
(3) and obtaining a normalized data set B, wherein after all dimensions of the data set A are normalized, the data set B is formed, the B is an n multiplied by m matrix, the numerical range of all dimension data in the data set B is between [0 and 1], and one piece of data in the data set B, namely one sample, represents the data of one client.
Further, the step (1) of determining the K value by using a contour coefficient method includes the following specific steps:
(1-1) calculating intra-class dissimilarity a (y) of sample y:
Figure BDA0002983524170000032
wherein a (y) represents the average of the distances from sample y to other samples in the same cluster, MyRepresenting the cluster to which the current sample y belongs, wherein the smaller the value is, the more likely the sample y belongs to the class is, and a (y) is called the intra-class dissimilarity of the sample y; t is tkRepresents the cluster MyTotal number of samples in (1), p ∈ MyRepresenting belongings to the cluster MyY-p represents the euclidean distance of sample y to sample p;
(1-2) calculating the degree of inter-class dissimilarity b (y) of the sample y:
(1-2-1) calculation of bk(y):
Figure BDA0002983524170000033
Wherein, bk(y) represents the average of the distances of sample y to all samples in the other classes, My′Representing other clusters not containing the cluster to which the current sample y belongs, denoted bk(y) is sample y and class My′Dissimilarity of;
(1-2-2) set of inter-class dissimilarity of sample y:
B(y)={b1(y),b2(y),...,bk(y),...,bK(y)}
where the length of set b (y) should be | b (y) | ═ K-1;
(1-2-3) obtaining b (y) of sample y:
b(y)=minB(y)=min{bi(y),b2(y),...,bk(y),...,bK(y)}
wherein b (y) represents the dissimilarity between classes of the sample y, and the greater the b (y), the greater the probability that the sample y does not belong to other classes;
(1-3) calculating the contour coefficient of the sample y:
Figure BDA0002983524170000041
wherein s (y) represents the contour coefficient of the sample y, the value range of s (y) is [ -1, 1], and when s (y) approaches to 1, the clustering result of the sample y is reasonable; when s (y) approaches-1, it shows that the sample y should be clustered into other categories; when s (y) approaches 0, it indicates that sample y is located on the two class boundaries;
(1-4) calculating the overall contour coefficient of the cluster:
Figure BDA0002983524170000042
wherein s (M) represents the overall contour coefficient of the clustering result, the coefficient can measure the closeness degree of the data clustering, n represents the total number of samples, and s (y) representsi) A contour coefficient representing an ith sample;
(1-5) determining the K value:
(1-5-1) calculating the overall profile coefficients of different K value clusters to form a matrix SK:
SK={SK2,SK3,...,SKi,...,SK10},2≤i≤10
wherein SKiAn overall contour coefficient representing a cluster when K ═ i; n represents the total number of samples;
(1-5-2) determination of K value:
Sk=max{SK2,SK3,...,SKi,...,SK10}
wherein, Sk corresponds to SKiI.e. the finally determined value of K.
Further, determining the initial cluster center M in the step (2)1The method comprises the following specific steps:
(2-1) calculating the distance between each sample and other samples, sample YiForming a matrix D with other sample distancesi
Di=[Di1,Di2,,Dij,...]
Wherein D isiRepresents a sample YiForming a matrix D with other sample distancesi;|Di| represents the matrix DiThe number of the elements is n-1; dijRepresents a sample YiAnd sample YjThe calculation formula of the distance between the two is as follows:
Figure BDA0002983524170000051
(2-2) sorting the distances in each sample distance matrix from small to large, and taking the first r elements, namely the distance matrix | DiSample D | ≧ riThe distance matrix forms a matrix R, the average value of the matrix R is calculated, the specific numerical value of R needs to be determined, the specific numerical value is temporarily determined to be 20, and the matrix R is properly adjusted according to the actual situation in the actual application;
R=[R1,R2,...,Ri,...]
wherein | R | is less than or equal to n, namely the number of elements of the matrix R is not more than n; riRepresents a distance matrix DiThe first R elements of the array form a matrix R after being sorted from small to largeiThe calculation formula is as follows:
Figure BDA0002983524170000052
(2-3) taking a sample corresponding to the superscript corresponding to the minimum value of the element in the matrix R as an initial clustering center of the cluster, namely if R isiIs the minimum value of the element in the matrix R, the corresponding sample is YiWill be the initial cluster center M of the cluster1
Further, the determining K clustering centers in the step (3) includes:
(3-1) determination of initial clustering center M1The concrete steps are shown in step (2);
(3-2) calculating the shortest distance between each sample and the cluster center:
Figure BDA0002983524170000061
Figure BDA0002983524170000062
wherein d isminA matrix representing the shortest distance between a sample and the cluster center,
Figure BDA0002983524170000063
represents a sample YiShortest distance to the center of the cluster, dikRepresents a sample YiAnd the clustering center MkThe distance between them;
(3-3) calculating the sum of the shortest distances of all samples from the cluster center:
Figure BDA0002983524170000064
(3-4) randomly selecting the value a, 0<a is less than or equal to d, then a subtracts dminThe sample of (1), a ═ a-
Figure BDA0002983524170000065
Until a is less than or equal to 0, at this time
Figure BDA0002983524170000066
Corresponding sample YiNamely, the next clustering center is obtained;
(3-5) repeating the steps (3-2) - (3-4) until K cluster centers are selected.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: the K-Means clustering-based automatic client classification method has the advantages that:
(1) the classification result is more objective, the customers are automatically classified by adopting a K-means clustering method according to the platform autonomous behavior of the customers, the subjective factors of people are not mixed, and the objectivity of the classification result can be ensured.
(2) Manpower resources are released in an automatic classification mode, the manpower resources under the condition of manual classification are released, and meanwhile, the working efficiency is obviously improved.
(3) The dynamic monitoring can detect the classification change state of the client in real time in an automatic mode, and the platform can adopt different personalized services according to the classification and the classification change condition to provide more satisfactory services for the client.
(4) The platform enterprise client can be automatically divided, a basis is provided for more reasonable management clients of the platform, infrastructure can be provided for hierarchical management clients, fine operation and accurate marketing of services, a direction is provided for more accurate expansion clients of the platform, and the platform service quality is improved.
(5) Different data dimensions are established for the platform clients without identity attributes, client classification is carried out, and the method can adapt to various client identities.
(6) For enterprise customers, the platform can provide more targeted and convenient services for the customers, so that the time of the customers is saved, and the customer experience is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart of a K-Means clustering-based client automatic classification method according to an embodiment of the present invention.
Fig. 2 is a first diagram of the contour coefficients disclosed in the embodiment of the present invention.
Fig. 3 is a diagram showing a first clustering situation disclosed in the embodiment of the present invention.
Fig. 4 is a second diagram of the contour coefficients disclosed in the embodiment of the present invention.
Fig. 5 is a diagram of a clustering situation disclosed in the embodiment of the present invention.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As shown in fig. 1, an embodiment of the present invention provides a method for automatically classifying clients based on K-Means clustering. The method is used for automatically classifying the clients by taking the autonomous behaviors of the clients on the platform as the basis of classification and by a K-means clustering method aiming at an Internet e-commerce platform. The method comprises the following steps: initializing platform client data, forming a data set matrix A by data of all company clients on a platform, and normalizing the data set matrix A by using the column number in a data dimension corresponding matrix and the row number in a client number corresponding matrix to obtain a normalized data set matrix B; automatically classifying the customers by adopting a K-means clustering method, determining a K value by adopting a contour coefficient method, and further determining an initial clustering center and K clustering centers; then, distributing all samples in the data set matrix B to the nearest cluster set according to the principle of minimum distance, and taking the average value of all samples in each cluster as a new cluster center; and repeating the steps until the clustering center is not changed any more, and obtaining K clusters, namely the result of automatically classifying the customers. The process of the present invention is described in detail below.
Firstly, initializing platform client data:
1. a data set a is defined. Data of all company clients on the platform form a data set A, wherein A is an n multiplied by m matrix, and m represents a data dimension and corresponds to the number of columns in the matrix; n represents the number of clients, corresponding to the number of rows in the matrix. Data, including all corporate customers of the platform; the data dimension includes dimensions such as purchasing times, supply times, expenditure amount, income amount, and client enterprise size (including but not limited to, which can be adjusted appropriately according to platform usage and development).
2. And carrying out normalization processing on the data set A. Because the numerical difference is large between the same data dimension, the numerical values of the data dimensions in the data set A are normalized to be in the range of [0,1 ]. For example, when normalizing a certain data dimension, the normalization formula is:
Figure BDA0002983524170000081
wherein, yijRepresenting the original data xijThe value after normalization; x is the number ofijRepresenting the original data corresponding to the ith row and the jth column in A; x is the number ofjminRepresents the minimum value of the j-th dimension (i.e. j-th column) data in A; x is the number ofjmaxRepresents the maximum value of the data in the j-th dimension (i.e., the j-th column) in a.
3. And obtaining a normalized data set B. After all dimensions of the data set A are normalized, a data set B is formed, wherein the data set B is an n multiplied by m matrix, the numerical range of all dimension data in the data set B is between [0 and 1], and one piece of data in the data set B, namely one sample, represents the data of one client.
Secondly, automatically classifying the clients by adopting a K-means clustering method
1. And (3) determining the K value, namely the number of the clusters by adopting a contour coefficient method (the specific steps of the contour coefficient method are shown in three).
2. And determining an initial clustering center, wherein the specific steps are shown in the fourth step.
3. And determining K clustering centers, wherein the specific steps are shown in the step five. The K clustering centers are respectively marked as M1,M2,......,MK
4. And distributing all samples in the data set B to the nearest cluster set according to the principle of minimum distance. Wherein the distance is calculated using the euclidean distance.
Figure BDA0002983524170000091
Wherein Y isiRepresenting the ith sample data, M, in data set BkDenotes the k-th cluster center, yijRepresents the value corresponding to the jth dimension of the ith sample in the data set B, MkjAnd representing data corresponding to the jth dimension of the kth clustering center.
5. All sample means in each cluster are taken as new cluster centers.
Figure BDA0002983524170000092
Wherein M iskRepresenting the new k-th cluster center, l representing the total number of samples in the current k-th cluster,
Figure BDA0002983524170000093
represents the sum of the values of the m-th dimension of the samples in the current k-th cluster.
6. And repeating the steps 4-5 until the cluster center is not changed any more.
7. And finishing to obtain K clusters, namely obtaining the result of automatically classifying the client.
Thirdly, determining the K value by a contour coefficient method
1. Calculating intra-class dissimilarity a (y) of sample y:
Figure BDA0002983524170000094
a (y) represents the average of the distance of the sample y from other samples belonging to the same cluster, MyRepresenting the cluster to which the current sample y belongs. The smaller the value, the more the sample is indicatedThe more likely this y belongs to the class, the intra-class dissimilarity, referred to as sample y, is a (y). t is tkRepresents the cluster MyTotal number of samples in (1), p ∈ MyRepresenting belongings to the cluster MyY-p represents the euclidean distance of sample y to sample p.
2. Calculating the degree of dissimilarity b (y) between classes of the sample y:
(1) calculation of bk(y):
Figure BDA0002983524170000101
Wherein, bk(y) represents the average of the distances of sample y to all samples in the other classes, My′Representing other clusters not containing the cluster to which the current sample y belongs, denoted bk(y) is sample y and class My′Degree of dissimilarity.
(2) Set of inter-class dissimilarities for sample y:
B(y)={b1(y),b2(y),...,bk(y),...,bK(y)}
the length of the set b (y) should be | b (y) | ═ K-1.
(3) Obtaining b (y) of sample y:
b(y)=minB(y)=min{bi(y),b2(y),...,bk(y),...,bK(y)}
wherein b (y) represents the degree of dissimilarity between classes of the sample y, and the larger b (y) indicates that the sample y has a higher probability of not belonging to other classes.
3. Calculating the contour coefficient of the sample y:
Figure BDA0002983524170000102
wherein s (y) represents the contour coefficient of the sample y, and the value range of s (y) is [ -1, 1 ]. When s (y) approaches to 1, the clustering result of the sample y is reasonable; when s (y) approaches-1, it shows that the sample y should be clustered into other categories; when s (y) approaches 0, it is illustrated that sample y lies on the two class boundaries.
4. Calculating the overall contour coefficient of the cluster:
Figure BDA0002983524170000103
where s (m) represents the overall contour coefficient of the clustering result, which can measure how close the data are clustered. n denotes the total number of samples, s (y)i) Representing the contour coefficients of the ith sample.
5. And determining the K value.
(1) Calculating the overall profile coefficients of different K value clusters to form a matrix SK:
SK={SK2,SK3,...,SKi,...,SK10},2≤i≤10
wherein SKiAn overall contour coefficient representing a cluster when K ═ i; n represents the total number of samples.
(2) The value of K is determined and,
Sk=max{SK2,SK3,...,SKi,...,SK10}
wherein, Sk corresponds to SKiI.e. the finally determined value of K.
Four, initial clustering center M1Is determined
1. Calculating the distance of each sample from the other samples, sample YiForming a matrix D with other sample distancesi
Di=[Di1,Di2,,Dij,...]
Wherein D isiRepresents a sample YiForming a matrix D with other sample distancesi;|Di| represents the matrix DiThe number of the elements is n-1; dijRepresents a sample YiAnd sample YjThe calculation formula of the distance between the two is as follows:
Figure BDA0002983524170000111
2. sorting the distances in each sample distance matrix from small to large, and taking the first r elements, namely the distance matrix | DiSample D | ≧ riThe distance matrix forms a matrix R and calculates the average value of the matrix R (the specific value of R needs to be determined, and is temporarily determined as 20, which can be adjusted appropriately according to the actual situation in actual application).
R=[R1,R2,...,Ri,...]
Wherein | R | is less than or equal to n, namely the number of elements of the matrix R is not more than n; riRepresents a distance matrix DiThe first R elements of the array form a matrix R after being sorted from small to largeiThe calculation formula is as follows:
Figure BDA0002983524170000112
3. and taking a sample corresponding to the superscript corresponding to the minimum value of the element in the matrix R as an initial clustering center of the cluster. If R isiIs the minimum value of the element in the matrix R, the corresponding sample is YiWill be the initial cluster center M of the cluster1
Determination of five and K cluster centers
1. Determining an initial clustering center M1(see step four for the concrete steps).
2. Calculating the shortest distance between each sample and the cluster center:
Figure BDA0002983524170000121
Figure BDA0002983524170000122
wherein d isminA matrix representing the shortest distance between a sample and the cluster center,
Figure BDA0002983524170000123
represents a sample YiShortest distance to the center of the cluster, dikRepresents a sample YiAnd the clustering center MkThe distance between them.
3. Calculate the sum of the shortest distances of all samples to the cluster center:
Figure BDA0002983524170000124
4. randomly selecting the value a, 0<a is less than or equal to d. Then a subtracts d in turnminThe sample of (1) is selected from,
Figure BDA0002983524170000125
until a is less than or equal to 0, at this time
Figure BDA0002983524170000126
Corresponding sample YiI.e. the next cluster center.
5. And repeating the steps 2-4 until K clustering centers are selected.
Example one, normalization calculation example
There are two such sets of data in the database (and where the amount involved in the transaction is in the ten thousand dollar range),
Figure BDA0002983524170000127
for convenience of illustration, the data in a is all reserved as integers for illustration (four decimal places are reserved for calculation when actually applied):
Figure BDA0002983524170000128
the data set A is a 3 x 5 matrix and shows that 2 enterprise client sample data are provided, each sample comprises numerical information of 5 dimensions, dimension 1 shows the purchasing times, dimension 2 shows the supply times, dimension 3 shows the expenditure amount, dimension 4 shows the income amount, and dimension 5 shows the enterprise scale of the client enterprise.
An example of a normalized calculation for its dimension 1:
normalization calculation for the first dimension, X1=[10 25 15]
Figure BDA0002983524170000129
Figure BDA0002983524170000131
Figure BDA0002983524170000132
After normalization in the first dimension
Figure BDA0002983524170000133
The normalization process is the same for the remaining dimensions.
Example two, K-Means clustering algorithm calculation example
The presence data is assumed to be as follows:
O1(0,2),O2(0,0),O3(1.5,0),O4(5,0),O5(5,2)
1. selection of O1(0,2),O2(0, 0) is the initial cluster center, i.e., M1=O1=(0,2),M2=O2=(0,0)。
2. For each object remaining, it is assigned to the closest class according to its distance from the respective cluster center.
To O3
Figure BDA0002983524170000134
Figure BDA0002983524170000135
Due to d (M)2,O3)≤d(M1,O3) So that O is3Is assigned to C2
To O4
Figure BDA0002983524170000136
Figure BDA0002983524170000137
Due to d (M)2,O4)≤d(M1,O4) So that O is4Is assigned to C2
To O5
Figure BDA0002983524170000138
Figure BDA0002983524170000139
Due to d (M)1,O5)≤d(M2,O5) So that O is5Is assigned to C1
Update to obtain a new class C1={O1,O5And C2={O2,O3,O4At the center of M1=(0,2),M2(0, 0). The square error criterion was calculated with a single variance of:
E1=[(0-0)2+(2-2)2]+[(0-5)2+(2-2)2]=25
E2=[(0-0)2+(0-0)2]+[(0-1.5)2+(0-0)2]+[(0-5)2+(0-0)2]=27.25
the overall mean variance is: e ═ E1+E2=25+27.25=52.25
3. The center of the new cluster is calculated.
M1=((0+5)/2,(2+2))/2)=(2.5,2)
M2=((0+1.5+5)/3,(0+0+0))/3)=(2.17,0)
Repeating 2 and 3 to obtain O1Is assigned to C1,O2Is assigned to C2,O3Is assigned to C2,O4Is assigned to C2,O5Is assigned to C1. Update to obtain a new class C1={O1,O5And C2={O2,O3,O4At the center of M1=(2.5,2),M2=(2.17,0)。
The individual variances are:
E1=[(2.5-0)2+(2-2)2]+[(2.5-5)2+(2-2)2]=12.5
E2=[(2.17-0)2+(0-0)2]+[(2.17-1.5)2+(0-0)2]+[(2.17-5)2+(0-0)2]=13.15
the overall average error is: e ═ E1+E2=12.5+13.15=25.65
It can be seen that after the first iteration, the overall average error is reduced from 52.25 to 25.65, which is significantly reduced. Since the cluster center is unchanged in both iterations, the iteration process is stopped.
Example three, example of K value calculation determined by contour coefficient method
Example 1: according to the results after clustering in example two: k is 2, clustering result C1={O1,O5And C2={O2,O3,O4In which O is1(0,2),O2(0,0),O3(1.5,0),O4(5,0),O5(5,2)。
1. Calculating dissimilarity within sample class (calculating O)1,O2For example intra-class dissimilarity of):
Figure BDA0002983524170000141
Figure BDA0002983524170000142
2. calculating dissimilarity between sample classes (calculating O)1,O2Class dissimilarity of (d):
since K in this example is 2, the dissimilarity between the sample and another class is the dissimilarity between the sample and the other class.
Figure BDA0002983524170000151
Figure BDA0002983524170000152
3. Calculating the contour coefficient of the sample (calculating O)1,O2Class dissimilarity of (d):
Figure BDA0002983524170000153
Figure BDA0002983524170000154
4. calculating the overall contour coefficient of the cluster when K is 2:
Figure BDA0002983524170000155
contour coefficient diagram:
example 1:
O1(0,2),O2(0,0),O3(1.5,0),O4(5,0),O5(5,2), the outline coefficient of this example is shown in the figure2 (data in this example are not normalized, only the profile coefficient case is shown):
the horizontal axis represents the number K of clusters, and the vertical axis represents the number of entire clusters. It can be seen that when K is 2, the number of overall contours is the largest, that is, K is 2.
The clustering at this time is shown in fig. 3.
Example 2:
O1(0,2),O2(0,0),O3(1.5,0),O4(5,0),O5(5,2),O6(2,3),O7(2,2.5),O8(2,3.5), the profile coefficient graph of this example is shown in fig. 4 (the data of this example is not normalized, only the profile coefficient case is shown):
the horizontal axis represents the number K of clusters, and the vertical axis represents the number of entire clusters. It can be seen that when K is 3, the number of overall contours is the largest, that is, K is 3.
The clustering at this time is shown in fig. 5.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims (5)

1. The automatic customer classification method based on K-Means clustering is characterized by comprising the following steps of:
initializing platform client data, forming a data set matrix A by data of all company clients on a platform, and normalizing the data set matrix A by using the column number in a data dimension corresponding matrix and the row number in a client number corresponding matrix to obtain a normalized data set matrix B;
the method for automatically classifying the clients by adopting a K-means clustering method comprises the following steps:
(1) determining a K value, namely the number of the types of clustering by adopting a contour coefficient method;
(2) determining an initial clustering center M1
(3) Determining K clustering centers; the K clustering centers are respectively marked as M1,M2,……,MK
(4) Distributing all samples in the data set matrix B to the nearest cluster set according to the principle of minimum distance, wherein the distance is calculated by adopting Euclidean distance:
Figure FDA0002983524160000011
wherein, YiRepresenting the ith sample data, M, in data set BkDenotes the k-th cluster center, yijRepresents the value corresponding to the jth dimension of the ith sample in the data set B, MkjRepresenting data corresponding to the jth dimensionality of the kth clustering center;
(5) taking all sample mean values in each cluster as a new cluster center;
Figure FDA0002983524160000012
wherein M iskRepresenting the new k-th cluster center, l representing the total number of samples in the current k-th cluster,
Figure FDA0002983524160000013
the sum of the values representing the m-th dimension of the samples in the current k-th cluster;
(6) repeating the steps (4) to (5) until the clustering center is not changed any more;
(7) and finishing to obtain K clusters, namely obtaining the result of automatically classifying the client.
2. The method for customer automatic classification based on K-Means clustering according to claim 1, wherein the initialization of platform customer data to obtain a normalized data set matrix B comprises the following specific steps:
(1) defining a data set A, wherein data of all company clients on a platform form the data set A, and the A is an n multiplied by m matrix, wherein m represents a data dimension and corresponds to the number of columns in the matrix; n represents the number of clients and corresponds to the number of rows in the matrix; data, including all corporate customers of the platform; the data dimension includes, but is not limited to, number of purchases, number of supplies, amount of expenses, amount of income, and size of client enterprise, and can be adjusted appropriately according to platform use and development conditions.
(2) Normalizing the data set A, normalizing the numerical values of all data dimensions in the data set A to the range of [0,1], wherein when a certain data dimension is normalized, the normalization formula is as follows:
Figure FDA0002983524160000021
wherein, yijRepresenting the original data xijThe value after normalization; x is the number ofijRepresenting the original data corresponding to the ith row and the jth column in A; x is the number ofjminRepresenting the minimum value of j-th dimension data in A; x is the number ofjmaxRepresenting the maximum value of j-th dimension data in A;
(3) and obtaining a normalized data set B, wherein after all dimensions of the data set A are normalized, the data set B is formed, the B is an n multiplied by m matrix, the numerical range of all dimension data in the data set B is between [0 and 1], and one piece of data in the data set B, namely one sample, represents the data of one client.
3. The method for customer automatic classification based on K-Means clustering as claimed in claim 1, wherein the step (1) of determining the K value by using a contour coefficient method comprises the following specific steps:
(1-1) calculating intra-class dissimilarity a (y) of sample y:
Figure FDA0002983524160000022
wherein a (y) represents the average of the distances from sample y to other samples in the same cluster, MyRepresenting the cluster to which the current sample y belongs, wherein the smaller the value is, the more likely the sample y belongs to the class is, and a (y) is called the intra-class dissimilarity of the sample y; t is tkRepresents the cluster MyTotal number of samples in (1), p ∈ MyRepresenting belongings to the cluster MyY-p represents the euclidean distance of sample y to sample p;
(1-2) calculating the degree of inter-class dissimilarity b (y) of the sample y:
(1-2-1) calculation of bk(y):
Figure FDA0002983524160000031
Wherein, bk(y) represents the average of the distances of sample y to all samples in the other classes, My′Representing other clusters not containing the cluster to which the current sample y belongs, denoted bk(y) is sample y and class My′Dissimilarity of;
(1-2-2) set of inter-class dissimilarity of sample y:
B(y)={b1(y),b2(y),...,bk(y),...,bK(y)}
where the length of set b (y) should be | b (y) | ═ K-1;
(1-2-3) obtaining b (y) of sample y:
b(y)=minB(y)=min{b1(y),b2(y),...,bk(y),...,bK(y)}
wherein b (y) represents the dissimilarity between classes of the sample y, and the greater the b (y), the greater the probability that the sample y does not belong to other classes;
(1-3) calculating the contour coefficient of the sample y:
Figure FDA0002983524160000032
wherein s (y) represents the contour coefficient of the sample y, the value range of s (y) is [ -1, 1], and when s (y) approaches to 1, the clustering result of the sample y is reasonable; when s (y) approaches-1, it shows that the sample y should be clustered into other categories; when s (y) approaches 0, it indicates that sample y is located on the two class boundaries;
(1-4) calculating the overall contour coefficient of the cluster:
Figure FDA0002983524160000033
wherein s (M) represents the overall contour coefficient of the clustering result, the coefficient can measure the closeness degree of the data clustering, n represents the total number of samples, and s (y) representsi) A contour coefficient representing an ith sample;
(1-5) determining the K value:
(1-5-1) calculating the overall profile coefficients of different K value clusters to form a matrix SK:
SK={SK2,SK3,…,SKi,…,SK10},2≤i≤10
wherein SKiAn overall contour coefficient representing a cluster when K ═ i; n represents the total number of samples;
(1-5-2) determination of K value:
Sk=max{SK2,SK3,…,SKi,…,SK10}
wherein, Sk corresponds to SKiI.e. the finally determined value of K.
4. The method for customer automatic classification based on K-Means clustering as claimed in claim 3, wherein the determining of the initial clustering center M in the step (2)1The method comprises the following specific steps:
(2-1) calculating the distance between each sample and other samples, sample YiForming a matrix D with other sample distancesi
Di=[Di1,Di2,…,Dij,…]
Wherein D isiRepresents a sample YiForming a matrix D with other sample distancesi;|Di| represents the matrix DiThe number of the elements is n-1; dijRepresents a sample YiAnd sample YjThe calculation formula of the distance between the two is as follows:
Figure FDA0002983524160000041
(2-2) sorting the distances in each sample distance matrix from small to large, and taking the first r elements, namely the distance matrix | DiSample D | ≧ riThe distance matrix forms a matrix R, the average value of the matrix R is calculated, the specific numerical value of R needs to be determined, the specific numerical value is temporarily determined to be 20, and the matrix R is properly adjusted according to the actual situation in the actual application;
R=[R1,R2,…,Ri,…]
wherein | R | is less than or equal to n, namely the number of elements of the matrix R is not more than n; riRepresents a distance matrix DiThe first R elements of the array form a matrix R after being sorted from small to largeiThe calculation formula is as follows:
Figure FDA0002983524160000042
(2-3) taking a sample corresponding to the superscript corresponding to the minimum value of the element in the matrix R as an initial clustering center of the cluster, namely if R isiIs the minimum value of the element in the matrix R, the corresponding sample is YiWill be the initial cluster center M of the cluster1
5. The method for customer automatic classification based on K-Means clustering as claimed in claim 4, wherein the K clustering centers are determined in the step (3), and the specific steps comprise:
(3-1) determination of initial clustering center M1The concrete steps are shown in step (2);
(3-2) calculating the shortest distance between each sample and the cluster center:
Figure FDA0002983524160000051
Figure FDA0002983524160000052
wherein d isminA matrix representing the shortest distance between a sample and the cluster center,
Figure FDA0002983524160000053
represents a sample YiShortest distance to the center of the cluster, dikRepresents a sample YiAnd the clustering center MkThe distance between them;
(3-3) calculating the sum of the shortest distances of all samples from the cluster center:
Figure FDA0002983524160000054
(3-4) randomly selecting the value a, 0<a is less than or equal to d, then a subtracts dminThe sample of (1) is selected from,
Figure FDA0002983524160000055
Figure FDA0002983524160000056
until a is less than or equal to 0, at this time
Figure FDA0002983524160000057
Corresponding sample YiNamely, the next clustering center is obtained;
(3-5) repeating the steps (3-2) - (3-4) until K cluster centers are selected.
CN202110293773.3A 2021-03-19 2021-03-19 Automatic customer classification method based on K-Means clustering Pending CN112905863A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110293773.3A CN112905863A (en) 2021-03-19 2021-03-19 Automatic customer classification method based on K-Means clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110293773.3A CN112905863A (en) 2021-03-19 2021-03-19 Automatic customer classification method based on K-Means clustering

Publications (1)

Publication Number Publication Date
CN112905863A true CN112905863A (en) 2021-06-04

Family

ID=76105486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110293773.3A Pending CN112905863A (en) 2021-03-19 2021-03-19 Automatic customer classification method based on K-Means clustering

Country Status (1)

Country Link
CN (1) CN112905863A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643116A (en) * 2021-08-23 2021-11-12 中远海运科技(北京)有限公司 Method for classifying companies based on financial voucher data, computer readable medium
CN115114770A (en) * 2022-06-06 2022-09-27 哈尔滨工业大学 Baseline self-adaptive auxiliary power device performance trend analysis method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228399A (en) * 2016-07-20 2016-12-14 福建工程学院 A kind of stock trader's customer risk preference categories method based on big data
CN107145895A (en) * 2017-03-13 2017-09-08 东方网力科技股份有限公司 Public security crime class case analysis method based on k means algorithms
CN108230029A (en) * 2017-12-29 2018-06-29 西南大学 Client trading behavior analysis method
CN108629375A (en) * 2018-05-08 2018-10-09 广东工业大学 Power customer sorting technique, system, terminal and computer readable storage medium
CN108681973A (en) * 2018-05-14 2018-10-19 广州供电局有限公司 Sorting technique, device, computer equipment and the storage medium of power consumer
CN108734217A (en) * 2018-05-22 2018-11-02 齐鲁工业大学 A kind of customer segmentation method and device based on clustering
CN109446185A (en) * 2018-08-29 2019-03-08 广西大学 Collaborative filtering missing data processing method based on user's cluster
CN109657712A (en) * 2018-12-11 2019-04-19 浙江工业大学 A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark
CN110085322A (en) * 2019-04-18 2019-08-02 岭南师范学院 A kind of improved method of k-means cluster diabetes Early-warning Model
CN112381248A (en) * 2020-11-27 2021-02-19 广东电网有限责任公司肇庆供电局 Power distribution network fault diagnosis method based on deep feature clustering and LSTM

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228399A (en) * 2016-07-20 2016-12-14 福建工程学院 A kind of stock trader's customer risk preference categories method based on big data
CN107145895A (en) * 2017-03-13 2017-09-08 东方网力科技股份有限公司 Public security crime class case analysis method based on k means algorithms
CN108230029A (en) * 2017-12-29 2018-06-29 西南大学 Client trading behavior analysis method
CN108629375A (en) * 2018-05-08 2018-10-09 广东工业大学 Power customer sorting technique, system, terminal and computer readable storage medium
CN108681973A (en) * 2018-05-14 2018-10-19 广州供电局有限公司 Sorting technique, device, computer equipment and the storage medium of power consumer
CN108734217A (en) * 2018-05-22 2018-11-02 齐鲁工业大学 A kind of customer segmentation method and device based on clustering
CN109446185A (en) * 2018-08-29 2019-03-08 广西大学 Collaborative filtering missing data processing method based on user's cluster
CN109657712A (en) * 2018-12-11 2019-04-19 浙江工业大学 A kind of electric business food and drink data analysing method based on the improved K-Means algorithm of Spark
CN110085322A (en) * 2019-04-18 2019-08-02 岭南师范学院 A kind of improved method of k-means cluster diabetes Early-warning Model
CN112381248A (en) * 2020-11-27 2021-02-19 广东电网有限责任公司肇庆供电局 Power distribution network fault diagnosis method based on deep feature clustering and LSTM

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨一帆,贺国先,李永定: "优化初始聚类中心选择的K-means 算法", 《电脑知识与技术》 *
杨俊闯,赵超: "K-Means 聚类算法研究综述", 《计算机工程与应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643116A (en) * 2021-08-23 2021-11-12 中远海运科技(北京)有限公司 Method for classifying companies based on financial voucher data, computer readable medium
CN113643116B (en) * 2021-08-23 2023-10-27 中远海运科技(北京)有限公司 Company classification method based on financial evidence data and computer readable medium
CN115114770A (en) * 2022-06-06 2022-09-27 哈尔滨工业大学 Baseline self-adaptive auxiliary power device performance trend analysis method
CN115114770B (en) * 2022-06-06 2024-04-16 哈尔滨工业大学 Baseline self-adaptive auxiliary power device performance trend analysis method

Similar Documents

Publication Publication Date Title
CN111553759A (en) Product information pushing method, device, equipment and storage medium
CN112905863A (en) Automatic customer classification method based on K-Means clustering
CN110704730A (en) Product data pushing method and system based on big data and computer equipment
CN110097287B (en) Group portrait method for logistic drivers
CN113076437B (en) Small sample image classification method and system based on label redistribution
CN115761314A (en) E-commerce image and text classification method and system based on prompt learning
CN113449802A (en) Graph classification method and device based on multi-granularity mutual information maximization
CN112990386A (en) User value clustering method and device, computer equipment and storage medium
CN105160598B (en) Power grid service classification method based on improved EM algorithm
CN116226732A (en) Electric bus charging load curve classification method and system
CN110991247B (en) Electronic component identification method based on deep learning and NCA fusion
CN112836750A (en) System resource allocation method, device and equipment
CN117217282A (en) Structured pruning method for deep pedestrian search model
Bogdan et al. Predicting bankruptcy based on the full population of Croatian companies
CN113191771A (en) Buyer account period risk prediction method
CN109977787B (en) Multi-view human behavior identification method
CN112884028A (en) System resource adjusting method, device and equipment
CN112035715B (en) User label design method and device
CN113159957B (en) Transaction processing method and device
Inoue et al. Improved parameter estimation for variance-stabilizing transformation of gene-expression microarray data
CN115953166B (en) Customer information management method and system based on big data intelligent matching
US20230297651A1 (en) Cost equalization spectral clustering
US20230385951A1 (en) Systems and methods for training models
CN117237053B (en) Air ticket distribution platform and control method thereof
CN111930934B (en) Clustering method based on constraint sparse concept decomposition of dual local agreement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210604

RJ01 Rejection of invention patent application after publication