CN112905863A

CN112905863A - Automatic customer classification method based on K-Means clustering

Info

Publication number: CN112905863A
Application number: CN202110293773.3A
Authority: CN
Inventors: 霍胜军; 郑鑫; 于德尚; 徐楠楠
Original assignee: Qingdao Mengdou Network Technology Co ltd
Current assignee: Qingdao Mengdou Network Technology Co ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-06-04

Abstract

The invention provides a client automatic classification method based on K-Means clustering, which comprises the following steps: initializing platform client data, forming a data set matrix A by data of all company clients on a platform, and normalizing the data set matrix A by using the column number in a data dimension corresponding matrix and the row number in a client number corresponding matrix to obtain a normalized data set matrix B; automatically classifying the customers by adopting a K-means clustering method, determining a K value by adopting a contour coefficient method, and further determining an initial clustering center and K clustering centers; then, distributing all samples in the data set matrix B to the nearest cluster set according to the principle of minimum distance, and taking the average value of all samples in each cluster as a new cluster center; and repeating the steps until the clustering center is not changed any more, and obtaining K clusters, namely the result of automatically classifying the customers. The method can ensure the objectivity of the classification result and save the labor cost.

Description

Automatic customer classification method based on K-Means clustering

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a client automatic classification method based on K-Means clustering.

Background

In the field of education, the Confucius presents an education concept of 'teaching according to the factors', and personalized learning is an ideal situation pursued by education. For the service industry, personalized service is an ideal boundary for pursuing.

With the rapid development of the internet, the network provides technical support for personalized services of an internet platform by virtue of strong interactivity and distributed characteristics of the network. A large amount of customer data exist in an internet e-commerce platform database, and the data are used for classifying customers, so that more personalized services are provided for different types of customers, and more benefits can be brought to the platform.

For the classification of the existing customers of the platform, if the manual classification mode is directly adopted is unrealistic, some problems mainly exist:

(1) the objectivity is not strong, in the classification process of different people, personal subjective factors can be mixed in the classification process, and meanwhile, the classification standard of a client is limited and influenced by the subjective factors of the people, so that the classification result of the client is not objective;

(2) the resources are wasted, a manual mode is adopted, a large number of clients are classified, a large amount of time of people is consumed, and the waste of human resources is caused for the platform.

And the prior art lacks a proper automatic classification method. In view of the above, it is desirable to provide an automatic classification method to solve the above problems.

Disclosure of Invention

The purpose of the invention is: aiming at the problems described in the background technology, the invention provides a customer automatic classification method based on K-Means clustering, which is used for automatically classifying customers through the autonomous behavior of the customers on a platform as the classification basis and the K-Means clustering method aiming at an Internet electronic commerce platform, so that the objectivity of the classification result can be ensured, and the occupied human resources can be released.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

the automatic customer classification method based on K-Means clustering is characterized by comprising the following steps of:

initializing platform client data, forming a data set matrix A by data of all company clients on a platform, and normalizing the data set matrix A by using the column number in a data dimension corresponding matrix and the row number in a client number corresponding matrix to obtain a normalized data set matrix B;

the method for automatically classifying the clients by adopting a K-means clustering method comprises the following steps:

(1) determining a K value, namely the number of the types of clustering by adopting a contour coefficient method;

(2) determining an initial clustering center M₁；

(3) Determining K clustering centers; the K clustering centers are respectively marked as M₁,M₂,……,M_K；

(4) Distributing all samples in the data set matrix B to the nearest cluster set according to the principle of minimum distance, wherein the distance is calculated by adopting Euclidean distance:

wherein, Y_iRepresenting the ith sample data, M, in data set B_kDenotes the k-th cluster center, y_ijRepresents the value corresponding to the jth dimension of the ith sample in the data set B, M_kjRepresenting data corresponding to the jth dimensionality of the kth clustering center;

(5) taking all sample mean values in each cluster as a new cluster center;

wherein M is_kRepresenting the new k-th cluster center, l representing the total number of samples in the current k-th cluster,

the sum of the values representing the m-th dimension of the samples in the current k-th cluster;

(6) repeating the steps (4) to (5) until the clustering center is not changed any more;

(7) and finishing to obtain K clusters, namely obtaining the result of automatically classifying the client.

Further, initializing the platform client data to obtain a normalized data set matrix B, and the specific steps include:

(1) defining a data set A, wherein data of all company clients on a platform form the data set A, and the A is an n multiplied by m matrix, wherein m represents a data dimension and corresponds to the number of columns in the matrix; n represents the number of clients and corresponds to the number of rows in the matrix; data, including all corporate customers of the platform; the data dimension includes, but is not limited to, number of purchases, number of supplies, amount of expenses, amount of income, and size of client enterprise, and can be adjusted appropriately according to platform use and development conditions.

(2) Normalizing the data set A, normalizing the numerical values of all data dimensions in the data set A to the range of [0,1], wherein when a certain data dimension is normalized, the normalization formula is as follows:

wherein, y_ijRepresenting the original data x_ijThe value after normalization; x is the number of_ijRepresenting the original data corresponding to the ith row and the jth column in A; x is the number of_jminRepresenting the minimum value of j-th dimension data in A; x is the number of_jmaxRepresenting the maximum value of j-th dimension data in A;

(3) and obtaining a normalized data set B, wherein after all dimensions of the data set A are normalized, the data set B is formed, the B is an n multiplied by m matrix, the numerical range of all dimension data in the data set B is between [0 and 1], and one piece of data in the data set B, namely one sample, represents the data of one client.

Further, the step (1) of determining the K value by using a contour coefficient method includes the following specific steps:

(1-1) calculating intra-class dissimilarity a (y) of sample y:

wherein a (y) represents the average of the distances from sample y to other samples in the same cluster, M_yRepresenting the cluster to which the current sample y belongs, wherein the smaller the value is, the more likely the sample y belongs to the class is, and a (y) is called the intra-class dissimilarity of the sample y; t is t_kRepresents the cluster M_yTotal number of samples in (1), p ∈ M_yRepresenting belongings to the cluster M_yY-p represents the euclidean distance of sample y to sample p;

(1-2) calculating the degree of inter-class dissimilarity b (y) of the sample y:

(1-2-1) calculation of b_k(y)：

Wherein, b_k(y) represents the average of the distances of sample y to all samples in the other classes, M_y′Representing other clusters not containing the cluster to which the current sample y belongs, denoted b_k(y) is sample y and class M_y′Dissimilarity of;

(1-2-2) set of inter-class dissimilarity of sample y:

B(y)＝{b₁(y)，b₂(y)，...，b_k(y)，...，b_K(y)}

where the length of set b (y) should be | b (y) | ═ K-1;

(1-2-3) obtaining b (y) of sample y:

b(y)＝minB(y)＝min{b_i(y)，b₂(y)，...，b_k(y)，...，b_K(y)}

wherein b (y) represents the dissimilarity between classes of the sample y, and the greater the b (y), the greater the probability that the sample y does not belong to other classes;

(1-3) calculating the contour coefficient of the sample y:

wherein s (y) represents the contour coefficient of the sample y, the value range of s (y) is [ -1, 1], and when s (y) approaches to 1, the clustering result of the sample y is reasonable; when s (y) approaches-1, it shows that the sample y should be clustered into other categories; when s (y) approaches 0, it indicates that sample y is located on the two class boundaries;

(1-4) calculating the overall contour coefficient of the cluster:

wherein s (M) represents the overall contour coefficient of the clustering result, the coefficient can measure the closeness degree of the data clustering, n represents the total number of samples, and s (y) represents_i) A contour coefficient representing an ith sample;

(1-5) determining the K value:

(1-5-1) calculating the overall profile coefficients of different K value clusters to form a matrix SK:

SK＝{SK₂，SK₃，...，SK_i，...，SK₁₀}，2≤i≤10

wherein SK_iAn overall contour coefficient representing a cluster when K ═ i; n represents the total number of samples;

(1-5-2) determination of K value:

Sk＝max{SK₂，SK₃，...，SK_i，...，SK₁₀}

wherein, Sk corresponds to SK_iI.e. the finally determined value of K.

Further, determining the initial cluster center M in the step (2)₁The method comprises the following specific steps:

(2-1) calculating the distance between each sample and other samples, sample Y_iForming a matrix D with other sample distances_i：

D_i＝[D_i1，D_i2，，D_ij，...]

Wherein D is_iRepresents a sample Y_iForming a matrix D with other sample distances_i；|D_i| represents the matrix D_iThe number of the elements is n-1; d_ijRepresents a sample Y_iAnd sample Y_jThe calculation formula of the distance between the two is as follows:

(2-2) sorting the distances in each sample distance matrix from small to large, and taking the first r elements, namely the distance matrix | D_iSample D | ≧ r_iThe distance matrix forms a matrix R, the average value of the matrix R is calculated, the specific numerical value of R needs to be determined, the specific numerical value is temporarily determined to be 20, and the matrix R is properly adjusted according to the actual situation in the actual application;

R＝[R¹，R²，...，Rⁱ，...]

wherein | R | is less than or equal to n, namely the number of elements of the matrix R is not more than n; rⁱRepresents a distance matrix D_iThe first R elements of the array form a matrix R after being sorted from small to large_iThe calculation formula is as follows:

(2-3) taking a sample corresponding to the superscript corresponding to the minimum value of the element in the matrix R as an initial clustering center of the cluster, namely if R isⁱIs the minimum value of the element in the matrix R, the corresponding sample is Y_iWill be the initial cluster center M of the cluster₁。

Further, the determining K clustering centers in the step (3) includes:

(3-1) determination of initial clustering center M₁The concrete steps are shown in step (2);

(3-2) calculating the shortest distance between each sample and the cluster center:

wherein d is_minA matrix representing the shortest distance between a sample and the cluster center,

represents a sample Y_iShortest distance to the center of the cluster, d_ikRepresents a sample Y_iAnd the clustering center M_kThe distance between them;

(3-3) calculating the sum of the shortest distances of all samples from the cluster center:

(3-4) randomly selecting the value a, 0<a is less than or equal to d, then a subtracts d_minThe sample of (1), a ═ a-

Until a is less than or equal to 0, at this time

Corresponding sample Y_iNamely, the next clustering center is obtained;

(3-5) repeating the steps (3-2) - (3-4) until K cluster centers are selected.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: the K-Means clustering-based automatic client classification method has the advantages that:

(1) the classification result is more objective, the customers are automatically classified by adopting a K-means clustering method according to the platform autonomous behavior of the customers, the subjective factors of people are not mixed, and the objectivity of the classification result can be ensured.

(2) Manpower resources are released in an automatic classification mode, the manpower resources under the condition of manual classification are released, and meanwhile, the working efficiency is obviously improved.

(3) The dynamic monitoring can detect the classification change state of the client in real time in an automatic mode, and the platform can adopt different personalized services according to the classification and the classification change condition to provide more satisfactory services for the client.

(4) The platform enterprise client can be automatically divided, a basis is provided for more reasonable management clients of the platform, infrastructure can be provided for hierarchical management clients, fine operation and accurate marketing of services, a direction is provided for more accurate expansion clients of the platform, and the platform service quality is improved.

(5) Different data dimensions are established for the platform clients without identity attributes, client classification is carried out, and the method can adapt to various client identities.

(6) For enterprise customers, the platform can provide more targeted and convenient services for the customers, so that the time of the customers is saved, and the customer experience is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart of a K-Means clustering-based client automatic classification method according to an embodiment of the present invention.

Fig. 2 is a first diagram of the contour coefficients disclosed in the embodiment of the present invention.

Fig. 3 is a diagram showing a first clustering situation disclosed in the embodiment of the present invention.

Fig. 4 is a second diagram of the contour coefficients disclosed in the embodiment of the present invention.

Fig. 5 is a diagram of a clustering situation disclosed in the embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in fig. 1, an embodiment of the present invention provides a method for automatically classifying clients based on K-Means clustering. The method is used for automatically classifying the clients by taking the autonomous behaviors of the clients on the platform as the basis of classification and by a K-means clustering method aiming at an Internet e-commerce platform. The method comprises the following steps: initializing platform client data, forming a data set matrix A by data of all company clients on a platform, and normalizing the data set matrix A by using the column number in a data dimension corresponding matrix and the row number in a client number corresponding matrix to obtain a normalized data set matrix B; automatically classifying the customers by adopting a K-means clustering method, determining a K value by adopting a contour coefficient method, and further determining an initial clustering center and K clustering centers; then, distributing all samples in the data set matrix B to the nearest cluster set according to the principle of minimum distance, and taking the average value of all samples in each cluster as a new cluster center; and repeating the steps until the clustering center is not changed any more, and obtaining K clusters, namely the result of automatically classifying the customers. The process of the present invention is described in detail below.

Firstly, initializing platform client data:

1. a data set a is defined. Data of all company clients on the platform form a data set A, wherein A is an n multiplied by m matrix, and m represents a data dimension and corresponds to the number of columns in the matrix; n represents the number of clients, corresponding to the number of rows in the matrix. Data, including all corporate customers of the platform; the data dimension includes dimensions such as purchasing times, supply times, expenditure amount, income amount, and client enterprise size (including but not limited to, which can be adjusted appropriately according to platform usage and development).

2. And carrying out normalization processing on the data set A. Because the numerical difference is large between the same data dimension, the numerical values of the data dimensions in the data set A are normalized to be in the range of [0,1 ]. For example, when normalizing a certain data dimension, the normalization formula is:

wherein, y_ijRepresenting the original data x_ijThe value after normalization; x is the number of_ijRepresenting the original data corresponding to the ith row and the jth column in A; x is the number of_jminRepresents the minimum value of the j-th dimension (i.e. j-th column) data in A; x is the number of_jmaxRepresents the maximum value of the data in the j-th dimension (i.e., the j-th column) in a.

3. And obtaining a normalized data set B. After all dimensions of the data set A are normalized, a data set B is formed, wherein the data set B is an n multiplied by m matrix, the numerical range of all dimension data in the data set B is between [0 and 1], and one piece of data in the data set B, namely one sample, represents the data of one client.

Secondly, automatically classifying the clients by adopting a K-means clustering method

1. And (3) determining the K value, namely the number of the clusters by adopting a contour coefficient method (the specific steps of the contour coefficient method are shown in three).

2. And determining an initial clustering center, wherein the specific steps are shown in the fourth step.

3. And determining K clustering centers, wherein the specific steps are shown in the step five. The K clustering centers are respectively marked as M₁,M₂,......，M_K。

4. And distributing all samples in the data set B to the nearest cluster set according to the principle of minimum distance. Wherein the distance is calculated using the euclidean distance.

Wherein Y is_iRepresenting the ith sample data, M, in data set B_kDenotes the k-th cluster center, y_ijRepresents the value corresponding to the jth dimension of the ith sample in the data set B, M_kjAnd representing data corresponding to the jth dimension of the kth clustering center.

5. All sample means in each cluster are taken as new cluster centers.

represents the sum of the values of the m-th dimension of the samples in the current k-th cluster.

6. And repeating the steps 4-5 until the cluster center is not changed any more.

7. And finishing to obtain K clusters, namely obtaining the result of automatically classifying the client.

Thirdly, determining the K value by a contour coefficient method

1. Calculating intra-class dissimilarity a (y) of sample y:

a (y) represents the average of the distance of the sample y from other samples belonging to the same cluster, M_yRepresenting the cluster to which the current sample y belongs. The smaller the value, the more the sample is indicatedThe more likely this y belongs to the class, the intra-class dissimilarity, referred to as sample y, is a (y). t is t_kRepresents the cluster M_yTotal number of samples in (1), p ∈ M_yRepresenting belongings to the cluster M_yY-p represents the euclidean distance of sample y to sample p.

2. Calculating the degree of dissimilarity b (y) between classes of the sample y:

(1) calculation of b_k(y)：

Wherein, b_k(y) represents the average of the distances of sample y to all samples in the other classes, M_y′Representing other clusters not containing the cluster to which the current sample y belongs, denoted b_k(y) is sample y and class M_y′Degree of dissimilarity.

(2) Set of inter-class dissimilarities for sample y:

B(y)＝{b₁(y)，b₂(y)，...，b_k(y)，...，b_K(y)}

the length of the set b (y) should be | b (y) | ═ K-1.

(3) Obtaining b (y) of sample y:

b(y)＝minB(y)＝min{b_i(y)，b₂(y)，...，b_k(y)，...，b_K(y)}

wherein b (y) represents the degree of dissimilarity between classes of the sample y, and the larger b (y) indicates that the sample y has a higher probability of not belonging to other classes.

3. Calculating the contour coefficient of the sample y:

wherein s (y) represents the contour coefficient of the sample y, and the value range of s (y) is [ -1, 1 ]. When s (y) approaches to 1, the clustering result of the sample y is reasonable; when s (y) approaches-1, it shows that the sample y should be clustered into other categories; when s (y) approaches 0, it is illustrated that sample y lies on the two class boundaries.

4. Calculating the overall contour coefficient of the cluster:

where s (m) represents the overall contour coefficient of the clustering result, which can measure how close the data are clustered. n denotes the total number of samples, s (y)_i) Representing the contour coefficients of the ith sample.

5. And determining the K value.

(1) Calculating the overall profile coefficients of different K value clusters to form a matrix SK:

SK＝{SK₂，SK₃，...，SK_i，...，SK₁₀}，2≤i≤10

wherein SK_iAn overall contour coefficient representing a cluster when K ═ i; n represents the total number of samples.

(2) The value of K is determined and,

Sk＝max{SK₂，SK₃，...，SK_i，...，SK₁₀}

wherein, Sk corresponds to SK_iI.e. the finally determined value of K.

Four, initial clustering center M₁Is determined

1. Calculating the distance of each sample from the other samples, sample Y_iForming a matrix D with other sample distances_i：

D_i＝[D_i1，D_i2，，D_ij，...]

2. sorting the distances in each sample distance matrix from small to large, and taking the first r elements, namely the distance matrix | D_iSample D | ≧ r_iThe distance matrix forms a matrix R and calculates the average value of the matrix R (the specific value of R needs to be determined, and is temporarily determined as 20, which can be adjusted appropriately according to the actual situation in actual application).

R＝[R¹，R²，...，Rⁱ，...]

3. and taking a sample corresponding to the superscript corresponding to the minimum value of the element in the matrix R as an initial clustering center of the cluster. If R isⁱIs the minimum value of the element in the matrix R, the corresponding sample is Y_iWill be the initial cluster center M of the cluster₁。

Determination of five and K cluster centers

1. Determining an initial clustering center M₁(see step four for the concrete steps).

2. Calculating the shortest distance between each sample and the cluster center:

represents a sample Y_iShortest distance to the center of the cluster, d_ikRepresents a sample Y_iAnd the clustering center M_kThe distance between them.

3. Calculate the sum of the shortest distances of all samples to the cluster center:

4. randomly selecting the value a, 0<a is less than or equal to d. Then a subtracts d in turn_minThe sample of (1) is selected from,

until a is less than or equal to 0, at this time

Corresponding sample Y_iI.e. the next cluster center.

5. And repeating the steps 2-4 until K clustering centers are selected.

Example one, normalization calculation example

There are two such sets of data in the database (and where the amount involved in the transaction is in the ten thousand dollar range),

for convenience of illustration, the data in a is all reserved as integers for illustration (four decimal places are reserved for calculation when actually applied):

the data set A is a 3 x 5 matrix and shows that 2 enterprise client sample data are provided, each sample comprises numerical information of 5 dimensions, dimension 1 shows the purchasing times, dimension 2 shows the supply times, dimension 3 shows the expenditure amount, dimension 4 shows the income amount, and dimension 5 shows the enterprise scale of the client enterprise.

An example of a normalized calculation for its dimension 1:

normalization calculation for the first dimension, X₁＝[10 25 15]

After normalization in the first dimension

The normalization process is the same for the remaining dimensions.

Example two, K-Means clustering algorithm calculation example

The presence data is assumed to be as follows:

O₁(0，2)，O₂(0，0)，O₃(1.5，0)，O₄(5，0)，O₅(5，2)

1. selection of O₁(0，2)，O₂(0, 0) is the initial cluster center, i.e., M₁＝O₁＝(0，2)，M₂＝O₂＝(0，0)。

2. For each object remaining, it is assigned to the closest class according to its distance from the respective cluster center.

To O₃：

Due to d (M)₂，O₃)≤d(M₁，O₃) So that O is₃Is assigned to C₂。

To O₄：

Due to d (M)₂，O₄)≤d(M₁，O₄) So that O is₄Is assigned to C₂。

To O₅：

Due to d (M)₁，O₅)≤d(M₂，O₅) So that O is₅Is assigned to C₁。

Update to obtain a new class C₁＝{O₁，O₅And C₂＝{O₂，O₃，O₄At the center of M₁＝(0，2)，M₂(0, 0). The square error criterion was calculated with a single variance of:

E₁＝[(0-0)²+(2-2)²]+[(0-5)²+(2-2)²]＝25

E₂＝[(0-0)²+(0-0)²]+[(0-1.5)²+(0-0)²]+[(0-5)²+(0-0)²]＝27.25

the overall mean variance is: e ═ E₁+E₂＝25+27.25＝52.25

3. The center of the new cluster is calculated.

M₁＝((0+5)/2，(2+2))/2)＝(2.5，2)

M₂＝((0+1.5+5)/3，(0+0+0))/3)＝(2.17，0)

Repeating 2 and 3 to obtain O₁Is assigned to C₁，O₂Is assigned to C₂，O₃Is assigned to C₂，O₄Is assigned to C₂，O₅Is assigned to C₁. Update to obtain a new class C₁＝{O₁，O₅And C₂＝{O₂，O₃，O₄At the center of M₁＝(2.5，2)，M₂＝(2.17，0)。

The individual variances are:

E₁＝[(2.5-0)²+(2-2)²]+[(2.5-5)²+(2-2)²]＝12.5

E₂＝[(2.17-0)²+(0-0)²]+[(2.17-1.5)²+(0-0)²]+[(2.17-5)²+(0-0)²]＝13.15

the overall average error is: e ═ E₁+E₂＝12.5+13.15＝25.65

It can be seen that after the first iteration, the overall average error is reduced from 52.25 to 25.65, which is significantly reduced. Since the cluster center is unchanged in both iterations, the iteration process is stopped.

Example three, example of K value calculation determined by contour coefficient method

Example 1: according to the results after clustering in example two: k is 2, clustering result C₁＝{O₁，O₅And C₂＝{O₂，O₃，O₄In which O is₁(0，2)，O₂(0，0)，O₃(1.5，0)，O₄(5，0)，O₅(5，2)。

1. Calculating dissimilarity within sample class (calculating O)₁，O₂For example intra-class dissimilarity of):

2. calculating dissimilarity between sample classes (calculating O)₁,O₂Class dissimilarity of (d):

since K in this example is 2, the dissimilarity between the sample and another class is the dissimilarity between the sample and the other class.

3. Calculating the contour coefficient of the sample (calculating O)₁,O₂Class dissimilarity of (d):

4. calculating the overall contour coefficient of the cluster when K is 2:

contour coefficient diagram:

example 1:

O₁(0,2),O₂(0,0),O₃(1.5,0),O₄(5,0),O₅(5,2), the outline coefficient of this example is shown in the figure2 (data in this example are not normalized, only the profile coefficient case is shown):

the horizontal axis represents the number K of clusters, and the vertical axis represents the number of entire clusters. It can be seen that when K is 2, the number of overall contours is the largest, that is, K is 2.

The clustering at this time is shown in fig. 3.

Example 2:

O₁(0,2),O₂(0,0),O₃(1.5,0),O₄(5,0),O₅(5,2),O₆(2,3),O₇(2,2.5),O₈(2,3.5), the profile coefficient graph of this example is shown in fig. 4 (the data of this example is not normalized, only the profile coefficient case is shown):

the horizontal axis represents the number K of clusters, and the vertical axis represents the number of entire clusters. It can be seen that when K is 3, the number of overall contours is the largest, that is, K is 3.

The clustering at this time is shown in fig. 5.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims

1. The automatic customer classification method based on K-Means clustering is characterized by comprising the following steps of:

(2) determining an initial clustering center M₁；

(5) taking all sample mean values in each cluster as a new cluster center;

2. The method for customer automatic classification based on K-Means clustering according to claim 1, wherein the initialization of platform customer data to obtain a normalized data set matrix B comprises the following specific steps:

3. The method for customer automatic classification based on K-Means clustering as claimed in claim 1, wherein the step (1) of determining the K value by using a contour coefficient method comprises the following specific steps:

(1-1) calculating intra-class dissimilarity a (y) of sample y:

(1-2-1) calculation of b_k(y)：

(1-2-2) set of inter-class dissimilarity of sample y:

B(y)＝{b₁(y)，b₂(y)，...，b_k(y)，...，b_K(y)}

where the length of set b (y) should be | b (y) | ═ K-1;

(1-2-3) obtaining b (y) of sample y:

b(y)＝minB(y)＝min{b₁(y)，b₂(y)，...，b_k(y)，...，b_K(y)}

(1-3) calculating the contour coefficient of the sample y:

(1-4) calculating the overall contour coefficient of the cluster:

(1-5) determining the K value:

SK＝{SK₂,SK₃,…,SK_i,…,SK₁₀},2≤i≤10

(1-5-2) determination of K value:

Sk＝max{SK₂,SK₃,…,SK_i,…,SK₁₀}

wherein, Sk corresponds to SK_iI.e. the finally determined value of K.

4. The method for customer automatic classification based on K-Means clustering as claimed in claim 3, wherein the determining of the initial clustering center M in the step (2)₁The method comprises the following specific steps:

D_i＝[D_i1,D_i2,…,D_ij,…]

R＝[R¹,R²,…,Rⁱ,…]

5. The method for customer automatic classification based on K-Means clustering as claimed in claim 4, wherein the K clustering centers are determined in the step (3), and the specific steps comprise:

(3-4) randomly selecting the value a, 0<a is less than or equal to d, then a subtracts d_minThe sample of (1) is selected from,

until a is less than or equal to 0, at this time

Corresponding sample Y_iNamely, the next clustering center is obtained;

(3-5) repeating the steps (3-2) - (3-4) until K cluster centers are selected.