CN117056761A

CN117056761A - Customer subdivision method based on X-DBSCAN algorithm

Info

Publication number: CN117056761A
Application number: CN202310985703.3A
Authority: CN
Inventors: 殷丽凤; 刘震; 胡洪涛; 曲英伟; 孙晶华
Original assignee: Dalian Jiaotong University
Current assignee: Dalian Jiaotong University
Priority date: 2023-08-07
Filing date: 2023-08-07
Publication date: 2023-11-14

Abstract

The invention discloses a client subdivision method based on an X-DBSCAN algorithm, which relates to the technical field of data mining and comprises the following steps: acquiring a client data set; generating a distance matrix according to the client data set; obtaining a plurality of K-dist curves corresponding to different K values; obtaining a plurality of Eps parameters and a plurality of MinPts parameters according to the K-dist curve; sequentially inputting a plurality of corresponding Eps parameters and MinPts parameters into a DBSCAN algorithm to obtain a clustering result; judging a cluster structure stable interval according to the cluster number of the clustering result, and searching an optimal K value under the cluster number stable interval of the clustering result; generating an optimal Eps parameter and an optimal MinPts parameter; inputting the optimal Eps parameter and the optimal MinPts parameter into a DBSCAN algorithm, and subdividing the client data set according to the optimal clustering result to obtain a plurality of client groups. The algorithm improves the accuracy of the clustering result.

Description

Customer subdivision method based on X-DBSCAN algorithm

Technical Field

The invention relates to the technical field of data mining, in particular to a client subdivision method based on an X-DBSCAN algorithm.

Background

Customer segments are the classification of customers by factors such as their attributes, preferences, behavior, needs, and potential value in explicit strategies, business models, and specific markets by the enterprise. The enterprise gathers the customers with the same property together (target customer group) through the customer subdivision technology, so that customer portraits can be provided for the mall marketing team, the requirements and characteristics of different customer groups can be known more clearly, and customized solutions and personalized services can be provided for different customers, and marketing strategies can be formulated accordingly.

Since the individual variability of clients is large and there are a small portion of outliers, this can present great difficulties for classification tasks. Therefore, according to the characteristics of the data, a proper classification method is selected, so that the classification accuracy is improved, and the important point of the task is the fact. In the field of machine learning, the density-based method OPTICS algorithm and DBSCAN algorithm can realize client subdivision.

The Eps parameter selection mode of the traditional DBSCAN algorithm is to select the Eps parameter according to the K-dist curve after descending order, the change threshold point of the curve is selected by manual interaction to serve as the Eps parameter, the K value is generally set to be 4, the 4-dist curve is generally selected to be analyzed, and certain errors exist in the generated parameter due to locking of the K value on some data sets. On a dataset with smaller data volume, the 4-dist curve can basically reflect the characteristics of the dataset to select proper Eps parameters. However, when a data set with a larger data size is processed, the parameter K of the kth nearest neighbor distance curve is important, if the K value is selected to be smaller, the selected Eps value is smaller, then the number of points in the neighborhood of the object Eps in the class with a relatively lower density is smaller than MinPts, and the points are mistaken as boundary points, so that the boundary points are not used for further expansion of the class, and finally the class with the lower density is divided into a plurality of classes with similar properties. If the selected K value is larger, the larger neighborhood radius can cause different clusters to be combined, and the accuracy of the cluster number of the clustering result is also influenced, contrary to the influence generated above.

The MinPts are selected by the following steps: minPts is more than or equal to dim+1, wherein dim represents the dimension of data to be clustered, and because DBSCAN algorithm mainly aims at two-dimensional data clustering, the MinPts value is generally fixed to be 4, the coincidence degree of parameters and a data set is reduced, corresponding change trend cannot be realized along with the change of the characteristics of the data set, and errors exist in the number of generated clusters and the noise number.

The OPTICS algorithm does not explicitly generate a cluster, but rather computes an enhanced clustering order for automatic interactive cluster analysis. The OPTICS algorithm is proposed to help the DBSCAN algorithm select proper parameters and reduce the sensitivity of input parameters. At present, many scholars at home and abroad research and improvement on the DBSCAN algorithm. In terms of adaptively selecting parameters by using K-nearest distances, liu Peng and the like propose a VDBSCAN algorithm, wherein the algorithm selects parameter values under different densities through a K-dist diagram, clusters the clusters with different densities by using the parameter values, and finally clusters with different densities are found. Li Wenjie and the like propose a self-adaptive selection parameter KANN-DBSCAN algorithm, which is used for self-adaptively determining parameters of Eps and MinPts by automatically searching a variable stable interval of a cluster number of clustering results based on a parameter optimizing strategy, so that a clustering process with high accuracy is realized, but the time complexity of the algorithm is higher. The MAD-DBSCAN algorithm is proposed by Wanji et al, the algorithm utilizes the self-distribution characteristics of the data set after noise attenuation to generate candidate Eps and MinPts parameter lists, corresponding Eps and MinPts are selected as initial density thresholds in a section with the stable cluster number according to the noise removal level, clustering results and noise data are obtained under the initial density thresholds, the noise data are subjected to the same operation until the number or the density thresholds of the noise data do not meet the conditions, and finally the clustering results under all the density thresholds are combined. The algorithm has good clustering results on data sets with uneven density distribution, but the time complexity of the algorithm is high. Zhou Zhi an adaptive parameter selection AF-DBSCAN algorithm is proposed equally, the algorithm generates a K-dist curve by analyzing KNN distribution of data, a polynomial fitting is used for the curve based on a mathematical statistics rule, and curve inflection points are found out to adaptively calculate optimal global parameters Eps and MinPts, but the algorithm fixedly selects K=4 to possibly limit parameter selection, and meanwhile, the adaptively selected parameters do not necessarily have corresponding change trend along with the change of self distribution characteristics of different data sets. In the aspect of parameter selection through algorithm combination, wang Guang and the like, a self-adaptive parameter KLS-DBSCAN algorithm is provided, the algorithm determines a parameter range according to data distribution characteristics by using a kernel density estimation and mathematical expectation method, reasonable clustering number of a data set is calculated by analyzing local density characteristics, and finally two optimal parameters of Eps and MinPts are determined by adopting a contour coefficient, but clustering effects of the algorithm under Gao Weiyi and multi-density data sets are general. Gholizadeh, N and the like propose a K-DBSCAN algorithm, the algorithm is based on the idea that a data partition uses a K-means++ algorithm to carry out initial grouping on data, then DBSCAN is used to cluster each group of data respectively, finally boundary clusters are combined, and a clustering result is integrated. Avory Bryant et al propose an RNN-DBSCAN algorithm that utilizes the reverse nearest neighbor number as an estimate of the observed density, improving the ability to process data sets with large density differences, so that the complexity of the parameter selection problem is reduced to using a single parameter (the selection of K nearest neighbors). In the aspect of algorithm structure improvement, chen Wenlong and the like divide a data set through a KD tree to construct a neighborhood object data set, noise points and core points are distinguished in advance, the calculation of noise data in the clustering process is reduced, the operation efficiency of an algorithm is effectively improved, and the memory occupation is overlarge when the data volume is increased. Kim et al propose an AA-DBSCAN algorithm that defines the density layer of a dataset based on the new tree structure of the quadtree to achieve clustering of non-uniform density datasets, but the algorithm still requires input of relevant parameters.

In summary, the research on the aspect of adaptively selecting DBSCAN algorithm parameters based on a K-dist curve in the prior art is less, and the research on related aspects mostly adopts a fixed K-dist curve for analysis (namely, a fixed K-dist parameter K), and the selected parameters on a client data set cannot adapt to the self-distribution characteristics of the data set, so that the clustering effect is not ideal, the client subdivision is disordered, and the corresponding client group cannot be clearly determined.

Disclosure of Invention

The embodiment of the invention provides a client subdivision method based on an X-DBSCAN algorithm, which solves the problems that parameters selected on a client data set in the prior art cannot adapt to the self-distribution characteristics of the data set, so that the clustering effect is not ideal, client subdivision is disordered, and a corresponding client group cannot be clearly determined.

The invention provides a client subdivision method based on an X-DBSCAN algorithm, which comprises the following steps:

acquiring a client data set;

the Euclidean distance among all the client data in the client data set is calculated, and a distance matrix is generated;

acquiring a plurality of K-dist curves corresponding to different K values based on a distance matrix;

obtaining inflection points of each K-dist curve to obtain a plurality of Eps parameters;

calculating a plurality of Eps parameters by adopting a mathematical expectation method to obtain a plurality of MinPts parameters;

sequentially inputting a plurality of corresponding Eps parameters and MinPts parameters into a DBSCAN algorithm, and clustering a data set to obtain a clustering result;

judging a cluster structure stable interval according to the cluster number of the clustering result, and searching an optimal K value under the cluster number stable interval of the clustering result;

generating an optimal Eps parameter and an optimal MinPts parameter based on the optimal K value;

inputting the optimal Eps parameter and the optimal MinPts parameter into a DBSCAN algorithm to obtain an optimal clustering result, and subdividing the client data set according to the optimal clustering result to obtain a plurality of client groups.

Preferably, a plurality of K-dist curves corresponding to different K values are obtained based on a distance matrix, and the method comprises the following steps:

sorting the elements in the distance matrix according to ascending order;

the K column elements in the ordered distance matrix are ordered according to ascending order to be used as an ordinate, and the data quantity is used as an abscissa, so that a plurality of K-dist curves are generated; wherein K is more than or equal to 1 and less than or equal to n, and n is the number of clients in the client data set.

Preferably, before the inflection point of each K-dist curve is obtained, a least square method is adopted to perform curve fitting on each K-dist curve.

Preferably, a maximum curvature method is adopted to calculate the point with the maximum curvature in the abrupt change area after each fitted K-dist curve steadily rises, and the curvature formula is as follows:

where k is the curvature of the fitted curve, f '(x) is the second derivative of the fitted curve, and f' (x) is the first derivative of the fitted curve.

Preferably, the ordinate corresponding to the inflection point of the K-dist curve is the Eps parameter.

Preferably, a mathematical expectation method is adopted to calculate a plurality of Eps parameters to obtain a plurality of MinPts parameters, and the method comprises the following steps:

sequentially calculating clients contained in the neighborhood of a plurality of different Eps for each client in the client data set;

calculating mathematical expectations of the number of clients contained in the Eps neighborhood of each client to obtain a plurality of MinPts undetermined parameters;

and adding a noise reduction threshold value to each MinPts undetermined parameter to obtain a plurality of MinPts parameters.

Preferably, the MinPts parameters are as follows:

wherein beta is noise reduction threshold value, beta is more than or equal to 0 and less than or equal to 1, P _i The number of clients in the Eps neighborhood for the i-th client, K, is a constant.

Preferably, the method for determining the stable section of the clustering structure according to the number of clusters of the clustering result, and searching the optimal K value under the stable section of the number of clusters of the clustering result comprises the following steps:

setting a cluster number continuous identical value Y of a clustering result according to the data set;

when the cluster numbers of the clustering results are the same for Y times continuously, judging that the clustering results tend to be stable, and selecting the current cluster number X as the optimal cluster number;

when the cluster number of the clustering result is not Y, selecting the maximum K value when the cluster number is Y as the current optimal K value;

when the clustering result is not identical for Y times in succession, searching for the identical cluster number for Y-1 (Y-1 is more than or equal to 3) times in succession.

Preferably, when the number of clusters of the clustering result is not three times continuously the same, defining a stable interval within the fluctuation range of the number of clusters within 1, wherein the maximum K value in the stable interval is taken as an optimal K value, and the number of clusters of the clustering result corresponding to the maximum K value is the optimal number of clusters of the clustering result.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides an X-DBSCAN algorithm for adaptively determining algorithm parameters based on a K-dist graph, which combines the distribution characteristics of a data set, generates an Eps parameter list through a K-dist curve and a mathematical method, generates a corresponding MinPts parameter list by using a mathematical expectation method and a noise reduction threshold value, selects Eps and MinPts values corresponding to the maximum K value as optimal parameters under a cluster number change stability interval of a clustering result, and realizes the adaptive determination of algorithm parameters. The client data set is clustered through the algorithm, and the accuracy of a clustering result is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a two-dimensional dataset of the present invention;

FIG. 2 is a graph of all K-dist plots corresponding to a two-dimensional dataset of the present invention;

FIG. 3 is a graph of K-dist (81-dist) of the present invention;

FIG. 4 is a graph of the fitted K-dist (81-dist) of the present invention;

FIG. 5 is a plot of the corners of a post-fit K-dist (81-dist) curve of the present invention;

FIG. 6 is a graph of the relationship between the values of the Eps parameter and the K value according to the present invention;

FIG. 7 is a graph showing the relationship between the MinPts parameter values and K values according to the present invention;

FIG. 8 is a graph of cluster number versus K value for the clustering result of the present invention;

FIG. 9 is a graph of the relationship between the profile coefficient and the K value of the clustering result according to the present invention;

FIG. 10 is a graph of clustering results of a two-dimensional dataset of the present invention under an X-DBSCAN algorithm;

FIG. 11 is a two-dimensional artificial dataset of the present embodiment;

wherein a: aggregation dataset, b: compound dataset, c: jain dataset, d: flame dataset, e: an R15 dataset;

FIG. 12 is a graph showing the clustering effect of the X-DBSCAN algorithm on five data sets according to the present embodiment;

FIG. 13 is a graph showing the clustering effect of the KANN-DBSCAN algorithm on five data sets according to the present embodiment;

FIG. 14 is a graph showing the clustering effect of the AF-DBSCAN algorithm on five data sets according to the present embodiment;

FIG. 15 is a graph showing the clustering effect of the DBSCAN algorithm on five data sets according to the present embodiment;

FIG. 16 is a graph showing clustering results of annual revenue and consumption score data by the X-DBSCAN algorithm of the present embodiment;

FIG. 17 is a graph showing clustering results of annual revenue and consumption score data by the DBSCAN algorithm of the present embodiment;

FIG. 18 is a flow chart of a client subdivision method based on the X-DBSCAN algorithm of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 18, the invention provides a client subdivision method X-DBSCAN based on an X-DBSCAN algorithm, wherein the selection of parameters of the X-DBSCAN algorithm comprises the steps of adaptively generating an Eps parameter list, adaptively generating a MinPts parameter list, adaptively determining an optimal K value, and optimizing and verifying the optimal parameters of the parameters. The method specifically comprises the following steps:

the first step: a customer dataset is obtained.

Referring to fig. 1, a client data set D of 500 data objects in total of 5 categories is acquired.

And a second step of: the Euclidean distance between all the customer data in the data set is calculated, and a distance matrix is generated.

The distance distribution matrix is as follows:

Dist _n×n ＝ {dist(i，j)|1≤i≤n，1≤j≤n} (1)

wherein Dist _n×n Is a distance matrix, dist (i, j) is the distance between object i and object j in the dataset, and n is the number of clients in the client dataset D.

And a third step of: and acquiring a plurality of K-dist curves corresponding to different K values based on the distance matrix.

Referring to FIG. 2, FIG. 2 shows all the K-dist (1. Ltoreq.K.ltoreq.n) curves generated from the data set of FIG. 1, and thus it can be seen that when the parameter K is selected to be different, the corresponding K-dist curves are greatly different, which results in different inflection points to be selected, that is, different Eps parameters are generated, so that the selection of an appropriate K value is necessary for determining the Eps parameters.

Distance matrix Dist _n×n The elements in the matrix are ordered in ascending order of rows, and the elements in the first column of the matrix after the ordering are all 0, which is the distance from each client data to the client data.

The K column elements in the ordered distance matrix are ordered according to ascending order to be used as an ordinate, and the data quantity is used as an abscissa, so that a plurality of K-dist curves are generated; wherein K is more than or equal to 1 and less than or equal to n, and n is the number of clients in the client data set. The meaning of each point on the K-dist curve is the distance between the X-th client and the K-th nearest client in the client data set, a K-dist curve is generated according to the distance, different K-dist curves are generated when the parameter K takes different values, and all the K-dist curves form a K-dist diagram.

Fourth step: and (5) performing curve fitting on each K-dist curve by adopting a least square method.

The 81-dist curve in fig. 2 is shown in fig. 3, and the 81-dist curve is analyzed, and it can be seen from fig. 3 that the curve is rough, the fluctuation of individual points is large, and the selection of inflection points has a certain influence, so that the curve needs to be fitted. FIG. 4 is a fitted 81-dist curve, which is seen to have rough points removed from the curve compared to the original 81-dist curve, making the curve smoother.

The least square method is calculated as follows:

the function most matched with the original data distribution is found by a method of minimizing the sum of squares of errors (in order to more accurately fit the K-dist curve, the algorithm fitting order is set to 15, and the least squares polynomial curve is fitted as follows:

f(x) ＝ θ ₀ +θ ₁ x+θ ₂ x ² +…+θ ₁₅ x ¹⁵ (3)

wherein θ _j (j=1, 2,3,) 15 is the coefficients of the polynomial.

Fifth step: and obtaining inflection points of each fitted K-dist curve by adopting a maximum curvature method to obtain a plurality of Eps parameters.

Calculating the point with the maximum curvature in the abrupt change area after the smooth curve of the fitted K-dist steadily rises, and meanwhile, calculating the inflection point at the steep rising position of the gentle curve, wherein the curvature is calculated as follows:

The distance value (ordinate) corresponding to the inflection point is used as the Eps parameter Eps _K The inflection point selected for the fitted 81-dist curve is shown in fig. 5, and the ordinate of the inflection point is taken as the corresponding Eps parameter value when k=81.

After all K-dist curves are calculated, an Eps parameter list is generated, and the following formula is shown:

Eps _list ＝{Eps _K |1≤K≤n}) (5)

all K-dist (K is not less than 1 and not more than n) curves are executed with the steps and the processes, and the Eps parameters obtained by each K-dist curve form an Eps parameter list Eps _list 。

The relationship between the values of the generated Eps parameters and the values of K is shown in fig. 6. It can be seen that the value of the Eps parameter increases gradually as the value of K increases. When the K value is about 100, the curve is obviously steep, and the Eps parameter value is sharply increased, which means that the kth nearest neighbor client of each client data in the client data set belongs to other categories at the moment, so that the range of the optimal parameter K of the client data set is judged to be between 1 and 100 (the curves of the Eps parameter values and the K values of different client data sets are different, so that the judgment of the range of the optimal parameter K is only aimed at the client data set in fig. 1).

Sixth step: and calculating a plurality of Eps parameters by adopting a mathematical expectation method to obtain a plurality of MinPts parameters.

For the generated Eps parameter list, sequentially calculating the parameters of each Eps _K Each object in the client data set D is under the parameters of (1.ltoreq.K.ltoreq.n) at the Eps _K Neighborhood regionThe number of clients contained in the data set is then calculated to obtain the mathematical expected value of the number of clients in the Eps neighborhood of each client, the expected value is used as the MinPts pending parameter, and a noise reduction threshold value is added to the pending parameter to generate MinPts _K The parameters are specifically expressed as follows:

wherein beta is a noise reduction threshold value, beta is more than or equal to 0 and less than or equal to 1, the algorithm of the invention sets beta to be 0.85, n is the total number of clients in the client data set D, and P _i Number of Eps neighborhood clients for the ith client, for each Eps _K After the parameter calculation is completed, a MinPts parameter list is generated, and the following formula is shown:

MinPts _list ＝{MinPts _K |1≤K≤n} (7)

as shown in fig. 7, the relationship between the value of the MinPts parameter and the K value shows that the value of the MinPts parameter gradually increases with increasing K value, and the MinPts parameter shows a meandering upward trend.

Seventh step: and sequentially inputting a plurality of corresponding Eps parameters and MinPts parameters into a DBSCAN algorithm, and clustering the client data set.

First, eps generated according to K-dist diagram drawn by different parameters K _K Parameter composition Eps parameter list Eps _list Sequentially selecting parameter list Eps _list Different Eps in (B) _K (K is not less than 1 and not more than n) and the corresponding MinPts is obtained by calculation of the formula (6) _K And (1) inputting the parameters K and n into a DBSCAN algorithm to perform cluster analysis on the client data set D.

Eighth step: and judging a stable section of the clustering structure according to the number of the clusters of the clustering result, searching an optimal K value under the stable section of the number of the clusters of the clustering result, and generating an optimal Eps parameter and an optimal MinPts parameter based on the optimal K value.

When K (K is more than or equal to 1 and less than or equal to n) takes different values to obtain corresponding clustering result cluster numbers, when the clustering result cluster numbers are continuous Y (Y is 5 in the algorithm of the invention), the clustering result is considered to be stable, the current cluster number X is selected as the optimal clustering result cluster number, and the steps are continued for clientsAnd executing a DBSCAN algorithm by the data set D, and when the cluster number of the clustering result is not Y, selecting the maximum K value (abscissa) when the cluster number is Y as the current optimal K value. In addition, if the whole clustering result is not identical for Y times in succession, searching for the same cluster number for Y-1 times (Y-1 is more than or equal to 3 times), if the same cluster number is not present for three times in succession in the clustering result, defining a stable interval within a fluctuation range of the cluster number within 1, taking the maximum K value within the stable interval as an optimal K value, and taking the cluster number of the clustering result corresponding to the maximum K value as the optimal cluster number of the clustering result. Eps generated by corresponding K-dist curve under optimal K value _K Eps generated based on the optimal K value for the optimal Eps parameter value _K Corresponding MinPts _K Is the optimal MinPts parameter value.

The above steps and operations are performed on the client data set in fig. 1 to analyze, and the relationship between the cluster number of the clustering result and the K value is shown in fig. 8, and it can be seen that when K is equal to 3, the cluster number of the clustering result enters the stable section, and when K is equal to 81, the stable section ends, so that k=81 is the optimal K value, the inflection point selected by the 81-dist curve corresponding to k=81 is the optimal Eps parameter, the optimal mps parameter is generated based on the Eps parameter, and the optimal eps= 3.019 and the optimal mps=77 are obtained through calculation.

The invention adopts the contour coefficient to verify the clustering effect under the self-adaptively selected optimal parameters. The contour coefficient (Silhouette Coefficient) is an effective index for evaluating the effect of clustering. The method is used for measuring the similarity of one sample point and the cluster where the sample point is located compared with other clusters, and is used for describing the index of the definition of the outline of each class after clustering. It includes two factors, cohesiveness and separability. The degree of cohesion is a measure reflecting how tightly a sample point is against elements within the class. The degree of separation is a measure reflecting how tight a sample point is to an element outside the class. The formula of the profile coefficient is as follows:

where a (i) represents the cohesion of the sample point, i.e. the average distance of the i-th object to the other objects in the cluster, and b (i) represents the separation of the sample point, i.e. the average distance of the i-th object to the objects in the clusters other than the cluster. The value range of the contour coefficient S is [ -1,1], and the closer the value is to 1, the more reasonable the clustering effect is.

Fig. 9 is a graph showing the relationship between the profile coefficient and the K value (when K is greater than 89, the number of clusters of the corresponding Eps and MinPts parameters is 1, and the basic condition for calculating the profile coefficient is not formed, so that the relationship between the profile coefficient and the K value in the graph stops when k=89), and it can be seen from the graph that the clustering result of the optimal parameter adaptively selected by the present invention is better, and compared with the profile coefficient value of other parameters, the optimal K value (k=81) selected by the present invention is effective, and the Eps and MinPts parameters corresponding to the optimal K value are optimal.

Ninth step: and inputting the optimal Eps parameter and the optimal MinPts parameter into a DBSCAN algorithm to obtain an optimal clustering result. And subdividing the client data set according to the optimal clustering result to obtain a plurality of client groups, and providing customized solutions and personalized services for different clients, thereby making marketing strategies.

The clustering result is shown in fig. 10, and a total of 5 clusters are generated, which is consistent with the clustering result given by the client data set, so that the X-DBSCAN algorithm provided by the invention can adaptively generate appropriate Eps and MinPts parameters according to the characteristics of the data set, effectively cluster the data set and accurately divide each density region.

For two-dimensional datasets, the temporal complexity of the conventional DBSCAN algorithm is O (n ² ) N is the number of objects in the dataset, and when using the KD tree index structure, it is possible to effectively search all points within a given distance from a specific point, and the time complexity is reduced to O (nlogn). The X-DBSCAN algorithm is based on DBSCAN, iterative operation is carried out on the basis of the DBSCAN algorithm, the iterative times are the number n of data set objects, so that the time complexity of the X-DBSCAN algorithm is O (n) when the KD tree index structure is used ² logn)。

The spatial complexity of the DBSCAN algorithm is O (n ² ) The X-DBSCAN algorithm generates a distance matrix, an Eps parameter list and a MinPts parameter list in the operation process, wherein the distance matrix and the MinPts parameter listNo storage is required, and the space complexity is not counted; the spatial complexity of the Eps parameter list is O (n), so the spatial complexity of the X-DBSCAN algorithm is O (n ² )+O(n)。

In summary, the complexity of the X-DBSCAN algorithm is slightly improved compared with that of the traditional DBSCAN algorithm, but the clustering accuracy is effectively improved, and the clustering method has a better clustering effect in a general scene.

Example 1

The experiment is realized by Python3.9, and operates under the Windows10 system environment of a 64-bit system, and the hardware is configured as follows: intel Corei7-7500U,CPU@2.70GHz dual core, 8GB memory. In order to verify the clustering accuracy and effectiveness of the algorithm of the invention, experiments and cluster analysis are performed on two-dimensional artificial data sets and UCI real data sets of various shapes.

Five two-dimensional artificial datasets were selected for cluster analysis as shown in fig. 11. Aggregation is a 7 class total of 788 data object two-dimensional dataset, compound is a 6 class total of 399 data object two-dimensional dataset, jain is a 2 class total of 373 data object two-dimensional dataset, flame is a 2 class total of 240 data object two-dimensional dataset, R15 is a 15 class total of 600 data object two-dimensional dataset.

Four UCI real data sets are selected for cluster analysis, and specific data information is shown in table 1.

Table 1UCI real dataset

Experiments on the X-DBSCAN algorithm, the KANN-DBSCAN algorithm, the AF-DBSCAN algorithm and the DBSCAN algorithm were performed on five two-dimensional artificial data sets, and clustering effect graphs are shown in 12-15.

It can be seen that the X-DBSCAN algorithm can effectively cluster on the data sets (Aggregation, flame and R15) with uniform density (Aggregation, flame and R15), the cluster number is correctly divided, and the clustering result is consistent with the category number given by the data sets. Clustering results on data sets (Compound and Jain) with large density distribution differences are reasonable, and compared with other algorithms, clustering results are good.

The experiment adopts a supervised F value, adjustment Mutual Information (AMI) and adjustment Rand coefficient (ARI) as external evaluation indexes of a clustering algorithm, wherein the F value integrates the accuracy and recall rate, and the value range is [0,1], and the closer to 1, the better the clustering effect is indicated. The evaluation index values F, AMI and ARI obtained by the four clustering algorithms are shown in table 2.

Table 2 evaluation index values of four algorithms on five artificial two-dimensional data sets

From the cluster number of clustering results, the cluster number generated by the X-DBSCAN algorithm is most accurate (except for the Compound and Jain data sets), because the density distribution of the data sets is uneven, large density difference exists, and certain errors exist when the X-DBSCAN algorithm clusters the data sets with uneven density distribution, so that the cluster number of the clustering results is inconsistent with the class number of the data sets and the clustering effect is poor, and the traditional DBSCAN algorithm mainly aims at the data sets with even density, has poor clustering quality on the data sets with large density difference and is the limitation of the DBSCAN algorithm. From the F value evaluation index, the X-DBSCAN algorithm is superior to other comparison algorithms, and the X-DBSCAN algorithm and the KANN-DBSCAN algorithm can cluster various data sets better. Because the parameters of the AF-DBSCAN algorithm need subjective determination, the error of the clustering result of the data set is larger, and because the K-dist curve parameter K is fixed to be 4 and the MinPts parameter is fixed to be 4, the selected Eps parameter and MinPts parameter cannot adapt to the change of different data sets, and the clustering effect is poorer. From the two evaluation indexes of AMI and ARI, the clustering result of the X-DBSCAN algorithm is most consistent with the real situation, which shows that the X-DBSCAN algorithm can effectively cluster two-dimensional artificial data sets, and the AMI and ARI evaluation indexes of the KANN-DBSCAN algorithm on Jain data sets are slightly higher than those of the X-DBSCAN algorithm, because the clustering result of the KANN-DBSCAN algorithm on individual clusters is more accurate, but the clustering result of the X-DBSCAN algorithm on the Jain data sets is consistent with the given cluster number of the data sets, and the comprehensive evaluation of the accuracy of the clustering result is better than that of the KANN-DBSCAN algorithm.

Example 2

In order to verify the clustering quality of the X-DBSCAN algorithm on a real data set, the experiment adopts Accuracy (ACC), adjustment Mutual Information (AMI) and adjustment Rankine coefficient (ARI) as evaluation indexes. The value range of ACC is [0,1], the value ranges of AMI and ARI are [ -1,1], and the closer the value is to 1, the more the clustering result is matched with the real situation. The evaluation index values obtained by the four algorithms on the four UCI real data sets are shown in table 3.

Table 3 four algorithms evaluate index values on UCI real dataset

As can be seen from Table 3, the ACC value of the algorithm X-DBSCAN is the greatest on the Seeds and Ecoli data sets, indicating that the algorithm is the highest in accuracy. On the Iris data set, the ACC value of the algorithm AF-DBSCAN is 0.866, which shows that the algorithm accuracy is the highest, and the ACC value of the algorithm X-DBSCAN is 0.667, which shows that the algorithm accuracy is higher. On a Wine data set, the ACC value of AF-DBSCAN is 0.609, the accuracy is highest, the ACC value of algorithm X-DBSCAN is 0.561, and the accuracy is higher.

By combining the evaluation indexes, the X-DBSCAN algorithm has stable performance on a high-dimensional data set, has good clustering effect on data sets with different dimensions, and has good algorithm robustness. The AF-DBSCAN algorithm has good clustering effect on Iris and Wine, but has unstable performance on data sets with different data distribution and different dimensionalities. Although the KANN-DBSCAN algorithm is good in two-dimensional artificial data sets, the effect on the high-dimensional UCI real data sets is general, because the relation curve of the cluster number of clustering results and K on the high-dimensional data sets with uneven distribution is difficult to enter the same stable interval of the cluster number three times in succession, the influence on the clustering results caused by the selection of the K value is large, and therefore the error of the clustering results on the high-dimensional data sets is large. According to the method, the optimal K-dist curve is selected by combining the self-distribution characteristics of the data set to perform analysis and algorithm parameter determination, and the noise reduction threshold value is added in the MinPts parameter generation process, so that the influence of noise data on a clustering result is reduced, the redefined selection strategy of the stable interval can obtain algorithm parameters which are more in line with the data characteristics, and the clustering effect and the algorithm stability are improved.

Example 3

By applying the X-DBSCAN algorithm to the actual mall customer data for verification, customers with the same properties are gathered together (target customer groups), so that customer portraits can be provided for mall marketing teams, the requirements and characteristics of different customer groups can be more clearly known, customized solutions and personalized services can be provided for different customers, and marketing strategies can be formulated accordingly.

The data of the invention is derived from a real mall customer data set, and a supermarket shopping center gathers basic data of 200 customers, such as customer IDs, ages, sexes, annual income and consumption scores, through customer membership cards. The consumption score is the score assigned to a customer by the marketplace based on defined parameters, such as customer behavior and purchase data. The basic data condition and partial data of the client are shown in table 4, the average value, the minimum value, the percentage bit value and the maximum value of each attribute of the basic data of the client are shown in table 5, and the complete data of the client are shown in table 6.

Table 4 store customer portion data

Table 5 customer data respective attribute value cases

Table 6 mall customer complete data

And carrying out data preprocessing on basic data of the mall clients. The data processing-related working time occupies more than 70% of the whole project. Data processing involves many factors, including: accuracy, integrity, consistency, timeliness, credibility, and interpretability. In the real data, the data of the invention may contain a large number of missing values, may contain a large number of noises, and may also have abnormal points due to manual input errors, which is very unfavorable for training of an algorithm model. The result of data cleaning is that various dirty data are processed in a corresponding mode, so that standard, clean and continuous data are obtained and provided for data statistics, data mining and the like.

The invention adopts two groups of two-dimensional data of annual income and consumption fraction, age and consumption fraction to carry out cluster analysis, and groups the clients with the same properties and consumption characteristics as a specific consumer group. Data preprocessing is carried out on basic data of a mall client: annual revenue and consumption score data are extracted, data normalization is performed on the annual revenue and consumption score data, age and consumption score data are extracted, and data normalization is performed on the age and consumption score data.

Clustering results of annual revenue and consumption score data by the X-DBSCAN algorithm presented in the present invention are shown in FIG. 16.

From the clustering result graph of fig. 16, it can be seen that the X-DBSCAN algorithm groups annual revenue and consumption score data into 5 clusters, each cluster representing a customer population of different revenue levels and consumption characteristics, the 5 clusters can be described as: the method comprises the steps of (1) representing a customer group with moderate annual income and consumption level, (2) representing a customer group with low annual income and high consumption level, (3) representing a customer group with low annual income and low consumption level, (4) representing a customer group with high annual income and high consumption level, and (5) representing a customer group with high annual income and low consumption level. Also included are a few extremely small groups of clients, such as those with low annual income but very high consumption levels, which may be student clients; such as a group of customers who are very high in annual revenue and high in consumption level, they may be elite customers of the enterprise; such as a customer group with very high annual income but low consumption level, which has great consumption potential, the mall can set a corresponding marketing strategy for the small group to promote sales.

From the clustering result of the X-DBSCAN algorithm in fig. 16, the marketplace can clearly understand the main customer groups, i.e., the cluster (4) in the figure, which have higher income and consumption capability, and the marketplace needs to maintain the consumption stability of such customer groups and take appropriate marketing campaigns to further promote their consumption. Cluster (2) is also the main customer group in the mall, but their income is low and consumer instability may exist. The cluster (1) is a secondary customer group of a mall, the number of customers of the group is large, the income and the consumption capacity are at a medium level, and the cluster is a mass customer group, and the mall is required to make corresponding strategies for the group to conduct guided consumption, so that sales is promoted. Clustering (5) is a group of customers with a relatively high consumption potential, and the mall needs to mine the consumption requirements of such a group of customers. Cluster (3) is an unstable customer group, and their churning rate will be high.

Comparing the clustering results of the X-DBSCAN algorithm and the DBSCAN algorithm. The clustering results of annual revenue and consumption score data by the DBSCAN algorithm are shown in FIG. 17.

As can be seen from fig. 16 and 17, the clustering result of the DBSCAN algorithm on the mall customer dataset is not as good as the X-DBSCAN algorithm. The customer segments of the DBSCAN algorithm are relatively cluttered and the corresponding customer groups cannot be determined clearly. To further compare the two algorithms, the clustering performance and clustering effect of the X-DBSCAN algorithm and the DBSCAN algorithm are verified by adopting the contour coefficient value, and the specific contour coefficient index value is shown in Table 7.

TABLE 7 Profile coefficient index value comparison Table

As can be seen from Table 7, the X-DBSCAN algorithm has higher clustering profile coefficients on the mall customer data set than the DBSCAN algorithm, which improves clustering performance and clustering stability as compared to the DBSCAN algorithm. Meanwhile, the X-DBSCAN algorithm has good application effect on client subdivision application.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The client subdivision method based on the X-DBSCAN algorithm is characterized by comprising the following steps of:

acquiring a client data set;

calculating Euclidean distances among all client data in the client data set to generate a distance matrix;

2. The method for client subdivision based on the X-DBSCAN algorithm as claimed in claim 1, wherein the method for obtaining a plurality of K-dist curves corresponding to different K values based on the distance matrix comprises the steps of:

sorting the elements in the distance matrix according to ascending order;

3. The method of claim 2, wherein a least square method is used to fit each K-dist curve before the inflection point of each K-dist curve is obtained.

4. The method for subdividing clients based on the X-DBSCAN algorithm as recited in claim 3, wherein a maximum curvature method is adopted to calculate a point with maximum curvature in an abrupt change area after each fitted K-dist curve steadily rises, and a curvature formula is as follows:

where k is the curvature of the fitted curve, f' (x) is the second derivative of the fitted curve, f ^′ (x) To fit the first derivative of the curve.

5. The method of claim 1, wherein the ordinate corresponding to the inflection point of the K-dist curve is an Eps parameter.

6. The method for client subdivision based on the X-DBSCAN algorithm as claimed in claim 1, wherein the mathematical expectation method is adopted to calculate a plurality of Eps parameters to obtain a plurality of MinPts parameters, comprising the steps of:

7. The method of claim 6, wherein the MinPts parameters are as follows:

8. The method for subdividing clients based on the X-DBSCAN algorithm according to claim 1, wherein the method for determining the stable section of the clustering structure according to the number of clusters of the clustering result, searching the optimal K value under the stable section of the number of clusters of the clustering result, comprises the following steps:

9. The method for subdividing clients based on the X-DBSCAN algorithm according to claim 8, wherein when the cluster number of the clustering result clusters does not exist three times continuously and identically, a stable interval is defined as a range of fluctuation of the cluster number within 1, a maximum K value within the stable interval is used as an optimal K value, and the cluster number corresponding to the maximum K value is the optimal cluster number of the clustering result clusters.