CN110866782A

CN110866782A - Customer classification method and system and electronic equipment

Info

Publication number: CN110866782A
Application number: CN201911078287.9A
Authority: CN
Inventors: 穆维松; 李玥; 冯建英; 田东; 褚晓泉
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2020-03-06
Anticipated expiration: 2039-11-06
Also published as: CN110866782B

Abstract

The embodiment of the invention provides a customer classification method, a customer classification system and electronic equipment, wherein the method comprises the following steps: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of a numerical type variable sample and a classification type variable sample; carrying out feature extraction on the numerical variable sample to obtain a comprehensive evaluation factor; acquiring dissimilarity measurement values of the comprehensive evaluation factors and the classified variable samples; and optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm, constructing a customer classification model, and further obtaining a customer classification result. The embodiment of the invention combines the K-Means clustering algorithm and the improved particle swarm optimization algorithm, considers various indexes influencing customer consumption, adopts the optimization clustering algorithm to perform clustering analysis on customer consumption data to obtain customer groups with different characteristics, has more reasonable and accurate analysis results, and is convenient for adopting different operation and customer service strategies for different groups.

Description

Customer classification method and system and electronic equipment

Technical Field

The embodiment of the invention belongs to the technical field of computer information processing, and particularly relates to a client classification method, a client classification system and electronic equipment.

Background

The retail customer classification refers to the behavior of dividing customers into different customer groups by using a related technical analysis means according to factors such as the attributes, purchase demands, value views and the like of the customers to evaluate the consumption behaviors of different customers, so as to find out high-value customers or customize corresponding products and services for the customers according to classification results. The traditional method for classifying the retail customers mainly comprises an experience subdivision method and a mathematical statistics method, the experience subdivision method has the defect of strong subjectivity, and the classification advantages and disadvantages of the mathematical statistics method depend on classification standards to a great extent, so that both the two methods have one-sidedness.

In the data mining technology, clustering analysis is used as an unsupervised learning method, which can classify data sets and mine valuable information from feature data of research objects, so that objects in classes are similar as much as possible, and objects between different classes are different as much as possible. Clustering analysis is a powerful information processing method, and has been widely used in various subject research fields, such as: customer classification, market research, data analysis, pattern recognition, machine learning, and the like.

At present, the cluster analysis method mainly comprises: a partitioning (splitting) method, a hierarchical analysis method, a density-based method, a mesh-based method, a model-based method. In 1967, the K-Means clustering algorithm proposed by MacQueen is a clustering method based on division, but the K-Means clustering algorithm also has the following defects: (1) when the three-dimensional dissimilarity degree between data objects is calculated, the clustering algorithm can only process numerical data and is not suitable for processing data with types; and the importance of different characteristics is not balanced; (2) the selection of the random initial value may cause the clustering result to have a large uncertainty, even if there are no solutions and the clustering result falls into a local minimum.

In summary, the methods for classifying retail customers in the prior art have many defects, which result in the defects of inaccurate analysis result, unobvious classification effect, and the like.

Disclosure of Invention

The embodiment of the invention provides a customer classification method, a customer classification system and electronic equipment, which are used for solving or partially solving the defects in the prior art.

In a first aspect, an embodiment of the present invention provides a client classification method, including:

step S1: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of M numerical variable samples and N categorical variable samples.

Step S2: and (4) performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors.

Step S3: and obtaining the dissimilarity metric values of the T comprehensive evaluation factors and the N classified variable samples.

Step S4: and optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm to construct a customer classification model.

Step S5: and inputting the customer consumption data set into a customer classification model to obtain a customer classification result.

Further, in step S2, the performing feature extraction on the mixed type data sample set includes:

respectively carrying out digital standardized calculation on the M numerical variable samples in each mixed data sample to obtain a standardized matrix; obtaining a correlation matrix of the standardized matrix; based on a principal component analysis method, obtaining an initial common factor according to a correlation matrix;

establishing a regression equation which takes the initial common factor as a dependent variable and takes the numerical variable sample as an independent variable based on a regression method, and obtaining a factor score corresponding to the initial common factor; wherein the factor score is a weighted sum.

Further, after obtaining the initial common factor, the method further includes: and performing orthogonal factor rotation or skew factor rotation by adopting a variance maximization method to obtain an orthogonal or skew factor load matrix, and performing linear combination processing on the initial common factor by using the orthogonal or skew factor load matrix.

Further, the above-mentioned performing digital normalization calculation on the M numerical variable samples in each mixed data sample respectively includes:

wherein x is_ijIs a numerical variable sample set;

the average value of numerical variable samples is obtained; the numerator is the standard deviation of the numerical type variable sample; m is the number of numerical variable samples, P is the number of observation indexes of each numerical variable sample, l_ijAnd carrying out digital standardization calculation on the jth dimension of the numerical variable sample for the ith data.

Further, the obtaining of the correlation matrix of the normalized matrix includes:

wherein R is a correlation matrix, L is a normalization matrix, and L' is a transpose of L, where L_MIs the result of the numerical normalization calculation of the mth numerical variable sample.

Further, before step S2, the method further includes: and carrying out missing value processing and abnormal value processing on the mixed type data sample set.

Further, step S3 includes: dividing N categorical variable samples into N₁An ordered typed variable sample and N₂Classifying variable samples of the nominal type;

respectively obtaining T comprehensive evaluation factors and N₁An ordered typed variable sample and N₂Dissimilarity measure for each nominal type classification variable sample.

Further, the above-mentioned T comprehensive evaluation factors, N are obtained separately₁An ordered typed variable sample and N₂Dissimilarity of individual nominal categorical variable samplesThe metric value comprises:

to N₁Carrying out numerical attribute normalization transformation on the ordered typing variable samples, converting each ordered typing variable sample into an ordered numerical variable sample, and calculating N by using the following formula₁Dissimilarity measure values of the ordered numerical variable samples and the T comprehensive evaluation factors:

wherein d is_m(x_i,x_j) Variable sample x_iAnd variable sample x_jM is the number of numerical attributes of each ordered numerical variable sample, x_imDenotes the i-th data m-dimensional value, x_jmRepresenting the m-dimensional value of the jth data; calculating N using symmetrical Hamming formula₂Dissimilarity measure for each nominal type categorical variable sample:

where n is the number of numerical attributes of each nominal type categorical variable sample, x_ipAnd x_jpRespectively representing classification values, δ (x), of the p-th dimension attribute data object_ip,x_jp) Are matching coefficients.

Further, if the nominal type classification variable sample is a binary nominal type classification variable sample, then:

further, before the step S3, the method further includes: determining the weight of each dissimilarity metric value based on an entropy weight method, wherein the weight comprises the steps of utilizing a dispersion standardization method to normalize each comprehensive evaluation factor and each ordered type-based variable sample, and mapping to a dimensionless numerical value in a (0, 1) interval; calculating the specific gravity of each dimensionless number; acquiring the specific gravity of each nominal type classification variable sample; calculating the information entropy of each proportion; and determining the weight of each comprehensive evaluation factor and each classification type variable sample according to each information entropy.

Further, the calculating the specific gravity of each dimensionless number includes: the contribution of each dimensionless number is obtained using the following formula:

wherein p is_ijIs the ith data object x under the jth attribute_iThe degree of contribution of (c); determining a specific gravity of each of the comprehensive evaluation factors based on the contribution degree.

Further, the obtaining of the specific gravity of each nominal type categorical variable sample includes: acquiring the occurrence times of each nominal type classification variable sample and the number of total samples, and according to the following formula:

obtaining a specific gravity of each nominal type categorical variable sample; wherein r is_ijSample x representing a nominal type categorical variable_ijThe number of occurrences, a, is the number of total samples.

Further, the determining the weight of each comprehensive evaluation factor and each classification type variable sample according to each information entropy includes:

wherein, w_iThe weight of the ith information entropy is shown, and m is the total number of the information entropy.

Further, step S4 includes updating the speed and position of the individual best, the global best and the particles through the dissimilarity measure based on the particle swarm algorithm, and performing the border crossing processing until the optimal k cluster center points are obtained; and taking the K clustering central points as initial clustering centers of a K-Means clustering algorithm to construct a customer classification model.

In a second aspect, an embodiment of the present invention provides a customer classification system, including: data acquisition device, first data processing unit, second data processing unit, third data processing unit and customer classification list, wherein:

and the data acquisition device is used for acquiring a mixed data sample set consisting of a plurality of mixed data samples, wherein each mixed data sample consists of M numerical variable samples and N classification variable samples.

And the first data processing unit is used for carrying out feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors.

And the second data processing unit is used for acquiring the T comprehensive evaluation factors and the dissimilarity metric values of the N categorical variable samples.

And the third data processing unit is used for optimizing the K-Means clustering algorithm of the dissimilarity metric value based on the particle swarm optimization algorithm and constructing a customer classification model.

And the client classification unit is used for inputting the client consumption data set into the client classification model to obtain a client classification result.

In a third aspect, an embodiment of the present invention provides an electronic device, including but not limited to a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the client classification method when executing the program.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the above-mentioned client classification method.

The customer classification method, the customer classification system and the electronic equipment provided by the embodiment of the invention combine a K-Means clustering algorithm and an improved particle swarm optimization algorithm, simultaneously consider various indexes influencing customer consumption, adopt the improved clustering algorithm to perform accurate clustering analysis on customer consumption data to obtain customer groups with different characteristics, and ensure that the clustering result analysis is more reasonable and clear, and more convenient to adopt different operation strategies and customer service strategies on different groups, thereby ensuring the leading position of retail market enterprises.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a customer classification method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a model for obtaining a comprehensive evaluation factor according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of another customer classification method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a customer classification system according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With the progress of science and technology and the development of the internet of things, how to accurately acquire the classification condition of a client and timely make a targeted product promotion and product sale strategy are crucial to the commercial success, and particularly for some fresh food products with high preservation requirements, such as fresh fruits, marine products and the like, the sales means can be rapidly widened and the sales efficiency can be reasonably improved according to the classification condition of the client, so that the storage time of the products is reduced, and the economic benefit is laterally improved.

For convenience of description, the customer classification of the fresh grapes sold on the market is taken as an example in all the embodiments of the present invention, but the present invention is not limited to the scope of the embodiments of the present invention.

In the customer classification of fresh food grapes for retail sale, the most advanced method in the prior art is performed by using cluster analysis as a means through a data mining technology. Clustering analysis, as an unsupervised learning method, can classify data sets and mine valuable information from feature data of research objects, so that objects in a class are as similar as possible, and objects between different classes are as different as possible. The K-Means clustering algorithm is a classic clustering based on division, and the function is particularly remarkable. The algorithm has the advantages of easy understanding, simple realization, high convergence speed, capability of effectively processing a large data set and the like, becomes the most famous and most common clustering algorithm in the field of data mining, and is already applied to the classification research of retail customers. However, the background technology of the invention does not indicate that the algorithm has a plurality of defects in the field of data mining, so that the analysis result often cannot reach the ideal precision, and the classification result is inaccurate.

In view of the above, in order to effectively solve or partially solve the defects in the data mining in the prior art, an embodiment of the present invention provides a method for classifying customers, as shown in fig. 1, including, but not limited to, the following steps:

The invention uses the fresh grape consumer behavior and preference data set, and selects a part of the fresh grape consumer behavior and preference data set as a client classification variable for dividing the client group. The subdata set contains consumption value data in the form of a quantity table topic, which is a numerical variable (not strictly numerical), and demographic data in the form of a choice topic, which is a categorical variable.

As shown in table 1, the contents of the characteristic data about the value of the fresh grapes purchasing behavior and the basic customer information are partially described. The data source in the form can be established based on network questionnaires, visiting questionnaires, opinion feedback from each sales network, and the like, and all the obtained survey data are counted, and factors influencing various aspects of actual purchase and influence strength of various factors on final purchase completion in the purchase of fresh grapes in different consumption groups and different age levels in the society are comprehensively considered. It should be noted that table 1 is only a list of any kind of characteristic data provided by the embodiment of the present invention, and is not to be considered as a limitation to the scope of the embodiment of the present invention.

As shown in table 1, the characteristics in which the numerical variables are described include: consumer value view data, such as consumer value view data with characteristic names a-n, which are mainly used as numerical variables; and recording the data of the basic information of the client, wherein the data is mainly used as a typing variable.

TABLE 1

Table 2 provides a list of raw data for each numerical variant sample and for each subtyping variant in table 1:

TABLE 2

As shown in table 2, a total of 3230 mixed data samples (only part of the content is shown in the table) are provided, wherein each mixed data sample comprises 14 numerical variable samples (values corresponding to a-n) and 7 classification variable samples (values corresponding to sex, age, cultural level, professional nature, family population, family average monthly income and city grade).

Through the correlation analysis of the numerical variables, obvious correlation exists among several groups of variables, which causes the occurrence of the phenomenon of information overflow; some variables may not necessarily function properly and may add to the computational effort. In order to solve the problem that the efficiency of a clustering algorithm is reduced due to the fact that index information among samples is overlapped, original input variables of numerical types with a large number (all variables are required to be continuous variables, and sequence variables and category variables cannot be subjected to factor analysis) are processed by using a dimensionality reduction idea, and are changed into a group of independent new input variables (comprehensive evaluation factors) with a small number.

As shown in fig. 2, by performing feature extraction on these variables, for example: the 3 different variables "package delicacy", "hygienic condition of purchase environment" and "presence or absence of package" can be unified into a comprehensive evaluation factor "package hygienic condition". By the method for extracting the features, the calculation workload can be effectively reduced, and the clustering algorithm efficiency is improved.

Further, based on the characteristics of the retail fresh grape purchase data (numerical data, ordered classification data and nominal classification data exist in the questionnaire and belong to mixed type data characteristics), in the embodiment of the invention, a method for measuring the dissimilarity degree of mixed attributes is provided aiming at that the K-Means clustering algorithm can only process numerical data types for improvement.

Firstly, all questionnaire data are processed, and 3230 samples of fresh grape consumption samples of clients are finally obtained (namely, a mixed type data sample set consisting of 3230 mixed data samples is obtained). Further, in the embodiment of the present invention, two characteristics of the client are researched: consumer value view (numeric data), customer basic information (ordered and unordered classification data types), i.e. as can be learned from table 3: the mixed data has data with two different attributes of a classification type and a numerical type, wherein the numerical type attribute has 4 (namely 4 comprehensive evaluation factors), the classification type attribute has 7, and each data object has 11 attributes in total.

Numeric variables (scalars), also called quantitative or continuous attribute variables, are scalar, i.e., numbers with no directional meaning, also called scale variables. Unlike scalar attributes, categorical variables have a specific order between their attribute values, and the class of a feature is determined according to some rule, whose values reflect the order relationship, whose relative order of values is necessary, but whose actual size is not important. If a single distance formula commonly used in a common K-Means clustering algorithm is used for carrying out uniform dissimilarity measurement calculation on all mixed data, the data relation is obviously not met.

In embodiments of the present invention, the sample data object type comprises a mixture of numeric and categorical variable samples (where categorical variables can be divided by nature into categorical and nominal categorical variables), and is therefore referred to as a mixed data sample. One method provided for calculating the degree of dissimilarity between objects described by the hybrid data is to group the data by type, taking different approaches to different classes of attributes. Through the analysis, the numerical variables and the ordered classification variables can be finally classified into scalar quantity to calculate the dissimilarity degree, and the nominal classification variables are independently classified into one type to calculate the dissimilarity degree. By respectively obtaining the dissimilarity metric value of the comprehensive evaluation factor and the dissimilarity metric value of the classified variable sample and comprehensively considering the dissimilarity of the comprehensive evaluation factor and the classified variable sample, the defect that the K-Means clustering algorithm can only aim at numerical data and is not suitable for other types of data is overcome, and the clustering effect is effectively improved.

TABLE 3

When the K-Means clustering algorithm is used, the initial clustering center is determined firstly, and the clustering center is continuously optimized in the iteration process until the clustering center is not changed. The convergence speed is high, the local search capability is strong, but the clustering center is initialized in a randomization mode, so that the initial position of the clustering center has high randomness, the fluctuation of the clustering result is caused, and the clustering result is easy to fall into a local minimum value. In order to effectively overcome the defects, the particle swarm optimization algorithm provided by the embodiment of the invention optimizes the K-Means clustering algorithm of the dissimilarity metric value, and optimizes the initial clustering center of the K-Means algorithm by utilizing the strong global search capability and faster convergence speed of the particle swarm optimization algorithm, so as to eliminate the dependence of the K-Means on the initial value of the clustering center and improve the accuracy and convergence of the clustering result.

The customer classification method provided by the embodiment of the invention combines the K-Means clustering algorithm and the improved particle swarm optimization algorithm, simultaneously considers various indexes influencing customer consumption, adopts the improved clustering algorithm to perform accurate clustering analysis on customer consumption data to obtain customer groups with different characteristics, and is more reasonable and clear in clustering result analysis, and more convenient for adopting different operation strategies and customer service strategies for different groups, thereby ensuring the leading position of retail market enterprises.

Based on the content of the foregoing embodiment, as an alternative embodiment, in step S2, the performing feature extraction on the hybrid data sample set includes, but is not limited to, the following steps:

s21: respectively carrying out digital standardized calculation on the M numerical variable samples in each mixed data sample to obtain a standardized matrix;

s22: a correlation matrix of the normalized matrix is obtained.

S23: and acquiring an initial common factor according to the correlation matrix based on a principal component analysis method.

S24: establishing a regression equation which takes the initial common factor as a dependent variable and takes the numerical variable sample as an independent variable based on a regression method, and obtaining a factor score corresponding to the initial common factor; wherein the factor score is a weighted sum.

According to the embodiment of the invention, M numerical variable samples in each mixed data sample are respectively converted into numerical factor scores by using a factor analysis method to research the internal dependency relationship among a plurality of variables, and a small number of T non-observable variables called comprehensive evaluation factors are used for representing a basic data structure, so that the comprehensive indexes can be objectively and effectively determined, and the obtained indexes have less information cross and strong comparability.

Wherein, after the M numerical variable samples are subjected to digital standardized calculation, the obtained original variable is expressed as Z₁,Z₂,…,Z_M(M is the number of original attributes), the extracted T (T) can be used<M) factors f₁,f₂,…,f_TTo represent a linear combination:

wherein, i is 1, 2.. times.m; 1,2, T, the above formula can be expressed as: z is AF + epsilon. Z is an observable M-dimensional variable vector, each component of which represents an index or variable; f represents a common factor vector, each component represents a factor, the factors are independent non-observable theoretical variables, and the specific meaning of the common factor needs to be determined by combining practical research; matrix A represents the factor load matrix, a_ijIs the load of the ith original variable on the jth factor; ε is a special factor that is not observable, representing the fraction of the original variable that cannot be factored, its mean value being 0.

Further, in an embodiment of the present invention, the performing digital normalization calculation on the M numerical variable samples in each mixed data sample respectively includes:

wherein x is_ijIs a numerical variable sample set;

is a numerical variable sampleMean value of the book; the numerator is the standard deviation of the numerical type variable sample; m is the number of numerical variable samples, P is the number of observation indexes of each numerical variable sample, l_ijAnd carrying out digital standardization calculation on the jth dimension of the numerical variable sample for the ith data.

Further, in step S22, the normalized matrix obtained in the previous step is transformed to obtain the correlation between the samples, and the correlation matrix is obtained, and the transformation method may be:

L＝[l₁,l₂,l₃…l_M]

Further, the method for obtaining the initial common factor based on the principal component analysis method described in the above step S23 according to the correlation matrix includes, but is not limited to, the following steps:

setting M variables, solving principal components from a correlation matrix to obtain T principal components, arranging the T principal components in descending order, and recording as Y₁，Y₂……Y_TLet us order

The following relationship is given:

the load matrix A and a group of initial common factors F can be obtained by calculation through the formula_iWherein:

in the load matrix A, γ₁≥γ₂≥…≥γ_pFor the above-mentioned correlation matrix RCharacteristic root, γ₁、γ₂、...γ_pOrthogonalizing the feature vectors for the corresponding norm, and m<p。

Further, in the above step S24, the method for calculating the factor score corresponding to the initial common factor includes, but is not limited to, the following steps:

establishing a regression equation with an initial common factor as a dependent variable and an original variable as an independent variable by adopting a regression method, wherein the factor score can be regarded as the weighted sum of all variables, the importance degree of the variables to the factors is represented by the weight coefficient, and the regression equation can be as follows:

F_j＝w_j1X₁+w_j2X₂+...+w_jMX_M,j＝1,2,...,T

further, based on the results of the above factor analysis, the first 14 consumer value features can be extracted as 4 comprehensive evaluation indexes (comprehensive evaluation factors): namely quality characteristic factors, quality safety factors, package sanitary condition factors and external additional condition factors, and the factor analysis requirements are as follows: the finally obtained factors are independent from each other and have no correlation.

Based on the content of the foregoing embodiment, as an alternative embodiment, after obtaining the initial common factor, the method further includes: and performing orthogonal factor rotation or skew factor rotation by adopting a variance maximization method to obtain an orthogonal or skew factor load matrix, and performing linear combination processing on the initial common factor by using the orthogonal or skew factor load matrix.

The linear combination of the correlation matrices in step S22 is performed to find a common factor with a more obvious practical significance, and the conversion method may be:

wherein, F₁、F₂、…F_MIs an initial common factor, F'₁、F’₂、…F’_MIs a new common factor after linear transformation.

Based on the content of the foregoing embodiment, as an alternative embodiment, before step S2, the method further includes: and carrying out missing value processing and abnormal value processing on the acquired mixed type data sample set.

According to the client classification method provided by the embodiment of the invention, the samples can be preprocessed before the obtained data set is clustered by using a related algorithm, because incomplete data or data without actual value for clustering may exist in original sample data. Through missing value processing and abnormal value processing, not only can the clustering effect be improved, but also the running time required by the algorithm can be reduced, and therefore the performance of the particle swarm clustering algorithm is improved.

Wherein, the missing value processing may be: by observing the sample data of the invention, the processing method of the missing value is as follows: the samples that lack multiple feature attributes are first deleted (>2), and then the rest of the data is padded using the interpolation (with the mode of the attribute values).

Wherein the outlier processing may be: in the embodiment of the invention, firstly, the consistency check is carried out on the data, and whether the abnormal value record exists is deleted or not can be judged by setting a deletion threshold, and when the proportion of the abnormal value record in the mixed type data sample set is smaller than the deletion threshold, the abnormal value record can not be operated; when the ratio is larger than the deletion threshold, the final aggregation result may be adversely affected, and the deletion may be performed at this time.

Based on the content of the foregoing embodiment, as an alternative embodiment, the foregoing step S3 includes, but is not limited to, the following steps: dividing N of the categorical variable samples into N₁An ordered typed variable sample and N₂Classifying variable samples of the nominal type;

The classified variable samples can be further classified into ordered classified variable samples and nominal classified variable samples because the classified variable samples are distinguished according to the attributes. For example, as shown in table 1, the attribute values of the ordered categorical variables have a specific order, the rank of the feature is determined according to a certain rule, the value reflects the order relationship, the relative order of the values is necessary, and the actual size is not important. Such as the age (17-25 years, 26-35 years, 36-45 years, 46-55 years, 56-75 years), the culture level (this family and above, major, high school, junior middle school, primary school and below), the population number of the family (1, 2, 3, 4, 5, 6, 7, 8, 9), the average monthly income of the family (2000 Yuan or below, 2001-3000 Yuan, 3001-5000 Yuan, 5001-7000 Yuan, 7001-10000 Yuan, 10001-15000 Yuan, 15000 Yuan), the city grade (first line, second line, third line, fourth line, fifth line).

The nominal classification variables (such as judgment variables of gender, quality and the like) are non-numerical values, and are digitized for convenient analysis, and numerical indicators of the characteristics have no numerical meaning or no sequence relation and only use numbers to represent various states.

In the embodiment of the invention, not only all mixed data samples are divided into numerical variable samples and classified variable samples, but also the classified variable samples are further divided into ordered classified variable samples and nominal classified variable samples according to different attributes of the samples, and dissimilarity measurement values are calculated by adopting different methods respectively according to the characteristics of different samples, so that the clustering accuracy is further improved.

Further, in the embodiment of the present invention, the above-mentioned T comprehensive evaluation factors and N are respectively obtained₁An ordered typed variable sample and N₂Dissimilarity measures for individual nominal-type categorical variable samples, including, but not limited to, the following methods:

wherein d is_m(x_i,x_j) Variable sample x_iAnd variable sample x_jN is that each variable sample has m numerical attributes;

calculating N using symmetrical Hamming formula₂Dissimilarity measure for each nominal type categorical variable sample:

Specifically, the dissimilarity metric value calculation method for the digital variable samples includes: numeric variables (scalars), also called quantitative or continuous attribute variables, are scalar, i.e., numbers with no directional meaning, also called scale variables. The meaning of euclidean distance is the collective distance of two data objects in euclidean space, which is widely used to identify the dissimilarity of two scalar elements because of its intuitive intelligibility and strong interpretability. Therefore, in the embodiment of the present invention, the euclidean distance is adopted as the sample dissimilarity degree of the numerical attribute. If there are M sample data in a mixed data, each data object has n number of value type attributes, data object x_iAnd x_jThe dissimilarity measure between (1. ltoreq. i, j. ltoreq. M) is expressed as:

further, the method for calculating the dissimilarity measure value of the ordered variable samples comprises the following steps: for ordered variable samples, each value is generally assigned a number, called the rank of the value, e.g. oneThe data attribute has M_fA state, the ordered states defining a sequence: 1,2, …, M_f. In the embodiment of the invention, the dissimilarity is calculated by converting the non-numerical-value-ordered classification attribute into a numerical value attribute (rank), and then replacing the original value with the value after rank processing as a scalar attribute. The formula of the data conversion can be as follows: (k-1)/(M)_f-1), where k is the rank value, k ∈ {1,2, …, M_f}，M_fFor the total number of attribute values, each ordered variable sample can eventually be mapped to [0, 1]]Thereby achieving normalization. And finally, calculating the normalized ordered variable sample according to the dissimilarity metric value calculation method for the digital variable sample.

Further, the nominal categorical variables are themselves non-numeric, they are quantified for ease of analysis, and the numeric indicators of these features have neither quantitative nor sequential relationships, but are simply numbers used to represent various states. For a nominal type categorical variable, the distance formula that deals with scalar variables cannot be used directly for the nominal type categorical variable. The nominal categorical variables may have two or more status values, such as gender (male, female), occupational properties (party and business entity personnel, company/enterprise personnel, free job owners, farmers, education and research institution personnel, students, non-employment/retirement, others), which may be divided into two-categorical variables and multiple-categorical variables.

The nominal type binary classification variable only comprises a gender variable, and belongs to a symmetric classification variable, so that the dissimilarity of the attribute is measured by adopting a symmetric hamming formula (in information coding, the numbers of different coded bits corresponding to two legal codes are called horse ranges and also called hamming distances), and the dissimilarity can be calculated by using a simple matching coefficient, and is defined as follows:

in the formula, n is the number of classification attributes; x is the number of_ipAnd x_jpRespectively, representing the values of the p-th dimension attribute data objects. x is the number of_ipAnd x_jpThe greater the number of occurrences, the greater the dissimilarity between the two data objects.

Wherein the matching coefficient

The assignment in the case where the multi-classification variable is a two-classification variable is a nominal type.

The nominal multi-classification variable is the popularization of a two-classification variable, each numerical value has no numerical or ordinal significance, such as occupational property, and the numerical value of the attribute has no significance for distinguishing.

In summary, for any mixed data, we assume that the numerical attribute and the nominal classification attribute are of equal importance, and assume that there are g sample data including g₁Individual numerical attribute variables (numerical variables and ordered categorical variables), g₂A categorical variable (nominal categorical variable) is given by g ═ g₁+g₂If the dissimilarity measure related to the numerical attribute data mainly uses the euclidean distance formula and the dissimilarity measure related to the classification attribute data variables is the formula, the dissimilarity measure of the mixed attribute data can be expressed as:

based on the content of the foregoing embodiment, as an alternative embodiment, before the foregoing step S3, the method further includes: determining the weight of each dissimilarity metric value based on an entropy weight method, wherein the weight comprises the step of mapping each comprehensive evaluation factor and each ordered typing variable sample to a dimensionless numerical value in a range of 0, 1 after normalizing the comprehensive evaluation factors and the ordered typing variable samples by using a dispersion standardization method; calculating the specific gravity of each dimensionless number; acquiring the specific gravity of each nominal type classification variable sample; calculating the information entropy of each proportion; and determining the weight of each comprehensive evaluation factor and each classification type variable sample according to each information entropy.

Since in the above formula of the dissimilarity value of the mixed type data variables, it is assumed that the importance degree of each attribute variable is the same (i.e., the weight value is the same). However, in the conventional clustering algorithm for actual data mining, the importance of each attribute is expressed in the measure of dissimilarity between objects in a data set, and if the importance of the attribute characteristics is not considered in calculating the sample distance, the obtained final clustering effect is inaccurate. Therefore, proper weight is given to the attributes, so that the dissimilarity among the data objects is better measured, and the clustering quality is improved.

It is therefore necessary to determine the weight according to the degree of importance of each evaluation index. In the embodiment of the invention, the characteristic weight of the mixed type data variable is balanced by adopting a weighting method (entropy weight method) based on the variable information entropy, and the defects of randomness, instability and the like of the mixed type variable clustering result can be effectively solved. And the weight coefficient of each attribute can be automatically determined, the subjective randomness of weight determination is avoided, and a basis is provided for multi-index comprehensive evaluation. The entropy weight method provided by the embodiment of the invention is an objective weighting algorithm, and the entropy weight of each index is calculated by using the information entropy according to the attribute difference degree of each index, and the weight of each index is corrected by the entropy weight, so that objective index weight is obtained. Compared with the subjective weighting method in the prior art, the method has higher precision and higher objectivity, can better explain the obtained weight, and can be used for any process needing to determine the weight.

The processing procedure for calculating the data characteristic weight based on the entropy weight method provided by the embodiment of the invention can be realized by the following steps:

(1) data normalization

For the inconsistency of the measurement standard and the meaning of each numerical attribute (including numerical variables and ordered classification variables) in the mixed data sample set, the corresponding numerical variation range is large, and if the euclidean distance is directly adopted to calculate the dissimilarity, the influence of the attribute with the large numerical value on the dissimilarity among the samples is very large, so that the attribute needs to be subjected to standardization processing. According to the customer classification method provided by the embodiment of the invention, original variable data are linearly transformed by adopting a dispersion standardization method (only numerical type variables and ordered type classification variables are normalized, and nominal type classification variables are not processed), so that a result is mapped to a [0, 1] interval and is converted into a dimensionless pure value, and indexes of different units or orders can be compared and weighted conveniently.

If m (for example, m is 9, and no nominal type classification variable is included) original data samples (including comprehensive evaluation factor and ordered type classification variable samples) have a numerical attribute of X₁,X₂,…,X_mWherein X is_i1,2, …, n, wherein n represents the number of sample data variables; for original data sample X_mThe formula for normalization is as follows:

wherein x is_maxAnd x_minThe maximum and minimum values of the numerical property in the original data sample, respectively.

(2) Calculating a mixed attribute weight

Calculating the attribute proportion, p, of the numerical attribute (a dimensionless numerical value in this case) of each normalized original data sample in the j-th dimension data_ijRepresenting the ith data object x under the jth attribute_iThe contribution of (e.g., i ═ 1,2, …, 3230; j ═ 1,2, …,9 can be taken in the above-described embodiment):

further, the method for calculating the specific gravity of each nominal type classification attribute comprises the following steps:

wherein r is_ijSample x representing a nominal type categorical variable_ijThe number of occurrences, a, is the number of total samples.

(3) Calculating the information entropy of each index

Definition according to information entropy

Wherein, the constant K is 1/lnn; and specifies when p_ijWhen equal to 0, p_ijln p_ij0; i is 1,2, …, m; j is 1,2, …, n, from which the information entropy calculation formula of each index can be derived as follows:

wherein p is_ijIs the ith data object x under the jth attribute_iSpecific gravity of (A), H_iIs the information entropy of the ith specific gravity.

The customer classification method provided by the embodiment of the invention is characterized in that the provided mixed type data sample set comprises a plurality of numerical variable samples and a plurality of nominal variable samples, the characteristic information entropy is calculated by separating the numerical variable samples and the nominal variable samples, and finally, the relative characteristic weight is calculated respectively. The analysis of the sample is more practical, and the final detection and analysis result is more real.

(4) Determining the weights of the indexes

Calculating the entropy weight of each index by using the entropy value, thereby determining the objective weight w of the index_i(0≤w_iLess than or equal to 1), and

the calculation formula is as follows:

in summary, the dissimilarity degree measurement formula of the new mixed attribute data provided by the embodiment of the present invention is:

based on the content of the above embodiments, as an alternative embodiment, the step S4 includes but is not limited to:

updating the speed and the position of the individual optimum, the global optimum and the particles through dissimilarity metric values based on a particle swarm algorithm, and performing border crossing processing until the optimal k clustering center points are obtained;

and taking the K clustering central points as initial clustering centers of a K-Means clustering algorithm to construct a customer classification model.

And (4) performing cluster analysis by using a K-Means clustering algorithm, firstly determining an initial clustering center, and continuously optimizing the clustering center in an iteration process until the clustering center is not changed. Although the K-Means clustering algorithm has high convergence speed and strong local search capability, the clustering center is initialized in a randomization mode, so that the initial position of the clustering center has high randomness, the clustering result has large fluctuation and is easy to fall into a local minimum value.

In view of the above, the invention provides a particle swarm-based K-Means clustering algorithm, which optimizes an initial clustering center of the K-Means algorithm by using the strong global search capability and faster convergence rate of the particle swarm optimization algorithm, so as to eliminate the dependence of the K-Means on an initial value of the clustering center and improve the accuracy and convergence of a clustering result.

The embodiment of the invention provides a particle swarm optimization algorithm-based method for optimizing a K-Means clustering algorithm of dissimilarity metric values, which specifically comprises the following steps:

firstly, searching for optimal K clustering center points by using a particle swarm algorithm, then determining a final clustering result by using a K-Means clustering algorithm, and finally outputting the result to further complete the construction of a customer classification model. Wherein, the position (location), velocity (velocity) and fitness value (fitness) of the particle are used to describe the information of the particle, and the structure of the particle i includes three parts, which can be expressed as:

particle(i)＝{location[],velocity[],fitness}

assuming that the scale of the mixed data samples is n, each mixed data sample has p characteristic attributes, the number of clusters is set to be k, and the cluster center is set to be e_j(j ═ 1,2, …, k), resulting in M particles, each of which is thenThe position of the particle is composed of k cluster centers, and the position coding structure of the particle is shown as the following formula, wherein i is 1,2, …, M and M is the population scale;

(j ═ 1,2, …, k, k is the number of clusters) is the cluster center for the j-th sample, and is a p-dimensional vector in which the position-coding structural formula is:

further, the velocity of each particle is composed of k cluster center velocities, and the position coding structural formula is as follows:

particle(i).velocity[]＝[v₁,v₂,...,v_k]

further, for each iteration, the fitness value of the population needs to be calculated, and the fitness value of each particle is a real number and can be expressed as the following formula. Wherein a is a normal number; e is the intra-class degree of aggregation; it can be seen that the smaller the intra-class aggregation (the final objective of the clustering in the embodiments of the present invention), the larger the fitness value of the particle (the final objective of the objective function provided in the embodiments of the present invention), where the fitness value encodes the structural formula:

in the iterative update and evolution process, the individual optimal solution f (p) is passed through each particle_i) Represents the optimal fitness value experienced by the particle, the global optimal solution f (p) of the whole particle group_g) Representing the optimal fitness value experienced by the population, the following equation can be obtained:

f(p_i)＝{location[],fitness}

f(p_g)＝{location[],fitness}

therefore, according to the definition of the particle swarm optimization algorithm, the position updating formula of the particle i is obtained as follows:

particle(i).location[]'＝w·particle(i).location[]+c₁r₁·(p_i-particle(i).location[]+ε)+c₂r₂·(p_g-particle(i).location[])

and (4) adopting a particle swarm clustering algorithm according to the defined particle swarm structure. When the global optimal solution (cluster center) is determined, the partition of the cluster is determined by the nearest neighbor rule, i.e., each data object is preferentially partitioned to the class nearest to it. Thus, data object x_i(i ═ 1,2, …, n) and the cluster center point satisfy the following equation:

the particle swarm optimization algorithm provided by the embodiment of the invention can adopt a real number coding mechanism, in the clustering center-based coding, the coding of each particle comprises a position vector, a velocity vector and a fitness value of the particle, and one coding corresponds to a feasible solution of the clustering problem. Assuming that the data object is p-dimensional attribute and the number of clusters is k, the encoding scheme of each particle i is as follows:

X₁₁X₁₂...X_1pX₂₁X₂₂...X_2pX_k1X_k2X_kpV₁₁V₁₂...V_1pV₂₁V₂₂...V_2pV_k1V_k2V_kpfitness(x)

wherein, X_jpAnd representing the p-th dimension attribute value of the center point of the j (j ═ 1,2, …, k) th class. The front part in the coding structure represents the position of the particle and consists of k clustering centers, and the dimension is k × p; the middle part represents the velocity of the particle, also with dimension k × p; the final part represents the fitness value corresponding to the particle to evaluate the particle position X_jThe performance is good or bad.

Further, since the fitness value of the particle reflects the degree of dissimilarity between data objects in each class, a larger fitness value indicates a tighter degree of combination of data in the class, and a smaller dissimilarity value indicates a better clustering effect. Therefore, in the embodiment of the present invention, the objective of improving the particle swarm clustering algorithm is to search a particle position at which the fitness of a particle is maximized, and a clustering center corresponding to the particle position is the global optimal position. The degree of intra-particle cluster can be expressed as:

wherein E is_j(j ═ 1,2, …, k) denotes the jth class, X denotes the data object in the sample data set, and_jrepresents E_jCluster center of class. The fitness function provided by the embodiment of the invention is as follows:

where a is a normal number, as the case may be.

The particle swarm optimization K-Means clustering algorithm provided by the embodiment of the invention adopts a classification mode without iteration through design and application, and calculates the data object X and the particle X in the sample data set after determining the optimal clustering center by using the particle swarm_j(j ═ 1,2, … k), data object x is classified into the jth class with its smallest cluster according to the distance minimization principle, and data object x is labeled as j. If X, X_jSatisfying the following distance formula, x belongs to class j:

the distance formula may be used to complete the division of all the sample data in the mixed data sample set by calculating the dissimilarity measure of the mixed data sample recorded in step S3, so as to construct a client classification model.

Fig. 3 is a schematic flow chart of another customer classification method according to an embodiment of the present invention, and as shown in fig. 3, when the customer classification model is used to classify customers, an internal operation mode includes:

when the client classification is carried out, firstly, the number of sample data of an input client consumption data set and the number of clusters are determined; and then initializing the particle population based on a particle swarm optimization algorithm, and calculating the fitness of the current particles by using a fitting grid (which can be determined according to a calculation method of the dissimilarity metric value). Further, the individual optimum and the global optimum are updated according to the acquired particle fitness value so as to update the speed and the position of the particle. Then, whether the convergence requirement is satisfied is detected based on whether the accounting data is the basis, and if the convergence is satisfied. And outputting a global extreme value and a global optimal position, further distributing the sample data in the customer consumption data set to a clustering center of an accumulation algorithm according to a minimum principle, and outputting a classification result corresponding to the clustering center.

An embodiment of the present invention further provides a customer classification system, as shown in fig. 4, including but not limited to: a data acquisition device 1, a first data processing unit 2, a second data processing unit 3, a third data processing unit 4 and a customer classification unit 5, wherein:

a data acquisition means 1 for acquiring a hybrid data sample set composed of a plurality of hybrid data samples each composed of M numerical variable samples and N categorical variable samples; the first data processing unit 2 is used for performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors; the second data processing unit 3 is used for acquiring the dissimilarity metric values of the T comprehensive evaluation factors and the N categorical variable samples; the third data processing unit 4 is used for optimizing a K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm and constructing a customer classification model; and the client classification unit 5 is used for inputting the client consumption data set into the client classification model to obtain a client classification result.

The customer classification system provided by the embodiment of the invention combines the K-Means clustering algorithm and the improved particle swarm optimization algorithm, simultaneously considers various indexes influencing customer consumption, adopts the improved clustering algorithm to perform accurate clustering analysis on customer consumption data to obtain customer groups with different characteristics, and is more reasonable and clear in clustering result analysis, and more convenient for adopting different operation strategies and customer service strategies for different groups, thereby ensuring the leading position of retail market enterprises.

Fig. 5 illustrates a physical structure diagram of a server, and as shown in fig. 5, the server may include: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform a method comprising the steps of: step S1: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of M numerical variable samples and N categorical variable samples. Step S2: and (4) performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors. Step S3: obtaining dissimilarity metric values of the T comprehensive evaluation factors and the N classified variable samples S4: and optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm to construct a customer classification model. Step S5: and inputting the customer consumption data set into a customer classification model to obtain a customer classification result.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: step S1: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of M numerical variable samples and N categorical variable samples. Step S2: and (4) performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors. Step S3: obtaining dissimilarity metric values of the T comprehensive evaluation factors and the N classified variable samples S4: and optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm to construct a customer classification model. Step S5: and inputting the customer consumption data set into a customer classification model to obtain a customer classification result.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A customer categorization method, comprising:

step S1: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of M numerical variable samples and N classification variable samples;

step S2: performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors;

step S3: acquiring dissimilarity metric values of the T comprehensive evaluation factors and the N classified variable samples;

step S4: optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm to construct a customer classification model;

step S5: and inputting the customer consumption data set into the customer classification model to obtain a customer classification result.

2. The customer classification method according to claim 1, wherein in the step S2, the performing feature extraction on the mixed type data sample set includes:

respectively carrying out digital standardization calculation on the M numerical variable samples in each mixed data sample to obtain a standardization matrix;

obtaining a correlation matrix of the standardized matrix;

based on a principal component analysis method, obtaining an initial common factor according to the correlation matrix;

establishing a regression equation with the initial common factor as a dependent variable and the numerical variable sample as an independent variable based on a regression method, and acquiring a factor score corresponding to the initial common factor; wherein the factor score is a weighted sum.

3. The customer categorization method of claim 2, further comprising, after said obtaining the initial common factor: and performing orthogonal factor rotation or skew factor rotation by adopting a variance maximization method to obtain an orthogonal or skew factor load matrix, and performing linear combination processing on the initial common factor by using the orthogonal or skew factor load matrix.

4. The customer classification method according to claim 2, wherein the performing the numerical normalization calculation on the M numerical variable samples in each of the mixed data samples respectively comprises:

wherein x is_ijIs a numerical variable sample set;

5. The customer categorization method of claim 4, wherein the obtaining of the correlation matrix of the standardized matrices comprises:

L＝[l₁,l₂,l₃…l_M]

wherein R is a correlation matrix and L is a standardThe transformation matrix, L' is the transposed matrix of L, where L_MIs the result of the numerical normalization calculation of the mth numerical variable sample.

6. The customer categorization method of claim 1, further comprising, before said step S2: and carrying out missing value processing and abnormal value processing on the mixed type data sample set.

7. The customer classification method according to claim 1, wherein the step S3 includes: dividing N of the categorical variable samples into N₁An ordered typed variable sample and N₂Classifying variable samples of the nominal type;

8. The customer categorization method of claim 7, wherein the obtaining of the T comprehensive evaluation factors, N, respectively₁An ordered typed variable sample and N₂Dissimilarity measure for each nominal type categorical variable sample, comprising:

to the N₁Carrying out numerical value attribute normalization transformation on the ordered typing variable samples, converting each ordered typing variable sample into an ordered numerical value type variable sample, and calculating N by using the following formula₁Dissimilarity metric values of the ordered numerical variable samples and the T comprehensive evaluation factors:

wherein d is_m(x_i,x_j) Variable sample x_iAnd variable sample x_jM is the number of numerical attributes of each ordered numerical variable sample, x_imDenotes the i-th data m-dimensional value, x_jmM-dimension representing jth dataA value;

9. The customer classification method according to claim 8, characterized in that if the nominal classification variable samples are binary name semantic classification variable samples:

10. the customer classification method according to claim 7, further comprising, before the step S3:

determining the weight of each dissimilarity metric value based on an entropy weight method, wherein the weight comprises the step of mapping each comprehensive evaluation factor and each ordered typing variable sample to a dimensionless numerical value in a range of 0, 1 after normalizing the comprehensive evaluation factors and the ordered typing variable samples by using a dispersion standardization method;

calculating a specific gravity of each of the dimensionless numbers;

obtaining a specific gravity of each nominal type categorical variable sample;

calculating the information entropy of each specific gravity;

and determining the weight of each comprehensive evaluation factor and each classification type variable sample according to each information entropy.

11. The customer categorization method of claim 10, wherein said calculating the specific gravity of each of said dimensionless numbers comprises:

the contribution of each dimensionless number is obtained using the following formula:

wherein p is_ijIs the ith data object x under the jth attribute_iThe degree of contribution of (c);

determining a specific gravity of each of the comprehensive evaluation factors based on the contribution degree.

12. The customer classification method of claim 10, wherein the obtaining a specific gravity for each of the nominal classification variable samples comprises:

acquiring the occurrence times of each nominal type classification variable sample and the number of total samples, and according to the following formula:

obtaining a specific gravity of each nominal type categorical variable sample;

13. The customer classification method according to claim 12, wherein the calculation formula of the information entropy of each of the specific gravities is:

14. The customer categorization method of claim 12 wherein the determining a weight for each of the composite valuation factors and each of the categorical variable samples based on each of the information entropies comprises:

15. The customer classification method according to claim 1, wherein the step S4 includes:

updating the speed and the position of the individual optimum, the global optimum and the particles through the dissimilarity metric value based on a particle swarm algorithm, and performing border crossing processing until the optimal k clustering center points are obtained;

and taking the K clustering central points as initial clustering centers of a K-Means clustering algorithm to construct the customer classification model.

16. A customer classification system, comprising:

data acquisition means for acquiring a mixed data sample set composed of a plurality of mixed data samples, each of which is composed of M numerical variable samples and N categorical variable samples;

the first data processing unit is used for carrying out feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors;

the second data processing unit is used for acquiring the dissimilarity metric values of the T comprehensive evaluation factors and the N categorical variable samples;

the third data processing unit is used for optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm and constructing a customer classification model;

and the customer classification unit is used for inputting the customer consumption data set into the customer classification model to obtain a customer classification result.

17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the client classification method according to any one of claims 1 to 15 are performed by the processor when executing the program.

18. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the client categorization method of any of claims 1 to 15.