CN110866782A - Customer classification method and system and electronic equipment - Google Patents

Customer classification method and system and electronic equipment Download PDF

Info

Publication number
CN110866782A
CN110866782A CN201911078287.9A CN201911078287A CN110866782A CN 110866782 A CN110866782 A CN 110866782A CN 201911078287 A CN201911078287 A CN 201911078287A CN 110866782 A CN110866782 A CN 110866782A
Authority
CN
China
Prior art keywords
variable
samples
sample
numerical
customer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911078287.9A
Other languages
Chinese (zh)
Other versions
CN110866782B (en
Inventor
穆维松
李玥
冯建英
田东
褚晓泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Agricultural University
Original Assignee
China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Agricultural University filed Critical China Agricultural University
Priority to CN201911078287.9A priority Critical patent/CN110866782B/en
Publication of CN110866782A publication Critical patent/CN110866782A/en
Application granted granted Critical
Publication of CN110866782B publication Critical patent/CN110866782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Finance (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a customer classification method, a customer classification system and electronic equipment, wherein the method comprises the following steps: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of a numerical type variable sample and a classification type variable sample; carrying out feature extraction on the numerical variable sample to obtain a comprehensive evaluation factor; acquiring dissimilarity measurement values of the comprehensive evaluation factors and the classified variable samples; and optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm, constructing a customer classification model, and further obtaining a customer classification result. The embodiment of the invention combines the K-Means clustering algorithm and the improved particle swarm optimization algorithm, considers various indexes influencing customer consumption, adopts the optimization clustering algorithm to perform clustering analysis on customer consumption data to obtain customer groups with different characteristics, has more reasonable and accurate analysis results, and is convenient for adopting different operation and customer service strategies for different groups.

Description

Customer classification method and system and electronic equipment
Technical Field
The embodiment of the invention belongs to the technical field of computer information processing, and particularly relates to a client classification method, a client classification system and electronic equipment.
Background
The retail customer classification refers to the behavior of dividing customers into different customer groups by using a related technical analysis means according to factors such as the attributes, purchase demands, value views and the like of the customers to evaluate the consumption behaviors of different customers, so as to find out high-value customers or customize corresponding products and services for the customers according to classification results. The traditional method for classifying the retail customers mainly comprises an experience subdivision method and a mathematical statistics method, the experience subdivision method has the defect of strong subjectivity, and the classification advantages and disadvantages of the mathematical statistics method depend on classification standards to a great extent, so that both the two methods have one-sidedness.
In the data mining technology, clustering analysis is used as an unsupervised learning method, which can classify data sets and mine valuable information from feature data of research objects, so that objects in classes are similar as much as possible, and objects between different classes are different as much as possible. Clustering analysis is a powerful information processing method, and has been widely used in various subject research fields, such as: customer classification, market research, data analysis, pattern recognition, machine learning, and the like.
At present, the cluster analysis method mainly comprises: a partitioning (splitting) method, a hierarchical analysis method, a density-based method, a mesh-based method, a model-based method. In 1967, the K-Means clustering algorithm proposed by MacQueen is a clustering method based on division, but the K-Means clustering algorithm also has the following defects: (1) when the three-dimensional dissimilarity degree between data objects is calculated, the clustering algorithm can only process numerical data and is not suitable for processing data with types; and the importance of different characteristics is not balanced; (2) the selection of the random initial value may cause the clustering result to have a large uncertainty, even if there are no solutions and the clustering result falls into a local minimum.
In summary, the methods for classifying retail customers in the prior art have many defects, which result in the defects of inaccurate analysis result, unobvious classification effect, and the like.
Disclosure of Invention
The embodiment of the invention provides a customer classification method, a customer classification system and electronic equipment, which are used for solving or partially solving the defects in the prior art.
In a first aspect, an embodiment of the present invention provides a client classification method, including:
step S1: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of M numerical variable samples and N categorical variable samples.
Step S2: and (4) performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors.
Step S3: and obtaining the dissimilarity metric values of the T comprehensive evaluation factors and the N classified variable samples.
Step S4: and optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm to construct a customer classification model.
Step S5: and inputting the customer consumption data set into a customer classification model to obtain a customer classification result.
Further, in step S2, the performing feature extraction on the mixed type data sample set includes:
respectively carrying out digital standardized calculation on the M numerical variable samples in each mixed data sample to obtain a standardized matrix; obtaining a correlation matrix of the standardized matrix; based on a principal component analysis method, obtaining an initial common factor according to a correlation matrix;
establishing a regression equation which takes the initial common factor as a dependent variable and takes the numerical variable sample as an independent variable based on a regression method, and obtaining a factor score corresponding to the initial common factor; wherein the factor score is a weighted sum.
Further, after obtaining the initial common factor, the method further includes: and performing orthogonal factor rotation or skew factor rotation by adopting a variance maximization method to obtain an orthogonal or skew factor load matrix, and performing linear combination processing on the initial common factor by using the orthogonal or skew factor load matrix.
Further, the above-mentioned performing digital normalization calculation on the M numerical variable samples in each mixed data sample respectively includes:
Figure BDA0002263169390000031
wherein x isijIs a numerical variable sample set;
Figure BDA0002263169390000032
the average value of numerical variable samples is obtained; the numerator is the standard deviation of the numerical type variable sample; m is the number of numerical variable samples, P is the number of observation indexes of each numerical variable sample, lijAnd carrying out digital standardization calculation on the jth dimension of the numerical variable sample for the ith data.
Further, the obtaining of the correlation matrix of the normalized matrix includes:
Figure BDA0002263169390000033
wherein R is a correlation matrix, L is a normalization matrix, and L' is a transpose of L, where LMIs the result of the numerical normalization calculation of the mth numerical variable sample.
Further, before step S2, the method further includes: and carrying out missing value processing and abnormal value processing on the mixed type data sample set.
Further, step S3 includes: dividing N categorical variable samples into N1An ordered typed variable sample and N2Classifying variable samples of the nominal type;
respectively obtaining T comprehensive evaluation factors and N1An ordered typed variable sample and N2Dissimilarity measure for each nominal type classification variable sample.
Further, the above-mentioned T comprehensive evaluation factors, N are obtained separately1An ordered typed variable sample and N2Dissimilarity of individual nominal categorical variable samplesThe metric value comprises:
to N1Carrying out numerical attribute normalization transformation on the ordered typing variable samples, converting each ordered typing variable sample into an ordered numerical variable sample, and calculating N by using the following formula1Dissimilarity measure values of the ordered numerical variable samples and the T comprehensive evaluation factors:
Figure BDA0002263169390000034
wherein d ism(xi,xj) Variable sample xiAnd variable sample xjM is the number of numerical attributes of each ordered numerical variable sample, ximDenotes the i-th data m-dimensional value, xjmRepresenting the m-dimensional value of the jth data; calculating N using symmetrical Hamming formula2Dissimilarity measure for each nominal type categorical variable sample:
Figure BDA0002263169390000035
where n is the number of numerical attributes of each nominal type categorical variable sample, xipAnd xjpRespectively representing classification values, δ (x), of the p-th dimension attribute data objectip,xjp) Are matching coefficients.
Further, if the nominal type classification variable sample is a binary nominal type classification variable sample, then:
Figure BDA0002263169390000041
further, before the step S3, the method further includes: determining the weight of each dissimilarity metric value based on an entropy weight method, wherein the weight comprises the steps of utilizing a dispersion standardization method to normalize each comprehensive evaluation factor and each ordered type-based variable sample, and mapping to a dimensionless numerical value in a (0, 1) interval; calculating the specific gravity of each dimensionless number; acquiring the specific gravity of each nominal type classification variable sample; calculating the information entropy of each proportion; and determining the weight of each comprehensive evaluation factor and each classification type variable sample according to each information entropy.
Further, the calculating the specific gravity of each dimensionless number includes: the contribution of each dimensionless number is obtained using the following formula:
Figure BDA0002263169390000042
wherein p isijIs the ith data object x under the jth attributeiThe degree of contribution of (c); determining a specific gravity of each of the comprehensive evaluation factors based on the contribution degree.
Further, the obtaining of the specific gravity of each nominal type categorical variable sample includes: acquiring the occurrence times of each nominal type classification variable sample and the number of total samples, and according to the following formula:
Figure BDA0002263169390000043
obtaining a specific gravity of each nominal type categorical variable sample; wherein r isijSample x representing a nominal type categorical variableijThe number of occurrences, a, is the number of total samples.
Further, the determining the weight of each comprehensive evaluation factor and each classification type variable sample according to each information entropy includes:
Figure BDA0002263169390000044
wherein, wiThe weight of the ith information entropy is shown, and m is the total number of the information entropy.
Further, step S4 includes updating the speed and position of the individual best, the global best and the particles through the dissimilarity measure based on the particle swarm algorithm, and performing the border crossing processing until the optimal k cluster center points are obtained; and taking the K clustering central points as initial clustering centers of a K-Means clustering algorithm to construct a customer classification model.
In a second aspect, an embodiment of the present invention provides a customer classification system, including: data acquisition device, first data processing unit, second data processing unit, third data processing unit and customer classification list, wherein:
and the data acquisition device is used for acquiring a mixed data sample set consisting of a plurality of mixed data samples, wherein each mixed data sample consists of M numerical variable samples and N classification variable samples.
And the first data processing unit is used for carrying out feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors.
And the second data processing unit is used for acquiring the T comprehensive evaluation factors and the dissimilarity metric values of the N categorical variable samples.
And the third data processing unit is used for optimizing the K-Means clustering algorithm of the dissimilarity metric value based on the particle swarm optimization algorithm and constructing a customer classification model.
And the client classification unit is used for inputting the client consumption data set into the client classification model to obtain a client classification result.
In a third aspect, an embodiment of the present invention provides an electronic device, including but not limited to a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the client classification method when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the above-mentioned client classification method.
The customer classification method, the customer classification system and the electronic equipment provided by the embodiment of the invention combine a K-Means clustering algorithm and an improved particle swarm optimization algorithm, simultaneously consider various indexes influencing customer consumption, adopt the improved clustering algorithm to perform accurate clustering analysis on customer consumption data to obtain customer groups with different characteristics, and ensure that the clustering result analysis is more reasonable and clear, and more convenient to adopt different operation strategies and customer service strategies on different groups, thereby ensuring the leading position of retail market enterprises.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a customer classification method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a model for obtaining a comprehensive evaluation factor according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of another customer classification method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a customer classification system according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the progress of science and technology and the development of the internet of things, how to accurately acquire the classification condition of a client and timely make a targeted product promotion and product sale strategy are crucial to the commercial success, and particularly for some fresh food products with high preservation requirements, such as fresh fruits, marine products and the like, the sales means can be rapidly widened and the sales efficiency can be reasonably improved according to the classification condition of the client, so that the storage time of the products is reduced, and the economic benefit is laterally improved.
For convenience of description, the customer classification of the fresh grapes sold on the market is taken as an example in all the embodiments of the present invention, but the present invention is not limited to the scope of the embodiments of the present invention.
In the customer classification of fresh food grapes for retail sale, the most advanced method in the prior art is performed by using cluster analysis as a means through a data mining technology. Clustering analysis, as an unsupervised learning method, can classify data sets and mine valuable information from feature data of research objects, so that objects in a class are as similar as possible, and objects between different classes are as different as possible. The K-Means clustering algorithm is a classic clustering based on division, and the function is particularly remarkable. The algorithm has the advantages of easy understanding, simple realization, high convergence speed, capability of effectively processing a large data set and the like, becomes the most famous and most common clustering algorithm in the field of data mining, and is already applied to the classification research of retail customers. However, the background technology of the invention does not indicate that the algorithm has a plurality of defects in the field of data mining, so that the analysis result often cannot reach the ideal precision, and the classification result is inaccurate.
In view of the above, in order to effectively solve or partially solve the defects in the data mining in the prior art, an embodiment of the present invention provides a method for classifying customers, as shown in fig. 1, including, but not limited to, the following steps:
step S1: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of M numerical variable samples and N categorical variable samples.
Step S2: and (4) performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors.
Step S3: and obtaining the dissimilarity metric values of the T comprehensive evaluation factors and the N classified variable samples.
Step S4: and optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm to construct a customer classification model.
Step S5: and inputting the customer consumption data set into a customer classification model to obtain a customer classification result.
The invention uses the fresh grape consumer behavior and preference data set, and selects a part of the fresh grape consumer behavior and preference data set as a client classification variable for dividing the client group. The subdata set contains consumption value data in the form of a quantity table topic, which is a numerical variable (not strictly numerical), and demographic data in the form of a choice topic, which is a categorical variable.
As shown in table 1, the contents of the characteristic data about the value of the fresh grapes purchasing behavior and the basic customer information are partially described. The data source in the form can be established based on network questionnaires, visiting questionnaires, opinion feedback from each sales network, and the like, and all the obtained survey data are counted, and factors influencing various aspects of actual purchase and influence strength of various factors on final purchase completion in the purchase of fresh grapes in different consumption groups and different age levels in the society are comprehensively considered. It should be noted that table 1 is only a list of any kind of characteristic data provided by the embodiment of the present invention, and is not to be considered as a limitation to the scope of the embodiment of the present invention.
As shown in table 1, the characteristics in which the numerical variables are described include: consumer value view data, such as consumer value view data with characteristic names a-n, which are mainly used as numerical variables; and recording the data of the basic information of the client, wherein the data is mainly used as a typing variable.
TABLE 1
Figure BDA0002263169390000081
Table 2 provides a list of raw data for each numerical variant sample and for each subtyping variant in table 1:
TABLE 2
Figure BDA0002263169390000091
As shown in table 2, a total of 3230 mixed data samples (only part of the content is shown in the table) are provided, wherein each mixed data sample comprises 14 numerical variable samples (values corresponding to a-n) and 7 classification variable samples (values corresponding to sex, age, cultural level, professional nature, family population, family average monthly income and city grade).
Through the correlation analysis of the numerical variables, obvious correlation exists among several groups of variables, which causes the occurrence of the phenomenon of information overflow; some variables may not necessarily function properly and may add to the computational effort. In order to solve the problem that the efficiency of a clustering algorithm is reduced due to the fact that index information among samples is overlapped, original input variables of numerical types with a large number (all variables are required to be continuous variables, and sequence variables and category variables cannot be subjected to factor analysis) are processed by using a dimensionality reduction idea, and are changed into a group of independent new input variables (comprehensive evaluation factors) with a small number.
As shown in fig. 2, by performing feature extraction on these variables, for example: the 3 different variables "package delicacy", "hygienic condition of purchase environment" and "presence or absence of package" can be unified into a comprehensive evaluation factor "package hygienic condition". By the method for extracting the features, the calculation workload can be effectively reduced, and the clustering algorithm efficiency is improved.
Further, based on the characteristics of the retail fresh grape purchase data (numerical data, ordered classification data and nominal classification data exist in the questionnaire and belong to mixed type data characteristics), in the embodiment of the invention, a method for measuring the dissimilarity degree of mixed attributes is provided aiming at that the K-Means clustering algorithm can only process numerical data types for improvement.
Firstly, all questionnaire data are processed, and 3230 samples of fresh grape consumption samples of clients are finally obtained (namely, a mixed type data sample set consisting of 3230 mixed data samples is obtained). Further, in the embodiment of the present invention, two characteristics of the client are researched: consumer value view (numeric data), customer basic information (ordered and unordered classification data types), i.e. as can be learned from table 3: the mixed data has data with two different attributes of a classification type and a numerical type, wherein the numerical type attribute has 4 (namely 4 comprehensive evaluation factors), the classification type attribute has 7, and each data object has 11 attributes in total.
Numeric variables (scalars), also called quantitative or continuous attribute variables, are scalar, i.e., numbers with no directional meaning, also called scale variables. Unlike scalar attributes, categorical variables have a specific order between their attribute values, and the class of a feature is determined according to some rule, whose values reflect the order relationship, whose relative order of values is necessary, but whose actual size is not important. If a single distance formula commonly used in a common K-Means clustering algorithm is used for carrying out uniform dissimilarity measurement calculation on all mixed data, the data relation is obviously not met.
In embodiments of the present invention, the sample data object type comprises a mixture of numeric and categorical variable samples (where categorical variables can be divided by nature into categorical and nominal categorical variables), and is therefore referred to as a mixed data sample. One method provided for calculating the degree of dissimilarity between objects described by the hybrid data is to group the data by type, taking different approaches to different classes of attributes. Through the analysis, the numerical variables and the ordered classification variables can be finally classified into scalar quantity to calculate the dissimilarity degree, and the nominal classification variables are independently classified into one type to calculate the dissimilarity degree. By respectively obtaining the dissimilarity metric value of the comprehensive evaluation factor and the dissimilarity metric value of the classified variable sample and comprehensively considering the dissimilarity of the comprehensive evaluation factor and the classified variable sample, the defect that the K-Means clustering algorithm can only aim at numerical data and is not suitable for other types of data is overcome, and the clustering effect is effectively improved.
TABLE 3
Figure BDA0002263169390000101
When the K-Means clustering algorithm is used, the initial clustering center is determined firstly, and the clustering center is continuously optimized in the iteration process until the clustering center is not changed. The convergence speed is high, the local search capability is strong, but the clustering center is initialized in a randomization mode, so that the initial position of the clustering center has high randomness, the fluctuation of the clustering result is caused, and the clustering result is easy to fall into a local minimum value. In order to effectively overcome the defects, the particle swarm optimization algorithm provided by the embodiment of the invention optimizes the K-Means clustering algorithm of the dissimilarity metric value, and optimizes the initial clustering center of the K-Means algorithm by utilizing the strong global search capability and faster convergence speed of the particle swarm optimization algorithm, so as to eliminate the dependence of the K-Means on the initial value of the clustering center and improve the accuracy and convergence of the clustering result.
The customer classification method provided by the embodiment of the invention combines the K-Means clustering algorithm and the improved particle swarm optimization algorithm, simultaneously considers various indexes influencing customer consumption, adopts the improved clustering algorithm to perform accurate clustering analysis on customer consumption data to obtain customer groups with different characteristics, and is more reasonable and clear in clustering result analysis, and more convenient for adopting different operation strategies and customer service strategies for different groups, thereby ensuring the leading position of retail market enterprises.
Based on the content of the foregoing embodiment, as an alternative embodiment, in step S2, the performing feature extraction on the hybrid data sample set includes, but is not limited to, the following steps:
s21: respectively carrying out digital standardized calculation on the M numerical variable samples in each mixed data sample to obtain a standardized matrix;
s22: a correlation matrix of the normalized matrix is obtained.
S23: and acquiring an initial common factor according to the correlation matrix based on a principal component analysis method.
S24: establishing a regression equation which takes the initial common factor as a dependent variable and takes the numerical variable sample as an independent variable based on a regression method, and obtaining a factor score corresponding to the initial common factor; wherein the factor score is a weighted sum.
According to the embodiment of the invention, M numerical variable samples in each mixed data sample are respectively converted into numerical factor scores by using a factor analysis method to research the internal dependency relationship among a plurality of variables, and a small number of T non-observable variables called comprehensive evaluation factors are used for representing a basic data structure, so that the comprehensive indexes can be objectively and effectively determined, and the obtained indexes have less information cross and strong comparability.
Wherein, after the M numerical variable samples are subjected to digital standardized calculation, the obtained original variable is expressed as Z1,Z2,…,ZM(M is the number of original attributes), the extracted T (T) can be used<M) factors f1,f2,…,fTTo represent a linear combination:
Figure BDA0002263169390000121
wherein, i is 1, 2.. times.m; 1,2, T, the above formula can be expressed as: z is AF + epsilon. Z is an observable M-dimensional variable vector, each component of which represents an index or variable; f represents a common factor vector, each component represents a factor, the factors are independent non-observable theoretical variables, and the specific meaning of the common factor needs to be determined by combining practical research; matrix A represents the factor load matrix, aijIs the load of the ith original variable on the jth factor; ε is a special factor that is not observable, representing the fraction of the original variable that cannot be factored, its mean value being 0.
Further, in an embodiment of the present invention, the performing digital normalization calculation on the M numerical variable samples in each mixed data sample respectively includes:
Figure BDA0002263169390000122
wherein x isijIs a numerical variable sample set;
Figure BDA0002263169390000123
is a numerical variable sampleMean value of the book; the numerator is the standard deviation of the numerical type variable sample; m is the number of numerical variable samples, P is the number of observation indexes of each numerical variable sample, lijAnd carrying out digital standardization calculation on the jth dimension of the numerical variable sample for the ith data.
Further, in step S22, the normalized matrix obtained in the previous step is transformed to obtain the correlation between the samples, and the correlation matrix is obtained, and the transformation method may be:
Figure BDA0002263169390000124
L=[l1,l2,l3…lM]
wherein R is a correlation matrix, L is a normalization matrix, and L' is a transpose of L, where LMIs the result of the numerical normalization calculation of the mth numerical variable sample.
Further, the method for obtaining the initial common factor based on the principal component analysis method described in the above step S23 according to the correlation matrix includes, but is not limited to, the following steps:
setting M variables, solving principal components from a correlation matrix to obtain T principal components, arranging the T principal components in descending order, and recording as Y1,Y2……YTLet us order
Figure BDA0002263169390000131
The following relationship is given:
Figure BDA0002263169390000132
the load matrix A and a group of initial common factors F can be obtained by calculation through the formulaiWherein:
Figure BDA0002263169390000133
in the load matrix A, γ1≥γ2≥…≥γpFor the above-mentioned correlation matrix RCharacteristic root, γ1、γ2、...γpOrthogonalizing the feature vectors for the corresponding norm, and m<p。
Further, in the above step S24, the method for calculating the factor score corresponding to the initial common factor includes, but is not limited to, the following steps:
establishing a regression equation with an initial common factor as a dependent variable and an original variable as an independent variable by adopting a regression method, wherein the factor score can be regarded as the weighted sum of all variables, the importance degree of the variables to the factors is represented by the weight coefficient, and the regression equation can be as follows:
Fj=wj1X1+wj2X2+...+wjMXM,j=1,2,...,T
further, based on the results of the above factor analysis, the first 14 consumer value features can be extracted as 4 comprehensive evaluation indexes (comprehensive evaluation factors): namely quality characteristic factors, quality safety factors, package sanitary condition factors and external additional condition factors, and the factor analysis requirements are as follows: the finally obtained factors are independent from each other and have no correlation.
Based on the content of the foregoing embodiment, as an alternative embodiment, after obtaining the initial common factor, the method further includes: and performing orthogonal factor rotation or skew factor rotation by adopting a variance maximization method to obtain an orthogonal or skew factor load matrix, and performing linear combination processing on the initial common factor by using the orthogonal or skew factor load matrix.
The linear combination of the correlation matrices in step S22 is performed to find a common factor with a more obvious practical significance, and the conversion method may be:
Figure BDA0002263169390000141
wherein, F1、F2、…FMIs an initial common factor, F'1、F’2、…F’MIs a new common factor after linear transformation.
Based on the content of the foregoing embodiment, as an alternative embodiment, before step S2, the method further includes: and carrying out missing value processing and abnormal value processing on the acquired mixed type data sample set.
According to the client classification method provided by the embodiment of the invention, the samples can be preprocessed before the obtained data set is clustered by using a related algorithm, because incomplete data or data without actual value for clustering may exist in original sample data. Through missing value processing and abnormal value processing, not only can the clustering effect be improved, but also the running time required by the algorithm can be reduced, and therefore the performance of the particle swarm clustering algorithm is improved.
Wherein, the missing value processing may be: by observing the sample data of the invention, the processing method of the missing value is as follows: the samples that lack multiple feature attributes are first deleted (>2), and then the rest of the data is padded using the interpolation (with the mode of the attribute values).
Wherein the outlier processing may be: in the embodiment of the invention, firstly, the consistency check is carried out on the data, and whether the abnormal value record exists is deleted or not can be judged by setting a deletion threshold, and when the proportion of the abnormal value record in the mixed type data sample set is smaller than the deletion threshold, the abnormal value record can not be operated; when the ratio is larger than the deletion threshold, the final aggregation result may be adversely affected, and the deletion may be performed at this time.
Based on the content of the foregoing embodiment, as an alternative embodiment, the foregoing step S3 includes, but is not limited to, the following steps: dividing N of the categorical variable samples into N1An ordered typed variable sample and N2Classifying variable samples of the nominal type;
respectively obtaining T comprehensive evaluation factors and N1An ordered typed variable sample and N2Dissimilarity measure for each nominal type classification variable sample.
The classified variable samples can be further classified into ordered classified variable samples and nominal classified variable samples because the classified variable samples are distinguished according to the attributes. For example, as shown in table 1, the attribute values of the ordered categorical variables have a specific order, the rank of the feature is determined according to a certain rule, the value reflects the order relationship, the relative order of the values is necessary, and the actual size is not important. Such as the age (17-25 years, 26-35 years, 36-45 years, 46-55 years, 56-75 years), the culture level (this family and above, major, high school, junior middle school, primary school and below), the population number of the family (1, 2, 3, 4, 5, 6, 7, 8, 9), the average monthly income of the family (2000 Yuan or below, 2001-3000 Yuan, 3001-5000 Yuan, 5001-7000 Yuan, 7001-10000 Yuan, 10001-15000 Yuan, 15000 Yuan), the city grade (first line, second line, third line, fourth line, fifth line).
The nominal classification variables (such as judgment variables of gender, quality and the like) are non-numerical values, and are digitized for convenient analysis, and numerical indicators of the characteristics have no numerical meaning or no sequence relation and only use numbers to represent various states.
In the embodiment of the invention, not only all mixed data samples are divided into numerical variable samples and classified variable samples, but also the classified variable samples are further divided into ordered classified variable samples and nominal classified variable samples according to different attributes of the samples, and dissimilarity measurement values are calculated by adopting different methods respectively according to the characteristics of different samples, so that the clustering accuracy is further improved.
Further, in the embodiment of the present invention, the above-mentioned T comprehensive evaluation factors and N are respectively obtained1An ordered typed variable sample and N2Dissimilarity measures for individual nominal-type categorical variable samples, including, but not limited to, the following methods:
to N1Carrying out numerical attribute normalization transformation on the ordered typing variable samples, converting each ordered typing variable sample into an ordered numerical variable sample, and calculating N by using the following formula1Dissimilarity measure values of the ordered numerical variable samples and the T comprehensive evaluation factors:
Figure BDA0002263169390000151
wherein d ism(xi,xj) Variable sample xiAnd variable sample xjN is that each variable sample has m numerical attributes;
calculating N using symmetrical Hamming formula2Dissimilarity measure for each nominal type categorical variable sample:
Figure BDA0002263169390000161
where n is the number of numerical attributes of each nominal type categorical variable sample, xipAnd xjpRespectively representing classification values, δ (x), of the p-th dimension attribute data objectip,xjp) Are matching coefficients.
Specifically, the dissimilarity metric value calculation method for the digital variable samples includes: numeric variables (scalars), also called quantitative or continuous attribute variables, are scalar, i.e., numbers with no directional meaning, also called scale variables. The meaning of euclidean distance is the collective distance of two data objects in euclidean space, which is widely used to identify the dissimilarity of two scalar elements because of its intuitive intelligibility and strong interpretability. Therefore, in the embodiment of the present invention, the euclidean distance is adopted as the sample dissimilarity degree of the numerical attribute. If there are M sample data in a mixed data, each data object has n number of value type attributes, data object xiAnd xjThe dissimilarity measure between (1. ltoreq. i, j. ltoreq. M) is expressed as:
Figure BDA0002263169390000162
further, the method for calculating the dissimilarity measure value of the ordered variable samples comprises the following steps: for ordered variable samples, each value is generally assigned a number, called the rank of the value, e.g. oneThe data attribute has MfA state, the ordered states defining a sequence: 1,2, …, Mf. In the embodiment of the invention, the dissimilarity is calculated by converting the non-numerical-value-ordered classification attribute into a numerical value attribute (rank), and then replacing the original value with the value after rank processing as a scalar attribute. The formula of the data conversion can be as follows: (k-1)/(M)f-1), where k is the rank value, k ∈ {1,2, …, Mf},MfFor the total number of attribute values, each ordered variable sample can eventually be mapped to [0, 1]]Thereby achieving normalization. And finally, calculating the normalized ordered variable sample according to the dissimilarity metric value calculation method for the digital variable sample.
Further, the nominal categorical variables are themselves non-numeric, they are quantified for ease of analysis, and the numeric indicators of these features have neither quantitative nor sequential relationships, but are simply numbers used to represent various states. For a nominal type categorical variable, the distance formula that deals with scalar variables cannot be used directly for the nominal type categorical variable. The nominal categorical variables may have two or more status values, such as gender (male, female), occupational properties (party and business entity personnel, company/enterprise personnel, free job owners, farmers, education and research institution personnel, students, non-employment/retirement, others), which may be divided into two-categorical variables and multiple-categorical variables.
The nominal type binary classification variable only comprises a gender variable, and belongs to a symmetric classification variable, so that the dissimilarity of the attribute is measured by adopting a symmetric hamming formula (in information coding, the numbers of different coded bits corresponding to two legal codes are called horse ranges and also called hamming distances), and the dissimilarity can be calculated by using a simple matching coefficient, and is defined as follows:
Figure BDA0002263169390000171
in the formula, n is the number of classification attributes; x is the number ofipAnd xjpRespectively, representing the values of the p-th dimension attribute data objects. x is the number ofipAnd xjpThe greater the number of occurrences, the greater the dissimilarity between the two data objects.
Wherein the matching coefficient
Figure BDA0002263169390000172
The assignment in the case where the multi-classification variable is a two-classification variable is a nominal type.
The nominal multi-classification variable is the popularization of a two-classification variable, each numerical value has no numerical or ordinal significance, such as occupational property, and the numerical value of the attribute has no significance for distinguishing.
In summary, for any mixed data, we assume that the numerical attribute and the nominal classification attribute are of equal importance, and assume that there are g sample data including g1Individual numerical attribute variables (numerical variables and ordered categorical variables), g2A categorical variable (nominal categorical variable) is given by g ═ g1+g2If the dissimilarity measure related to the numerical attribute data mainly uses the euclidean distance formula and the dissimilarity measure related to the classification attribute data variables is the formula, the dissimilarity measure of the mixed attribute data can be expressed as:
Figure BDA0002263169390000173
based on the content of the foregoing embodiment, as an alternative embodiment, before the foregoing step S3, the method further includes: determining the weight of each dissimilarity metric value based on an entropy weight method, wherein the weight comprises the step of mapping each comprehensive evaluation factor and each ordered typing variable sample to a dimensionless numerical value in a range of 0, 1 after normalizing the comprehensive evaluation factors and the ordered typing variable samples by using a dispersion standardization method; calculating the specific gravity of each dimensionless number; acquiring the specific gravity of each nominal type classification variable sample; calculating the information entropy of each proportion; and determining the weight of each comprehensive evaluation factor and each classification type variable sample according to each information entropy.
Since in the above formula of the dissimilarity value of the mixed type data variables, it is assumed that the importance degree of each attribute variable is the same (i.e., the weight value is the same). However, in the conventional clustering algorithm for actual data mining, the importance of each attribute is expressed in the measure of dissimilarity between objects in a data set, and if the importance of the attribute characteristics is not considered in calculating the sample distance, the obtained final clustering effect is inaccurate. Therefore, proper weight is given to the attributes, so that the dissimilarity among the data objects is better measured, and the clustering quality is improved.
It is therefore necessary to determine the weight according to the degree of importance of each evaluation index. In the embodiment of the invention, the characteristic weight of the mixed type data variable is balanced by adopting a weighting method (entropy weight method) based on the variable information entropy, and the defects of randomness, instability and the like of the mixed type variable clustering result can be effectively solved. And the weight coefficient of each attribute can be automatically determined, the subjective randomness of weight determination is avoided, and a basis is provided for multi-index comprehensive evaluation. The entropy weight method provided by the embodiment of the invention is an objective weighting algorithm, and the entropy weight of each index is calculated by using the information entropy according to the attribute difference degree of each index, and the weight of each index is corrected by the entropy weight, so that objective index weight is obtained. Compared with the subjective weighting method in the prior art, the method has higher precision and higher objectivity, can better explain the obtained weight, and can be used for any process needing to determine the weight.
The processing procedure for calculating the data characteristic weight based on the entropy weight method provided by the embodiment of the invention can be realized by the following steps:
(1) data normalization
For the inconsistency of the measurement standard and the meaning of each numerical attribute (including numerical variables and ordered classification variables) in the mixed data sample set, the corresponding numerical variation range is large, and if the euclidean distance is directly adopted to calculate the dissimilarity, the influence of the attribute with the large numerical value on the dissimilarity among the samples is very large, so that the attribute needs to be subjected to standardization processing. According to the customer classification method provided by the embodiment of the invention, original variable data are linearly transformed by adopting a dispersion standardization method (only numerical type variables and ordered type classification variables are normalized, and nominal type classification variables are not processed), so that a result is mapped to a [0, 1] interval and is converted into a dimensionless pure value, and indexes of different units or orders can be compared and weighted conveniently.
If m (for example, m is 9, and no nominal type classification variable is included) original data samples (including comprehensive evaluation factor and ordered type classification variable samples) have a numerical attribute of X1,X2,…,XmWherein X isi1,2, …, n, wherein n represents the number of sample data variables; for original data sample XmThe formula for normalization is as follows:
Figure BDA0002263169390000191
wherein x ismaxAnd xminThe maximum and minimum values of the numerical property in the original data sample, respectively.
(2) Calculating a mixed attribute weight
Calculating the attribute proportion, p, of the numerical attribute (a dimensionless numerical value in this case) of each normalized original data sample in the j-th dimension dataijRepresenting the ith data object x under the jth attributeiThe contribution of (e.g., i ═ 1,2, …, 3230; j ═ 1,2, …,9 can be taken in the above-described embodiment):
Figure BDA0002263169390000192
further, the method for calculating the specific gravity of each nominal type classification attribute comprises the following steps:
Figure BDA0002263169390000193
wherein r isijSample x representing a nominal type categorical variableijThe number of occurrences, a, is the number of total samples.
(3) Calculating the information entropy of each index
Definition according to information entropy
Figure BDA0002263169390000194
Wherein, the constant K is 1/lnn; and specifies when pijWhen equal to 0, pijln pij0; i is 1,2, …, m; j is 1,2, …, n, from which the information entropy calculation formula of each index can be derived as follows:
Figure BDA0002263169390000195
wherein p isijIs the ith data object x under the jth attributeiSpecific gravity of (A), HiIs the information entropy of the ith specific gravity.
The customer classification method provided by the embodiment of the invention is characterized in that the provided mixed type data sample set comprises a plurality of numerical variable samples and a plurality of nominal variable samples, the characteristic information entropy is calculated by separating the numerical variable samples and the nominal variable samples, and finally, the relative characteristic weight is calculated respectively. The analysis of the sample is more practical, and the final detection and analysis result is more real.
(4) Determining the weights of the indexes
Calculating the entropy weight of each index by using the entropy value, thereby determining the objective weight w of the indexi(0≤wiLess than or equal to 1), and
Figure BDA0002263169390000201
the calculation formula is as follows:
Figure BDA0002263169390000202
in summary, the dissimilarity degree measurement formula of the new mixed attribute data provided by the embodiment of the present invention is:
Figure BDA0002263169390000203
based on the content of the above embodiments, as an alternative embodiment, the step S4 includes but is not limited to:
updating the speed and the position of the individual optimum, the global optimum and the particles through dissimilarity metric values based on a particle swarm algorithm, and performing border crossing processing until the optimal k clustering center points are obtained;
and taking the K clustering central points as initial clustering centers of a K-Means clustering algorithm to construct a customer classification model.
And (4) performing cluster analysis by using a K-Means clustering algorithm, firstly determining an initial clustering center, and continuously optimizing the clustering center in an iteration process until the clustering center is not changed. Although the K-Means clustering algorithm has high convergence speed and strong local search capability, the clustering center is initialized in a randomization mode, so that the initial position of the clustering center has high randomness, the clustering result has large fluctuation and is easy to fall into a local minimum value.
In view of the above, the invention provides a particle swarm-based K-Means clustering algorithm, which optimizes an initial clustering center of the K-Means algorithm by using the strong global search capability and faster convergence rate of the particle swarm optimization algorithm, so as to eliminate the dependence of the K-Means on an initial value of the clustering center and improve the accuracy and convergence of a clustering result.
The embodiment of the invention provides a particle swarm optimization algorithm-based method for optimizing a K-Means clustering algorithm of dissimilarity metric values, which specifically comprises the following steps:
firstly, searching for optimal K clustering center points by using a particle swarm algorithm, then determining a final clustering result by using a K-Means clustering algorithm, and finally outputting the result to further complete the construction of a customer classification model. Wherein, the position (location), velocity (velocity) and fitness value (fitness) of the particle are used to describe the information of the particle, and the structure of the particle i includes three parts, which can be expressed as:
particle(i)={location[],velocity[],fitness}
assuming that the scale of the mixed data samples is n, each mixed data sample has p characteristic attributes, the number of clusters is set to be k, and the cluster center is set to be ej(j ═ 1,2, …, k), resulting in M particles, each of which is thenThe position of the particle is composed of k cluster centers, and the position coding structure of the particle is shown as the following formula, wherein i is 1,2, …, M and M is the population scale;
Figure BDA0002263169390000211
(j ═ 1,2, …, k, k is the number of clusters) is the cluster center for the j-th sample, and is a p-dimensional vector in which the position-coding structural formula is:
Figure BDA0002263169390000212
further, the velocity of each particle is composed of k cluster center velocities, and the position coding structural formula is as follows:
particle(i).velocity[]=[v1,v2,...,vk]
further, for each iteration, the fitness value of the population needs to be calculated, and the fitness value of each particle is a real number and can be expressed as the following formula. Wherein a is a normal number; e is the intra-class degree of aggregation; it can be seen that the smaller the intra-class aggregation (the final objective of the clustering in the embodiments of the present invention), the larger the fitness value of the particle (the final objective of the objective function provided in the embodiments of the present invention), where the fitness value encodes the structural formula:
Figure BDA0002263169390000213
in the iterative update and evolution process, the individual optimal solution f (p) is passed through each particlei) Represents the optimal fitness value experienced by the particle, the global optimal solution f (p) of the whole particle groupg) Representing the optimal fitness value experienced by the population, the following equation can be obtained:
f(pi)={location[],fitness}
f(pg)={location[],fitness}
therefore, according to the definition of the particle swarm optimization algorithm, the position updating formula of the particle i is obtained as follows:
particle(i).location[]'=w·particle(i).location[]+c1r1·(pi-particle(i).location[]+ε)+c2r2·(pg-particle(i).location[])
and (4) adopting a particle swarm clustering algorithm according to the defined particle swarm structure. When the global optimal solution (cluster center) is determined, the partition of the cluster is determined by the nearest neighbor rule, i.e., each data object is preferentially partitioned to the class nearest to it. Thus, data object xi(i ═ 1,2, …, n) and the cluster center point satisfy the following equation:
Figure BDA0002263169390000221
the particle swarm optimization algorithm provided by the embodiment of the invention can adopt a real number coding mechanism, in the clustering center-based coding, the coding of each particle comprises a position vector, a velocity vector and a fitness value of the particle, and one coding corresponds to a feasible solution of the clustering problem. Assuming that the data object is p-dimensional attribute and the number of clusters is k, the encoding scheme of each particle i is as follows:
X11X12...X1pX21X22...X2pXk1Xk2XkpV11V12...V1pV21V22...V2pVk1Vk2Vkpfitness(x)
wherein, XjpAnd representing the p-th dimension attribute value of the center point of the j (j ═ 1,2, …, k) th class. The front part in the coding structure represents the position of the particle and consists of k clustering centers, and the dimension is k × p; the middle part represents the velocity of the particle, also with dimension k × p; the final part represents the fitness value corresponding to the particle to evaluate the particle position XjThe performance is good or bad.
Further, since the fitness value of the particle reflects the degree of dissimilarity between data objects in each class, a larger fitness value indicates a tighter degree of combination of data in the class, and a smaller dissimilarity value indicates a better clustering effect. Therefore, in the embodiment of the present invention, the objective of improving the particle swarm clustering algorithm is to search a particle position at which the fitness of a particle is maximized, and a clustering center corresponding to the particle position is the global optimal position. The degree of intra-particle cluster can be expressed as:
Figure BDA0002263169390000222
wherein E isj(j ═ 1,2, …, k) denotes the jth class, X denotes the data object in the sample data set, andjrepresents EjCluster center of class. The fitness function provided by the embodiment of the invention is as follows:
Figure BDA0002263169390000223
where a is a normal number, as the case may be.
The particle swarm optimization K-Means clustering algorithm provided by the embodiment of the invention adopts a classification mode without iteration through design and application, and calculates the data object X and the particle X in the sample data set after determining the optimal clustering center by using the particle swarmj(j ═ 1,2, … k), data object x is classified into the jth class with its smallest cluster according to the distance minimization principle, and data object x is labeled as j. If X, XjSatisfying the following distance formula, x belongs to class j:
Figure BDA0002263169390000224
the distance formula may be used to complete the division of all the sample data in the mixed data sample set by calculating the dissimilarity measure of the mixed data sample recorded in step S3, so as to construct a client classification model.
Fig. 3 is a schematic flow chart of another customer classification method according to an embodiment of the present invention, and as shown in fig. 3, when the customer classification model is used to classify customers, an internal operation mode includes:
when the client classification is carried out, firstly, the number of sample data of an input client consumption data set and the number of clusters are determined; and then initializing the particle population based on a particle swarm optimization algorithm, and calculating the fitness of the current particles by using a fitting grid (which can be determined according to a calculation method of the dissimilarity metric value). Further, the individual optimum and the global optimum are updated according to the acquired particle fitness value so as to update the speed and the position of the particle. Then, whether the convergence requirement is satisfied is detected based on whether the accounting data is the basis, and if the convergence is satisfied. And outputting a global extreme value and a global optimal position, further distributing the sample data in the customer consumption data set to a clustering center of an accumulation algorithm according to a minimum principle, and outputting a classification result corresponding to the clustering center.
An embodiment of the present invention further provides a customer classification system, as shown in fig. 4, including but not limited to: a data acquisition device 1, a first data processing unit 2, a second data processing unit 3, a third data processing unit 4 and a customer classification unit 5, wherein:
a data acquisition means 1 for acquiring a hybrid data sample set composed of a plurality of hybrid data samples each composed of M numerical variable samples and N categorical variable samples; the first data processing unit 2 is used for performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors; the second data processing unit 3 is used for acquiring the dissimilarity metric values of the T comprehensive evaluation factors and the N categorical variable samples; the third data processing unit 4 is used for optimizing a K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm and constructing a customer classification model; and the client classification unit 5 is used for inputting the client consumption data set into the client classification model to obtain a client classification result.
The customer classification system provided by the embodiment of the invention combines the K-Means clustering algorithm and the improved particle swarm optimization algorithm, simultaneously considers various indexes influencing customer consumption, adopts the improved clustering algorithm to perform accurate clustering analysis on customer consumption data to obtain customer groups with different characteristics, and is more reasonable and clear in clustering result analysis, and more convenient for adopting different operation strategies and customer service strategies for different groups, thereby ensuring the leading position of retail market enterprises.
Fig. 5 illustrates a physical structure diagram of a server, and as shown in fig. 5, the server may include: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform a method comprising the steps of: step S1: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of M numerical variable samples and N categorical variable samples. Step S2: and (4) performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors. Step S3: obtaining dissimilarity metric values of the T comprehensive evaluation factors and the N classified variable samples S4: and optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm to construct a customer classification model. Step S5: and inputting the customer consumption data set into a customer classification model to obtain a customer classification result.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: step S1: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of M numerical variable samples and N categorical variable samples. Step S2: and (4) performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors. Step S3: obtaining dissimilarity metric values of the T comprehensive evaluation factors and the N classified variable samples S4: and optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm to construct a customer classification model. Step S5: and inputting the customer consumption data set into a customer classification model to obtain a customer classification result.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (18)

1. A customer categorization method, comprising:
step S1: acquiring a mixed data sample set consisting of a plurality of mixed data samples; each mixed data sample consists of M numerical variable samples and N classification variable samples;
step S2: performing feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors;
step S3: acquiring dissimilarity metric values of the T comprehensive evaluation factors and the N classified variable samples;
step S4: optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm to construct a customer classification model;
step S5: and inputting the customer consumption data set into the customer classification model to obtain a customer classification result.
2. The customer classification method according to claim 1, wherein in the step S2, the performing feature extraction on the mixed type data sample set includes:
respectively carrying out digital standardization calculation on the M numerical variable samples in each mixed data sample to obtain a standardization matrix;
obtaining a correlation matrix of the standardized matrix;
based on a principal component analysis method, obtaining an initial common factor according to the correlation matrix;
establishing a regression equation with the initial common factor as a dependent variable and the numerical variable sample as an independent variable based on a regression method, and acquiring a factor score corresponding to the initial common factor; wherein the factor score is a weighted sum.
3. The customer categorization method of claim 2, further comprising, after said obtaining the initial common factor: and performing orthogonal factor rotation or skew factor rotation by adopting a variance maximization method to obtain an orthogonal or skew factor load matrix, and performing linear combination processing on the initial common factor by using the orthogonal or skew factor load matrix.
4. The customer classification method according to claim 2, wherein the performing the numerical normalization calculation on the M numerical variable samples in each of the mixed data samples respectively comprises:
Figure FDA0002263169380000021
wherein x isijIs a numerical variable sample set;
Figure FDA0002263169380000022
the average value of numerical variable samples is obtained; the numerator is the standard deviation of the numerical type variable sample; m is the number of numerical variable samples, P is the number of observation indexes of each numerical variable sample, lijAnd carrying out digital standardization calculation on the jth dimension of the numerical variable sample for the ith data.
5. The customer categorization method of claim 4, wherein the obtaining of the correlation matrix of the standardized matrices comprises:
Figure FDA0002263169380000023
L=[l1,l2,l3…lM]
wherein R is a correlation matrix and L is a standardThe transformation matrix, L' is the transposed matrix of L, where LMIs the result of the numerical normalization calculation of the mth numerical variable sample.
6. The customer categorization method of claim 1, further comprising, before said step S2: and carrying out missing value processing and abnormal value processing on the mixed type data sample set.
7. The customer classification method according to claim 1, wherein the step S3 includes: dividing N of the categorical variable samples into N1An ordered typed variable sample and N2Classifying variable samples of the nominal type;
respectively obtaining T comprehensive evaluation factors and N1An ordered typed variable sample and N2Dissimilarity measure for each nominal type classification variable sample.
8. The customer categorization method of claim 7, wherein the obtaining of the T comprehensive evaluation factors, N, respectively1An ordered typed variable sample and N2Dissimilarity measure for each nominal type categorical variable sample, comprising:
to the N1Carrying out numerical value attribute normalization transformation on the ordered typing variable samples, converting each ordered typing variable sample into an ordered numerical value type variable sample, and calculating N by using the following formula1Dissimilarity metric values of the ordered numerical variable samples and the T comprehensive evaluation factors:
Figure FDA0002263169380000024
wherein d ism(xi,xj) Variable sample xiAnd variable sample xjM is the number of numerical attributes of each ordered numerical variable sample, ximDenotes the i-th data m-dimensional value, xjmM-dimension representing jth dataA value;
calculating N using symmetrical Hamming formula2Dissimilarity measure for each nominal type categorical variable sample:
Figure FDA0002263169380000031
where n is the number of numerical attributes of each nominal type categorical variable sample, xipAnd xjpRespectively representing classification values, δ (x), of the p-th dimension attribute data objectip,xjp) Are matching coefficients.
9. The customer classification method according to claim 8, characterized in that if the nominal classification variable samples are binary name semantic classification variable samples:
Figure FDA0002263169380000032
10. the customer classification method according to claim 7, further comprising, before the step S3:
determining the weight of each dissimilarity metric value based on an entropy weight method, wherein the weight comprises the step of mapping each comprehensive evaluation factor and each ordered typing variable sample to a dimensionless numerical value in a range of 0, 1 after normalizing the comprehensive evaluation factors and the ordered typing variable samples by using a dispersion standardization method;
calculating a specific gravity of each of the dimensionless numbers;
obtaining a specific gravity of each nominal type categorical variable sample;
calculating the information entropy of each specific gravity;
and determining the weight of each comprehensive evaluation factor and each classification type variable sample according to each information entropy.
11. The customer categorization method of claim 10, wherein said calculating the specific gravity of each of said dimensionless numbers comprises:
the contribution of each dimensionless number is obtained using the following formula:
Figure FDA0002263169380000033
wherein p isijIs the ith data object x under the jth attributeiThe degree of contribution of (c);
determining a specific gravity of each of the comprehensive evaluation factors based on the contribution degree.
12. The customer classification method of claim 10, wherein the obtaining a specific gravity for each of the nominal classification variable samples comprises:
acquiring the occurrence times of each nominal type classification variable sample and the number of total samples, and according to the following formula:
Figure FDA0002263169380000041
obtaining a specific gravity of each nominal type categorical variable sample;
wherein r isijSample x representing a nominal type categorical variableijThe number of occurrences, a, is the number of total samples.
13. The customer classification method according to claim 12, wherein the calculation formula of the information entropy of each of the specific gravities is:
Figure FDA0002263169380000042
wherein p isijIs the ith data object x under the jth attributeiSpecific gravity of (A), HiIs the information entropy of the ith specific gravity.
14. The customer categorization method of claim 12 wherein the determining a weight for each of the composite valuation factors and each of the categorical variable samples based on each of the information entropies comprises:
Figure FDA0002263169380000043
wherein, wiThe weight of the ith information entropy is shown, and m is the total number of the information entropy.
15. The customer classification method according to claim 1, wherein the step S4 includes:
updating the speed and the position of the individual optimum, the global optimum and the particles through the dissimilarity metric value based on a particle swarm algorithm, and performing border crossing processing until the optimal k clustering center points are obtained;
and taking the K clustering central points as initial clustering centers of a K-Means clustering algorithm to construct the customer classification model.
16. A customer classification system, comprising:
data acquisition means for acquiring a mixed data sample set composed of a plurality of mixed data samples, each of which is composed of M numerical variable samples and N categorical variable samples;
the first data processing unit is used for carrying out feature extraction on the M numerical variable samples to obtain T comprehensive evaluation factors;
the second data processing unit is used for acquiring the dissimilarity metric values of the T comprehensive evaluation factors and the N categorical variable samples;
the third data processing unit is used for optimizing the K-Means clustering algorithm of the dissimilarity metric value based on a particle swarm optimization algorithm and constructing a customer classification model;
and the customer classification unit is used for inputting the customer consumption data set into the customer classification model to obtain a customer classification result.
17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the client classification method according to any one of claims 1 to 15 are performed by the processor when executing the program.
18. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the client categorization method of any of claims 1 to 15.
CN201911078287.9A 2019-11-06 2019-11-06 Customer classification method and system and electronic equipment Active CN110866782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911078287.9A CN110866782B (en) 2019-11-06 2019-11-06 Customer classification method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911078287.9A CN110866782B (en) 2019-11-06 2019-11-06 Customer classification method and system and electronic equipment

Publications (2)

Publication Number Publication Date
CN110866782A true CN110866782A (en) 2020-03-06
CN110866782B CN110866782B (en) 2022-09-16

Family

ID=69653182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911078287.9A Active CN110866782B (en) 2019-11-06 2019-11-06 Customer classification method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN110866782B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381559A (en) * 2020-10-14 2021-02-19 浪潮软件股份有限公司 Tobacco retailer segmentation method based on unsupervised machine learning algorithm
CN112765521A (en) * 2021-01-21 2021-05-07 南京信息工程大学 Website user classification method based on improved K nearest neighbor
CN113240010A (en) * 2021-05-14 2021-08-10 烟台海颐软件股份有限公司 Abnormity detection method and system supporting non-independent distribution of mixed data
CN113271232A (en) * 2020-10-27 2021-08-17 苏州铁头电子信息科技有限公司 Online office network disturbance processing method and device
CN114692752A (en) * 2022-03-30 2022-07-01 中国农业银行股份有限公司 Client portrait construction method and device, storage medium and electronic equipment
CN115423024A (en) * 2022-09-14 2022-12-02 中国建设银行股份有限公司 Data processing method, device, equipment, storage medium and program product
CN117493979A (en) * 2023-12-29 2024-02-02 青岛智简尚达信息科技有限公司 Customer classification method based on data processing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170193538A1 (en) * 2016-01-06 2017-07-06 Oracle International Corporation System and method for determining the priority of mixed-type attributes for customer segmentation
CN108734217A (en) * 2018-05-22 2018-11-02 齐鲁工业大学 A kind of customer segmentation method and device based on clustering

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170193538A1 (en) * 2016-01-06 2017-07-06 Oracle International Corporation System and method for determining the priority of mixed-type attributes for customer segmentation
CN108734217A (en) * 2018-05-22 2018-11-02 齐鲁工业大学 A kind of customer segmentation method and device based on clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
尹波 等: "基于PSO的模糊K-Prototypes聚类", 《计算机工程与设计》 *
郑茜茜: "基于数据挖掘的客户细分研究", 《CNKI优秀硕士学位论文全文库》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381559A (en) * 2020-10-14 2021-02-19 浪潮软件股份有限公司 Tobacco retailer segmentation method based on unsupervised machine learning algorithm
CN113271232A (en) * 2020-10-27 2021-08-17 苏州铁头电子信息科技有限公司 Online office network disturbance processing method and device
CN113271232B (en) * 2020-10-27 2022-01-11 苏州铁头电子信息科技有限公司 Online office network disturbance processing method and device
CN112765521A (en) * 2021-01-21 2021-05-07 南京信息工程大学 Website user classification method based on improved K nearest neighbor
CN112765521B (en) * 2021-01-21 2023-06-23 南京信息工程大学 Website user classification method based on improved K neighbor
CN113240010A (en) * 2021-05-14 2021-08-10 烟台海颐软件股份有限公司 Abnormity detection method and system supporting non-independent distribution of mixed data
CN113240010B (en) * 2021-05-14 2023-10-24 烟台海颐软件股份有限公司 Anomaly detection method and system supporting non-independent distribution mixed data
CN114692752A (en) * 2022-03-30 2022-07-01 中国农业银行股份有限公司 Client portrait construction method and device, storage medium and electronic equipment
CN115423024A (en) * 2022-09-14 2022-12-02 中国建设银行股份有限公司 Data processing method, device, equipment, storage medium and program product
CN117493979A (en) * 2023-12-29 2024-02-02 青岛智简尚达信息科技有限公司 Customer classification method based on data processing

Also Published As

Publication number Publication date
CN110866782B (en) 2022-09-16

Similar Documents

Publication Publication Date Title
CN110866782B (en) Customer classification method and system and electronic equipment
CN112070125A (en) Prediction method of unbalanced data set based on isolated forest learning
CN112487199B (en) User characteristic prediction method based on user purchasing behavior
CN110532429B (en) Online user group classification method and device based on clustering and association rules
CN117151870B (en) Portrait behavior analysis method and system based on guest group
CN112085525A (en) User network purchasing behavior prediction research method based on hybrid model
CN115062732A (en) Resource sharing cooperation recommendation method and system based on big data user tag information
CN113111924A (en) Electric power customer classification method and device
Straton et al. Big social data analytics for public health: Predicting facebook post performance using artificial neural networks and deep learning
CN111209469A (en) Personalized recommendation method and device, computer equipment and storage medium
CN108920647A (en) Low-rank matrix based on spectral clustering fills TOP-N recommended method
CN112991026A (en) Commodity recommendation method, system, equipment and computer readable storage medium
CN113688906A (en) Customer segmentation method and system based on quantum K-means algorithm
CN113591947A (en) Power data clustering method and device based on power consumption behaviors and storage medium
Larkin et al. An analytical toast to wine: Using stacked generalization to predict wine preference
TW201243627A (en) Multi-label text categorization based on fuzzy similarity and k nearest neighbors
CN112418987A (en) Method and system for rating credit of transportation unit, electronic device and storage medium
CN111506813A (en) Remote sensing information accurate recommendation method based on user portrait
Fan et al. Using hybrid and diversity-based adaptive ensemble method for binary classification
Tsapatsoulis et al. Investigating the scalability of algorithms, the role of similarity metric and the list of suggested items construction scheme in recommender systems
CN113837847A (en) Knowledge-intensive service recommendation method based on heterogeneous multivariate relation fusion
Sharifi et al. Customer Behavior Analysis using Wild Horse Optimization Algorithm.
Zadeh A New Sales Forecasting method for industrial supply chain
Yu et al. Restaurants Review Star Prediction for Yelp Dataset
Mendikowski et al. Creating customers that never existed: Synthesis of e-commerce data using CTGAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant