CN111339294B - Customer data classification method and device and electronic equipment - Google Patents

Customer data classification method and device and electronic equipment Download PDF

Info

Publication number
CN111339294B
CN111339294B CN202010086453.6A CN202010086453A CN111339294B CN 111339294 B CN111339294 B CN 111339294B CN 202010086453 A CN202010086453 A CN 202010086453A CN 111339294 B CN111339294 B CN 111339294B
Authority
CN
China
Prior art keywords
attribute
data
attributes
client
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010086453.6A
Other languages
Chinese (zh)
Other versions
CN111339294A (en
Inventor
井玉欣
陈永林
陈甜甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Puxin Hengye Technology Development Beijing Co ltd
Original Assignee
Puxin Hengye Technology Development Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Puxin Hengye Technology Development Beijing Co ltd filed Critical Puxin Hengye Technology Development Beijing Co ltd
Priority to CN202010086453.6A priority Critical patent/CN111339294B/en
Publication of CN111339294A publication Critical patent/CN111339294A/en
Application granted granted Critical
Publication of CN111339294B publication Critical patent/CN111339294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for classifying customer data and electronic equipment, wherein the method comprises the following steps: acquiring client data comprising a plurality of client records; wherein the customer data includes a plurality of columns of attributes; respectively determining the attribute type of each column of attribute; wherein the attribute type is a classification type or a numerical value type; converting attribute values corresponding to the classified data attributes in the client data into attribute values corresponding to the numerical data attributes; and executing clustering operation on the client data to obtain a clustering result for representing client subdivision. The invention can convert the classified data attribute into the numerical data attribute and then execute the clustering operation, so that the classified data attribute and the numerical data attribute can be uniformly considered in the clustering operation, and the classification effect is better.

Description

Customer data classification method and device and electronic equipment
Technical Field
The application relates to the technical field of marketing and artificial intelligence, in particular to a client data classification method, a client data classification device and electronic equipment.
Background
Customer subdivision is a necessary thing for enterprises, and not only can customers be better understood, but also cost can be effectively reduced, and great benefits are brought to enterprises. Customer subdivision is a basic project, and particularly for customer relationship management, accurate customer subdivision greatly improves marketing efficiency; the subdivision method has a certain help effect on general marketing tactics, production operations and even enterprise strategic applications.
At present, the customer subdivision mainly starts from distinguishing different demands of customers and different attribute characteristics of the customers, divides the whole market into a plurality of sub-markets requiring different products and different marketing combinations according to certain standards, selects certain target markets on the basis, and finally designs the whole activity process of corresponding marketing tools.
The object of the client subdivision is client data, which typically includes basic data, behavior data, etc., and may include both categorical data attributes (discrete type) and numerical data attributes (continuous type) as viewed in data type. For the client data, the attribute of the classified data comprises the attribute of sex, occupation, residence and the like of the client, and the value range of the attribute is discrete and limited; the numeric data attributes include income, login time length, consumption amount and the like of the client, and the numeric data attributes have a numeric range of continuous numeric intervals.
At present, in the process of classifying the client data, the data types (classified data attributes and numerical data attributes) of the client data are not distinguished, but the clustering algorithm is biased to the classified data attributes due to the obvious characteristics of the classified data attributes, so that the final classification result is focused on the classified data attributes too much, the numerical data attributes are ignored, and the classification effect is not good.
Disclosure of Invention
In view of this, the present application provides a method, an apparatus, and an electronic device for classifying customer data, which can convert a classified data attribute into a numeric data attribute, and then perform a clustering operation, so that the classified data attribute and the numeric data attribute can be considered in a balanced manner in the clustering operation, so that the classification effect is better.
In order to achieve the above object, the present invention provides the following technical features:
a method of classifying customer data, comprising:
acquiring client data comprising a plurality of client records; wherein the customer data includes a plurality of columns of attributes;
respectively determining the attribute type of each column of attribute; wherein the attribute type is a classification type or a numerical value type;
converting attribute values corresponding to the classified data attributes in the client data into attribute values corresponding to the numerical data attributes;
and executing clustering operation on the client data to obtain a clustering result for representing client subdivision.
Optionally, the determining the attribute type of each column of attributes includes:
judging whether the numerical type of the attribute value corresponding to a list of attributes is a continuous data type or not;
determining the column attribute as a classified data attribute if the data type is discontinuous;
if the data type is continuous, counting the number of different attribute values in the list of attributes, and calculating the ratio of the number of the different attribute values to the total number of the attribute values;
judging whether the ratio is larger than a set threshold value or not;
if the ratio is greater than a set threshold, determining that the column attribute is a numerical data attribute;
and if the ratio is not greater than the set threshold, determining that the column attribute is a classified data attribute.
Optionally, the converting the attribute value corresponding to the classified data attribute in the client data into the attribute value corresponding to the numeric data attribute includes:
the following operations are executed for each column of classified data attributes in the client data:
grouping the client data according to different attribute values of the column of classified data attributes to obtain a plurality of groups corresponding to the attribute values one by one;
determining a target numerical data attribute which is most matched with the classified data attribute from the numerical data attributes of the client data;
for each packet: and calculating the average attribute value of the target numerical data attribute in the group, determining the average attribute value as the attribute value corresponding to the group, and converting the attribute value into the attribute value of the numerical data attribute.
Optionally, the determining, from the respective numeric data attributes of the client data, the target numeric data attribute that matches the classified data attribute best includes:
calculating the intra-group variance of each group aiming at each numerical data attribute of the client data, and summing the intra-group variances to obtain a variance sum corresponding to each numerical data attribute;
performing sorting operation on the variance sum corresponding to each numerical data attribute;
and determining the variance and the smallest numerical data attribute as the target numerical data attribute which is matched with the classification data attribute best.
Optionally, before the clustering operation is performed on the client data, removing outliers in the client data by adopting an isolated forest algorithm.
Optionally, the performing a clustering operation on the client data, and obtaining a clustering result for representing client subdivision includes:
performing pre-clustering on the client data by adopting a hierarchical classification algorithm, and stopping pre-clustering when the number of the output micro class clusters reaches a preset number;
calculating the center points of a preset number of micro clusters;
determining K center points from a preset number of center points to serve as initial center points of a K-means algorithm;
and performing a second clustering operation based on the K initial center points to obtain a clustering result for representing client subdivision.
Optionally, the determining K center points from the preset number of center points as initial center points of the K-means algorithm includes:
randomly selecting a point from a preset number of center points as a first initial center point, and adding a set S;
calculating the nearest distance between the rest center points in the preset number of center points and the set S, and selecting one rest center point with the largest nearest distance to add into the set S;
the above steps are repeated until the set S reaches K center points.
Optionally, after the obtaining the client data including the plurality of client records, the method further includes:
performing a data cleansing operation on the customer data set, the data cleansing operation including a missing value padding operation, an abnormal value processing operation, and a repeated data culling operation;
after the attribute type of each column of the attribute is determined, the method further comprises:
performing a decorrelation operation on the plurality of classified data attributes, and deleting the classified data attributes with high correlation;
performing a decorrelation operation on the plurality of numeric data attributes, deleting the numeric data attributes having high correlation, and performing a normalization processing operation on attribute values corresponding to the remaining numeric data attributes.
A customer data classification device comprising:
an acquisition unit configured to acquire client data including a plurality of client records; wherein the customer data includes a plurality of columns of attributes;
a determining unit, configured to determine attribute types of each column of attributes respectively; wherein the attribute type is a classification type or a numerical value type;
the conversion unit is used for converting the attribute value corresponding to the classified data attribute in the client data into the attribute value corresponding to the numerical data attribute;
and the clustering unit is used for executing clustering operation on the client data to obtain a clustering result for representing client subdivision.
An electronic device, comprising:
a memory for storing customer data comprising a plurality of customer records, wherein the customer data comprises a plurality of columns of attributes;
a processor for determining attribute types of each column of attributes respectively; wherein the attribute type is a classification type or a numerical value type; converting attribute values corresponding to the classified data attributes in the client data into attribute values corresponding to the numerical data attributes; and executing clustering operation on the client data to obtain a clustering result for representing client subdivision.
Through the technical means, the following beneficial effects can be realized:
the invention determines the attribute type of the attribute in the client data after the client data is obtained, and converts the attribute value corresponding to the classified data attribute in the client data into the attribute value corresponding to the numerical data attribute; so that the client data all become digital data attributes. And clustering operation is performed on the client data on the basis of the clustering result to obtain a clustering result for representing client subdivision.
The invention converts the classified data attribute into the numerical data attribute and then executes the clustering operation, so that the classified data attribute and the numerical data attribute can be uniformly considered in the clustering operation, and the classification effect is better.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for classifying customer data according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a client data classification device according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The method is applied to the electronic equipment, the electronic equipment can be an enterprise server, a cloud server or other equipment for classifying the client data, and the application architecture of the client data classification method is not limited.
The invention provides a client data classification method which is applied to electronic equipment. Referring to fig. 1, the method comprises the following steps:
step S101: acquiring client data comprising a plurality of client records; wherein the customer data includes a plurality of columns of attributes.
The electronic device obtains the client data from an external database, and also can obtain the client data from a self memory, depending on the storage positions of the client data in different application scenes.
Referring to Table 1, a schematic example of customer data for an enterprise is shown. Customer data from the table includes 10 customer records, the customer data including 4 columns of attributes: residence, industry, login duration, and investment amount.
TABLE 1
Step S102: respectively determining the attribute type of each column of attribute; wherein the attribute type is a score type or a numeric type.
Since the attribute type of some attributes among the plurality of attributes is a classification type, the attribute type of some attributes is a numerical type. To facilitate automated processing, the present invention provides a way in which the attribute type of an attribute may be automatically determined.
Step S1: and judging whether the numerical type of the attribute value corresponding to the list of attributes is a continuous data type or not. If not, the step S2 is entered, and if yes, the step S3 is entered.
The continuous data type may be integer or floating point type, and if the value type of the attribute value corresponding to an attribute is not integer or floating point type, the attribute may be determined to be a classified data attribute.
Step S2: if the data type is discontinuous, the column attribute is determined to be a classified data attribute.
Step S3: if the data is continuous, counting the number of different attribute values in the list of attributes, and calculating the ratio of the number of different attribute values to the total number of attribute values.
For example, taking the investment amount in table 1 as an example, the number of different attribute values is 9, and the total number of attribute values is 10, the ratio of the number of different attribute values to the total number of attribute values is 0.9.
Taking the residence in table 1 as an example, the number of different attribute values is 2, and the total number of attribute values is 10, and the ratio of the number of different attribute values to the total number of attribute values is 0.2.
Step S4: judging whether the ratio is larger than a set threshold value or not; if yes, go to step S5, otherwise go to step S6.
Step S5: if the ratio is greater than a set threshold, determining that the column attribute is a numerical data attribute;
step S6: and if the ratio is not greater than the set threshold, determining that the column attribute is a classified data attribute.
The threshold is typically set empirically, it being understood that if the attribute is of a type, the attribute value is typically a finite number, and the proportion is small relative to the total number of attribute values; if the attribute is of a numerical type, the value of each attribute value may be different, and the result of dividing the number of attribute values by the total number of attribute values is close to 1.
Step S103: and converting the attribute value corresponding to the classified data attribute in the client data into the attribute value corresponding to the numerical data attribute.
Optionally, in order to facilitate subsequent processing, the standardized operation can be performed on the attribute values of the digital data attributes, so that the magnitude is unified, and the reliability of the algorithm effect is ensured. Alternatively, a normalization algorithm (Max-min method or Z-score method, etc.) may be employed to perform the normalization operation.
Referring to table 2, a schematic example of customer data after the normalization operation is performed on the basis of table 1.
TABLE 2
The following operations are executed for each column of classified data attributes in the client data:
step S1: and grouping the client data according to different attribute values of the column of the classified data attributes to obtain a plurality of groups corresponding to the attribute values one by one.
For example, customer data is divided into two groups by residence, beijing group and Shanghai group.
Step S2: from among the individual numeric data attributes of the customer data, a target numeric data attribute that best matches the categorized data attribute is determined.
Since there are a plurality of numerical data attributes, the classification data attributes are converted according to one numerical data attribute. In order to make the conversion result more optimal, a target numeric data attribute that best matches the classified data attribute may be selected among the individual numeric data attributes.
The method specifically comprises the following steps:
s21: and calculating the intra-group variance of each group aiming at each numerical data attribute of the client data, and summing the intra-group variances to obtain a variance sum corresponding to each numerical data attribute.
S22: performing sorting operation on the variance sum corresponding to each numerical data attribute;
s23: and determining the variance and the smallest numerical data attribute as the target numerical data attribute which is matched with the classification data attribute best.
The variance and the smallest numerical data attribute represent that the classification type data attribute is associated with, so the fluctuation range is small. Then, the attribute value obtained after conversion using the target numeric data attribute with which the classified data attribute is the most matched is relatively stable.
Step S3: for each packet: and calculating the average attribute value of the target numerical data attribute in the group, determining the average attribute value as the attribute value corresponding to the group, and converting the attribute value into the attribute value of the numerical data attribute.
For ease of understanding, the process of converting a split type data attribute into a numeric data attribute is explained in detail below on the basis of table 2:
first, the following definitions are given:
assuming that k customer records exist in one customer data D, the customer records consist of p columns of classified data attributes and q columns of standardized numerical data attributes; wherein:
d: representing the total number of customer records, i.e. k, 10 in the example of table 2.
D i : representing the ith customer record and 1.ltoreq.i.ltoreq.k. Examples: d (D) 3 Representing customer record 3 in table 2.
X i : representing the data attribute of the ith column classification type, wherein i is more than or equal to 1 and less than or equal to p; examples: x is X 2 Is an "industry" attribute.
IX i : representing categorical data attributes X i A set of attribute values that occur in the model; examples: IX (IX) 2 = { "government", "IT", "finance" }.
N i : the i-th column numerical data attribute is represented, and i is more than or equal to 1 and less than or equal to q. Example N 1 Is a "login duration" attribute.
The attribute value of the i-th column classification type data attribute of the client record D is represented, and D is E D, i is more than or equal to 1 and less than or equal to p.
Examples: d=d 3
Representing the value of the ith column value type data attribute of the client record D, wherein D is E D, i is more than or equal to 1 and less than or equal to q; examples: d=d 3 ,
D(X i =t): representing column i taxonomy data attributes X i Customer records with value t, and t epsilon IX i ,
Examples: t= "Shanghai", D (X 1 =t)=4。
|D(X i =t) |: representing all attributes X i The number of customer records with a value of t;
examples: t= "Shanghai", |d (X 1 =t)|=4。
V(t|X i ): representing categorical data attributes X i An attribute value t obtained after conversion.
SS(X i ,N j ): the sum of the intra-group variances representing the i-th column classification type data attribute and the j-th column numerical type data attribute is 1.ltoreq.i.ltoreq.p, 1.ltoreq.j.ltoreq.q.
The following describes the specific conversion process:
a) Traversing all of the taxonomy data attributes in the customer data D, for each column of the taxonomy data attributes X i (e.g. take X 1 "residence" attribute) performs the following operations:
for all client records in the client data D, according to the classification attribute X i Is denoted as D (X i =t 1 )、D(X i =t 2 )……D(X i =t m ) Wherein t is m ∈IX i
Examples: the above can be divided into 2 groups in table 2:
D(X 1 beijing =
D(X 1 =Shanghai)
Traversing individual numeric data attributes in a customer record, for each numeric data attribute N j According to X i Each attribute value is grouped and then is in attribute N j Sum of intra-group variances SS (X) i ,N j )。
For example, with numerical data attribute N 1 For example, calculate the attributes N in each group 1 Corresponding average attribute values.
D*=D(X i =t m ),t m ∈IX i
Examples: d (X) 1 Log duration attribute average value of each piece of data in =beijing) is:
E(N 1 ,D(X 1 =beijing))= (0.23+0.77+0.19+0.23+0.9+1)/6= 0.55333
D(X 1 Log duration attribute average value of each piece of data in =Shanghai) is:
E(N 1 ,D(X 1 =Shanghai)) (0+0.09+0.52+0.27)/4=0.22
Then, the intra-group variance of each group, which is the variance calculated from the attribute values and the average attribute values in each group, reflecting the degree of difference of each attribute value in the group, is calculated as follows:
D*=D(X i =t m ),t m ∈IX i
then, the sum of intra-group variances SS (X) i ,N j )=0.11796+0.03945=0.15741。
According to the algorithm, the intra-group variance sum corresponding to each numerical data attribute in the grouping division in the step i is calculated in sequence, and is respectively as follows: SS (X) i ,N 1 )、SS(X i ,N 2 )……SS(X i ,N q )。
From the sum of the variances in each group, the smallest value is selected as SS (X i ,N min ) The corresponding numerical attribute is N min Then N min Is X i The best reference attribute, i.e., the target numeric data attribute, is converted.
Using the numerical attribute N min For classification attribute X i And converting, namely converting each attribute value item into a numerical value type. That is, given an attribute value t m ∈IX i According to step i, the corresponding packet data D (X i =t m ) Wherein the attribute N min Corresponding value set of (a) isWe use the average attribute value of this set, i.e., E (N min ,D(X i =t m ) As attribute value t) m The attribute value of the corresponding numerical type data attribute realizes the conversion from the classified type to the numerical type. Namely:
D*=D(X i =t m ),t m ∈IX i
the data attribute X of the classification type can be obtained according to the formula i Is converted into a numeric data type. Thus, the 'Beijing' and the 'Shanghai' in the residential area are converted into continuous values. Examples: v (Beijing|X) 1 ,N 1 )=E(N 1 ,D(X 1 =beijing))= 0.553.V (Shanghai|X) 1 ,N 1 )=E(N 1 ,D(X 1 =Shanghai))=0.22.
Repeating the process of a), finding the best conversion reference numerical data attribute for each classified data attribute, converting the attribute value of the classified data attribute based on the best reference numerical data attribute, and finally converting all the classified data attributes into the attribute value of the numerical data attribute.
Step S104: and executing clustering operation on the client data to obtain a clustering result for representing client subdivision.
A hierarchical clustering algorithm, or a K-means algorithm, or other clustering operations may be employed to cluster the customer data to obtain a clustering result representing customer segments.
Alternatively, since different clustering algorithms have advantages and disadvantages, the present invention adopts a secondary clustering algorithm. In order to integrate the calculated amount and the clustering effect, hierarchical clustering and K-means are adopted to perform secondary clustering operation.
Alternatively, the present invention provides a preferred solution for the clustering operation. In the prior art, a single clustering operation cannot be well performed, and therefore, the method adopts a secondary clustering mode to perform clustering operation on the client data.
Step S1: and performing pre-clustering on the client data by adopting a hierarchical classification algorithm, and stopping pre-clustering when the number of the output micro class clusters reaches a preset number.
Step S2: and calculating the central points of the preset number of micro clusters.
Randomly selecting a point from a preset number of center points as a first initial center point, and adding a set S;
calculating the nearest distance between the rest center points in the preset number of center points and the set S, and selecting one rest center point with the largest nearest distance to add into the set S;
the above steps are repeated until the set S reaches K center points.
Step S3: determining K center points from a preset number of center points to serve as initial center points of a K-means algorithm;
step S4: and performing a second clustering operation based on the K initial center points to obtain a clustering result for representing client subdivision.
The method and the device have the advantages that the merging rule in the hierarchical clustering algorithm is easy to define, and parameters do not need to be set, so that the hierarchical clustering algorithm is adopted for pre-clustering in the initial stage, the subsequent clustering data quantity is reduced, and the data rule is found preliminarily.
Because hierarchical clustering is complex in calculation and chain-shaped clusters are easy to form, a stop-use condition is set when the number of micro clusters reaches a preset number. Later, K-means is used for subsequent clustering.
K-means clustering is efficient and rapid, but the effect is influenced by the initial center selection position, and here, the initial points are selected in the hierarchical clustering result to avoid the points from being selected too far or too densely, so that the clustering effect is prevented from being in local optimization, and the clustering effect is more stable and effective.
Optionally, the number of clusters in the output result of the clustering operation is recorded as K, and the K value may be determined by the following method:
the user designates the number of the clusters to be divided into a plurality of clusters, the effect is observed and evaluated manually after the clustering is carried out, if the number of the clusters is unsatisfactorily adjustable and the clustering is repeated, and a reasonable result can be obtained by repeatedly carrying out the process;
and automatically evaluating the number K of the optimal clusters through an iterative clustering process, sequentially taking the values of the K in a certain value space, clustering the K respectively to obtain output results, calculating corresponding contour coefficients (Silhouette Coefficient), and finally selecting the K value with the largest contour coefficient as the optimal value.
Optionally, before the clustering operation is performed on the client data, removing outliers in the client data by adopting an isolated forest algorithm.
Optionally, after obtaining the client data including the plurality of client records in step S101, the method further includes:
and performing data cleaning operations on the client data set, wherein the data cleaning operations comprise a missing value filling operation, an abnormal value processing operation and a repeated data rejecting operation.
After the attribute type of each column of the attribute is determined, the method further comprises:
performing a decorrelation operation on the plurality of classified data attributes, and deleting the classified data attributes with high correlation; and performing a decorrelation operation on the plurality of numerical data attributes, and deleting the numerical data attributes with high correlation. This reduces the repetition properties in order to reduce the amount of subsequent calculations.
Through the technical means, the following beneficial effects can be realized:
the invention determines the attribute type of the attribute in the client data after the client data is obtained, and converts the attribute value corresponding to the classified data attribute in the client data into the attribute value corresponding to the numerical data attribute; so that the client data all become digital data attributes. And clustering operation is performed on the client data on the basis of the clustering result to obtain a clustering result for representing client subdivision.
The invention converts the classified data attribute into the numerical data attribute and then executes the clustering operation, so that the classified data attribute and the numerical data attribute can be uniformly considered in the clustering operation, and the classification effect is better.
Referring to fig. 2, the present invention provides a customer data classification apparatus, comprising:
an acquisition unit 21 for acquiring client data including a plurality of client records; wherein the customer data includes a plurality of columns of attributes;
a determining unit 22 for determining attribute types of each column of attributes, respectively; wherein the attribute type is a classification type or a numerical value type;
a conversion unit 23, configured to convert an attribute value corresponding to a classified data attribute in the client data into an attribute value corresponding to a numeric data attribute;
and the clustering unit 24 is used for performing clustering operation on the client data to obtain a clustering result for representing client subdivision.
For the specific implementation of the client data classifying device, reference may be made to the specific implementation of the client data classifying method, which is not described herein.
Referring to fig. 3, the present invention provides an electronic device including:
a memory 31 for storing customer data comprising a plurality of customer records, wherein the customer data comprises a plurality of columns of attributes;
a processor 32 for determining the attribute type of each column of attributes, respectively; wherein the attribute type is a classification type or a numerical value type; converting attribute values corresponding to the classified data attributes in the client data into attribute values corresponding to the numerical data attributes; and executing clustering operation on the client data to obtain a clustering result for representing client subdivision.
For the specific implementation of the processor, reference may be made to the specific implementation of the client data classification method, which is not described herein.
Through the technical means, the following beneficial effects can be realized:
the invention determines the attribute type of the attribute in the client data after the client data is obtained, and converts the attribute value corresponding to the classified data attribute in the client data into the attribute value corresponding to the numerical data attribute; so that the client data all become digital data attributes. And clustering operation is performed on the client data on the basis of the clustering result to obtain a clustering result for representing client subdivision.
The invention converts the classified data attribute into the numerical data attribute and then executes the clustering operation, so that the classified data attribute and the numerical data attribute can be uniformly considered in the clustering operation, and the classification effect is better.
The functions described in the method of this embodiment, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computing device readable storage medium. Based on such understanding, a portion of the embodiments of the present application that contributes to the prior art or a portion of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A method of classifying customer data, comprising:
acquiring client data comprising a plurality of client records; wherein the customer data includes a plurality of columns of attributes;
respectively determining the attribute type of each column of attribute; wherein the attribute type is a classification type or a numerical value type;
the determining the attribute type of each column of attributes respectively comprises the following steps:
judging whether the numerical type of the attribute value corresponding to a list of attributes is a continuous data type or not;
determining the column attribute as a classified data attribute if the data type is discontinuous;
if the data type is continuous, counting the number of different attribute values in the list of attributes, and calculating the ratio of the number of the different attribute values to the total number of the attribute values;
judging whether the ratio is larger than a set threshold value or not;
if the ratio is greater than a set threshold, determining that the column attribute is a numerical data attribute;
if the ratio is not greater than the set threshold, determining that the column attribute is a classified data attribute;
converting attribute values corresponding to the classified data attributes in the client data into attribute values corresponding to the numerical data attributes;
and executing clustering operation on the client data to obtain a clustering result for representing client subdivision.
2. The method of claim 1, wherein converting the attribute value corresponding to the categorized data attribute in the customer data into the attribute value corresponding to the numeric data attribute comprises:
the following operations are executed for each column of classified data attributes in the client data:
grouping the client data according to different attribute values of the column of classified data attributes to obtain a plurality of groups corresponding to the attribute values one by one;
determining a target numerical data attribute which is most matched with the classified data attribute from the numerical data attributes of the client data;
for each packet: and calculating the average attribute value of the target numerical data attribute in the group, determining the average attribute value as the attribute value corresponding to the group, and converting the attribute value into the attribute value of the numerical data attribute.
3. The method of claim 2, wherein said determining a target numeric data attribute that best matches the classified data attribute from among the numeric data attributes of the customer data comprises:
calculating the intra-group variance of each group aiming at each numerical data attribute of the client data, and summing the intra-group variances to obtain a variance sum corresponding to each numerical data attribute;
performing sorting operation on the variance sum corresponding to each numerical data attribute;
and determining the variance and the smallest numerical data attribute as the target numerical data attribute which is matched with the classification data attribute best.
4. The method of claim 1, further comprising removing outliers in the customer data using an orphan forest algorithm prior to performing a clustering operation on the customer data.
5. The method of claim 4, wherein performing a clustering operation on the client data to obtain a clustering result representing client segments comprises:
performing pre-clustering on the client data by adopting a hierarchical classification algorithm, and stopping pre-clustering when the number of the output micro class clusters reaches a preset number;
calculating the center points of a preset number of micro clusters;
determining K center points from a preset number of center points to serve as initial center points of a K-means algorithm;
and performing a second clustering operation based on the K initial center points to obtain a clustering result for representing client subdivision.
6. The method of claim 5, wherein determining K center points from among a preset number of center points as initial center points of a K-means algorithm comprises:
randomly selecting a point from a preset number of center points as a first initial center point, and adding a set S;
calculating the nearest distance between the rest center points in the preset number of center points and the set S, and selecting one rest center point with the largest nearest distance to add into the set S;
the above steps are repeated until the set S reaches K center points.
7. The method of claim 1, wherein,
after the obtaining the client data containing the plurality of client records, the method further comprises:
performing a data cleansing operation on the customer data set, the data cleansing operation including a missing value padding operation, an abnormal value processing operation, and a repeated data culling operation;
after the attribute type of each column of the attribute is determined, the method further comprises:
performing a decorrelation operation on the plurality of classified data attributes, and deleting the classified data attributes with high correlation;
performing a decorrelation operation on the plurality of numeric data attributes, deleting the numeric data attributes having high correlation, and performing a normalization processing operation on attribute values corresponding to the remaining numeric data attributes.
8. A customer data sorting apparatus, comprising:
an acquisition unit configured to acquire client data including a plurality of client records; wherein the customer data includes a plurality of columns of attributes;
a determining unit, configured to determine attribute types of each column of attributes respectively; wherein the attribute type is a classification type or a numerical value type;
the determining the attribute type of each column of attributes respectively comprises the following steps:
judging whether the numerical type of the attribute value corresponding to a list of attributes is a continuous data type or not; determining the column attribute as a classified data attribute if the data type is discontinuous; if the data type is continuous, counting the number of different attribute values in the list of attributes, and calculating the ratio of the number of the different attribute values to the total number of the attribute values; judging whether the ratio is larger than a set threshold value or not; if the ratio is greater than a set threshold, determining that the column attribute is a numerical data attribute; if the ratio is not greater than the set threshold, determining that the column attribute is a classified data attribute;
the conversion unit is used for converting the attribute value corresponding to the classified data attribute in the client data into the attribute value corresponding to the numerical data attribute;
and the clustering unit is used for executing clustering operation on the client data to obtain a clustering result for representing client subdivision.
9. An electronic device, comprising:
a memory for storing customer data comprising a plurality of customer records, wherein the customer data comprises a plurality of columns of attributes;
a processor for determining attribute types of each column of attributes respectively; the determining the attribute type of each column of attributes respectively comprises the following steps: judging whether the numerical type of the attribute value corresponding to a list of attributes is a continuous data type or not; determining the column attribute as a classified data attribute if the data type is discontinuous; if the data type is continuous, counting the number of different attribute values in the list of attributes, and calculating the ratio of the number of the different attribute values to the total number of the attribute values; judging whether the ratio is larger than a set threshold value or not; if the ratio is greater than a set threshold, determining that the column attribute is a numerical data attribute; if the ratio is not greater than the set threshold, determining that the column attribute is a classified data attribute;
wherein the attribute type is a classification type or a numerical value type; converting attribute values corresponding to the classified data attributes in the client data into attribute values corresponding to the numerical data attributes; and executing clustering operation on the client data to obtain a clustering result for representing client subdivision.
CN202010086453.6A 2020-02-11 2020-02-11 Customer data classification method and device and electronic equipment Active CN111339294B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010086453.6A CN111339294B (en) 2020-02-11 2020-02-11 Customer data classification method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010086453.6A CN111339294B (en) 2020-02-11 2020-02-11 Customer data classification method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN111339294A CN111339294A (en) 2020-06-26
CN111339294B true CN111339294B (en) 2023-07-25

Family

ID=71181465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010086453.6A Active CN111339294B (en) 2020-02-11 2020-02-11 Customer data classification method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111339294B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858732B (en) * 2020-07-14 2024-04-05 北京北大软件工程股份有限公司 Data fusion method and terminal

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681973A (en) * 2018-05-14 2018-10-19 广州供电局有限公司 Sorting technique, device, computer equipment and the storage medium of power consumer

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8295575B2 (en) * 2007-10-29 2012-10-23 The Trustees of the University of PA. Computer assisted diagnosis (CAD) of cancer using multi-functional, multi-modal in-vivo magnetic resonance spectroscopy (MRS) and imaging (MRI)
CN104899331A (en) * 2015-06-24 2015-09-09 Tcl集团股份有限公司 Television used behavior data clustering method and device and Spark big data platform
US20170169447A1 (en) * 2015-12-09 2017-06-15 Oracle International Corporation System and method for segmenting customers with mixed attribute types using a targeted clustering approach
US10585864B2 (en) * 2016-11-11 2020-03-10 International Business Machines Corporation Computing the need for standardization of a set of values
CN107103336A (en) * 2017-04-28 2017-08-29 温州职业技术学院 A kind of mixed attributes data clustering method based on density peaks
KR20190094068A (en) * 2018-01-11 2019-08-12 한국전자통신연구원 Learning method of classifier for classifying behavior type of gamer in online game and apparatus comprising the classifier
CN108830765A (en) * 2018-04-18 2018-11-16 中国地质大学(武汉) A kind of checking method and system of pollution entering the water monitoring data
CN108734217A (en) * 2018-05-22 2018-11-02 齐鲁工业大学 A kind of customer segmentation method and device based on clustering
CN109189876B (en) * 2018-08-31 2021-09-10 深圳市元征科技股份有限公司 Data processing method and device
CN109919227A (en) * 2019-03-07 2019-06-21 重庆邮电大学 A kind of density peaks clustering method towards mixed attributes data set

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681973A (en) * 2018-05-14 2018-10-19 广州供电局有限公司 Sorting technique, device, computer equipment and the storage medium of power consumer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Fuzzy non-metric model for data with tolerance and its application to incomplete data clustering;Yasunori Endo 等;《2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)》;全文 *

Also Published As

Publication number Publication date
CN111339294A (en) 2020-06-26

Similar Documents

Publication Publication Date Title
CN107563645A (en) A kind of Financial Risk Analysis method based on big data
CN105512167A (en) Multi-business user data managing system based on mixed database and method for same
CN111967971B (en) Bank customer data processing method and device
CN115187344B (en) Big data-based user preference analysis and identification method
CN107818334A (en) A kind of mobile Internet user access pattern characterizes and clustering method
CN110598061A (en) Multi-element graph fused heterogeneous information network embedding method
TW201800987A (en) Method and equipment for problem recommendation
CN113240111B (en) Pruning method based on discrete cosine transform channel importance score
CN111339294B (en) Customer data classification method and device and electronic equipment
CN111967521A (en) Cross-border active user identification method and device
CN116362823A (en) Recommendation model training method, recommendation method and recommendation device for behavior sparse scene
CN111507782A (en) User loss attribution focusing method and device, storage medium and electronic equipment
CN117056761A (en) Customer subdivision method based on X-DBSCAN algorithm
CN111242452A (en) Financial risk data analysis control system and method
Ko et al. On data summarization for machine learning in multi-organization federations
CN114997959A (en) Electronic intelligent product marketing recommendation method
Abdullahi Banded Pattern Mining For N-Dimensional Zero-One Data
CN110610420A (en) Stock price trend prediction method and system
TWI721331B (en) Classification device and classification method
CN113987372B (en) Hot data acquisition method, device and equipment of domain business object model
Bi Research for Customer Segmentation of Medical Insurance Based on K-means and C&R Tree Algorithms
Desai et al. An enterprise-friendly book recommendation system for very sparse data
Li et al. Modelling user pictures with hierarchical Dirichlet process of P2P lending market
Jia et al. Data Mining and Business Intelligence in SME Customer Relationship Value Analysis
Qin et al. Research on Early Warning of Customer Churn Based on Random Forest.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant