CN111160382A - Effective method for processing classified data in real life - Google Patents

Effective method for processing classified data in real life Download PDF

Info

Publication number
CN111160382A
CN111160382A CN201910935269.1A CN201910935269A CN111160382A CN 111160382 A CN111160382 A CN 111160382A CN 201910935269 A CN201910935269 A CN 201910935269A CN 111160382 A CN111160382 A CN 111160382A
Authority
CN
China
Prior art keywords
class
data
algorithm
fuzzy
centers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910935269.1A
Other languages
Chinese (zh)
Inventor
李顺勇
张苗苗
张钰嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Priority to CN201910935269.1A priority Critical patent/CN111160382A/en
Publication of CN111160382A publication Critical patent/CN111160382A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an effective method for processing classification data in real life, which comprises the following steps: step 1, randomly selecting k initial points from a data set X containing n samples, wherein k is the classification number of the data set X; step 2, calculating the distance between each object and k initial points, and distributing the objects to the initial point class with the minimum distance to the objects to obtain k clusters; step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as wh(ii) a Step 4, repeating the step 2 and the step 3 until the class center whUntil unchanged. The core of the invention is an MD fuzzy k-modes clustering algorithm based on classification type matrix object data, and in a big data era, a plurality of records are clustered by using the MD fuzzy k-modes algorithm, so that the customer's consumption can be found more easilyAnd the preference is paid, so that more targeted recommendations are made.

Description

Effective method for processing classified data in real life
Technical Field
The invention relates to the field of advanced calculation and data processing, in particular to an effective method for processing classified data in real life.
Background
In data mining, the input to the algorithm is in most cases a data set X, also called a table or a matrix. In many practical applications, a database typically contains a plurality of tables, with one-to-one, one-to-many, and many-to-many relationships between the tables.
For example, a customer may purchase multiple products simultaneously while shopping, the object described by the multiple feature vectors is referred to as a matrix object, and the dataset composed of matrix objects is referred to as a matrix object dataset. From http is described in table 1: // www.taobao.com. Table 1 has two parts, the left part describes basic information of users, and the right part records information that each user accesses different brands at different time points, wherein the attribute "access time" represents the time that the user accesses one brand on the same day. We refer to the left part as the main table and the right part as the detail table in the database, where there is a typical one-to-many relationship between the two parts.
The data in table 1 have the following characteristics: (1) relevance, the data in the main and detail tables may have a certain relevance, and users of different genders or ages may have different preferences. For example, a 24 year old female user has access to goods commonly used by most female users, such as JOSINE and WETHERM; however, a 40 year old female user has access to merchandise for male or female use because she needs to care for their family. (2) One-to-many, with each user in the master table corresponding to multiple records in the detail table. Furthermore, the number of brands accessed by different users tends to be different in table 1. For example, user 10944750 has 11 records, while user 8149250 has 4 records. (3) Mixedness, in most cases, objects are described by a category attribute together with a numerical attribute. For example, in the list, "brand name" is a category attribute, and "access time" is a numerical attribute. (4) The evolution, some attribute values, may change over time. For example, a user may repeatedly visit a brand during the month, but the brand may not be visited by him or her the next month. In other words, the change in user behavior is a dynamic evolution process that changes over time.
As can be seen from the detailed table portion of Table 1, each user has access to at least one brand, which may be viewed by many users. Furthermore, a brand may be accessed by users multiple times during a day. Of course, it may also be accessed by the user multiple times within a few days. Obviously, if a user visits a certain brand many times, he or she may be interested in such goods. For example, for user 10944750, JOSINY is accessed four months in a row each year, with several accesses per month. Thus, we can predict that the user may prefer JOSINE very much, that the user 10944750 only visits SEMIR once, and that we know that the user may prefer it less than JOSINE.
Such data as shown in table 1 is very common in bank, insurance, telecommunications, retail and medical databases and it is therefore necessary to develop a method to find groups of users with different behaviour patterns from a detailed list rather than a master list. Because behavioral analysis can help managers obtain more valuable decision information.
The cluster analysis is an unsupervised algorithm, and aims to divide data with larger similarity into the same cluster according to a certain similarity measurement, so that the data in the cluster has larger similarity and the data between the clusters has smaller similarity as much as possible. Conventional clustering algorithms generally cluster single-valued attribute data, but in many practical applications, each object is usually described by a plurality of feature vectors. If the existing clustering algorithm is used for processing the data, prior knowledge is needed to select one record, which seriously loses information and destroys the originality of the data, and goes against the original purpose of data analysis by using the data totality. Therefore, in order to find the consumption preference of the client by using a plurality of consumption records, thereby making more targeted recommendations, it is necessary to research a clustering algorithm based on matrix object data. At present, relatively few researches are conducted on a matrix object data clustering algorithm, and a plurality of problems are to be solved.
Nowadays, the data volume is increased and the complexity of the data brings about a small problem to the cluster analysis, and the traditional K-models algorithm cannot quickly and effectively classify relatively complex classified data in life.
Disclosure of Invention
In order to solve the defects and shortcomings of the prior art, an effective method for processing the data of the classification type in real life is provided, a core algorithm in the method, namely an MD fuzzy k-modes clustering algorithm, can cluster a plurality of records, and can quickly and effectively process the data set of the classification type with multiple attributes.
The invention provides an effective method for processing real life classified data, which comprises the following steps:
step 1, randomly selecting k initial points from a data set X containing n samples, wherein k is the classification number of the data set X;
step 2, calculating the distance between each object and k initial points, and distributing the objects to the initial point class with the minimum distance to the objects to obtain k clusters;
step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as wh
Step 4, repeating the step 2 and the step 3 until the class center whUntil unchanged.
In a further refinement, the distances d (X) of each object to the k initial points are calculated in step 2i,Xj) The method specifically comprises the following steps:
Figure BDA0002221446940000031
wherein
Figure BDA0002221446940000032
Figure RE-GDA0002438919360000033
In the formula (2)
Figure BDA0002221446940000041
Representing a matrix object XiAt attribute A s1/2 is a normalization factor because 0 ≦ δ (X)is,Xjs) 2 or less when
Figure BDA0002221446940000042
When, delta (X)is,Xjs) 0; when in use
Figure BDA0002221446940000043
When, delta (X)is,Xjs)=2。
As a further improvement of the above scheme, in step 3, the affiliation degree of each object to k centers is calculated, and the class centers w of k clusters are updated by using a heuristic update algorithmhThe method specifically comprises the following steps:
Figure BDA0002221446940000044
as a further improvement of the above scheme, the heuristic update algorithm in step 3 specifically includes the steps of:
step 3.1, analyze formula 5 if QlCan minimize equation 5, then QlIs the center-like of the X,
Figure BDA0002221446940000045
where β is the ambiguity factor and w is the degree of membership.
As a further improvement of the above scheme, said step 3.1, the analysis of formula 5 is carried out, if QlCan minimize equation 5, then QlIs the center-like of the X,
Figure BDA0002221446940000046
wherein β is a fuzzy factor, w is a membership degree, the analysis process is shown in formula 6,
Figure RE-GDA0002438919360000047
here note
Figure RE-GDA0002438919360000048
Referred to as XiProperty value of
Figure RE-GDA0002438919360000049
Weight of, requires class center QlI.e. to minimize D (X, Q)l) According to definition 1, only minimization is required
Figure RE-GDA00024389193600000410
Let attribute AsHas an attribute number of | VsI, then, its class center QlsThe number of attribute values can only be from 1 to | VsIf you want to choose from | VsSelecting u from | piecessIs taken as QlsThe attribute value of (1) is
Figure RE-GDA0002438919360000051
In one case, usAlso ranges from 1 to | VsI, class-centered on attribute A according to polynomial theorysIs selected from
Figure RE-GDA0002438919360000052
In the method, each situation needs to be traversed
Figure RE-GDA0002438919360000053
Then arrange in descending order to update the attribute AsAnd similarly, the class centers with other attributes can be updated.
The time complexity of the global update type central algorithm is O (nmtk multiplied by 2)|V′|) N denotes the number of objects, m denotes the number of attributes, k denotes the number of classifications, t denotes the number of iterations, | V' | max { | VsTherefore, the algorithm time complexity of the global update class center is increased linearly along with the increase of the number of objects, the number of attributes, the number of classifications and the number of iterations, and the number of attribute values is increased exponentially.
When the number of attribute values in the matrix object data is excessive, the calculation amount of the global updating class center is too large, and the consumed time is increased, so that the heuristic updating class center algorithm is provided.
As a further improvement of the above scheme, formula 6 is analyzed as follows: want to make
Figure BDA0002221446940000051
At a minimum, it is necessary to
Figure BDA0002221446940000052
Maximum, assume attribute A in class IjValue range of is
Figure BDA0002221446940000053
Computing
Figure BDA0002221446940000054
And in descending order, if QljIs r isjValue, then there are 3 cases:
1) when r isjWhen 1, if
Figure BDA0002221446940000055
Where t represents the value of the several attributes, then
Figure BDA0002221446940000056
If more than one maximum value exists, randomly taking a number according to a randderm function in the MATLAB program;
2) when r isj=r′jIf so, taking all values as class centers;
3) when r isj≠r′jWhen, there are the following 3 cases:
① if
Figure BDA0002221446940000057
Before selection rjThe value as a class center;
② if
Figure BDA0002221446940000058
Then select r beforej-1 value
Figure BDA0002221446940000059
As QljA part of (a);
if it is
Figure BDA00022214469400000510
Then
Figure BDA00022214469400000511
If it is
Figure BDA00022214469400000512
Then
Figure BDA00022214469400000513
If it is
Figure BDA00022214469400000514
One is randomly selected;
③ if
Figure BDA00022214469400000515
Figure BDA00022214469400000516
Then select first (r)j-p' -1) attribute values as part of a class center, denoted as
Figure BDA00022214469400000517
Then p '+ 1 from the next p' + p +1 attribute values are selected as the remainderjIs the set of all combinations of the remaining attribute values, RtIs QjOf each combination of frequencies, if
Figure BDA0002221446940000063
Determining rjValue, order
Figure BDA0002221446940000062
Based on the heuristic update algorithm, Attribute AjClass centers of other attributes can also be found, as well.
The invention has the beneficial effects that:
compared with the prior art, the core algorithm has the remarkable advantages that compared with the existing classified data clustering technology, the core algorithm has the following advantages: the algorithm can cluster a plurality of records, so that the consumption preference of customers can be found more easily, and more targeted recommendation can be made.
Drawings
The following detailed description of embodiments of the invention is provided in conjunction with the appended drawings, in which:
fig. 1 is a relationship diagram of β versus w on 4 datasets, specifically fig. 1.1 is a relationship diagram of β versus w on a Market Basket dataset, fig. 1.2 is a relationship diagram of β versus w on a Microsoft Web dataset, fig. 1.3 is a relationship diagram of β versus w on a Musk data dataset, and fig. 1.4 is a relationship diagram of β versus w on a movilens dataset.
Detailed Description
The invention will be described in further detail below with reference to the accompanying drawings,
comprises the following steps:
step 1, randomly selecting k initial points from a data set X containing n samples, wherein k is the classification number of the data set X, and the specific steps are shown as steps 1.1-1.2;
step 1.1, 5 real data sets, such as Market base data, Microsoft Web data, Muskdata, MovieLens data and Alibaba data sets, are selected. The Market Basket data records the transaction records of 1001 customers, and each record is described by 4 attributes of user ID, transaction time, product name and product ID; microsoft Web data is from a UCI data set and records the webpage browsing situation of 32711 anonymous users in a certain week in 1 month of 1998, and each user is described by 2 attributes of a user ID and a webpage ID; musk data is also from the UCI dataset, comprising 92 objects, each object being described by 167 attributes; the movileens data is downloaded from a movileens website, only using the rating data therein, which records 1000209 scores of 6040 audiences for 3900 movies, each record being described by 4 attributes of user ID, movie ID, user score and time of submitting a rating; the Alibaba data set is 884 users browsing 182880 records of certain brands, also described by 4 attributes, and these 5 data sets are matrix object data sets. In order to enhance the clustering effect, the invention carries out corresponding preprocessing on the attributes of each data set, and the data form after preprocessing is shown in table 2.
Step 1.2, defining an evaluation standard, and carrying out effectiveness evaluation on the provided algorithm by adopting 5 evaluation indexes of precision (AC), Purity (PR), recall Rate (RE), Lande index (ARI) and Normalized Mutual Information (NMI). AC represents the proportion of correct classification; PR indicates how many of the samples predicted to be positive are true; RE indicates how many positive examples in the sample are predicted to be correct; ARI and NMI are used to measure the degree of agreement between the two data distributions. AC. The larger the values of PR, RE, ARI and NMI are, the closer the clustering result is to the real partitioning of the data set, the better the clustering effect is, and the specific calculation formula is shown as formulas 10-14;
Figure BDA0002221446940000071
Figure BDA0002221446940000072
Figure BDA0002221446940000073
Figure BDA0002221446940000074
Figure BDA0002221446940000075
wherein the content of the first and second substances,
Figure BDA0002221446940000076
and
Figure BDA0002221446940000077
are respectively CiAnd PjNumber of middle objects, nij=|CiI Pj|,
Figure BDA0002221446940000078
Indicating the number of objects correctly classified into i-class. If the clustering result is closer to the true partition of the data set, the larger the values of AC, PR, RE, ARI, NMI are, the maximum value is 1. In other words, the larger the value of the experimental result on the 5 evaluation criteria, the better the clustering quality, and the more effective the algorithm.
Step 2, calculating the distance d (X) from each object to k initial pointsi,Xj) The method specifically comprises the following steps:
Figure BDA0002221446940000081
wherein
Figure BDA0002221446940000082
Figure RE-GDA00024389193600000810
In the formula (2)
Figure BDA0002221446940000084
Representing a matrix object XiAt attribute A s1/2 is a normalization factor because 0 ≦ δ (X)is,Xjs) 2 or less when
Figure BDA0002221446940000085
When, delta (X)is,Xjs) 0; when in use
Figure BDA0002221446940000086
When, delta (X)is,Xjs)=2。
Distributing the objects to an initial point class with the minimum distance to the objects to obtain k clusters;
step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as whThe method specifically comprises the following steps:
Figure BDA0002221446940000087
the heuristic updating algorithm comprises the following specific steps:
step 3.1, analyze formula 5 if QlCan minimize equation 5, then QlIs the center-like of the X,
Figure BDA0002221446940000088
wherein β is a fuzzy factor, w is a membership degree, the analysis process is shown in formula 6,
Figure RE-GDA0002438919360000093
here note
Figure RE-GDA0002438919360000094
Referred to as XiProperty value of
Figure RE-GDA0002438919360000095
Weight of, requires class center QlI.e. to minimize D (X, Q)l) According to definition 1, only minimization is required
Figure RE-GDA0002438919360000096
Let attribute AsHas an attribute number of | VsI, then, its class center QlsThe number of attribute values can only be from 1 to | VsIf you want to choose from | VsSelecting u from | piecessIs taken as QlsThe attribute value of (1) is
Figure RE-GDA0002438919360000097
In one case, usAlso ranges from 1 to | VsI, class-centered on attribute A according to polynomial theorysIs selected from
Figure RE-GDA0002438919360000098
In the method, each situation needs to be traversed
Figure RE-GDA0002438919360000099
Then arrange in descending order to update the attribute AsAnd similarly, the class centers with other attributes can be updated.
The time complexity of the global update type central algorithm is O (nmtk multiplied by 2)|V′|) N denotes the number of objects, m denotes the number of attributes, k denotes the number of classifications, t denotes the number of iterations, | V' | max { | VsTherefore, the algorithm time complexity of the global update class center is increased linearly along with the increase of the number of objects, the number of attributes, the number of classifications and the number of iterations, and the number of attribute values is increased exponentially.
When the number of attribute values in the matrix object data is excessive, the calculation amount of the global update class center is too large, and the consumed time is increased, so that a heuristic update class center algorithm is provided herein, and formula 6 is analyzed, specifically as follows: want to make
Figure BDA0002221446940000094
At a minimum, it is necessary to
Figure BDA0002221446940000095
Maximum, assume attribute A in class IjValue range of is
Figure BDA0002221446940000096
Computing
Figure BDA0002221446940000097
And in descending order, if QljIs r isjValue, then there are 3 cases:
1) when r isjWhen 1, if
Figure BDA0002221446940000098
Where t represents the value of the several attributes, then
Figure BDA0002221446940000099
If there is more than one maximum, then the MATLAB program is followedRandomly taking a number by the randderm function in the random number;
2) when r isj=r′jIf so, taking all values as class centers;
3) when r isj≠r′jWhen, there are the following 3 cases:
① if
Figure BDA00022214469400000910
Before selection rjThe value as a class center;
② if
Figure BDA0002221446940000101
Then select r beforej-1 value
Figure BDA0002221446940000102
As QljA part of (a);
if it is
Figure BDA0002221446940000103
Then
Figure BDA0002221446940000104
If it is
Figure BDA0002221446940000105
Then
Figure BDA0002221446940000106
If it is
Figure BDA0002221446940000107
One is randomly selected;
③ if
Figure BDA0002221446940000108
Figure BDA0002221446940000109
Then select first (r)j-p' -1) attribute values as one of the class centersPortion, is described as
Figure BDA00022214469400001010
Then p '+ 1 from the next p' + p +1 attribute values are selected as the remainderjIs the set of all combinations of the remaining attribute values, RtIs QjOf each combination of frequencies, if
Figure BDA00022214469400001013
Determining rjValue, order
Figure BDA00022214469400001012
Based on the heuristic update algorithm, Attribute AjClass centers of other attributes can also be found, as well.
Step 4, repeating the step 2 and the step 3 until the class center whThe specific steps are shown in steps 4.1-4, 4.
And 4.1, evaluating the effectiveness of a heuristic center updating algorithm, updating the center by adopting a Heuristic (HAMF) algorithm and a Global Algorithm (GAMF) algorithm respectively, and comparing the experimental result with the operation time, wherein the operation time is 10 times by taking the Market Basket data as an example, and the results are shown in tables 3 and 4.
As can be seen from tables 3 and 4, the clustering effect of updating the class centers by using the global algorithm is better than that of the heuristic updating algorithm, but the time is as long as 96 h. And the heuristic updating algorithm only needs to take 160s under the condition that the clustering effect is similar, so that the heuristic updating algorithm provided by the invention is more effective when the MD fuzzy k-modes algorithm is used for clustering.
The method comprises the steps of 4.2, comparing the MD fuzzy k-models algorithm with other algorithms, and comparing the 4 algorithms of SV-k-models, k-mw-models, fuzzy k-models and fuzzy SV-k-models with MD fuzzy k-models algorithm, wherein the results of the MD fuzzy k-models algorithm must be converted into single value attribute value form, and SV-k-models, fuzzy SV-k-models algorithm must be converted into value data set form, when compared with the SV-k-models, k-mw-models algorithm, because the two algorithms do not contain fuzzy factor β, the invention assumes that β value in MD fuzzy k-models algorithm is 1.1, when compared with fuzzy algorithm, fuzzy SV-k-models algorithm, the central data evaluation value of fuzzy algorithm is not considered to be improved by the fuzzy algorithm 2% and the central data evaluation of fuzzy algorithm is not considered to be improved by the MD fuzzy data set PR-model, the MD fuzzy data set, and the results of fuzzy data are improved by the MD fuzzy algorithm, and the MD fuzzy data set, and the results of fuzzy data are improved by the MD fuzzy algorithm, and the MD fuzzy data set, and the fuzzy data evaluation result of fuzzy 2, wherein the fuzzy data are improved by the fuzzy algorithm results of the MD fuzzy 2, the fuzzy data set is improved by the fuzzy 2, the fuzzy algorithm, and the fuzzy algorithm, the fuzzy data set, and the fuzzy data set, and the fuzzy algorithm, the fuzzy data of the fuzzy 2, the fuzzy data, and the fuzzy 2, and fuzzy 2, the fuzzy data, the fuzzy algorithm, the fuzzy data, and the fuzzy data, the fuzzy algorithm, the fuzzy data set, the fuzzy data set, and the fuzzy, the fuzzy data set, the fuzzy data set, and the fuzzy data set, the fuzzy data of the fuzzy 2, the fuzzy data set, the fuzzy data, the fuzzy, the characteristic of the fuzzy, the characteristic of the fuzzy, the characteristic of the fuzzy, the characteristic of the.
Step 4.3, analyzing the relation between β and w, because the value of β directly affects the membership degree of the matrix object to each category, it is necessary to analyze the relation between the fuzzy factor β and the membership degree w, because the number of the objects of the data set is too large, the invention only takes the first 10 objects as the research objects, and after 30 times of experiments, the experimental results of 4 data sets, i.e. Market Basket data, Microsoft Web data, Musk data and MovieLens data, are respectively shown in FIG. 1, wherein, red 'o' represents the matrix object to be classified into the 1 st category, blue '★' represents the matrix object to be classified into the 2 nd category, green '□' represents the matrix object to be classified into the 3 rd category, purple '+' represents the matrix object to be classified into the 4 th category.
And 4.4, selecting five real data sets including mark Basket data, Microsoft Web data, Musk data, MovieLens data and Alibaba data in the table 2, comparing the clustering effect of the algorithm with the K-models and WK-models algorithms, and verifying the effectiveness of the algorithm, wherein the comparison result is shown in the table 7. It can be seen that the algorithm of the invention is obviously superior to K-modes and WK-modes algorithms.
According to the steps, the effectiveness of the MD fuzzy k-models algorithm on the clustering effect is verified, and the relation between the fuzzy factor β and the membership degree w is analyzed.
TABLE 1 Taobao true data set
Figure BDA0002221446940000131
TABLE 2 preprocessed data set
Figure BDA0002221446940000132
TABLE 3 comparison of results for updating class centers with GAMF and HAMF in the MD fuzzy k-models algorithm
Figure BDA0002221446940000133
TABLE 4 updating run-time of class-centers with GAMF and HAMF in the MD fuzzy k-models algorithm
Figure BDA0002221446940000134
Table 5 comparison of 3 algorithms on 5 data sets
Figure RE-GDA0002438919360000144
Table 6 comparison of 3 algorithms for different parameter values on 5 data sets
TABLE 6.1 comparison of 3 algorithms on Market Basket data set
Figure BDA0002221446940000142
Figure BDA0002221446940000151
TABLE 6.2 comparison of 3 algorithms on Microsoft Web data set
Figure BDA0002221446940000152
TABLE 6.3 comparison of 3 algorithms on Musk data set
Figure BDA0002221446940000161
Table 6.4 comparison of 3 algorithms on MovieLens data set
Figure BDA0002221446940000162
Figure BDA0002221446940000171
Table 7 clustering effect comparison of 3 algorithms on 5 data sets
Figure RE-GDA0002438919360000182

Claims (6)

1. An efficient method for processing real-life typed data, comprising the steps of:
step 1, randomly selecting k initial points from a data set X containing n samples, wherein k is the classification number of the data set X;
step 2, calculating the distance between each object and k initial points, and distributing the objects to the initial point class with the minimum distance to the objects to obtain k clusters;
step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as wk
Step 4, repeating the step 2 and the step 3 until the class center wkUntil unchanged.
2. An efficient method of processing real-life typed data according to claim 1, characterized in that: calculating the distance d (X) from each object to k initial points in the step 2i,Xj) The method specifically comprises the following steps:
Figure RE-FDA0002438919350000011
wherein
Figure RE-FDA0002438919350000012
Figure RE-FDA0002438919350000013
In the formula (2)
Figure RE-FDA0002438919350000014
Representing a matrix object XiAt attribute As1/2 is a normalization factor because 0 ≦ δ (X)is,Xjs) 2 or less when
Figure RE-FDA0002438919350000015
When, delta (X)is,Xjs) 0; when in use
Figure RE-FDA0002438919350000016
When, delta (X)is,Xjs)=2。
3. An efficient method of processing real-life typed data according to claim 1, characterized in that: calculating the membership degree of each object to k centers in the step 3, and updating the class centers w of the k clusters by using a heuristic updating algorithmkThe method specifically comprises the following steps:
Figure FDA0002221446930000021
4. an efficient method of processing real-life typed data according to claim 1, characterized in that: the heuristic updating algorithm in the step 3 comprises the following specific steps:
step 3.1, analyze formula 5 if QlCan minimize equation 5, then QlIs the center-like of the X,
Figure FDA0002221446930000022
where β is the ambiguity factor and w is the degree of membership.
5. An efficient method of processing real-life typed data according to claim 4, characterized in that: said step 3.1, the analysis of formula 5 is carried out, if QlCan minimize equation 5, then QlIs the center-like of the X,
Figure FDA0002221446930000023
wherein β is a fuzzy factor, w is a membership degree, the analysis process is shown in formula 6,
Figure FDA0002221446930000024
here note
Figure FDA0002221446930000025
Referred to as XiProperty value of
Figure FDA0002221446930000026
The weight of (c).
6. An efficient method of processing real-life typed data according to claim 5, characterized in that: analysis of formula 6 was performed as follows: want to make
Figure FDA0002221446930000027
At a minimum, it is necessary to
Figure FDA0002221446930000028
Maximum, assume attribute A in class IjValue range of is
Figure FDA0002221446930000029
Computing
Figure FDA00022214469300000210
And pressIn descending order, if QljIs r isjValue, then there are 3 cases:
1) when r isjWhen 1, if
Figure FDA00022214469300000211
t=2,K r′jWhere t represents the value of the attribute, then
Figure FDA0002221446930000031
If more than one maximum value exists, randomly taking a number according to a randderm function in an MATLAB program;
2) when r isj=r′jIf so, taking all values as class centers;
3) when r isj≠r′jWhen, there are the following 3 cases:
① if
Figure FDA0002221446930000032
Before selection rjThe value as a class center;
② if
Figure FDA0002221446930000033
Then select r beforej-1 value
Figure FDA0002221446930000034
As QljA part of (a);
if it is
Figure FDA0002221446930000035
Then
Figure FDA0002221446930000036
If it is
Figure FDA0002221446930000037
Then
Figure FDA0002221446930000038
If it is
Figure FDA0002221446930000039
One is randomly selected;
③ if
Figure FDA00022214469300000310
Figure FDA00022214469300000311
Then select first (r)j-p' -1) attribute values as part of a class center, denoted as
Figure FDA00022214469300000312
Then p '+ 1 from the next p' + p +1 attribute values are selected as the remainderjIs the set of all combinations of the remaining attribute values, RtIs QjOf each combination of frequencies, if
Figure FDA00022214469300000313
Rs≠Rt,Qlj=Rs∪ Q', determining rjValue, order
Figure FDA00022214469300000314
Based on the heuristic update algorithm, Attribute AjSimilarly, class centers with other attributes can also be found.
CN201910935269.1A 2019-09-29 2019-09-29 Effective method for processing classified data in real life Pending CN111160382A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910935269.1A CN111160382A (en) 2019-09-29 2019-09-29 Effective method for processing classified data in real life

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910935269.1A CN111160382A (en) 2019-09-29 2019-09-29 Effective method for processing classified data in real life

Publications (1)

Publication Number Publication Date
CN111160382A true CN111160382A (en) 2020-05-15

Family

ID=70555617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910935269.1A Pending CN111160382A (en) 2019-09-29 2019-09-29 Effective method for processing classified data in real life

Country Status (1)

Country Link
CN (1) CN111160382A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122793A (en) * 2017-03-23 2017-09-01 北京航空航天大学 A kind of improved global optimization k modes clustering methods
CN110083665A (en) * 2019-05-05 2019-08-02 贵州师范大学 Data classification method based on the detection of improved local outlier factor

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122793A (en) * 2017-03-23 2017-09-01 北京航空航天大学 A kind of improved global optimization k modes clustering methods
CN110083665A (en) * 2019-05-05 2019-08-02 贵州师范大学 Data classification method based on the detection of improved local outlier factor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李顺勇 等: "基于分类型矩阵对象数据的MD fuzzy k-modes聚类算法", 《计算机研究与发展》 *

Similar Documents

Publication Publication Date Title
US10726153B2 (en) Differentially private machine learning using a random forest classifier
Mealy et al. Interpreting economic complexity
Shu et al. Beyond news contents: The role of social context for fake news detection
Shinde et al. Hybrid personalized recommender system using centering-bunching based clustering algorithm
Tjioe et al. Mining association rules in data warehouses
Kandeil et al. A two-phase clustering analysis for B2B customer segmentation
JP2019507444A (en) Method and system for ontology-based dynamic learning and knowledge integration from measurement data and text
Hornick et al. Extending recommender systems for disjoint user/item sets: The conference recommendation problem
CN107329994A (en) A kind of improvement collaborative filtering recommending method based on user characteristics
Devale et al. Applications of data mining techniques in life insurance
CN109977299A (en) A kind of proposed algorithm of convergence project temperature and expert's coefficient
Zhang et al. Improved covering-based collaborative filtering for new users’ personalized recommendations
Guastadisegni et al. Use of the Lagrange multiplier test for assessing measurement invariance under model misspecification
Savage et al. Distributed mining of contrast patterns
Girish et al. Mining the Web Data for Classifying and Predicting Users’ Requests
Kaufmann et al. An inductive fuzzy classification approach applied to individual marketing
CN111160382A (en) Effective method for processing classified data in real life
Huang et al. A unified framework of targeted marketing using customer preferences
Bedo et al. A k–skyband approach for feature selection
Intan A proposal of fuzzy multidimensional association rules
Mitrović et al. Emergence and structure of cybercommunities
Ma Novel next-group recommendation approach based on sequential market basket information
CN113221966A (en) Differential privacy decision tree construction method based on F _ Max attribute measurement
Granov Customer loyalty, return and churn prediction through machine learning methods: for a Swedish fashion and e-commerce company
Li et al. Novel multidimensional collaborative filtering algorithm based on improved item rating prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200515