CN111160382A

CN111160382A - Effective method for processing classified data in real life

Info

Publication number: CN111160382A
Application number: CN201910935269.1A
Authority: CN
Inventors: 李顺勇; 张苗苗; 张钰嘉
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-05-15

Abstract

The invention discloses an effective method for processing classification data in real life, which comprises the following steps: step 1, randomly selecting k initial points from a data set X containing n samples, wherein k is the classification number of the data set X; step 2, calculating the distance between each object and k initial points, and distributing the objects to the initial point class with the minimum distance to the objects to obtain k clusters; step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as w_h(ii) a Step 4, repeating the step 2 and the step 3 until the class center w_hUntil unchanged. The core of the invention is an MD fuzzy k-modes clustering algorithm based on classification type matrix object data, and in a big data era, a plurality of records are clustered by using the MD fuzzy k-modes algorithm, so that the customer's consumption can be found more easilyAnd the preference is paid, so that more targeted recommendations are made.

Description

Effective method for processing classified data in real life

Technical Field

The invention relates to the field of advanced calculation and data processing, in particular to an effective method for processing classified data in real life.

Background

In data mining, the input to the algorithm is in most cases a data set X, also called a table or a matrix. In many practical applications, a database typically contains a plurality of tables, with one-to-one, one-to-many, and many-to-many relationships between the tables.

For example, a customer may purchase multiple products simultaneously while shopping, the object described by the multiple feature vectors is referred to as a matrix object, and the dataset composed of matrix objects is referred to as a matrix object dataset. From http is described in table 1: // www.taobao.com. Table 1 has two parts, the left part describes basic information of users, and the right part records information that each user accesses different brands at different time points, wherein the attribute "access time" represents the time that the user accesses one brand on the same day. We refer to the left part as the main table and the right part as the detail table in the database, where there is a typical one-to-many relationship between the two parts.

The data in table 1 have the following characteristics: (1) relevance, the data in the main and detail tables may have a certain relevance, and users of different genders or ages may have different preferences. For example, a 24 year old female user has access to goods commonly used by most female users, such as JOSINE and WETHERM; however, a 40 year old female user has access to merchandise for male or female use because she needs to care for their family. (2) One-to-many, with each user in the master table corresponding to multiple records in the detail table. Furthermore, the number of brands accessed by different users tends to be different in table 1. For example, user 10944750 has 11 records, while user 8149250 has 4 records. (3) Mixedness, in most cases, objects are described by a category attribute together with a numerical attribute. For example, in the list, "brand name" is a category attribute, and "access time" is a numerical attribute. (4) The evolution, some attribute values, may change over time. For example, a user may repeatedly visit a brand during the month, but the brand may not be visited by him or her the next month. In other words, the change in user behavior is a dynamic evolution process that changes over time.

As can be seen from the detailed table portion of Table 1, each user has access to at least one brand, which may be viewed by many users. Furthermore, a brand may be accessed by users multiple times during a day. Of course, it may also be accessed by the user multiple times within a few days. Obviously, if a user visits a certain brand many times, he or she may be interested in such goods. For example, for user 10944750, JOSINY is accessed four months in a row each year, with several accesses per month. Thus, we can predict that the user may prefer JOSINE very much, that the user 10944750 only visits SEMIR once, and that we know that the user may prefer it less than JOSINE.

Such data as shown in table 1 is very common in bank, insurance, telecommunications, retail and medical databases and it is therefore necessary to develop a method to find groups of users with different behaviour patterns from a detailed list rather than a master list. Because behavioral analysis can help managers obtain more valuable decision information.

The cluster analysis is an unsupervised algorithm, and aims to divide data with larger similarity into the same cluster according to a certain similarity measurement, so that the data in the cluster has larger similarity and the data between the clusters has smaller similarity as much as possible. Conventional clustering algorithms generally cluster single-valued attribute data, but in many practical applications, each object is usually described by a plurality of feature vectors. If the existing clustering algorithm is used for processing the data, prior knowledge is needed to select one record, which seriously loses information and destroys the originality of the data, and goes against the original purpose of data analysis by using the data totality. Therefore, in order to find the consumption preference of the client by using a plurality of consumption records, thereby making more targeted recommendations, it is necessary to research a clustering algorithm based on matrix object data. At present, relatively few researches are conducted on a matrix object data clustering algorithm, and a plurality of problems are to be solved.

Nowadays, the data volume is increased and the complexity of the data brings about a small problem to the cluster analysis, and the traditional K-models algorithm cannot quickly and effectively classify relatively complex classified data in life.

Disclosure of Invention

In order to solve the defects and shortcomings of the prior art, an effective method for processing the data of the classification type in real life is provided, a core algorithm in the method, namely an MD fuzzy k-modes clustering algorithm, can cluster a plurality of records, and can quickly and effectively process the data set of the classification type with multiple attributes.

The invention provides an effective method for processing real life classified data, which comprises the following steps:

step 1, randomly selecting k initial points from a data set X containing n samples, wherein k is the classification number of the data set X;

step 2, calculating the distance between each object and k initial points, and distributing the objects to the initial point class with the minimum distance to the objects to obtain k clusters;

step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as w_h；

Step 4, repeating the step 2 and the step 3 until the class center w_hUntil unchanged.

In a further refinement, the distances d (X) of each object to the k initial points are calculated in step 2_i，X_j) The method specifically comprises the following steps:

wherein

In the formula (2)

Representing a matrix object X_iAt attribute A _s1/2 is a normalization factor because 0 ≦ δ (X)_is，X_js) 2 or less when

When, delta (X)_is，X_js) 0; when in use

When, delta (X)_is，X_js)＝2。

As a further improvement of the above scheme, in step 3, the affiliation degree of each object to k centers is calculated, and the class centers w of k clusters are updated by using a heuristic update algorithm_hThe method specifically comprises the following steps:

as a further improvement of the above scheme, the heuristic update algorithm in step 3 specifically includes the steps of:

step 3.1, analyze formula 5 if Q_lCan minimize equation 5, then Q_lIs the center-like of the X,

where β is the ambiguity factor and w is the degree of membership.

As a further improvement of the above scheme, said step 3.1, the analysis of formula 5 is carried out, if Q_lCan minimize equation 5, then Q_lIs the center-like of the X,

wherein β is a fuzzy factor, w is a membership degree, the analysis process is shown in formula 6,

here note

Referred to as X_iProperty value of

Weight of, requires class center Q_lI.e. to minimize D (X, Q)_l) According to definition 1, only minimization is required

Let attribute A_sHas an attribute number of | V^sI, then, its class center Q_lsThe number of attribute values can only be from 1 to | V^sIf you want to choose from | V^sSelecting u from | pieces_sIs taken as Q_lsThe attribute value of (1) is

In one case, u_sAlso ranges from 1 to | V^sI, class-centered on attribute A according to polynomial theory_sIs selected from

In the method, each situation needs to be traversed

Then arrange in descending order to update the attribute A_sAnd similarly, the class centers with other attributes can be updated.

The time complexity of the global update type central algorithm is O (nmtk multiplied by 2)^|V′|) N denotes the number of objects, m denotes the number of attributes, k denotes the number of classifications, t denotes the number of iterations, | V' | max { | V^sTherefore, the algorithm time complexity of the global update class center is increased linearly along with the increase of the number of objects, the number of attributes, the number of classifications and the number of iterations, and the number of attribute values is increased exponentially.

When the number of attribute values in the matrix object data is excessive, the calculation amount of the global updating class center is too large, and the consumed time is increased, so that the heuristic updating class center algorithm is provided.

As a further improvement of the above scheme, formula 6 is analyzed as follows: want to make

At a minimum, it is necessary to

Maximum, assume attribute A in class I_jValue range of is

Computing

And in descending order, if Q_ljIs r is_jValue, then there are 3 cases:

1) when r is_jWhen 1, if

Where t represents the value of the several attributes, then

If more than one maximum value exists, randomly taking a number according to a randderm function in the MATLAB program;

2) when r is_j＝r′_jIf so, taking all values as class centers;

3) when r is_j≠r′_jWhen, there are the following 3 cases:

① if

Before selection r_jThe value as a class center;

② if

Then select r before_j-1 value

As Q_ljA part of (a);

if it is

Then

If it is

Then

If it is

One is randomly selected;

③ if

Then select first (r)_j-p' -1) attribute values as part of a class center, denoted as

Then p '+ 1 from the next p' + p +1 attribute values are selected as the remainder^jIs the set of all combinations of the remaining attribute values, R_tIs Q^jOf each combination of frequencies, if

Determining r_jValue, order

Based on the heuristic update algorithm, Attribute A_jClass centers of other attributes can also be found, as well.

The invention has the beneficial effects that:

compared with the prior art, the core algorithm has the remarkable advantages that compared with the existing classified data clustering technology, the core algorithm has the following advantages: the algorithm can cluster a plurality of records, so that the consumption preference of customers can be found more easily, and more targeted recommendation can be made.

Drawings

The following detailed description of embodiments of the invention is provided in conjunction with the appended drawings, in which:

fig. 1 is a relationship diagram of β versus w on 4 datasets, specifically fig. 1.1 is a relationship diagram of β versus w on a Market Basket dataset, fig. 1.2 is a relationship diagram of β versus w on a Microsoft Web dataset, fig. 1.3 is a relationship diagram of β versus w on a Musk data dataset, and fig. 1.4 is a relationship diagram of β versus w on a movilens dataset.

Detailed Description

The invention will be described in further detail below with reference to the accompanying drawings,

comprises the following steps:

step 1, randomly selecting k initial points from a data set X containing n samples, wherein k is the classification number of the data set X, and the specific steps are shown as steps 1.1-1.2;

step 1.1, 5 real data sets, such as Market base data, Microsoft Web data, Muskdata, MovieLens data and Alibaba data sets, are selected. The Market Basket data records the transaction records of 1001 customers, and each record is described by 4 attributes of user ID, transaction time, product name and product ID; microsoft Web data is from a UCI data set and records the webpage browsing situation of 32711 anonymous users in a certain week in 1 month of 1998, and each user is described by 2 attributes of a user ID and a webpage ID; musk data is also from the UCI dataset, comprising 92 objects, each object being described by 167 attributes; the movileens data is downloaded from a movileens website, only using the rating data therein, which records 1000209 scores of 6040 audiences for 3900 movies, each record being described by 4 attributes of user ID, movie ID, user score and time of submitting a rating; the Alibaba data set is 884 users browsing 182880 records of certain brands, also described by 4 attributes, and these 5 data sets are matrix object data sets. In order to enhance the clustering effect, the invention carries out corresponding preprocessing on the attributes of each data set, and the data form after preprocessing is shown in table 2.

Step 1.2, defining an evaluation standard, and carrying out effectiveness evaluation on the provided algorithm by adopting 5 evaluation indexes of precision (AC), Purity (PR), recall Rate (RE), Lande index (ARI) and Normalized Mutual Information (NMI). AC represents the proportion of correct classification; PR indicates how many of the samples predicted to be positive are true; RE indicates how many positive examples in the sample are predicted to be correct; ARI and NMI are used to measure the degree of agreement between the two data distributions. AC. The larger the values of PR, RE, ARI and NMI are, the closer the clustering result is to the real partitioning of the data set, the better the clustering effect is, and the specific calculation formula is shown as formulas 10-14;

wherein the content of the first and second substances,

and

are respectively C_iAnd P_jNumber of middle objects, n_ij＝|C_iI P_j|，

Indicating the number of objects correctly classified into i-class. If the clustering result is closer to the true partition of the data set, the larger the values of AC, PR, RE, ARI, NMI are, the maximum value is 1. In other words, the larger the value of the experimental result on the 5 evaluation criteria, the better the clustering quality, and the more effective the algorithm.

Step 2, calculating the distance d (X) from each object to k initial points_i，X_j) The method specifically comprises the following steps:

wherein

In the formula (2)

When, delta (X)_is，X_js) 0; when in use

When, delta (X)_is，X_js)＝2。

Distributing the objects to an initial point class with the minimum distance to the objects to obtain k clusters;

step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as w_hThe method specifically comprises the following steps:

the heuristic updating algorithm comprises the following specific steps:

here note

Referred to as X_iProperty value of

In the method, each situation needs to be traversed

When the number of attribute values in the matrix object data is excessive, the calculation amount of the global update class center is too large, and the consumed time is increased, so that a heuristic update class center algorithm is provided herein, and formula 6 is analyzed, specifically as follows: want to make

At a minimum, it is necessary to

Maximum, assume attribute A in class I_jValue range of is

Computing

And in descending order, if Q_ljIs r is_jValue, then there are 3 cases:

1) when r is_jWhen 1, if

Where t represents the value of the several attributes, then

If there is more than one maximum, then the MATLAB program is followedRandomly taking a number by the randderm function in the random number;

2) when r is_j＝r′_jIf so, taking all values as class centers;

3) when r is_j≠r′_jWhen, there are the following 3 cases:

① if

Before selection r_jThe value as a class center;

② if

Then select r before_j-1 value

As Q_ljA part of (a);

if it is

Then

If it is

Then

If it is

One is randomly selected;

③ if

Then select first (r)_j-p' -1) attribute values as one of the class centersPortion, is described as

Determining r_jValue, order

Step 4, repeating the step 2 and the step 3 until the class center w_hThe specific steps are shown in steps 4.1-4, 4.

And 4.1, evaluating the effectiveness of a heuristic center updating algorithm, updating the center by adopting a Heuristic (HAMF) algorithm and a Global Algorithm (GAMF) algorithm respectively, and comparing the experimental result with the operation time, wherein the operation time is 10 times by taking the Market Basket data as an example, and the results are shown in tables 3 and 4.

As can be seen from tables 3 and 4, the clustering effect of updating the class centers by using the global algorithm is better than that of the heuristic updating algorithm, but the time is as long as 96 h. And the heuristic updating algorithm only needs to take 160s under the condition that the clustering effect is similar, so that the heuristic updating algorithm provided by the invention is more effective when the MD fuzzy k-modes algorithm is used for clustering.

The method comprises the steps of 4.2, comparing the MD fuzzy k-models algorithm with other algorithms, and comparing the 4 algorithms of SV-k-models, k-mw-models, fuzzy k-models and fuzzy SV-k-models with MD fuzzy k-models algorithm, wherein the results of the MD fuzzy k-models algorithm must be converted into single value attribute value form, and SV-k-models, fuzzy SV-k-models algorithm must be converted into value data set form, when compared with the SV-k-models, k-mw-models algorithm, because the two algorithms do not contain fuzzy factor β, the invention assumes that β value in MD fuzzy k-models algorithm is 1.1, when compared with fuzzy algorithm, fuzzy SV-k-models algorithm, the central data evaluation value of fuzzy algorithm is not considered to be improved by the fuzzy algorithm 2% and the central data evaluation of fuzzy algorithm is not considered to be improved by the MD fuzzy data set PR-model, the MD fuzzy data set, and the results of fuzzy data are improved by the MD fuzzy algorithm, and the MD fuzzy data set, and the results of fuzzy data are improved by the MD fuzzy algorithm, and the MD fuzzy data set, and the fuzzy data evaluation result of fuzzy 2, wherein the fuzzy data are improved by the fuzzy algorithm results of the MD fuzzy 2, the fuzzy data set is improved by the fuzzy 2, the fuzzy algorithm, and the fuzzy algorithm, the fuzzy data set, and the fuzzy data set, and the fuzzy algorithm, the fuzzy data of the fuzzy 2, the fuzzy data, and the fuzzy 2, and fuzzy 2, the fuzzy data, the fuzzy algorithm, the fuzzy data, and the fuzzy data, the fuzzy algorithm, the fuzzy data set, the fuzzy data set, and the fuzzy, the fuzzy data set, the fuzzy data set, and the fuzzy data set, the fuzzy data of the fuzzy 2, the fuzzy data set, the fuzzy data, the fuzzy, the characteristic of the fuzzy, the characteristic of the fuzzy, the characteristic of the fuzzy, the characteristic of the.

Step 4.3, analyzing the relation between β and w, because the value of β directly affects the membership degree of the matrix object to each category, it is necessary to analyze the relation between the fuzzy factor β and the membership degree w, because the number of the objects of the data set is too large, the invention only takes the first 10 objects as the research objects, and after 30 times of experiments, the experimental results of 4 data sets, i.e. Market Basket data, Microsoft Web data, Musk data and MovieLens data, are respectively shown in FIG. 1, wherein, red 'o' represents the matrix object to be classified into the 1 st category, blue '★' represents the matrix object to be classified into the 2 nd category, green '□' represents the matrix object to be classified into the 3 rd category, purple '+' represents the matrix object to be classified into the 4 th category.

And 4.4, selecting five real data sets including mark Basket data, Microsoft Web data, Musk data, MovieLens data and Alibaba data in the table 2, comparing the clustering effect of the algorithm with the K-models and WK-models algorithms, and verifying the effectiveness of the algorithm, wherein the comparison result is shown in the table 7. It can be seen that the algorithm of the invention is obviously superior to K-modes and WK-modes algorithms.

According to the steps, the effectiveness of the MD fuzzy k-models algorithm on the clustering effect is verified, and the relation between the fuzzy factor β and the membership degree w is analyzed.

TABLE 1 Taobao true data set

TABLE 2 preprocessed data set

TABLE 3 comparison of results for updating class centers with GAMF and HAMF in the MD fuzzy k-models algorithm

TABLE 4 updating run-time of class-centers with GAMF and HAMF in the MD fuzzy k-models algorithm

Table 5 comparison of 3 algorithms on 5 data sets

Table 6 comparison of 3 algorithms for different parameter values on 5 data sets

TABLE 6.1 comparison of 3 algorithms on Market Basket data set

TABLE 6.2 comparison of 3 algorithms on Microsoft Web data set

TABLE 6.3 comparison of 3 algorithms on Musk data set

Table 6.4 comparison of 3 algorithms on MovieLens data set

Table 7 clustering effect comparison of 3 algorithms on 5 data sets

Claims

1. An efficient method for processing real-life typed data, comprising the steps of:

step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as w_k；

Step 4, repeating the step 2 and the step 3 until the class center w_kUntil unchanged.

2. An efficient method of processing real-life typed data according to claim 1, characterized in that: calculating the distance d (X) from each object to k initial points in the step 2_i，X_j) The method specifically comprises the following steps:

wherein

In the formula (2)

Representing a matrix object X_iAt attribute A_s1/2 is a normalization factor because 0 ≦ δ (X)_is,X_js) 2 or less when

When, delta (X)_is,X_js) 0; when in use

When, delta (X)_is,X_js)＝2。

3. An efficient method of processing real-life typed data according to claim 1, characterized in that: calculating the membership degree of each object to k centers in the step 3, and updating the class centers w of the k clusters by using a heuristic updating algorithm_kThe method specifically comprises the following steps:

4. an efficient method of processing real-life typed data according to claim 1, characterized in that: the heuristic updating algorithm in the step 3 comprises the following specific steps:

where β is the ambiguity factor and w is the degree of membership.

5. An efficient method of processing real-life typed data according to claim 4, characterized in that: said step 3.1, the analysis of formula 5 is carried out, if Q_lCan minimize equation 5, then Q_lIs the center-like of the X,

here note

Referred to as X_iProperty value of

The weight of (c).

6. An efficient method of processing real-life typed data according to claim 5, characterized in that: analysis of formula 6 was performed as follows: want to make