CN111160382A - Effective method for processing classified data in real life - Google Patents
Effective method for processing classified data in real life Download PDFInfo
- Publication number
- CN111160382A CN111160382A CN201910935269.1A CN201910935269A CN111160382A CN 111160382 A CN111160382 A CN 111160382A CN 201910935269 A CN201910935269 A CN 201910935269A CN 111160382 A CN111160382 A CN 111160382A
- Authority
- CN
- China
- Prior art keywords
- class
- data
- algorithm
- fuzzy
- centers
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000012545 processing Methods 0.000 title claims abstract description 14
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 80
- 239000011159 matrix material Substances 0.000 claims abstract description 18
- 238000004458 analytical method Methods 0.000 claims description 7
- 241000287196 Asthenes Species 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 7
- 241000402754 Erythranthe moschata Species 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000011157 data evaluation Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000007621 cluster analysis Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 239000004576 sand Substances 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an effective method for processing classification data in real life, which comprises the following steps: step 1, randomly selecting k initial points from a data set X containing n samples, wherein k is the classification number of the data set X; step 2, calculating the distance between each object and k initial points, and distributing the objects to the initial point class with the minimum distance to the objects to obtain k clusters; step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as wh(ii) a Step 4, repeating the step 2 and the step 3 until the class center whUntil unchanged. The core of the invention is an MD fuzzy k-modes clustering algorithm based on classification type matrix object data, and in a big data era, a plurality of records are clustered by using the MD fuzzy k-modes algorithm, so that the customer's consumption can be found more easilyAnd the preference is paid, so that more targeted recommendations are made.
Description
Technical Field
The invention relates to the field of advanced calculation and data processing, in particular to an effective method for processing classified data in real life.
Background
In data mining, the input to the algorithm is in most cases a data set X, also called a table or a matrix. In many practical applications, a database typically contains a plurality of tables, with one-to-one, one-to-many, and many-to-many relationships between the tables.
For example, a customer may purchase multiple products simultaneously while shopping, the object described by the multiple feature vectors is referred to as a matrix object, and the dataset composed of matrix objects is referred to as a matrix object dataset. From http is described in table 1: // www.taobao.com. Table 1 has two parts, the left part describes basic information of users, and the right part records information that each user accesses different brands at different time points, wherein the attribute "access time" represents the time that the user accesses one brand on the same day. We refer to the left part as the main table and the right part as the detail table in the database, where there is a typical one-to-many relationship between the two parts.
The data in table 1 have the following characteristics: (1) relevance, the data in the main and detail tables may have a certain relevance, and users of different genders or ages may have different preferences. For example, a 24 year old female user has access to goods commonly used by most female users, such as JOSINE and WETHERM; however, a 40 year old female user has access to merchandise for male or female use because she needs to care for their family. (2) One-to-many, with each user in the master table corresponding to multiple records in the detail table. Furthermore, the number of brands accessed by different users tends to be different in table 1. For example, user 10944750 has 11 records, while user 8149250 has 4 records. (3) Mixedness, in most cases, objects are described by a category attribute together with a numerical attribute. For example, in the list, "brand name" is a category attribute, and "access time" is a numerical attribute. (4) The evolution, some attribute values, may change over time. For example, a user may repeatedly visit a brand during the month, but the brand may not be visited by him or her the next month. In other words, the change in user behavior is a dynamic evolution process that changes over time.
As can be seen from the detailed table portion of Table 1, each user has access to at least one brand, which may be viewed by many users. Furthermore, a brand may be accessed by users multiple times during a day. Of course, it may also be accessed by the user multiple times within a few days. Obviously, if a user visits a certain brand many times, he or she may be interested in such goods. For example, for user 10944750, JOSINY is accessed four months in a row each year, with several accesses per month. Thus, we can predict that the user may prefer JOSINE very much, that the user 10944750 only visits SEMIR once, and that we know that the user may prefer it less than JOSINE.
Such data as shown in table 1 is very common in bank, insurance, telecommunications, retail and medical databases and it is therefore necessary to develop a method to find groups of users with different behaviour patterns from a detailed list rather than a master list. Because behavioral analysis can help managers obtain more valuable decision information.
The cluster analysis is an unsupervised algorithm, and aims to divide data with larger similarity into the same cluster according to a certain similarity measurement, so that the data in the cluster has larger similarity and the data between the clusters has smaller similarity as much as possible. Conventional clustering algorithms generally cluster single-valued attribute data, but in many practical applications, each object is usually described by a plurality of feature vectors. If the existing clustering algorithm is used for processing the data, prior knowledge is needed to select one record, which seriously loses information and destroys the originality of the data, and goes against the original purpose of data analysis by using the data totality. Therefore, in order to find the consumption preference of the client by using a plurality of consumption records, thereby making more targeted recommendations, it is necessary to research a clustering algorithm based on matrix object data. At present, relatively few researches are conducted on a matrix object data clustering algorithm, and a plurality of problems are to be solved.
Nowadays, the data volume is increased and the complexity of the data brings about a small problem to the cluster analysis, and the traditional K-models algorithm cannot quickly and effectively classify relatively complex classified data in life.
Disclosure of Invention
In order to solve the defects and shortcomings of the prior art, an effective method for processing the data of the classification type in real life is provided, a core algorithm in the method, namely an MD fuzzy k-modes clustering algorithm, can cluster a plurality of records, and can quickly and effectively process the data set of the classification type with multiple attributes.
The invention provides an effective method for processing real life classified data, which comprises the following steps:
In a further refinement, the distances d (X) of each object to the k initial points are calculated in step 2i,Xj) The method specifically comprises the following steps:
In the formula (2)Representing a matrix object XiAt attribute A s1/2 is a normalization factor because 0 ≦ δ (X)is,Xjs) 2 or less whenWhen, delta (X)is,Xjs) 0; when in useWhen, delta (X)is,Xjs)=2。
As a further improvement of the above scheme, in step 3, the affiliation degree of each object to k centers is calculated, and the class centers w of k clusters are updated by using a heuristic update algorithmhThe method specifically comprises the following steps:
as a further improvement of the above scheme, the heuristic update algorithm in step 3 specifically includes the steps of:
step 3.1, analyze formula 5 if QlCan minimize equation 5, then QlIs the center-like of the X,where β is the ambiguity factor and w is the degree of membership.
As a further improvement of the above scheme, said step 3.1, the analysis of formula 5 is carried out, if QlCan minimize equation 5, then QlIs the center-like of the X,wherein β is a fuzzy factor, w is a membership degree, the analysis process is shown in formula 6,
here noteReferred to as XiProperty value ofWeight of, requires class center QlI.e. to minimize D (X, Q)l) According to definition 1, only minimization is requiredLet attribute AsHas an attribute number of | VsI, then, its class center QlsThe number of attribute values can only be from 1 to | VsIf you want to choose from | VsSelecting u from | piecessIs taken as QlsThe attribute value of (1) isIn one case, usAlso ranges from 1 to | VsI, class-centered on attribute A according to polynomial theorysIs selected fromIn the method, each situation needs to be traversedThen arrange in descending order to update the attribute AsAnd similarly, the class centers with other attributes can be updated.
The time complexity of the global update type central algorithm is O (nmtk multiplied by 2)|V′|) N denotes the number of objects, m denotes the number of attributes, k denotes the number of classifications, t denotes the number of iterations, | V' | max { | VsTherefore, the algorithm time complexity of the global update class center is increased linearly along with the increase of the number of objects, the number of attributes, the number of classifications and the number of iterations, and the number of attribute values is increased exponentially.
When the number of attribute values in the matrix object data is excessive, the calculation amount of the global updating class center is too large, and the consumed time is increased, so that the heuristic updating class center algorithm is provided.
As a further improvement of the above scheme, formula 6 is analyzed as follows: want to makeAt a minimum, it is necessary toMaximum, assume attribute A in class IjValue range of isComputingAnd in descending order, if QljIs r isjValue, then there are 3 cases:
1) when r isjWhen 1, ifWhere t represents the value of the several attributes, thenIf more than one maximum value exists, randomly taking a number according to a randderm function in the MATLAB program;
2) when r isj=r′jIf so, taking all values as class centers;
3) when r isj≠r′jWhen, there are the following 3 cases:
③ if Then select first (r)j-p' -1) attribute values as part of a class center, denoted asThen p '+ 1 from the next p' + p +1 attribute values are selected as the remainderjIs the set of all combinations of the remaining attribute values, RtIs QjOf each combination of frequencies, ifDetermining rjValue, orderBased on the heuristic update algorithm, Attribute AjClass centers of other attributes can also be found, as well.
The invention has the beneficial effects that:
compared with the prior art, the core algorithm has the remarkable advantages that compared with the existing classified data clustering technology, the core algorithm has the following advantages: the algorithm can cluster a plurality of records, so that the consumption preference of customers can be found more easily, and more targeted recommendation can be made.
Drawings
The following detailed description of embodiments of the invention is provided in conjunction with the appended drawings, in which:
fig. 1 is a relationship diagram of β versus w on 4 datasets, specifically fig. 1.1 is a relationship diagram of β versus w on a Market Basket dataset, fig. 1.2 is a relationship diagram of β versus w on a Microsoft Web dataset, fig. 1.3 is a relationship diagram of β versus w on a Musk data dataset, and fig. 1.4 is a relationship diagram of β versus w on a movilens dataset.
Detailed Description
The invention will be described in further detail below with reference to the accompanying drawings,
comprises the following steps:
step 1.1, 5 real data sets, such as Market base data, Microsoft Web data, Muskdata, MovieLens data and Alibaba data sets, are selected. The Market Basket data records the transaction records of 1001 customers, and each record is described by 4 attributes of user ID, transaction time, product name and product ID; microsoft Web data is from a UCI data set and records the webpage browsing situation of 32711 anonymous users in a certain week in 1 month of 1998, and each user is described by 2 attributes of a user ID and a webpage ID; musk data is also from the UCI dataset, comprising 92 objects, each object being described by 167 attributes; the movileens data is downloaded from a movileens website, only using the rating data therein, which records 1000209 scores of 6040 audiences for 3900 movies, each record being described by 4 attributes of user ID, movie ID, user score and time of submitting a rating; the Alibaba data set is 884 users browsing 182880 records of certain brands, also described by 4 attributes, and these 5 data sets are matrix object data sets. In order to enhance the clustering effect, the invention carries out corresponding preprocessing on the attributes of each data set, and the data form after preprocessing is shown in table 2.
Step 1.2, defining an evaluation standard, and carrying out effectiveness evaluation on the provided algorithm by adopting 5 evaluation indexes of precision (AC), Purity (PR), recall Rate (RE), Lande index (ARI) and Normalized Mutual Information (NMI). AC represents the proportion of correct classification; PR indicates how many of the samples predicted to be positive are true; RE indicates how many positive examples in the sample are predicted to be correct; ARI and NMI are used to measure the degree of agreement between the two data distributions. AC. The larger the values of PR, RE, ARI and NMI are, the closer the clustering result is to the real partitioning of the data set, the better the clustering effect is, and the specific calculation formula is shown as formulas 10-14;
wherein the content of the first and second substances,andare respectively CiAnd PjNumber of middle objects, nij=|CiI Pj|,Indicating the number of objects correctly classified into i-class. If the clustering result is closer to the true partition of the data set, the larger the values of AC, PR, RE, ARI, NMI are, the maximum value is 1. In other words, the larger the value of the experimental result on the 5 evaluation criteria, the better the clustering quality, and the more effective the algorithm.
In the formula (2)Representing a matrix object XiAt attribute A s1/2 is a normalization factor because 0 ≦ δ (X)is,Xjs) 2 or less whenWhen, delta (X)is,Xjs) 0; when in useWhen, delta (X)is,Xjs)=2。
Distributing the objects to an initial point class with the minimum distance to the objects to obtain k clusters;
step 3.1, analyze formula 5 if QlCan minimize equation 5, then QlIs the center-like of the X,wherein β is a fuzzy factor, w is a membership degree, the analysis process is shown in formula 6,
here noteReferred to as XiProperty value ofWeight of, requires class center QlI.e. to minimize D (X, Q)l) According to definition 1, only minimization is requiredLet attribute AsHas an attribute number of | VsI, then, its class center QlsThe number of attribute values can only be from 1 to | VsIf you want to choose from | VsSelecting u from | piecessIs taken as QlsThe attribute value of (1) isIn one case, usAlso ranges from 1 to | VsI, class-centered on attribute A according to polynomial theorysIs selected fromIn the method, each situation needs to be traversedThen arrange in descending order to update the attribute AsAnd similarly, the class centers with other attributes can be updated.
The time complexity of the global update type central algorithm is O (nmtk multiplied by 2)|V′|) N denotes the number of objects, m denotes the number of attributes, k denotes the number of classifications, t denotes the number of iterations, | V' | max { | VsTherefore, the algorithm time complexity of the global update class center is increased linearly along with the increase of the number of objects, the number of attributes, the number of classifications and the number of iterations, and the number of attribute values is increased exponentially.
When the number of attribute values in the matrix object data is excessive, the calculation amount of the global update class center is too large, and the consumed time is increased, so that a heuristic update class center algorithm is provided herein, and formula 6 is analyzed, specifically as follows: want to makeAt a minimum, it is necessary toMaximum, assume attribute A in class IjValue range of isComputingAnd in descending order, if QljIs r isjValue, then there are 3 cases:
1) when r isjWhen 1, ifWhere t represents the value of the several attributes, thenIf there is more than one maximum, then the MATLAB program is followedRandomly taking a number by the randderm function in the random number;
2) when r isj=r′jIf so, taking all values as class centers;
3) when r isj≠r′jWhen, there are the following 3 cases:
③ if Then select first (r)j-p' -1) attribute values as one of the class centersPortion, is described asThen p '+ 1 from the next p' + p +1 attribute values are selected as the remainderjIs the set of all combinations of the remaining attribute values, RtIs QjOf each combination of frequencies, ifDetermining rjValue, orderBased on the heuristic update algorithm, Attribute AjClass centers of other attributes can also be found, as well.
And 4.1, evaluating the effectiveness of a heuristic center updating algorithm, updating the center by adopting a Heuristic (HAMF) algorithm and a Global Algorithm (GAMF) algorithm respectively, and comparing the experimental result with the operation time, wherein the operation time is 10 times by taking the Market Basket data as an example, and the results are shown in tables 3 and 4.
As can be seen from tables 3 and 4, the clustering effect of updating the class centers by using the global algorithm is better than that of the heuristic updating algorithm, but the time is as long as 96 h. And the heuristic updating algorithm only needs to take 160s under the condition that the clustering effect is similar, so that the heuristic updating algorithm provided by the invention is more effective when the MD fuzzy k-modes algorithm is used for clustering.
The method comprises the steps of 4.2, comparing the MD fuzzy k-models algorithm with other algorithms, and comparing the 4 algorithms of SV-k-models, k-mw-models, fuzzy k-models and fuzzy SV-k-models with MD fuzzy k-models algorithm, wherein the results of the MD fuzzy k-models algorithm must be converted into single value attribute value form, and SV-k-models, fuzzy SV-k-models algorithm must be converted into value data set form, when compared with the SV-k-models, k-mw-models algorithm, because the two algorithms do not contain fuzzy factor β, the invention assumes that β value in MD fuzzy k-models algorithm is 1.1, when compared with fuzzy algorithm, fuzzy SV-k-models algorithm, the central data evaluation value of fuzzy algorithm is not considered to be improved by the fuzzy algorithm 2% and the central data evaluation of fuzzy algorithm is not considered to be improved by the MD fuzzy data set PR-model, the MD fuzzy data set, and the results of fuzzy data are improved by the MD fuzzy algorithm, and the MD fuzzy data set, and the results of fuzzy data are improved by the MD fuzzy algorithm, and the MD fuzzy data set, and the fuzzy data evaluation result of fuzzy 2, wherein the fuzzy data are improved by the fuzzy algorithm results of the MD fuzzy 2, the fuzzy data set is improved by the fuzzy 2, the fuzzy algorithm, and the fuzzy algorithm, the fuzzy data set, and the fuzzy data set, and the fuzzy algorithm, the fuzzy data of the fuzzy 2, the fuzzy data, and the fuzzy 2, and fuzzy 2, the fuzzy data, the fuzzy algorithm, the fuzzy data, and the fuzzy data, the fuzzy algorithm, the fuzzy data set, the fuzzy data set, and the fuzzy, the fuzzy data set, the fuzzy data set, and the fuzzy data set, the fuzzy data of the fuzzy 2, the fuzzy data set, the fuzzy data, the fuzzy, the characteristic of the fuzzy, the characteristic of the fuzzy, the characteristic of the fuzzy, the characteristic of the.
Step 4.3, analyzing the relation between β and w, because the value of β directly affects the membership degree of the matrix object to each category, it is necessary to analyze the relation between the fuzzy factor β and the membership degree w, because the number of the objects of the data set is too large, the invention only takes the first 10 objects as the research objects, and after 30 times of experiments, the experimental results of 4 data sets, i.e. Market Basket data, Microsoft Web data, Musk data and MovieLens data, are respectively shown in FIG. 1, wherein, red 'o' represents the matrix object to be classified into the 1 st category, blue '★' represents the matrix object to be classified into the 2 nd category, green '□' represents the matrix object to be classified into the 3 rd category, purple '+' represents the matrix object to be classified into the 4 th category.
And 4.4, selecting five real data sets including mark Basket data, Microsoft Web data, Musk data, MovieLens data and Alibaba data in the table 2, comparing the clustering effect of the algorithm with the K-models and WK-models algorithms, and verifying the effectiveness of the algorithm, wherein the comparison result is shown in the table 7. It can be seen that the algorithm of the invention is obviously superior to K-modes and WK-modes algorithms.
According to the steps, the effectiveness of the MD fuzzy k-models algorithm on the clustering effect is verified, and the relation between the fuzzy factor β and the membership degree w is analyzed.
TABLE 1 Taobao true data set
TABLE 2 preprocessed data set
TABLE 3 comparison of results for updating class centers with GAMF and HAMF in the MD fuzzy k-models algorithm
TABLE 4 updating run-time of class-centers with GAMF and HAMF in the MD fuzzy k-models algorithm
Table 5 comparison of 3 algorithms on 5 data sets
Table 6 comparison of 3 algorithms for different parameter values on 5 data sets
TABLE 6.1 comparison of 3 algorithms on Market Basket data set
TABLE 6.2 comparison of 3 algorithms on Microsoft Web data set
TABLE 6.3 comparison of 3 algorithms on Musk data set
Table 6.4 comparison of 3 algorithms on MovieLens data set
Table 7 clustering effect comparison of 3 algorithms on 5 data sets
Claims (6)
1. An efficient method for processing real-life typed data, comprising the steps of:
step 1, randomly selecting k initial points from a data set X containing n samples, wherein k is the classification number of the data set X;
step 2, calculating the distance between each object and k initial points, and distributing the objects to the initial point class with the minimum distance to the objects to obtain k clusters;
step 3, calculating the membership degree of each object to k centers, and updating the class centers of the k clusters by using a heuristic updating algorithm, wherein the class centers are represented as wk;
Step 4, repeating the step 2 and the step 3 until the class center wkUntil unchanged.
2. An efficient method of processing real-life typed data according to claim 1, characterized in that: calculating the distance d (X) from each object to k initial points in the step 2i,Xj) The method specifically comprises the following steps:
3. An efficient method of processing real-life typed data according to claim 1, characterized in that: calculating the membership degree of each object to k centers in the step 3, and updating the class centers w of the k clusters by using a heuristic updating algorithmkThe method specifically comprises the following steps:
4. an efficient method of processing real-life typed data according to claim 1, characterized in that: the heuristic updating algorithm in the step 3 comprises the following specific steps:
step 3.1, analyze formula 5 if QlCan minimize equation 5, then QlIs the center-like of the X,
where β is the ambiguity factor and w is the degree of membership.
5. An efficient method of processing real-life typed data according to claim 4, characterized in that: said step 3.1, the analysis of formula 5 is carried out, if QlCan minimize equation 5, then QlIs the center-like of the X,wherein β is a fuzzy factor, w is a membership degree, the analysis process is shown in formula 6,
6. An efficient method of processing real-life typed data according to claim 5, characterized in that: analysis of formula 6 was performed as follows: want to makeAt a minimum, it is necessary toMaximum, assume attribute A in class IjValue range of isComputingAnd pressIn descending order, if QljIs r isjValue, then there are 3 cases:
1) when r isjWhen 1, ift=2,K r′jWhere t represents the value of the attribute, thenIf more than one maximum value exists, randomly taking a number according to a randderm function in an MATLAB program;
2) when r isj=r′jIf so, taking all values as class centers;
3) when r isj≠r′jWhen, there are the following 3 cases:
③ if Then select first (r)j-p' -1) attribute values as part of a class center, denoted asThen p '+ 1 from the next p' + p +1 attribute values are selected as the remainderjIs the set of all combinations of the remaining attribute values, RtIs QjOf each combination of frequencies, ifRs≠Rt,Qlj=Rs∪ Q', determining rjValue, orderBased on the heuristic update algorithm, Attribute AjSimilarly, class centers with other attributes can also be found.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910935269.1A CN111160382A (en) | 2019-09-29 | 2019-09-29 | Effective method for processing classified data in real life |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910935269.1A CN111160382A (en) | 2019-09-29 | 2019-09-29 | Effective method for processing classified data in real life |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111160382A true CN111160382A (en) | 2020-05-15 |
Family
ID=70555617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910935269.1A Pending CN111160382A (en) | 2019-09-29 | 2019-09-29 | Effective method for processing classified data in real life |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111160382A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107122793A (en) * | 2017-03-23 | 2017-09-01 | 北京航空航天大学 | A kind of improved global optimization k modes clustering methods |
CN110083665A (en) * | 2019-05-05 | 2019-08-02 | 贵州师范大学 | Data classification method based on the detection of improved local outlier factor |
-
2019
- 2019-09-29 CN CN201910935269.1A patent/CN111160382A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107122793A (en) * | 2017-03-23 | 2017-09-01 | 北京航空航天大学 | A kind of improved global optimization k modes clustering methods |
CN110083665A (en) * | 2019-05-05 | 2019-08-02 | 贵州师范大学 | Data classification method based on the detection of improved local outlier factor |
Non-Patent Citations (1)
Title |
---|
李顺勇 等: "基于分类型矩阵对象数据的MD fuzzy k-modes聚类算法", 《计算机研究与发展》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10726153B2 (en) | Differentially private machine learning using a random forest classifier | |
Mealy et al. | Interpreting economic complexity | |
Shu et al. | Beyond news contents: The role of social context for fake news detection | |
Shinde et al. | Hybrid personalized recommender system using centering-bunching based clustering algorithm | |
Tjioe et al. | Mining association rules in data warehouses | |
Kandeil et al. | A two-phase clustering analysis for B2B customer segmentation | |
JP2019507444A (en) | Method and system for ontology-based dynamic learning and knowledge integration from measurement data and text | |
Hornick et al. | Extending recommender systems for disjoint user/item sets: The conference recommendation problem | |
CN107329994A (en) | A kind of improvement collaborative filtering recommending method based on user characteristics | |
Devale et al. | Applications of data mining techniques in life insurance | |
CN109977299A (en) | A kind of proposed algorithm of convergence project temperature and expert's coefficient | |
Zhang et al. | Improved covering-based collaborative filtering for new users’ personalized recommendations | |
Guastadisegni et al. | Use of the Lagrange multiplier test for assessing measurement invariance under model misspecification | |
Savage et al. | Distributed mining of contrast patterns | |
Girish et al. | Mining the Web Data for Classifying and Predicting Users’ Requests | |
Kaufmann et al. | An inductive fuzzy classification approach applied to individual marketing | |
CN111160382A (en) | Effective method for processing classified data in real life | |
Huang et al. | A unified framework of targeted marketing using customer preferences | |
Bedo et al. | A k–skyband approach for feature selection | |
Intan | A proposal of fuzzy multidimensional association rules | |
Mitrović et al. | Emergence and structure of cybercommunities | |
Ma | Novel next-group recommendation approach based on sequential market basket information | |
CN113221966A (en) | Differential privacy decision tree construction method based on F _ Max attribute measurement | |
Granov | Customer loyalty, return and churn prediction through machine learning methods: for a Swedish fashion and e-commerce company | |
Li et al. | Novel multidimensional collaborative filtering algorithm based on improved item rating prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200515 |