CN111309777A - Report data mining method for improving association rule based on mutual exclusion expression - Google Patents
Report data mining method for improving association rule based on mutual exclusion expression Download PDFInfo
- Publication number
- CN111309777A CN111309777A CN202010050602.3A CN202010050602A CN111309777A CN 111309777 A CN111309777 A CN 111309777A CN 202010050602 A CN202010050602 A CN 202010050602A CN 111309777 A CN111309777 A CN 111309777A
- Authority
- CN
- China
- Prior art keywords
- data
- item
- item set
- frequent
- sets
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method for mining improved association rule report data based on mutual exclusion expression relates to a knowledge discovery and data mining method in the field of data science, and solves the problems of large memory consumption and low efficiency when a traditional association rule algorithm is used for processing mass data. The method comprises the following steps: converting data into transaction data based on a data threshold range, and obtaining a binary sparse matrix with a grouping label based on data logic; secondly, acquiring a set with all frequent items as 1, and removing the non-frequent item set to obtain a new grouping result; and thirdly, carrying out self-connection iterative search on the frequent item set, cutting the candidate item set, and iterating until a new frequent item set cannot be generated, thereby obtaining an association rule mining result. The basic idea of the invention is to convert the structured data into transaction data, generate grouping based on mutual exclusion relationship, and perform rule mining, thereby reducing the calculation memory and improving the calculation efficiency. The application scene is wide, and the social and economic values are high.
Description
Technical Field
The invention relates to a knowledge discovery and data mining method in the field of data science, in particular to an improved association rule report data mining method based on mutual exclusion expression.
Background
With the explosive growth of data volume in the information age, people find that huge data value is hidden behind massive data. For the data value of the structured data, a better result can be obtained by applying a traditional or modern data mining means, but because the structured data report is difficult to be converted into transaction data (only a data set indicating whether an event occurs is represented by a Boolean value True and a Flase) in the association rule mining process, the data scale is continuously increased, and a data mining method which can mine valuable information faces a huge challenge in the presence of mass data. At present, the data dimensionality is greatly improved by dividing a structured data report into transaction data based on a threshold value, the computing efficiency is low because the memory and computing power of a computer cannot meet the computing requirement, and the time cost of data mining is greatly increased. The method provided by the patent can reasonably reduce the data of the high-dimensional data after the structured data report is converted into the transaction data based on mutual exclusion expression, so that the calculation cost of the traditional association rule algorithm is reduced, and the efficiency of association rule data mining is improved.
Data mining and analysis by using a traditional association rule algorithm have already achieved some research achievements at home and abroad, but aiming at a structured data report based on conversion from threshold partitioning into transaction data, because threshold partitioning generates more data items with mutual exclusion relationship, the traditional algorithm has low efficiency and easily causes insufficient calculation memory, and the improvement method for the situation is still in an exploration stage. Association Rules (Association Rules) are Rules that reflect the interdependency and Association of one thing with another for mining correlations between valuable data items from a large amount of data. The main methods of the existing association rule mining method can be divided into three categories: an association rule mining method based on Apriori, an association rule mining method based on FP-growth and an association rule mining method based on Eclat. The Apriori algorithm utilizes two characteristics of a frequent item set to filter a large number of independent sets, thereby improving the calculation efficiency. The FP-growth algorithm compresses data records by constructing a tree structure, so that the data records are only required to be scanned twice when a frequent item set is mined, and a candidate item set is not required to be generated, thereby improving the calculation efficiency. The Eclat algorithm is converted from an original horizontal database structure of items corresponding to transactions into a structure of a vertical database by introducing an inverted idea, wherein the items in the data are used as keys, and a transaction ID corresponding to each item is used as a value, so that an inverted table is generated, the generation speed of a frequent item set is accelerated, the support degree is calculated by cross counting by taking depth-first search as a strategy, and the calculation efficiency is improved.
The existing association rule algorithm is an association rule mining method for small-scale data, but when the data to be processed is converted from structured data into transaction data with a mutual exclusion relationship, the memory deficiency is often caused by low calculation efficiency. In addition, the efficiency of mining and calculating frequent item sets by association rules can be optimized by improving the existing Apriori algorithm, so that the time cost of the algorithm is reduced. The innovative methods can help people to analyze the big data association rule, improve the calculation efficiency, reduce the time cost and obtain the association rule analysis result more quickly.
Disclosure of Invention
Because the scale of the data mined by the association rule is continuously improved at present, the traditional association rule analysis mining method is difficult to solve the problems of insufficient calculation memory and overhigh calculation time cost caused by converting threshold division into a structured data report of transaction data. The invention aims to solve the problem of low calculation efficiency caused by excessive mutual exclusion data items in a structured data report which is converted into transaction data by threshold partitioning and processed by the conventional association rule mining method, and provides an improved association rule mining method based on mutual exclusion expression, so that the calculation efficiency can be improved, and the implicit association relation among data indexes can be mined.
The purpose of the invention is realized by the following technical scheme: firstly, converting data to be processed into transaction data based on a logical relationship among the data and a data threshold range, analyzing data information to obtain data scale and matrix density, and grouping the data based on mathematical logic; then, reading and writing the data set for the first time, calculating the support degree value of each item, and grouping the frequent item sets according to the grouping result of the previous step by obtaining all sets with the frequent items of 1; and finally, respectively carrying out iterative search on each group of data to obtain a frequent item set, cutting the candidate set obtained by search, and iterating until a new frequent item set cannot be generated, thereby obtaining an association rule mining result.
The flow chart of the invention is shown in figure 1, and the specific steps are as follows:
the method comprises the following steps: firstly, converting data to be processed into transaction data based on a logic relationship between the data and a data threshold range, analyzing data information to obtain data scale and matrix density, and grouping an item set with a mutual exclusion relationship in the data based on mathematical logic to obtain a binary sparse matrix with a grouping label, wherein the specific steps are as follows:
1) converting a threshold value of a data set to be mined based on data values into transaction data, recording the transaction data as a data set D, and setting In={rn1,rn2,...,rnmIs a collection of m different items, each rnκReferred to as an item, a collection of items InCalled item set, the number of elements is called length of item set, item set with length k is called "k item set", and for data set D, containing n item sets, it can be expressed as:
D={I1,I2,...,In}
after conversion into transaction data, for any item iκIts value only exists in two cases:
2) by transforming the data set D, a binary sparse matrix U can be obtained, where U can be expressed as:
wherein r isnmThe first subscript n of (a) denotes the corresponding nth item set InThe second subscript m represents the set of items InThe m-th element in the binary sparse matrix U only contains boolean values 1 and 0, which respectively correspond to True and False in the transaction data;
3) for item set I in data set DjPerforming a logic analysis ifIf there are item sets with mutual exclusion relationship, then divide them into one group, and mark it as Qt:
Qt={Ia,Ib,...,In}
Note that packet QtIs determined by the number of all mutually exclusive sets of items in the data set D, and each set of items exists in only one group QtAmong them:
4) further, a binary sparse matrix U with a grouping label is obtained, which can be represented by Q as:
U=[Q1Q2... Qt]
wherein, U is a binary sparse matrix, i.e. a matrix only containing 0 and 1, and is composed of t grouping matrices Q.
Step two: reading and writing the data set for the first time, calculating the support degree, the confidence degree and the promotion degree of each item, obtaining a set with all frequent items as 1, and removing the non-frequent item sets according to the groups obtained based on the mutual exclusion relationship in the step one to obtain a new grouping result, wherein the specific steps are as follows:
1) traversing the data set D, calculating a set of items I for eachnOf middle record number, i.e. r nκ1, based on a given minimum support S0If S is greater than or equal to S0Then the item set InA frequent item set;
2) respectively calculating the Support degree Support, the Confidence degree Confidence and the Lift degree Lift of the generated frequent item set:
Support(Ia→Ib)=Support(Ia∪Ib)=P(Ia∪Ib)
wherein (I)a→Ib) Representing only item sets IaAnd item set IbThe correlation between the two does not have the meaning of mathematical calculation; the support degree is the number of times of occurrence of the corresponding item set divided by the total number of records; support (I)a→Ib) Is that the transaction in the database D contains Ia∪IbPercent of (d), denoted Support (I)a∪Ib),Ia,IbE is the I; the significance of the confidence lies in the set of terms Ia,IbThe number of simultaneous occurrences in the set of terms IaThe ratio of the number of occurrences, i.e. occurrence IaUnder conditions of (A) and (B) occurbThe probability of (d); the significance of the promotion degree lies in the metric item set IaAnd { I }bIndependence of the item set, which reflects the set of items IaOccurrence of { I } for a set of items { IbHow much change occurs to the probability of occurrence, generally if the value is 1, it indicates that the two conditions have no correlation, if the value is less than 1, it indicates that the two conditions are repulsive, and when the degree of lift is greater than 1, the greater the degree of lift is, the higher the value of the correlation rule is;
3) after traversal, based on the minimum support threshold S0Get the collection of item set with frequent items 1, and record as L1:
4) For the remaining non-frequent item set L1'from the group Q obtained based on the mutual exclusion event in the first step, a new Q' is obtained excluding the non-frequent item set:
Q′=Q-L1′
q ' is a grouping obtained based on the mutual exclusion relation after a frequent item set is obtained for the first time, and is used for extracting a candidate set of ' 2 item sets '.
Step three: obtaining the mining result of the association rule, respectively extracting the item set from Q 'to generate a candidate set C of' 2 item sets2Pruning to obtain frequent item set L2Then, for frequent item set L2Performing self-connection iterative search on frequent item sets, cutting candidate sets obtained by search, iterating until new frequent item sets cannot be generated, calculating confidence degrees and promotion degrees of all frequent item sets to obtain data index item sets with implicit association relations, and specifically comprising the following steps:
1) according to the permutation and combination, two item sets are sequentially extracted from Q 'to generate a candidate set C of' 2 item sets2:
2) To obtain L2Candidate item set C2Pruning, i.e. computing candidate sets C2Thereby obtaining a frequent item set L2:
3) To obtain Lκ(kappa.gtoreq.3), adding Lκ-1And Lκ-1Concatenating to produce a "kappa item set" candidate set CκIt is written as:
Lκ={l1,l2,...,ln},li,lj(1≤i,j≤n),li,lj∈Lκ
wherein li={li(1),li(2),...,li(m)},li(κ)(1≤κ≤m)∈liAnd is liThe kth entry of (1), perform join Lκ-1∞Lκ-1The symbol "∞" indicates that the two sets of terms are self-join, i.e., different terms are extracted from each set of terms in turn to generate a new set of terms, L if their first (k-2) terms are the sameκ-1Is connectable, connection l1And l2The resulting set of items is { l }1(1)l1(2),...,l1(κ-1)l2(κ-1)};
4)CκIs LκSuperset of (C), calculation of the same reasonκThe support degree of each candidate item set is obtained as a frequent item set LκIf, ifHave the advantages thatI.e. new candidates will not be frequent items either, so may be at CκRemoving the item;
5) repeated iteration is carried out to obtain a frequent item set LκUp toAnd then, ending iteration to obtain the confidence and promotion results of the frequent item sets from 2 to k, and obtaining the data index item set with the implicit association relationship.
Compared with the prior art, the invention has the following advantages:
the invention adopts an association rule mining method, and converts numerical data into transaction data by dividing a data set based on a threshold value, thereby facilitating the development of association rule analysis and mining work and expanding the data range which can be analyzed by the association rule; the characteristic of the structured data with the mutually exclusive items is used for grouping the mutually exclusive item sets in the data preprocessing process, so that the data storage structure in the database is optimized, and the traversal efficiency is improved.
Meanwhile, the structured data association rule mining with a large number of mutual exclusion relations is improved based on the Apriori algorithm, and in the process of traversing frequent item sets for the first time, the structured data association rule mining method based on the Apriori algorithm groups based on the mutual exclusion item sets, so that the calculation process of deleting non-frequent item sets is reduced, and the calculation storage space is effectively reduced; in the process of repeatedly iterating to search for frequent item sets and eliminating non-frequent item sets, the generation of the non-frequent item sets is greatly reduced based on the grouping of the mutually exclusive item sets, so that the complicated step of deleting the non-frequent item sets is omitted, the calculation efficiency of the algorithm is effectively improved, and a large amount of time cost is saved.
At present, some existing association rule mining methods are only suitable for analyzing and mining a small amount of transaction data, and a large amount of time cost is needed for massive transaction data due to low computing efficiency. Aiming at massive structured data with a mutually exclusive item set, the association rule mining and analyzing method carries out association rule mining and analyzing on the data based on an improved algorithm of the group iterative search of the mutually exclusive item set, thereby improving the iterative computation efficiency, optimizing the memory space required by computation, greatly reducing the time cost, and quickly and efficiently meeting the requirements of association rule mining and analyzing of big data researchers.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is an exemplary diagram of essential information of an embodiment data set.
Fig. 3 is a diagram of an example of the first 20 degrees of support.
FIG. 4 is an exemplary graph of support, confidence and boost scatter.
Fig. 5 is an exemplary diagram of the association rule paracoord.
Detailed Description
The following describes the specific implementation of the invention with reference to college teaching quality data:
the college teaching quality data are structured data, and multi-dimensional data of college teaching quality are included. The implicit relevance exists among the multidimensional indexes such as the number scale of students, the quality of teacher teams, school handling conditions, school fund prizes and the like, and the association rule mining and analysis are needed to be carried out on the teaching quality data.
Executing the step one: the method comprises the steps of preprocessing structured data to be processed as shown in table 1, uniformly converting the data into transaction data in a Boolean numerical value form, grouping the data based on a mutual exclusion relation seen by data indexes as shown in table 2, and obtaining a binary sparse matrix with a grouping label.
Based on each index, the original structured data in table 1 selects a proper threshold value and converts the threshold value into corresponding transaction data. The original data includes 997 college 25 core indexes, and each core index is divided into three types of low, medium and high according to a threshold value, so that the original structured data is expanded to a binary sparse matrix of 997 × 75, and after the original structured data is converted into the transaction data in table 2, basic information of a data set is checked by calling a Summary function in an R language, and the matrix density of the matrix is 0.33, as shown in fig. 2. The project support frequency map can be viewed by using the itemFrequencyPlot function in the R language, and the project with the support rank 20 is shown in FIG. 3.
Table 1 example table of data to be processed
School | Number of students in general department | Folding number of students | Special-purpose teachers | Percent of pass of constitutions |
Anhui University of Science and Technology | 24242 | 36520 | 1853 | 94.15 |
ZHEJIANG INTERNATIONAL STUDIES University | 24430 | 28531 | 1635 | 91.56 |
Heilongjiang foreign language | 7474 | 7988 | 752 | 69.90 |
Zhejiang Ocean University | 9576 | 10235 | 564 | 76.65 |
QILU NORMAL University | 11144 | 13564 | 1304 | 92.52 |
QINGDAO HUANGHAI University | 7400 | 7954 | 856 | 87.49 |
QILU INSTITUTE OF TECHNOLOGY | 5422 | 6523 | 785 | 88.48 |
JOURNAL OF JIANGXI PUBLIC SECURITY College | 29850 | 31256 | 2365 | 100.00 |
GUANGXI MEDICAL University | 8876 | 9520 | 981 | 65.22 |
GANNAN NORMAL University | 16778 | 18463 | 2028 | 99.90 |
BAOSHAN University | 22173 | 42979 | 1876 | 88.90 |
ZheJiang Chinese Medical University | 7047 | 8456 | 626 | 67.36 |
XINYU University | 4337 | 5236 | 666 | 98.20 |
Jilin University | 41344 | 91261 | 2869 | 77.63 |
CANGZHOU NORMAL University | 14533 | 15623 | 1562 | 82.56 |
NANCHANG INSTITUTE OF SCIENCE & TECHNOLOGY | 13340 | 15354 | 1111 | 98.25 |
GUANGDONG UNIVERSITY OF SCIENCE & TECHNOLOGY | 6283 | 7652 | 628 | 77.36 |
Table 2 transaction data example table
School | Reduced student number _ low | Reduced student number in | Reduced number of students _ height | General Benke student number _ Low |
Anhui University of Science and Technology | FALSE | FALSE | TRUE | FALSE |
ZHEJIANG INTERNATIONAL STUDIES University | FALSE | FALSE | TRUE | FALSE |
Heilongjiang foreign language | TRUE | FALSE | FALSE | TRUE |
Zhejiang Ocean University | FALSE | TRUE | FALSE | FALSE |
QILU NORMAL University | FALSE | TRUE | FALSE | FALSE |
QINGDAO HUANGHAI University | TRUE | FALSE | FALSE | TRUE |
QILU INSTITUTE OF TECHNOLOGY | TRUE | FALSE | FALSE | TRUE |
JOURNAL OF JIANGXI PUBLIC SECURITY College | FALSE | FALSE | TRUE | FALSE |
GUANGXI MEDICAL University | TRUE | FALSE | FALSE | TRUE |
GANNAN NORMAL University | FALSE | TRUE | FALSE | FALSE |
BAOSHAN University | FALSE | FALSE | TRUE | FALSE |
ZheJiang Chinese Medical University | TRUE | FALSE | FALSE | TRUE |
XINYU University | TRUE | FALSE | FALSE | TRUE |
Jilin University | FALSE | FALSE | TRUE | FALSE |
CANGZHOU NORMAL University | FALSE | TRUE | FALSE | FALSE |
NANCHANG INSTITUTE OF SCIENCE & TECHNOLOGY | FALSE | TRUE | FALSE | FALSE |
GUANGDONG UNIVERSITY OF SCIENCE & TECHNOLOGY | TRUE | FALSE | FALSE | TRUE |
The transaction data in table 2 are represented by boolean values 0 and 1, resulting in a binary sparse matrix. Because each index of the original structured data is expanded into three indexes, namely a low index, a medium index and a high index, and the indexes have obvious mutual exclusion relationship, the low index, the medium index and the high index of each original index are divided into one group in a mutually exclusive item set Q. Namely, the reduced student number _ low, the reduced student number _ medium and the reduced student number _ high, and the three indexes have obvious mutual exclusion relationship, so that the indexes are divided into a group. Because there are 25 core indexes, the mutually exclusive item set Q includes 25 sets, and each set has 3 indexes. Thereby obtaining a binary sparse matrix with grouping labels, and finishing the first step of data preprocessing.
And (5) executing the step two: reading and writing the data set for the first time, calculating the support value of each item, obtaining a set with all frequent items being 1, and grouping the frequent item sets according to the grouping labels.
For the obtained binary sparse matrix first read-write data set, the record number of each index can be obtained, the minimum support degree is given, the respective support degrees of 75 indexes are calculated, and for the indexes meeting the minimum support degree threshold value, an item set with frequent items of 1 is recorded as L1. The support, confidence and lift of each set of terms can then be calculated. Wherein the support is the number of occurrences of the corresponding set of items divided by the total number of records; support (I)a→Ib) Is that the transaction in the database D contains Ia∪Ib(ii) percent (d); the significance of the confidence lies in the set of terms Ia,IbThe number of simultaneous occurrences in the set of terms IaThe ratio of the number of occurrences, i.e. occurrence IaUnder conditions of (A) and (B) occurbThe probability of (d); the significance of the promotion degree lies in the metric item set IaAnd { I }bIndependence of the item set, which reflects the set of items IaOccurrence of { I } for a set of items { IbHow much change occurs in the probability of occurrence, generally if the value is 1, it indicates that the two conditions are not associated, if less than 1, it indicates that the two conditions are repulsive, and when the degree of lift is greater than 1, the greater the degree of lift, the higher the value of the association rule.
For the remaining non-frequent item set L1' removing the non-frequent item set L from Q according to the grouping Q obtained based on the mutual exclusion event in the step one1'get a new packet Q'. The significance of the method is to reduce the index which does not meet the minimum support threshold in the grouping label, and because the index which does not meet the minimum support threshold still cannot meet the minimum support threshold in the item set generated after the index is connected with other indexes in the follow-up process, the index which does not meet the minimum support threshold can be removed in the stepThe method has the advantages that the calculation efficiency is effectively improved, and a part of memory space can be released, so that the calculation process of the computer is optimized, and the time cost is saved.
And step three is executed: extracting item sets from Q 'respectively to generate a candidate set C of' 2 item sets2Pruning to obtain frequent item set L2Then, for frequent item set L2And carrying out self-connection iterative search on the frequent item set, cutting the candidate set obtained by search, and iterating until a new frequent item set cannot be generated, thereby obtaining an association rule mining result.
To generate a candidate set C of "2 item sets2In the grouping obtained in the step two, two indexes are arbitrarily extracted from any two grouping sets by permutation and combination to obtain a candidate set C of a' 2 item set2. For candidate set C2Calculating the support degree and judging whether the support degree meets the minimum support degree threshold value, and for the item set meeting the minimum support degree threshold value, the frequent item set L can be added2Among them; and deleting the item sets which do not meet the minimum support threshold, thereby completing the pruning process.
Then in order to obtain Lκ(kappa.gtoreq.3), adding Lκ-1And Lκ-1Concatenating to produce a "kappa item set" candidate set CκWherein l isi={li(1),li(2),...,li(m)},li(κ)(1≤κ≤m)∈liAnd is liThe kth entry of (1), perform join Lκ-1∞Lκ-1L if their preceding (kappa-2) terms are the sameκ-1Is connectable, connection l1And l2The resulting set of items is { l }1(1)l1(2),...,l1(κ-1)l2(κ-1)};CκIs LκSuperset of (C), calculation of the same reasonκThe support degree of each candidate item set is obtained as a frequent item set LκIf, ifHave the advantages thatI.e. new candidates will not be frequent items either, so may be at CκThe term is removed. Repeated iteration is carried out to obtain a frequent item set LκUp toAnd then, ending the iteration to obtain the support degree, confidence degree and promotion degree results of each frequent item set.
And (3) mining association rules of the data by using an improved Apriori function, setting the parameter support degree to be 0.1, setting the confidence coefficient to be 0.8, and setting the minimum item number included in the association rules to be 2. 738679 rules were obtained by mining. When the support degree is increased to 0.2, the mining obtains 31869 rules as shown in FIG. 2, and when the support degree is increased to 0.3, the mining obtains 3519 rules.
And selecting the mining result with the support degree of 0.2 and the confidence degree of 0.8 for analysis. A scatter diagram of the mining result of the support degree, the confidence degree and the lifting degree is shown in fig. 4, when the support degree is 0.4 and the confidence degree is 0.8, a paracoord diagram is drawn and shown in fig. 5, wherein a broken line represents a related rule, and the deeper the color is, the higher the lifting degree is. The promotion degree of 4135 rules in total is larger than 3 when the support degree is 0.1 and the confidence degree is 0.8, and the promotion degree of 53 rules in total is larger than 3 when the support degree is 0.2 and the confidence degree is 0.8.
The association rule analysis legend generated in this embodiment and the improved association rule report data mining method based on the mutual exclusion expression are written based on the 3.4.3 version R language, but the implementation of the patent of the present invention is not limited to the development of the R language, and the programming implementation of the patent method by using other languages and development environments is all within the protection scope of the patent.
Claims (4)
1. A report data mining method based on the improved association rule of mutual exclusion expression is characterized in that the method comprises the following steps:
the method comprises the following steps: converting data to be processed into transaction data based on the logic relationship among the data and the data threshold range, analyzing data information to obtain data scale and matrix density, and grouping item sets with mutual exclusion relationship in the data based on mathematical logic to obtain a binary sparse matrix with a grouping label;
step two: reading and writing the data set, calculating the support degree, the confidence degree and the promotion degree of each item to obtain a set with all frequent items being 1, and removing the non-frequent item set according to the grouping obtained based on the mutual exclusion relationship in the step one so as to obtain a new grouping result;
step three: extracting item sets from mutually exclusive groups respectively to generate a candidate set C of' 2 item sets2Pruning to obtain frequent item set L2Then, for frequent item set L2And carrying out self-connection iterative search on the frequent item set, cutting the candidate set obtained by search, and iterating until a new frequent item set cannot be generated, thereby obtaining an association rule mining result.
2. The method for mining report data based on mutual exclusion expression improved association rules according to claim 1, wherein the step one specifically comprises:
converting data to be processed into transaction data based on a logic relationship between the data and a data threshold range, analyzing data information to obtain data scale and matrix density, and grouping item sets with a mutual exclusion relationship in the data based on mathematical logic to obtain a binary sparse matrix with a grouping label, wherein the specific steps are as follows:
1) converting a threshold value of a data set to be mined based on data values into transaction data, recording the transaction data as a data set D, and setting In={rn1,rn2,...,rnmIs a collection of m different items, each rnκReferred to as an item, a collection of items InCalled item set, the number of elements is called length of item set, item set with length k is called "k item set", and for data set D, containing n item sets, it can be expressed as:
D={I1,I2,...,In}
after conversion into transaction data, for any item iκIts value only exists in two cases:
2) by transforming the data set D, a binary sparse matrix U can be obtained, where U can be expressed as:
wherein r isnmThe first subscript n of (a) denotes the corresponding nth item set InThe second subscript m represents the set of items InThe m-th element in the binary sparse matrix U only contains boolean values 1 and 0, which respectively correspond to True and False in the transaction data;
3) for item set I in data set DjPerforming logic analysis, if the item set of the mutual exclusion relationship exists, dividing the item set into a group, and marking the group as Qt:
Qt={Ia,Ib,...,In}
Note that packet QtIs determined by the number of all mutually exclusive sets of items in the data set D, and each set of items exists in only one group QtAmong them:
4) further, a binary sparse matrix U with a grouping label is obtained, which can be represented by Q as:
U=[Q1Q2... Qt]
wherein, U is a binary sparse matrix, i.e. a matrix only containing 0 and 1, and is composed of t grouping matrices Q.
3. The method for mining report data based on mutual exclusion expression improved association rules according to claim 1, wherein the second step specifically comprises:
reading and writing the data set for the first time, calculating the support degree, the confidence degree and the promotion degree of each item, obtaining a set with all frequent items as 1, and removing the non-frequent item sets according to the groups obtained based on the mutual exclusion relationship in the step one to obtain a new grouping result, wherein the specific steps are as follows:
1) traversing the data set D, calculating a set of items I for eachnOf middle record number, i.e. rnκ1, based on a given minimum support S0If S is greater than or equal to S0Then the item set InA frequent item set;
2) respectively calculating the Support degree Support, the Confidence degree Confidence and the Lift degree Lift of the generated frequent item set:
Support(Ia→Ib)=Support(Ia∪Ib)=P(Ia∪Ib)
wherein (I)a→Ib) Representing only item sets IaAnd item set IbThe correlation between the two does not have the meaning of mathematical calculation; the support degree is the number of times of occurrence of the corresponding item set divided by the total number of records; support (I)a→Ib) Is that the transaction in the database D contains Ia∪IbPercent of (d), denoted Support (I)a∪Ib),Ia,IbE is the I; the significance of the confidence lies in the set of terms Ia,IbThe number of simultaneous occurrences in the set of terms IaThe ratio of the number of occurrences, i.e. occurrence IaConditions of (2)Take place again from belowbThe probability of (d); the significance of the promotion degree lies in the metric item set IaAnd { I }bIndependence of the item set, which reflects the set of items IaOccurrence of { I } for a set of items { IbHow much change occurs to the probability of occurrence, generally if the value is 1, it indicates that the two conditions have no correlation, if the value is less than 1, it indicates that the two conditions are repulsive, and when the degree of lift is greater than 1, the greater the degree of lift is, the higher the value of the correlation rule is;
3) after traversal, based on the minimum support threshold S0Get the collection of item set with frequent items 1, and record as L1:
4) For the remaining non-frequent item set L1'from the group Q obtained based on the mutual exclusion event in the first step, a new Q' is obtained excluding the non-frequent item set:
Q′=Q-L1′
q ' is a grouping obtained based on the mutual exclusion relation after a frequent item set is obtained for the first time, and is used for extracting a candidate set of ' 2 item sets '.
4. The method for mining report data based on mutual exclusion expression improved association rules according to claim 1, wherein the third step specifically comprises:
obtaining the mining result of the association rule, respectively extracting the item set from Q 'to generate a candidate set C of' 2 item sets2Pruning to obtain frequent item set L2Then, for frequent item set L2Performing self-connection iterative search on frequent item sets, cutting candidate sets obtained by search, iterating until new frequent item sets cannot be generated, calculating confidence degrees and promotion degrees of all frequent item sets to obtain data index item sets with implicit association relations, and specifically comprising the following steps:
1) according to the permutation and combination, two item sets are sequentially extracted from Q 'to generate a candidate set C of' 2 item sets2:
2) To obtain L2Candidate item set C2Pruning, i.e. computing candidate sets C2Thereby obtaining a frequent item set L2:
3) To obtain Lκ(kappa.gtoreq.3), adding Lκ-1And Lκ-1Concatenating to produce a "kappa item set" candidate set CκIt is written as:
Lκ={l1,l2,...,ln},li,lj(1≤i,j≤n),li,lj∈Lκ
wherein li={li(1),li(2),...,li(m)},li(κ)(1≤κ≤m)∈liAnd is liThe kth entry of (1), perform join Lκ-1∞Lκ-1The symbol "∞" indicates that the two sets of terms are self-join, i.e., different terms are extracted from each set of terms in turn to generate a new set of terms, L if their first (k-2) terms are the sameκ-1Is connectable, connection l1And l2The resulting set of items is { l }1(1)l1(2),...,l1(κ-1)l2(κ-1)};
4)CκIs LκSuperset of (C), calculation of the same reasonκThe support degree of each candidate item set is obtained as a frequent item set LκIf, ifHave the advantages thatI.e. new candidates will not be frequent items either, so may be at CκRemoving the item;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010050602.3A CN111309777A (en) | 2020-01-14 | 2020-01-14 | Report data mining method for improving association rule based on mutual exclusion expression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010050602.3A CN111309777A (en) | 2020-01-14 | 2020-01-14 | Report data mining method for improving association rule based on mutual exclusion expression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111309777A true CN111309777A (en) | 2020-06-19 |
Family
ID=71148851
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010050602.3A Pending CN111309777A (en) | 2020-01-14 | 2020-01-14 | Report data mining method for improving association rule based on mutual exclusion expression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111309777A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112837148A (en) * | 2021-03-03 | 2021-05-25 | 中央财经大学 | Risk logical relationship quantitative analysis method fusing domain knowledge |
CN113282686A (en) * | 2021-06-03 | 2021-08-20 | 光大科技有限公司 | Method and device for determining association rule of unbalanced sample |
CN114116532A (en) * | 2020-08-31 | 2022-03-01 | 南京邮电大学 | Access mode self-learning based cache optimization method and access method |
CN114265886A (en) * | 2021-12-28 | 2022-04-01 | 航天科工智能运筹与信息安全研究院(武汉)有限公司 | Similar model retrieval system based on improved Apriori algorithm |
CN114839601A (en) * | 2022-07-04 | 2022-08-02 | 中国人民解放军国防科技大学 | Radar signal high-dimensional time sequence feature extraction method and device based on frequent item analysis |
CN115543667A (en) * | 2022-09-19 | 2022-12-30 | 成都飞机工业(集团)有限责任公司 | Parameter relevance analysis method, device, equipment and medium for PIU subsystem |
CN117272398A (en) * | 2023-11-23 | 2023-12-22 | 聊城金恒智慧城市运营有限公司 | Data mining safety protection method and system based on artificial intelligence |
CN118245561A (en) * | 2024-05-21 | 2024-06-25 | 天工(天津)传媒科技有限公司 | Urban and rural planning and mapping result generation method and system |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664642A (en) * | 2018-05-16 | 2018-10-16 | 句容市茂润苗木有限公司 | Rules for Part of Speech Tagging automatic obtaining method based on Apriori algorithm |
-
2020
- 2020-01-14 CN CN202010050602.3A patent/CN111309777A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664642A (en) * | 2018-05-16 | 2018-10-16 | 句容市茂润苗木有限公司 | Rules for Part of Speech Tagging automatic obtaining method based on Apriori algorithm |
Non-Patent Citations (3)
Title |
---|
张笑: "基于数据挖掘的河北省钢铁行业用水分析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王伟: "关联规则中的Apriori算法的研究与改进", 《万方》 * |
马冬来: "一种基于数据属性的Apriori算法的改进方法", 《中国农机化学报》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114116532A (en) * | 2020-08-31 | 2022-03-01 | 南京邮电大学 | Access mode self-learning based cache optimization method and access method |
CN112837148A (en) * | 2021-03-03 | 2021-05-25 | 中央财经大学 | Risk logical relationship quantitative analysis method fusing domain knowledge |
CN113282686A (en) * | 2021-06-03 | 2021-08-20 | 光大科技有限公司 | Method and device for determining association rule of unbalanced sample |
CN113282686B (en) * | 2021-06-03 | 2023-11-07 | 光大科技有限公司 | Association rule determining method and device for unbalanced sample |
CN114265886B (en) * | 2021-12-28 | 2024-04-30 | 航天科工智能运筹与信息安全研究院(武汉)有限公司 | Similarity model retrieval system based on improved Apriori algorithm |
CN114265886A (en) * | 2021-12-28 | 2022-04-01 | 航天科工智能运筹与信息安全研究院(武汉)有限公司 | Similar model retrieval system based on improved Apriori algorithm |
CN114839601A (en) * | 2022-07-04 | 2022-08-02 | 中国人民解放军国防科技大学 | Radar signal high-dimensional time sequence feature extraction method and device based on frequent item analysis |
CN114839601B (en) * | 2022-07-04 | 2022-09-16 | 中国人民解放军国防科技大学 | Radar signal high-dimensional time sequence feature extraction method and device based on frequent item analysis |
CN115543667A (en) * | 2022-09-19 | 2022-12-30 | 成都飞机工业(集团)有限责任公司 | Parameter relevance analysis method, device, equipment and medium for PIU subsystem |
CN115543667B (en) * | 2022-09-19 | 2024-04-16 | 成都飞机工业(集团)有限责任公司 | Parameter relevance analysis method, device, equipment and medium of PIU subsystem |
CN117272398B (en) * | 2023-11-23 | 2024-01-26 | 聊城金恒智慧城市运营有限公司 | Data mining safety protection method and system based on artificial intelligence |
CN117272398A (en) * | 2023-11-23 | 2023-12-22 | 聊城金恒智慧城市运营有限公司 | Data mining safety protection method and system based on artificial intelligence |
CN118245561A (en) * | 2024-05-21 | 2024-06-25 | 天工(天津)传媒科技有限公司 | Urban and rural planning and mapping result generation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111309777A (en) | Report data mining method for improving association rule based on mutual exclusion expression | |
Bansal et al. | Improved k-mean clustering algorithm for prediction analysis using classification technique in data mining | |
CN111914558B (en) | Course knowledge relation extraction method and system based on sentence bag attention remote supervision | |
CN107766324B (en) | Text consistency analysis method based on deep neural network | |
CN103226554A (en) | Automatic stock matching and classifying method and system based on news data | |
CN112231477A (en) | Text classification method based on improved capsule network | |
CN114281809B (en) | Multi-source heterogeneous data cleaning method and device | |
CN113742396B (en) | Mining method and device for object learning behavior mode | |
CN106295690A (en) | Time series data clustering method based on Non-negative Matrix Factorization and system | |
CN116245107B (en) | Electric power audit text entity identification method, device, equipment and storage medium | |
CN113972010B (en) | Auxiliary disease reasoning system based on knowledge graph and self-adaptive mechanism | |
CN112257386B (en) | Method for generating scene space relation information layout in text-to-scene conversion | |
CN113011161A (en) | Method for extracting human and pattern association relation based on deep learning and pattern matching | |
CN116204673A (en) | Large-scale image retrieval hash method focusing on relationship among image blocks | |
CN103473308A (en) | High-dimensional multimedia data classifying method based on maximum margin tensor study | |
CN112489689B (en) | Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure | |
CN109582743A (en) | A kind of data digging method for the attack of terrorism | |
CN111553442B (en) | Optimization method and system for classifier chain tag sequence | |
CN112434145A (en) | Picture-viewing poetry method based on image recognition and natural language processing | |
CN107633259A (en) | A kind of cross-module state learning method represented based on sparse dictionary | |
CN116805010A (en) | Multi-data chain integration and fusion knowledge graph construction method oriented to equipment manufacturing | |
CN115795037A (en) | Multi-label text classification method based on label perception | |
CN111275081A (en) | Method for realizing multi-source data link processing based on Bayesian probability model | |
Fazili et al. | Recent trends in dimension reduction methods | |
CN116578611B (en) | Knowledge management method and system for inoculated knowledge |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200619 |
|
RJ01 | Rejection of invention patent application after publication |