CN111309777A - Report data mining method for improving association rule based on mutual exclusion expression - Google Patents

Report data mining method for improving association rule based on mutual exclusion expression Download PDF

Info

Publication number
CN111309777A
CN111309777A CN202010050602.3A CN202010050602A CN111309777A CN 111309777 A CN111309777 A CN 111309777A CN 202010050602 A CN202010050602 A CN 202010050602A CN 111309777 A CN111309777 A CN 111309777A
Authority
CN
China
Prior art keywords
data
item
item set
frequent
sets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010050602.3A
Other languages
Chinese (zh)
Inventor
沈毅
赵虹博
张淼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202010050602.3A priority Critical patent/CN111309777A/en
Publication of CN111309777A publication Critical patent/CN111309777A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for mining improved association rule report data based on mutual exclusion expression relates to a knowledge discovery and data mining method in the field of data science, and solves the problems of large memory consumption and low efficiency when a traditional association rule algorithm is used for processing mass data. The method comprises the following steps: converting data into transaction data based on a data threshold range, and obtaining a binary sparse matrix with a grouping label based on data logic; secondly, acquiring a set with all frequent items as 1, and removing the non-frequent item set to obtain a new grouping result; and thirdly, carrying out self-connection iterative search on the frequent item set, cutting the candidate item set, and iterating until a new frequent item set cannot be generated, thereby obtaining an association rule mining result. The basic idea of the invention is to convert the structured data into transaction data, generate grouping based on mutual exclusion relationship, and perform rule mining, thereby reducing the calculation memory and improving the calculation efficiency. The application scene is wide, and the social and economic values are high.

Description

Report data mining method for improving association rule based on mutual exclusion expression
Technical Field
The invention relates to a knowledge discovery and data mining method in the field of data science, in particular to an improved association rule report data mining method based on mutual exclusion expression.
Background
With the explosive growth of data volume in the information age, people find that huge data value is hidden behind massive data. For the data value of the structured data, a better result can be obtained by applying a traditional or modern data mining means, but because the structured data report is difficult to be converted into transaction data (only a data set indicating whether an event occurs is represented by a Boolean value True and a Flase) in the association rule mining process, the data scale is continuously increased, and a data mining method which can mine valuable information faces a huge challenge in the presence of mass data. At present, the data dimensionality is greatly improved by dividing a structured data report into transaction data based on a threshold value, the computing efficiency is low because the memory and computing power of a computer cannot meet the computing requirement, and the time cost of data mining is greatly increased. The method provided by the patent can reasonably reduce the data of the high-dimensional data after the structured data report is converted into the transaction data based on mutual exclusion expression, so that the calculation cost of the traditional association rule algorithm is reduced, and the efficiency of association rule data mining is improved.
Data mining and analysis by using a traditional association rule algorithm have already achieved some research achievements at home and abroad, but aiming at a structured data report based on conversion from threshold partitioning into transaction data, because threshold partitioning generates more data items with mutual exclusion relationship, the traditional algorithm has low efficiency and easily causes insufficient calculation memory, and the improvement method for the situation is still in an exploration stage. Association Rules (Association Rules) are Rules that reflect the interdependency and Association of one thing with another for mining correlations between valuable data items from a large amount of data. The main methods of the existing association rule mining method can be divided into three categories: an association rule mining method based on Apriori, an association rule mining method based on FP-growth and an association rule mining method based on Eclat. The Apriori algorithm utilizes two characteristics of a frequent item set to filter a large number of independent sets, thereby improving the calculation efficiency. The FP-growth algorithm compresses data records by constructing a tree structure, so that the data records are only required to be scanned twice when a frequent item set is mined, and a candidate item set is not required to be generated, thereby improving the calculation efficiency. The Eclat algorithm is converted from an original horizontal database structure of items corresponding to transactions into a structure of a vertical database by introducing an inverted idea, wherein the items in the data are used as keys, and a transaction ID corresponding to each item is used as a value, so that an inverted table is generated, the generation speed of a frequent item set is accelerated, the support degree is calculated by cross counting by taking depth-first search as a strategy, and the calculation efficiency is improved.
The existing association rule algorithm is an association rule mining method for small-scale data, but when the data to be processed is converted from structured data into transaction data with a mutual exclusion relationship, the memory deficiency is often caused by low calculation efficiency. In addition, the efficiency of mining and calculating frequent item sets by association rules can be optimized by improving the existing Apriori algorithm, so that the time cost of the algorithm is reduced. The innovative methods can help people to analyze the big data association rule, improve the calculation efficiency, reduce the time cost and obtain the association rule analysis result more quickly.
Disclosure of Invention
Because the scale of the data mined by the association rule is continuously improved at present, the traditional association rule analysis mining method is difficult to solve the problems of insufficient calculation memory and overhigh calculation time cost caused by converting threshold division into a structured data report of transaction data. The invention aims to solve the problem of low calculation efficiency caused by excessive mutual exclusion data items in a structured data report which is converted into transaction data by threshold partitioning and processed by the conventional association rule mining method, and provides an improved association rule mining method based on mutual exclusion expression, so that the calculation efficiency can be improved, and the implicit association relation among data indexes can be mined.
The purpose of the invention is realized by the following technical scheme: firstly, converting data to be processed into transaction data based on a logical relationship among the data and a data threshold range, analyzing data information to obtain data scale and matrix density, and grouping the data based on mathematical logic; then, reading and writing the data set for the first time, calculating the support degree value of each item, and grouping the frequent item sets according to the grouping result of the previous step by obtaining all sets with the frequent items of 1; and finally, respectively carrying out iterative search on each group of data to obtain a frequent item set, cutting the candidate set obtained by search, and iterating until a new frequent item set cannot be generated, thereby obtaining an association rule mining result.
The flow chart of the invention is shown in figure 1, and the specific steps are as follows:
the method comprises the following steps: firstly, converting data to be processed into transaction data based on a logic relationship between the data and a data threshold range, analyzing data information to obtain data scale and matrix density, and grouping an item set with a mutual exclusion relationship in the data based on mathematical logic to obtain a binary sparse matrix with a grouping label, wherein the specific steps are as follows:
1) converting a threshold value of a data set to be mined based on data values into transaction data, recording the transaction data as a data set D, and setting In={rn1,rn2,...,rnmIs a collection of m different items, each rReferred to as an item, a collection of items InCalled item set, the number of elements is called length of item set, item set with length k is called "k item set", and for data set D, containing n item sets, it can be expressed as:
D={I1,I2,...,In}
after conversion into transaction data, for any item iκIts value only exists in two cases:
Figure BDA0002371019350000021
2) by transforming the data set D, a binary sparse matrix U can be obtained, where U can be expressed as:
Figure BDA0002371019350000022
wherein r isnmThe first subscript n of (a) denotes the corresponding nth item set InThe second subscript m represents the set of items InThe m-th element in the binary sparse matrix U only contains boolean values 1 and 0, which respectively correspond to True and False in the transaction data;
3) for item set I in data set DjPerforming a logic analysis ifIf there are item sets with mutual exclusion relationship, then divide them into one group, and mark it as Qt
Figure BDA0002371019350000031
Qt={Ia,Ib,...,In}
Note that packet QtIs determined by the number of all mutually exclusive sets of items in the data set D, and each set of items exists in only one group QtAmong them:
Figure BDA0002371019350000032
4) further, a binary sparse matrix U with a grouping label is obtained, which can be represented by Q as:
U=[Q1Q2... Qt]
wherein, U is a binary sparse matrix, i.e. a matrix only containing 0 and 1, and is composed of t grouping matrices Q.
Step two: reading and writing the data set for the first time, calculating the support degree, the confidence degree and the promotion degree of each item, obtaining a set with all frequent items as 1, and removing the non-frequent item sets according to the groups obtained based on the mutual exclusion relationship in the step one to obtain a new grouping result, wherein the specific steps are as follows:
1) traversing the data set D, calculating a set of items I for eachnOf middle record number, i.e. r 1, based on a given minimum support S0If S is greater than or equal to S0Then the item set InA frequent item set;
2) respectively calculating the Support degree Support, the Confidence degree Confidence and the Lift degree Lift of the generated frequent item set:
Figure BDA0002371019350000033
Support(Ia→Ib)=Support(Ia∪Ib)=P(Ia∪Ib)
Figure BDA0002371019350000034
Figure BDA0002371019350000035
wherein (I)a→Ib) Representing only item sets IaAnd item set IbThe correlation between the two does not have the meaning of mathematical calculation; the support degree is the number of times of occurrence of the corresponding item set divided by the total number of records; support (I)a→Ib) Is that the transaction in the database D contains Ia∪IbPercent of (d), denoted Support (I)a∪Ib),Ia,IbE is the I; the significance of the confidence lies in the set of terms Ia,IbThe number of simultaneous occurrences in the set of terms IaThe ratio of the number of occurrences, i.e. occurrence IaUnder conditions of (A) and (B) occurbThe probability of (d); the significance of the promotion degree lies in the metric item set IaAnd { I }bIndependence of the item set, which reflects the set of items IaOccurrence of { I } for a set of items { IbHow much change occurs to the probability of occurrence, generally if the value is 1, it indicates that the two conditions have no correlation, if the value is less than 1, it indicates that the two conditions are repulsive, and when the degree of lift is greater than 1, the greater the degree of lift is, the higher the value of the correlation rule is;
3) after traversal, based on the minimum support threshold S0Get the collection of item set with frequent items 1, and record as L1
Figure BDA0002371019350000041
4) For the remaining non-frequent item set L1'from the group Q obtained based on the mutual exclusion event in the first step, a new Q' is obtained excluding the non-frequent item set:
Q′=Q-L1
q ' is a grouping obtained based on the mutual exclusion relation after a frequent item set is obtained for the first time, and is used for extracting a candidate set of ' 2 item sets '.
Step three: obtaining the mining result of the association rule, respectively extracting the item set from Q 'to generate a candidate set C of' 2 item sets2Pruning to obtain frequent item set L2Then, for frequent item set L2Performing self-connection iterative search on frequent item sets, cutting candidate sets obtained by search, iterating until new frequent item sets cannot be generated, calculating confidence degrees and promotion degrees of all frequent item sets to obtain data index item sets with implicit association relations, and specifically comprising the following steps:
1) according to the permutation and combination, two item sets are sequentially extracted from Q 'to generate a candidate set C of' 2 item sets2
Figure BDA0002371019350000042
2) To obtain L2Candidate item set C2Pruning, i.e. computing candidate sets C2Thereby obtaining a frequent item set L2
Figure BDA0002371019350000043
3) To obtain Lκ(kappa.gtoreq.3), adding Lκ-1And Lκ-1Concatenating to produce a "kappa item set" candidate set CκIt is written as:
Lκ={l1,l2,...,ln},li,lj(1≤i,j≤n),li,lj∈Lκ
wherein li={li(1),li(2),...,li(m)},li(κ)(1≤κ≤m)∈liAnd is liThe kth entry of (1), perform join Lκ-1∞Lκ-1The symbol "∞" indicates that the two sets of terms are self-join, i.e., different terms are extracted from each set of terms in turn to generate a new set of terms, L if their first (k-2) terms are the sameκ-1Is connectable, connection l1And l2The resulting set of items is { l }1(1)l1(2),...,l1(κ-1)l2(κ-1)};
4)CκIs LκSuperset of (C), calculation of the same reasonκThe support degree of each candidate item set is obtained as a frequent item set LκIf, if
Figure BDA0002371019350000044
Have the advantages that
Figure BDA0002371019350000045
I.e. new candidates will not be frequent items either, so may be at CκRemoving the item;
5) repeated iteration is carried out to obtain a frequent item set LκUp to
Figure BDA0002371019350000046
And then, ending iteration to obtain the confidence and promotion results of the frequent item sets from 2 to k, and obtaining the data index item set with the implicit association relationship.
Compared with the prior art, the invention has the following advantages:
the invention adopts an association rule mining method, and converts numerical data into transaction data by dividing a data set based on a threshold value, thereby facilitating the development of association rule analysis and mining work and expanding the data range which can be analyzed by the association rule; the characteristic of the structured data with the mutually exclusive items is used for grouping the mutually exclusive item sets in the data preprocessing process, so that the data storage structure in the database is optimized, and the traversal efficiency is improved.
Meanwhile, the structured data association rule mining with a large number of mutual exclusion relations is improved based on the Apriori algorithm, and in the process of traversing frequent item sets for the first time, the structured data association rule mining method based on the Apriori algorithm groups based on the mutual exclusion item sets, so that the calculation process of deleting non-frequent item sets is reduced, and the calculation storage space is effectively reduced; in the process of repeatedly iterating to search for frequent item sets and eliminating non-frequent item sets, the generation of the non-frequent item sets is greatly reduced based on the grouping of the mutually exclusive item sets, so that the complicated step of deleting the non-frequent item sets is omitted, the calculation efficiency of the algorithm is effectively improved, and a large amount of time cost is saved.
At present, some existing association rule mining methods are only suitable for analyzing and mining a small amount of transaction data, and a large amount of time cost is needed for massive transaction data due to low computing efficiency. Aiming at massive structured data with a mutually exclusive item set, the association rule mining and analyzing method carries out association rule mining and analyzing on the data based on an improved algorithm of the group iterative search of the mutually exclusive item set, thereby improving the iterative computation efficiency, optimizing the memory space required by computation, greatly reducing the time cost, and quickly and efficiently meeting the requirements of association rule mining and analyzing of big data researchers.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is an exemplary diagram of essential information of an embodiment data set.
Fig. 3 is a diagram of an example of the first 20 degrees of support.
FIG. 4 is an exemplary graph of support, confidence and boost scatter.
Fig. 5 is an exemplary diagram of the association rule paracoord.
Detailed Description
The following describes the specific implementation of the invention with reference to college teaching quality data:
the college teaching quality data are structured data, and multi-dimensional data of college teaching quality are included. The implicit relevance exists among the multidimensional indexes such as the number scale of students, the quality of teacher teams, school handling conditions, school fund prizes and the like, and the association rule mining and analysis are needed to be carried out on the teaching quality data.
Executing the step one: the method comprises the steps of preprocessing structured data to be processed as shown in table 1, uniformly converting the data into transaction data in a Boolean numerical value form, grouping the data based on a mutual exclusion relation seen by data indexes as shown in table 2, and obtaining a binary sparse matrix with a grouping label.
Based on each index, the original structured data in table 1 selects a proper threshold value and converts the threshold value into corresponding transaction data. The original data includes 997 college 25 core indexes, and each core index is divided into three types of low, medium and high according to a threshold value, so that the original structured data is expanded to a binary sparse matrix of 997 × 75, and after the original structured data is converted into the transaction data in table 2, basic information of a data set is checked by calling a Summary function in an R language, and the matrix density of the matrix is 0.33, as shown in fig. 2. The project support frequency map can be viewed by using the itemFrequencyPlot function in the R language, and the project with the support rank 20 is shown in FIG. 3.
Table 1 example table of data to be processed
School Number of students in general department Folding number of students Special-purpose teachers Percent of pass of constitutions
Anhui University of Science and Technology 24242 36520 1853 94.15
ZHEJIANG INTERNATIONAL STUDIES University 24430 28531 1635 91.56
Heilongjiang foreign language 7474 7988 752 69.90
Zhejiang Ocean University 9576 10235 564 76.65
QILU NORMAL University 11144 13564 1304 92.52
QINGDAO HUANGHAI University 7400 7954 856 87.49
QILU INSTITUTE OF TECHNOLOGY 5422 6523 785 88.48
JOURNAL OF JIANGXI PUBLIC SECURITY College 29850 31256 2365 100.00
GUANGXI MEDICAL University 8876 9520 981 65.22
GANNAN NORMAL University 16778 18463 2028 99.90
BAOSHAN University 22173 42979 1876 88.90
ZheJiang Chinese Medical University 7047 8456 626 67.36
XINYU University 4337 5236 666 98.20
Jilin University 41344 91261 2869 77.63
CANGZHOU NORMAL University 14533 15623 1562 82.56
NANCHANG INSTITUTE OF SCIENCE & TECHNOLOGY 13340 15354 1111 98.25
GUANGDONG UNIVERSITY OF SCIENCE & TECHNOLOGY 6283 7652 628 77.36
Table 2 transaction data example table
School Reduced student number _ low Reduced student number in Reduced number of students _ height General Benke student number _ Low
Anhui University of Science and Technology FALSE FALSE TRUE FALSE
ZHEJIANG INTERNATIONAL STUDIES University FALSE FALSE TRUE FALSE
Heilongjiang foreign language TRUE FALSE FALSE TRUE
Zhejiang Ocean University FALSE TRUE FALSE FALSE
QILU NORMAL University FALSE TRUE FALSE FALSE
QINGDAO HUANGHAI University TRUE FALSE FALSE TRUE
QILU INSTITUTE OF TECHNOLOGY TRUE FALSE FALSE TRUE
JOURNAL OF JIANGXI PUBLIC SECURITY College FALSE FALSE TRUE FALSE
GUANGXI MEDICAL University TRUE FALSE FALSE TRUE
GANNAN NORMAL University FALSE TRUE FALSE FALSE
BAOSHAN University FALSE FALSE TRUE FALSE
ZheJiang Chinese Medical University TRUE FALSE FALSE TRUE
XINYU University TRUE FALSE FALSE TRUE
Jilin University FALSE FALSE TRUE FALSE
CANGZHOU NORMAL University FALSE TRUE FALSE FALSE
NANCHANG INSTITUTE OF SCIENCE & TECHNOLOGY FALSE TRUE FALSE FALSE
GUANGDONG UNIVERSITY OF SCIENCE & TECHNOLOGY TRUE FALSE FALSE TRUE
The transaction data in table 2 are represented by boolean values 0 and 1, resulting in a binary sparse matrix. Because each index of the original structured data is expanded into three indexes, namely a low index, a medium index and a high index, and the indexes have obvious mutual exclusion relationship, the low index, the medium index and the high index of each original index are divided into one group in a mutually exclusive item set Q. Namely, the reduced student number _ low, the reduced student number _ medium and the reduced student number _ high, and the three indexes have obvious mutual exclusion relationship, so that the indexes are divided into a group. Because there are 25 core indexes, the mutually exclusive item set Q includes 25 sets, and each set has 3 indexes. Thereby obtaining a binary sparse matrix with grouping labels, and finishing the first step of data preprocessing.
And (5) executing the step two: reading and writing the data set for the first time, calculating the support value of each item, obtaining a set with all frequent items being 1, and grouping the frequent item sets according to the grouping labels.
For the obtained binary sparse matrix first read-write data set, the record number of each index can be obtained, the minimum support degree is given, the respective support degrees of 75 indexes are calculated, and for the indexes meeting the minimum support degree threshold value, an item set with frequent items of 1 is recorded as L1. The support, confidence and lift of each set of terms can then be calculated. Wherein the support is the number of occurrences of the corresponding set of items divided by the total number of records; support (I)a→Ib) Is that the transaction in the database D contains Ia∪Ib(ii) percent (d); the significance of the confidence lies in the set of terms Ia,IbThe number of simultaneous occurrences in the set of terms IaThe ratio of the number of occurrences, i.e. occurrence IaUnder conditions of (A) and (B) occurbThe probability of (d); the significance of the promotion degree lies in the metric item set IaAnd { I }bIndependence of the item set, which reflects the set of items IaOccurrence of { I } for a set of items { IbHow much change occurs in the probability of occurrence, generally if the value is 1, it indicates that the two conditions are not associated, if less than 1, it indicates that the two conditions are repulsive, and when the degree of lift is greater than 1, the greater the degree of lift, the higher the value of the association rule.
For the remaining non-frequent item set L1' removing the non-frequent item set L from Q according to the grouping Q obtained based on the mutual exclusion event in the step one1'get a new packet Q'. The significance of the method is to reduce the index which does not meet the minimum support threshold in the grouping label, and because the index which does not meet the minimum support threshold still cannot meet the minimum support threshold in the item set generated after the index is connected with other indexes in the follow-up process, the index which does not meet the minimum support threshold can be removed in the stepThe method has the advantages that the calculation efficiency is effectively improved, and a part of memory space can be released, so that the calculation process of the computer is optimized, and the time cost is saved.
And step three is executed: extracting item sets from Q 'respectively to generate a candidate set C of' 2 item sets2Pruning to obtain frequent item set L2Then, for frequent item set L2And carrying out self-connection iterative search on the frequent item set, cutting the candidate set obtained by search, and iterating until a new frequent item set cannot be generated, thereby obtaining an association rule mining result.
To generate a candidate set C of "2 item sets2In the grouping obtained in the step two, two indexes are arbitrarily extracted from any two grouping sets by permutation and combination to obtain a candidate set C of a' 2 item set2. For candidate set C2Calculating the support degree and judging whether the support degree meets the minimum support degree threshold value, and for the item set meeting the minimum support degree threshold value, the frequent item set L can be added2Among them; and deleting the item sets which do not meet the minimum support threshold, thereby completing the pruning process.
Then in order to obtain Lκ(kappa.gtoreq.3), adding Lκ-1And Lκ-1Concatenating to produce a "kappa item set" candidate set CκWherein l isi={li(1),li(2),...,li(m)},li(κ)(1≤κ≤m)∈liAnd is liThe kth entry of (1), perform join Lκ-1∞Lκ-1L if their preceding (kappa-2) terms are the sameκ-1Is connectable, connection l1And l2The resulting set of items is { l }1(1)l1(2),...,l1(κ-1)l2(κ-1)};CκIs LκSuperset of (C), calculation of the same reasonκThe support degree of each candidate item set is obtained as a frequent item set LκIf, if
Figure BDA0002371019350000071
Have the advantages that
Figure BDA0002371019350000072
I.e. new candidates will not be frequent items either, so may be at CκThe term is removed. Repeated iteration is carried out to obtain a frequent item set LκUp to
Figure BDA0002371019350000073
And then, ending the iteration to obtain the support degree, confidence degree and promotion degree results of each frequent item set.
And (3) mining association rules of the data by using an improved Apriori function, setting the parameter support degree to be 0.1, setting the confidence coefficient to be 0.8, and setting the minimum item number included in the association rules to be 2. 738679 rules were obtained by mining. When the support degree is increased to 0.2, the mining obtains 31869 rules as shown in FIG. 2, and when the support degree is increased to 0.3, the mining obtains 3519 rules.
And selecting the mining result with the support degree of 0.2 and the confidence degree of 0.8 for analysis. A scatter diagram of the mining result of the support degree, the confidence degree and the lifting degree is shown in fig. 4, when the support degree is 0.4 and the confidence degree is 0.8, a paracoord diagram is drawn and shown in fig. 5, wherein a broken line represents a related rule, and the deeper the color is, the higher the lifting degree is. The promotion degree of 4135 rules in total is larger than 3 when the support degree is 0.1 and the confidence degree is 0.8, and the promotion degree of 53 rules in total is larger than 3 when the support degree is 0.2 and the confidence degree is 0.8.
The association rule analysis legend generated in this embodiment and the improved association rule report data mining method based on the mutual exclusion expression are written based on the 3.4.3 version R language, but the implementation of the patent of the present invention is not limited to the development of the R language, and the programming implementation of the patent method by using other languages and development environments is all within the protection scope of the patent.

Claims (4)

1. A report data mining method based on the improved association rule of mutual exclusion expression is characterized in that the method comprises the following steps:
the method comprises the following steps: converting data to be processed into transaction data based on the logic relationship among the data and the data threshold range, analyzing data information to obtain data scale and matrix density, and grouping item sets with mutual exclusion relationship in the data based on mathematical logic to obtain a binary sparse matrix with a grouping label;
step two: reading and writing the data set, calculating the support degree, the confidence degree and the promotion degree of each item to obtain a set with all frequent items being 1, and removing the non-frequent item set according to the grouping obtained based on the mutual exclusion relationship in the step one so as to obtain a new grouping result;
step three: extracting item sets from mutually exclusive groups respectively to generate a candidate set C of' 2 item sets2Pruning to obtain frequent item set L2Then, for frequent item set L2And carrying out self-connection iterative search on the frequent item set, cutting the candidate set obtained by search, and iterating until a new frequent item set cannot be generated, thereby obtaining an association rule mining result.
2. The method for mining report data based on mutual exclusion expression improved association rules according to claim 1, wherein the step one specifically comprises:
converting data to be processed into transaction data based on a logic relationship between the data and a data threshold range, analyzing data information to obtain data scale and matrix density, and grouping item sets with a mutual exclusion relationship in the data based on mathematical logic to obtain a binary sparse matrix with a grouping label, wherein the specific steps are as follows:
1) converting a threshold value of a data set to be mined based on data values into transaction data, recording the transaction data as a data set D, and setting In={rn1,rn2,...,rnmIs a collection of m different items, each rReferred to as an item, a collection of items InCalled item set, the number of elements is called length of item set, item set with length k is called "k item set", and for data set D, containing n item sets, it can be expressed as:
D={I1,I2,...,In}
after conversion into transaction data, for any item iκIts value only exists in two cases:
Figure FDA0002371019340000011
2) by transforming the data set D, a binary sparse matrix U can be obtained, where U can be expressed as:
Figure FDA0002371019340000012
wherein r isnmThe first subscript n of (a) denotes the corresponding nth item set InThe second subscript m represents the set of items InThe m-th element in the binary sparse matrix U only contains boolean values 1 and 0, which respectively correspond to True and False in the transaction data;
3) for item set I in data set DjPerforming logic analysis, if the item set of the mutual exclusion relationship exists, dividing the item set into a group, and marking the group as Qt
Figure FDA0002371019340000021
Qt={Ia,Ib,...,In}
Note that packet QtIs determined by the number of all mutually exclusive sets of items in the data set D, and each set of items exists in only one group QtAmong them:
Figure FDA0002371019340000022
4) further, a binary sparse matrix U with a grouping label is obtained, which can be represented by Q as:
U=[Q1Q2... Qt]
wherein, U is a binary sparse matrix, i.e. a matrix only containing 0 and 1, and is composed of t grouping matrices Q.
3. The method for mining report data based on mutual exclusion expression improved association rules according to claim 1, wherein the second step specifically comprises:
reading and writing the data set for the first time, calculating the support degree, the confidence degree and the promotion degree of each item, obtaining a set with all frequent items as 1, and removing the non-frequent item sets according to the groups obtained based on the mutual exclusion relationship in the step one to obtain a new grouping result, wherein the specific steps are as follows:
1) traversing the data set D, calculating a set of items I for eachnOf middle record number, i.e. r1, based on a given minimum support S0If S is greater than or equal to S0Then the item set InA frequent item set;
2) respectively calculating the Support degree Support, the Confidence degree Confidence and the Lift degree Lift of the generated frequent item set:
Figure FDA0002371019340000023
Support(Ia→Ib)=Support(Ia∪Ib)=P(Ia∪Ib)
Figure FDA0002371019340000024
Figure FDA0002371019340000025
wherein (I)a→Ib) Representing only item sets IaAnd item set IbThe correlation between the two does not have the meaning of mathematical calculation; the support degree is the number of times of occurrence of the corresponding item set divided by the total number of records; support (I)a→Ib) Is that the transaction in the database D contains Ia∪IbPercent of (d), denoted Support (I)a∪Ib),Ia,IbE is the I; the significance of the confidence lies in the set of terms Ia,IbThe number of simultaneous occurrences in the set of terms IaThe ratio of the number of occurrences, i.e. occurrence IaConditions of (2)Take place again from belowbThe probability of (d); the significance of the promotion degree lies in the metric item set IaAnd { I }bIndependence of the item set, which reflects the set of items IaOccurrence of { I } for a set of items { IbHow much change occurs to the probability of occurrence, generally if the value is 1, it indicates that the two conditions have no correlation, if the value is less than 1, it indicates that the two conditions are repulsive, and when the degree of lift is greater than 1, the greater the degree of lift is, the higher the value of the correlation rule is;
3) after traversal, based on the minimum support threshold S0Get the collection of item set with frequent items 1, and record as L1
Figure FDA0002371019340000031
4) For the remaining non-frequent item set L1'from the group Q obtained based on the mutual exclusion event in the first step, a new Q' is obtained excluding the non-frequent item set:
Q′=Q-L1
q ' is a grouping obtained based on the mutual exclusion relation after a frequent item set is obtained for the first time, and is used for extracting a candidate set of ' 2 item sets '.
4. The method for mining report data based on mutual exclusion expression improved association rules according to claim 1, wherein the third step specifically comprises:
obtaining the mining result of the association rule, respectively extracting the item set from Q 'to generate a candidate set C of' 2 item sets2Pruning to obtain frequent item set L2Then, for frequent item set L2Performing self-connection iterative search on frequent item sets, cutting candidate sets obtained by search, iterating until new frequent item sets cannot be generated, calculating confidence degrees and promotion degrees of all frequent item sets to obtain data index item sets with implicit association relations, and specifically comprising the following steps:
1) according to the permutation and combination, two item sets are sequentially extracted from Q 'to generate a candidate set C of' 2 item sets2
Figure FDA0002371019340000032
2) To obtain L2Candidate item set C2Pruning, i.e. computing candidate sets C2Thereby obtaining a frequent item set L2
Figure FDA0002371019340000033
3) To obtain Lκ(kappa.gtoreq.3), adding Lκ-1And Lκ-1Concatenating to produce a "kappa item set" candidate set CκIt is written as:
Lκ={l1,l2,...,ln},li,lj(1≤i,j≤n),li,lj∈Lκ
wherein li={li(1),li(2),...,li(m)},li(κ)(1≤κ≤m)∈liAnd is liThe kth entry of (1), perform join Lκ-1∞Lκ-1The symbol "∞" indicates that the two sets of terms are self-join, i.e., different terms are extracted from each set of terms in turn to generate a new set of terms, L if their first (k-2) terms are the sameκ-1Is connectable, connection l1And l2The resulting set of items is { l }1(1)l1(2),...,l1(κ-1)l2(κ-1)};
4)CκIs LκSuperset of (C), calculation of the same reasonκThe support degree of each candidate item set is obtained as a frequent item set LκIf, if
Figure FDA0002371019340000041
Have the advantages that
Figure FDA0002371019340000042
I.e. new candidates will not be frequent items either, so may be at CκRemoving the item;
5) inverse directionRepeating iteration to obtain frequent item set LκUp to
Figure FDA0002371019340000043
And then, ending iteration to obtain the confidence and promotion results of the frequent item sets from 2 to k, and obtaining the data index item set with the implicit association relationship.
CN202010050602.3A 2020-01-14 2020-01-14 Report data mining method for improving association rule based on mutual exclusion expression Pending CN111309777A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010050602.3A CN111309777A (en) 2020-01-14 2020-01-14 Report data mining method for improving association rule based on mutual exclusion expression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010050602.3A CN111309777A (en) 2020-01-14 2020-01-14 Report data mining method for improving association rule based on mutual exclusion expression

Publications (1)

Publication Number Publication Date
CN111309777A true CN111309777A (en) 2020-06-19

Family

ID=71148851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010050602.3A Pending CN111309777A (en) 2020-01-14 2020-01-14 Report data mining method for improving association rule based on mutual exclusion expression

Country Status (1)

Country Link
CN (1) CN111309777A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112837148A (en) * 2021-03-03 2021-05-25 中央财经大学 Risk logical relationship quantitative analysis method fusing domain knowledge
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN114116532A (en) * 2020-08-31 2022-03-01 南京邮电大学 Access mode self-learning based cache optimization method and access method
CN114265886A (en) * 2021-12-28 2022-04-01 航天科工智能运筹与信息安全研究院(武汉)有限公司 Similar model retrieval system based on improved Apriori algorithm
CN114839601A (en) * 2022-07-04 2022-08-02 中国人民解放军国防科技大学 Radar signal high-dimensional time sequence feature extraction method and device based on frequent item analysis
CN115543667A (en) * 2022-09-19 2022-12-30 成都飞机工业(集团)有限责任公司 Parameter relevance analysis method, device, equipment and medium for PIU subsystem
CN117272398A (en) * 2023-11-23 2023-12-22 聊城金恒智慧城市运营有限公司 Data mining safety protection method and system based on artificial intelligence
CN118245561A (en) * 2024-05-21 2024-06-25 天工(天津)传媒科技有限公司 Urban and rural planning and mapping result generation method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664642A (en) * 2018-05-16 2018-10-16 句容市茂润苗木有限公司 Rules for Part of Speech Tagging automatic obtaining method based on Apriori algorithm

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664642A (en) * 2018-05-16 2018-10-16 句容市茂润苗木有限公司 Rules for Part of Speech Tagging automatic obtaining method based on Apriori algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张笑: "基于数据挖掘的河北省钢铁行业用水分析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王伟: "关联规则中的Apriori算法的研究与改进", 《万方》 *
马冬来: "一种基于数据属性的Apriori算法的改进方法", 《中国农机化学报》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114116532A (en) * 2020-08-31 2022-03-01 南京邮电大学 Access mode self-learning based cache optimization method and access method
CN112837148A (en) * 2021-03-03 2021-05-25 中央财经大学 Risk logical relationship quantitative analysis method fusing domain knowledge
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN113282686B (en) * 2021-06-03 2023-11-07 光大科技有限公司 Association rule determining method and device for unbalanced sample
CN114265886B (en) * 2021-12-28 2024-04-30 航天科工智能运筹与信息安全研究院(武汉)有限公司 Similarity model retrieval system based on improved Apriori algorithm
CN114265886A (en) * 2021-12-28 2022-04-01 航天科工智能运筹与信息安全研究院(武汉)有限公司 Similar model retrieval system based on improved Apriori algorithm
CN114839601A (en) * 2022-07-04 2022-08-02 中国人民解放军国防科技大学 Radar signal high-dimensional time sequence feature extraction method and device based on frequent item analysis
CN114839601B (en) * 2022-07-04 2022-09-16 中国人民解放军国防科技大学 Radar signal high-dimensional time sequence feature extraction method and device based on frequent item analysis
CN115543667A (en) * 2022-09-19 2022-12-30 成都飞机工业(集团)有限责任公司 Parameter relevance analysis method, device, equipment and medium for PIU subsystem
CN115543667B (en) * 2022-09-19 2024-04-16 成都飞机工业(集团)有限责任公司 Parameter relevance analysis method, device, equipment and medium of PIU subsystem
CN117272398B (en) * 2023-11-23 2024-01-26 聊城金恒智慧城市运营有限公司 Data mining safety protection method and system based on artificial intelligence
CN117272398A (en) * 2023-11-23 2023-12-22 聊城金恒智慧城市运营有限公司 Data mining safety protection method and system based on artificial intelligence
CN118245561A (en) * 2024-05-21 2024-06-25 天工(天津)传媒科技有限公司 Urban and rural planning and mapping result generation method and system

Similar Documents

Publication Publication Date Title
CN111309777A (en) Report data mining method for improving association rule based on mutual exclusion expression
Bansal et al. Improved k-mean clustering algorithm for prediction analysis using classification technique in data mining
CN111914558B (en) Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN107766324B (en) Text consistency analysis method based on deep neural network
CN103226554A (en) Automatic stock matching and classifying method and system based on news data
CN112231477A (en) Text classification method based on improved capsule network
CN114281809B (en) Multi-source heterogeneous data cleaning method and device
CN113742396B (en) Mining method and device for object learning behavior mode
CN106295690A (en) Time series data clustering method based on Non-negative Matrix Factorization and system
CN116245107B (en) Electric power audit text entity identification method, device, equipment and storage medium
CN113972010B (en) Auxiliary disease reasoning system based on knowledge graph and self-adaptive mechanism
CN112257386B (en) Method for generating scene space relation information layout in text-to-scene conversion
CN113011161A (en) Method for extracting human and pattern association relation based on deep learning and pattern matching
CN116204673A (en) Large-scale image retrieval hash method focusing on relationship among image blocks
CN103473308A (en) High-dimensional multimedia data classifying method based on maximum margin tensor study
CN112489689B (en) Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure
CN109582743A (en) A kind of data digging method for the attack of terrorism
CN111553442B (en) Optimization method and system for classifier chain tag sequence
CN112434145A (en) Picture-viewing poetry method based on image recognition and natural language processing
CN107633259A (en) A kind of cross-module state learning method represented based on sparse dictionary
CN116805010A (en) Multi-data chain integration and fusion knowledge graph construction method oriented to equipment manufacturing
CN115795037A (en) Multi-label text classification method based on label perception
CN111275081A (en) Method for realizing multi-source data link processing based on Bayesian probability model
Fazili et al. Recent trends in dimension reduction methods
CN116578611B (en) Knowledge management method and system for inoculated knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200619

RJ01 Rejection of invention patent application after publication