CN110222090A - A kind of mass data Mining Frequent Itemsets - Google Patents
A kind of mass data Mining Frequent Itemsets Download PDFInfo
- Publication number
- CN110222090A CN110222090A CN201910477465.9A CN201910477465A CN110222090A CN 110222090 A CN110222090 A CN 110222090A CN 201910477465 A CN201910477465 A CN 201910477465A CN 110222090 A CN110222090 A CN 110222090A
- Authority
- CN
- China
- Prior art keywords
- transaction data
- data set
- frequent item
- frequent
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007418 data mining Methods 0.000 title abstract description 5
- 238000005065 mining Methods 0.000 claims abstract description 72
- 238000000034 method Methods 0.000 claims abstract description 59
- 238000004364 calculation method Methods 0.000 claims description 32
- 238000010008 shearing Methods 0.000 claims description 17
- 238000001914 filtration Methods 0.000 claims description 14
- 238000002360 preparation method Methods 0.000 claims 1
- 238000009412 basement excavation Methods 0.000 abstract description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000000746 purification Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention provides a kind of mass data Mining Frequent Itemsets, comprising: using Frequent Itemsets Mining Algorithm to original transaction data set TOIt is excavated, obtains original transaction data set TOCorresponding all Local frequent itemsets;Scan original transaction data set TO, corresponding to calculate above-mentioned each Local frequent itemset obtained in original transaction data set TOOn support counting, Local frequent itemset obtained is filtered, obtains each Local frequent itemset that support is not less than ω, and by acquired each Local frequent itemset and calculate the corresponding write-in file F of resulting corresponding support countingqfIn;Read newly-increased transaction data set (TDS) TΔ, and judge newly-increased transaction data set (TDS) TΔIt whether is sky, later based on newly-increased transaction data set (TDS) TΔIt whether is empty progress frequent item set mining.The present invention has been multiplexed file F in entire mining processqf, set STCADWith array cntΔ, reduce computing cost to a certain extent, so that the excavation rate of frequent item set can be improved.
Description
Technical Field
The invention relates to the technical field of data mining, in particular to a method for mining a frequent item set of mass data.
Background
Frequent item set mining has long been one of the most active areas in data mining. The method has very wide application in real life, for example, the method is widely applied to a plurality of research fields such as data mining, software error detection, space-time data analysis, biological analysis and the like. Because of its practical significance, frequent itemset mining has attracted a wide range of attention.
In the field of data storage, data is typically stored in a read-only/add-only mode, and the entire transaction data set may be divided into two parts: the original transaction data set and the new transaction data set. And under a certain time or condition, merging the data in the newly added transaction data set into the original transaction data set, increasing the data in the original transaction data set, emptying the data in the newly added transaction data set due to the fact that the data in the newly added transaction data set is merged into the original transaction data set, writing the newly added data into the newly added transaction data set when the newly added data is written, merging the newly written data in the newly added transaction data set into the original transaction data set again when a certain time or condition is met again, continuously using the newly added transaction data set for waiting for the storage of new data, and so on. Therefore, under the condition of storing in the read-only/addition-only mode, the original transaction data set always consists of the original transaction data set and the newly added transaction data set.
For many years, researchers at home and abroad have proposed many related algorithms. Existing algorithms can be divided into two categories: candidate generation based algorithms, pattern growth based algorithms. The candidate generation based algorithm first generates a candidate set, then verifies the candidate set by scanning the database and identifies a frequent set thereof. In addition, candidate generation based algorithms also exploit the inverse monotonicity to clip the search space. However, such algorithms require multiple passes through the database, which can incur high I/O overhead when dealing with large amounts of data. The pattern growth based algorithm does not directly generate candidate sets, and it preserves the necessary information of the frequent item sets in the database by constructing a special tree-based data structure. By using the data structure, a frequent item set can be calculated efficiently, however, the algorithm is very complex in data structure construction, and when processing massive data, the memory requirement usually exceeds the available memory, so that the data structure cannot be constructed in the memory correctly.
Therefore, the invention provides a method for mining the mass data frequent item set, which is used for mining the mass data frequent item set in a read-only/add-only mode storage mode.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method for mining mass data frequent item sets, which is used for mining mass data frequent item sets in a read-only/add-only mode storage mode so as to improve the mining speed of the mass data frequent item sets.
The invention provides a method for mining mass data frequent item sets, which is used for mining the frequent item sets which meet the global minimum support degree minsup in a total transaction data set T, wherein the global minimum support degree minsup is the preset minimum support degree on the total transaction data set T;
said total transaction data set T comprising an original transaction data set TOAnd newly adding an affairData set TΔ;
The method for mining the mass data frequent item set comprises the following steps:
adopting a frequent item set mining algorithm to carry out on an original transaction data set TOMining is carried out to obtain an original transaction data set TOAll corresponding local frequent item sets;
scanning an original transaction data set TOCorrespondingly calculating each local frequent item set obtained in the above step in the original transaction data set TOThe above support degree counting is to filter the obtained local frequent item sets according to the local minimum support degree omega, obtain each local frequent item set with the support degree not less than omega, and correspondingly write each obtained local frequent item set with the support degree not less than omega and the corresponding support degree obtained by calculation into the file FqfPerforming the following steps;
reading newly added transaction data set TΔAnd judging the newly added transaction data set TΔWhether it is empty:
if yes, the file F is processed according to the number n of the transactions in the total transaction data set T and the global minimum support degree minsupqfFiltering the local frequent item sets to obtain and output the local frequent item sets with the filtered support counts not less than the global minimum support count n multiplied by min, wherein each output local frequent item set is all frequent item sets meeting the global minimum support count min on the total transaction data set T;
if not, mining a frequent item set on the total transaction data set T by adopting an incremental updating method;
wherein, the local minimum support degree omega is a preset original transaction data set TOThe local minimum support omega is less than the global minimum support minsup.
Further, the incremental updating method comprises the following steps:
scanning for newly added transaction data set TΔCalculating a newly added transaction data set TΔIn the new transaction data set TΔCount the support degree of the transaction data set and add a new transaction data set TΔItem sets in the system and newly added transaction data set T obtained by calculationΔIn the new transaction data set TΔCount of support degree of (1), and correspondingly store into array cntΔAnd count the group cntΔThe maximum support count of the medium is masΔ;
Scanning a current document FqfFor each slave current file FqfRespectively judging whether the local frequent item sets scanned in the total transaction data set T meet the preset global minimum support degree minsup, if so, outputting the currently scanned local frequent item sets, and recording the output local frequent item sets as first frequent item sets; if the judgment result is no, based on the masΔJudging whether the currently scanned local frequent item set is not a frequent item set which meets a preset global minimum support degree minsup on the total transaction data set T: if not, counting the currently scanned local frequent item set and the corresponding support degree, and correspondingly writing into the set STCADPerforming the following steps;
based on the above array cntΔCorresponding calculation and update set STCADNewly added transaction data set TΔThe support degree of each local frequent item set in the total transaction data set T is counted to obtain an updated set STCAD;
Traversing updated set STCADAnd respectively judging each current traversed updated set STCADIf the local frequent item set in the set is the frequent item set meeting the preset global minimum support degree minsup on the total transaction data set T, correspondingly outputting each current traversed updated set STCADThe corresponding local frequent itemset in (c);
then judging the expression (n)O×ω-1)+masΔWhether or not (n × min) holds:
if yes, finishing the mining of the frequent item set;
otherwise, continuing to add the transaction data set TΔExcavating a new frequent item set, and outputting the excavated new frequent item set;
wherein the new frequent item set is in the original transaction data set TOThe support degree on the system is more than zero, the global minimum support degree minsup is met on the total transaction data set T, and the system is different from all the output frequent item sets;
nOfor newly adding transaction data set TΔThe number of transactions in (2).
Further, the new transaction data set TΔMining a new frequent item set, and outputting the mined new frequent item set, wherein the method comprises the following steps:
by the formulaCalculating a newly added transaction data set TΔMinimum support degree of (Min)ΔIn the formula nΔFor newly adding transaction data set TΔThe number of transactions in;
splitting a newly added transaction dataset TΔThe transactions in (1) are a plurality of target transaction data sets;
adopting an Eclat algorithm to carry out local frequent item set mining on each target transaction data set to obtain the minimum support degree minsup which corresponds to each target transaction data set and meets the calculationΔAll local frequent itemsets of (1);
the set of all local frequent item sets corresponding to the target transaction data sets is recorded as LFΔGo through and delete the set LFΔHas appeared in the file FqfTo obtain a candidate set GFΔ(ii) a And based on the above array cntΔCandidate set GFΔAdding local frequent item sets in the newly added transaction data set TΔThe corresponding support metric in (b) is stored in the candidate set GFΔ;
Scanning a current original transaction data set TOAdding and updating candidate set GFΔCounting the support degree of the middle local frequent item set to obtain a new candidate set GFΔ;
Scanning the new candidate set GFΔCorrespondingly judging each currently scanned new candidate set GFΔIf the local frequent item set in (1) is the frequent item set meeting the preset global minimum support degree minsup, correspondingly outputting each currently scanned candidate set GFΔLocal frequent itemses in (1).
Further, the Eclat algorithm is adopted to perform local frequent item set mining on each target transaction data set to obtain the minimum support degree minsup which is obtained by satisfying the calculation and corresponds to each target transaction data setΔAll local frequent itemsets of (1), including:
p0, traversing each target transaction data set;
p1, for the currently traversed target transaction data set:
p11, calculating and acquiring a candidate frequent k-item set corresponding to the currently traversed target transaction data set by adopting an Eclat algorithm, and meeting the minimum support degree minsup obtained by the calculation in the generated candidate frequent k-item setΔAnd then, the generated candidate frequent k-item set is recorded as a frequent k-item set and stored in a set LFk,ΔIn the middle, k is more than or equal to 1;
p12 by merging the LFk,ΔGenerating candidate frequent (k +1) -item sets by the two medium frequent k-item sets, and satisfying the minimum support degree minsup obtained by the calculation in the generated candidate frequent (k +1) -item setsΔThen, the generated candidate frequent (k +1) -item set is recorded as a frequent (k +1) -item set and stored in a set LFk+1,ΔWherein two are as described inThe first k-1 terms of the frequent k-term set are the same and the last term is different;
p13, repeating the above steps P11-P12, increasing k by 1 each time until a new candidate frequent item set corresponding to the currently traversed target transaction data set can not be generated; then step P14 is executed;
p14, continuously traversing the next target transaction data set, and repeatedly executing the steps P11-P13 until all the target transaction data sets are traversed, thereby obtaining the minimum support degree minsup which is obtained by correspondingly meeting the calculation and corresponds to all the target transaction data setsΔAll of the local frequent itemsets of (c).
Further, between step P11 and step P12, step S is further included: for the set LFk,ΔFine shearing;
wherein, for the set LFk,ΔThe fine shearing step comprises the following steps:
obtaining and dividing a set LF according to whether the first (k-1) items of the item set are the same or notk,ΔGrouping medium-frequent k-item sets to obtain a corresponding number of item set groups, wherein the first (k-1) items of the frequent k-item sets in the same item set group are the same;
respectively counting the number of the frequent k-item sets in each item set group, and correspondingly judging whether the counted number is equal to 1: if yes, deleting the corresponding item set group and deleting the set LFk,ΔThe same set of items as the frequent set of k-items in the corresponding set of items grouping; the corresponding item set group, wherein the number of the frequent k-item sets is equal to 1;
correspondingly judging whether the union of any two frequent k-item sets in the item set group is contained in the file F in the currently existing item set groupqfThe method comprises the following steps: if yes, deleting the corresponding item set group and deleting the set LFk,ΔThe same set of items as the frequent set of k-items in the corresponding group of sets of items.
Further, based on the above-mentioned array cntΔCorresponding calculation and update setSTCADNewly added transaction data set TΔThe support degree of each local frequent item set in the total transaction data set T is counted to obtain an updated set STCADBefore, also include to the said set STCADFine shearing;
for the set STCADThe fine shearing step comprises a first stage of simplification step;
the first stage of the reduction step comprises:
traverse said file FqfAnd array cntΔAnd respectively calculate each traversed file FqfSupport count and array cnt for 1-item set in (1)ΔThe same 1-item set in the array cntΔIf the calculated sum of the support degrees is smaller than the global minimum support degree count n multiplied by min, the traversed file FqfFrom said set STCADIs removed.
Further, for the set STCADThe fine shearing step of (2) further comprises a second stage of simplification step;
the second stage of the reduction step comprises:
building a PIP array;
traversing the set ST reduced by the first stage of reductionCADAnd is a set ST reduced by the first stage of reductionCADSelecting two items with the minimum support counts in the corresponding item sets respectively for each local frequent item set to form an item pair of the corresponding item sets, and storing the item pair in the constructed PIP array;
calculating the new transaction data set T of each item pair in the PIP arrayΔA support count of (a);
determining that each item pair in the PIP array is in the newly added transaction data set TΔCount of support degree and corresponding item set in newly added transaction data set TΔThe sum of the support counts and the global minimum support count n × min, and a local frequent item set self-set ST corresponding to each item pair for which it is determined that the sum of the support counts is smaller than the global minimum support count n × minCADIs deleted.
Further, updating the newly added transaction data set TΔAnd updating said original transaction data set TOFor the original transaction data set TOAnd the original newly added transaction data set TΔSum and updated newly added transaction data set TΔWhen the mass data is not empty, the method for mining the frequent item set of the mass data further comprises the step of updating and mining;
the step of updating the mining comprises the following steps:
updating the number n of the transactions in the total transaction data set T to the original transaction data set TOAnd the original newly added transaction data set TΔThe sum of the number of transactions of (c);
obtaining the original transaction data set T obtained aboveOCorresponding all local frequent item sets, and calculating the acquired original transaction data set TOThe corresponding local frequent item sets are in the original transaction data set TOCount of degree of support above, corresponding to write file Fqf,OPerforming the following steps;
newly-added transaction data set T based on originalΔCorresponding array cntΔAdding and updating file Fqf,OThe local frequent item sets are in the original newly added transaction data set TΔA support count of (a); then, according to the local minimum support degree omega, the file F is processedqf,OFiltering the local frequent item sets in the total transaction data set to obtain each local frequent item set with the support degree not less than omega corresponding to the updated total transaction data set T;
then, each local frequent item set with the support degree not less than omega corresponding to the obtained updated total transaction data set T and the file F thereofqf,OEach of them isCorresponding to the support count, writing a new file Fqf;
Obtaining the original newly added transaction data set TΔCorresponding sets LFΔAnd delete the acquired set LFΔIs present in the file Fqf,OThe item set in (1) is corresponded to obtain a new set LFΔ;
Newly-added transaction data set T based on originalΔCorresponding array cntΔIn the new set LFΔIn correspondence with writing the new set LFΔAll items in the original newly added transaction data set TΔA support count of (a);
then, according to the local minimum support degree omega, the new set LF is processedΔFiltering the item sets in the collection, acquiring the filtered item sets with the support degree not less than omega, and acquiring the acquired item sets with the support degree not less than omega and the new set LFΔThe support degree count of the write-in is written into the new file F correspondinglyqf;
Thereafter using the new file FqfReplacing original file FqfUsing original transaction data set TOAnd the original newly added transaction data set TΔThe sum replaces the original transaction data set TOAnd using the updated new transaction data set TΔReplace the original newly added transaction data set TΔAnd mining a frequent item set on the updated total transaction data set T by adopting the incremental updating method based on the number n of the transactions in the updated total transaction data set T.
Further, the file F is processed according to the number n of the transactions in the total transaction data set T and the global minimum support degree minsupqfFiltering the local frequent item set to obtain a local frequent item set of which the filtered support count is not less than the global minimum support count nxmin, which specifically comprises the following steps:
sequential scanning of documentsFqf;
Respectively judging the scanned documents FqfWhether the support count of the local frequent item set in (1) is greater than or equal to a global minimum support count n × minisup:
if yes, the scanned file FqfThe local frequent item set in (1) is the local frequent item set of which the filtered support count is not less than the global minimum support count n multiplied by min.
Further, said scanning the current file FqfFor each slave current file FqfRespectively judging whether the local frequent item sets scanned in the total transaction data set T meet the preset global minimum support degree minsup, if so, outputting the currently scanned local frequent item sets, and recording the output local frequent item sets as first frequent item sets; if the judgment result is no, based on the masΔJudging whether the currently scanned local frequent item set is not a frequent item set which meets a preset global minimum support degree minsup on the total transaction data set T: if not, counting the currently scanned local frequent item set and the corresponding support degree, and correspondingly writing into the set STCADThe method specifically comprises the following steps:
sequentially scanning said document Fqf;
For each scanned document FqfThe scanned file F is respectively judged according to the local frequent item setqfWhether the support count of the local frequent item set in (1) is greater than or equal to a global minimum support count n × minisup:
if so, outputting the scanned local frequent item set, wherein the output local frequent item set is a frequent item set meeting the preset global minimum support minsup on the total transaction data set T;
if not, judging the scanned file FqfThe support count of the local frequent item set in (1) and the masΔIs less than the global minimum supportCounting n multiplied by min, if yes, the currently scanned local frequent item set is not the frequent item set meeting the preset global minimum support degree min on the total transaction data set T, otherwise, the currently scanned local frequent item set and the correspondingly scanned support degree count are correspondingly written into the set STCAD。
The invention has the beneficial effects that:
(1) the method for mining the mass data frequent item set adopts a file FqfSet STCADAnd array cntΔAnd the file F is reused in the whole excavation processqfSet STCADAnd array cntΔThis avoids to some extent the need for the original transaction data set TOAnd adding a new transaction data set TΔThe calculation cost is reduced to a certain extent, so that the mining speed of the frequent item set can be improved to a certain extent.
(2) The method for mining the mass data frequent item set comprises the step of carrying out ST on the setCADA fine shearing step of (1) making the set STCADIs further reduced before being used for subsequent computations, thereby reducing I/O overhead and computational overhead.
(3) The method for mining the mass data frequent item set provides a specific increment updating strategy, and utilizes the existing calculation information, such as an array cntΔSet LFΔAnd the updating operation is accelerated, so that the performance and the practicability of the frequent item set mining of mass data are improved.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention. The execution subject in fig. 1 may be a computing node, a server, or an ordinary PC. The method for mining the mass data frequent item sets is used for mining the frequent item sets which meet the global minimum support degree minsup in a total transaction data set T, wherein the global minimum support degree minsup is the preset minimum support degree on the total transaction data set T, and the total transaction data set T comprises an original transaction data set TOAnd adding a new transaction data set TΔ。
Referring to fig. 1, the method for mining the mass data frequent item set includes:
step 110, adopting a frequent item set mining algorithm to carry out on an original transaction data set TOMining is carried out to obtain an original transaction data set TOAll corresponding local frequent item sets.
In particular, the original transaction data set T can be read sequentiallyOThe retrieved transaction is placed in a memory buffer area, then a local frequent item set is calculated on a data set of the buffer area by using the existing frequent item set mining algorithm, and the calculated local frequent item set is stored in a file F; then emptying the buffer area, and continuously and sequentially reading the original transaction data set TOThe next iteration is performed on the transaction in (1), and the calculated local frequent item set is continuously saved in the file F. This process is repeated until the original transaction data set TOAll transactions in (1) are read, so far, the original transaction data set TOAll corresponding local frequent item sets are generated and obtained and are stored in the file F.
For convenience of description, the step corresponding to step 110 is referred to as a pre-calculation stage.
Step 120, scanning the original transaction data set TOCorrespondingly calculating each local frequent item set obtained in the above step 110 in the original transaction data set TOThe above support degree counting is to filter the obtained local frequent item sets according to the local minimum support degree omega, obtain each local frequent item set with the support degree not less than omega, and correspondingly write each obtained local frequent item set with the support degree not less than omega and the corresponding support degree obtained by calculation into the file FqfIn (1).
In specific implementation, firstly, all local frequent item sets in the file F are read into a memory, and then, an original transaction data set T is sequentially scannedOCorrespondingly calculating the support count of each local frequent item set read into the memory; finally, filtering each local frequent item set read into the memory according to the local minimum support degree omega, and storing each local frequent item set with the support degree not less than omega obtained by filtering in a file FqfIn (1).
Wherein the above document FqfIs denoted as Fqf(IS, SUP), IS tableShowing an item set, wherein SUP shows the corresponding support degree count of the item set IS; file FqfThe local frequent item sets in (1) are sorted in descending order according to the support count.
For convenience of description, the step corresponding to step 120 will be referred to as the purification stage.
Step 130, reading the newly added transaction data set TΔAnd judging the newly added transaction data set TΔWhether it is empty:
if yes, the file F is processed according to the number n of the transactions in the total transaction data set T and the global minimum support degree minsupqfFiltering the local frequent item sets to obtain and output the local frequent item sets with the filtered support counts not less than the global minimum support count n multiplied by min, wherein each output local frequent item set is all frequent item sets meeting the global minimum support count min on the total transaction data set T;
if not, mining a frequent item set on the total transaction data set T by adopting an incremental updating method;
wherein, the local minimum support degree omega is a preset original transaction data set TOThe local minimum support omega is less than the global minimum support minsup.
Wherein the file F is processed according to the number n of transactions in the total transaction data set T and the global minimum support degree minsupqfFiltering the local frequent item set to obtain a local frequent item set of which the filtered support count is not less than the global minimum support count nxmin, which specifically comprises the following steps:
sequential scanning document Fqf;
Respectively judging the scanned documents FqfWhether the support count of the local frequent item set in (1) is greater than or equal to a global minimum support count n × minisup:
if yes, the scanned file FqfThe local frequent item set in (1) is the supportThe local frequent item set (the frequent item set on the total transaction data set T) with the count not less than the global minimum support count n × min.
Wherein, the new transaction data set T is readΔFirst, read the TΔReading the item in the transaction again, and after the item in the transaction is read completely, switching to read the TΔThe next transaction in the queue. Wherein, fortΔRepresenting a newly added transaction data set T currently being readΔI denotes said tΔOne item in (2), each time a new transaction data set T is readΔThe item in (1), the count of i (initial value is 0) is incremented by 1.
Preferably, the incremental updating method includes the following steps (i.e., the step of mining the frequent item set on the total transaction data set T by using the incremental updating method):
scanning for newly added transaction data set TΔCalculating a newly added transaction data set TΔIn the new transaction data set TΔCount the support degree of the transaction data set and add a new transaction data set TΔItem sets in the system and newly added transaction data set T obtained by calculationΔIn the new transaction data set TΔCount of support degree of (1), and correspondingly store into array cntΔAnd count the group cntΔThe maximum support count of the medium is masΔ;
Scanning the current file F sequentiallyqfFor each slave current file FqfRespectively judging whether the local frequent item sets scanned in the total transaction data set T meet the preset global minimum support degree minsup, if so, outputting the currently scanned local frequent item sets, and recording the output local frequent item sets as first frequent item sets; if the judgment result is no, based on the masΔDetermining the currently scanned partWhether the frequent item set is not necessarily the frequent item set satisfying the preset global minimum support degree minsup on the total transaction data set T: if not, counting the currently scanned local frequent item set and the corresponding support degree, and correspondingly writing into the set STCADPerforming the following steps;
based on the above array cntΔCorresponding calculation and update set STCADNewly added transaction data set TΔThe support degree of each local frequent item set in the total transaction data set T is counted to obtain an updated set STCAD;
Traversing updated set STCADAnd respectively judging each current traversed updated set STCADIf the local frequent item set in the set is the frequent item set meeting the preset global minimum support degree minsup on the total transaction data set T, correspondingly outputting each current traversed updated set STCADThe corresponding local frequent itemset in (c);
then judging the expression (n)O×ω-1)+masΔWhether or not (n × min) holds:
if yes, finishing the mining of the frequent item set;
otherwise, continuing to add the transaction data set TΔExcavating a new frequent item set, and outputting the excavated new frequent item set;
wherein the new frequent item set is in the original transaction data set TOThe support degree on the system is more than zero, the global minimum support degree minsup is met on the total transaction data set T, and the system is different from all the output frequent item sets;
nOfor newly adding transaction data set TΔThe number of transactions in (2).
In this embodiment, the current file F is scanned sequentiallyqfFor each slave current file FqfRespectively judging the local frequent item sets scanned from middleJudging whether the current local frequent item set meets the preset global minimum support degree minsup on the total transaction data set T, if so, outputting the currently scanned local frequent item set, and recording the output local frequent item set as a first frequent item set; if the judgment result is no, based on the masΔJudging whether the currently scanned local frequent item set is not a frequent item set which meets a preset global minimum support degree minsup on the total transaction data set T: if not, counting the currently scanned local frequent item set and the corresponding support degree, and correspondingly writing into the set STCADThe method specifically comprises the following steps:
sequentially scanning said document Fqf;
For each scanned document FqfThe scanned file F is respectively judged according to the local frequent item setqfWhether the support count of the local frequent item set in (1) is greater than or equal to a global minimum support count n × minisup:
if so, outputting the scanned local frequent item set, wherein the output local frequent item set is a frequent item set meeting the preset global minimum support minsup on the total transaction data set T;
if not, judging the scanned file FqfThe support count of the local frequent item set in (1) and the masΔIf the sum of the local frequent item sets is less than the global minimum support degree minsup, if yes, the currently scanned local frequent item set must not be the frequent item set meeting the preset global minimum support degree minsup on the total transaction data set T, otherwise, the currently scanned local frequent item set and the correspondingly scanned support degree are counted and correspondingly written into the set STCAD。
Note that, in the present invention, document FqfThe local frequent item set in (1) is divided into three parts: (1) portions of the frequent item set that absolutely belong to the total transaction dataset T, (2) portions of the frequent item set that absolutely do not belong to the total transaction dataset T, (3) portions of the frequent item set that are likely to belong to the total transaction dataset T. It can be seen thatT represents the currently read local frequent item set, and if T meets the global minimum support degree minsup, T is the frequent item set of the total transaction data set T; assuming said TΔAll the transactions in the system contain the T, but the T still cannot meet the global minimum support degree minsup, so the T is not a frequent item set of the total transaction data set T; in other cases, T may be a frequent item set of the total transaction data set T, requiring further validation, the present invention saves T, which may be a frequent item set, in the set STCADIn (1). Therefore, the frequent item sets in the total transaction data set T are mined in a classified manner based on the classification of the frequent item sets in the whole total transaction data set T, and the mining efficiency is improved to a certain extent.
Preferably, in this embodiment, the new transaction data set T is addedΔMining a new frequent item set, and outputting the mined new frequent item set, wherein the method comprises the following steps:
by the formulaCalculating a newly added transaction data set TΔMinimum support degree of (Min)ΔIn the formula nΔFor newly adding transaction data set TΔThe number of transactions in the total transaction data set T, n is the number of transactions in the total transaction data set T, nOFor the original transaction data set TOThe number of transactions in;
splitting a newly added transaction dataset TΔThe transactions in (1) are a plurality of target transaction data sets;
adopting an Eclat algorithm to carry out local frequent item set mining on each target transaction data set to obtain the minimum support degree minsup which corresponds to each target transaction data set and meets the calculationΔAll local frequent itemsets of (1);
get target transactions as described aboveThe collection of all local frequent item sets corresponding to the data set is LFΔGo through and delete the set LFΔHas appeared in the file FqfTo obtain a candidate set GFΔ(ii) a And based on the above array cntΔCandidate set GFΔAdding local frequent item sets in the newly added transaction data set TΔThe corresponding support metric in (b) is stored in the candidate set GFΔ;
Scanning a current original transaction data set TOAdding and updating candidate set GFΔCounting the support degree of the middle local frequent item set to obtain a new candidate set GFΔ;
Scanning the new candidate set GFΔCorrespondingly judging each currently scanned new candidate set GFΔIf the local frequent item set in (1) is the frequent item set meeting the preset global minimum support degree minsup, correspondingly outputting each currently scanned candidate set GFΔLocal frequent itemses in (1).
In this embodiment, the Eclat algorithm is adopted to perform local frequent item set mining on each target transaction data set to obtain the minimum support degree min corresponding to each target transaction data set and satisfying the above calculationΔAll local frequent itemsets of (1), including:
p0, traversing each target transaction data set;
p1, for the currently traversed target transaction data set:
p11, calculating and acquiring a candidate frequent k-item set corresponding to the currently traversed target transaction data set by adopting an Eclat algorithm, and meeting the minimum support degree minsup obtained by the calculation in the generated candidate frequent k-item setΔAnd then, the generated candidate frequent k-item set is recorded as a frequent k-item set and stored in a set LFk,ΔIn the middle, k is more than or equal to 1;
p12 by merging the LFk,ΔTwo frequent k-terms inGenerating a candidate frequent (k +1) -item set by the set, and satisfying the minimum support degree minsup obtained by the calculation in the generated candidate frequent (k +1) -item setΔThen, the generated candidate frequent (k +1) -item set is recorded as a frequent (k +1) -item set and stored in a set LFk+1,ΔWherein the first k-1 terms of the two sets of frequent k-terms are the same and the last term is different;
p13, repeating the above steps P11-P12, increasing k by 1 each time until a new candidate frequent item set corresponding to the currently traversed target transaction data set can not be generated; then step P14 is executed;
p14, continuously traversing the next target transaction data set, and repeatedly executing the steps P11-P13 until all the target transaction data sets are traversed, thereby obtaining the minimum support degree minsup which is obtained by correspondingly meeting the calculation and corresponds to all the target transaction data setsΔAll of the local frequent itemsets of (c).
Note that, referring to fig. 1:
each t in the figure is the currently scanned file F in the corresponding stepqfLocal frequent item sets in (1); SUP, each t is the local frequent item set t currently scanned in the corresponding step in the file FqfThe corresponding support count in (1);
"s" shown in the figure is the updated set ST currently scanned by the corresponding stepCADSup "is the set ST of items s after updatingCADThe corresponding support count in (1);
"r" shown in the figure is the new candidate set GF currently scanned in the corresponding stepΔSup "is the new candidate set GF in the item set rΔCorresponding support count.
In addition, it can be seen from fig. 1 that:
in the determination of the newly added transaction data set TΔWhen the output is empty, the corresponding output item sets t in FIG. 1 are the items passing through the present inventionShowing all frequent item sets on the total transaction data set T mined by the method;
in the determination of the newly added transaction data set TΔWhen the total transaction data set T is non-empty, each item set T, each item set s, and each item set r, which are output correspondingly in fig. 1, are all frequent item sets on the total transaction data set T mined by the method shown in the present invention.
In addition, in the specific implementation of the present invention, the values of min and ω may be selected by a person skilled in the art according to empirical values, for example, min may be 0.2, ω may be 0.1, and the like.
In summary, the method for mining the frequent item set of mass data provided by the invention adopts the file FqfSet STCADAnd array cntΔAnd the file F is reused in the whole excavation processqfSet STCADAnd array cntΔThis avoids to some extent the need for the original transaction data set TOAnd adding a new transaction data set TΔThe calculation cost is reduced to a certain extent, so that the mining speed of the frequent item set can be improved to a certain extent.
Example 2:
compared with embodiment 1, the difference is that the method for mining the frequent item sets of mass data described in embodiment 2 further includes, between step P11 and step P12, step S: for the set LFk,ΔFine shearing;
wherein, for the set LFk,ΔThe fine shearing step comprises the following steps:
obtaining and dividing a set LF according to whether the first (k-1) items of the item set are the same or notk,ΔGrouping medium-frequent k-item sets to obtain a corresponding number of item set groups, wherein the first (k-1) items of the frequent k-item sets in the same item set group are the same;
respectively counting the number of the frequent k-item sets in each item set group, and correspondingly judging whether the counted number is equal to 1: if yes, deleting the corresponding item set group and deleting the set LFk,ΔThe same set of items as the frequent set of k-items in the corresponding set of items grouping; the corresponding item set group, wherein the number of the frequent k-item sets is equal to 1;
correspondingly judging whether the union of any two frequent k-item sets in the item set group is contained in the file F in the currently existing item set groupqfThe method comprises the following steps: if yes, deleting the corresponding item set group and deleting the set LFk,ΔThe same set of items as the frequent set of k-items in the corresponding group of sets of items.
In addition, in order to further increase the mining speed of the present invention, the present embodiment is based on the above-mentioned array cntΔCorresponding calculation and update set STCADNewly added transaction data set TΔThe support degree of each local frequent item set in the total transaction data set T is counted to obtain an updated set STCADBefore, also include to the said set STCADFine shearing;
for the set STCADThe fine shearing step comprises a first stage of simplification step;
the first stage of the reduction step comprises:
traverse said file FqfAnd array cntΔAnd respectively calculate each traversed file FqfSupport count and array cnt for 1-item set in (1)ΔThe same 1-item set in the array cntΔIf the calculated sum of the support degrees is smaller than the global minimum support degree count n multiplied by min, the traversed file FqfFrom said set STCADIs removed.
In addition, to further increase the mining rate of the present invention, the set ST is processedCADFine shearing step ofStep, also include the reduction step of the second stage;
the second stage of the reduction step comprises:
building a PIP array;
traversing the set ST reduced by the first stage of reductionCADAnd is a set ST reduced by the first stage of reductionCADSelecting two items with the minimum support counts in the corresponding item sets respectively for each local frequent item set to form an item pair of the corresponding item sets, and storing the item pair in the constructed PIP array;
calculating the new transaction data set T of each item pair in the PIP arrayΔA support count of (a);
determining that each item pair in the PIP array is in the newly added transaction data set TΔCount of support degree and corresponding item set in newly added transaction data set TΔThe sum of the support counts and the global minimum support count n × min, and a local frequent item set self-set ST corresponding to each item pair for which it is determined that the sum of the support counts is smaller than the global minimum support count n × minCADIs deleted.
In summary, the method for mining the mass data frequent item set provided by the invention further comprises the step of collecting STCADA fine shearing step of (1) making the set STCADIs further reduced before being used for subsequent computations, thereby reducing I/O overhead and computational overhead.
Example 3:
compared with embodiment 2, the difference is that the mass data frequent item set mining method described in embodiment 3 updates the newly added transaction data set TΔAnd updating said original transaction data set TOFor the original transaction data set TOAnd the original newly added transaction data set TΔSum, and updatedNewly added transaction data set TΔWhen the data is not empty, the method also comprises a step of updating the mining.
Specifically, the step of updating the mining in this embodiment includes:
updating the number n of the transactions in the total transaction data set T to the original transaction data set TOAnd the original newly added transaction data set TΔThe sum of the number of transactions of (c);
obtaining the original transaction data set T obtained aboveOCorresponding all local frequent item sets, and calculating the acquired original transaction data set TOThe corresponding local frequent item sets are in the original transaction data set TOCount of degree of support above, corresponding to write file Fqf,OPerforming the following steps;
newly-added transaction data set T based on originalΔCorresponding array cntΔAdding and updating file Fqf,OThe local frequent item sets are in the original newly added transaction data set TΔA support count of (a); then, according to the local minimum support degree omega, the file F is processedqf,OFiltering the local frequent item sets in the total transaction data set to obtain each local frequent item set with the support degree not less than omega corresponding to the updated total transaction data set T;
then, each local frequent item set with the support degree not less than omega corresponding to the obtained updated total transaction data set T and the file F thereofqf,ORespectively corresponding to the support degree counts, correspondingly writing a new file Fqf;
Obtaining the original newly added transaction data set TΔCorresponding sets LFΔAnd delete the acquired set LFΔIs present in the file Fqf,OThe item set in (1) is corresponded to obtain a new set LFΔ;
Newly-added transaction data set T based on originalΔCorresponding array cntΔIn the new set LFΔCorrespondingly writing the new setLFΔAll items in the original newly added transaction data set TΔA support count of (a);
then, according to the local minimum support degree omega, the new set LF is processedΔFiltering the item sets in the collection, acquiring the filtered item sets with the support degree not less than omega, and acquiring the acquired item sets with the support degree not less than omega and the new set LFΔThe support degree count of the write-in is written into the new file F correspondinglyqf;
Thereafter using the new file FqfReplacing original file FqfUsing original transaction data set TOAnd the original newly added transaction data set TΔThe sum replaces the original transaction data set TOAnd using the updated new transaction data set TΔReplace the original newly added transaction data set TΔAnd mining a frequent item set on the updated total transaction data set T by adopting the incremental updating method based on the number n of the transactions in the updated total transaction data set T.
Therefore, the method for mining the mass data frequent item set provided by the invention provides a specific increment updating strategy and utilizes the existing calculation information, such as an array cntΔSet LFΔAnd the updating operation is accelerated, so that the performance and the practicability of the frequent item set mining of mass data are improved.
It should be noted that the same and similar parts in the various embodiments in this specification may be referred to each other.
Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A mass data frequent item set mining method is used for mining a frequent item set which meets a global minimum support degree minsup in a total transaction data set T, wherein the global minimum support degree minsup is a preset minimum support degree on the total transaction data set T; it is characterized in that the preparation method is characterized in that,
said total transaction data set T comprising an original transaction data set TOAnd adding a new transaction data set TΔ;
The method for mining the mass data frequent item set comprises the following steps:
adopting a frequent item set mining algorithm to carry out on an original transaction data set TOMining is carried out to obtain an original transaction data set TOAll corresponding local frequent item sets;
scanning an original transaction data set TOCorrespondingly calculating each local frequent item set obtained in the above step in the original transaction data set TOThe above support degree counting is to filter the obtained local frequent item sets according to the local minimum support degree omega, obtain each local frequent item set with the support degree not less than omega, and correspondingly write each obtained local frequent item set with the support degree not less than omega and the corresponding support degree obtained by calculation into the file FqfPerforming the following steps;
reading newly added transaction data set TΔAnd judging the newly added transaction data set TΔWhether it is empty:
if yes, the file F is processed according to the number n of the transactions in the total transaction data set T and the global minimum support degree minsupqfFiltering the local frequent item sets to obtain and output the local frequent item sets with the filtered support counts not less than the global minimum support count n multiplied by min, wherein each output local frequent item set is all frequent item sets meeting the global minimum support count min on the total transaction data set T;
if not, mining a frequent item set on the total transaction data set T by adopting an incremental updating method;
wherein, the local minimum support degree omega is a preset original transaction data set TOThe local minimum support omega is less than the global minimum support minsup.
2. The mass data frequent itemset mining method according to claim 1, wherein the incremental updating method comprises the following steps:
scanning for newly added transaction data set TΔCalculating a newly added transaction data set TΔIn the new transaction data set TΔCount the support degree of the transaction data set and add a new transaction data set TΔEach ofItem set and newly added transaction data set T obtained through calculationΔIn the new transaction data set TΔCount of support degree of (1), and correspondingly store into array cntΔAnd count the group cntΔThe maximum support count of the medium is masΔ;
Scanning a current document FqfFor each slave current file FqfRespectively judging whether the local frequent item sets scanned in the total transaction data set T meet the preset global minimum support degree minsup, if so, outputting the currently scanned local frequent item sets, and recording the output local frequent item sets as first frequent item sets; if the judgment result is no, based on the masΔJudging whether the currently scanned local frequent item set is not a frequent item set which meets a preset global minimum support degree minsup on the total transaction data set T: if not, counting the currently scanned local frequent item set and the corresponding support degree, and correspondingly writing into the set STCADPerforming the following steps;
based on the above array cntΔCorresponding calculation and update set STCADNewly added transaction data set TΔThe support degree of each local frequent item set in the total transaction data set T is counted to obtain an updated set STCAD;
Traversing updated set STCADAnd respectively judging each current traversed updated set STCADIf the local frequent item set in the set is the frequent item set meeting the preset global minimum support degree minsup on the total transaction data set T, correspondingly outputting each current traversed updated set STCADThe corresponding local frequent itemset in (c);
then judging the expression (n)O×ω-1)+masΔWhether or not (n × min) holds:
if yes, finishing the mining of the frequent item set;
otherwise, continuing to add the transaction data set TΔExcavating a new frequent item set, and outputting the excavated new frequent item set;
wherein,the new frequent item set is in the original transaction data set TOThe support degree on the system is more than zero, the global minimum support degree minsup is met on the total transaction data set T, and the system is different from all the output frequent item sets;
nOfor newly adding transaction data set TΔThe number of transactions in (2).
3. The mass data frequent itemset mining method according to claim 2, wherein the new transaction data set TΔMining a new frequent item set, and outputting the mined new frequent item set, wherein the method comprises the following steps:
by the formulaCalculating a newly added transaction data set TΔMinimum support degree of (Min)ΔIn the formula nΔFor newly adding transaction data set TΔThe number of transactions in;
splitting a newly added transaction dataset TΔThe transactions in (1) are a plurality of target transaction data sets;
adopting an Eclat algorithm to carry out local frequent item set mining on each target transaction data set to obtain the minimum support degree minsup which corresponds to each target transaction data set and meets the calculationΔAll local frequent itemsets of (1);
the set of all local frequent item sets corresponding to the target transaction data sets is recorded as LFΔGo through and delete the set LFΔHas appeared in the file FqfTo obtain a candidate set GFΔ(ii) a And based on the above array cntΔCandidate set GFΔAdding local frequent item sets in the newly added transaction data set TΔThe corresponding support metric in (b) is stored in the candidate set GFΔ;
Scanning a current original transaction data set TOAdding and updating candidate set GFΔCounting the support degree of the middle and local frequent item sets to obtain new candidatesSet GFΔ;
Scanning the new candidate set GFΔCorrespondingly judging each currently scanned new candidate set GFΔIf the local frequent item set in (1) is the frequent item set meeting the preset global minimum support degree minsup, correspondingly outputting each currently scanned candidate set GFΔLocal frequent itemses in (1).
4. The method as claimed in claim 3, wherein the Eclat algorithm is used to perform local frequent itemset mining on each target transaction data set to obtain the minimum support degree minsup corresponding to each target transaction data set and satisfying the above calculationΔAll local frequent itemsets of (1), including:
p0, traversing each target transaction data set;
p1, for the currently traversed target transaction data set:
p11, calculating and acquiring a candidate frequent k-item set corresponding to the currently traversed target transaction data set by adopting an Eclat algorithm, and meeting the minimum support degree minsup obtained by the calculation in the generated candidate frequent k-item setΔAnd then, the generated candidate frequent k-item set is recorded as a frequent k-item set and stored in a set LFk,ΔIn the middle, k is more than or equal to 1;
p12 by merging the LFk,ΔGenerating candidate frequent (k +1) -item sets by the two medium frequent k-item sets, and satisfying the minimum support degree minsup obtained by the calculation in the generated candidate frequent (k +1) -item setsΔThen, the generated candidate frequent (k +1) -item set is recorded as a frequent (k +1) -item set and stored in a set LFk+1,ΔWherein the first k-1 terms of the two sets of frequent k-terms are the same and the last term is different;
p13, repeating the above steps P11-P12, increasing k by 1 each time until a new candidate frequent item set corresponding to the currently traversed target transaction data set can not be generated; then step P14 is executed;
p14, continue traversing the next target transaction data set, and repeatedly executing the above stepsP11-P13 until all target transaction data sets are traversed, so that the minimum support degree minsup which is obtained by correspondingly meeting the calculation and corresponds to each target transaction data set is obtainedΔAll of the local frequent itemsets of (c).
5. The mass data frequent item set mining method according to claim 4, further comprising, between step P11 and step P12, step S: for the set LFk,ΔFine shearing;
wherein, for the set LFk,ΔThe fine shearing step comprises the following steps:
obtaining and dividing a set LF according to whether the first (k-1) items of the item set are the same or notk,ΔGrouping medium-frequent k-item sets to obtain a corresponding number of item set groups, wherein the first (k-1) items of the frequent k-item sets in the same item set group are the same;
respectively counting the number of the frequent k-item sets in each item set group, and correspondingly judging whether the counted number is equal to 1: if yes, deleting the corresponding item set group and deleting the set LFk,ΔThe same set of items as the frequent set of k-items in the corresponding set of items grouping; the corresponding item set group, wherein the number of the frequent k-item sets is equal to 1;
correspondingly judging whether the union of any two frequent k-item sets in the item set group is contained in the file F in the currently existing item set groupqfThe method comprises the following steps: if yes, deleting the corresponding item set group and deleting the set LFk,ΔThe same set of items as the frequent set of k-items in the corresponding group of sets of items.
6. The mass data frequent itemset mining method according to claim 2, wherein the array cnt is based onΔCorresponding calculation and update set STCADNewly added transaction data set TΔThe support degree of each local frequent item set in the total transaction data set T is counted to obtain an updated set STCADBefore, also include to the said set STCADFine shearing;
for the set STCADThe fine shearing step comprises a first stage of simplification step;
the first stage of the reduction step comprises:
traverse said file FqfAnd array cntΔAnd respectively calculate each traversed file FqfSupport count and array cnt for 1-item set in (1)ΔThe same 1-item set in the array cntΔIf the calculated sum of the support degrees is smaller than the global minimum support degree count n multiplied by min, the traversed file FqfFrom said set STCADIs removed.
7. The mass data frequent itemset mining method of claim 6, wherein the set ST isCADThe fine shearing step of (2) further comprises a second stage of simplification step;
the second stage of the reduction step comprises:
building a PIP array;
traversing the set ST reduced by the first stage of reductionCADAnd is a set ST reduced by the first stage of reductionCADSelecting two items with the minimum support counts in the corresponding item sets respectively for each local frequent item set to form an item pair of the corresponding item sets, and storing the item pair in the constructed PIP array;
calculating the new transaction data set T of each item pair in the PIP arrayΔA support count of (a);
determining that each item pair in the PIP array is in the newly added transaction data set TΔCount of support degree and corresponding item set in newly added transaction data set TΔThe sum of the support counts and the global minimum support count n × min, and a local frequent item set self-set ST corresponding to each item pair for which it is determined that the sum of the support counts is smaller than the global minimum support count n × minCADIs deleted.
8. The mass data frequent item set mining method according to claim 1, 2, 3, 4, 5, 6 or 7, wherein said newly added transaction data set T is updatedΔAnd updating said original transaction data set TOFor the original transaction data set TOAnd the original newly added transaction data set TΔSum and updated newly added transaction data set TΔWhen the mass data is not empty, the method for mining the frequent item set of the mass data further comprises the step of updating and mining;
the step of updating the mining comprises the following steps:
updating the number n of the transactions in the total transaction data set T to the original transaction data set TOAnd the original newly added transaction data set TΔThe sum of the number of transactions of (c);
obtaining the original transaction data set T obtained aboveOCorresponding all local frequent item sets, and calculating the acquired original transaction data set TOThe corresponding local frequent item sets are in the original transaction data set TOCount of degree of support above, corresponding to write file Fqf,OPerforming the following steps;
newly-added transaction data set T based on originalΔCorresponding array cntΔAdding and updating file Fqf,OThe local frequent item sets are in the original newly added transaction data set TΔA support count of (a); then, according to the local minimum support degree omega, the file F is processedqf,OFiltering the local frequent item sets in the total transaction data set to obtain each local frequent item set with the support degree not less than omega corresponding to the updated total transaction data set T;
then, each local frequent item set with the support degree not less than omega corresponding to the obtained updated total transaction data set T and the file F thereofqf,ORespectively corresponding to the support degree counts, correspondingly writing a new file Fqf;
Obtaining the original newly added transaction data set TΔCorresponding sets LFΔAnd delete the acquired set LFΔIs present in the file Fqf,OThe item set in (1) is corresponded to obtain a new set LFΔ;
Newly-added transaction data set T based on originalΔCorresponding array cntΔIn the new set LFΔIn correspondence with writing the new set LFΔAll items in the original newly added transaction data set TΔA support count of (a);
then, according to the local minimum support degree omega, the new set LF is processedΔFiltering the item sets in the collection, acquiring the filtered item sets with the support degree not less than omega, and acquiring the acquired item sets with the support degree not less than omega and the new set LFΔThe support degree count of the write-in is written into the new file F correspondinglyqf;
Thereafter using the new file FqfReplacing original file FqfUsing original transaction data set TOAnd the original newly added transaction data set TΔThe sum replaces the original transaction data set TOAnd using the updated new transaction data set TΔReplace the original newly added transaction data set TΔAnd mining a frequent item set on the updated total transaction data set T by adopting the incremental updating method based on the number n of the transactions in the updated total transaction data set T.
9. The mass data frequent item set mining method according to claim 1, 2, 3, 4, 5, 6 or 7, characterized in that said file F is mined according to the number n of transactions in the total transaction data set T and said global minimum support degree minqfFiltering the local frequent item set to obtain a local frequent item set of which the filtered support count is not less than the global minimum support count nxmin, which specifically comprises the following steps:
sequential scanning document Fqf;
Respectively judging the scanned documents FqfWhether the support count of the local frequent item set in (1) is greater than or equal to a global minimum support count n × minisup:
if yes, the scanned file FqfThe local frequent item set in (1) is the local frequent item set of which the filtered support count is not less than the global minimum support count n multiplied by min.
10. The mass data frequent item set mining method according to claim 2, 3, 4, 5, 6 or 7, characterized in that said scanning of the current file FqfFor each slave current file FqfRespectively judging whether the local frequent item sets scanned in the total transaction data set T meet the preset global minimum support degree minsup, if so, outputting the currently scanned local frequent item sets, and recording the output local frequent item sets as first frequent item sets; if the judgment result is no, based on the masΔJudging whether the currently scanned local frequent item set is not a frequent item set which meets a preset global minimum support degree minsup on the total transaction data set T: if not, counting the currently scanned local frequent item set and the corresponding support degree, and correspondingly writing into the set STCADThe method specifically comprises the following steps:
sequentially scanning said document Fqf;
For each scanned document FqfThe scanned file F is respectively judged according to the local frequent item setqfWhether the support count of the local frequent item set in (1) is greater than or equal to a global minimum support count n × minisup:
if so, outputting the scanned local frequent item set, wherein the output local frequent item set is a frequent item set meeting the preset global minimum support minsup on the total transaction data set T;
if not, judging the scanned file FqfThe support count of the local frequent item set in (1) and the masΔIs less than the global minimum support degree minsup, and if the result of the determination is yes, the currently scanned local frequent item set must not be the frequent item set satisfying the preset global minimum support degree minsup on the total transaction data set T, and if not, the local frequent item set satisfies the global minimum support degree minsupCorrespondingly writing the currently scanned local frequent item set and the correspondingly scanned support degree count into the set STCAD。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477465.9A CN110222090A (en) | 2019-06-03 | 2019-06-03 | A kind of mass data Mining Frequent Itemsets |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477465.9A CN110222090A (en) | 2019-06-03 | 2019-06-03 | A kind of mass data Mining Frequent Itemsets |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110222090A true CN110222090A (en) | 2019-09-10 |
Family
ID=67819051
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910477465.9A Pending CN110222090A (en) | 2019-06-03 | 2019-06-03 | A kind of mass data Mining Frequent Itemsets |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110222090A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113064934A (en) * | 2021-03-26 | 2021-07-02 | 安徽继远软件有限公司 | Fault association rule mining method and system for sensing layer of power sensor network |
CN114004286A (en) * | 2021-10-19 | 2022-02-01 | 河海大学 | Multi-dimensional time sequence synchronization motif discovery method based on frequent item mining |
CN114691749A (en) * | 2022-05-11 | 2022-07-01 | 江苏大学 | Sliding window-based frequent item set parallel incremental mining method |
CN115473933A (en) * | 2022-10-10 | 2022-12-13 | 国网江苏省电力有限公司南通供电分公司 | Network system associated service discovery method based on frequent subgraph mining |
CN115525695A (en) * | 2022-10-08 | 2022-12-27 | 广东工业大学 | Incremental frequent itemset mining method for internet financial real-time streaming data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761236A (en) * | 2013-11-20 | 2014-04-30 | 同济大学 | Incremental frequent pattern increase data mining method |
CN107229751A (en) * | 2017-06-28 | 2017-10-03 | 济南大学 | A kind of concurrent incremental formula association rule mining method towards stream data |
-
2019
- 2019-06-03 CN CN201910477465.9A patent/CN110222090A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761236A (en) * | 2013-11-20 | 2014-04-30 | 同济大学 | Incremental frequent pattern increase data mining method |
CN107229751A (en) * | 2017-06-28 | 2017-10-03 | 济南大学 | A kind of concurrent incremental formula association rule mining method towards stream data |
Non-Patent Citations (1)
Title |
---|
韩希先: "Efficiently Mining Frequent Itemsets on Massive Data", 《IEEE ACCESS》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113064934A (en) * | 2021-03-26 | 2021-07-02 | 安徽继远软件有限公司 | Fault association rule mining method and system for sensing layer of power sensor network |
CN113064934B (en) * | 2021-03-26 | 2023-12-08 | 安徽继远软件有限公司 | Power sensing network perception layer fault association rule mining method and system |
CN114004286A (en) * | 2021-10-19 | 2022-02-01 | 河海大学 | Multi-dimensional time sequence synchronization motif discovery method based on frequent item mining |
CN114004286B (en) * | 2021-10-19 | 2024-04-26 | 河海大学 | Multi-dimensional time sequence synchronization motif discovery method based on frequent item mining |
CN114691749A (en) * | 2022-05-11 | 2022-07-01 | 江苏大学 | Sliding window-based frequent item set parallel incremental mining method |
CN114691749B (en) * | 2022-05-11 | 2024-03-19 | 江苏大学 | Method for parallel incremental mining of frequent item sets based on sliding window |
CN115525695A (en) * | 2022-10-08 | 2022-12-27 | 广东工业大学 | Incremental frequent itemset mining method for internet financial real-time streaming data |
CN115473933A (en) * | 2022-10-10 | 2022-12-13 | 国网江苏省电力有限公司南通供电分公司 | Network system associated service discovery method based on frequent subgraph mining |
CN115473933B (en) * | 2022-10-10 | 2023-05-23 | 国网江苏省电力有限公司南通供电分公司 | Network system associated service discovery method based on frequent subgraph mining |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110222090A (en) | A kind of mass data Mining Frequent Itemsets | |
Sonthalia et al. | Tree! i am no tree! i am a low dimensional hyperbolic embedding | |
US9589045B2 (en) | Distributed clustering with outlier detection | |
CN104809244B (en) | Data digging method and device under a kind of big data environment | |
CN112925821B (en) | MapReduce-based parallel frequent item set incremental data mining method | |
CN110389950B (en) | Rapid running big data cleaning method | |
KR20100045682A (en) | Method and system of clustering for multi-dimensional data streams | |
CN111309976B (en) | GraphX data caching method for convergence graph application | |
CN107391621A (en) | A kind of parallel association rule increment updating method based on Spark | |
CN102207964B (en) | Real-time massive data index construction method and system | |
CN115168326A (en) | Hadoop big data platform distributed energy data cleaning method and system | |
CN111475837A (en) | Network big data privacy protection method | |
CN109739897A (en) | A kind of increment type Mining Frequent Itemsets based on Spark frame | |
CN108319728A (en) | A kind of frequent community search method and system based on k-star | |
Kim et al. | Efficient method for mining high utility occupancy patterns based on indexed list structure | |
Ahmed et al. | Efficient mining of weighted frequent patterns over data streams | |
CN108897820B (en) | Parallelization method of DENCLUE algorithm | |
CN114490835B (en) | High-utility item set mining method and device, electronic equipment and medium | |
CN111177190A (en) | Data processing method and device, electronic equipment and readable storage medium | |
CN110413602B (en) | Layered cleaning type big data cleaning method | |
Padillo et al. | Subgroup discovery on big data: Pruning the search space on exhaustive search algorithms | |
Rajendran et al. | Incremental MapReduce for K-medoids clustering of big time-series data | |
CN112231590A (en) | Content recommendation method, system, computer device and storage medium | |
Wang et al. | Novel algorithms for efficient mining of connected induced subgraphs of a given cardinality | |
Blunck et al. | In-place algorithms for computing (layers of) maxima |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190910 |
|
RJ01 | Rejection of invention patent application after publication |