CN106815302A - A kind of Mining Frequent Itemsets for being applied to game item recommendation - Google Patents

A kind of Mining Frequent Itemsets for being applied to game item recommendation Download PDF

Info

Publication number
CN106815302A
CN106815302A CN201611144649.6A CN201611144649A CN106815302A CN 106815302 A CN106815302 A CN 106815302A CN 201611144649 A CN201611144649 A CN 201611144649A CN 106815302 A CN106815302 A CN 106815302A
Authority
CN
China
Prior art keywords
frequent
list
item
code
last
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611144649.6A
Other languages
Chinese (zh)
Inventor
金海�
张舫
张宇
廖小飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201611144649.6A priority Critical patent/CN106815302A/en
Publication of CN106815302A publication Critical patent/CN106815302A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5019Workload prediction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention realizes a kind of Mining Frequent Itemsets, belongs to data mining technology field.The inventive method obtains each occurrence number on MapReduce first, screened by sequence and threshold value, reject incongruent item, F List are obtained, F List is then divided and is obtained G List, according to the division of G List, data are transmitted to Mapper, and by Mapper treatment, data are transmitted to Reducer, the excavation of MapReduce is carried out on Reducer.Excavate firstly the need of the PPCTree obtained on each Reducer, obtain after PPCTree and then obtain the G Subsume of respective items on N List, and each Reducer, final frequent item set is obtained finally according to N List and G Subsume recurrence.The present invention is according to load estimation classifying rationally data, it is ensured that load balancing;Flow is excavated by Optimal Recursive, is greatly reduced Method on Dense Type of Data Using and is excavated the time.

Description

A kind of Mining Frequent Itemsets for being applied to game item recommendation
Technical field
The invention belongs to Data Mining, more particularly, to a kind of Mining Frequent Itemsets.
Background technology
Data mining technology has been directed to discovery since the birth and is hidden in valuable information in data, and data mining has Six kinds of patterns:Classification mode, Clustering, Regression Model, association mode, sequence pattern and deviation pattern.Wherein association mode Analysis is the direction of its important research.And frequent item set mining is the important component of association rules mining algorithm.By frequency Numerous item set mining algorithm can find out useful rule in big data, and this method can apply to many fields, such as webpage Web log mining, commercial distribution aspect, financial circles aspect recommend their possible finance interested for different type customer group Business and the recommendation of game application stage property etc..However, traditional unit under the background of big data excavates mode cannot Meet the demand of people, not only cost is too high for the simple method by improving CPU arithmetic speeds and memory size, it is also not existing Real, demand of the people to arithmetic speed is much unable to catch up with the development of hardware, and the operational pattern of at this moment parallelization is particularly important, By improvement or innovation data mining algorithm, and it is when previous good alternative to be combined with distributed arithmetic pattern.
With the arrival of networked information era, network game industry is arisen at the historic moment.Online game is culture, art and high-tech The fusion of skill, it is we provided a kind of new amusement and recreation mode.At the same time, network game industry flourishes, city Field further expands, and online game is increasingly becoming the bellwether of network economy.When the selection of game is more and more, the eye of player More and more fastidious, the game for being only adapted to player could be commercially lasting.Data mining has caused game industry Very big concern, its main cause is the presence of mass data, can be widely used, and in the urgent need to converting the data into Useful information and knowledge.Improve game quality with this, improve efficiency of operation, be that gaming operators get more users.Number It is able to fully use in industry-by-industry according to excavating, but this block market of online game does not have fully exploitation completely.Go simultaneously Effective treatment game data method it is not yet bright and clear.
Existing Frequent Itemsets Mining Algorithm mainly possesses following shortcoming:
1) efficiency of algorithm is too low, it is impossible to which finite time the inside obtains Result again;
2) parallel algorithm cannot in a balanced way divide load.
The content of the invention
Defect or urgent technical need for prior art, the invention discloses one kind in MapReduce platform simultaneously Capable Mining Frequent Itemsets, according to load estimation classifying rationally data, it is ensured that load balancing;Excavated by Optimal Recursive and flowed Journey, greatly reduces Method on Dense Type of Data Using and excavates the time, solves the problems, such as that efficiency of algorithm is low, load imbalance.
To achieve the above object, the present invention has following steps:
A kind of Mining Frequent Itemsets, comprise the following steps:
(1) occurrence number of items in initial data is counted by Mapreduce;
(2) frequent one is filtered out according to every occurrence number, frequent one is sorted from high to low according to occurrence number Constitute F-List;
(3) according to load balancing principle to the items packet in F-List, obtain comprising item and its affiliated group number information G-List;
(4) Mapper is allocated to initial data:
(4-1) resequences to every the every of initial data according to F-List middle terms order;
(4-2) reads item item since last of every initial data, and the group number of item is searched in G-List Gid, then using gid as key key, will come all before item and constitutes key-value pair as value value in data<key =gid, value=items>, as the key-value pair that Mapper is exported, if group number gid had occurred, ignore, before continuing to take One carries out same operation, until a data is disposed;
(5) Reducer carries out frequent item set mining to the key-value pair that Mapper is exported:
The key=gid that (5-1) is exported according to Mapper, corresponding reducer is distributed to by value=items, Reducer builds PPCtree;PPCtree is tree, and each node includes five property values:Name, support Frequency, child node, preamble traversal sequence number pre and postorder traversal sequence number post;
(5-2) is for each node N in PPC-treei, will<Ni.pre,Ni.post,Ni.frequency>It is named as PP- Code, by each PP-code according to the ascending sort of pre, builds and obtains each frequent one N-List in F-List;
(5-3) builds the G-Subsume of Reducer:G-Subsume (A)={ A, B ∈ I1,Wherein, A represents two different frequent one with B, and A.gid represents an A Group number, Reducer.gid represents the corresponding group numbers of Reducer, and g (X) represents the set of the data ID comprising a frequent X, X =A or B, I1Represent the set of frequent;
(5-4) recurrence is excavated, and its sub-step is as follows:
A) last L is taken in F-List as the recurrence primary data of the first round using F-List, by last L Combined with its G-Subsume (L), generate frequent two item collection, write-in result array Result;
B) take an X one by one from front to back in recurrence primary data, be N by its N-ListXPP-code and L N- List is NLastPP-code be compared, if X is present in G-Susbume (L), continue to take latter, otherwise:Work as NX's The pre of PP-code is less than NLastPP-code, and NXPP-code post be more than NLastPP-code post, then give birth to Into frequent two item collections XL, will<NX.PP-code.pre,NX.PP-code.post,NLast.PP-code.frequency>Add frequency The N-List of numerous two item collections XL is NXL, and NLastPP-code after move;Work as NXPP-code pre be less than NLastPP- Code, and NXPP-code post be less than NLastPP-code post, then NXPP-code after move;Work as NXPP- The pre of code is more than NLastPP-code, then NLastPP-code after move, until NLastAnd NXPP-code all traveled through Finish;
NXPP-code traversal finish after, if the support sum of the PP-code of the N-List of end product XL is unsatisfactory for Threshold value, then delete XL, and XL is frequent two item collection if meeting;
C) continue to take the next item down, repeat step b), until last L in recurrence primary data from recurrence primary data All items before compare and finish, that is, obtained frequent two item collection and its N-List with last L as suffix, write result Array Result and using its N-List as frequent three item set mining primary data, frequent two item collection is directly and G- Subsume (L) merges frequent three item collection in part obtained with L as suffix, adds array Result;
D) the inverted Section 2 in recurrence primary data, repeats the above steps a), b), c), until recurrence primary data In all end of operations, that is, obtained frequent three item collection of all of frequent two item collection and part;
E) only different frequent two item collection of prefix is extracted, as the recurrence primary data of the second wheel, to be opened from last Begin, according to step b)-d) same way process, obtain all of frequent three item collection, and will be embroidered with after in frequent three item collection The Xiang Yuqi G-Subsume of G-Subsume are combined and are obtained frequent four item collection;
F) by that analogy, unique frequently K item collections are to the last relatively obtained by N-List, recurrence terminates;
(5-5) Reducer is exported<Key=item ∈ gid, value=Result>, so far complete all of frequent item set Mining process.
Further, the step that implements of the step (1) is:
(1-1) carries out horizontal fragmentation treatment to raw data base, and each subfile that burst is obtained is called Block blocks, Block blocks are assigned on the node in cluster;
(1-2) Block blocks as each Map function input data, for the data T in Block blocksiIn it is every One item aj, the output key-value pair of Mapper<Key=aj, value=1>;
(1-3) all key=ajKey-value pair will be assigned to same Reducer, then the input of Reducer is<key =aj, value=1,1 ..., 1 }>, Reducer once sued for peace output<Key=aj, value=sum 1,1 ..., 1 } >。
Further, load balancing principle is in the step (3):Using sequence number every in F-list as load Value, according to load value to the items packet in F-List.
Further, the G-List is stored using Hash table.
The present invention uses such scheme, other parallel algorithm schemes is better than in performance, and excavate in performance in game It is greatly improved, it is specific as follows:
1) N-List is used, this method can reduce complexity, in general Mining Frequent Itemsets, use Set to carry out recurrence, not only take up room but also set recurrence complexity considerably beyond the recurrence of the method, while this method is used , be not compared for each PP-code in N-List by unique comparative approach, if by two each PP- of N-List Code is compared, and complexity is O (mn), m and n is respectively two length of N-List, and this unique comparative approach is answered Miscellaneous degree is only O (m+n), also significantly reduces recurrence complexity;
2) it is used for the parallel of MapReduce using new concept G-Subsume, during frequent item set mining, passes through G-Subsume can reduce the merging number of comparisons of N-List, but directly be merged with G-Subsume, substantially increase Digging efficiency;
3) generally, G-List can take the mode of remainder to be grouped, but some recurrence times are long, have Recurrence time it is short, the end product stand-by period can be caused to be defined by item at most, while will also result in load imbalance, in order to Equally loaded, the present invention estimates the load of each in advance:Under depth-first pattern, it is right that the effect of depth of PPCTree trees Tree is carried out the time of first sequence, postorder traversal, and depth is bigger time-consuming more;The MAXPATHLEN of PPCTree trees where each single item The corresponding sequence number in F-List equal to it, and the maximum length of the N-List structures corresponding to this is equal to the support of this Number and 2n- 1 minimum value therebetween, wherein n are sequence number of this in F-List.Can be easily according to two above rule The load for estimating each, you can to realize load balancing of the invention.
Brief description of the drawings
Fig. 1 is the flow chart of frequently method for digging of the invention;
Fig. 2 is the flow chart that Mapper and Reducer carries out frequent item set mining;
Fig. 3 is the building process of PPCTree of the present invention;
Fig. 4 is the flow chart for obtaining frequent two item collection during recurrence of the present invention is excavated by a frequent item collection;
Fig. 5 is the schematic diagram of load balancing of the present invention;
Fig. 6 is the schematic diagram of MapReduce processes of the present invention.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples Present aspect is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain present aspect, and It is not used in the restriction present invention.
Term of the present invention is illustrated first:
Frequent item set:Also referred to as item collection, the collection of item is collectively referred to as item collection;As long as ratio occurs in item collection reaches given constant s, These item collections are all frequent item sets.
Frequent K item collections:The K item collection of item and be frequent item set be referred to as frequent K item collections.
Support:A frequency that goes out of item collection is the number of transactions comprising item collection, referred to as the support of item collection.
MapReduce:It is a kind of programming model, for the concurrent operation of large-scale dataset (being more than 1TB).Concept " Map (mapping) " and " Reduce (reduction) " it is their main thought, is all borrowed from Functional Programming, also from arrow The characteristic borrowed in amount programming language.It is very easy to programming personnel will not distributed parallel program in the case of, will The program of oneself is operated in distributed system.Current software realizes it being to specify Map (mapping) function, for one group Key-value pair is mapped to one group of new key-value pair, concurrent Reduce (reduction) function is specified, for ensureing the key assignments of all mappings The shared identical key group of each of centering.
Fig. 1 show the flow chart of frequently method for digging of the invention.The inventive method is applied to MapReduce platform, first Each occurrence number is first obtained on MapReduce, is screened by sequence and threshold value, reject incongruent item, obtain F- List, then divides F-List and obtains G-List, and according to the division of G-List, record is transmitted to Mapper, and by Mapper at Each affairs is transmitted to Reducer by reason, and the excavation part of MapReduce is carried out on Reducer.Firstly the need of obtaining each PPCTree on Reducer, obtains after PPCTree and then obtains the G- of respective items on N-List, and each Reducer Subsume, final frequent item set is obtained finally according to N-List and G-Subsume recurrence.
More specifically, the detailed process of the frequent method for digging of the present invention is as follows:
To achieve the above object, the present invention has following steps:
(1) occurrence number of items in initial data is counted by Mapreduce.Its sub-step is:
(1-1) carries out horizontal fragmentation treatment to raw data base, and each subfile that burst is obtained is called Block blocks, Block blocks are assigned on the node in cluster, and the step is carried out automatically by Hadoop platform;
(1-2) Block blocks as each Map function input data, the input key-value pair of Mapper is<key,value =Ti>, TiRepresent the data in Block blocks.For data TiIn each aj, Mapper output key-value pairs<key =aj, value=1>;
(1-3) Reduce merges the key-value pair from each Mapper.Specifically, all key=ajKey-value pair will Same Reducer is assigned to, so the input of Reducer is<Key=aj, value=1,1 ..., 1 }>.Reducer Only need to once be sued for peace, then export<Key=aj, value=sum 1,1 ..., 1 }>;
(2) frequent one is filtered out according to every occurrence number, and the structure that sorted from high to low according to occurrence number is obtained F-List comprising frequent one with correspondence occurrence number information.Its sub-step is as follows:
After the completion of (2-1) aforesaid operations, the output key-value pair result of Reducer is stored on HDFS, is read from HDFS Destination file;
(2-2) sorts and rejects Non-Compliance.Descending sort is carried out according to value values in key-value pair, meanwhile, according to given Threshold value, rejects the item less than threshold value, obtains F-List;
(3) according to load balancing principle to the items packet in F-List, obtain comprising item and its affiliated group number information G-List.Its sub-step is as follows:
(3-1) is predicted to each single item load in F-List in advance, and F-List is divided according to load balancing principle;
(3-2) builds G-List according to F-List division results.G-List includes two:Item and its affiliated group number information gid.Meanwhile, construction Hash table storage;
(4) Mapper is allocated to initial data:
(4-1) resequences to the every of every data according to F-List middle terms order;
(4-2) reads item item since last of every data, and the group number gid of item is searched in G-List, Then using gid as key key, all before item will be come and constitutes key-value pair as value value<Key=gid, value =items>, as the key-value pair that Mapper is exported, if group number gid had occurred, ignore, continue to take previous item carry out it is identical Operation, until a data is disposed;
(5) Reducer carries out frequent item set mining to the key-value pair that Mapper is exported:
The key=gid that (5-1) is exported according to Mapper, corresponding reducer is distributed to by value=items, Reducer builds PPCtree;PPCtree is tree, and each node includes five property values:Name, support Frequency, child node, preamble traversal sequence number pre and postorder traversal sequence number post;
(5-2) is for each node N in PPC-treei, will<Ni.pre,Ni.post,Ni.frequency>It is named as PP- Code, by each PP-code according to the ascending sort of pre, builds and obtains each frequent one N-List in F-List;
(5-3) builds the G-Subsume of the Reducer.G-Subsume is new ideas proposed by the present invention:G-Subsume (A)={ A, B ∈ I1,A represents two different frequent one, A.gid with B The group number of item A is represented, Reducer.gid represents the corresponding group numbers of Reducer.G-Subsume only in a frequent item collection, i.e., Find out the G-Subsume of a corresponding frequent item collection of all Reducer correspondences gid.A.gid ∈ Reducer.gid are represented G-Subsume is found only for the corresponding frequent item collections of Reducer.gid.There are corresponding ID, g (X) to represent bag per data The set of the data ID containing item X,Item B is necessarily then included in every data of the expression comprising item A, and comprising item B Every data in not necessarily include item A.G-Subsume is equivalent to be found for the corresponding frequent item collections of Reducer.gid The set of its ancestors, in follow-up excavation, it is therefore apparent that if the G-Subsume of A is { A1,A2,…,Am, then the 2 of the setm- The support of the combination of 1 nonvoid subset and A is equal to the support of A, and the characteristic can be used for follow-up frequent item set mining, if G-Susbume (A)={ B }, XA are frequent episodes, then XBA must be frequent episode.
(5-4) recurrence is excavated, and its sub-step is as follows:
A) it is N from the N-List of last L using F-List as the recurrence primary data of the first roundLastProceed by Recurrence, last L is combined with its G-Subsume, generates frequent two item collection, writes result array Result, is not intended as The primary data of the item collection of recurrence Mining Frequent three, only data add Result as a result;
B) in recurrence primary data from front to back be respectively N by the N-List of item XXPP-code and NLastPP- Code is compared, if X is present in the G-Susbume of L, continues to take latter, otherwise:Work as NXPP-code pre it is small In NLastPP-code, and NXPP-code post be more than NLastPP-code post, then by result<NX.PP- code.pre,NX.PP-code.post,NLast.PP-code.frequency>Add new N-List, name is XL, and NLast PP-code after move;If working as NXPP-code pre be less than NLastPP-code, and NXThe post of PP-code be less than NLastPP-code post, then NXPP-code after move;If working as NXPP-code pre be more than NLastPP-code, then NLastPP-code after move, until NLastAnd NXPP-code all travel through and finish, this method can reduce complexity, if will NLastAnd NXEach be compared, complexity be O (mn), m and n are respectively NLastAnd NXLength, and the complexity of this method Degree is only O (m+n), NXPP-code traversal finish after, if the support sum of the PP-code of the N-List of end product XL Threshold value is unsatisfactory for, then deletes XL, XL is frequent two item collection if meeting;
C) continue latter the PP-code and N of the N-List of itemLastPP-code be compared i.e. repeat step b), Until all items compare and finish, that is, the set { AL, BL ... } of frequent two item collection with last L as suffix and its every is obtained N-List, write result array Result and using its N-List as frequent three item set mining primary data, due to upper The characteristic of (5-3) introduction is stated, frequent two item collection directly merges the part frequent three obtained with L as suffix with the G-Subsume of L Item collection, adds Result;
D) continue to take previous item carry out it is above-mentioned a), b), c) operate, until all end of operations, that is, obtained all of Frequent three item collection of frequent two item collection and part, above-mentioned steps understand that all frequent item sets for obtaining that merge with G-Subsume are not made It is primary data that recurrence is excavated, i.e. the item collection of next step Mining Frequent three is not used and merges obtain frequent with G-Subsume Item collection;
E) obtain thus frequent two item collection after, different two of the item collection of further Mining Frequent three, only prefix just may be used Frequent three item collection can be obtained, i.e. AX and BX can just carry out judging whether that frequent three item collection can be obtained.Extract only prefix different Frequent two item collection as the second wheel recurrence primary data, since last with it before item be compared, from going to After compare, manner of comparison and b)-d) step is identical, the N-List to AX and BX is compared, and last recycle ratio is relatively owned Frequent three item collection, and the Xiang Yuqi G-Subsume of G-Subsume will be embroidered with after in frequent three item collection combine and obtain frequent four Collection;
F) by that analogy, the frequent K item collections for the last relatively being obtained by N-List (are not closed including G-Subsume And the frequent K item collections for obtaining) in only one of which, recurrence terminates;
(5-5) Reducer is exported<Key=item ∈ gid, value=Result>, complete all of frequent item set mining Process.
So far all steps of frequent item set mining are completed, the application of the invention is explained by taking game application as an example below:
1) to game user point group, the present invention uses heroic sequence length using according to user using heroic number and user To make temperature figure, and then split user group.
2), in initial data, be present many useless interference data in data de-noising, must such as be used in first game game Hero, the data without in all senses, it is necessary to carry out data go it is dry, obtain different crowd user it is significant using hero Sequence.
3) algorithm is applied to the sequence, the end product for being excavated, i.e. user use the frequent mould of heroic sequence Formula, progressively guide user by user using number is few, user be short to user using number is more, user makes using heroic sequence length With in heroic sequence length crowd long.
Fig. 2 show Mapper of the present invention and Reducer and carries out the flow chart of frequent item set mining, first in Mapper In, it is ranked up according to the order of F-List per data, according to the division of G-List, per data by circular treatment, will Result is transmitted to Reducer;On Reducer, it is necessary first to obtain the PPCTree on each Reducer, obtain After PPCTree and then obtain N-List, and G-Subsume, excavated finally according to recurrence and obtain final frequent item set.
Fig. 3 show the building process of PPCTree in the present invention, is example to be input into scheming, and is first according to the order of ABC Be successively to insert in empty tree to root node, the second data is B, C, first look under root node whether B node, do not find B Node, whether newly-built and insert B node, searching under B node has C nodes, does not find C nodes, newly-built and insert C nodes;3rd Data is A, B, D, and A and B node are found first, but the child node of B node does not find D nodes, newly-built under B node to insert Enter D nodes;The last item data are B, D, first look for B node, but D nodes are found not in the child node of B node, in B It is newly-built and insert D nodes under node, finally complete the structure of PPCTree.
Fig. 4 show during recurrence of the present invention is excavated the flow chart that frequent two item collection is obtained by a frequent item collection, first by most Latter LnMerge with its G-Subsume and obtain frequent two item collection in part, from front to back, by each single item and LnG-Subsume enter Row compares, and sees whether this is contained in G-Subsume, if removing the next item down comprising if, not comprising then by the N-List of this with LnN-List be compared, compare each PP-code, comparison rule is as follows:Work as NxPP-code pre be less than NnPP- Code, and NxPP-code post be more than NnPP-code post, then by result<Nx.PP-code.pre,Nx.PP- code.post,Nn.PP-code.frequency>New N-List is added, name is LxLn, and NnPP-code after move;If Work as NxPP-code pre be less than NnPP-code, and NxPP-code post be less than NnPP-code post, then Nx PP-code after move;If working as NxPP-code pre be more than NnPP-code, then NnPP-code after move, until NnAnd NX PP-code all travel through and finish, if the support sum of the PP-code of the N-List of end product XL is unsatisfactory for threshold value, delete Except XL, XL is frequent two item collection if meeting;Continue latter the PP-code and N of the N-List of itemnPP-code carry out Compare, until all items compare finishing, that is, obtained with last LnIt is frequent two item collection of suffix, then takes LnPrevious item, Same operation is carried out, until get Section 2 terminating, that is, all of frequent two item collection has been obtained, the digging of follow-up frequently k item collections Pick method does not do excessive elaboration similar to the method for digging of frequent two item collection, it should be noted that G-Subsume subsequently merges is G-Subsume according to frequent episode suffix is merged, and merges the frequent k item collections of generation with item using G-Subsume Need not be as the primary data of frequent k+1 item collections, as just Result, and in follow-up frequently k item set minings, N- The comparing of List is only compared in different two of only prefix, and such as AX and BX is compared.
Fig. 5 show the schematic diagram of load balancing in the present invention, for each, it is necessary to be added into corresponding G- Group in List, group number gid generally, can take the mode of remainder to be grouped, but some recurrence times Long, some recurrence times are short, and the end product stand-by period can be caused to be defined by item at most, while it is uneven to will also result in load Weighing apparatus, for equally loaded, takes the load balancing, and the present invention estimates the load of each in advance, estimates using following several Individual foundation:
1) under depth-first pattern, the effect of depth of PPCTree trees the time that first sequence, postorder traversal are carried out to tree, Depth is bigger time-consuming more;
2) as two frequent item sets of merging corresponding N-List, its time complexity is two N-List length sums;
3) MAXPATHLEN of PPCTree trees where each single item is equal to its corresponding sequence number in F-List, and is somebody's turn to do The maximum length of the N-List structures corresponding to is equal to the support number and 2 of thisn- 1 minimum value therebetween, wherein n is Sequence number of this in F-List;
So the load of each is estimated with the corresponding sequence number in F-List, and after estimation load, in order to reach load balancing, this Invention uses greedy algorithm, by that minimum group of each existing group of load sum of addition, until all items are assigned.
Fig. 6 show the schematic diagram of MapReduce processes in the present invention.The present invention will experience MapReduce processes twice, First MpaReuce is exported by Map<Key=item, value=1>, Reducer and then each value is added, output< Key=item, value=sum { 1,1 ... 1 }>, second MapReduce is then to carry out data mining, obtained F-List and After G-List, carry out data mining according to N-List and Subsume and obtain final result.
The present invention uses such scheme, other parallel algorithm schemes is better than in performance, and excavate in performance in game It is greatly improved, it is specific as follows:
1) N-List is used, this method can reduce complexity, in general Mining Frequent Itemsets, use Set to carry out recurrence, not only take up room but also set recurrence complexity considerably beyond the recurrence of the method, while this method is used , be not compared for each PP-code in N-List by unique comparative approach, if by two each PP- of N-List Code is compared, and complexity is O (mn), m and n is respectively two length of N-List, and this unique comparative approach is answered Miscellaneous degree is only O (m+n), also significantly reduces recurrence complexity;
2) it is used for the parallel of MapReduce using new concept G-Subsume, during frequent item set mining, passes through G-Subsume can reduce the merging number of comparisons of N-List, but directly be merged with G-Subsume, substantially increase Digging efficiency;
3) generally, G-List can take the mode of remainder to be grouped, but some recurrence times are long, have Recurrence time it is short, the end product stand-by period can be caused to be defined by item at most, while will also result in load imbalance, in order to Equally loaded, the present invention estimates the load of each in advance:Under depth-first pattern, it is right that the effect of depth of PPCTree trees Tree is carried out the time of first sequence, postorder traversal, and depth is bigger time-consuming more;The MAXPATHLEN of PPCTree trees where each single item The corresponding sequence number in F-List equal to it, and the maximum length of the N-List structures corresponding to this is equal to the support of this Number and 2n- 1 minimum value therebetween, wherein n are sequence number of this in F-List.Can be easily according to two above rule The load for estimating each, you can to realize load balancing of the invention.
As it will be easily appreciated by one skilled in the art that the foregoing is only presently preferred embodiments of the present invention, it is not used to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles in the present invention etc., all should include Within protection scope of the present invention.

Claims (4)

1. a kind of Mining Frequent Itemsets, it is characterised in that comprise the following steps:
(1) occurrence number of items in initial data is counted by Mapreduce;
(2) frequent one is filtered out according to every occurrence number, by frequent one composition that sorted from high to low according to occurrence number F-List;
(3) G- comprising item He its affiliated group number information is obtained to the items packet in F-List according to load balancing principle List;
(4) Mapper is allocated to initial data:
(4-1) resequences to every the every of initial data according to F-List middle terms order;
(4-2) reads item item since last of every initial data, and the group number gid of item is searched in G-List, Then using gid as key key, all before item will be come in data and constitutes key-value pair as value value<Key= Gid, value=items>, as the key-value pair that Mapper is exported, if group number gid had occurred, ignore, continue to take previous Item carries out same operation, until a data is disposed;
(5) Reducer carries out frequent item set mining to the key-value pair that Mapper is exported:
The key=gid that (5-1) is exported according to Mapper, corresponding reducer, reducer structures are distributed to by value=items Build PPCtree;PPCtree is tree, and each node includes five property values:Name, support frequency, sub- section Point, preamble traversal sequence number pre and postorder traversal sequence number post;
(5-2) is for each node N in PPC-treei, will<Ni.pre,Ni.post,Ni.frequency>It is named as PP-code, By each PP-code according to the ascending sort of pre, build and obtain each frequent one N-List in F-List;
(5-3) builds the G-Subsume of Reducer: Wherein, A represents two different frequent one with B, and A.gid represents an A Group number, Reducer.gid represents the corresponding group numbers of Reducer, and g (X) represents the set of the data ID comprising a frequent X, X =A or B, I1Represent the set of frequent;
(5-4) recurrence is excavated, and its sub-step is as follows:
A) last L is taken in F-List as the recurrence primary data of the first round using F-List, by last L and its G-Subsume (L) is combined, and generates frequent two item collection, write-in result array Result;
B) take an X one by one from front to back in recurrence primary data, be N by its N-ListXThe N-List of PP-code and L be NLastPP-code be compared, if X is present in G-Susbume (L), continue to take latter, otherwise:
Work as NXPP-code pre be less than NLastPP-code, and NXPP-code post be more than NLastPP-code Post, then generate frequent two item collections XL, will<NX.PP-code.pre,NX.PP-code.post,NLast.PP- code.frequency>Add the N-List i.e. N of frequent two item collections XLXL, and NLastPP-code after move;
Work as NXPP-code pre be less than NLastPP-code, and NXPP-code post be less than NLastPP-code Post, then NXPP-code after move;
Work as NXPP-code pre be more than NLastPP-code, then NLastPP-code after move, until NLastAnd NXPP- Code is traveled through and finished;
NXPP-code traversal finish after, if the support sum of the PP-code of the N-List of end product XL is unsatisfactory for threshold value, XL is then deleted, XL is frequent two item collection if meeting;
C) continue to take the next item down, repeat step b), until in recurrence primary data before last L from recurrence primary data All items compare and finish, that is, obtained frequent two item collection and its N-List with last L as suffix, write result array Result and using its N-List as frequent three item set mining primary data, frequent two item collection directly with G-Subsume (L) Merging obtains frequent three item collection in part with L as suffix, adds array Result;
D) the inverted Section 2 in recurrence primary data, repeats the above steps a), b), c), until institute in recurrence primary data There is an end of operation, that is, obtained frequent three item collection of all of frequent two item collection and part;
E) only different frequent two item collection of prefix is extracted, as the recurrence primary data of the second wheel, since last, to press According to step b)-d) same way treatment, obtain all of frequent three item collection, and G- will be embroidered with after in frequent three item collection The Xiang Yuqi G-Subsume of Subsume are combined and are obtained frequent four item collection;
F) by that analogy, unique frequently K item collections are to the last relatively obtained by N-List, recurrence terminates;
(5-5) Reducer is exported<Key=item ∈ gid, value=Result>, so far complete all of frequent item set mining Process.
2. Mining Frequent Itemsets according to claim 1, it is characterised in that the step (1) implements step Suddenly it is:
(1-1) carries out horizontal fragmentation treatment to raw data base, and each subfile that burst is obtained is called Block blocks, Block Block is assigned on the node in cluster;
(1-2) Block blocks as each Map function input data, for the data T in Block blocksiIn each Item aj, the output key-value pair of Mapper<Key=aj, value=1>;
(1-3) all key=ajKey-value pair will be assigned to same Reducer, then the input of Reducer is<Key=aj, Value=1,1 ..., 1 }>, Reducer once sued for peace output<Key=aj, value=sum 1,1 ..., 1 }>.
3. Mining Frequent Itemsets according to claim 1 and 2, it is characterised in that load balancing in the step (3) Principle is:Using sequence number every in F-list as load value, according to load value to the items packet in F-List.
4. Mining Frequent Itemsets according to claim 1 and 2, it is characterised in that the G-List uses Hash table Storage.
CN201611144649.6A 2016-12-13 2016-12-13 A kind of Mining Frequent Itemsets for being applied to game item recommendation Pending CN106815302A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611144649.6A CN106815302A (en) 2016-12-13 2016-12-13 A kind of Mining Frequent Itemsets for being applied to game item recommendation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611144649.6A CN106815302A (en) 2016-12-13 2016-12-13 A kind of Mining Frequent Itemsets for being applied to game item recommendation

Publications (1)

Publication Number Publication Date
CN106815302A true CN106815302A (en) 2017-06-09

Family

ID=59109915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611144649.6A Pending CN106815302A (en) 2016-12-13 2016-12-13 A kind of Mining Frequent Itemsets for being applied to game item recommendation

Country Status (1)

Country Link
CN (1) CN106815302A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108089853A (en) * 2017-12-29 2018-05-29 江苏名通信息科技有限公司 Parallel Misra-Gries methods based on Hadoop
CN108090800A (en) * 2017-11-27 2018-05-29 珠海金山网络游戏科技有限公司 A kind of game item method for pushing and device based on player's consumption potentiality
CN109002532A (en) * 2018-07-17 2018-12-14 电子科技大学 Behavior trend mining analysis method and system based on student data
CN111309786A (en) * 2020-02-20 2020-06-19 江西理工大学 Parallel frequent item set mining method based on MapReduce
CN111729301A (en) * 2020-06-15 2020-10-02 北京智明星通科技股份有限公司 Method and device for recommending props in breakthrough game and game terminal
CN112925821A (en) * 2021-02-07 2021-06-08 江西理工大学 MapReduce-based parallel frequent item set incremental data mining method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042698A (en) * 2007-02-01 2007-09-26 江苏技术师范学院 Synthesis excavation method of related rule and metarule
CN104408127A (en) * 2014-11-27 2015-03-11 无锡市思库瑞科技信息有限公司 Maximal pattern mining method for uncertain data based on depth-first
CN106202575A (en) * 2016-08-22 2016-12-07 东南大学 A kind of distributed quick Mining Frequent Itemsets based on Apriori

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042698A (en) * 2007-02-01 2007-09-26 江苏技术师范学院 Synthesis excavation method of related rule and metarule
CN104408127A (en) * 2014-11-27 2015-03-11 无锡市思库瑞科技信息有限公司 Maximal pattern mining method for uncertain data based on depth-first
CN106202575A (en) * 2016-08-22 2016-12-07 东南大学 A kind of distributed quick Mining Frequent Itemsets based on Apriori

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BAY VO等: "Mining frequent itemsets using the N-list and subsume concepts", 《INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS》 *
廖晶贵: "基于Hadoop的大数据关联规则挖掘算法的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090800A (en) * 2017-11-27 2018-05-29 珠海金山网络游戏科技有限公司 A kind of game item method for pushing and device based on player's consumption potentiality
CN108089853A (en) * 2017-12-29 2018-05-29 江苏名通信息科技有限公司 Parallel Misra-Gries methods based on Hadoop
CN108089853B (en) * 2017-12-29 2021-03-16 镇江多游网络科技有限公司 Hadoop-based parallel Misra-Gries method
CN109002532A (en) * 2018-07-17 2018-12-14 电子科技大学 Behavior trend mining analysis method and system based on student data
CN111309786A (en) * 2020-02-20 2020-06-19 江西理工大学 Parallel frequent item set mining method based on MapReduce
CN111309786B (en) * 2020-02-20 2023-09-15 韶关学院 Parallel frequent item set mining method based on MapReduce
CN111729301A (en) * 2020-06-15 2020-10-02 北京智明星通科技股份有限公司 Method and device for recommending props in breakthrough game and game terminal
CN112925821A (en) * 2021-02-07 2021-06-08 江西理工大学 MapReduce-based parallel frequent item set incremental data mining method

Similar Documents

Publication Publication Date Title
CN106815302A (en) A kind of Mining Frequent Itemsets for being applied to game item recommendation
Liao et al. MRPrePost—A parallel algorithm adapted for mining big data
CN104731925A (en) MapReduce-based FP-Growth load balance parallel computing method
Wei et al. Incremental FP-Growth mining strategy for dynamic threshold value and database based on MapReduce
CN109614520B (en) Parallel acceleration method for multi-pattern graph matching
CN104834709B (en) A kind of parallel cosine mode method for digging based on load balancing
Yang et al. Parallel co-location pattern mining based on neighbor-dependency partition and column calculation
Xiao et al. Paradigm and performance analysis of distributed frequent itemset mining algorithms based on Mapreduce
Prasad et al. Frequent pattern mining and current state of the art
Wang et al. Association rules mining in parallel conditional tree based on grid computing inspired partition algorithm
CN108717551A (en) A kind of fuzzy hierarchy clustering method based on maximum membership degree
CN108256086A (en) Data characteristics statistical analysis technique
Ma et al. Parallel exact inference on multicore using mapreduce
Jiang Research and practice of big data analysis process based on hadoop framework
Xu et al. Explore maximal frequent itemsets for big data pre-processing based on small sample in cloud computing
Yu et al. Mining high utility itemsets in large high dimensional data
Maw An improvement of FP-growth mining algorithm using linked list
Shan et al. A subgraph query method based on adjacent node features on large-scale label graphs
Chen et al. A new algorithm based on shared pattern-tree to mine shared emerging patterns
Chen et al. Research on association rules mining base on positive and negative items of FP-tree
Billa et al. Efficient frequent pattern mining algorithm based on node sets in cloud computing environment
Lu et al. Frequent Itemset Mining Algorithm Based on Linear Table
Kavitha et al. Efficient transaction reduction in actionable pattern mining for high voluminous datasets based on bitmap and class labels
Aswini et al. Implementing reverse up growth tracking approach under distributed data mining
CN108228607A (en) Maximum frequent itemsets method for digging based on degree of communication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170609

RJ01 Rejection of invention patent application after publication