CN106815302A - A kind of Mining Frequent Itemsets for being applied to game item recommendation - Google Patents
A kind of Mining Frequent Itemsets for being applied to game item recommendation Download PDFInfo
- Publication number
- CN106815302A CN106815302A CN201611144649.6A CN201611144649A CN106815302A CN 106815302 A CN106815302 A CN 106815302A CN 201611144649 A CN201611144649 A CN 201611144649A CN 106815302 A CN106815302 A CN 106815302A
- Authority
- CN
- China
- Prior art keywords
- frequent
- list
- item
- code
- last
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5019—Workload prediction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention realizes a kind of Mining Frequent Itemsets, belongs to data mining technology field.The inventive method obtains each occurrence number on MapReduce first, screened by sequence and threshold value, reject incongruent item, F List are obtained, F List is then divided and is obtained G List, according to the division of G List, data are transmitted to Mapper, and by Mapper treatment, data are transmitted to Reducer, the excavation of MapReduce is carried out on Reducer.Excavate firstly the need of the PPCTree obtained on each Reducer, obtain after PPCTree and then obtain the G Subsume of respective items on N List, and each Reducer, final frequent item set is obtained finally according to N List and G Subsume recurrence.The present invention is according to load estimation classifying rationally data, it is ensured that load balancing;Flow is excavated by Optimal Recursive, is greatly reduced Method on Dense Type of Data Using and is excavated the time.
Description
Technical field
The invention belongs to Data Mining, more particularly, to a kind of Mining Frequent Itemsets.
Background technology
Data mining technology has been directed to discovery since the birth and is hidden in valuable information in data, and data mining has
Six kinds of patterns:Classification mode, Clustering, Regression Model, association mode, sequence pattern and deviation pattern.Wherein association mode
Analysis is the direction of its important research.And frequent item set mining is the important component of association rules mining algorithm.By frequency
Numerous item set mining algorithm can find out useful rule in big data, and this method can apply to many fields, such as webpage
Web log mining, commercial distribution aspect, financial circles aspect recommend their possible finance interested for different type customer group
Business and the recommendation of game application stage property etc..However, traditional unit under the background of big data excavates mode cannot
Meet the demand of people, not only cost is too high for the simple method by improving CPU arithmetic speeds and memory size, it is also not existing
Real, demand of the people to arithmetic speed is much unable to catch up with the development of hardware, and the operational pattern of at this moment parallelization is particularly important,
By improvement or innovation data mining algorithm, and it is when previous good alternative to be combined with distributed arithmetic pattern.
With the arrival of networked information era, network game industry is arisen at the historic moment.Online game is culture, art and high-tech
The fusion of skill, it is we provided a kind of new amusement and recreation mode.At the same time, network game industry flourishes, city
Field further expands, and online game is increasingly becoming the bellwether of network economy.When the selection of game is more and more, the eye of player
More and more fastidious, the game for being only adapted to player could be commercially lasting.Data mining has caused game industry
Very big concern, its main cause is the presence of mass data, can be widely used, and in the urgent need to converting the data into
Useful information and knowledge.Improve game quality with this, improve efficiency of operation, be that gaming operators get more users.Number
It is able to fully use in industry-by-industry according to excavating, but this block market of online game does not have fully exploitation completely.Go simultaneously
Effective treatment game data method it is not yet bright and clear.
Existing Frequent Itemsets Mining Algorithm mainly possesses following shortcoming:
1) efficiency of algorithm is too low, it is impossible to which finite time the inside obtains Result again;
2) parallel algorithm cannot in a balanced way divide load.
The content of the invention
Defect or urgent technical need for prior art, the invention discloses one kind in MapReduce platform simultaneously
Capable Mining Frequent Itemsets, according to load estimation classifying rationally data, it is ensured that load balancing;Excavated by Optimal Recursive and flowed
Journey, greatly reduces Method on Dense Type of Data Using and excavates the time, solves the problems, such as that efficiency of algorithm is low, load imbalance.
To achieve the above object, the present invention has following steps:
A kind of Mining Frequent Itemsets, comprise the following steps:
(1) occurrence number of items in initial data is counted by Mapreduce;
(2) frequent one is filtered out according to every occurrence number, frequent one is sorted from high to low according to occurrence number
Constitute F-List;
(3) according to load balancing principle to the items packet in F-List, obtain comprising item and its affiliated group number information
G-List;
(4) Mapper is allocated to initial data:
(4-1) resequences to every the every of initial data according to F-List middle terms order;
(4-2) reads item item since last of every initial data, and the group number of item is searched in G-List
Gid, then using gid as key key, will come all before item and constitutes key-value pair as value value in data<key
=gid, value=items>, as the key-value pair that Mapper is exported, if group number gid had occurred, ignore, before continuing to take
One carries out same operation, until a data is disposed;
(5) Reducer carries out frequent item set mining to the key-value pair that Mapper is exported:
The key=gid that (5-1) is exported according to Mapper, corresponding reducer is distributed to by value=items,
Reducer builds PPCtree;PPCtree is tree, and each node includes five property values:Name, support
Frequency, child node, preamble traversal sequence number pre and postorder traversal sequence number post;
(5-2) is for each node N in PPC-treei, will<Ni.pre,Ni.post,Ni.frequency>It is named as PP-
Code, by each PP-code according to the ascending sort of pre, builds and obtains each frequent one N-List in F-List;
(5-3) builds the G-Subsume of Reducer:G-Subsume (A)={ A, B ∈ I1,Wherein, A represents two different frequent one with B, and A.gid represents an A
Group number, Reducer.gid represents the corresponding group numbers of Reducer, and g (X) represents the set of the data ID comprising a frequent X, X
=A or B, I1Represent the set of frequent;
(5-4) recurrence is excavated, and its sub-step is as follows:
A) last L is taken in F-List as the recurrence primary data of the first round using F-List, by last L
Combined with its G-Subsume (L), generate frequent two item collection, write-in result array Result;
B) take an X one by one from front to back in recurrence primary data, be N by its N-ListXPP-code and L N-
List is NLastPP-code be compared, if X is present in G-Susbume (L), continue to take latter, otherwise:Work as NX's
The pre of PP-code is less than NLastPP-code, and NXPP-code post be more than NLastPP-code post, then give birth to
Into frequent two item collections XL, will<NX.PP-code.pre,NX.PP-code.post,NLast.PP-code.frequency>Add frequency
The N-List of numerous two item collections XL is NXL, and NLastPP-code after move;Work as NXPP-code pre be less than NLastPP-
Code, and NXPP-code post be less than NLastPP-code post, then NXPP-code after move;Work as NXPP-
The pre of code is more than NLastPP-code, then NLastPP-code after move, until NLastAnd NXPP-code all traveled through
Finish;
NXPP-code traversal finish after, if the support sum of the PP-code of the N-List of end product XL is unsatisfactory for
Threshold value, then delete XL, and XL is frequent two item collection if meeting;
C) continue to take the next item down, repeat step b), until last L in recurrence primary data from recurrence primary data
All items before compare and finish, that is, obtained frequent two item collection and its N-List with last L as suffix, write result
Array Result and using its N-List as frequent three item set mining primary data, frequent two item collection is directly and G-
Subsume (L) merges frequent three item collection in part obtained with L as suffix, adds array Result;
D) the inverted Section 2 in recurrence primary data, repeats the above steps a), b), c), until recurrence primary data
In all end of operations, that is, obtained frequent three item collection of all of frequent two item collection and part;
E) only different frequent two item collection of prefix is extracted, as the recurrence primary data of the second wheel, to be opened from last
Begin, according to step b)-d) same way process, obtain all of frequent three item collection, and will be embroidered with after in frequent three item collection
The Xiang Yuqi G-Subsume of G-Subsume are combined and are obtained frequent four item collection;
F) by that analogy, unique frequently K item collections are to the last relatively obtained by N-List, recurrence terminates;
(5-5) Reducer is exported<Key=item ∈ gid, value=Result>, so far complete all of frequent item set
Mining process.
Further, the step that implements of the step (1) is:
(1-1) carries out horizontal fragmentation treatment to raw data base, and each subfile that burst is obtained is called Block blocks,
Block blocks are assigned on the node in cluster;
(1-2) Block blocks as each Map function input data, for the data T in Block blocksiIn it is every
One item aj, the output key-value pair of Mapper<Key=aj, value=1>;
(1-3) all key=ajKey-value pair will be assigned to same Reducer, then the input of Reducer is<key
=aj, value=1,1 ..., 1 }>, Reducer once sued for peace output<Key=aj, value=sum 1,1 ..., 1 }
>。
Further, load balancing principle is in the step (3):Using sequence number every in F-list as load
Value, according to load value to the items packet in F-List.
Further, the G-List is stored using Hash table.
The present invention uses such scheme, other parallel algorithm schemes is better than in performance, and excavate in performance in game
It is greatly improved, it is specific as follows:
1) N-List is used, this method can reduce complexity, in general Mining Frequent Itemsets, use
Set to carry out recurrence, not only take up room but also set recurrence complexity considerably beyond the recurrence of the method, while this method is used
, be not compared for each PP-code in N-List by unique comparative approach, if by two each PP- of N-List
Code is compared, and complexity is O (mn), m and n is respectively two length of N-List, and this unique comparative approach is answered
Miscellaneous degree is only O (m+n), also significantly reduces recurrence complexity;
2) it is used for the parallel of MapReduce using new concept G-Subsume, during frequent item set mining, passes through
G-Subsume can reduce the merging number of comparisons of N-List, but directly be merged with G-Subsume, substantially increase
Digging efficiency;
3) generally, G-List can take the mode of remainder to be grouped, but some recurrence times are long, have
Recurrence time it is short, the end product stand-by period can be caused to be defined by item at most, while will also result in load imbalance, in order to
Equally loaded, the present invention estimates the load of each in advance:Under depth-first pattern, it is right that the effect of depth of PPCTree trees
Tree is carried out the time of first sequence, postorder traversal, and depth is bigger time-consuming more;The MAXPATHLEN of PPCTree trees where each single item
The corresponding sequence number in F-List equal to it, and the maximum length of the N-List structures corresponding to this is equal to the support of this
Number and 2n- 1 minimum value therebetween, wherein n are sequence number of this in F-List.Can be easily according to two above rule
The load for estimating each, you can to realize load balancing of the invention.
Brief description of the drawings
Fig. 1 is the flow chart of frequently method for digging of the invention;
Fig. 2 is the flow chart that Mapper and Reducer carries out frequent item set mining;
Fig. 3 is the building process of PPCTree of the present invention;
Fig. 4 is the flow chart for obtaining frequent two item collection during recurrence of the present invention is excavated by a frequent item collection;
Fig. 5 is the schematic diagram of load balancing of the present invention;
Fig. 6 is the schematic diagram of MapReduce processes of the present invention.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
Present aspect is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain present aspect, and
It is not used in the restriction present invention.
Term of the present invention is illustrated first:
Frequent item set:Also referred to as item collection, the collection of item is collectively referred to as item collection;As long as ratio occurs in item collection reaches given constant s,
These item collections are all frequent item sets.
Frequent K item collections:The K item collection of item and be frequent item set be referred to as frequent K item collections.
Support:A frequency that goes out of item collection is the number of transactions comprising item collection, referred to as the support of item collection.
MapReduce:It is a kind of programming model, for the concurrent operation of large-scale dataset (being more than 1TB).Concept " Map
(mapping) " and " Reduce (reduction) " it is their main thought, is all borrowed from Functional Programming, also from arrow
The characteristic borrowed in amount programming language.It is very easy to programming personnel will not distributed parallel program in the case of, will
The program of oneself is operated in distributed system.Current software realizes it being to specify Map (mapping) function, for one group
Key-value pair is mapped to one group of new key-value pair, concurrent Reduce (reduction) function is specified, for ensureing the key assignments of all mappings
The shared identical key group of each of centering.
Fig. 1 show the flow chart of frequently method for digging of the invention.The inventive method is applied to MapReduce platform, first
Each occurrence number is first obtained on MapReduce, is screened by sequence and threshold value, reject incongruent item, obtain F-
List, then divides F-List and obtains G-List, and according to the division of G-List, record is transmitted to Mapper, and by Mapper at
Each affairs is transmitted to Reducer by reason, and the excavation part of MapReduce is carried out on Reducer.Firstly the need of obtaining each
PPCTree on Reducer, obtains after PPCTree and then obtains the G- of respective items on N-List, and each Reducer
Subsume, final frequent item set is obtained finally according to N-List and G-Subsume recurrence.
More specifically, the detailed process of the frequent method for digging of the present invention is as follows:
To achieve the above object, the present invention has following steps:
(1) occurrence number of items in initial data is counted by Mapreduce.Its sub-step is:
(1-1) carries out horizontal fragmentation treatment to raw data base, and each subfile that burst is obtained is called Block blocks,
Block blocks are assigned on the node in cluster, and the step is carried out automatically by Hadoop platform;
(1-2) Block blocks as each Map function input data, the input key-value pair of Mapper is<key,value
=Ti>, TiRepresent the data in Block blocks.For data TiIn each aj, Mapper output key-value pairs<key
=aj, value=1>;
(1-3) Reduce merges the key-value pair from each Mapper.Specifically, all key=ajKey-value pair will
Same Reducer is assigned to, so the input of Reducer is<Key=aj, value=1,1 ..., 1 }>.Reducer
Only need to once be sued for peace, then export<Key=aj, value=sum 1,1 ..., 1 }>;
(2) frequent one is filtered out according to every occurrence number, and the structure that sorted from high to low according to occurrence number is obtained
F-List comprising frequent one with correspondence occurrence number information.Its sub-step is as follows:
After the completion of (2-1) aforesaid operations, the output key-value pair result of Reducer is stored on HDFS, is read from HDFS
Destination file;
(2-2) sorts and rejects Non-Compliance.Descending sort is carried out according to value values in key-value pair, meanwhile, according to given
Threshold value, rejects the item less than threshold value, obtains F-List;
(3) according to load balancing principle to the items packet in F-List, obtain comprising item and its affiliated group number information
G-List.Its sub-step is as follows:
(3-1) is predicted to each single item load in F-List in advance, and F-List is divided according to load balancing principle;
(3-2) builds G-List according to F-List division results.G-List includes two:Item and its affiliated group number information
gid.Meanwhile, construction Hash table storage;
(4) Mapper is allocated to initial data:
(4-1) resequences to the every of every data according to F-List middle terms order;
(4-2) reads item item since last of every data, and the group number gid of item is searched in G-List,
Then using gid as key key, all before item will be come and constitutes key-value pair as value value<Key=gid, value
=items>, as the key-value pair that Mapper is exported, if group number gid had occurred, ignore, continue to take previous item carry out it is identical
Operation, until a data is disposed;
(5) Reducer carries out frequent item set mining to the key-value pair that Mapper is exported:
The key=gid that (5-1) is exported according to Mapper, corresponding reducer is distributed to by value=items,
Reducer builds PPCtree;PPCtree is tree, and each node includes five property values:Name, support
Frequency, child node, preamble traversal sequence number pre and postorder traversal sequence number post;
(5-2) is for each node N in PPC-treei, will<Ni.pre,Ni.post,Ni.frequency>It is named as PP-
Code, by each PP-code according to the ascending sort of pre, builds and obtains each frequent one N-List in F-List;
(5-3) builds the G-Subsume of the Reducer.G-Subsume is new ideas proposed by the present invention:G-Subsume
(A)={ A, B ∈ I1,A represents two different frequent one, A.gid with B
The group number of item A is represented, Reducer.gid represents the corresponding group numbers of Reducer.G-Subsume only in a frequent item collection, i.e.,
Find out the G-Subsume of a corresponding frequent item collection of all Reducer correspondences gid.A.gid ∈ Reducer.gid are represented
G-Subsume is found only for the corresponding frequent item collections of Reducer.gid.There are corresponding ID, g (X) to represent bag per data
The set of the data ID containing item X,Item B is necessarily then included in every data of the expression comprising item A, and comprising item B
Every data in not necessarily include item A.G-Subsume is equivalent to be found for the corresponding frequent item collections of Reducer.gid
The set of its ancestors, in follow-up excavation, it is therefore apparent that if the G-Subsume of A is { A1,A2,…,Am, then the 2 of the setm-
The support of the combination of 1 nonvoid subset and A is equal to the support of A, and the characteristic can be used for follow-up frequent item set mining, if
G-Susbume (A)={ B }, XA are frequent episodes, then XBA must be frequent episode.
(5-4) recurrence is excavated, and its sub-step is as follows:
A) it is N from the N-List of last L using F-List as the recurrence primary data of the first roundLastProceed by
Recurrence, last L is combined with its G-Subsume, generates frequent two item collection, writes result array Result, is not intended as
The primary data of the item collection of recurrence Mining Frequent three, only data add Result as a result;
B) in recurrence primary data from front to back be respectively N by the N-List of item XXPP-code and NLastPP-
Code is compared, if X is present in the G-Susbume of L, continues to take latter, otherwise:Work as NXPP-code pre it is small
In NLastPP-code, and NXPP-code post be more than NLastPP-code post, then by result<NX.PP-
code.pre,NX.PP-code.post,NLast.PP-code.frequency>Add new N-List, name is XL, and NLast
PP-code after move;If working as NXPP-code pre be less than NLastPP-code, and NXThe post of PP-code be less than
NLastPP-code post, then NXPP-code after move;If working as NXPP-code pre be more than NLastPP-code, then
NLastPP-code after move, until NLastAnd NXPP-code all travel through and finish, this method can reduce complexity, if will
NLastAnd NXEach be compared, complexity be O (mn), m and n are respectively NLastAnd NXLength, and the complexity of this method
Degree is only O (m+n), NXPP-code traversal finish after, if the support sum of the PP-code of the N-List of end product XL
Threshold value is unsatisfactory for, then deletes XL, XL is frequent two item collection if meeting;
C) continue latter the PP-code and N of the N-List of itemLastPP-code be compared i.e. repeat step b),
Until all items compare and finish, that is, the set { AL, BL ... } of frequent two item collection with last L as suffix and its every is obtained
N-List, write result array Result and using its N-List as frequent three item set mining primary data, due to upper
The characteristic of (5-3) introduction is stated, frequent two item collection directly merges the part frequent three obtained with L as suffix with the G-Subsume of L
Item collection, adds Result;
D) continue to take previous item carry out it is above-mentioned a), b), c) operate, until all end of operations, that is, obtained all of
Frequent three item collection of frequent two item collection and part, above-mentioned steps understand that all frequent item sets for obtaining that merge with G-Subsume are not made
It is primary data that recurrence is excavated, i.e. the item collection of next step Mining Frequent three is not used and merges obtain frequent with G-Subsume
Item collection;
E) obtain thus frequent two item collection after, different two of the item collection of further Mining Frequent three, only prefix just may be used
Frequent three item collection can be obtained, i.e. AX and BX can just carry out judging whether that frequent three item collection can be obtained.Extract only prefix different
Frequent two item collection as the second wheel recurrence primary data, since last with it before item be compared, from going to
After compare, manner of comparison and b)-d) step is identical, the N-List to AX and BX is compared, and last recycle ratio is relatively owned
Frequent three item collection, and the Xiang Yuqi G-Subsume of G-Subsume will be embroidered with after in frequent three item collection combine and obtain frequent four
Collection;
F) by that analogy, the frequent K item collections for the last relatively being obtained by N-List (are not closed including G-Subsume
And the frequent K item collections for obtaining) in only one of which, recurrence terminates;
(5-5) Reducer is exported<Key=item ∈ gid, value=Result>, complete all of frequent item set mining
Process.
So far all steps of frequent item set mining are completed, the application of the invention is explained by taking game application as an example below:
1) to game user point group, the present invention uses heroic sequence length using according to user using heroic number and user
To make temperature figure, and then split user group.
2), in initial data, be present many useless interference data in data de-noising, must such as be used in first game game
Hero, the data without in all senses, it is necessary to carry out data go it is dry, obtain different crowd user it is significant using hero
Sequence.
3) algorithm is applied to the sequence, the end product for being excavated, i.e. user use the frequent mould of heroic sequence
Formula, progressively guide user by user using number is few, user be short to user using number is more, user makes using heroic sequence length
With in heroic sequence length crowd long.
Fig. 2 show Mapper of the present invention and Reducer and carries out the flow chart of frequent item set mining, first in Mapper
In, it is ranked up according to the order of F-List per data, according to the division of G-List, per data by circular treatment, will
Result is transmitted to Reducer;On Reducer, it is necessary first to obtain the PPCTree on each Reducer, obtain
After PPCTree and then obtain N-List, and G-Subsume, excavated finally according to recurrence and obtain final frequent item set.
Fig. 3 show the building process of PPCTree in the present invention, is example to be input into scheming, and is first according to the order of ABC
Be successively to insert in empty tree to root node, the second data is B, C, first look under root node whether B node, do not find B
Node, whether newly-built and insert B node, searching under B node has C nodes, does not find C nodes, newly-built and insert C nodes;3rd
Data is A, B, D, and A and B node are found first, but the child node of B node does not find D nodes, newly-built under B node to insert
Enter D nodes;The last item data are B, D, first look for B node, but D nodes are found not in the child node of B node, in B
It is newly-built and insert D nodes under node, finally complete the structure of PPCTree.
Fig. 4 show during recurrence of the present invention is excavated the flow chart that frequent two item collection is obtained by a frequent item collection, first by most
Latter LnMerge with its G-Subsume and obtain frequent two item collection in part, from front to back, by each single item and LnG-Subsume enter
Row compares, and sees whether this is contained in G-Subsume, if removing the next item down comprising if, not comprising then by the N-List of this with
LnN-List be compared, compare each PP-code, comparison rule is as follows:Work as NxPP-code pre be less than NnPP-
Code, and NxPP-code post be more than NnPP-code post, then by result<Nx.PP-code.pre,Nx.PP-
code.post,Nn.PP-code.frequency>New N-List is added, name is LxLn, and NnPP-code after move;If
Work as NxPP-code pre be less than NnPP-code, and NxPP-code post be less than NnPP-code post, then Nx
PP-code after move;If working as NxPP-code pre be more than NnPP-code, then NnPP-code after move, until NnAnd NX
PP-code all travel through and finish, if the support sum of the PP-code of the N-List of end product XL is unsatisfactory for threshold value, delete
Except XL, XL is frequent two item collection if meeting;Continue latter the PP-code and N of the N-List of itemnPP-code carry out
Compare, until all items compare finishing, that is, obtained with last LnIt is frequent two item collection of suffix, then takes LnPrevious item,
Same operation is carried out, until get Section 2 terminating, that is, all of frequent two item collection has been obtained, the digging of follow-up frequently k item collections
Pick method does not do excessive elaboration similar to the method for digging of frequent two item collection, it should be noted that G-Subsume subsequently merges is
G-Subsume according to frequent episode suffix is merged, and merges the frequent k item collections of generation with item using G-Subsume
Need not be as the primary data of frequent k+1 item collections, as just Result, and in follow-up frequently k item set minings, N-
The comparing of List is only compared in different two of only prefix, and such as AX and BX is compared.
Fig. 5 show the schematic diagram of load balancing in the present invention, for each, it is necessary to be added into corresponding G-
Group in List, group number gid generally, can take the mode of remainder to be grouped, but some recurrence times
Long, some recurrence times are short, and the end product stand-by period can be caused to be defined by item at most, while it is uneven to will also result in load
Weighing apparatus, for equally loaded, takes the load balancing, and the present invention estimates the load of each in advance, estimates using following several
Individual foundation:
1) under depth-first pattern, the effect of depth of PPCTree trees the time that first sequence, postorder traversal are carried out to tree,
Depth is bigger time-consuming more;
2) as two frequent item sets of merging corresponding N-List, its time complexity is two N-List length sums;
3) MAXPATHLEN of PPCTree trees where each single item is equal to its corresponding sequence number in F-List, and is somebody's turn to do
The maximum length of the N-List structures corresponding to is equal to the support number and 2 of thisn- 1 minimum value therebetween, wherein n is
Sequence number of this in F-List;
So the load of each is estimated with the corresponding sequence number in F-List, and after estimation load, in order to reach load balancing, this
Invention uses greedy algorithm, by that minimum group of each existing group of load sum of addition, until all items are assigned.
Fig. 6 show the schematic diagram of MapReduce processes in the present invention.The present invention will experience MapReduce processes twice,
First MpaReuce is exported by Map<Key=item, value=1>, Reducer and then each value is added, output<
Key=item, value=sum { 1,1 ... 1 }>, second MapReduce is then to carry out data mining, obtained F-List and
After G-List, carry out data mining according to N-List and Subsume and obtain final result.
The present invention uses such scheme, other parallel algorithm schemes is better than in performance, and excavate in performance in game
It is greatly improved, it is specific as follows:
1) N-List is used, this method can reduce complexity, in general Mining Frequent Itemsets, use
Set to carry out recurrence, not only take up room but also set recurrence complexity considerably beyond the recurrence of the method, while this method is used
, be not compared for each PP-code in N-List by unique comparative approach, if by two each PP- of N-List
Code is compared, and complexity is O (mn), m and n is respectively two length of N-List, and this unique comparative approach is answered
Miscellaneous degree is only O (m+n), also significantly reduces recurrence complexity;
2) it is used for the parallel of MapReduce using new concept G-Subsume, during frequent item set mining, passes through
G-Subsume can reduce the merging number of comparisons of N-List, but directly be merged with G-Subsume, substantially increase
Digging efficiency;
3) generally, G-List can take the mode of remainder to be grouped, but some recurrence times are long, have
Recurrence time it is short, the end product stand-by period can be caused to be defined by item at most, while will also result in load imbalance, in order to
Equally loaded, the present invention estimates the load of each in advance:Under depth-first pattern, it is right that the effect of depth of PPCTree trees
Tree is carried out the time of first sequence, postorder traversal, and depth is bigger time-consuming more;The MAXPATHLEN of PPCTree trees where each single item
The corresponding sequence number in F-List equal to it, and the maximum length of the N-List structures corresponding to this is equal to the support of this
Number and 2n- 1 minimum value therebetween, wherein n are sequence number of this in F-List.Can be easily according to two above rule
The load for estimating each, you can to realize load balancing of the invention.
As it will be easily appreciated by one skilled in the art that the foregoing is only presently preferred embodiments of the present invention, it is not used to
The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles in the present invention etc., all should include
Within protection scope of the present invention.
Claims (4)
1. a kind of Mining Frequent Itemsets, it is characterised in that comprise the following steps:
(1) occurrence number of items in initial data is counted by Mapreduce;
(2) frequent one is filtered out according to every occurrence number, by frequent one composition that sorted from high to low according to occurrence number
F-List;
(3) G- comprising item He its affiliated group number information is obtained to the items packet in F-List according to load balancing principle
List;
(4) Mapper is allocated to initial data:
(4-1) resequences to every the every of initial data according to F-List middle terms order;
(4-2) reads item item since last of every initial data, and the group number gid of item is searched in G-List,
Then using gid as key key, all before item will be come in data and constitutes key-value pair as value value<Key=
Gid, value=items>, as the key-value pair that Mapper is exported, if group number gid had occurred, ignore, continue to take previous
Item carries out same operation, until a data is disposed;
(5) Reducer carries out frequent item set mining to the key-value pair that Mapper is exported:
The key=gid that (5-1) is exported according to Mapper, corresponding reducer, reducer structures are distributed to by value=items
Build PPCtree;PPCtree is tree, and each node includes five property values:Name, support frequency, sub- section
Point, preamble traversal sequence number pre and postorder traversal sequence number post;
(5-2) is for each node N in PPC-treei, will<Ni.pre,Ni.post,Ni.frequency>It is named as PP-code,
By each PP-code according to the ascending sort of pre, build and obtain each frequent one N-List in F-List;
(5-3) builds the G-Subsume of Reducer: Wherein, A represents two different frequent one with B, and A.gid represents an A
Group number, Reducer.gid represents the corresponding group numbers of Reducer, and g (X) represents the set of the data ID comprising a frequent X, X
=A or B, I1Represent the set of frequent;
(5-4) recurrence is excavated, and its sub-step is as follows:
A) last L is taken in F-List as the recurrence primary data of the first round using F-List, by last L and its
G-Subsume (L) is combined, and generates frequent two item collection, write-in result array Result;
B) take an X one by one from front to back in recurrence primary data, be N by its N-ListXThe N-List of PP-code and L be
NLastPP-code be compared, if X is present in G-Susbume (L), continue to take latter, otherwise:
Work as NXPP-code pre be less than NLastPP-code, and NXPP-code post be more than NLastPP-code
Post, then generate frequent two item collections XL, will<NX.PP-code.pre,NX.PP-code.post,NLast.PP-
code.frequency>Add the N-List i.e. N of frequent two item collections XLXL, and NLastPP-code after move;
Work as NXPP-code pre be less than NLastPP-code, and NXPP-code post be less than NLastPP-code
Post, then NXPP-code after move;
Work as NXPP-code pre be more than NLastPP-code, then NLastPP-code after move, until NLastAnd NXPP-
Code is traveled through and finished;
NXPP-code traversal finish after, if the support sum of the PP-code of the N-List of end product XL is unsatisfactory for threshold value,
XL is then deleted, XL is frequent two item collection if meeting;
C) continue to take the next item down, repeat step b), until in recurrence primary data before last L from recurrence primary data
All items compare and finish, that is, obtained frequent two item collection and its N-List with last L as suffix, write result array
Result and using its N-List as frequent three item set mining primary data, frequent two item collection directly with G-Subsume (L)
Merging obtains frequent three item collection in part with L as suffix, adds array Result;
D) the inverted Section 2 in recurrence primary data, repeats the above steps a), b), c), until institute in recurrence primary data
There is an end of operation, that is, obtained frequent three item collection of all of frequent two item collection and part;
E) only different frequent two item collection of prefix is extracted, as the recurrence primary data of the second wheel, since last, to press
According to step b)-d) same way treatment, obtain all of frequent three item collection, and G- will be embroidered with after in frequent three item collection
The Xiang Yuqi G-Subsume of Subsume are combined and are obtained frequent four item collection;
F) by that analogy, unique frequently K item collections are to the last relatively obtained by N-List, recurrence terminates;
(5-5) Reducer is exported<Key=item ∈ gid, value=Result>, so far complete all of frequent item set mining
Process.
2. Mining Frequent Itemsets according to claim 1, it is characterised in that the step (1) implements step
Suddenly it is:
(1-1) carries out horizontal fragmentation treatment to raw data base, and each subfile that burst is obtained is called Block blocks, Block
Block is assigned on the node in cluster;
(1-2) Block blocks as each Map function input data, for the data T in Block blocksiIn each
Item aj, the output key-value pair of Mapper<Key=aj, value=1>;
(1-3) all key=ajKey-value pair will be assigned to same Reducer, then the input of Reducer is<Key=aj,
Value=1,1 ..., 1 }>, Reducer once sued for peace output<Key=aj, value=sum 1,1 ..., 1 }>.
3. Mining Frequent Itemsets according to claim 1 and 2, it is characterised in that load balancing in the step (3)
Principle is:Using sequence number every in F-list as load value, according to load value to the items packet in F-List.
4. Mining Frequent Itemsets according to claim 1 and 2, it is characterised in that the G-List uses Hash table
Storage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611144649.6A CN106815302A (en) | 2016-12-13 | 2016-12-13 | A kind of Mining Frequent Itemsets for being applied to game item recommendation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611144649.6A CN106815302A (en) | 2016-12-13 | 2016-12-13 | A kind of Mining Frequent Itemsets for being applied to game item recommendation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106815302A true CN106815302A (en) | 2017-06-09 |
Family
ID=59109915
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611144649.6A Pending CN106815302A (en) | 2016-12-13 | 2016-12-13 | A kind of Mining Frequent Itemsets for being applied to game item recommendation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106815302A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108089853A (en) * | 2017-12-29 | 2018-05-29 | 江苏名通信息科技有限公司 | Parallel Misra-Gries methods based on Hadoop |
CN108090800A (en) * | 2017-11-27 | 2018-05-29 | 珠海金山网络游戏科技有限公司 | A kind of game item method for pushing and device based on player's consumption potentiality |
CN109002532A (en) * | 2018-07-17 | 2018-12-14 | 电子科技大学 | Behavior trend mining analysis method and system based on student data |
CN111309786A (en) * | 2020-02-20 | 2020-06-19 | 江西理工大学 | Parallel frequent item set mining method based on MapReduce |
CN111729301A (en) * | 2020-06-15 | 2020-10-02 | 北京智明星通科技股份有限公司 | Method and device for recommending props in breakthrough game and game terminal |
CN112925821A (en) * | 2021-02-07 | 2021-06-08 | 江西理工大学 | MapReduce-based parallel frequent item set incremental data mining method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101042698A (en) * | 2007-02-01 | 2007-09-26 | 江苏技术师范学院 | Synthesis excavation method of related rule and metarule |
CN104408127A (en) * | 2014-11-27 | 2015-03-11 | 无锡市思库瑞科技信息有限公司 | Maximal pattern mining method for uncertain data based on depth-first |
CN106202575A (en) * | 2016-08-22 | 2016-12-07 | 东南大学 | A kind of distributed quick Mining Frequent Itemsets based on Apriori |
-
2016
- 2016-12-13 CN CN201611144649.6A patent/CN106815302A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101042698A (en) * | 2007-02-01 | 2007-09-26 | 江苏技术师范学院 | Synthesis excavation method of related rule and metarule |
CN104408127A (en) * | 2014-11-27 | 2015-03-11 | 无锡市思库瑞科技信息有限公司 | Maximal pattern mining method for uncertain data based on depth-first |
CN106202575A (en) * | 2016-08-22 | 2016-12-07 | 东南大学 | A kind of distributed quick Mining Frequent Itemsets based on Apriori |
Non-Patent Citations (2)
Title |
---|
BAY VO等: "Mining frequent itemsets using the N-list and subsume concepts", 《INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS》 * |
廖晶贵: "基于Hadoop的大数据关联规则挖掘算法的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108090800A (en) * | 2017-11-27 | 2018-05-29 | 珠海金山网络游戏科技有限公司 | A kind of game item method for pushing and device based on player's consumption potentiality |
CN108089853A (en) * | 2017-12-29 | 2018-05-29 | 江苏名通信息科技有限公司 | Parallel Misra-Gries methods based on Hadoop |
CN108089853B (en) * | 2017-12-29 | 2021-03-16 | 镇江多游网络科技有限公司 | Hadoop-based parallel Misra-Gries method |
CN109002532A (en) * | 2018-07-17 | 2018-12-14 | 电子科技大学 | Behavior trend mining analysis method and system based on student data |
CN111309786A (en) * | 2020-02-20 | 2020-06-19 | 江西理工大学 | Parallel frequent item set mining method based on MapReduce |
CN111309786B (en) * | 2020-02-20 | 2023-09-15 | 韶关学院 | Parallel frequent item set mining method based on MapReduce |
CN111729301A (en) * | 2020-06-15 | 2020-10-02 | 北京智明星通科技股份有限公司 | Method and device for recommending props in breakthrough game and game terminal |
CN112925821A (en) * | 2021-02-07 | 2021-06-08 | 江西理工大学 | MapReduce-based parallel frequent item set incremental data mining method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106815302A (en) | A kind of Mining Frequent Itemsets for being applied to game item recommendation | |
Liao et al. | MRPrePost—A parallel algorithm adapted for mining big data | |
CN104731925A (en) | MapReduce-based FP-Growth load balance parallel computing method | |
Wei et al. | Incremental FP-Growth mining strategy for dynamic threshold value and database based on MapReduce | |
CN109614520B (en) | Parallel acceleration method for multi-pattern graph matching | |
CN104834709B (en) | A kind of parallel cosine mode method for digging based on load balancing | |
Yang et al. | Parallel co-location pattern mining based on neighbor-dependency partition and column calculation | |
Xiao et al. | Paradigm and performance analysis of distributed frequent itemset mining algorithms based on Mapreduce | |
Prasad et al. | Frequent pattern mining and current state of the art | |
Wang et al. | Association rules mining in parallel conditional tree based on grid computing inspired partition algorithm | |
CN108717551A (en) | A kind of fuzzy hierarchy clustering method based on maximum membership degree | |
CN108256086A (en) | Data characteristics statistical analysis technique | |
Ma et al. | Parallel exact inference on multicore using mapreduce | |
Jiang | Research and practice of big data analysis process based on hadoop framework | |
Xu et al. | Explore maximal frequent itemsets for big data pre-processing based on small sample in cloud computing | |
Yu et al. | Mining high utility itemsets in large high dimensional data | |
Maw | An improvement of FP-growth mining algorithm using linked list | |
Shan et al. | A subgraph query method based on adjacent node features on large-scale label graphs | |
Chen et al. | A new algorithm based on shared pattern-tree to mine shared emerging patterns | |
Chen et al. | Research on association rules mining base on positive and negative items of FP-tree | |
Billa et al. | Efficient frequent pattern mining algorithm based on node sets in cloud computing environment | |
Lu et al. | Frequent Itemset Mining Algorithm Based on Linear Table | |
Kavitha et al. | Efficient transaction reduction in actionable pattern mining for high voluminous datasets based on bitmap and class labels | |
Aswini et al. | Implementing reverse up growth tracking approach under distributed data mining | |
CN108228607A (en) | Maximum frequent itemsets method for digging based on degree of communication |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170609 |
|
RJ01 | Rejection of invention patent application after publication |