CN115563192B

CN115563192B - Method for mining high-utility periodic frequent pattern applied to purchase pattern

Info

Publication number: CN115563192B
Application number: CN202211463101.3A
Authority: CN
Inventors: 张振洲; 陈建铭; 吴明泰; 吴祖扬
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-10
Anticipated expiration: 2042-11-22
Also published as: CN115563192A

Abstract

The invention provides a method for mining a high-utility periodic frequent pattern applied to a purchasing pattern, which comprises the following steps: s1, inputting a database and five custom thresholds; s2, scanning the database to construct a HUPFPS-list of the item set x 1, and judging whether the HUPFPS-list is a high-utility periodic frequent mode or not; s3, pruning the search space according to the upper bound value, and adding the HUPFPS-list meeting the conditions into the set; s4, intersecting and combining the trimmed 1 item sets into 2 item sets, and judging whether the 2 item sets are HUPFPS or not; s5, recursively circulating the HUPFPS-list of the n-1 item set to generate an n item set until the n item set cannot be expanded, and outputting all high-utility periodic frequent item sets. The technical scheme of the invention overcomes the problems that most researches on periodic patterns in the prior art are mined in a single sequence and the internal utility and the external utility of the patterns are not considered.

Description

High-utility periodic frequent pattern mining method applied to purchase pattern

Technical Field

The invention relates to the technical field of data mining, in particular to a high-utility periodic frequent pattern mining method applied to a purchasing pattern.

Background

In recent years, high-utility periodic pattern mining has gradually become one of the trending directions of data mining, and many scholars have made intensive studies on periodic pattern mining. However, the previous periodic pattern mining algorithms are all mining for a single time series, and the mining for the periodic patterns ignores the weight (value) and quantitative information inherent in the data, so that the mining patterns cannot gain advantages in profit or benefit. In order to meet the demand of the public for benefits, high-Utility Pattern Mining (HUPM) associated with benefits has become one of the research focuses of the academic and industrial fields of data intelligence field. In the utility model mining research, the model can appear more than once in a certain data/record, and the value of the model itself can be set with a specific gravity, which is more suitable for the application needs of the real society. As periodic patterns continue to be studied in depth, some variations of periodic patterns take into account the utility (profit) of the pattern. Then, an algorithm named PHUSPM is designed to mine a high-utility periodic pattern in a plurality of symbol sequences, the algorithm treats the plurality of sequences as a sequence, and the periodic pattern in a single sequence is mined by using the same periodic metric.

In recent years, sequence pattern mining has become one of the most popular pattern mining tasks, and is a generalization of the frequent item set mining problem, aiming to find frequent sub-sequences in a sequence. Currently, although many SPM algorithms are proposed to be applied to practical applications, there are limitations to SPM algorithms, which do not consider the number of items in the sequence and their unit profit, and they cannot be used to find high-utility patterns that often appear in the data. These factors are more useful in the field, for example, when a customer buys beer and fried chicken, then beef, the mode of purchase may generate high profits, but the beef accounts for more than one total profit, and in practical application, it is more important to find the mode of high profits which is bought periodically every week by a plurality of customers. In the conventional periodic frequent pattern mining PFPM, some items are purchased by customers regularly, but the customers cannot find out which profits of the frequently purchased items are higher, which greatly hinders their effectiveness on some practical applications, such as combination recommendation of products. Another example is the regular occurrence of certain DNA molecules in a gene sequence, but each DNA molecule is of different importance, which directly affects the expression of some external traits, and it is most critical to find DNA molecules that occur frequently and play a major role. Most studies on periodic patterns are mined in a single sequence and do not consider internal and external utilities of the patterns, and therefore, a method for high-utility periodic frequent pattern mining capable of mining in multiple sequences and considering internal and external utilities is needed.

Disclosure of Invention

The invention mainly aims to provide a method applied to high-utility periodic frequent pattern mining in a purchase mode, so as to solve the problems that most researches on periodic patterns in the prior art are mined in a single sequence and the internal utility and the external utility of the patterns are not considered.

In order to achieve the above object, the present invention provides a method for mining high utility period frequent patterns in a purchase pattern, comprising the following steps:

step 1, inputting a database of goods and quantity purchased by a customer within a period of time, and customizing five thresholds by a merchant, namely a minimum support rate threshold minsupRa, a maximum periodicity threshold maxPr, a maximum standard deviation threshold maxStd, a minimum high utility threshold minHuRa and a minimum sequence periodicity threshold minSeqRa;

step 2, scanning the database to construct 1 item set x HUPFPS-list, namely constructing a data list HUPFPS-list which is formed by the commodity x appearing in the purchase sequence of several users, appearing in sequence according to the time sequence and the utility of the commodity, and judging whether the 1 item set x is a high utility period frequent pattern HUPFPS, which specifically comprises the following steps:

step 2.1, scanning each sequence in the database and calculating the support rate supRa ({ x }, S), maximum periodicity maxPer ({ x }, S), utility ratio utiRa ({ x }, S) and period standard deviation stanvev ({ x }, S) of 1 item set x;

for a product x appearing in the purchase sequence S, if the purchase frequency of a certain product x is greater than the minimum purchase frequency ratio, i.e., supRa ({ x }, S) ≧ minSupRa, the time interval between two times of purchase of the product x does not exceed the maximum period threshold, i.e., maxPeer ({ x }, S) ≦ maxPr, the purchase period of the product x is stable within a certain range, i.e., stanvv ({ x }, S) < maxStd, and the sales ratio of the product x in a customer purchase sequence is greater than the merchant-defined threshold, i.e., utiRa ({ x }, S) ≧ minHuRa, then 1 item set x is a high-utility period frequent pattern in the purchase sequence S of a certain customer, and the algorithm stores the sequences of which the 1 item set x satisfies the condition in the set huprSeq (x).

Step 2.2, calculating huSeqRa (x) according to the set huPrSeq (x), and if the high utility period sequence is more than or equal to minSeqRa (x), outputting 1 item set x which is a high utility period frequent pattern HUPFPS item set;

Step 3, pruning the search space according to the upper bound value upseqRa, adding HUPFPS-list of 1 item set which meets the condition that upseqRa (x) is more than or equal to minseqRa into the set bound HUPFPS, and not expanding the condition which does not meet the condition;

step 4, utilizing a set bound HUPFPS to intersect and merge the 1 item sets after pruning into 2 item sets, namely the combination of 2 commodity data information, constructing HUPFPS-list of the 2 item sets, storing the HUPFPS-list of the item set which accords with upseqRa (x) and is not less than minseqRa into the bound HUPFPS so as to carry out a new iteration, and judging whether the 2 item sets are HUPFPS or not;

and 5, recursively circulating the HUPFPS-list of the n-1 item set to generate an n item set until the n item set cannot be expanded, and outputting all high-utility periodic frequent item sets.

Further, the item set of one commodity is item set X1, the item sets of a plurality of commodities are item sets X, the item set X satisfies the number of trades of a certain commodity X in a database, supRa (X, S) ≧ minSupRa, and all sequence sets of maximum periodicity maxPeer (X, S) ≦ maxPr and utility ratio utiRa (X, S) ≧ minHuRa in the item set X are recorded as huCand (X) = { S) = ₁ ,...,S _n And the number of sequences in the set is recorded as UpSeqRa (X) = | huCand (X) |/| D |, and the upper bound of the value of the sequence ratio of the high utility period of the item set X in the database is defined as UpSeqRa (X) = | huCand (X) |/| D |.

Further, the support rate of the item set X in the sequence S is defined as supRa (X, S) = sup (X, S)/| S |, where | S | is the total number of transactions contained in the sequence S;

the number of times a transaction including the occurrence of a certain commodity X in the sequence S is defined as sup (X, S) = | TR (X, S) |.

Further, let u (X, S) be the total utility of the item set X in a purchase sequence S, su (S) be the total utility of the sequence S, and the ratio thereof is defined as utiRa (X, S) = u (X, S)/su (S), where utiRa (X, S) is referred to as utility ratio.

The invention has the following advantages:

1. the method provided by the invention not only considers the frequency ratio of the mode in each sequence, but also considers the periodicity of the mode in each sequence and the utility ratio of the mode in the sequence.

2. In order to ensure the frequency of the periodic pattern in each sequence, the invention defines a new metric, namely the ratio of the support number in different sequence lengths of the periodic pattern to the sequence length, so as to ensure that the output of the algorithm is the high-utility periodic frequent pattern.

3. The invention provides a measure for mining a high-utility periodic frequency pattern in a plurality of sequences, namely a high-utility periodic sequence ratio huseqRa, and aims to define the high-utility periodic frequency pattern in the plurality of sequences.

4. On the basis of a support counting method, the method is improved to use the constraint of the support ratio, the internal utility and the external utility of a project are considered on the basis of periodic pattern mining, the high utility ratio of the pattern in a sequence is defined, the purpose is to define and find that the pattern is high utility in a sequence, the accuracy of the high utility frequent pattern is ensured, and the mining requirement can be effectively met.

5. In order to reduce the search space and accelerate the HUPFPS speed of the high-utility periodic frequent pattern mining algorithm, the invention provides a pruning strategy, namely defining an upper bound upseqRa of a high-utility periodic sequence ratio, and extending two pruning characteristics, namely:

(1) The algorithm calculates the upseqRa value of 1 item set x to prune the search space and stores the HUPFPS-list of 1 item set of upseqRa (x) being more than or equal to minseqRa in the set boundHUPFPS;

(2) The high utility period sequence ratio upper bound for item set X in the database is defined as upseqRa (X) = | huand (X) |/| D |.

Therefore, an efficient algorithm is generated, the algorithm is called a high-utility periodic frequent pattern mining algorithm HUPFPS, a HUPFPS-list structure is constructed by the algorithm through a cross program, repeated scanning of a database is avoided, and algorithm operation efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts. In the drawings:

FIG. 1 illustrates a flow chart of a method for high utility cycle frequent pattern mining in a purchase pattern according to the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method firstly introduces the definition of a period mode and a utility mode in a traditional single sequence, then extends the period mode and the utility mode to a plurality of sequences, and finally proposes a pruning strategy of a search space and two new pruning characteristics. The following introduces definitions and theorems relating to the present invention:

definition 1: let I = { X ₁ , X ₂ , ..., X _m Is a set of m different items in the database, a set of items X is a subset of I, denoted X contained in I, a set of items X having k different items { I } ₁ ，i _2， ...，i _k K-itemset, a set of items 1 item X, a plurality of items X, a database n sets of sequences, one sequence S an ordered list of transactions, denoted S = { T }, a database n sets of sequences, a transaction X, a database n, and a transaction X, n, m ₁ ，T ₂ , ... T _j T here _j Representing a transaction in a sequence, where j is the sequenceThe unique transaction identifier in the column.

Definition 2: each project in the database has a measure of profit per unit or other value, denoted as pl (i) _m ) This represents how important the item is to the user. The unit profit for each project has a specialized profit list, denoted as profit = { pl (i) ₁ ), pl(i ₂ ),..., pl(i _m ) }, any transaction T in a sequence _q Item i _j Is expressed as u (i, T) _q , S _n ) = q（i _j , T _q , S _n ）* pl(i _j ) Wherein q (i) _j， T _q , S _n ) Is to point out that the sequence S is present _n Middle, transaction T _q Item i _j The number of the cells.

Definition 3: consider a sequence S of rows _i A set of items X, sequence S _i An ordered transaction list containing a set of items X is defined as TR (X, S) =<T _g(1) , T _g(2) ,..., T _g(k) >Is contained in S _i。 Let T _g（z） And Tg _（z+1） Is the occurrence of item set X in sequence S _i Two consecutive transactions. The periodic calculation formula for two consecutive transactions containing item set X is per (T) _g(z), T _g(z+1) ) = g (z + 1) -g (z). Sequence S _i The period of the middle set X is pr (X, S) _i = per1, per 2., perk +1}, where Perk = g (k) -g (k-1), g (k) being the TID of the transaction in which the set of items X appears, and g (0) =0 and g (k + 1) = | S are specified _i L, where l S _i Is the length of the sequence.

Definition 4: the standard deviation of the period of one set of terms X in the sequence S is denoted as stanDev (X, S).

Definition 5: the maximum periodicity of one item set X in the sequence S is defined as maxPer (X, S) = argmax (pr (X, S)).

Definition 6: in a sequence S, one set of items X may appear in multiple transactions, and the number of transactions in the sequence S that contain the occurrence of X is defined as sup (X, S) = | TR (X, S) |.

Definition 7: the support rate of the item set X in the sequence S is defined as supRa (X, S) = sup (X, S)/| S |, where | S | is the total number of transactions contained in the sequence S.

Definition 8: let sequence S _i The total utility of item set X in (1) is u (X, S) _i ) Sequence S _i Has a total effect of su (S) _i ). The ratio is defined as utiRa (X, S) _i ) = u(X, S _i )/su(S _i ) Wherein utiRa (X, S) _i ) Referred to as utility ratio.

Definition 9: assuming that there are four user-defined thresholds, minSuPra, maxPr, maxStd, and minHuRa, respectively, if a set of terms X satisfies the conditions in the sequence S, supRa (X, S) ≧ minSuPra, maxPer (X, S) ≦ maxPr, stanvv (X, S) ≦ maxStd, and utiRa (X, S) ≧ minHuRa, then the set of terms X is defined to be highly frequent in the sequence S. In the database, the set of all sequences whose entry set X satisfies the periodicity frequency is represented as huprSeq (X) = { S | suppRa (X, S) ≧ minosupRa ^ maxPeer (X, S) ≦ maxPr ^ standv (X, S) ≦ maxStd ^ utiRa (X, S) ≦ minHuRa ^ See D }.

Definition 11: in the database, if huseqRa (X) ≧ minseqRa, then the high utility periodic frequent pattern of item set X in the database.

Definition 12: assuming that the term set X satisfies the conditions that supRa (X, S) ≥ minSupRa, maxPeer (X, S) ≤ maxPr and utiRa (X, S) ≥ minHuRa in the database, all the sequence sets are denoted as huCand (X) = { S) = ₁ ,...,S _n And the term set X is called an Utility cycle frequent candidate pattern, the number of sequences in the set is denoted as | huCand (X) |, and the UpSeqRa (X) = | huCand (X) |/| D | is defined as the upper bound of the UpSeqRa (X) = | for the term set X in the database.

Theorem 1: in the sequence database, the value of upseqRa of item set X is not less than huseqRa, and is expressed as upseqRa (X) ≧ huseqRa (X).

Theorem 2: in the database, for any two sets of items, upseqRa (X) ≧ upseqRa (XY) if the subset of items whose XY is X is denoted XY-containing X.

Theorem 3: in one database, if upseqRa (X) of any item set X ≦ minSeqRa, then any item set X and its superset are not HUPFPS.

The specific algorithm process in the present invention is described below with reference to fig. 1:

as shown in fig. 1, a method applied to frequent pattern mining of high utility periods in a purchasing pattern includes step 1, inputting a database of goods and quantities purchased by customers within a period of time, and defining five thresholds by a merchant, namely a minimum support rate threshold minSupRa, a maximum periodicity threshold maxPr, a maximum standard deviation threshold maxStd, a minimum high utility threshold minHuRa and a minimum sequence periodicity threshold minSeqRa;

the algorithm finds all HUPFPS by depth-first search, taking as input one multi-sequence database and five custom thresholds.

And 2, scanning the database to construct 1 HUPFPS-list of the item set x, namely constructing a data list HUPFPS-list formed by the purchase sequence of users of which a certain commodity appears, the transactions of which the commodity appears in sequence according to the time sequence and the utility of the commodity, and judging whether the item set x is a high utility periodic frequent pattern HUPFPS or not.

Specifically, each sequence in the database is scanned and the support rate supRa ({ x }, S) for the 1-term set x, the maximum number of cycles maxPer ({ x }, S) for the 1-term set x, the utility ratio utiRa ({ x }, S) and the cycle standard deviation standv ({ x }, S) for the 1-term set x are calculated;

for a product x appearing in the purchase sequence S, if the purchase frequency of a certain product x is greater than the minimum purchase frequency ratio, i.e., supRa ({ x }, S) ≧ minSupRa, the time interval between two times of purchase of the product x does not exceed the maximum period threshold, i.e., maxPeer ({ x }, S) ≦ maxPr, the stability of the purchase period of the product x is within a certain range, i.e., stanvv ({ x }, S) < maxStd, and the sales ratio of the product x in a customer purchase sequence is greater than the merchant-defined threshold, i.e., utiRa ({ x }, S) ≧ minHuRa, then 1 item set x is a high-utility period frequent pattern in the purchase sequence S of a certain customer, and the algorithm stores the sequence of which 1 item set x satisfies the condition into the set huprSeq (x).

The algorithm then divides the number of sequences in the set huPrSeq by the total number of sequences | D | to calculate the high utility period ratio hupeqra (x) for 1 item set x, which is a high utility period frequent item set if this value is not less than minSeqRa.

In step 3, the search space is pruned according to the upper bound value upseqRa, HUPFPS-list of 1 item set x meeting the condition upseqRa (x) which is more than or equal to minseqRa is added to the set bound HUPFPS, and expansion is not performed any more when the condition is not met.

Specifically, the algorithm computes the upseqRa value of 1 term set x to prune the search space and stores the HUPFPS-list of 1 term set x where upseqRa (x) ≧ minseqRa in the set bound HUPFPS, with the HUPFPS-lists in the set sorted according to the value of upseqRa. Algorithm HUPFPS performs depth-first search calls boundhpfps, minSupRa, maxPr, maxStd, minSeqRa, minHuRa and database, performing recursive search for 2 sets of terms and larger patterns. This process will only explore sets of items having an upseqRa value no less than minSeqRa.

And 4, intersecting and merging the 1 item set after pruning into 2 item sets, namely the combination of 2 commodity data information, by utilizing a set bound HUPFPS, constructing HUPFPS-list of the 2 item sets, storing the HUPFPS-list of the item set which accords with upseqRa (x) which is not less than minSeqRa into the bound HUPFPS so as to carry out a new iteration, and judging whether the 2 item set is HUPFPS or not.

Specifically, the search process takes as input a set of terms P and a series of custom thresholds minSupRa, maxPr, maxStd, minSeqRa and minHuRa and a set boundHUPFPS. The extension of item set P is the set of items obtained by appending item set z to P, denoted Pz. When the algorithm first invokes this search process, P is an empty set and the extended term set of P is a 1 term set. The search process executes a loop that combines each pair of expanded term sets Px and Py of P into a HUPFPS-list of term set Pxy.

The algorithm can construct the HUPFPS-list of the extension item set Pxy from the HUPFPS-list of Px and Py by a cross program without repeatedly scanning a database. The algorithm then scans Pxy's HUPFPS-list to calculate huCand (Pxy) and upseqRa (Pxy). Then, if upseqRa (Pxy) ≧ minseqRa, item set Pxy and its superset may be a HUPFPS and Pxy's HUPFPS-list is added to the set boundHUPFPS, which stores HUPFPS-lists for all extension item sets for Px with upseqRa values no less than minseqRa. Then, the algorithm calculates the value of huSeqRa (Pxy), and if the value is not less than minSeqRa, outputs Pxy as HUPFPS.

Specifically, the calling pattern search process, which is recursive throughout the last algorithm, explores the n term set, and if the value of upseqRa (Pxy) is less than minseqRa, the term set Pxy and all its supersets are pruned.

PREFERRED EMBODIMENTS

The sequence database sample in the preferred embodiment is shown in Table 1:

table 1: sequence database sample

SID
	1．(a:6，b:10，c:10)，(b:8，c:8，d:13)，(a:5，b:6)，(a:8，b:5，e:8)，(a:4，b:7，c:6，d:10)
2．(d:14)，(a:5，b:8，c:3，d:3)，(a:6，c:15，d:8)，(a:9，b:9，d:15)，(a:10，b:6，c:14，e:13)
	3．(b:7，d:10)，(a:8，d:4)，(a:5，c:15，d:12)，(b:3，d:12，e:3)，(a:9，b:11，d:12)
4．(a:6，b:12，d:14)，(a:6，b:2，d:8)，(a:9，c:6，d:6)，(b:2，d:9)，(b:5, d:8，e:6)

The HUPFPS-list structure was constructed as shown in tables 2,3 and 4:

table 2: HUPFPS-list of item set { a }

i-set {a}
	Sid-list {1，2，3，4}
Tran-list [{1，3，4，5}，{2，3，4，5}，{2，3，5}，{1，2，3}]
	Uti-list[{456，380，608，304}，{380，456，684，760}，{608，380，684}，{456，456，684}]

Table 3: HUPFPS-list of item set { d }

i-set {d}
	Sid-list {1，2，3，4}
Tran-list [{2，5}，{1，2，3，4}，{1，2，3，4，5}，{1，2，3，4，5}]
	Uti-list [{533，410}，{574，123，328，615}，{410，164，492，492，492}，{574，328，246，369，328}]

Table 4: HUPFPS-list of item set { a, d }

i-set {a，d}
	Sid-list {1，2，3，4}
Tran-list [{5}，{2，3，4}，{2，3，5}，{1，2，3}]
	Uti-list [{714}，{503，784，1299}，{772，872，1176}，{1030，784，930}]

Table 5: external watch

a	b	c	d	e
					76	65	35	41	118

Firstly, the algorithm calculates huSeqRa ({ a }) to be more than or equal to minSeqRa, upSeqRa ({ a }) to be more than or equal to minSeqRa, huSeqRa ({ d }) to be more than or equal to minSeqRa and upSeqRa ({ d }) to be more than or equal to minSeqRa according to parameter values of 1 item set. Therefore, the item sets { a } and { d } and the algorithm scan the database to generate the HUPFPS-list of the 2 item sets through the intersection and expansion of the field information Sid-list, tran-list and Uti-list of the HUPFPS-list of the 1 item set of the high-utility periodic candidate mode, then the parameter values of the 2 item set mode are calculated through the HUPFPS-list information, and whether the expanded 2 item set is the HUPFPS is judged, and so on until a larger item set cannot be generated.

Table 1 shows the times and amounts at which four customers purchase the items a, b, c, d, e, as exemplified by the purchase list 1 of the first customer in table 1 (a: 6, b: that is, the first customer purchases 6 items a, 10 items b, 10 items c, 8 items b, 8 items c, 13 items d, and so on for the first time.

In Table 2, the set of items {1,3, 4} in Sid-list {1,2,3,4} representing that the first, second, third and fourth customers all purchased the a commodity, tran-list [ {1,3,4,5}, {2,3,5}, {1,2,3} ] represents that the first customer purchased the commodity a for the first time, the third time, the fourth time and the fifth time, and {2,3,4,5} represents that the second customer purchased the commodity a for the second time, the third time, the fourth time and the fifth time, and so on.

The external utility of the first customer who purchased 6 a commodities for the first time is 6 × 76=456, 608, 304, and 5 × 76=380for the third time, and so on in the aggregate {456, 380, 456, 684}, in the Uti-list [ {456, 380, 456, 684}, in {456, 380, 608, 304}, and so on.

In Table 4

Uti-list [ {714}, {503, 784, 1299}, {772, 872, 1176}, {1030, 784, 930} ], in combination with the external utility values for each of the commodities in Table 5, wherein the item set {714} is the external utility of 4 × 76+10 × 41=714 for the first customer who purchased 4 a commodities and 10 d commodities simultaneously the fifth time, and so on.

As can be seen from Table 1, in the HUPFPS-list of the pattern { a }, the Sid-list is {1,2,3,4}, the Tran-list of the pattern { a } is ({ 1,3,4,5}, {2,3,5}, {1,2,3 }), and the Uti-list of the pattern { a } is {456, 380, 608, 304}, {380, 456, 684, 760}, {608, 380, 684}, {456, 456, 684}. In the HUPFPS-list of the pattern { d }, sid-list is {1,2,3,4}, tran-list of the pattern { d } is ({ 2,5}, {1,2,3,4,5 }), and Uti-list of { d } is ({ 533, 410}, {574, 123, 328, 615}, {410, 164, 492, 92, 492}, {574, 328, 246, 369, 328 }).

The algorithm expands the HUPFPS-list intersection of patterns { a } and { d } to obtain the Tran-list of patterns { a, d } where Sid-list is {1,2,3,4}, { a, d } is ({ 5}, {2,3,4}, {2,3,5}, {1,2,3 }), and { a, d } where Uti-list is ({ 714}, {503, 784, 1299}, {772, 872, 1176}, {1030, 784, 930 }). The algorithm calculates the parameter values from the field information in the HUPFPS-list of { a, D }, then compares with the custom threshold to obtain the set huCand ({ a, D }) = { S2, S3, S4}, and calculates from the set upseqRa ({ a, D }) = | huCand ({ a, D }) | \| D = |0.75 ≧ minSeqRa, so the patterns { a, D } and their superset may be HUPFPS, and adds the HUPFPS-list of { a, D } to the set bound HUPFPS in order to extend the 3-item set.

Finally, a sequence set hupreq ({ a, d }) = { S2, S3, S4} is calculated according to parameter values, hupreq ({ a, d }) =3/4=0.75 is calculated, and an algorithm output two-term set { a, d } is HUPFPS. The recursive calling explorer of the algorithm HUPFPS explores a larger set of n terms.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims

1. A method for mining high-utility periodic frequent patterns in a purchase mode is characterized by comprising the following steps:

step 2.1, scanning each sequence in the database and calculating the support rate supRa ({ x }, S), maximum periodicity maxPer ({ x }, S), utility ratio utiRa ({ x }, S) of 1 item set x and the period standard deviation standv ({ x }, S) of 1 item set x;

for the commodity x appearing in the purchase sequence S, if the purchase frequency of the commodity x is greater than the minimum purchase frequency, namely, supRa ({ x }, S) ≧ minSupRa, the time interval between two times of purchase of the commodity x does not exceed the maximum period threshold, namely, maxPer ({ x }, S) ≦ maxPr, the purchase period of the commodity x is stable within a certain range, namely, stanvev ({ x }, S) < maxStd, and the sales ratio utiRa ({ x }, S) of the commodity x in the shopping sequence of a client is greater than the merchant-defined minimum high-utility threshold, namely utiRa ({ x }, S) ≧ minHuRa, then 1 item set x is a high-utility frequent-period mode in the purchase sequence S of a client, and the algorithm stores the sequences of the item set 1 item set hux satisfying the condition into the set PrSeq (x);

step 3, pruning the search space according to the upper bound value upseqRa, adding HUPFPS-list of 1 item set x meeting the condition upseqRa (x) which is more than or equal to minseqRa into the set bound HUPFPS, and not expanding the condition;

step 4, intersecting and merging the 1 item set after pruning into 2 item sets, namely the combination of 2 commodity data information, by utilizing a set bound HUPFPS, constructing HUPFPS-list of the 2 item sets, storing the HUPFPS-list of the item set which accords with upseqRa (x) which is not less than minSeqRa into the bound HUPFPS so as to carry out a new iteration, and judging whether the 2 item set is HUPFPS or not;

and 5, recursively circulating the HUPFPS-list of the n-1 item set to generate an n item set, and outputting all high-utility periodic frequent item sets until the n item set cannot be expanded.

2. The method as claimed in claim 1, wherein the item set of commodity is item set 1X, the item set of multiple commodities is item set X, the item set X satisfies the number of trades supRa (X, S) ≧ minSUPRa in the database, the maximum periodicity maxPeer (X, S) ≦ maxPr and the utility ratio utiRa (X, S) ≧ minHuRa in the item set X are all sequence sets denoted huhud (X) = { S Cand (X) = { S) = ₁ ,...,S _n And the number of sequences in the set is recorded as UpSeqRa (X) = | huCand (X) |/| D |, and the upper bound of the value of the sequence ratio of the high utility period of the item set X in the database is defined as UpSeqRa (X) = | huCand (X) |/| D |.

3. The method of claim 2, wherein the support rate of item set X in sequence S is defined as supRa (X, S) = sup (X, S)/| S | where | S | is the total number of transactions contained in sequence S;

4. The method of claim 2, wherein the total utility of the item set X in a purchase sequence S is u (X, S), the total utility of the sequence S is su (S), and the ratio thereof is defined as utiRa (X, S) = u (X, S)/su (S), wherein utiRa (X, S) is called utility ratio.