CN115617881A

CN115617881A - Multi-sequence periodic frequent pattern mining method in uncertain transaction database

Info

Publication number: CN115617881A
Application number: CN202211635955.5A
Authority: CN
Inventors: 张振洲; 陈建铭; 甘文生
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-01-17
Anticipated expiration: 2042-12-20
Also published as: CN115617881B

Abstract

The invention provides a method for mining a multi-sequence periodic frequent pattern in an uncertain transaction database, which specifically comprises the following steps: s1, inputting a database and four user-defined thresholds; s2, scanning the database to construct a UPFPS-list of the item set x 1, and judging whether the item set x is a periodic frequent item set or not; s3, pruning the search space according to the upper bound value, and adding the UPFPS-list meeting the conditions into the set; s4, intersecting and combining the 1 item sets after pruning into 2 item sets, constructing a UPFPS-list of the 2 item sets, and judging whether the 2 item sets are UPFPS; and S5, recursively circulating the n-1 item set until the n item set cannot be expanded, and outputting all UPFPSs in the uncertain database. The technical scheme of the invention solves the problem that the prior art can not carry out periodic frequent pattern mining on a plurality of sequences in an uncertain database.

Description

Multi-sequence periodic frequent pattern mining method in uncertain transaction database

Technical Field

The invention relates to the technical field of data mining, in particular to a multi-sequence periodic frequent pattern mining method in an uncertain transaction database.

Background

Data mining has become an important technological approach to the exploitation of excessive and complex data, and scientific research in this field of technology is also developing at an increasing rate. The association rule is also called frequent item set mining and is a hot direction in the field of data mining. The existing algorithms for frequent pattern mining are all in an accurate database to mine patterns which are useful for us, but in the real situation, the collected data is more or less lost or inaccurate due to various reasons, so that the collected data contains uncertainty. In real life, more data with uncertainty is mined, and a deterministic database has certain limitations in practical application. Mining frequently periodic patterns in the multi-sequence uncertainty database, wherein the periodic and frequent patterns are both satisfied, and the common periodic and frequent patterns are mined in the multi-sequence simultaneously, and the occurrence probability of the patterns also satisfies a threshold value set by a user. Meanwhile, how to measure the ratio of the patterns mined in a plurality of sequences in an uncertain database to the total database sequence number is a complicated problem.

In recent years, frequent Pattern Mining (FPM) was a popular data mining problem, and was originally intended to analyze transactions in large chain supermarkets to find products frequently purchased by customers, so as to make better sales decisions. Frequent pattern mining is an early and fundamental research direction in data mining, and with the continuous development of FPM, FPM has been applied in many fields today. Different methods have been developed to exploit meaningful patterns. At present, the frequent pattern mining based on the determined data is continuously explored, the mining period frequent pattern in the sequence also becomes an increasingly popular research direction, and the periodic and frequent occurrence of the research pattern along with the time has more practical application significance in real life. However, there are two limitations in the process of frequent pattern mining in the traditional mining cycle, the first limitation is that the item sets contained in things are all definite, while uncertain data in real life are ubiquitous, such as: sensor networks, military, economic, logistics, financial, biomedical, etc. The second limitation is that current algorithms that study periodic frequent patterns are usually directed to a single sequence, and it is more valuable to find a group of co-occurring periodic and frequent patterns in real life, such as: the method has more practical significance in website browsing click analysis, biomedical gene and market shopping basket research.

As the sequence pattern mining is also more popular, the time sequence between transactions is ignored in the database in the conventional FPM algorithm, and the mined patterns may be very frequent or have very high association between patterns, but do not reflect the chronological sequence between items or transactions. One of the most popular research tasks at present is Sequence Pattern Mining (SPM), in real life, sequences with respect to time are more emphasized in many fields, and the sequence pattern mining task is to dig out frequent sub-sequences from a group of sequences, which is more widely applied in reality. The data mining algorithm can mine several types of modes, and a frequent item set is a basic step of many important data mining tasks such as association rules, correlation analysis, sequence modes, weighting modes and the like and is also content of basic research in data mining. The conventional SPM algorithms have three limitations, firstly, the item sets contained in the data are all definite, but the uncertain data in real life are ubiquitous, such as: uncertain databases play an important role in many fields such as sensor networks, military, economy, logistics, finance, biomedicine and the like. But the uncertainty of the data has a great influence on data mining, so people pay more and more attention to mining in an uncertainty database. Second, SPM algorithms are used to find patterns that occur periodically in a sequence in a cyclic manner, but are an important research context in data mining. Thirdly, the SPM algorithm basically studies a single sequence, ignores the relevance among a plurality of sequences, and simultaneously mines and analyzes potential relations among a plurality of sequences, so that the SPM algorithm has great research value and significance.

Therefore, there is a need for a method of periodic frequent pattern mining for multiple sequences in an uncertain-transaction database that overcomes the above problems.

Disclosure of Invention

The invention mainly aims to provide a periodic frequent pattern mining method under multiple sequences in an uncertain transaction database, so as to solve the problem that the periodic frequent pattern mining of the multiple sequences in the uncertain database cannot be carried out in the prior art.

In order to achieve the aim, the invention provides a method for mining a multi-sequence periodic frequent pattern in an uncertain transaction database, which comprises the following steps:

s1, inputting an uncertain transaction database of a large number of customers within a period of time, customizing four thresholds by a merchant, wherein the four thresholds are a minimum support frequency minSup, a maximum periodicity maxPr, a periodicity standard deviation maxStd and a minimum expected support number minExpRa;

s2, scanning the database to construct a UPFPS-list of the item set x 1, namely constructing a data list UPFPS-list which is formed by the purchase probability of each item x and the transaction in which the certain item x appears in the purchase sequence of which users according to the time sequence, and judging whether the item set x is a periodic frequent item set UPFPS in the uncertain database, and the method specifically comprises the following steps:

s2.1, calculating the transaction number sup (x, S) of the commodity x appearing in the sequence S, calculating the maximum periodicity maxPeer (x, S) of the item set x, and the period standard deviation stanDev (x, S), then circulating the algorithm through each single item set x, and for the commodity x appearing in the purchase sequence S, if the purchase probability of the commodity x is greater than the minimum purchase frequency, namely sup (x, S) = minSup, the time interval between the two times of purchase of the commodity x does not exceed the maximum period threshold, namely maxPeer (x, S) < = maxPr, the purchase period of the commodity x is stable within a certain range, namely stanDev (x, S) < = maxStd, then 1 item set x is called as frequent periodically in the sequence S, and the algorithm stores the sequences of the item set x satisfying the condition into the set prSeq (x);

s2.2, if the expected periodic sequence is more than or equal to minExpRa than expRa (x), outputting a 1 item set x as a periodic frequent pattern item set UPFPS;

wherein the expected periodic sequence ratio of sequences for x in the database is defined as expRa (x) = expsu (x)/| D |, where the value of expsu (x) is calculated from the sequences in the set PrSeq (x), and | D | is the total number of sequences in the database of sequences;

s3, pruning the search space according to an upper bound value upExpRa, adding UPFPS-list of 1 item set x meeting the condition upExpRa (x) > = minExpRa into a set bound UPFPS, and not expanding any more if the condition is not met;

s4, intersecting and merging the trimmed 1 item set into 2 item sets by using a set bound UPFPS, namely combining data information of two commodities to construct a UPFPS-list of the 2 item set, storing the UPFPS-list of the item set which accords with an upper bound value of upExpRa (x) > = minExpRa into the bound UPFPS so as to carry out a new iteration, and judging whether the 2 item set is a periodic frequent item set UPFPS in an uncertain database;

and S5, recursively circulating the n-1 item set until the n item set cannot be expanded, and outputting all periodic frequent item sets UPFPS in the uncertain database.

Further, a term set of one commodity composition is 1 term set X, a term set of a plurality of commodities compositions is X, two user-specified thresholds maxPr, minSup are set, if sup (X, S) > = minSup, maxPer (X, S) < = maxPr, the term set X is a cycle-frequent candidate pattern in the sequence S, a sequence set in the sequence database with the term set X as the cycle-frequent candidate pattern is represented and defined as candSeq (X) = { S ^ X, S) > = minSup ^ maxPer (X, S) < = maxPr ^ pro (X, T, S) | = T ^ 0^ S ^ D }, and an upper bound of the value of the desired support sequence ratio prexa (X) of X in the database is defined as expa (X) =/exp (X) | D (X), where X is calculated from the sequence set X.

Further, in the indeterminate sequence database, the sequence S of the kth transaction of some goods X _i The probability value of the existence of item set X is defined and expressed as pro (X, T) _k , S _i ) ( 1<= k<= m, 1<=i<=|D|)。

Further, the expected support number expsoup (X) is the sum of the maximum probabilities that the item set X exists in the specified sequences in the uncertain database, and is calculated by the formula:

wherein P (X, S) _i ) The value of (A) indicates that the set of items X is present in the sequence S _i The maximum probability of transaction in (1) is

。

The invention has the following beneficial effects:

1. in order to measure the support number of the patterns in the uncertain database, a new support metric is defined, namely the maximum value of the transaction probability is found in each sequence, and a new period measure, namely an expected sequence period ratio (expRa), is defined in the uncertain database on the basis of the maximum value of the transaction probability, so that the mode with periodicity is found in a plurality of sequences;

2. in order to effectively reduce the search space, an upper bound upExpRa of the expRa measure is provided, and two new pruning characteristics are provided;

3. a novel UPFPS-list structure is provided to avoid repeated scanning of the database by the algorithm and improve the operation efficiency of the algorithm.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts. In the drawings:

FIG. 1 illustrates a flow chart of a method for periodic frequent pattern mining of multiple sequences in an uncertain transaction database according to the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The definition of periodic frequent patterns in an uncertain database is first introduced, and then the combination of periodic patterns and expected support numbers is extended to multiple sequences in an uncertain database. Finally, a strategy pruning of the search space and two new pruning properties are proposed.

Definition 1: i = { (I) ₁ ：P _i1 ), (i ₂ ：P _i2 ),...,(i _m ：P _im ) Is a set of different items (symbols or binary values) that appear in the transaction database, represented for each item appearing in pairs as (i: p _i ) Where i is an item and P _i (wherein P is _i ∈（0,1]) Is the probability value that item i exists. An Itemset containing K Items is called K-Itemset. A sequence S is an ordered set of lists S =<T ₁ ,T ₂ ,....T _m >And T is _j Is contained in I (1)<=j<=m),T _j Referred to as one transaction, each transaction contains a number of items (i: P) _i ) And j is transaction T _j Is called transaction T _j The TID of (1).

Definition 2: the item set composed of one commodity is 1 item set X, the item sets composed of a plurality of commodities are X, and in the uncertain sequence database, the sequence S of the kth transaction of the commodity X _i The probability value of the existence of item set X is defined and expressed as pro (X, T) _k , S _i ) ( 1<= k<= m, 1<=i<=|D|)。

Definition 3: consider a sequence S of rows _i A set of items X, sequence S _i An ordered transaction list containing a set of items X is defined as TR (X, S) =<T _g(1) , T _g(2) ,..., T _g(k) >Is contained in S _i。 Let T _g（z） And Tg _（z+1） Is a transaction in which item set X appears in sequence Si in two consecutive occurrences. The periodic calculation formula for two consecutive transactions containing item set X is per (T) _g(z), T _g(z+1) ) = g (z + 1) -g (z). Sequence S _i The period of the middle set X is pr (X, S) _i = per1, per2,., perk +1, where Perk = g (k) -g (k-1), g (k) is the TID of the transaction in which the set of items X occurs, and it is specified that g (0) =0 and g (k + 1) = | S _i L, where l S _i And | is the length of the sequence.

Definition 4: the standard deviation of the period of one set of terms X in the sequence S is denoted as stanDev (X, S).

Definition 5: the maximum periodicity of one set of entries X in the sequence S is defined as maxPer (X, S) = argmax (pr (X, S)).

Definition 6: in a sequence S, one set of items X may appear in multiple transactions, and the number of transactions in the sequence S that contain the occurrence of X is defined as sup (X, S) = | TR (X, S) |.

Definition 7: there are three user-specified thresholds maxPr, minSup, and maxStd. If (X, S) > = minSup, maxPeer (X, S) < = maxPr, pro (i, T, S) | 0, and stanDev (X, S) < = maxStd, then term set X is said to be periodic in sequence, and the set of sequences for which term set X satisfies the periodic frequent pattern is defined as PrSeq (X) = { S | sup (X, S) > = minSup ^ maxPeer (X, S) < = maxPr ^ stanDev (X, S) < ^ maxStd ^ pro (X, T, S) | =0^ T ^ S ^ D }.

Definition 8: the expected support number is the sum of the maximum probabilities that the item set X exists in the specified sequence in the uncertain database, and the calculation formula is as follows:

. Wherein P (x, S) _i ) Indicates that the set of items X exists in the sequence S _i Maximum probability of transaction in (1), formula is

。

Definition 10: in a sequence database, if a set of terms X satisfies that a periodically frequent sequence is PrSeq (X), then the desired support sequence ratio of terms set X is expRa (X) = expsu (X)/| D |, where the value of expsu (X) is calculated from the sequences in the set PrSeq (X), and | D | is the total number of sequences in the sequence database.

Definition 11: in an uncertain database, if the expRa (X) value of an item set X is not less than minExpRa, then this item set X is a periodic frequent pattern in an uncertain database.

Definition 12: two user-specified thresholds maxPr and minSup are set. If sup (X, S) > = minSup, maxPeer (X, S) < = maxPr, item set X is a cycle-frequent candidate pattern in the sequence. A set of sequences in the sequence database with the set X of entries as the periodic frequent candidate patterns is represented and defined as candSeq (X) = { S | sup (X, S) > = minSup ^ maxPeer (X, S) < = maxPr ^ pro (X, T, S) | =0^ T ^ S ^ D }, and the upper bound of the value of the desired support sequence ratio expRa (X) for X in the database is defined as expRa (X) = expPasp (X)/| D |, where the value of expPasp (X) is calculated from the sequences in the set candSeq (X).

Theorem 1: in the sequence database, upExpRa (X) > = expRa (X) for item set X.

Theorem 2: in the sequence database, for any two sets of items, if X \8838; XY, then upExpRa (XY) < = upExpRa (X).

Theorem 3: in a sequence database, if upExpRa (X) < = minExpRa for any item set X, then neither item set X nor its superset XY is UPFPS.

The specific algorithm process named MUPFPS of the first algorithm in the present invention is described below with reference to fig. 1:

a method for mining multiple sequences of periodic frequent patterns in an uncertain-transaction database as shown in fig. 1, comprising the following steps:

s2, scanning the database to construct a UPFPS-list of the 1 item set x, namely constructing a data list UPFPS-list which is formed by the purchase probability of each item x and the transaction in which a certain item x appears in the purchase sequence of several users according to the time sequence, and judging whether the 1 item set x is a periodic frequent item set UPFPS in the uncertain database, and the method specifically comprises the following steps:

s2.1, calculating the transaction number sup (x, S) of the commodity x appearing in the sequence S, calculating the maximum periodicity maxPeer (x, S) of the item set x, and the period standard deviation stanDev (x, S), then traversing each single item set x by the algorithm in a loop, and for the commodity x appearing in the purchase sequence S, if the purchase probability of the commodity x is greater than the minimum purchase frequency, namely sup (x, S) > = minSup, the time interval between the two times of purchase of the commodity x does not exceed the maximum period threshold, namely maxPeer (x, S) <= maxPr, and the purchase period of the commodity x is stable in a certain range, namely stanDev (x, S) < = maxStd, then 1 item set x is called to be frequent in the sequence S periodically, and the algorithm stores the sequences of the item set x satisfying the condition into a set prSeq (x);

s2.2, if the expected periodic sequence is more than or equal to minExpRa than expRa (x), outputting 1 item set x as a periodic frequent pattern item set UPFPS;

specifically, the value of upExpRa (x) is computed to prune the search space and UPFPS-list of term sets. If upExpRa (x) > = minExpRa, then the UPFPS-list of 1 item set x is added to the set bound UPFPS and the set is stored in ascending order according to the value of upExpRa.

The search process of the algorithm contains the following parameters: a multi-sequence uncertain database, UPFPS-list of the P extended item set, and a series of user-defined thresholds minSup, maxPr, maxStd, and minExpRa. The search process is implemented by merging the item set Px in UPPFPS-list with all the other extension item sets Py of P. The extension of item set P is the set of items obtained by appending item set z to P, denoted Pz. When this procedure is invoked for the first time, P is an empty set and P's extension is a single set of items. The search process executes a loop that combines each pair of extensions Px and Py of P to generate Pxy. The merging of the item set Px and the item set Py in the Px loop is to generate a new extended item set Pxy with length | Px | +1 according to the ascending order of the upExpRa values. The algorithm constructs UPPFPS-list of item sets Px and Py into UPFPS-list of Pxy through an intersection program on the premise of not repeatedly scanning the database. Then, UPFPS-list of Pxy is scanned to obtain a sequence satisfying candSeq (Pxy) condition, and upExpRa (Pxy) is further calculated. If upExpRa (Pxy) > = minExpRa, then Pxy and its superset are likely to be UPFPS, and Pxy's UPFPS-list is added to a set named ExtensionsOfPx, and the upExpRa (Pxy) of each item set in the set is not less than minExpRa, otherwise neither Pxy nor its superset are likely to be UPFPS. The UPFPS-list of all the extension item sets of Px of upExpRa > = minExpRa is then stored in the set, and if the value of expRa is finally calculated to be not less than the threshold value minExpRa, the output Pxy is UPFPS and the extension item Pxy is added to the extension set considered by the depth-first search. And finally, recursively calling the UPFPS-list which is judged to be the UPFPS by the algorithm, performing list cross construction, and continuously searching the expansion item set of the Px until all the expansion item sets of the Px are searched.

Because the metric upExpRa has strict anti-monotonicity, both the aperiodic frequent item set pattern set and its expanded item set in the uncertain database are deleted from the search space, because the algorithm uses the expRa values of the computed item sets in UPFPS-list, and the upExpRa value of the output item set is not less than the minExpRa, so the algorithm is complete and correct. The algorithm outputs all the UPFPS and their expRa as well as the serial and transaction numbers.

PREFERRED EMBODIMENTS

Table 1: sequence database sample

SID
	1.(a:0.66,e:0.68),(a:0.76,c:0.91,e:0.48),(a:0.6,c:0.75,e:0.58),(a:0.55,b:0.37,c:0.78),(a:0.48,b:0.7,e:0.66)
2.(a:0.76, c:0.48),(b:0.87),(b:0.65, c:0.46),(b:0.73, c:0.34),(a:0.52, e:0.33)
	3.(a:0.64,c:0.62),(a:0.33, b:0.51,e:0.48),(a:0.6, c:0.75, d:0.56),(a:0.49, c:0.34, e:0.88),(a:48, b:0.7)
4.(a:0.46,c:0.78,c:0.77,e:0.87),(a:0.66,b:0.97,e:0.38),(a:0.36,b:0.67,c:0.75),(a:0.45,c:0.37,c:0.78),(a:0.81,e:0.69)

Table 2: UPFPS-list of item set { a }

i-set {a}
	Sid-list {1,2,3,4}
Tran-list [{1,2,3,4,5},{1,5},{1,2,3,4,5},{1,2,3,4,5}]
	Pro-list[{0.66,0.76,0.6,0.55,0.48},{0.76,0.52},{0.64,0.33,0.6,0.49,0.48},{0.46,0.66,0.36,0.45,0.81}]

Table 3: UPFPS-list of item set { c }

i-set {c}
	Sid-list {1,2,3,4}
Tran-list [{2, 3, 4},{1, 3, 4},{1, 3, 4},{1, 3, 4}]
	Pro-list[{0.91,0.75,0.78},{0.48,0.46,0.34},{0.62,0.75,0.34},{0.77,0.75,0.78}]

Table 4: UPFPS-list of item set { a, c }

i-set {a,c}
	Sid-list {1,2,3,4}
Tran-list [{2,3,4},{1},{1,3,4},{1,3,4}]
	Pro-list [{0.69,0.45,0.42},{0.36},{0.39,0.45,0.16},{0.35,0.27,0.35}]

Table 1 is a database of tentative transactions for a large number of customers over a period of time, wherein the first row sequence represents a first probability of a first user purchasing item a of 0.66 and a probability of purchasing item e of 0.68. In Table 2, sid-list {1,2,3,4} represents the product a purchased by the first, second, third and fourth customers, and {1,2,3,4,5} in Tran-list [ {1,2,3,4,5}, {1,2,3,4,5} ] represents the product a purchased by the first customer for the first time, the second time, the third time, the fourth time and the fifth time, and so on.

Pro-list [ {0.66,0.76,0.6,0.55,0.48}, {0.66, 0.52}, {0.64,0.33,0.6,0.49,0.48}, {0.46,0.66,0.36,0.45,0.81} ] represents that the probability of purchasing commodity a for the first time is 0.66, the probability of purchasing commodity a for the second time is 0.76, and so on.

Pro-list [ {0.69,0.45,0.42}, {0.36}, {0.39,0.45,0.16}, {0.35,0.27,0.35} ] in Table 4 represents that the probability of the first customer purchasing both commodity a and commodity c at the same time is 0.76X 0.91 and approximately equal to 0.69, and so on.

Based on the above data and specific properties, the merchant gives a set of thresholds minSup = 2, maxStd = 1.2, maxPr = 3 and minExpRa = 0.35, respectively. The UPFPS-list of { a, c } for table three is obtained by intersecting the UPFPS-lists of the item sets { a } and { c } shown for table two and table three, respectively. The Sid-list of the set of terms { a } is {1,2,3,4}, the Sid-list of the set of terms { c } is {1,2,3,4}, and the Sid-list of the set of terms { a, c } is {1,2,3,4} after intersection. For sequences S in the database ₁ The Tran-list of the set of items { a } is {1,2,3,4,5}, and the set of items { c } is in the sequence S ₁ The Tran-list in (1) is {2,3,4}, and the Tran-list {2,3,4} of the term set { a, c } obtained by the intersection process is added to the Tran-list of { a, c }. Also, the Tran-list of other sequences of { a, c } is constructed in the same manner, pro-list of item set { a }: { ([0.66]， [0.76]，[0.6]，[0.55]，[0.48])， ([0.76]， [0.52])， ([0.64]， [0.33]， [0.6]，[0.49]， [0.48])，([0.46]，[0.66]，[0.36]，[0.45]，[0.81]) Pro-list of item set { c }: { ([0.91]，[0.75]，[0.78])，([0.48]，[046]，[0.34])，([0.62]，[0.75]，[0.34])，([0.77]，[0.75]，[0.78]) }. Pro-list of the set of terms { a, c } obtained by intersection: { ([0.69]，[0.45]，[0.42])，([0.36])，([0.39]， [0.45]，[0.16])， ([0.35]，[0.27]，[0.35]) }. The UPFPS-list of the set of terms { a, c } resulting from the intersection is shown in Table 4, where pr ({ a, c }, S) can be calculated from the UPFPS-list of the set of terms { a, c } ₁ ) = {2，1，1，1}，sup({a，c}，S ₁ ) =3 >= minsupe and maxPer ({ a, c }, S) ₁ ) = 2<= maxPr, so the set of terms { a, c } is in the sequence S ₁ Is a candidate periodic frequent pattern, and then the sequence S ₁ Added to the set candSeq (x), and then calculatedpr({a，c}，S ₁ ) = {1，4}，sup({a，c}，S ₂ ) =1 <= minsupp and maxPer ({ a, c }, S) ₂ ) = 4>= maxPr, so the set of terms { a, c } is in the sequence S ₂ Is not a candidate periodic frequent pattern, and then pr ({ a, c }, S) is calculated ₁ ) = {1，2，1，1}，sup(a，c，S ₃ ) =3 <= minsupe and maxPer ({ a, c }, S) ₃ ) = 2>= maxPr, so the set of terms { a, c } is in the sequence S ₃ Is a candidate periodic frequent pattern, and the sequence S ₃ Added to the set candSeq (x), the similarly known item sets { a, c } are in the sequence S ₄ Is also a candidate periodic frequent pattern, and also sequences S ₄ Added to the set candSeq (x). The sequence set candSeq ({ a, c }) obtained by the above calculation contains sequences { S } ₁ ，S ₃ ，S ₄ Using information of UPFPS-list of { a, c }, calculating exprup ({ a, c }) = 0.69 + 0.45 + 0.35 = 1.49, and then obtaining upExpRa ({ a, c }) = exprup ({ a, c })/| D | = 1.49/4 = 0.37>= minExpRa, so the item set { a, c } and its superset may be UPFPS. The algorithm finds sequences { S } that satisfy the condition of sequence set PrSeq ({ a, c }) ₁ ，S ₃ ，S ₄ Exptup ({ a, c }) = 0.69 + 0.45 + 0.35 = 1.49, and finally expRa ({ a, c }) = 1.49/4 = 0.37>= minExpRa, judge that the item set { a, c } is a UPFPS, and output it.

Example two

The second algorithm, which performs breadth-first search mining all the UPFPS, is called MUPFPS _ BFS, and takes as input the ambiguous multi-sequence database and a series of user-defined thresholds minSup, maxPr, maxStd, and minExpRa. And finishing the operation of the algorithm and outputting all UPFPSs.

The MUPFPS _ BFS algorithm firstly scans a database to create a UPFPS-list of a single item set, and then calculates through parameters in Sid-list, tran-list and Pro-list in the UPFPS-list to obtain whether each item set meets the following conditions: sup ({ x }, S) ≧ minSup, maxPeer ({ x }, S) ≦ maxPr and stanDev ({ x }, S) < maxStd, that is, satisfying these conditions means that term set x is periodic in sequence S. Then, whether the item set x has frequent periodicity in each sequence is checked, and for all sequence sets prSeq (x) meeting the frequent periodicity in the item set x, the algorithm calculates the expected support number of the item set by dividing the total number of sequences of the database by the sequences in the sets, so as to obtain expRa (x). If this value is not less than the user-defined minExpRa, then item set x is called a UPFPS and output.

In addition, upExpRa (x) is further computed and pruned for UPFPS-list of the search space and term set. Thus, all 1 item sets of which upExpRa (x) has a value not less than minExpRa are stored in a set boundUPFPS, and the clipped item sets are stored in the set boundUPFPS in order of the value of upExpRa (x). Finally, the algorithm searches and mines UPFPS through a loop to the item set in a breadth-first mode, and a recursive call item set expansion process in the loop generates a larger mode until a new item set cannot be generated. The item set extension process first calls the set boundUPFPS because the boundUPFPS contains all item sets with upExpRa (x) values no less than minExpRa. The item set expansion process combines any two 1 item sets to obtain a 2 item set, obtains a 2 item set of which the value of upExpRa (x) is not less than minExpRa, and then stores the 2 item set in a set bound PFPS. Meanwhile, calculating expRa (x) identifies and outputs UPFPS of 2 item sets. Generally, in the process of generating an item set each time a recursion is called, if a boundUPFPS set contains at least two item sets, a loop continuously calls the generation process, and two k item sets (k > = 1) are combined to generate a k +1 item set.

The item set extension procedure takes as input a set of UPFPS-list of k item sets (k > = 1), this set being defined as boundedupps, and a series of thresholds minspu, maxPr, maxStd, minExpRa, and finally the output is the UPFPS-list of the item set of length k +1, denoted as boundedupfps _ k +1. The algorithm first initializes a set boundUPFPS _ k +1 with length k +1, and stores all sets of k +1 items of upExpRa (x) not less than a custom threshold mineExpRa into the set boundUPFPS _ k +1. Putting all item sets in UPFPS-list in the set bound UPFPS into a variable item set, and then circularly combining the k item sets in sequence to generate a k +1 item set. If the item set Px and the item set Py with the length of k have prefixes with the common length of k-1 items, the item set Px and the item set Py are combined into an item set Pxy with the length of k +1, and the algorithm calls an intersection process to combine the UPFPS-list of the item set Px and the UPFPS-list of the Py to generate the UPFPS-list of a new item set Pxy. The algorithm will then determine if the set of terms Pxy and its superset are likely UPFPS. The algorithm uses the information in UPFPS-list of the item set Pxy to calculate sup (Pxy, S), maxPeer (Pxy, S), stanDev (Pxy, S) and Pro (Pxy, T, S) in the sequence of the item set Pxy. The algorithm calculates upExpRa (Pxy) from these values, and if upExpRa (Pxy) is not less than minExpRa, adds UPFPS-list of item set Pxy to the boundUPFPS _ k +1 set to produce a larger item set. Then, expRa (x) is calculated from UPFPS-list of Pxy, and if this value is not less than minExpRa, then output Pxy is a periodic frequent pattern in the uncertain database.

Finally, after combining all k item sets that can be combined, the algorithm adds bound UPFPS _ k +1 to bound UPFPS in preparation for the next algorithm call to generate the item set for k +2 item set generation.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. A multi-sequence periodic frequent pattern mining method in an uncertain transaction database is characterized by comprising the following steps:

s2, scanning the database to construct a UPFPS-list of the 1 item set x, namely constructing a data list UPFPS-list which is formed by the purchase probability of each commodity x and the transaction in which the certain commodity x appears in the purchase sequence of which users according to the time sequence, and judging whether the 1 item set x is a periodic frequent item set UPFPS in the uncertain database;

and S5, recursively circulating the n-1 item sets until the n item sets cannot be expanded, and outputting all periodic frequent item sets UPFPS in the uncertain database.

2. The method for mining the periodic frequent pattern of the multiple sequences in the uncertain transaction database according to claim 1, wherein the specific step of judging whether the item set x of step S2 is the periodic frequent item set UPFPS in the uncertain database is as follows:

wherein the expected periodic sequence ratio of sequences for x in the database is defined as expRa (x) = expsu (x)/| D |, where the value of expsu (x) is calculated from the sequences in the set PrSeq (x), and | D | is the total number of sequences in the database of sequences.

3. The method of claim 1, wherein the set of items consisting of one commodity is 1 item set X, the set of items consisting of a plurality of commodities is X, two user-specified thresholds maxPr, minsupe are provided, if sup (X, S) > =minsupe, maxPer (X, S) < = maxPr, then item set X is a frequent pattern in sequence S, the set of sequences in the sequence database with item set X as the frequent pattern is represented and defined as candSeq (X) = { S | X, S) = minsupor (X, S) > < maxPr = X, S) = pro (X, T, S) | = 0T ^ S ∈ D }, and the upper limit of the desired supported sequence ratio prexa (X) in the database is defined as exp ^ X (X) = exp ^ X, where exp X = exp ^ X is calculated according to the sequence X (= expupx) = exp X ^ expx.

4. The method of claim 3, wherein the k-th transaction sequence S of the commodity X in the uncertain sequence database is _i The probability value of the existence of item set X is defined and expressed as pro (X, T) _k , S _i ) ( 1<= k<= m, 1<=i<=|D|)。

5. The method for mining the periodic frequent pattern of multiple sequences in the uncertain data transaction database according to claim 3, wherein the expected support number expSup (X) is the sum of the maximum probabilities that the item set X exists in the specified sequences in the uncertain data transaction database, and the calculation formula is as follows:

。