CN110399406B

CN110399406B - Method, device and computer storage medium for mining global high utility sequence pattern

Info

Publication number: CN110399406B
Application number: CN201910692048.6A
Authority: CN
Inventors: 林浚玮; 李圆法; 陈伟; 王巨宏
Original assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2024-06-04
Anticipated expiration: 2039-07-26
Also published as: CN110399406A

Abstract

The present disclosure provides a method, apparatus, and computer-readable storage medium for mining global high utility sequence patterns. The method comprises the following steps: determining a first type of item in the sequence database, wherein the first type of item is an item with a global sequence weight utility value higher than a first threshold value; determining utility value linked lists of all sequences in a sequence database; mining at least one candidate global high utility sequence pattern from a sequence database and determining a first set according to the determined first class item, wherein the first set comprises the at least one candidate global high utility sequence pattern, the identification of sequences comprising the respective candidate global high utility sequence patterns, and utility values of the respective candidate global high utility sequence patterns in the respective sequences; and mining a global high utility sequence pattern from the at least one candidate global high utility sequence pattern according to the utility value linked list and the first set of each sequence.

Description

Method, device and computer storage medium for mining global high utility sequence pattern

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a method, apparatus, and computer readable storage medium for mining global high utility sequence patterns.

Background

Sequence pattern mining is an important technology in the field of data mining. Sequence pattern mining is for sequence databases. The sequence database may include a plurality of sequences (which may also be referred to as transactions), where each sequence may include at least one item set (itemset), each item set includes at least one item (item), and there is a sort order between the item sets. Taking shopping data of a supermarket as an example, a user purchases commodity a and commodity b on the first day, commodity a and commodity c on the second day, and commodity b on the third day. Shopping data of a user during this time period can be abstracted into a sequence: the method comprises the steps of [ a b ], [ a c ], [ b ], wherein a, b and c are items, the items in the [ ] form an item set, and a plurality of item sets are arranged in sequence to form a sequence. The high utility sequence pattern mining algorithm mines combinations of commodities, i.e., sequence patterns (patterns), with utility values above a preset threshold. A sequential pattern is an ordered arrangement of different sets of items.

In the process of mining high utility patterns, the process of searching for high utility patterns by calculating the total utility value of the entire database requires more computation, particularly mining high utility sequence patterns. Thus, high utility sequence pattern mining is more complex than traditional high utility pattern mining and frequent sequence pattern mining. Current distributed and parallel pattern mining focuses on high utility pattern mining and frequent sequence pattern mining, which may be done on Hadoop platforms, for example. Thus, there is no distributed and parallel efficient sequence pattern mining method yet.

Disclosure of Invention

To this end, the present disclosure provides a method, apparatus, and computer-readable storage medium for mining global high utility sequence patterns.

According to one aspect of the present disclosure, there is provided a method for mining a global high utility sequence pattern, comprising: determining a first type of item in the sequence database, wherein the first type of item is an item with a global sequence weight utility value higher than a first threshold value; determining a utility value linked list of each sequence in the sequence database; mining at least one candidate global high utility sequence pattern from the sequence database and determining a first set according to the determined first type of item, wherein the first set comprises the at least one candidate global high utility sequence pattern, the identification of sequences comprising each candidate global high utility sequence pattern and utility values of each candidate global high utility sequence pattern in the corresponding sequence; and mining a global high utility sequence pattern from the at least one candidate global high utility sequence pattern according to the utility value linked list of each sequence and the first set.

According to one example of the present disclosure, wherein said determining a first type of item in the sequence database comprises: determining a global sequence weight utility value of each item in the sequence database; and determining an item with a global sequence weight utility value higher than a first threshold as a first type item.

According to one example of the present disclosure, wherein determining a global sequence weight utility value for each item in the sequence database comprises: determining local sequence weight utility values of the item in each partition of the sequence database; and determining a global sequence weight utility value for the term based on the determined local sequence weight utility value.

According to one example of the present disclosure, the local sequence weight utility value of the term at each partition of the sequence database is determined from utility values of sequences comprising the term in the partition.

According to one example of the present disclosure, wherein determining a linked list of utility values for each sequence in the sequence database comprises: and determining a utility value linked list of the sequence according to the utility value of each item in the sequence and the position of each item in the sequence.

According to one example of the present disclosure, wherein mining at least one candidate global utility sequence pattern from the sequence database according to the determined first class of items comprises: mining a local high utility sequence pattern from each partition of the sequence database according to the determined first class item; and determining at least one candidate global utility sequence pattern from the mined local utility sequence patterns.

According to one example of the present disclosure, wherein mining the local high utility sequence pattern from each partition of the sequence database according to the determined first class of items comprises: calculating utility values and residual utility values of the items in each sequence for one item belonging to the first class item in each sequence included in the partition, wherein the residual utility values of the items in one sequence are the sum of utility values of all items in the sequence and after the item; constructing a utility list of the item in each sequence; determining a utility value chain of the item according to utility lists of the item in each sequence; and mining a local high-utility sequence mode from the partition according to the utility value chains of the items in the partition.

According to one example of the present disclosure, wherein mining global high utility sequence patterns from the at least one candidate global high utility sequence pattern according to the utility value linked list of each sequence and the first set comprises: determining local utility values of each candidate global high utility sequence mode according to utility value linked lists of each sequence and the first set; determining the global utility value of each candidate global high utility sequence mode according to the local utility value of each candidate global high utility sequence mode; and determining a sequence pattern having a global utility value greater than a first threshold as a global high utility sequence pattern.

According to one example of the present disclosure, the above method further comprises: and dividing the sequence in the sequence database into a plurality of partitions according to a load balancing algorithm.

According to another aspect of the present disclosure, there is provided an apparatus for mining a global high utility sequence pattern, comprising: a first determining unit configured to determine a first type of item in the sequence database, wherein the first type of item is an item for which the global sequence weight utility value is higher than a first threshold value; the second determining unit is configured to determine utility value linked lists of all sequences in the sequence database; a first mining unit configured to mine at least one candidate global high utility sequence pattern from the sequence database according to the determined first class item and determine a first set, wherein the first set comprises the at least one candidate global high utility sequence pattern, an identification of a sequence comprising each candidate global high utility sequence pattern, and utility values of each candidate global high utility sequence pattern in a respective sequence; and a second mining unit configured to mine a global high utility sequence pattern from the at least one candidate global high utility sequence pattern according to a utility value linked list of each sequence and the first set.

According to one example of the present disclosure, wherein the first determining unit is configured to determine a global sequence weight utility value for each item in the sequence database; and determining an item with a global sequence weight utility value higher than a first threshold as a first type item.

According to one example of the present disclosure, wherein the second determining unit is configured to determine a local sequence weight utility value for each item at a respective partition of the sequence database; and determining a global sequence weight utility value for the term based on the determined local sequence weight utility value.

According to one example of the present disclosure, the local sequence weight utility value of the term at each partition of the sequence database is determined from utility values of sequences in the partition that include the term.

According to one example of the present disclosure, the second determining unit is configured to determine a utility value linked list of each sequence according to utility values of the respective items in the sequence and positions of the respective items in the sequence.

According to one example of the present disclosure, wherein the first mining unit is configured to mine local high utility sequence patterns from respective partitions of the sequence database according to the determined first class item; and determining at least one candidate global utility sequence pattern from the mined local utility sequence patterns.

According to one example of the disclosure, the first mining unit is configured to calculate, for an item belonging to a first class of items in each sequence included in each partition of the sequence database, a utility value and a remaining utility value of the item in each sequence, wherein the remaining utility value of the item in a sequence is a sum of utility values of all items in the sequence that follow the item; constructing a utility list of the item in each sequence; determining a utility value chain of the item according to utility lists of the item in each sequence; and mining a local high-utility sequence mode from the partition according to the utility value chains of the items in the partition.

According to one example of the present disclosure, the second mining unit is configured to determine local utility values of each candidate global high utility sequence pattern from a utility value linked list of each sequence and the first set; determining the global utility value of each candidate global high utility sequence mode according to the local utility value of each candidate global high utility sequence mode; and determining a sequence pattern having a global utility value greater than a first threshold as a global high utility sequence pattern.

According to one example of the present disclosure, the above apparatus further comprises a load distribution unit configured to divide the sequence in the sequence database into a plurality of partitions according to a load balancing algorithm.

According to another aspect of the present disclosure, there is provided an apparatus for mining a global high utility sequence pattern, comprising: a processor; and a memory, wherein the memory stores a computer executable program that, when executed by the processor, performs the method described above.

According to another aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the above-described method.

According to the method, the device and the computer-readable storage medium for mining the global high-utility sequence mode, the utility value linked list and the first set of each sequence in the sequence database are determined, and the global high-utility sequence mode is mined according to the two data structures, so that a large amount of time is saved, the calculation process of calculating the global utility value in the sequence database is quickened, the mining speed is quickened, and the time complexity is reduced.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments thereof with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a schematic diagram of a system architecture for mining global high utility sequence patterns from a sequence database, according to an embodiment of the present disclosure.

FIG. 2 is a flow chart of a method for mining global high utility sequence patterns, according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of the utility list of item a in sequence s1.

Fig. 4 shows a schematic diagram of a utility value chain for item a.

FIG. 5 is a flowchart of a method of mining global utility sequence patterns from at least one candidate global utility sequence pattern, according to an embodiment of the present disclosure.

FIG. 6 is a schematic structural diagram of an apparatus for mining global high utility sequence patterns according to an embodiment of the present disclosure.

Fig. 7 illustrates an architecture of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. In the drawings, like reference numerals refer to like elements throughout. It should be understood that: the embodiments described herein are merely illustrative and should not be construed as limiting the scope of the present disclosure.

In the present disclosure, when the utility value of a sequence pattern is high, for example, when the utility value of the sequence pattern is higher than a preset threshold, the sequence pattern may be referred to as a "high utility sequence pattern". That is, the "high utility sequence pattern" may be a sequence pattern having utility values higher than a preset threshold. The "preset threshold" herein may be fixed or may be changed as the application scenario of the mining algorithm changes.

The disclosure provides a distributed and parallel technical scheme for efficient sequence pattern mining. In the present disclosure, distributed and parallel efficient sequence pattern mining is implemented through a Hadoop platform-based distributed computing framework. In the mining process, a utility value linked list and a first set of each sequence in the sequence database are used for storing necessary information in the mining process, so that the mining speed is increased, and the time complexity is reduced. The "distributed computing framework" referred to herein may be a Map and induction (MapReduce) framework, where Map is to Map a key-value pair to a new key-value pair, and Reduce is to integrate the same values of keys in the key-value pair and Map to the new key-value pair. In addition, a module performing the Map operation may be referred to as a Mapper, and a module performing the Reduce operation may be referred to as a Reducer.

First, a system architecture for mining Global high utility sequence patterns (Global-High Utility Sequence Pattern, G-HUSP) from a sequence database according to an embodiment of the present disclosure is described with reference to fig. 1. FIG. 1 is a schematic diagram of a system architecture for mining global high utility sequence patterns from a sequence database, according to an embodiment of the present disclosure. As shown in fig. 1, the system architecture 100 may include three parts, an identification part 110, a local mining part 120, and an integration part 130, respectively. The identification portion 110 may include a plurality of mappers and a plurality of minimers, such as n mappers and n minimers, where n is a positive integer. The identification portion 110 may be used to determine a first type of item in the sequence database that is an item for which the global sequence weight utility value is above a first threshold. The first type of item is an item that is likely to constitute a high utility sequence pattern, and thus may also be referred to as a promising item (promising item). The partial mining portion 120 may include a plurality of mappers and a plurality of minimers, for example, n mappers and n minimers, where n is a positive integer. The Local mining portion 120 may be configured to mine Local high utility sequence patterns (Local-High Utility Sequence Pattern, L-HUSP) from the sequence database according to the first class of items. A portion of the sequence patterns in the local high utility sequence pattern may be global high utility sequence patterns, and another portion of the sequence patterns may not be global high utility sequence patterns, and then the another portion of the sequence patterns may be candidate global high utility sequence patterns. Further, the local mining portion 120 may also be configured to determine a first set (which may be represented as sidset) that may include at least one candidate global high-utility sequence pattern, an identification of a sequence that includes the global high-utility sequence pattern of each candidate, and a utility value of the global high-utility sequence pattern of each candidate in the corresponding sequence. In addition, a linked list of utility values for each sequence in the sequence database may be determined. The integration portion 130 may include a plurality of mappers and a plurality of minimators, such as n mappers and n minimators, where n is a positive integer. The integrating portion 130 may be configured to mine a global high utility sequence pattern from the at least one candidate global high utility sequence pattern based on the utility value linked list of each sequence and the first set. Through the system architecture shown in fig. 1, the utility value linked list and the first set of each sequence in the sequence database can be used to store necessary information in the mining process, so that the mining speed of the high utility sequence mode is increased, and the time complexity is reduced.

It should be appreciated that although a three-stage MapReduce is shown in FIG. 1, this is merely illustrative. There may also be fewer or more stages of MapReduce according to embodiments of the present disclosure. In addition, the number of mappers and reducers contained in the MapReduce of each stage may be the same or different. In addition, the numbers of mappers and/or reducers contained in the MapReduce at different stages can be the same or different.

Furthermore, it should be understood that in this disclosure, "local" is for one partition of the database, while "global" is for the entire database. For example, a "local high utility sequence pattern" in the present disclosure may be a high utility sequence pattern mined from a partition of a database, i.e., a sequence pattern that is high utility for that partition; while the "global utility sequence pattern" in the present disclosure may be a utility sequence pattern mined from a plurality of local utility sequence patterns, i.e., a sequence pattern that is utility to the database as a whole. As another example, a "local sequence weight utility value" in this disclosure may be a utility value determined from data in one partition of a database; while the "global high utility sequence pattern" in this disclosure may be a utility value determined from all data in the database.

A flowchart of a method of mining a global high utility sequence pattern according to the system framework shown in fig. 1 will be described in detail below in connection with fig. 2. FIG. 2 is a flow chart of a method 200 for mining global high utility sequence patterns, according to an embodiment of the present disclosure. As shown in fig. 2, in step S201, a first type of item in the sequence database is determined, wherein the first type of item is an item for which the global sequence weight utility value (Global Sequence Weight Utility, GSWU) is above a first threshold.

In the present disclosure, the sequence database may include a plurality of sequences and identification information corresponding to the respective sequences. In the present disclosure, the sequence may be a quantitative sequence (quantitative sequence). The identification information corresponding to each sequence may be referred to as a sequence id (sed). The sequence identity of the first sequence may be represented by s _l, where l is a positive integer. Each sequence may include one or more item sets, each of which may include one or more items. Each item has an internal utility value and an external utility value. In the database of transaction types, the internal utility value may be a transaction number of the item. In databases of other scenarios, the form of the internal utility value may be adjusted accordingly. The table that records the external utility values of the various items in the database may be referred to as an external utility value table. In a database of transaction types, the external utility value table may be a profit table, i.e., the external utility value table may record the unit profit values for each item in the database. In databases of other scenarios, the form of the external utility value table may be adjusted accordingly.

Table 1 below shows one example of a sequence database. As shown in table 1, the sequence database is a database of transaction types, which includes 5 sequences s ₁～s₅, respectively. Each sequence consists of purchasing lists of the same customer at different times, wherein each purchasing list is an item set, and purchased commodities are items. For example, the sequence s1 indicates that the customer purchased 2 items a and 3 items c first, then 3 items a, 1 item b and 2 items c, then 4 items a, 5 items b and 4 items d, and finally 3 items e.

sid	Sequence(s)
		s₁	<[(a∶2)(c∶3)]，[(a∶3)(b∶1)(c∶2)]，[(a∶4)(b∶5)(d∶4)]，[(e∶3)]>
s₂	<[(a∶1)(e∶3)]，[(a∶5)(b∶3)(d∶2)]，[(b∶2)(c∶1)(d∶4)(e∶3)]>
		s₃	<[(e∶2)]，[(c∶2)(d∶3)]，[(a∶3)(e∶3)]，[(b∶4)(d∶5)]>
s₄	<[(b∶2)(c∶3)]，[(a∶5)(e∶1)]，[(b∶4)(d∶3)(e∶5)]>
		s₅	<[(a∶4)(c∶3)]，[(a∶2)(b∶5)(c∶2)(d∶4)(e∶3)]>

Table 1 examples of sequence databases

Table 2 below shows one example of an external utility value table. As shown in table 2, the profit of commodity a was 5, the profit of commodity b was 3, the profit of commodity c was 4, the profit of commodity d was 2, the profit of commodity e was 1, and the profit of commodity f was 6.

Items	a	b	c	d	e	f
							Profit margin	5	3	4	2	1	6

Table 2 example of external utility value table

In step S201, a global sequence weight utility value of each item in the sequence database may be determined, and an item whose global sequence weight utility value is higher than a first threshold value is determined as a first type item. Step S201 may be performed by the identification portion 110 described above (i.e., the first stage MapReduce).

The process of determining the global sequence weight utility value for each item in the sequence database will be described below. According to one example of the present disclosure, for each item in a sequence database, a local sequence weight utility value (Local Sequence Weight Utility, LSWU) for the item at each partition of the sequence database may be first determined, and then a global sequence weight utility value for the item is determined from the determined local sequence weight utility values.

For example, the sequence database may first be divided into a plurality of partitions and the plurality of partitions may be assigned to a plurality of mappers in the first stage MapReduce. For example, the sequence database may be divided into n partitions, and the 1 st partition is assigned to Mapper 1 in the first stage MapReduce, where 1.ltoreq.k.ltoreq.n and is a positive integer.

Then, for each sequence in the kth partition, mapper k may determine a utility value for the sequence. For example, mapper k may determine the utility value of the sequence according to conventional methods of calculating utility values of the sequence. For example, the utility value of a sequence may be the sum of the utility values of the various sets of items comprising the sequence in the sequence. In the present disclosure, the utility value of sequence s _l may be represented as u (s _l).

Then, for each item in the sequence, mapper k may generate a key-value pair, and the key-value pair may be composed of the item and the utility value of the sequence. For example, for item i in sequence s _l, mapper k may generate a key-value pair (i, u (s _l)). It follows that the sequence identity of the sequence and the content of the sequence may be entered as a key-value pair into Mapper k, which then outputs one or more new key-value pairs.

Furthermore, because the different sequences in each partition may contain the same item, the Mapper may generate multiple key-value pairs for the same item in these different sequences. In this case, a combination module (e.g., may be referred to as combiner) may be configured for each Mapper to determine local sequence weight utility values for the same item in each partition. In particular, the local sequence weight utility value of the term at each partition of the sequence database may be determined from utility values of sequences in the partition that include the term. For example, the local sequence weight utility value of the term at each partition of the sequence database may be the sum of utility values of the sequences comprising the term in that partition. In this way, the workload of the Reducer, which will be described below, can be reduced, thereby reducing the requirements on communication costs and transportation time. For example, the local sequence weight utility value of term i at the kth partition of the sequence database may be determined by the following equation (1):

Where i represents an item, D _k represents the kth partition of the sequence database, s represents the sequence that includes the item, and u(s) represents the utility value of the sequence.

The process of determining the local sequence weight utility value of an item in a partition of a sequence database is described below in one specific example. For example, in the example where the kth partition of the sequence database includes sequences s ₁ and s ₂, mapper k may determine the utility values for sequences s ₁ and s ₂ to be u (s ₁) and u (s ₂), respectively. Then, for each item in the sequence s ₁, item a, item b, item c, item d, and item e, mapper k can generate a key-value pair (a, u (s ₁))、(b,u(s₁))、(c,u(s₁))、(d,u(s₁))、(e,u(s₁)). For each item in the sequence s ₂, item a, item b, item c, item d, and item e, map k may generate a key-value pair (a, u (s ₂))、(b,u(s₂))、(c,u(s₂))、(d,u(s₂))、(e,u(s₂)). Thus, for item a, there are two key-value pairs, namely (a, u (s ₁)) and (a, u (s ₂)). The two key-value pairs of item a may also be denoted as (a, l _u), where l _u is a set that includes u (s ₁) and u (s ₂). The combining module then sums the elements in set l _u, i.e., u (s ₁)+u(s₂), to obtain the local sequence weight utility value LSWU _a-k＝u(s₁)+u(s₂ for item a at the kth partition. Similarly, the local sequence weight utility values for item b, item c, item d, and item e at the kth partition may be obtained.

It follows that the key value pair output by Mapper k can be used as input to the combination module corresponding to the Mapper k, and the combination module generates a new key value pair. The new key-value pair may be composed of an item and a local sequence weight utility value for the item at the kth partition. For example, for item i, the combination module corresponding to Mapper k may generate key-value pairs (i, LSWU _i-k). In the example where item i is item a, the combination module corresponding to Mapper k may output a key-value pair (a, LSWU _a-k).

In the above manner, the local sequence weight utility value of each term in the respective partition of the sequence database may be determined. After determining the local sequence weight utility value for each term in the respective partition of the sequence database, a global sequence weight utility value for the term may be determined from the determined local sequence weight utility values. For example, the sum of the local sequence weight utility values of each item in the respective partition of the sequence database may be taken as the global sequence weight utility value of the item.

Specifically, the key value pairs with the same key value in the outputs of the plurality of combination modules may be input to one Reducer in the first stage MapReduce. That is, a plurality of key value pairs corresponding to the same item, for example, a plurality of key value pairs (i, LSWU _i-k) corresponding to item i, among outputs of a plurality of combination modules, are input to one Reducer. The Reducer may sum the local sequence weight utility values in the plurality of key value pairs as a global sequence weight utility value GSWU _i for item i. For example, the global sequence weight utility value of term i may be determined by the following equation (2):

Wherein GSWU (i, D) represents the global sequence weight utility value of item i in sequence database D, D _k represents the kth partition of the sequence database, and LSWU (i, D _k) represents the local sequence weight utility value of item i in the kth partition of the sequence database.

Thus far, a process has been described for determining a global sequence weight utility value for each item in a sequence database. After determining the global sequence weight utility value for each item in the sequence database, each Reducer in the first stage MapReduce may determine items having global sequence weight utility values greater than or equal to a first threshold as items of a first type and discard items having global sequence weight utility values less than the first threshold. Each Reducer may output one or more new key-value pairs, where each new key-value pair may be composed of a first type of item and a global sequence weight utility value for the first type of item. For example, when item i is a first type of item, a certain Reducer may output a key-value pair (i, GSWU _i).

The "first threshold" described herein may be determined based on the total utility value of the database and a threshold factor. For example, the "first threshold" may be a product of the total utility value of the database and a threshold factor. The total utility value of the database may be determined according to conventional methods of calculating the total utility value of the database. For example, the total utility value of the database may be the sum of the utility values of the individual transactions in the database. The total utility value of the database may be denoted as u (D). The threshold factor may be preset and may be expressed as δ. Thus, the first threshold may be expressed as δ×u (D).

Through step S201, items that are expected to constitute a high utility sequence pattern can be identified. Unrecognized items may be discarded and no longer need to be considered. Through step S201, the space for searching for the high utility sequence pattern is much reduced from the original search space, thereby improving the search speed and accelerating the mining speed.

Returning to FIG. 2, in step S202, a linked list of utility values for each sequence in the sequence database is determined. Step S202 may be performed before or after step S201, or may be performed in synchronization with step S201.

According to one example of the present disclosure, for a sequence in a sequence database, a linked list of utility values for the sequence may be determined based on the utility values for each item in the sequence and the position of each item in the sequence. The utility value of an item may be the product of the internal utility value and the external utility value of the item. The position of each item in the sequence may include an initial position of each item and a neighboring position, wherein the initial position of an item may be the position of the first occurrence of an item in the sequence and the neighboring position may be the position of the next occurrence of an item in the sequence. In addition, a utility value linked list of one sequence may include two rows, where the first row may be information about utility values and adjacent locations of individual items (may be simply referred to as Utility Position Information, UP information), and the second row may be information about initial locations of non-duplicate items in the sequence (may be simply referred to as a Header Table). The second row may include non-duplicate items and initial positions of respective non-duplicate items.

Table 3 below shows a linked list of utility values for the sequence s ₁ in Table 1. As shown in Table 3, the utility value linked list for sequence s ₁ includes two rows, the first row showing the utility value and adjacent position of each item a, b, c, d, e in sequence s ₁ and the second row showing the initial position of each item a, b, c, d, e in sequence s ₁. Specifically, "a" in the element (a, 10, 3) in the first row indicates the 1 st item in the sequence s ₁, "10" indicates that the utility value of the item a in the sequence s ₁ is 10, and "3" indicates the position where the item a next appears in the sequence s ₁. The element in the first row (c, 8, -) "c" represents the 5 th item in sequence s ₁, "8" represents an item c with a utility value of 8 in sequence s ₁, and "-" represents that item c has no next position in sequence s ₁. "a" in element (a, 1) in the second row represents the item in sequence s ₁, and "1" represents the initial position of item a in sequence s ₁.

Table 3 example of utility value linked list for sequence s ₁

It will be appreciated that the utility value linked list of sequences is formed by converting and expanding sequences in the original database, which records information about the original database and common information that needs to be calculated. The calculation speed of the sequence mode can be improved through the utility value linked list of the sequence. This is because the target sequence pattern may have multiple matches in a single transaction, and therefore computing the utility value of the sequence pattern in a transaction requires finding all matches and then taking the maximum utility value. The utility value linked list records the next position of the item in the transaction, so that the transaction does not need to be scanned multiple times, and the maximum utility value of the sequence pattern in the transaction can be calculated as long as the next position of the item is searched continuously.

Returning to fig. 2, in step S203, at least one candidate global high utility sequence pattern is mined from the sequence database according to the determined first category item and a first set is determined, wherein the first set comprises the at least one candidate global high utility sequence pattern, an identification of a sequence comprising the global high utility sequence pattern of each candidate and utility values of the global high utility sequence pattern of each candidate in the respective sequence. Step S203 may be performed by the local excavation portion 120 (i.e., the second stage MapReduce) described above.

According to one example of the present disclosure, before step S203 is performed, sequences in the sequence database may be allocated into a plurality of tasks (tasks). The number of tasks may be denoted as m, where m is a positive integer. For example, m may be a multiple of the number of mappers in the second stage MapReduce. In the following example, the present disclosure is described taking the example that m is equal to the number of mappers in the second stage MapReduce.

In this example, sequences in the sequence database may be partitioned into multiple partitions according to a load balancing algorithm. For example, sequences in a sequence database may be distributed into a plurality of tasks according to a load balancing algorithm. Specifically, for a sequence in the sequence database, the number (Num) of items of the first type that the sequence includes may be determined. Then, a task p having the smallest workload is selected from a plurality of tasks, and the sequence is assigned to the task p while the workload of the task p is updated according to the number of items of the first type included in the sequence. For example, the workload of the p-th task may be represented as WL _p, and when a sequence is assigned to the task, the workload of the task is updated by WL _p to (WL _p +Num).

Further, in this example, the algorithm may initialize the workload for each task to 0. Thus, in the first iteration of the algorithm, since the workload of each task is 0, for a sequence in the sequence database, a task may be randomly selected from among the plurality of tasks and assigned to that task. For example, the 1 st task may be selected from a plurality of tasks and the sequence assigned to the 1 st task.

In addition, the "task" described herein may also be referred to as a task file (task file). Hereinafter, tasks and task files may be used alternatively.

By the load balancing algorithm, the influence of unbalanced workload among nodes on the mining algorithm caused by the division of the database can be avoided, so that the workload among the nodes is balanced, and the speed of mining calculation is effectively improved.

Step S203 may include three substeps S2031 to S2033. In step S2031, local utility sequence patterns may be mined from the various partitions of the sequence database according to the determined first type of item. Then, in step S2032, at least one candidate global utility sequence pattern may be determined from the mined local utility sequence patterns. Then, in step S2033, a first set may be determined. Step S2033 may be performed simultaneously with step S2033.

In the present disclosure, local high utility sequence patterns may be mined from various tasks according to the determined first class of items. Some of these local utility sequence patterns may be global utility sequence patterns and another may not be global utility sequence patterns. The other partial sequence pattern may be taken as a candidate global high utility sequence pattern.

A process of mining a local utility sequence pattern from each partition of the sequence database in step S2031 will be described below. Specifically, for an item belonging to a first type of item in each sequence included in each partition, calculating a utility value and a residual utility value of the item in each sequence, constructing a utility list (utility list) of the item in each sequence, and determining a utility value chain of the item according to the utility list of the item in each sequence; a local high utility sequence pattern is mined from the partition according to a utility value chain (utility chain) of each item in the partition.

In the present disclosure, the remaining utility value of an item in a sequence may be the sum of utility values of all items in the sequence that follow the item. Further, the utility list of items in a sequence may include identification information of the sequence (may be denoted as sed), identification information of each item set in which the items are located (may be denoted as tid), utility values (may be denoted as acu) and remaining utility values (may be denoted as ru) of the items in each item set in the sequence, and indication information (e.g., a pointer) (may be denoted as next) pointing from one item set to the next item set. Further, the utility value chain of items may include a utility list of items in each sequence.

An example of a utility list of items in a sequence is given below. Assuming that a partition includes the sequence s ₁～s₅ shown in table 1, and that the item a belongs to the first class of items, for the sequence s ₁, it can be determined that the identification information of the sequence is 1. Further, item a appears in the 1 st item set of sequence s ₁, and thus, the utility value and the remaining utility value of item a in the 1 st item set in sequence s ₁ are determined to be 10 and 84, respectively. Since item a also appears in the 2 nd set of items of sequence s ₁, the utility value and the remaining utility value of item a in sequence s ₁ in the 2 nd set of items are determined to be 15 and 57, respectively. Since item a also appears in the 3 rd set of items of sequence s ₁, the utility value and the remaining utility value of item a in the 3 rd set of items in sequence s ₁ are determined to be 20 and 26, respectively. Thus, a utility list of item a in sequence s ₁ can be constructed. Fig. 3 shows a schematic diagram of the utility list of item a in sequence s ₁. As shown in fig. 3, the 1 st "1" in the first set of data (1, 10, 84) represents the sequence s ₁, the 2 nd "1" represents the 1 st item set of the sequence s ₁, "10" represents the utility value of item a in the 1 st item set in the sequence s ₁, and "84" represents the remaining utility value of item a in the 1 st item set in the sequence s ₁. "1" in the second set of data (1, 2, 15,57) represents the sequence s ₁, "2" represents the 2 nd set of items of the sequence s ₁, "15" represents the utility value of item a in the 2 nd set of items in the sequence s ₁, "57" represents the remaining utility value of item a in the 2 nd set of items in the sequence s ₁. "1" in the third set of data (1,3,20,26) represents the sequence s ₁, "3" represents the 3 rd set of items of the sequence s ₁, "20" represents the utility value of item a in the 3 rd set of items in the sequence s ₁, and "26" represents the remaining utility value of item a in the 3 rd set of items in the sequence s ₁. The black arrows in fig. 3 represent pointers from one item set to the next.

An example of a utility value chain for an item is given below. In the above example, similarly, a utility list for item a in sequence s ₂～s₅ may be determined. The utility value chain for item a may then be determined from the utility list of item a in sequence s ₁～s₅. Fig. 4 shows a schematic diagram of a utility value chain for item a. As shown in fig. 4, the utility value chain for item a includes a utility list for item a in sequence s ₁, a utility list for item a in sequence s ₂, a utility list for item a in sequence s ₃, a utility list for item a in sequence s ₄, and a utility list for item a in sequence s ₅.

Similarly, a utility value chain for each item belonging to the first class of items in each sequence included in each partition may be determined. The local high utility sequence patterns may then be mined from the partition based on the utility value chains for the various items in the partition. For example, each term in the partition and the utility value chain for each term may be used as input to a conventional high utility sequence pattern algorithm (e.g., HUS-Span algorithm) and one or more local high utility sequence patterns corresponding to the partition may be output by the algorithm. In addition, the algorithm can also output the utility value of each local high utility sequence mode in the corresponding sequence and the identification information of the sequence. The output of the algorithm may be represented as a key-value pair (pattern, { sil space, utility }), where pattern represents a locally high utility sequence pattern, sil represents an identification of a sequence containing the locally high utility sequence pattern, and utility represents a utility value of the locally high utility sequence pattern in the corresponding sequence.

The above-described operation with respect to step S2031 may be performed by the Mapper in the second stage MapReduce. For example, multiple partitions of the sequence database may be processed by multiple mappers in the second stage MapReduce, respectively, such that each of the mappers may mine a local utility sequence pattern from its corresponding partition. In this case, the algorithm output described above may be the output of Mapper. That is, for a partition of the sequence database, the output of the Mapper corresponding to the partition is one or more key-value pairs (patterns, { sild, utility }), where the one or more patterns are one or more local high utility sequence patterns mined from the partition.

After step S2031, at least one candidate global utility sequence pattern may be determined from the mined local utility sequence patterns in step S2032. For example, the same key value pair of the outputs of the mappers may be input to one Reducer in the second stage MapReduce. That is, among the outputs of the plurality of mappers, a plurality of key value pairs corresponding to the same pattern, for example, a plurality of key value pairs corresponding to pattern x (pattern x, { sild, utility }) are input to one Reducer. The Reducer may determine a sum of a plurality of utility values corresponding to pattern x, and determine whether pattern x is a global high utility sequence pattern based on the sum and a first threshold. If the sum is greater than or equal to the first threshold, then pattern x is determined to be a global high utility sequence pattern. If the sum is less than the first threshold, then it is determined that pattern x is not a global high utility sequence pattern, but a candidate global high utility sequence pattern.

In addition, each Reducer may output one or more new key-value pairs, each of which may be made up of a candidate global utility pattern, an identification of the sequence corresponding to the candidate utility sequence pattern, and utility values of the candidate utility sequence pattern in the sequence. For example, the new key-value pair may be expressed as (sed), i.e. the form of the key-value pair output by the Mapper is changed.

The "at least one candidate global utility sequence pattern" in step S2032 may be determined from the outputs of the plurality of minimers in the second stage MapReduce. For example, the "at least one candidate global utility sequence pattern" in step S2032 may be determined according to the sequence patterns in the key value pairs output by the plurality of redundaners in the second stage MapReduce. For example, the outputs of the plurality of minimers may be (s₁,(pattern 1,utility 1))、(s₂,(pattern 1,utility 1))、(s₃,(pattern 2,utility 2))、(s₃,(pattern 1,utility 1))、(s₄,(pattern 2,utility 2)), then the "at least one candidate global utility sequence pattern" in step S2032 may be pattern 1 and pattern 2.

Further, after step S2032, in step S2033, a first set may be determined. For example, the first set may be determined from the outputs of the plurality of Reducer in the second stage MapReduce. The first set may include the global high utility sequence pattern of the at least one candidate, an identification of a sequence including the global high utility sequence pattern of each candidate, and utility values of the global high utility sequence pattern of each candidate in the respective sequence. For example, the first set may include a plurality of subsets, each subset including an identification of a sequence, a candidate global high utility sequence pattern included by the sequence, and utility values in the sequence for the candidate global high utility sequence pattern included by the sequence. For example, the outputs of the plurality of Reducers in the second stage MapReduce may be (s₁,(pattern 1,utility 1))、(s₂,(pattern 1,utility 1))、(s₃,(pattern 2,utility 2))、(s₃,(pattern 1,utility 1))、(s₄,(pattern 2,utility 2)),. The first set may include four subsets, of which the 1 st subset is (S ₁, (pattern 1, utility 1)), the 2 nd subset is (S ₂, (pattern 1, utility 1)), the 3 rd subset is (S ₃, (pattern 2, utility 2), (pattern 1, utility 1)), and the 4 th subset is (S ₄, (pattern 2, utility 2).

It will be appreciated that by a first set of such data structures, the calculation of utility values for candidate global high utility sequence patterns may be expedited. In particular, if a sequence includes a candidate global high utility sequence pattern, the utility value of the candidate global high utility sequence pattern may be obtained directly from the first set without having to calculate its utility value again, as repeated calculations may take a lot of time.

In the above example, no corresponding combination module is configured for the Mapper in the second stage MapReduce. However, the present disclosure is not limited thereto. For example, a corresponding combination module may also be configured for the Mapper in the second stage MapReduce.

Returning to fig. 2, in step S204, a global high utility sequence pattern is mined from the at least one candidate global high utility sequence pattern according to the utility value linked list of each sequence and the first set. Step S204 may be performed by the integrating part 130 described above (i.e., the third stage MapReduce).

Step S204 will be specifically described below with reference to fig. 5. FIG. 5 is a flowchart of a method 500 of mining global utility sequence patterns from at least one candidate global utility sequence pattern, according to an embodiment of the present disclosure. As shown in fig. 5, in step S501, a local utility value of each candidate global high utility sequence pattern may be determined according to the utility value linked list of each sequence and the first set.

Specifically, the at least one candidate global utility sequence pattern and the first set may be taken as inputs for a plurality of mappers in the third stage MapReduce. For example, at least one candidate global utility sequence pattern may be divided into a plurality of groups, and then the plurality of groups are input to a plurality of mappers, respectively. Further, the first set may be input to each Mapper.

Each Mapper may then determine a utility value for each candidate global utility sequence pattern of the plurality of candidate global utility sequence patterns corresponding thereto. For example, for a global high utility sequence pattern of one candidate of the plurality of candidate global high utility sequence patterns corresponding to one Mapper, it may be determined whether the first set includes the global high utility sequence pattern of the candidate by the Mapper. When the first set includes the candidate global high utility sequence pattern, utility values for the candidate global high utility sequence pattern may be determined from the first set. Further, when the first set does not include the candidate global high utility sequence pattern, utility values of the candidate global high utility sequence pattern may be determined from a linked list of utility values of sequences.

This is because, when the utility value of the global high utility sequence pattern of the candidate has been calculated, the utility value of the global high utility sequence pattern of the candidate can be directly obtained by querying sidset of the sequence including the global high utility sequence pattern of the candidate. However, when the utility value of the candidate global high utility sequence pattern is not calculated, it is necessary to check whether it appears in a specific sequence. If this occurs, utility values for candidate globally highly utility sequence patterns need to be calculated in accordance with the particular sequence. It should be noted that the computation of this operation is time consuming, as the particular sequence needs to be scanned, and there may be multiple matches in the particular sequence for candidate global utility sequence patterns. Thus, the particular sequence needs to be scanned multiple times to find the utility value of the global high utility sequence pattern that is the candidate for the largest match in the particular sequence. Thus, to complete the mining task, multiple scans of the entire sequence database must be performed. The utility value linked list of the sequence proposed by the present disclosure is a compact data structure suitable for handling big data problems.

Examples of determining utility values for candidate globally high utility sequence patterns from a utility value linked list of sequences will be described below in connection with specific examples. For example, the utility values of candidate global high utility sequence patterns < [ a, c ], b > may be determined from a utility value linked list of sequences s ₁ shown in Table 2 above. Specifically, since item a and item c are in the same item set, the locations where all a, c occur, i.e., the first location (1, 2) and the utility value is 22, and the second location (3, 5) and the utility value is 23, can be found from the locations where item c occurs. For the first position (1, 2) that the term a, c satisfies, all positions that the term b satisfies, namely 4 and 7, can be found, and the utility value of the term a, c, b taken together can be calculated as 22+3=25, and 22+15=37. For the second location (3, 5) where the terms a, c meet, all locations where the term b meets, i.e. 7, can be found, the utility value for the terms a, c, b together can be calculated as 23+15=38. Thus, the utility value of the sequence pattern < [ a, c ], b > is max {25, 37, 38} = 38.

In the present disclosure, each Mapper in the third stage MapReduce may output one or more new key-value pairs, where each new key-value pair may be composed of one candidate global high utility sequence pattern and its utility value. For example, the new key value pair may be expressed as (pattern, property).

In addition, the same Mapper may output multiple key-value pairs (patterns) corresponding to the same candidate global utility sequence pattern. For example, for a candidate global utility sequence pattern, the same Mapper may output two key-value pairs, pattern y, utility 1 and pattern y, utility 2, respectively. These two key-value pairs may also be denoted as (pattern y, G _u), where G _u is a set that includes a utility 1 and a utility 2.

Also in this case, a combining module (e.g., may be referred to as combiner) may be configured for each Mapper to determine local utility values for the same candidate global high utility sequence pattern. In particular, the local utility value of the global high utility sequence pattern of the same candidate may be determined from utility values in a plurality of key value pairs corresponding to the global high utility sequence pattern of the candidate. For example, the local utility value of the global utility sequence pattern of the same candidate may be the sum of utility values in a plurality of key value pairs corresponding to the global utility sequence pattern of the candidate. For example, for the candidate global high utility sequence pattern, the same Mapper may output two key value pairs, (pattern y, utility 1) and (pattern y, utility 2), respectively, and then the local utility values local to utility for the candidate global high utility sequence pattern are (utility 1+utility 2).

In the present disclosure, the combining module may also output one or more new key-value pairs, where each new key-value pair may be composed of a candidate global utility-high sequence pattern and its local utility value. For example, the new key value pair may be expressed as (pattern, local-property). In the example where the candidate global utility sequence pattern is pattern, the combined pattern may output key-value pairs (pattern, utility 1+utility 2).

Returning to fig. 5, in step S502, global utility values for each candidate global utility sequence pattern may be determined from local utility values for each candidate global utility sequence pattern. For example, for each candidate global utility sequence pattern, a global utility value for the candidate global utility sequence pattern may be determined from a plurality of local utility values for the candidate global utility sequence pattern. For example, the sum of the plurality of local utility values of the candidate global high utility sequence pattern may be taken as the global utility value of the candidate global high utility sequence pattern.

Specifically, the key value pairs with the same key value in the outputs of the plurality of combination modules may be input to one Reducer in the third stage MapReduce. That is, a plurality of key value pairs corresponding to the same candidate global high utility sequence pattern, for example, a plurality of key value pairs corresponding to candidate global high utility sequence pattern y, among outputs of a plurality of combination modules are input to one Reducer. The Reducer may sum the local utility values in the plurality of key-value pairs as a global utility value (global-utility) of the candidate global high utility sequence pattern.

Then, in step S503, a sequence pattern having a global utility value greater than the first threshold may be determined as a global high utility sequence pattern. For example, each Reducer in the third stage MapReduce may determine a sequence pattern with a global utility value greater than or equal to a first threshold as a global high utility sequence pattern. Each Reducer may output one or more new key-value pairs, where each new key-value pair may be composed of a global high utility sequence pattern and a global utility value for the global high utility sequence pattern. For example, when pattern is a global utility sequence pattern, a certain Reducer may output a key-value pair (pattern, global-utility). Therefore, the sequence patterns in the key value pair output by each Reducer in the third stage MapReduce are all global high-utility sequence patterns.

According to the method for mining the global high-utility sequence mode, provided by the embodiment, the utility value linked list and the first set of each sequence in the sequence database are determined, and the global high-utility sequence mode is mined according to the two data structures, so that a great amount of time is saved, the calculation process of calculating the global utility value in the sequence database is quickened, the mining speed is quickened, and the time complexity is reduced.

Hereinafter, an apparatus corresponding to the method shown in fig. 2 according to an embodiment of the present disclosure will be described with reference to fig. 6. FIG. 6 illustrates a schematic diagram of an apparatus 600 for mining global high utility sequence patterns, in accordance with an embodiment of the present disclosure. Since the function of the apparatus 600 is the same as the details of the method described above with reference to fig. 2, a detailed description of the same is omitted herein for simplicity. As shown in fig. 6, the apparatus 600 includes: a first determining unit 610 configured to determine a first type of item in the sequence database, wherein the first type of item is an item for which the global sequence weight utility value is above a first threshold; a second determining unit 620 configured to determine a utility value linked list of each sequence in the sequence database; a first mining unit 630 configured to mine at least one candidate global high utility sequence pattern from the sequence database according to the determined first class item and determine a first set, wherein the first set comprises the at least one candidate global high utility sequence pattern, an identification of a sequence comprising each candidate global high utility sequence pattern and utility values of each candidate global high utility sequence pattern in the respective sequence; and a second mining unit 640 configured to mine a global high utility sequence pattern from the at least one candidate global high utility sequence pattern according to a utility value linked list of each sequence and the first set. In addition to these four units, the apparatus 600 may include other components, however, since these components are not related to the contents of the embodiments of the present disclosure, illustration and description thereof are omitted herein.

The first determining unit 610 may determine a global sequence weight utility value of each item in the sequence database, and determine an item whose global sequence weight utility value is higher than a first threshold value as a first type item. The first determination unit 610 may be the identification portion 110 described above (i.e., the first stage MapReduce).

The process by which the first determining unit 610 determines the global sequence weight utility value for each item in the sequence database will be described below. According to one example of the present disclosure, for each item in the sequence database, the first determining unit 610 may first determine a local sequence weight utility value (Local Sequence Weight Utility, LSWU) for the item at each partition of the sequence database, and then determine a global sequence weight utility value for the item according to the determined local sequence weight utility value.

For example, first, the first determining unit 610 may divide the sequence database into a plurality of partitions and allocate the plurality of partitions to a plurality of mappers in the first stage MapReduce. For example, the sequence database may be divided into n partitions, and the 1 st partition is assigned to Mapper 1 in the first stage MapReduce, where 1.ltoreq.k.ltoreq.n and is a positive integer.

Furthermore, because the different sequences in each partition may contain the same item, the Mapper may generate multiple key-value pairs for the same item in these different sequences. In this case, a combination module (e.g., may be referred to as combiner) may be configured for each Mapper to determine local sequence weight utility values for the same item in each partition. In particular, the local sequence weight utility value of the term at each partition of the sequence database may be determined from utility values of sequences in the partition that include the term. For example, the local sequence weight utility value of the term at each partition of the sequence database may be the sum of utility values of the sequences comprising the term in that partition.

In the above manner, the first determining unit 610 may determine the local sequence weight utility value of each item at each partition of the sequence database. After determining the local sequence weight utility value of each item in the respective partition of the sequence database, the first determining unit 610 may determine the global sequence weight utility value of the item according to the determined local sequence weight utility value. For example, the sum of the local sequence weight utility values of each item in the respective partition of the sequence database may be taken as the global sequence weight utility value of the item.

Specifically, the key value pairs with the same key value in the outputs of the plurality of combination modules may be input to one Reducer in the first stage MapReduce. That is, a plurality of key value pairs corresponding to the same item, for example, a plurality of key value pairs (i, LSWU _i-k) corresponding to item i, among outputs of a plurality of combination modules, are input to one Reducer. The Reducer may sum the local sequence weight utility values in the plurality of key value pairs as a global sequence weight utility value GSWU _i for item i.

Thus far, a process has been described for determining a global sequence weight utility value for each item in a sequence database. After determining the global sequence weight utility value for each item in the sequence database, each Reducer in the first stage MapReduce may determine items for which the global sequence weight utility value is greater than or equal to a first threshold as first category items. Each Reducer may output one or more new key-value pairs, where each new key-value pair may be composed of a first type of item and a global sequence weight utility value for the first type of item. For example, when item i is a first type of item, a certain Reducer may output a key-value pair (i, GSWU _i).

According to one example of the present disclosure, for a sequence in the sequence database, the second determining unit 620 may determine a utility value linked list of the sequence according to utility values of the respective items in the sequence and positions of the respective items in the sequence. The utility value of an item may be the product of the internal utility value and the external utility value of the item. The position of each item in the sequence may include an initial position of each item and a neighboring position, wherein the initial position of an item may be the position of the first occurrence of an item in the sequence and the neighboring position may be the position of the next occurrence of an item in the sequence. In addition, a utility value linked list of one sequence may include two rows, where the first row may be information about utility values and adjacent locations of individual items (may be simply referred to as Utility Position Information, UP information), and the second row may be information about initial locations of non-duplicate items in the sequence (may be simply referred to as a Header Table). The second row may include non-duplicate items and initial positions of respective non-duplicate items.

In the present disclosure, the first digging unit 630 may be the above-described partial digging portion 120 (i.e., the second stage MapReduce).

According to one example of the present disclosure, the apparatus 600 may further include a load distribution unit (not shown in the drawing) configured to distribute the sequences in the sequence database into a plurality of tasks (tasks). The number of tasks may be denoted as m, where m is a positive integer. For example, m may be a multiple of the number of mappers in the second stage MapReduce. In the following example, the present disclosure is described taking the example that m is equal to the number of mappers in the second stage MapReduce.

In this example, the load distribution unit may divide the sequence in the sequence database into a plurality of partitions according to a load balancing algorithm. For example, sequences in a sequence database may be distributed into a plurality of tasks according to a load balancing algorithm. Specifically, for a sequence in the sequence database, the number (Num) of items of the first type that the sequence includes may be determined. Then, a task p having the smallest workload is selected from a plurality of tasks, and the sequence is assigned to the task p while the workload of the task p is updated according to the number of items of the first type included in the sequence. For example, the workload of the p-th task may be represented as WL _p, and when a sequence is assigned to the task, the workload of the task is updated by WL _p to (WL _p +Num).

In the present disclosure, the first mining unit 630 may mine the local high utility sequence patterns from various partitions of the sequence database according to the determined first class item. The first mining unit 630 may then determine at least one candidate global high utility sequence pattern from the mined local high utility sequence patterns. The first mining unit 630 may then determine the first set.

The process by which the first mining unit 630 mines the local utility sequence patterns from each partition of the sequence database will be described below. Specifically, for an item belonging to a first type of item in each sequence included in each partition, calculating a utility value and a residual utility value of the item in each sequence, constructing a utility list (utility list) of the item in each sequence, and determining a utility value chain of the item according to the utility list of the item in each sequence; a local high utility sequence pattern is mined from the partition according to a utility value chain (utility chain) of each item in the partition.

Similarly, a utility value chain for each item belonging to the first class of items in each sequence included in each partition may be determined. The local high utility sequence patterns may then be mined from the partition based on the utility value chains for the various items in the partition. For example, each item in the partition and the utility value chain for each item may be input to a conventional high utility sequence pattern algorithm, through which one or more local high utility sequence patterns corresponding to the partition may be output. In addition, the algorithm can also output the utility value of each local high utility sequence mode in the corresponding sequence and the identification information of the sequence. The output of the algorithm may be represented as a key-value pair (pattern, { sil space, utility }), where pattern represents a locally high utility sequence pattern, sil represents an identification of a sequence containing the locally high utility sequence pattern, and utility represents a utility value of the locally high utility sequence pattern in the corresponding sequence.

This may be done by a Mapper in the second stage MapReduce. For example, multiple partitions of the sequence database may be processed by multiple mappers in the second stage MapReduce, respectively, such that each of the mappers may mine a local utility sequence pattern from its corresponding partition. In this case, the algorithm output described above may be the output of Mapper. That is, for a partition of the sequence database, the output of the Mapper corresponding to the partition is one or more key-value pairs (patterns, { sild, utility }), where the one or more patterns are one or more local high utility sequence patterns mined from the partition.

The first mining unit 630 may then determine at least one candidate global high utility sequence pattern from the mined local high utility sequence patterns. For example, the same key value pair of the outputs of the mappers may be input to one Reducer in the second stage MapReduce. That is, among the outputs of the plurality of mappers, a plurality of key value pairs corresponding to the same pattern, for example, a plurality of key value pairs corresponding to pattern x (pattern x, { sild, utility }) are input to one Reducer. The Reducer may determine a sum of a plurality of utility values corresponding to pattern x, and determine whether pattern x is a global high utility sequence pattern based on the sum and a first threshold. If the sum is greater than or equal to the first threshold, then pattern x is determined to be a global high utility sequence pattern. If the sum is less than the first threshold, then it is determined that pattern x is not a global high utility sequence pattern, but a candidate global high utility sequence pattern.

The "at least one candidate global utility sequence pattern" may be determined from the outputs of the plurality of minimers in the second stage MapReduce. For example, the "at least one candidate global utility sequence pattern" in step S2032 may be determined according to the sequence patterns in the key value pairs output by the plurality of redundaners in the second stage MapReduce. For example, the outputs of the plurality of minimers may be (s₁,(pattern 1,utility 1))、(s₂,(pattern 1,utility 1))、(s₃,(pattern 2,utility 2))、(S₃,(pattern 1,utility 1))、(s₄,(pattern 2,utility 2)), then the "at least one candidate global utility sequence pattern" in step S2032 may be pattern 1 and pattern 2.

Further, the first mining unit 630 may determine the first set. For example, the first set may be determined from the outputs of the plurality of Reducer in the second stage MapReduce. The first set may include the global high utility sequence pattern of the at least one candidate, an identification of a sequence including the global high utility sequence pattern of each candidate, and utility values of the global high utility sequence pattern of each candidate in the respective sequence. For example, the first set may include a plurality of subsets, each subset including an identification of a sequence, a candidate global high utility sequence pattern included by the sequence, and utility values in the sequence for the candidate global high utility sequence pattern included by the sequence. For example, the outputs of the plurality of Reducers in the second stage MapReduce may be (s₁,(pattern 1,utility 1))、(s₂,(pattern 1,utility 1))、(s₃,(pattern 2,utility 2))、(s₃,(pattern 1,utility 1))、(s₄,(pattern 2,utility2)),. The first set may include four subsets, of which subset 1 is (s 1, (pattern 1, utility 1)), subset 2 is (s ₂, (pattern 1, utility 1)), subset 3 is (s ₃, (pattern 2, utility 2), (pattern 1, utility 1)), and subset 4 is (s ₄, (pattern 2, utility 2).

Furthermore, in the present disclosure, the second excavation unit 640 may be the integration portion 130 described above (i.e., the third stage MapReduce).

The second mining unit 640 may determine local utility values of each candidate global high utility sequence pattern according to the utility value linked list of each sequence and the first set.

Also in this case, a combining module (e.g., may be referred to as combiner) may be configured for each Mapper to determine local utility values for the same candidate global high utility sequence pattern. In particular, the local utility value of the global high utility sequence pattern of the same candidate may be determined from utility values in a plurality of key value pairs corresponding to the global high utility sequence pattern of the candidate. For example, the local utility value of the global utility sequence pattern of the same candidate may be the sum of utility values in a plurality of key value pairs corresponding to the global utility sequence pattern of the candidate. For example, for the candidate global utility sequence pattern, the same Mapper may output two key-value pairs, (patterny, utility 1) and (pattern, utility 2), respectively, then the local utility value local-utility for the candidate global utility sequence pattern is (utility 1+utility 2).

Then, the second mining unit 640 may determine global utility values of the global high utility sequence patterns of the respective candidates according to the local utility values of the global high utility sequence patterns of the respective candidates. For example, for each candidate global utility sequence pattern, a global utility value for the candidate global utility sequence pattern may be determined from a plurality of local utility values for the candidate global utility sequence pattern. For example, the sum of the plurality of local utility values of the candidate global high utility sequence pattern may be taken as the global utility value of the candidate global high utility sequence pattern.

Then, the second mining unit 640 may determine a sequence pattern in which the global utility value is greater than the first threshold value as a global high utility sequence pattern. For example, each Reducer in the third stage MapReduce may determine a sequence pattern with a global utility value greater than or equal to a first threshold as a global high utility sequence pattern. Each Reducer may output one or more new key-value pairs, where each new key-value pair may be composed of a global high utility sequence pattern and a global utility value for the global high utility sequence pattern. For example, when pattern is a global utility sequence pattern, a certain Reducer may output a key-value pair (pattern, global-utility). Therefore, the sequence patterns in the key value pair output by each Reducer in the third stage MapReduce are all global high-utility sequence patterns.

According to the device for mining the global high-utility sequence mode, provided by the embodiment, the utility value linked list and the first set of each sequence in the sequence database are determined, and the global high-utility sequence mode is mined according to the two data structures, so that a great amount of time is saved, the calculation process of calculating the global utility value in the sequence database is quickened, the mining speed is quickened, and the time complexity is reduced.

Furthermore, an apparatus according to embodiments of the present disclosure may also be implemented by means of the architecture of the computing device shown in fig. 7. Fig. 7 illustrates an architecture of the computing device. As shown in fig. 7, computing device 700 may include a bus 710, one or more CPUs 720, a Read Only Memory (ROM) 730, a Random Access Memory (RAM) 740, a communication port 750 connected to a network, an input/output component 760, a hard disk 770, and the like. A storage device, such as ROM 730 or hard disk 770, in computing device 700 may store various data or files for computer processing and/or communication and program instructions for execution by the CPU. Computing device 700 may also include a user interface 780. Of course, the architecture shown in FIG. 7 is merely exemplary, and one or more components of the computing device shown in FIG. 7 may be omitted as may be practical in implementing different devices.

Embodiments of the present disclosure may also be implemented as a computer-readable storage medium. Computer readable storage media according to embodiments of the present disclosure have computer readable instructions stored thereon. When executed by a processor, may perform a method according to embodiments of the present disclosure described with reference to the above figures. The computer-readable storage medium includes, but is not limited to, for example, volatile memory and/or nonvolatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like.

Those skilled in the art will appreciate that various modifications and improvements can be made to the disclosure. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.

Furthermore, as shown in the present disclosure and claims, unless the context clearly indicates otherwise, the words "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Further, a flowchart is used in this disclosure to describe the operations performed by the system according to embodiments of the present disclosure. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

While the present disclosure has been described in detail above, it will be apparent to those skilled in the art that the present disclosure is not limited to the embodiments described in the present specification. The present disclosure may be embodied as modifications and variations without departing from the spirit and scope of the disclosure, which is defined by the appended claims. Accordingly, the description herein is for the purpose of illustration and is not intended to be in any limiting sense with respect to the present disclosure.

Claims

1. A method for mining global high utility sequence patterns, comprising:

Determining a first type of item in the sequence database, wherein the first type of item is an item with a global sequence weight utility value higher than a first threshold value;

determining a utility value linked list of each sequence in the sequence database;

Mining at least one candidate global high utility sequence pattern from the sequence database and determining a first set according to the determined first class of items, wherein the first set comprises a plurality of subsets, each subset comprising the at least one candidate global high utility sequence pattern, an identification of a sequence comprising a respective candidate global high utility sequence pattern, and utility values of the respective candidate global high utility sequence pattern in the respective sequence; and

Mining a global high utility sequence pattern from the at least one candidate global high utility sequence pattern based on a linked list of utility values for each sequence and the first set,

Wherein mining global high utility sequence patterns from the at least one candidate global high utility sequence pattern according to the utility value linked list of each sequence and the first set comprises:

When the first set includes the at least one candidate global high utility sequence pattern, a utility value of the at least one candidate global high utility sequence pattern is determined from the first set.

2. The method of claim 1, wherein said determining a first type of entry in a sequence database comprises:

determining a global sequence weight utility value of each item in the sequence database; and

Items with global sequence weight utility values above a first threshold are determined to be of a first type.

3. The method of claim 2, wherein determining a global sequence weight utility value for each item in the sequence database comprises:

determining local sequence weight utility values of the item in each partition of the sequence database; and

And determining a global sequence weight utility value of the item according to the determined local sequence weight utility value.

4. A method as claimed in claim 3, wherein the local sequence weight utility value of the term at each partition of the sequence database is determined from utility values of sequences comprising the term in that partition.

5. The method of any one of claims 1 to 4, wherein said determining a linked list of utility values for each sequence in the sequence database comprises:

And determining a utility value linked list of the sequence according to the utility value of each item in the sequence and the position of each item in the sequence.

6. The method of any of claims 1 to 4, wherein the mining at least one candidate global high utility sequence pattern from the sequence database according to the determined first class of items comprises:

mining a local high utility sequence pattern from each partition of the sequence database according to the determined first class item; and

At least one candidate global utility sequence pattern is determined from the mined local utility sequence patterns.

7. The method of claim 6, wherein mining a local high utility sequence pattern from each partition of the sequence database according to the determined first type of item comprises:

For an item belonging to the first class of items in the respective sequences comprised by the partition,

Calculating utility values and remaining utility values of the item in each sequence, wherein the remaining utility values of the item in a sequence are the sum of utility values of all items in the sequence that follow the item;

Constructing a utility list of the item in each sequence;

determining a utility value chain of the item according to utility lists of the item in each sequence;

and mining a local high-utility sequence mode from the partition according to the utility value chains of the items in the partition.

8. The method of any of claims 1 to 4, wherein mining global utility sequence patterns from the at least one candidate global utility sequence pattern according to a utility value linked list of individual sequences and the first set comprises:

determining local utility values of each candidate global high utility sequence mode according to utility value linked lists of each sequence and the first set;

Determining the global utility value of each candidate global high utility sequence mode according to the local utility value of each candidate global high utility sequence mode; and

A sequence pattern having a global utility value greater than a first threshold is determined as a global high utility sequence pattern.

9. The method of claim 6, further comprising:

and dividing the sequence in the sequence database into a plurality of partitions according to a load balancing algorithm.

10. An apparatus for mining global high utility sequence patterns, comprising:

A first determining unit configured to determine a first type of item in the sequence database, wherein the first type of item is an item for which the global sequence weight utility value is higher than a first threshold value;

the second determining unit is configured to determine utility value linked lists of all sequences in the sequence database;

A first mining unit configured to mine at least one candidate global utility sequence pattern from the sequence database according to the determined first class item and determine a first set, wherein the first set comprises a plurality of subsets, each subset comprising the at least one candidate global utility sequence pattern, an identification of a sequence comprising a respective candidate global utility sequence pattern, and utility values of the respective candidate global utility sequence pattern in the respective sequence; and

A second mining unit configured to mine global high utility sequence patterns from the at least one candidate global high utility sequence pattern according to a utility value linked list of each sequence and the first set,

Wherein the second mining unit is further configured to determine utility values of the at least one candidate global high utility sequence pattern from the first set when the first set includes the at least one candidate global high utility sequence pattern.

11. The apparatus of claim 10, wherein the first determining unit is configured to determine global sequence weight utility values for respective items in the sequence database; and determining an item with a global sequence weight utility value higher than a first threshold as a first type item.

12. The apparatus according to claim 10 or 11, wherein the second determining unit is configured to determine a utility value linked list for each sequence based on the utility values of the respective items in the sequence and the positions of the respective items in the sequence.

13. The apparatus of claim 10 or 11, wherein the second mining unit is configured to determine local utility values for each candidate globally highly-utility sequence pattern from a linked list of utility values for each sequence and the first set; determining the global utility value of each candidate global high utility sequence mode according to the local utility value of each candidate global high utility sequence mode; and determining a sequence pattern having a global utility value greater than a first threshold as a global high utility sequence pattern.

14. An apparatus for mining global high utility sequence patterns, comprising:

a processor; and

A memory, wherein the memory has stored therein a computer executable program which, when executed by the processor, performs the method of any of claims 1-9.

15. A computer readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-9.