CN113407543A

CN113407543A - Method, device and computer storage medium for mining high-utility continuous sequence mode

Info

Publication number: CN113407543A
Application number: CN202110727658.2A
Authority: CN
Inventors: 张春慨
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-17

Abstract

The invention discloses a method, a device and a computer storage medium for mining a high-utility continuous sequence pattern, wherein the method comprises the following steps: establishing a mapping database; generating an initial candidate sequence mode according to the mapping database, successively taking the initial candidate sequence mode as a current sequence mode, determining the current sequence mode and counting the utility value and the utility upper bound of the current sequence mode; when the utility value is greater than or equal to the threshold value, determining that the current sequence mode is a high utility continuous sequence mode; when the effective upper bound is larger than or equal to a threshold value, the current sequence mode is taken as a candidate sequence mode; under the continuous constraint condition, if the candidate sequence mode can be expanded, an expanded sequence mode is generated on the basis of the candidate sequence mode, the expanded sequence mode is used as a current sequence mode, and the utility value and the utility upper bound of the current sequence mode are counted according to a mapping database of the current sequence mode. The method and the device can meet the application requirement of the current click stream log mining analysis.

Description

Method, device and computer storage medium for mining high-utility continuous sequence mode

Technical Field

The present application relates to the field of data mining technologies, and in particular, to a method and an apparatus for mining a high-utility continuous sequence pattern, and a computer storage medium.

Background

In the internet era, many users visit different web sites each day, thereby generating a large number of click stream logs. The click stream records information such as a web browsing track and a corresponding browsing duration of each user, and can be simply represented by a sequence. For example, a click stream sequence: < (A:1), (C:3), (D:4) and (F:1) > show that the user browses four web pages A, C, D and F sequentially, the browsing time is 1, 3, 4 and 1 time unit respectively, and the sequence of the type is also called a sequence with a utility value, wherein the utility value refers to the browsing time. By mining and analyzing sequence patterns with high sum of utility values in the click stream log, such information can be obtained: after a user browses a certain webpage, what the next webpage browsed is usually, namely which webpages are highly related; which web pages the user browses for longer, i.e. which web pages are of most interest to the user, etc. By utilizing the information, a website service provider can improve the topological structure of the website technically, arrange a quick access path between the webpages with high correlation and improve the access efficiency of the user; in terms of business, advertisements can be put on hot web pages, and the exposure of the advertisements is improved; and the webpage content can be recommended according to the user interest, so that the user experience is improved. In conclusion, by applying the high-utility sequence pattern mining technology, the user behavior rule information contained in the website click stream log can be obtained, and the information has great value for the website service provider.

Current high utility sequence pattern mining algorithms are able to mine all sequence patterns in the click stream database that have a utility value above a predetermined threshold, i.e., high utility sequence patterns. However, not all high utility sequence patterns are meaningful under the specific application scenario of click stream analysis. For example, given two click stream sequences: < (B:10), (A:3), (C:1), (H:2), (D:1), (G:1), (F:5) > and < (B:9), (C:2), (F:6) >, with the minimum utility threshold set to 25, we can derive a utility value of 30, above the threshold, for only mode < B, F >. If the pattern is returned to the Web service provider, they may mistakenly assume that the user has finished browsing Web page B and will likely immediately browse Web page F. However, as shown by the two click stream sequences above, this is not the case, especially in the first click stream, where B and F are separated by many web pages. In order to solve the problem, researchers propose high-utility continuous sequence pattern mining, and continuous constraint is added on the basis of the high-utility continuous sequence pattern mining problem, namely the mined sequence pattern is required to be a continuous subsequence of at least one sequence in a database. Compared with the conventional high-utility continuous sequence mode, the high-utility continuous sequence mode can reflect the continuous access preference of the network user.

The existing high-utility continuous sequence pattern mining algorithm can only process a sequence with only one event occurring at each time point, and in the click stream sequence, a user can only browse one webpage at each moment. However, in practical applications, there are situations where a user is browsing multiple web pages simultaneously, and the click stream sequence is more complex. For example, when a user purchases online, a plurality of E-commerce platform pages are opened simultaneously to compare the commodity price; a user may simultaneously open a music platform to listen to music while browsing a news portal. Similar scenarios are many, in which case the click stream sequences are shaped as < { (B:3) (D:4) }, { (C:1) }, { (H:2) (E:1) } which indicates that the user browses web pages B and D, then C, and finally H and E simultaneously. In this form of the sequence, the contents of a curly brace "{ }" constitute a set of items, and the respective elements in the set of items are referred to as items, e.g., (B:3) are the items in the first set of items of the sequence. Existing high-utility sequence pattern mining algorithms are capable of handling such complex click stream sequences, but they do not consider the continuous limitation. And the high-utility continuous sequence pattern mining algorithm can not process complex click streams although continuous limitation is considered. Therefore, it is desirable to design a method and apparatus for mining high-utility continuous sequence patterns from complex click streams.

In addition, with the rapid development of the internet, the number of net citizens is increasing, and the generated click stream data volume is also becoming huge. The existing high-utility continuous sequence pattern mining algorithm has good performance on a small-scale database, but has low mining speed on a large-scale database, and is difficult to meet the requirement of large-scale data mining. How to improve the performance of the algorithm and enable the algorithm to rapidly dig out useful information on a database with a large scale is an urgent problem to be solved.

Disclosure of Invention

Aiming at the problems, the invention provides a method (FUCPM for short), a device and a computer storage medium for Mining a high-Utility continuous sequence Pattern in a large-scale complex click stream database.

In a first aspect of the present invention, a method for mining a high-utility continuous sequence pattern in a large-scale complex clickstream database is provided, which includes:

s1, establishing a mapping database of each sequence in the click stream sequence database;

s2, generating an initial candidate sequence mode according to the mapping database, sequentially taking the initial candidate sequence mode as a current sequence mode, and counting the utility value and the utility upper bound of the current sequence mode;

s3, when the utility value of the current sequence mode is larger than or equal to a threshold value, determining that the current sequence mode is a high utility continuous sequence mode; when the utility upper bound of the current sequence mode is larger than or equal to a threshold value, taking the current sequence mode as a candidate sequence mode;

and S4, under the continuous constraint condition, if the candidate sequence mode can be expanded, generating an expanded sequence mode from the candidate sequence mode, taking the expanded sequence mode as the current sequence mode, establishing a current sequence mode mapping database according to the current sequence mode, counting the utility value and the upper utility bound of the current sequence mode according to the current sequence mode mapping database, returning to S3, and if the candidate sequence mode can not be expanded any more, ending the circulation.

Further, in S2, generating an initial candidate sequence pattern according to the mapping database specifically includes: a sequence pattern of length 1 is determined as an initial candidate sequence pattern.

Further, the S4 step of expanding the candidate sequence pattern includes expanding the candidate sequence pattern by items and expanding the set of items.

Further, the mapping database stores utility values of each item of each sequence in the click stream sequence database, and the current sequence pattern mapping database stores utility values and position information of the current sequence pattern.

Further, the S4 continuous constraint condition includes: item expansion can only be done with items that are in the same set of items as the last item of the candidate sequence pattern; item set expansion can only be done with items in the item set that are located after the last item of the candidate sequence pattern.

Further, the S3 further includes: and deleting the current sequence mode when the utility upper bound of the current sequence mode is smaller than a threshold value.

In a second aspect of the present invention, there is provided an apparatus for mining a high-utility continuous sequence pattern, comprising: the mapping database establishing module is used for establishing a mapping database of the click stream sequence database;

the utility statistical module is used for generating an initial candidate sequence mode according to the mapping database, sequentially taking the initial candidate sequence mode as a current sequence mode, and counting the utility value and the utility upper bound of the current sequence mode;

the high-utility continuous sequence mode judging module is used for determining that the current sequence mode is the high-utility continuous sequence mode when the utility value of the current sequence mode is greater than or equal to a threshold value;

a candidate sequence mode decision module, configured to, when an upper utility bound of the current sequence mode is greater than or equal to a threshold, take the current sequence mode as a candidate sequence mode;

and the candidate sequence mode expansion module is used for expanding the candidate sequence under the continuous constraint condition to generate an expanded sequence mode, taking the expanded sequence mode as a current sequence mode, establishing a current sequence mode mapping database according to the current sequence mode, and counting the utility value and the utility upper bound of the current sequence mode according to the current sequence mode mapping database.

Further, the mapping database in the mapping database establishing module stores utility values of each item of each sequence in the click stream sequence database, and the current sequence pattern mapping database in the candidate sequence pattern expanding module stores utility values and location information of the current sequence pattern.

In a third aspect of the present invention, there is provided an apparatus for mining a high-utility continuous sequence pattern, comprising: a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the above-described method.

In a fourth aspect of the invention, a computer-readable storage medium is provided, having stored thereon instructions, which, when executed by a processor, cause the processor to perform the above-mentioned method.

The invention provides a method, a device and a computer storage medium for mining a high-utility continuous sequence mode in a large-scale complex click stream database, which are characterized in that firstly, a sequence mode with the length of 1 is detected, the utility value and the utility upper bound of the sequence mode in the database are calculated, and if the utility value is higher than a threshold value, the sequence mode is output; if the utility upper bound value is lower than the threshold value, pruning is carried out, namely, the sequence is stopped from being expanded, otherwise, item expansion and item set expansion are carried out on the sequence, then the same detection is carried out on the sequence mode obtained by the expansion, the process is carried out recursively until the sequence mode can not be expanded any more or the sequence mode meets the pruning condition, the mapping database comprises two structures, one structure is used for storing the utility value of each item of each sequence in the click stream sequence database, and the other structure is used for storing the utility value and the position information of the current sequence mode; reducing the search space by utilizing the fact that when the utility upper bound value of one sequence mode is smaller than the threshold value, the utility value of the extended sequence is necessarily smaller than the threshold value; based on the effective upper bound calculation method of item expansion and item set expansion, support is provided for reducing search space, meanwhile, in consideration of continuous constraint conditions, only items which are located in the same item set with the last item of the sequence mode are used for item expansion, items which are located in an item set behind the last item of the sequence mode are used for item set expansion, and finally the beneficial effects are achieved: compared with the existing high-utility continuous sequence pattern mining, the method, the device and the computer storage medium for mining the high-utility continuous sequence pattern in the large-scale complex click stream database provided by the invention can more quickly mine the required result, can be expanded to complex click stream sequence data with larger scale, can well meet the application requirement of current click stream log mining analysis, and have great practical value.

Drawings

FIG. 1 is a flow chart of a method for mining a high-utility continuous sequence pattern according to an embodiment of the present invention;

FIG. 2 is an Instance List of sequence pattern < { A } > in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a search space tree structure according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an apparatus for mining a high-utility continuous sequence pattern according to an embodiment of the present invention;

FIG. 5 shows an architecture of a computer device in an embodiment of the invention.

FIG. 6 is a graph comparing the runtime of FUCPM and HUCP-Miner in an embodiment of the present invention;

FIG. 7 is a graph illustrating the memory consumption comparison of FUCPM and HUCP-Miner according to an embodiment of the present invention;

FIG. 8 is a graph of FUCPM performance on artificial datasets of different sizes in an embodiment of the present invention;

Detailed Description

In order to further describe the technical scheme of the present invention in detail, the present embodiment is implemented on the premise of the technical scheme of the present invention, and detailed implementation modes and specific steps are given.

The embodiment of the invention aims at a method (FUCPM for short), a device and a computer storage medium for Mining a high-Utility continuous sequence mode in a large-scale complex click stream database. Fig. 1 is a flowchart of a method for mining a high-utility continuous sequence pattern according to an embodiment of the present invention:

in a specific embodiment, the data structure of the mapping database comprises two parts: sequence Information List (SIL) and Instance List, wherein SIL is used to store utility values of each Sequence item in clickstream database; the Instance List is used for storing the utility value and the position information of the candidate sequence mode formed in the mining process.

S2, generating initial candidate sequence patterns according to the mapping database, sequentially taking the initial candidate sequence patterns as current sequence patterns, and counting utility values and utility upper bounds of the current sequence patterns according to the mapping database of the current sequence patterns;

in S2, generating an initial candidate sequence pattern according to the mapping database, specifically: a sequence pattern of length 1 is determined as an initial candidate sequence pattern.

In the high-utility sequence pattern mining technology, the utility upper bound is an important concept. All high-utility sequence pattern mining algorithms utilize a specific utility upper bound to prune a search space, so that the algorithm efficiency is improved. The upper bound of utility for a sequence pattern must satisfy two conditions: firstly, the utility upper bound value of the sequence mode is larger than the utility value; and the utility upper bound value of the sequence mode is larger than that of the extended sequence. Therefore, when the utility upper bound value of one sequence mode is smaller than the threshold, the utility value of the expanded sequence is also smaller than the threshold inevitably, so that the expansion of the sequence mode can be stopped, and the search space is reduced.

the S3 further includes: and deleting the current sequence mode when the utility upper bound of the current sequence mode is smaller than a threshold value.

The S4 wherein the candidate sequence pattern being expandable includes the candidate sequence pattern being expandable by item and by item set. In an embodiment, the Instance List is the current sequence pattern mapping database in S4.

The S4 continuous constraint condition includes: item expansion can only be done with items that are in the same set of items as the last item of the candidate sequence pattern; item set expansion can only be done with items in the item set that are located after the last item of the candidate sequence pattern.

In a specific embodiment, the Utility upper bound Item Extension Utility (IEU for short) is calculated as follows:

for term expansion (i.e. adding one term to the last term set of the sequence pattern), assuming that the sequence pattern α is expanded by the term to obtain t, where the expanded term is i, then t is the IEU value for position p in the click stream sequence s:

in the above equation, u (α, p, s) is the utility value in s of the instance where the last term of α is located in the pth term set of s, and ru (i, p, s) is the sum of the utility values of all terms (including i) of the s residual sequence after term i of the pth term set of s.

For item set extension (i.e. adding an item set containing a single item at the tail of the sequence pattern), assuming that the sequence pattern α is extended by the item set to obtain t, the extended item is i, then t is, in the stream-on-click sequence s, the IEU value for position p is:

sequence pattern t the IEU value in a click stream sequence s is defined as:

the IEU value of the sequence pattern t in the click stream database is defined as:

the mining process of FUCPM is a recursive process of depth-first search. Firstly, detecting a sequence mode with the length of 1, such as < { A } >, calculating a utility value and a utility upper bound value of the sequence mode in a database, and outputting the mode if the utility value is higher than a threshold value; if the utility upper bound value is lower than the threshold value, pruning is carried out, namely the sequence is stopped to be expanded, otherwise item expansion and item set expansion are carried out on the sequence, then the same detection is carried out on the sequence mode obtained by the expansion, and the process is carried out recursively until the sequence mode can not be expanded any more or the sequence mode meets the pruning condition. After the search for one sequence pattern is completed, FUCPM continues to detect subsequent sequence patterns in the same manner.

An example illustration in the embodiment given below is given in the click stream sequence database shown in table 1.

TABLE 1 click stream database

SID	Clickstream
			1	<{(A:2)(C:6)},{(B:4)}>
2	<{(C:3)},{(B:6)},{(A:1)}>
		3	<{(B:2)(C:4)},{(A:4)}>

First, the SIL of the database is constructed, as shown in Table 2.

TABLE 2 SIL of clickstream database

SID	Content
			1	<{(A,2,9)(C,5,4)},{(B,4,0)}>
2	<{(C,3,7)},{(B,6,1)},{(A,1,0)}>
		3	<{(B,2,8)(C,4,4)},{(A,4,0)}>

Each triplet in the SIL records the utility value of an entry and the remaining utility value, e.g., in the first entry set of S1, (a,2,9) indicates that a has a utility value of 2 and the sum of the utility values of all entries in the remaining sequence is 9.

Secondly, FUCPM constructs all the Instance Lists of the sequence mode with the length of 1, for example, the Instance Lists of < { A } > are shown in FIG. 2, wherein TID represents the item set number of the last item of the sequence mode, and for < { A } > the TID is the item set number of the item A; utility is the Utility value of the sequence mode.

And thirdly, detecting the utility value and the upper utility bound of the sequence mode with the length of 1, and then performing item expansion and item set expansion. In view of the continuous constraint, item expansion can only be done with items that are in the same set of items as the last item of the sequence schema, and with items that are in a set of items that are after the last item of the sequence schema. For example, for < { A } > the item extension candidates are only C and the item set extension candidates are only B.

As shown in fig. 3, the search space is represented as a tree structure, each node in the tree except the root node contains a tuple, and the content in the tuple is the utility value and the utility upper bound of the corresponding sequence pattern. In this example, the minimum utility threshold is set at 16. All sequence patterns of length 1 do not have utility values above the threshold, but their upper utility bounds exceed the threshold, so they need to be extended. The utility upper bound of the sequence mode corresponding to the node framed by the square frame does not exceed the threshold, so that pruning is carried out on the node without growing downwards. Finally, we conclude that only < { C }, { B } > is a high-utility continuous sequence pattern.

Hereinafter, an apparatus 100 for mining a high-utility continuous sequence pattern according to an embodiment of the present disclosure corresponding to the method shown in fig. 1 is described with reference to fig. 4, and fig. 4 is a schematic structural diagram of the apparatus for mining a high-utility continuous sequence pattern in the embodiment of the present disclosure. Since the function of the apparatus 100 is the same as the details of the method described above with reference to fig. 1, a detailed description of the same is omitted here for the sake of simplicity. As shown in fig. 4, the apparatus 100 includes: a mapping database establishing module 101, configured to establish a mapping database of a click stream sequence database; a utility statistic module 102, configured to generate an initial candidate sequence pattern according to the mapping database, sequentially use the initial candidate sequence pattern as a current sequence pattern, and count a utility value and a utility upper bound of the current sequence pattern; a high utility continuous sequence mode determining module 103, configured to determine that the current sequence mode is a high utility continuous sequence mode when a utility value of the current sequence mode is greater than or equal to a threshold; a candidate sequence mode decision module 104, configured to, when the upper utility bound of the current sequence mode is greater than or equal to a threshold, take the current sequence mode as a candidate sequence mode; and the candidate sequence mode expanding module 105 is configured to expand the candidate sequence under a continuous constraint condition to generate an expanded sequence mode, use the expanded sequence mode as a current sequence mode, establish a current sequence mode mapping database according to the current sequence mode, and count the utility value and the utility upper bound of the current sequence mode according to the current sequence mode mapping database. The apparatus 100 may include other components in addition to the 5 units, however, since these components are not related to the contents of the embodiments of the present disclosure, illustration and description thereof are omitted herein.

The mapping database in the mapping database establishment module 101 stores utility values of each item of each sequence in the click stream sequence database, and the current sequence pattern mapping database in the candidate sequence pattern extension module 105 stores utility values and location information of the current sequence pattern.

The specific working process of the apparatus 100 for mining a high-utility continuous sequence pattern refers to the description of the method for mining a high-utility continuous sequence pattern, which is not described in detail.

Furthermore, an apparatus according to an embodiment of the invention may also be implemented by means of the architecture of a computing device as shown in fig. 5. Fig. 5 illustrates an architecture of the computing device. As shown in fig. 5, a computer system 201, a system bus 203, one or more CPUs 204, input/output components 202, memory 205, and the like. The memory 20 may store various data or files used in computer processing and/or communications as well as program instructions executed by the CPU. The architecture shown in fig. 5 is merely exemplary, and one or more of the components in fig. 5 may be adjusted as needed to implement different devices.

Embodiments of the invention may also be implemented as a computer-readable storage medium. A computer-readable storage medium according to an embodiment has computer-readable instructions stored thereon. The computer readable instructions, when executed by a processor, may perform a method according to embodiments of the invention as described with reference to the above figures.

The embodiment of the invention is directed to the method embodiment, the device embodiment and the computer storage medium embodiment capable of mining the high-utility continuous sequence mode in the large-scale complex clickstream database, and the results of the three embodiments are compared with the performance of the currently optimal high-utility continuous sequence mode mining method HUCP-Miner implementation result in terms of both the running speed and the memory consumption, and the embodiment is performed on four real clickstream data sets MSNBC, Kosarak10K, FIFA and BMS, wherein the first three data sets are derived from clickstream logs of news portals MSNBC, Kosarak and FIFA, and the last data set is derived from clickstream logs of a shopping website. Of these four data sets, the set of items for each click stream sequence contains only one item. In one embodiment, the minimum utility threshold parameter is a ratio that is multiplied by the sum of the utility values of all sequences in the data set to obtain a specific minimum utility threshold. The runtime and memory consumption of the FUCPM algorithm and the HUCP-Miner on the four datasets according to the embodiment of the present invention are shown in fig. 6 and fig. 7, respectively.

From FIG. 6, FUCPM is approximately 50% faster than HUCP-Miner, especially in FIFA data sets, the speed advantage of FUCPM is more pronounced. From FIG. 7, FUCPM consumes less memory on the MSNBC dataset than HUCP-Miner, but on the FIFA and BMS datasets, FUCPM consumes slightly more memory than HUCP-Miner.

This embodiment is performed on artificial clickstream data sets of different sizes, where the set of items for each clickstream sequence may contain multiple items to simulate a scenario where a user is browsing multiple web pages simultaneously. Since the existing high-utility continuous sequence pattern mining algorithms cannot process complex click stream sequences with a single item set containing multiple items, the embodiment only detects the efficiency of the FUCPM method. As can be seen from FIG. 8, FUCPM enables efficient and rapid mining of high utility continuous sequence patterns from complex clickstream databases of varying sizes.

By combining the method, the device and the computer storage medium which are provided by the embodiments and can be used for mining the high-utility continuous sequence mode in the large-scale complex click stream database, the required result can be mined more quickly, the method can be expanded to complex click stream sequence data with a larger scale, the application requirement of current click stream log mining analysis can be well met, and the method, the device and the computer storage medium have great practical value.

In this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process or method.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A method of mining a high utility continuous sequence pattern, the method comprising:

2. The method according to claim 1, wherein the generating of the initial candidate sequence pattern according to the mapping database in S2 specifically includes: a sequence pattern of length 1 is determined as an initial candidate sequence pattern.

3. The method of claim 1, wherein said S4 indicating that said candidate sequence pattern is expandable comprises that said candidate sequence pattern is expandable by item and by item set.

4. The method of claim 1, wherein the mapping database stores utility values for each item of sequences in the click stream sequence database, and wherein the current sequence pattern mapping database stores utility values and location information for the current sequence pattern.

5. The method according to claim 3, wherein the S4 continuous constraint condition comprises: item expansion can only be done with items that are in the same set of items as the last item of the candidate sequence pattern; item set expansion can only be done with items in the item set that are located after the last item of the candidate sequence pattern.

6. The method according to claim 1, wherein the S3 further comprises: and deleting the current sequence mode when the utility upper bound of the current sequence mode is smaller than a threshold value.

7. An apparatus for mining a high utility continuous sequence pattern, the apparatus comprising:

the mapping database establishing module is used for establishing a mapping database of the click stream sequence database;

8. The apparatus of claim 7, wherein the mapping database in the mapping database building module stores utility values of each item of each sequence in the click stream sequence database, and wherein the current sequence pattern mapping database in the candidate sequence pattern expanding module stores utility values and location information of the current sequence pattern.

9. An apparatus for mining a high utility continuous sequence pattern, comprising:

a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-6.

10. A computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-6.