CN113407543A - Method, device and computer storage medium for mining high-utility continuous sequence mode - Google Patents

Method, device and computer storage medium for mining high-utility continuous sequence mode Download PDF

Info

Publication number
CN113407543A
CN113407543A CN202110727658.2A CN202110727658A CN113407543A CN 113407543 A CN113407543 A CN 113407543A CN 202110727658 A CN202110727658 A CN 202110727658A CN 113407543 A CN113407543 A CN 113407543A
Authority
CN
China
Prior art keywords
sequence mode
utility
sequence
mode
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110727658.2A
Other languages
Chinese (zh)
Inventor
张春慨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202110727658.2A priority Critical patent/CN113407543A/en
Publication of CN113407543A publication Critical patent/CN113407543A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Abstract

The invention discloses a method, a device and a computer storage medium for mining a high-utility continuous sequence pattern, wherein the method comprises the following steps: establishing a mapping database; generating an initial candidate sequence mode according to the mapping database, successively taking the initial candidate sequence mode as a current sequence mode, determining the current sequence mode and counting the utility value and the utility upper bound of the current sequence mode; when the utility value is greater than or equal to the threshold value, determining that the current sequence mode is a high utility continuous sequence mode; when the effective upper bound is larger than or equal to a threshold value, the current sequence mode is taken as a candidate sequence mode; under the continuous constraint condition, if the candidate sequence mode can be expanded, an expanded sequence mode is generated on the basis of the candidate sequence mode, the expanded sequence mode is used as a current sequence mode, and the utility value and the utility upper bound of the current sequence mode are counted according to a mapping database of the current sequence mode. The method and the device can meet the application requirement of the current click stream log mining analysis.

Description

Method, device and computer storage medium for mining high-utility continuous sequence mode
Technical Field
The present application relates to the field of data mining technologies, and in particular, to a method and an apparatus for mining a high-utility continuous sequence pattern, and a computer storage medium.
Background
In the internet era, many users visit different web sites each day, thereby generating a large number of click stream logs. The click stream records information such as a web browsing track and a corresponding browsing duration of each user, and can be simply represented by a sequence. For example, a click stream sequence: < (A:1), (C:3), (D:4) and (F:1) > show that the user browses four web pages A, C, D and F sequentially, the browsing time is 1, 3, 4 and 1 time unit respectively, and the sequence of the type is also called a sequence with a utility value, wherein the utility value refers to the browsing time. By mining and analyzing sequence patterns with high sum of utility values in the click stream log, such information can be obtained: after a user browses a certain webpage, what the next webpage browsed is usually, namely which webpages are highly related; which web pages the user browses for longer, i.e. which web pages are of most interest to the user, etc. By utilizing the information, a website service provider can improve the topological structure of the website technically, arrange a quick access path between the webpages with high correlation and improve the access efficiency of the user; in terms of business, advertisements can be put on hot web pages, and the exposure of the advertisements is improved; and the webpage content can be recommended according to the user interest, so that the user experience is improved. In conclusion, by applying the high-utility sequence pattern mining technology, the user behavior rule information contained in the website click stream log can be obtained, and the information has great value for the website service provider.
Current high utility sequence pattern mining algorithms are able to mine all sequence patterns in the click stream database that have a utility value above a predetermined threshold, i.e., high utility sequence patterns. However, not all high utility sequence patterns are meaningful under the specific application scenario of click stream analysis. For example, given two click stream sequences: < (B:10), (A:3), (C:1), (H:2), (D:1), (G:1), (F:5) > and < (B:9), (C:2), (F:6) >, with the minimum utility threshold set to 25, we can derive a utility value of 30, above the threshold, for only mode < B, F >. If the pattern is returned to the Web service provider, they may mistakenly assume that the user has finished browsing Web page B and will likely immediately browse Web page F. However, as shown by the two click stream sequences above, this is not the case, especially in the first click stream, where B and F are separated by many web pages. In order to solve the problem, researchers propose high-utility continuous sequence pattern mining, and continuous constraint is added on the basis of the high-utility continuous sequence pattern mining problem, namely the mined sequence pattern is required to be a continuous subsequence of at least one sequence in a database. Compared with the conventional high-utility continuous sequence mode, the high-utility continuous sequence mode can reflect the continuous access preference of the network user.
The existing high-utility continuous sequence pattern mining algorithm can only process a sequence with only one event occurring at each time point, and in the click stream sequence, a user can only browse one webpage at each moment. However, in practical applications, there are situations where a user is browsing multiple web pages simultaneously, and the click stream sequence is more complex. For example, when a user purchases online, a plurality of E-commerce platform pages are opened simultaneously to compare the commodity price; a user may simultaneously open a music platform to listen to music while browsing a news portal. Similar scenarios are many, in which case the click stream sequences are shaped as < { (B:3) (D:4) }, { (C:1) }, { (H:2) (E:1) } which indicates that the user browses web pages B and D, then C, and finally H and E simultaneously. In this form of the sequence, the contents of a curly brace "{ }" constitute a set of items, and the respective elements in the set of items are referred to as items, e.g., (B:3) are the items in the first set of items of the sequence. Existing high-utility sequence pattern mining algorithms are capable of handling such complex click stream sequences, but they do not consider the continuous limitation. And the high-utility continuous sequence pattern mining algorithm can not process complex click streams although continuous limitation is considered. Therefore, it is desirable to design a method and apparatus for mining high-utility continuous sequence patterns from complex click streams.
In addition, with the rapid development of the internet, the number of net citizens is increasing, and the generated click stream data volume is also becoming huge. The existing high-utility continuous sequence pattern mining algorithm has good performance on a small-scale database, but has low mining speed on a large-scale database, and is difficult to meet the requirement of large-scale data mining. How to improve the performance of the algorithm and enable the algorithm to rapidly dig out useful information on a database with a large scale is an urgent problem to be solved.
Disclosure of Invention
Aiming at the problems, the invention provides a method (FUCPM for short), a device and a computer storage medium for Mining a high-Utility continuous sequence Pattern in a large-scale complex click stream database.
In a first aspect of the present invention, a method for mining a high-utility continuous sequence pattern in a large-scale complex clickstream database is provided, which includes:
s1, establishing a mapping database of each sequence in the click stream sequence database;
s2, generating an initial candidate sequence mode according to the mapping database, sequentially taking the initial candidate sequence mode as a current sequence mode, and counting the utility value and the utility upper bound of the current sequence mode;
s3, when the utility value of the current sequence mode is larger than or equal to a threshold value, determining that the current sequence mode is a high utility continuous sequence mode; when the utility upper bound of the current sequence mode is larger than or equal to a threshold value, taking the current sequence mode as a candidate sequence mode;
and S4, under the continuous constraint condition, if the candidate sequence mode can be expanded, generating an expanded sequence mode from the candidate sequence mode, taking the expanded sequence mode as the current sequence mode, establishing a current sequence mode mapping database according to the current sequence mode, counting the utility value and the upper utility bound of the current sequence mode according to the current sequence mode mapping database, returning to S3, and if the candidate sequence mode can not be expanded any more, ending the circulation.
Further, in S2, generating an initial candidate sequence pattern according to the mapping database specifically includes: a sequence pattern of length 1 is determined as an initial candidate sequence pattern.
Further, the S4 step of expanding the candidate sequence pattern includes expanding the candidate sequence pattern by items and expanding the set of items.
Further, the mapping database stores utility values of each item of each sequence in the click stream sequence database, and the current sequence pattern mapping database stores utility values and position information of the current sequence pattern.
Further, the S4 continuous constraint condition includes: item expansion can only be done with items that are in the same set of items as the last item of the candidate sequence pattern; item set expansion can only be done with items in the item set that are located after the last item of the candidate sequence pattern.
Further, the S3 further includes: and deleting the current sequence mode when the utility upper bound of the current sequence mode is smaller than a threshold value.
In a second aspect of the present invention, there is provided an apparatus for mining a high-utility continuous sequence pattern, comprising: the mapping database establishing module is used for establishing a mapping database of the click stream sequence database;
the utility statistical module is used for generating an initial candidate sequence mode according to the mapping database, sequentially taking the initial candidate sequence mode as a current sequence mode, and counting the utility value and the utility upper bound of the current sequence mode;
the high-utility continuous sequence mode judging module is used for determining that the current sequence mode is the high-utility continuous sequence mode when the utility value of the current sequence mode is greater than or equal to a threshold value;
a candidate sequence mode decision module, configured to, when an upper utility bound of the current sequence mode is greater than or equal to a threshold, take the current sequence mode as a candidate sequence mode;
and the candidate sequence mode expansion module is used for expanding the candidate sequence under the continuous constraint condition to generate an expanded sequence mode, taking the expanded sequence mode as a current sequence mode, establishing a current sequence mode mapping database according to the current sequence mode, and counting the utility value and the utility upper bound of the current sequence mode according to the current sequence mode mapping database.
Further, the mapping database in the mapping database establishing module stores utility values of each item of each sequence in the click stream sequence database, and the current sequence pattern mapping database in the candidate sequence pattern expanding module stores utility values and location information of the current sequence pattern.
In a third aspect of the present invention, there is provided an apparatus for mining a high-utility continuous sequence pattern, comprising: a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the above-described method.
In a fourth aspect of the invention, a computer-readable storage medium is provided, having stored thereon instructions, which, when executed by a processor, cause the processor to perform the above-mentioned method.
The invention provides a method, a device and a computer storage medium for mining a high-utility continuous sequence mode in a large-scale complex click stream database, which are characterized in that firstly, a sequence mode with the length of 1 is detected, the utility value and the utility upper bound of the sequence mode in the database are calculated, and if the utility value is higher than a threshold value, the sequence mode is output; if the utility upper bound value is lower than the threshold value, pruning is carried out, namely, the sequence is stopped from being expanded, otherwise, item expansion and item set expansion are carried out on the sequence, then the same detection is carried out on the sequence mode obtained by the expansion, the process is carried out recursively until the sequence mode can not be expanded any more or the sequence mode meets the pruning condition, the mapping database comprises two structures, one structure is used for storing the utility value of each item of each sequence in the click stream sequence database, and the other structure is used for storing the utility value and the position information of the current sequence mode; reducing the search space by utilizing the fact that when the utility upper bound value of one sequence mode is smaller than the threshold value, the utility value of the extended sequence is necessarily smaller than the threshold value; based on the effective upper bound calculation method of item expansion and item set expansion, support is provided for reducing search space, meanwhile, in consideration of continuous constraint conditions, only items which are located in the same item set with the last item of the sequence mode are used for item expansion, items which are located in an item set behind the last item of the sequence mode are used for item set expansion, and finally the beneficial effects are achieved: compared with the existing high-utility continuous sequence pattern mining, the method, the device and the computer storage medium for mining the high-utility continuous sequence pattern in the large-scale complex click stream database provided by the invention can more quickly mine the required result, can be expanded to complex click stream sequence data with larger scale, can well meet the application requirement of current click stream log mining analysis, and have great practical value.
Drawings
FIG. 1 is a flow chart of a method for mining a high-utility continuous sequence pattern according to an embodiment of the present invention;
FIG. 2 is an Instance List of sequence pattern < { A } > in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a search space tree structure according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an apparatus for mining a high-utility continuous sequence pattern according to an embodiment of the present invention;
FIG. 5 shows an architecture of a computer device in an embodiment of the invention.
FIG. 6 is a graph comparing the runtime of FUCPM and HUCP-Miner in an embodiment of the present invention;
FIG. 7 is a graph illustrating the memory consumption comparison of FUCPM and HUCP-Miner according to an embodiment of the present invention;
FIG. 8 is a graph of FUCPM performance on artificial datasets of different sizes in an embodiment of the present invention;
Detailed Description
In order to further describe the technical scheme of the present invention in detail, the present embodiment is implemented on the premise of the technical scheme of the present invention, and detailed implementation modes and specific steps are given.
The embodiment of the invention aims at a method (FUCPM for short), a device and a computer storage medium for Mining a high-Utility continuous sequence mode in a large-scale complex click stream database. Fig. 1 is a flowchart of a method for mining a high-utility continuous sequence pattern according to an embodiment of the present invention:
s1, establishing a mapping database of each sequence in the click stream sequence database;
in a specific embodiment, the data structure of the mapping database comprises two parts: sequence Information List (SIL) and Instance List, wherein SIL is used to store utility values of each Sequence item in clickstream database; the Instance List is used for storing the utility value and the position information of the candidate sequence mode formed in the mining process.
S2, generating initial candidate sequence patterns according to the mapping database, sequentially taking the initial candidate sequence patterns as current sequence patterns, and counting utility values and utility upper bounds of the current sequence patterns according to the mapping database of the current sequence patterns;
in S2, generating an initial candidate sequence pattern according to the mapping database, specifically: a sequence pattern of length 1 is determined as an initial candidate sequence pattern.
In the high-utility sequence pattern mining technology, the utility upper bound is an important concept. All high-utility sequence pattern mining algorithms utilize a specific utility upper bound to prune a search space, so that the algorithm efficiency is improved. The upper bound of utility for a sequence pattern must satisfy two conditions: firstly, the utility upper bound value of the sequence mode is larger than the utility value; and the utility upper bound value of the sequence mode is larger than that of the extended sequence. Therefore, when the utility upper bound value of one sequence mode is smaller than the threshold, the utility value of the expanded sequence is also smaller than the threshold inevitably, so that the expansion of the sequence mode can be stopped, and the search space is reduced.
S3, when the utility value of the current sequence mode is larger than or equal to a threshold value, determining that the current sequence mode is a high utility continuous sequence mode; when the utility upper bound of the current sequence mode is larger than or equal to a threshold value, taking the current sequence mode as a candidate sequence mode;
the S3 further includes: and deleting the current sequence mode when the utility upper bound of the current sequence mode is smaller than a threshold value.
And S4, under the continuous constraint condition, if the candidate sequence mode can be expanded, generating an expanded sequence mode from the candidate sequence mode, taking the expanded sequence mode as the current sequence mode, establishing a current sequence mode mapping database according to the current sequence mode, counting the utility value and the upper utility bound of the current sequence mode according to the current sequence mode mapping database, returning to S3, and if the candidate sequence mode can not be expanded any more, ending the circulation.
The S4 wherein the candidate sequence pattern being expandable includes the candidate sequence pattern being expandable by item and by item set. In an embodiment, the Instance List is the current sequence pattern mapping database in S4.
The S4 continuous constraint condition includes: item expansion can only be done with items that are in the same set of items as the last item of the candidate sequence pattern; item set expansion can only be done with items in the item set that are located after the last item of the candidate sequence pattern.
In a specific embodiment, the Utility upper bound Item Extension Utility (IEU for short) is calculated as follows:
for term expansion (i.e. adding one term to the last term set of the sequence pattern), assuming that the sequence pattern α is expanded by the term to obtain t, where the expanded term is i, then t is the IEU value for position p in the click stream sequence s:
Figure BDA0003138113050000051
in the above equation, u (α, p, s) is the utility value in s of the instance where the last term of α is located in the pth term set of s, and ru (i, p, s) is the sum of the utility values of all terms (including i) of the s residual sequence after term i of the pth term set of s.
For item set extension (i.e. adding an item set containing a single item at the tail of the sequence pattern), assuming that the sequence pattern α is extended by the item set to obtain t, the extended item is i, then t is, in the stream-on-click sequence s, the IEU value for position p is:
Figure BDA0003138113050000061
sequence pattern t the IEU value in a click stream sequence s is defined as:
Figure BDA0003138113050000062
the IEU value of the sequence pattern t in the click stream database is defined as:
Figure BDA0003138113050000063
the mining process of FUCPM is a recursive process of depth-first search. Firstly, detecting a sequence mode with the length of 1, such as < { A } >, calculating a utility value and a utility upper bound value of the sequence mode in a database, and outputting the mode if the utility value is higher than a threshold value; if the utility upper bound value is lower than the threshold value, pruning is carried out, namely the sequence is stopped to be expanded, otherwise item expansion and item set expansion are carried out on the sequence, then the same detection is carried out on the sequence mode obtained by the expansion, and the process is carried out recursively until the sequence mode can not be expanded any more or the sequence mode meets the pruning condition. After the search for one sequence pattern is completed, FUCPM continues to detect subsequent sequence patterns in the same manner.
An example illustration in the embodiment given below is given in the click stream sequence database shown in table 1.
TABLE 1 click stream database
SID Clickstream
1 <{(A:2)(C:6)},{(B:4)}>
2 <{(C:3)},{(B:6)},{(A:1)}>
3 <{(B:2)(C:4)},{(A:4)}>
First, the SIL of the database is constructed, as shown in Table 2.
TABLE 2 SIL of clickstream database
SID Content
1 <{(A,2,9)(C,5,4)},{(B,4,0)}>
2 <{(C,3,7)},{(B,6,1)},{(A,1,0)}>
3 <{(B,2,8)(C,4,4)},{(A,4,0)}>
Each triplet in the SIL records the utility value of an entry and the remaining utility value, e.g., in the first entry set of S1, (a,2,9) indicates that a has a utility value of 2 and the sum of the utility values of all entries in the remaining sequence is 9.
Secondly, FUCPM constructs all the Instance Lists of the sequence mode with the length of 1, for example, the Instance Lists of < { A } > are shown in FIG. 2, wherein TID represents the item set number of the last item of the sequence mode, and for < { A } > the TID is the item set number of the item A; utility is the Utility value of the sequence mode.
And thirdly, detecting the utility value and the upper utility bound of the sequence mode with the length of 1, and then performing item expansion and item set expansion. In view of the continuous constraint, item expansion can only be done with items that are in the same set of items as the last item of the sequence schema, and with items that are in a set of items that are after the last item of the sequence schema. For example, for < { A } > the item extension candidates are only C and the item set extension candidates are only B.
As shown in fig. 3, the search space is represented as a tree structure, each node in the tree except the root node contains a tuple, and the content in the tuple is the utility value and the utility upper bound of the corresponding sequence pattern. In this example, the minimum utility threshold is set at 16. All sequence patterns of length 1 do not have utility values above the threshold, but their upper utility bounds exceed the threshold, so they need to be extended. The utility upper bound of the sequence mode corresponding to the node framed by the square frame does not exceed the threshold, so that pruning is carried out on the node without growing downwards. Finally, we conclude that only < { C }, { B } > is a high-utility continuous sequence pattern.
Hereinafter, an apparatus 100 for mining a high-utility continuous sequence pattern according to an embodiment of the present disclosure corresponding to the method shown in fig. 1 is described with reference to fig. 4, and fig. 4 is a schematic structural diagram of the apparatus for mining a high-utility continuous sequence pattern in the embodiment of the present disclosure. Since the function of the apparatus 100 is the same as the details of the method described above with reference to fig. 1, a detailed description of the same is omitted here for the sake of simplicity. As shown in fig. 4, the apparatus 100 includes: a mapping database establishing module 101, configured to establish a mapping database of a click stream sequence database; a utility statistic module 102, configured to generate an initial candidate sequence pattern according to the mapping database, sequentially use the initial candidate sequence pattern as a current sequence pattern, and count a utility value and a utility upper bound of the current sequence pattern; a high utility continuous sequence mode determining module 103, configured to determine that the current sequence mode is a high utility continuous sequence mode when a utility value of the current sequence mode is greater than or equal to a threshold; a candidate sequence mode decision module 104, configured to, when the upper utility bound of the current sequence mode is greater than or equal to a threshold, take the current sequence mode as a candidate sequence mode; and the candidate sequence mode expanding module 105 is configured to expand the candidate sequence under a continuous constraint condition to generate an expanded sequence mode, use the expanded sequence mode as a current sequence mode, establish a current sequence mode mapping database according to the current sequence mode, and count the utility value and the utility upper bound of the current sequence mode according to the current sequence mode mapping database. The apparatus 100 may include other components in addition to the 5 units, however, since these components are not related to the contents of the embodiments of the present disclosure, illustration and description thereof are omitted herein.
The mapping database in the mapping database establishment module 101 stores utility values of each item of each sequence in the click stream sequence database, and the current sequence pattern mapping database in the candidate sequence pattern extension module 105 stores utility values and location information of the current sequence pattern.
The specific working process of the apparatus 100 for mining a high-utility continuous sequence pattern refers to the description of the method for mining a high-utility continuous sequence pattern, which is not described in detail.
Furthermore, an apparatus according to an embodiment of the invention may also be implemented by means of the architecture of a computing device as shown in fig. 5. Fig. 5 illustrates an architecture of the computing device. As shown in fig. 5, a computer system 201, a system bus 203, one or more CPUs 204, input/output components 202, memory 205, and the like. The memory 20 may store various data or files used in computer processing and/or communications as well as program instructions executed by the CPU. The architecture shown in fig. 5 is merely exemplary, and one or more of the components in fig. 5 may be adjusted as needed to implement different devices.
Embodiments of the invention may also be implemented as a computer-readable storage medium. A computer-readable storage medium according to an embodiment has computer-readable instructions stored thereon. The computer readable instructions, when executed by a processor, may perform a method according to embodiments of the invention as described with reference to the above figures.
The embodiment of the invention is directed to the method embodiment, the device embodiment and the computer storage medium embodiment capable of mining the high-utility continuous sequence mode in the large-scale complex clickstream database, and the results of the three embodiments are compared with the performance of the currently optimal high-utility continuous sequence mode mining method HUCP-Miner implementation result in terms of both the running speed and the memory consumption, and the embodiment is performed on four real clickstream data sets MSNBC, Kosarak10K, FIFA and BMS, wherein the first three data sets are derived from clickstream logs of news portals MSNBC, Kosarak and FIFA, and the last data set is derived from clickstream logs of a shopping website. Of these four data sets, the set of items for each click stream sequence contains only one item. In one embodiment, the minimum utility threshold parameter is a ratio that is multiplied by the sum of the utility values of all sequences in the data set to obtain a specific minimum utility threshold. The runtime and memory consumption of the FUCPM algorithm and the HUCP-Miner on the four datasets according to the embodiment of the present invention are shown in fig. 6 and fig. 7, respectively.
From FIG. 6, FUCPM is approximately 50% faster than HUCP-Miner, especially in FIFA data sets, the speed advantage of FUCPM is more pronounced. From FIG. 7, FUCPM consumes less memory on the MSNBC dataset than HUCP-Miner, but on the FIFA and BMS datasets, FUCPM consumes slightly more memory than HUCP-Miner.
This embodiment is performed on artificial clickstream data sets of different sizes, where the set of items for each clickstream sequence may contain multiple items to simulate a scenario where a user is browsing multiple web pages simultaneously. Since the existing high-utility continuous sequence pattern mining algorithms cannot process complex click stream sequences with a single item set containing multiple items, the embodiment only detects the efficiency of the FUCPM method. As can be seen from FIG. 8, FUCPM enables efficient and rapid mining of high utility continuous sequence patterns from complex clickstream databases of varying sizes.
By combining the method, the device and the computer storage medium which are provided by the embodiments and can be used for mining the high-utility continuous sequence mode in the large-scale complex click stream database, the required result can be mined more quickly, the method can be expanded to complex click stream sequence data with a larger scale, the application requirement of current click stream log mining analysis can be well met, and the method, the device and the computer storage medium have great practical value.
In this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process or method.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A method of mining a high utility continuous sequence pattern, the method comprising:
s1, establishing a mapping database of each sequence in the click stream sequence database;
s2, generating an initial candidate sequence mode according to the mapping database, sequentially taking the initial candidate sequence mode as a current sequence mode, and counting the utility value and the utility upper bound of the current sequence mode;
s3, when the utility value of the current sequence mode is larger than or equal to a threshold value, determining that the current sequence mode is a high utility continuous sequence mode; when the utility upper bound of the current sequence mode is larger than or equal to a threshold value, taking the current sequence mode as a candidate sequence mode;
and S4, under the continuous constraint condition, if the candidate sequence mode can be expanded, generating an expanded sequence mode from the candidate sequence mode, taking the expanded sequence mode as the current sequence mode, establishing a current sequence mode mapping database according to the current sequence mode, counting the utility value and the upper utility bound of the current sequence mode according to the current sequence mode mapping database, returning to S3, and if the candidate sequence mode can not be expanded any more, ending the circulation.
2. The method according to claim 1, wherein the generating of the initial candidate sequence pattern according to the mapping database in S2 specifically includes: a sequence pattern of length 1 is determined as an initial candidate sequence pattern.
3. The method of claim 1, wherein said S4 indicating that said candidate sequence pattern is expandable comprises that said candidate sequence pattern is expandable by item and by item set.
4. The method of claim 1, wherein the mapping database stores utility values for each item of sequences in the click stream sequence database, and wherein the current sequence pattern mapping database stores utility values and location information for the current sequence pattern.
5. The method according to claim 3, wherein the S4 continuous constraint condition comprises: item expansion can only be done with items that are in the same set of items as the last item of the candidate sequence pattern; item set expansion can only be done with items in the item set that are located after the last item of the candidate sequence pattern.
6. The method according to claim 1, wherein the S3 further comprises: and deleting the current sequence mode when the utility upper bound of the current sequence mode is smaller than a threshold value.
7. An apparatus for mining a high utility continuous sequence pattern, the apparatus comprising:
the mapping database establishing module is used for establishing a mapping database of the click stream sequence database;
the utility statistical module is used for generating an initial candidate sequence mode according to the mapping database, sequentially taking the initial candidate sequence mode as a current sequence mode, and counting the utility value and the utility upper bound of the current sequence mode;
the high-utility continuous sequence mode judging module is used for determining that the current sequence mode is the high-utility continuous sequence mode when the utility value of the current sequence mode is greater than or equal to a threshold value;
a candidate sequence mode decision module, configured to, when an upper utility bound of the current sequence mode is greater than or equal to a threshold, take the current sequence mode as a candidate sequence mode;
and the candidate sequence mode expansion module is used for expanding the candidate sequence under the continuous constraint condition to generate an expanded sequence mode, taking the expanded sequence mode as a current sequence mode, establishing a current sequence mode mapping database according to the current sequence mode, and counting the utility value and the utility upper bound of the current sequence mode according to the current sequence mode mapping database.
8. The apparatus of claim 7, wherein the mapping database in the mapping database building module stores utility values of each item of each sequence in the click stream sequence database, and wherein the current sequence pattern mapping database in the candidate sequence pattern expanding module stores utility values and location information of the current sequence pattern.
9. An apparatus for mining a high utility continuous sequence pattern, comprising:
a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-6.
10. A computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-6.
CN202110727658.2A 2021-06-29 2021-06-29 Method, device and computer storage medium for mining high-utility continuous sequence mode Pending CN113407543A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110727658.2A CN113407543A (en) 2021-06-29 2021-06-29 Method, device and computer storage medium for mining high-utility continuous sequence mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110727658.2A CN113407543A (en) 2021-06-29 2021-06-29 Method, device and computer storage medium for mining high-utility continuous sequence mode

Publications (1)

Publication Number Publication Date
CN113407543A true CN113407543A (en) 2021-09-17

Family

ID=77680309

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110727658.2A Pending CN113407543A (en) 2021-06-29 2021-06-29 Method, device and computer storage medium for mining high-utility continuous sequence mode

Country Status (1)

Country Link
CN (1) CN113407543A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870956A (en) * 2016-09-28 2018-04-03 腾讯科技(深圳)有限公司 A kind of effective item set mining method, apparatus and data processing equipment
CN109460424A (en) * 2018-10-18 2019-03-12 哈尔滨工业大学(深圳) Effective sequence pattern processing method, device and computer equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870956A (en) * 2016-09-28 2018-04-03 腾讯科技(深圳)有限公司 A kind of effective item set mining method, apparatus and data processing equipment
CN109460424A (en) * 2018-10-18 2019-03-12 哈尔滨工业大学(深圳) Effective sequence pattern processing method, device and computer equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUNKAI ZHANG等: "TKUS:Mining top-K high-utility sequential patterns", 《HTTP://DOI.ORG/10.48550/ARXIV.2011.13454》 *
JINLIN CHEN等: "Mining Contiguous Sequential Patterns from Web Logs", 《PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB》 *

Similar Documents

Publication Publication Date Title
Tong et al. Fast random walk with restart and its applications
Guan et al. Personalized tag recommendation using graph-based ranking on multi-type interrelated objects
CN100470544C (en) Method, equipment and system for chaiming file
US8645288B2 (en) Page selection for indexing
US20070011110A1 (en) Building support vector machines with reduced classifier complexity
CN105955984A (en) Network data searching method based on crawler mode
CN101211368B (en) Method for classifying search term, device and search engine system
CN108959580A (en) A kind of optimization method and system of label data
CN112434158A (en) Enterprise label acquisition method and device, storage medium and computer equipment
Chauhan et al. Web page ranking using machine learning approach
Gan et al. Exploiting highly qualified pattern with frequency and weight occupancy
CN106599122B (en) Parallel frequent closed sequence mining method based on vertical decomposition
Singh et al. Enhanced-RatioRank: Enhancing impact of inlinks and outlinks
CN101814093A (en) Similarity-based semi-supervised learning spam page detection method
CN107766419B (en) Threshold denoising-based TextRank document summarization method and device
CN113407543A (en) Method, device and computer storage medium for mining high-utility continuous sequence mode
CN107622125B (en) Information crawling method and device and electronic equipment
Błażewicz et al. A novel representation of graph structures in web mining and data analysis
US11709798B2 (en) Hash suppression
Annam et al. Entropy based informative content density approach for efficient web content extraction
Kumari et al. A review of classification in web usage mining using K-nearest neighbour
CN104750692A (en) Information processing method, information retrieval method and corresponding device of information retrieval method
Gao et al. A general markov framework for page importance computation
CN109460500A (en) Focus incident finds method, apparatus, computer equipment and storage medium
Benna et al. Building a social network, based on collaborative tagging, to enhance social information retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210917