CN113377766A - Sequence database contrast mining method and device based on utility and computer equipment - Google Patents

Sequence database contrast mining method and device based on utility and computer equipment Download PDF

Info

Publication number
CN113377766A
CN113377766A CN202110554575.8A CN202110554575A CN113377766A CN 113377766 A CN113377766 A CN 113377766A CN 202110554575 A CN202110554575 A CN 202110554575A CN 113377766 A CN113377766 A CN 113377766A
Authority
CN
China
Prior art keywords
sequence
utility
comparison
mode
contrast
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110554575.8A
Other languages
Chinese (zh)
Other versions
CN113377766B (en
Inventor
张春慨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202110554575.8A priority Critical patent/CN113377766B/en
Publication of CN113377766A publication Critical patent/CN113377766A/en
Application granted granted Critical
Publication of CN113377766B publication Critical patent/CN113377766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method, a device and computer equipment for contrasting and mining a sequence database based on utility, wherein the method comprises the following steps: preprocessing two sequence databases mined by pre-contrast; respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, and determining a common part of the two data trees LQS-tree as a search tree for contrastive mining of the sequence databases; pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode. The utility comparison sequence mode is combined with the utility comparison sequence mode, the utility-driven comparison sequence mode is designed, and an efficient mining algorithm is provided to find out the utility comparison sequence mode in the two input databases.

Description

Sequence database contrast mining method and device based on utility and computer equipment
Technical Field
The application relates to the technical field of data mining, in particular to a method, a device and computer equipment for contrasting and mining a sequence database based on effectiveness.
Background
The main purpose of data mining and analysis is to find novel, potentially useful patterns from a database that can be used in practical applications to gain useful knowledge. To identify and evaluate the effectiveness of different types of patterns, prior work has proposed a number of techniques and constraints, such as support, confidence, sequence order, and utility parameters (e.g., weight, price, profit, quantity, satisfaction, etc.). In recent years, there has been an increasing demand for utility-oriented pattern mining (UPM, or referred to as utility mining). UPM is a vital technical task that has many highly influential applications, including e-commerce, financial, medical, and biomedical applications. In particular, utility mining may be applied to business intelligence to make business decisions, such as product placement, catalog design, customer segmentation, cross-selling, and the like.
In UPM, each object/item has a unit utility (e.g., unit profit) and can occur more than once per transaction or event (e.g., number of purchases). The utility of a pattern represents its importance or satisfaction, which can be measured by risk, profit, cost, quantity, or other information depending on user preferences. In retail analysis (utility, i.e. profit), each transaction recorded with a customer contains several products/items, and their time of purchase, number of purchases are noted. If transaction <2021.3.1 (A,1) (B,3), (C,3), (D,4) > indicates a record of a user's purchase at 3 months and 1 day 2021, and the shopping sequence is that after purchasing a unit of A and 3 units of B, units of C and D are purchased in turn. The profit of the corresponding sales unit item is known and the profit of the transaction can be derived. When different utility mining technologies are applied to analyze the retail database, corresponding transaction modes can be obtained. For example, after the analysis by adopting a high-utility sequence pattern mining algorithm, patterns < (AB), (C), (D) >, are obtained, which indicates that the sales pattern produces profit values higher than the profit values specified by the user; meanwhile, the retailer can also use the mode as a reference of the commodity recommendation sequence after obtaining the mode. The basic task of high utility pattern mining is to mine all patterns from the database that have utility values above a specified utility value.
However, existing utility mining algorithms are deficient in their ability to analyze differences between databases, and many times we need patterns that are of high utility in one database and low utility in another database. For example, to help business decision-making, the poor data case can be compared with the good data case to find out the difference.
Disclosure of Invention
The invention provides a sequence database comparison mining method, a sequence database comparison mining device and computer equipment based on utility, and particularly relates to a comparison sequence mode driven by utility by combining the utility with the comparison sequence mode and giving an efficient mining algorithm to find out the utility comparison sequence mode in two input databases. The utility-aligned sequence patterns are those sequences whose utility values in the two databases differ greatly (beyond a predefined contrast), and the utility-aligned sequence patterns can account well for the differences between the two databases.
In a first aspect of the present invention, a utility-based sequence database comparison mining method is provided, the method comprising:
preprocessing two sequence databases mined by pre-contrast;
respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, and determining a common part of the two data trees LQS-tree as a search tree for contrastive mining of the sequence databases;
pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode.
Further, pruning the search tree by using the comparison effect upper bound, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode, wherein the specific process comprises the following steps:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
Further, the S203 further includes: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
Further, the upper bound for contrast effect includes three types, and the specific expression is as follows:
Figure BDA0003076729760000021
Figure BDA0003076729760000022
Figure BDA0003076729760000023
wherein iOUR (t; D)+,D-) Sequence pattern t in sequence database D obtained by adding an item to the last item set of the sequence pattern in the first type of item extension+、D-For the contrast in (1), the upper bound, iESU (t, D)+),iESU(t,D-) The sequence pattern t obtained by the first type item expansion is positioned in a sequence database D+、D-Upper bound on utility value of; sOUR (t; D)+,D-) Sequence pattern t obtained by adding an item set to the sequence pattern in the second type item extension is in sequence database D+、D-For the contrast in (1), upper bound, sESU (t, D)+),sESU(t,D-) The sequence pattern t obtained by the second kind of item expansion is positioned in a sequence database D+、D-Upper limit of utility value of, miui (t, D)-),miui(t,D+) Respectively corresponding sequence patterns t obtained in the first class item expansion and the second class item expansion in a sequence database D+、D-Minimum utility value of,wOUR(t;D+,D-) Sequence pattern t obtained by expanding one item for sequence pattern t' in sequence expansion is in sequence database D+、D-For contrast in (1), upper bound, DIU (t, D)+),DIU(t,D-) Respectively, the sequence pattern t obtained in the sequence extension is located in the database D+,D-Upper limit of utility value of, miui (t', D)-),miui(t',D+) Respectively sequence patterns t' in database D+,D-The minimum utility value occurring in (a).
Further, the pruning strategy specifically includes:
the extended contrast sequence mode comprises item extended descendant nodes and sequence extended descendant nodes, when the item extended descendant nodes are traversed, two extended item sets are obtained by respectively scanning utility value linked lists corresponding to two sequence databases, and elements in the intersection of the two extended item sets are connected with prefix sequences of the item extended descendant nodes to generate a new candidate contrast sequence mode.
In a second aspect of the present invention, a utility-based database comparison mining apparatus includes:
the sequence database preprocessing module is used for preprocessing two sequence databases which are subjected to pre-contrast mining;
the search tree confirmation module is used for respectively representing the search spaces of the two preprocessed sequence databases by two data trees LQS-tree and determining the public part of the two data trees LQS-tree as a search tree for the comparison mining of the sequence databases;
and the comparison mining module is used for pruning the search tree by adopting a comparison effect upper bound, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode.
Further, the specific process of the comparison mining module includes:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
Further, the step S203 in the contrast mining module further includes: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
In a third aspect of the present invention, a utility-based database comparison mining apparatus includes: a processor; and a memory, wherein the memory has stored therein a computer executable program that, when executed by the processor, performs the utility-based sequence database match-mining method described above.
In a fourth aspect of the invention, a computer-readable storage medium is provided having instructions stored thereon, which when executed by a processor, cause the processor to perform the utility-based sequence database comparison mining method described above.
The utility-based sequence database comparison mining method, device and computer equipment provided by the invention have the advantages that two sequence databases which are subjected to comparison mining are preprocessed, and irrelevant information is removed; respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, determining the public part of the two data trees LQS-tree as a search tree for comparison mining of the sequence databases, defining a new search space by using the attribute of a comparison algorithm, and filtering useless search nodes; pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode, so that the search space is reduced, and the comparison and excavation performance is improved. The beneficial effects that finally reach are: and combining the utility with the comparison sequence mode, designing a utility-driven comparison sequence mode, and providing an efficient mining algorithm to find out the utility comparison sequence mode in the two input databases. The utility-aligned sequence patterns are those sequences whose utility values in the two databases differ greatly (beyond a predefined contrast), and the utility-aligned sequence patterns can account well for the differences between the two databases.
Drawings
FIG. 1 is a flow chart of a method for utility-based database-based contrast mining in accordance with an embodiment of the present invention;
FIG. 2 is a search space example diagram of a high utility sequence pattern mining algorithm in an embodiment of the present invention;
FIG. 3 is an exemplary graph of a search space tree for the UCPM algorithm in an embodiment of the present invention;
FIG. 4 is a table 1 of examples of the present invention
Figure BDA0003076729760000041
The corresponding matrix example graph;
FIG. 5 is a diagram illustrating an example of a sequence < a > corresponding to Utility-chains in the databases of Table 1 and Table 2 according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating a mode of traversing and outputting utility comparison sequences for each node of the search tree according to a pruning strategy according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a utility-based database alignment mining device in an embodiment of the present invention;
FIG. 8 is an architecture of a computer device in an embodiment of the invention.
Detailed Description
In order to further describe the technical scheme of the present invention in detail, the present embodiment is implemented on the premise of the technical scheme of the present invention, and detailed implementation modes and specific steps are given.
The embodiment of the invention takes application in the field of business intelligence as an example, and provides a sequence database comparison mining method based on effectiveness, as shown in fig. 1, the method comprises the following steps:
s101: preprocessing two sequence databases mined by pre-contrast;
in a specific embodiment, the preprocessing includes data collection and cleansing, preprocessing two sequence databases to be analyzed and compared, removing irrelevant information, formatting, and inputting into a comparison mining system. The formatted data are shown in tables 1-3:
table 1 input database D+
Figure BDA0003076729760000051
Table 2 input database D_
Figure BDA0003076729760000052
TABLE 3 profit list for goods
Figure BDA0003076729760000053
S102: respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, and determining a common part of the two data trees LQS-tree as a search tree for contrastive mining of the sequence databases;
the UCPM (utility-driven contrast pattern mining) algorithm excavates the utility contrast sequence patterns in the two contrast sequence databases that exceed a predefined contrast. And taking the formatted two-comparison transaction sequence database, the commodity profit list and the user-specified utility ratio as the input of the UCPM algorithm, and mining the two-comparison transaction sequence database, wherein the algorithm outputs a utility comparison sequence mode to help the user to analyze and make a decision.
The search space mined by the original high-utility sequence pattern is represented by an LQS-tree, as shown in FIG. 2, each node of the search space represents a candidate which can become the sequence pattern. In order to complete the sequence mode mining task, nodes in the tree need to be traversed, and whether the utility value of the nodes reaches a preset value or not is calculated and judged.
In the specific embodiment, a new search space is defined by using the attribute of a contrast mining algorithm, and useless search nodes are filtered out. Because the utility contrast pattern is a sequence with great difference between the utility contrast patterns in the two contrast databases, the candidate nodes in the search tree that may become the result pattern must satisfy the LQS-tree that appears in the two sequence databases at the same time, as shown in fig. 3, the common part of the LQS-tree of the two sequence databases constitutes the search space of the UCPM algorithm.
Further, in the embodiment, a projection database and a compact data structure, such as CU-matrix (hereinafter, matrix) and Utility-chain, are used. According to the characteristic that only the information of the database related to the nodes is needed in the process of traversing the nodes, a projection database form is adopted as a data structure in operation: matrix and Utility-chain, as shown in fig. 4 and 5. The former is used for storing the utility value of each item of each sequence in the database; the latter is used for storing the utility value and the position information of the candidate sequence mode formed in the mining process.
S103: pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode.
The method adopts the compact utility upper bound for pruning, and the utility upper bound is an important concept in the high-utility sequence pattern mining technology. All high-utility sequence pattern mining algorithms utilize a specific utility upper bound to prune a search space, so that the algorithm efficiency is improved. The upper bound of utility for a sequence pattern must satisfy two conditions: firstly, the utility upper bound value of the sequence mode is larger than the utility value; and the utility upper bound value of the sequence mode is larger than that of the extended sequence. Therefore, when the utility upper bound value of a sequence mode is smaller than the threshold, the utility value of the extended sequence is necessarily smaller than the threshold, so that the sequence mode can be stopped from being extended, and the search space is reduced. The invention provides three compact upper boundary itemset and sequence optimization of Utility Ratio (iOUR and sOUR) and width optimization of Utility Ratio (wOUR) for contrast effect, and the calculation modes are as follows:
Figure BDA0003076729760000061
Figure BDA0003076729760000062
Figure BDA0003076729760000071
wherein iOUR (t; D)+,D_) Sequence pattern t in sequence database D obtained by adding an item to the last item set of the sequence pattern in the first type of item extension+、D_For the contrast in (1), the upper bound, iESU (t, D)+),iESU(t,D_) The sequence pattern t obtained by the first type item expansion is positioned in a sequence database D+、D_Upper bound on utility value of; sOUR (t; D)+,D_) Sequence pattern t obtained by adding an item set to the sequence pattern in the second type item extension is in sequence database D+、D-For the contrast in (1), upper bound, sESU (t, D)+),sESU(t,D-) Respectively, the sequence pattern t obtained by the second kind item expansion is positioned in the sequence database D+、D-Upper limit of utility value of, miui (t, D)-),miui(t,D+) Respectively corresponding sequence patterns t obtained in the first class item expansion and the second class item expansion in a sequence database D+、D-The minimum utility value, wOUR (t; D) occurring in+,D-) Sequence pattern t obtained by expanding one item for sequence pattern t' in sequence expansion is in sequence database D+、D-For contrast in (1), upper bound, DIU (t, D)+),DIU(t,D-) Respectively, the sequence pattern t obtained in the sequence extension is located in the database D+,D-Upper limit of utility value of, miui (t', D)-),miui(t',D+) Respectively sequence patterns t' in database D+,D-Most appeared inAnd a small utility value, wherein the first type of item extension is a sequence pattern t obtained by adding one item in the last item set of the sequence pattern, and the second type of item extension is a sequence pattern t obtained by adding one item set in the last item set of the sequence pattern.
In the specific implementation process, the three effective upper limits iESU, sESU, DIU and the minimum effective value miui are defined as follows:
assuming that the sequence pattern t has an extensible position P in a sequence data S of the database, there is
Figure BDA0003076729760000072
Figure BDA0003076729760000073
Where u (t, P, S) is the utility value of the scalable position P in S in the sequence pattern t, and u (S/(t, P)) is the sum of the utility values of all the terms of the scalable position P.
iESU and sESU of sequence pattern t in sequence data S are defined as
Figure BDA0003076729760000074
Figure BDA0003076729760000075
Further, iESU and sESU of the sequence pattern t in the database D are defined as
Figure BDA0003076729760000076
Figure BDA0003076729760000077
An exemplary description of the sequence pattern < { a, b } > in the examples is given below
Figure BDA0003076729760000081
Assuming that the sequence pattern t is generated by expanding an item by the sequence pattern t', DIU of t in the sequence data S is defined as
Figure BDA0003076729760000082
The DIU of the further sequence pattern t in the database D is defined as
Figure BDA0003076729760000083
An exemplary description of the sequence pattern < { a, b } > in the examples is given below
Figure BDA0003076729760000084
The minimum utility of the sequence pattern t in the database D is defined as
Figure BDA0003076729760000085
An exemplary description of the sequence pattern < { a, b } > in the examples is given below
Figure BDA0003076729760000086
In combination with the definition and calculation method of the above-given utility upper bound, minimum utility sequence example, we take the sequence < { a, b } > as an example to calculate each comparison utility upper bound.
Figure BDA0003076729760000087
Figure BDA0003076729760000088
Figure BDA0003076729760000089
Further, pruning the search tree by using the comparison effect upper bound, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode, as shown in fig. 6, the specific process includes:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
Further, the S203 further includes: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
Further, the pruning strategy specifically includes:
the extended contrast sequence mode comprises item extended descendant nodes and sequence extended descendant nodes, when the item extended descendant nodes are traversed, two extended item sets are obtained by respectively scanning utility value linked lists corresponding to two sequence databases, and elements in the intersection of the two extended item sets are connected with prefix sequences of the item extended descendant nodes to generate a new candidate contrast sequence mode.
In a specific embodiment, the search space can be effectively pruned by using the calculated upper bounds, and when we (< { ab } >) of the sequence < { ab } > is obtained, if we (< { ab } >) is smaller than a predefined contrast threshold, it is not necessary to traverse the sequence < { ab } > and the extended sequence thereof; calculating the upper bound iOUR (< { ab } >) and sOUR (< { ab } >) when wOUR (< { ab } >) meets the requirement, and calculating the utility ratio contrast of the current sequence and continuing the item (sequence) extension operation on the current sequence only when iOUR (< { ab } >) and sOUR (< { ab } >) meet the preset contrast utility condition.
In the specific embodiment, the UCPM (utility driven contrast pattern mining) algorithm flow includes:
s1001, inputting a comparison database D comprising two formats+、D-An external utility table and a minimum utility contrast threshold δ; output as two input comparison database D+、D-Full effect versus sequence pattern.
S1002, building CU-matrixes of two databases, and building Utility-chains for the sequence (1-sequence) with the length of 1. For each identical 1-sequence in the two databases, the UCPM firstly calculates and judges the utility ratio of the UCPM, and outputs the UCPM when the UCPM meets the condition;
s1003, dividing the descendant nodes of the 1-sequence node into item expansion descendants and sequence expansion descendants, respectively calling an item expansion method and a sequence expansion method to generate and process corresponding descendant nodes, and traversing the item expansion descendant nodes firstly, wherein for each type of descendant nodes, a depth pruning strategy, namely iOUR or sOUR is adopted, and the traversal of the descendant nodes is continued only when the upper bound of the corresponding utility ratio is greater than or equal to a comparison threshold delta;
s1004, when the node of the descendant is expanded by the processing item, an expandable item set is obtained by scanning Utility-chains corresponding to the two databases respectively, and the element in the intersection of the two sets is connected with the prefix sequence to generate a new candidate sequence. For each generated new sequence, a width pruning strategy is adopted, namely, the wOUR is calculated firstly, only when the wOUR is larger than or equal to delta, the corresponding Utility-chains are generated continuously, parameters such as the upper bound of each effect are calculated, and the following process is similar to the operation of processing the parent node. The algorithm ends when no new sequences are generated.
Hereinafter, an apparatus corresponding to the method shown in fig. 1, a utility-based sequence database comparison mining apparatus 300 according to an embodiment of the present disclosure will be described with reference to fig. 7, and fig. 7 is a schematic structural diagram of the utility-based sequence database comparison mining apparatus according to an embodiment of the present disclosure. Since the function of the apparatus 300 is the same as the details of the method described above with reference to fig. 1, a detailed description of the same is omitted here for the sake of simplicity. As shown in fig. 7, the apparatus 300 includes: the sequence database preprocessing module 301 is used for preprocessing two sequence databases which are subjected to pre-contrast mining; the search tree confirmation module 302 is used for respectively representing the search spaces of the two preprocessed sequence databases by two data trees LQS-tree, and determining the public part of the two data trees LQS-tree as a search tree for the comparative mining of the sequence databases; and the comparison mining module 303 is used for pruning the search tree by using the comparison effect upper bound, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode. The apparatus 300 may include other components in addition to the 3 modules, however, since these components are not related to the contents of the embodiments of the present disclosure, illustration and description thereof are omitted herein.
Further, the specific process of the contrast mining module 303 includes:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
Further, S203 in the contrast mining module 303 further includes: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
The specific working process of the utility-based sequence database comparison mining device 300 refers to the description of the utility-based sequence database comparison mining method, and is not repeated.
Furthermore, an apparatus according to an embodiment of the present invention may also be implemented by means of the architecture of a computing device shown in fig. 8. Fig. 8 illustrates an architecture of the computing device. As shown in fig. 8, a computer system 401, a system bus 403, one or more CPUs 404, input/output components 402, memory 405, and the like. The memory 405 may store various data or files used in computer processing and/or communications as well as program instructions executed by the CPU. The architecture shown in fig. 8 is merely exemplary, and one or more of the components in fig. 5 may be adjusted as needed to implement different devices.
Embodiments of the invention may also be implemented as a computer-readable storage medium. A computer-readable storage medium according to an embodiment has computer-readable instructions stored thereon. The computer readable instructions, when executed by a processor, may perform a method according to embodiments of the invention as described with reference to the above figures.
By combining the utility-based sequence database comparison mining method, device and computer equipment provided by the embodiments, two pre-comparison mined sequence databases are preprocessed to remove irrelevant information; respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, determining the public part of the two data trees LQS-tree as a search tree for comparison mining of the sequence databases, defining a new search space by using the attribute of a comparison algorithm, and filtering useless search nodes; pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode, so that the search space is reduced, and the comparison and excavation performance is improved. The beneficial effects that finally reach are: and combining the utility with the comparison sequence mode, designing a utility-driven comparison sequence mode, and providing an efficient mining algorithm to find out the utility comparison sequence mode in the two input databases. The utility-aligned sequence patterns are those sequences whose utility values in the two databases differ greatly (beyond a predefined contrast), and the utility-aligned sequence patterns can account well for the differences between the two databases.
In this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process or method.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A method for utility-based database-sequential comparative mining, the method comprising:
preprocessing two sequence databases mined by pre-contrast;
respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, and determining a common part of the two data trees LQS-tree as a search tree for contrastive mining of the sequence databases;
pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode.
2. The method of claim 1, wherein pruning the search tree using a comparison efficiency upper bound, traversing nodes of the search tree according to a pruning strategy, and outputting a utility comparison sequence pattern comprises:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 until an end condition is met.
3. The method for database-based comparative mining of utilities according to claim 2, wherein the S203 further comprises: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
4. The utility-based sequence database comparison mining method according to claim 2, wherein the upper bound for comparison effect comprises three types, and the specific expression is as follows:
Figure FDA0003076729750000011
Figure FDA0003076729750000012
Figure FDA0003076729750000013
wherein iOUR (t; D)+,D-) Sequence pattern t in sequence database D obtained by adding an item to the last item set of the sequence pattern in the first type of item extension+、D-For the contrast in (1), the upper bound, iESU (t, D)+),iESU(t,D-) The sequence pattern t obtained by the first type item expansion is positioned in a sequence database D+、D-Upper bound on utility value of; sOUR (t; D)+,D-) Sequence pattern t obtained by adding an item set at the end of the sequence pattern in item extension in the second type item extension is in sequence database D+、D-For the contrast in (1), upper bound, sESU (t, D)+),sESU(t,D-) Respectively, the sequence pattern t obtained by the second kind item expansion is positioned in the sequence database D+、D_Upper limit of utility value of, miui (t, D)_),miui(t,D+) Respectively corresponding sequence patterns t obtained in the first class item expansion and the second class item expansion in a sequence database D+、D_The minimum utility value, wOUR (t; D) occurring in+,D_) Sequence pattern t obtained by expanding one item for sequence pattern t' in sequence expansion is in sequence database D+、D-For contrast in (1), upper bound, DIU (t, D)+),DIU(t,D-) Respectively, the sequence pattern t obtained in the sequence extension is located in the database D+,D-Upper limit of utility value of, miui (t', D)-),miui(t',D+) Respectively sequence patterns t' in database D+,D-The minimum utility value occurring in (a).
5. The utility-based sequence database comparative mining method according to claim 3, wherein the pruning strategy further comprises:
the extended contrast sequence mode comprises item extended descendant nodes and sequence extended descendant nodes, when the item extended descendant nodes are traversed, two extended item sets are obtained by respectively scanning utility value linked lists corresponding to two sequence databases, and elements in the intersection of the two extended item sets are connected with prefix sequences of the item extended descendant nodes to generate a new candidate contrast sequence mode.
6. A utility-based sequence database comparison mining apparatus, the apparatus comprising:
the sequence database preprocessing module is used for preprocessing two sequence databases which are subjected to pre-contrast mining;
the search tree confirmation module is used for respectively representing the search spaces of the two preprocessed sequence databases by two data trees LQS-tree and determining the public part of the two data trees LQS-tree as a search tree for the comparison mining of the sequence databases;
and the comparison mining module is used for pruning the search tree by adopting a comparison effect upper bound, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode.
7. The apparatus of claim 6, wherein the specific process of the comparative mining module comprises:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
8. The apparatus of claim 7, wherein the comparative mining module S003 further comprises: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
9. A utility-based sequence database comparison mining device, comprising:
a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-5.
10. A computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-5.
CN202110554575.8A 2021-05-21 2021-05-21 Sequence database contrast mining method and device based on utility and computer equipment Active CN113377766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110554575.8A CN113377766B (en) 2021-05-21 2021-05-21 Sequence database contrast mining method and device based on utility and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110554575.8A CN113377766B (en) 2021-05-21 2021-05-21 Sequence database contrast mining method and device based on utility and computer equipment

Publications (2)

Publication Number Publication Date
CN113377766A true CN113377766A (en) 2021-09-10
CN113377766B CN113377766B (en) 2022-09-13

Family

ID=77571453

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110554575.8A Active CN113377766B (en) 2021-05-21 2021-05-21 Sequence database contrast mining method and device based on utility and computer equipment

Country Status (1)

Country Link
CN (1) CN113377766B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407442A (en) * 2023-12-11 2024-01-16 珠海大横琴科技发展有限公司 Mining method and device for judging high utility mode, electronic equipment and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6928368B1 (en) * 1999-10-26 2005-08-09 The Board Regents, The University Of Texas System Gene mining system and method
US20120130964A1 (en) * 2010-11-18 2012-05-24 Yen Show-Jane Fast algorithm for mining high utility itemsets
CN105590237A (en) * 2015-12-18 2016-05-18 齐鲁工业大学 Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making
CN105868296A (en) * 2016-03-24 2016-08-17 银江股份有限公司 Fast pruning policy based method for drug DDD value data analysis in efficient sequence modes
US20180307722A1 (en) * 2016-09-27 2018-10-25 Tencent Technology (Shenzhen) Company Limited Pattern mining method, high-utility itemset mining method, and related device
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device
CN110349678A (en) * 2019-07-19 2019-10-18 齐鲁工业大学 A kind of Chinese medicine marketing system and its working method based on the positive and negative sequence rule digging of effective
CN111930804A (en) * 2020-08-07 2020-11-13 河北工业大学 Top-k self-adaptive contrast mode mining method based on incomplete net tree

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6928368B1 (en) * 1999-10-26 2005-08-09 The Board Regents, The University Of Texas System Gene mining system and method
US20120130964A1 (en) * 2010-11-18 2012-05-24 Yen Show-Jane Fast algorithm for mining high utility itemsets
CN105590237A (en) * 2015-12-18 2016-05-18 齐鲁工业大学 Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making
CN105868296A (en) * 2016-03-24 2016-08-17 银江股份有限公司 Fast pruning policy based method for drug DDD value data analysis in efficient sequence modes
US20180307722A1 (en) * 2016-09-27 2018-10-25 Tencent Technology (Shenzhen) Company Limited Pattern mining method, high-utility itemset mining method, and related device
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device
CN110349678A (en) * 2019-07-19 2019-10-18 齐鲁工业大学 A kind of Chinese medicine marketing system and its working method based on the positive and negative sequence rule digging of effective
CN111930804A (en) * 2020-08-07 2020-11-13 河北工业大学 Top-k self-adaptive contrast mode mining method based on incomplete net tree

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHOWDHURY FARHAN AHMED: "Mining High Utility Web Access Sequences in Dynamic Web Log Data", 《IEEE XPLORE》 *
JERRY CHUN-WEI LIN: "High-Utility Sequential Pattern Mining with Multiple Minimum Utility Threadholds", 《SPRINGER》 *
魏芹双: "对比模式挖掘研究进展", 《网络安全技术与应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117407442A (en) * 2023-12-11 2024-01-16 珠海大横琴科技发展有限公司 Mining method and device for judging high utility mode, electronic equipment and medium
CN117407442B (en) * 2023-12-11 2024-03-19 珠海大横琴科技发展有限公司 Mining method and device for judging high utility mode, electronic equipment and medium

Also Published As

Publication number Publication date
CN113377766B (en) 2022-09-13

Similar Documents

Publication Publication Date Title
Lin et al. An incremental mining algorithm for high utility itemsets
CN105320756B (en) A kind of database association rule digging method based on improvement Apriori algorithm
JP3195233B2 (en) System and method for finding generalized relevant rules in a database
Lee et al. Sliding window filtering: an efficient method for incremental mining on a time-variant database
Masseglia et al. Efficient mining of sequential patterns with time constraints: Reducing the combinations
CN113377766B (en) Sequence database contrast mining method and device based on utility and computer equipment
Le et al. Mining frequent closed inter-sequence patterns efficiently using dynamic bit vectors
Truong et al. Efficient algorithms for mining frequent high utility sequences with constraints
Nguyen et al. An efficient algorithm for mining frequent weighted itemsets using interval word segments
Adhikari et al. Advances in knowledge discovery in databases
Huang et al. US-Rule: Discovering utility-driven sequential rules
Wang et al. Flexible online association rule mining based on multidimensional pattern relations
Chand et al. Target oriented sequential pattern mining using recency and monetary constraints
Alsaeedi et al. An incremental interesting maximal frequent itemset mining based on FP-Growth algorithm
CN110471960B (en) High-utility item set mining method containing negative utility
CN104408641A (en) Brand feature extraction method and system of electronic commerce recommendation model
Ioakimidis Robust reliability under uncertainty conditions by using modified info-gap models with two to four horizons of uncertainty and quantifier elimination
Iguchi et al. A second-order discretization for degenerate systems of stochastic differential equations
Paik et al. A new method for mining association rules from a collection of XML documents
CN107688581B (en) Data model processing method and device
KR100430479B1 (en) System and mechanism for discovering temporal realtion rules from interval data
Dalmas et al. Heuristics for high-utility local process model mining
Sasi et al. Comparative analysis of ARIMA and double exponential smoothing for forecasting rice sales in fair price shop
Sethi et al. Association rule mining: A review
Király et al. Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant