CN113377766A - Sequence database contrast mining method and device based on utility and computer equipment - Google Patents
Sequence database contrast mining method and device based on utility and computer equipment Download PDFInfo
- Publication number
- CN113377766A CN113377766A CN202110554575.8A CN202110554575A CN113377766A CN 113377766 A CN113377766 A CN 113377766A CN 202110554575 A CN202110554575 A CN 202110554575A CN 113377766 A CN113377766 A CN 113377766A
- Authority
- CN
- China
- Prior art keywords
- sequence
- utility
- comparison
- mode
- contrast
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method, a device and computer equipment for contrasting and mining a sequence database based on utility, wherein the method comprises the following steps: preprocessing two sequence databases mined by pre-contrast; respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, and determining a common part of the two data trees LQS-tree as a search tree for contrastive mining of the sequence databases; pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode. The utility comparison sequence mode is combined with the utility comparison sequence mode, the utility-driven comparison sequence mode is designed, and an efficient mining algorithm is provided to find out the utility comparison sequence mode in the two input databases.
Description
Technical Field
The application relates to the technical field of data mining, in particular to a method, a device and computer equipment for contrasting and mining a sequence database based on effectiveness.
Background
The main purpose of data mining and analysis is to find novel, potentially useful patterns from a database that can be used in practical applications to gain useful knowledge. To identify and evaluate the effectiveness of different types of patterns, prior work has proposed a number of techniques and constraints, such as support, confidence, sequence order, and utility parameters (e.g., weight, price, profit, quantity, satisfaction, etc.). In recent years, there has been an increasing demand for utility-oriented pattern mining (UPM, or referred to as utility mining). UPM is a vital technical task that has many highly influential applications, including e-commerce, financial, medical, and biomedical applications. In particular, utility mining may be applied to business intelligence to make business decisions, such as product placement, catalog design, customer segmentation, cross-selling, and the like.
In UPM, each object/item has a unit utility (e.g., unit profit) and can occur more than once per transaction or event (e.g., number of purchases). The utility of a pattern represents its importance or satisfaction, which can be measured by risk, profit, cost, quantity, or other information depending on user preferences. In retail analysis (utility, i.e. profit), each transaction recorded with a customer contains several products/items, and their time of purchase, number of purchases are noted. If transaction <2021.3.1 (A,1) (B,3), (C,3), (D,4) > indicates a record of a user's purchase at 3 months and 1 day 2021, and the shopping sequence is that after purchasing a unit of A and 3 units of B, units of C and D are purchased in turn. The profit of the corresponding sales unit item is known and the profit of the transaction can be derived. When different utility mining technologies are applied to analyze the retail database, corresponding transaction modes can be obtained. For example, after the analysis by adopting a high-utility sequence pattern mining algorithm, patterns < (AB), (C), (D) >, are obtained, which indicates that the sales pattern produces profit values higher than the profit values specified by the user; meanwhile, the retailer can also use the mode as a reference of the commodity recommendation sequence after obtaining the mode. The basic task of high utility pattern mining is to mine all patterns from the database that have utility values above a specified utility value.
However, existing utility mining algorithms are deficient in their ability to analyze differences between databases, and many times we need patterns that are of high utility in one database and low utility in another database. For example, to help business decision-making, the poor data case can be compared with the good data case to find out the difference.
Disclosure of Invention
The invention provides a sequence database comparison mining method, a sequence database comparison mining device and computer equipment based on utility, and particularly relates to a comparison sequence mode driven by utility by combining the utility with the comparison sequence mode and giving an efficient mining algorithm to find out the utility comparison sequence mode in two input databases. The utility-aligned sequence patterns are those sequences whose utility values in the two databases differ greatly (beyond a predefined contrast), and the utility-aligned sequence patterns can account well for the differences between the two databases.
In a first aspect of the present invention, a utility-based sequence database comparison mining method is provided, the method comprising:
preprocessing two sequence databases mined by pre-contrast;
respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, and determining a common part of the two data trees LQS-tree as a search tree for contrastive mining of the sequence databases;
pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode.
Further, pruning the search tree by using the comparison effect upper bound, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode, wherein the specific process comprises the following steps:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
Further, the S203 further includes: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
Further, the upper bound for contrast effect includes three types, and the specific expression is as follows:
wherein iOUR (t; D)+,D-) Sequence pattern t in sequence database D obtained by adding an item to the last item set of the sequence pattern in the first type of item extension+、D-For the contrast in (1), the upper bound, iESU (t, D)+),iESU(t,D-) The sequence pattern t obtained by the first type item expansion is positioned in a sequence database D+、D-Upper bound on utility value of; sOUR (t; D)+,D-) Sequence pattern t obtained by adding an item set to the sequence pattern in the second type item extension is in sequence database D+、D-For the contrast in (1), upper bound, sESU (t, D)+),sESU(t,D-) The sequence pattern t obtained by the second kind of item expansion is positioned in a sequence database D+、D-Upper limit of utility value of, miui (t, D)-),miui(t,D+) Respectively corresponding sequence patterns t obtained in the first class item expansion and the second class item expansion in a sequence database D+、D-Minimum utility value of,wOUR(t;D+,D-) Sequence pattern t obtained by expanding one item for sequence pattern t' in sequence expansion is in sequence database D+、D-For contrast in (1), upper bound, DIU (t, D)+),DIU(t,D-) Respectively, the sequence pattern t obtained in the sequence extension is located in the database D+,D-Upper limit of utility value of, miui (t', D)-),miui(t',D+) Respectively sequence patterns t' in database D+,D-The minimum utility value occurring in (a).
Further, the pruning strategy specifically includes:
the extended contrast sequence mode comprises item extended descendant nodes and sequence extended descendant nodes, when the item extended descendant nodes are traversed, two extended item sets are obtained by respectively scanning utility value linked lists corresponding to two sequence databases, and elements in the intersection of the two extended item sets are connected with prefix sequences of the item extended descendant nodes to generate a new candidate contrast sequence mode.
In a second aspect of the present invention, a utility-based database comparison mining apparatus includes:
the sequence database preprocessing module is used for preprocessing two sequence databases which are subjected to pre-contrast mining;
the search tree confirmation module is used for respectively representing the search spaces of the two preprocessed sequence databases by two data trees LQS-tree and determining the public part of the two data trees LQS-tree as a search tree for the comparison mining of the sequence databases;
and the comparison mining module is used for pruning the search tree by adopting a comparison effect upper bound, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode.
Further, the specific process of the comparison mining module includes:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
Further, the step S203 in the contrast mining module further includes: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
In a third aspect of the present invention, a utility-based database comparison mining apparatus includes: a processor; and a memory, wherein the memory has stored therein a computer executable program that, when executed by the processor, performs the utility-based sequence database match-mining method described above.
In a fourth aspect of the invention, a computer-readable storage medium is provided having instructions stored thereon, which when executed by a processor, cause the processor to perform the utility-based sequence database comparison mining method described above.
The utility-based sequence database comparison mining method, device and computer equipment provided by the invention have the advantages that two sequence databases which are subjected to comparison mining are preprocessed, and irrelevant information is removed; respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, determining the public part of the two data trees LQS-tree as a search tree for comparison mining of the sequence databases, defining a new search space by using the attribute of a comparison algorithm, and filtering useless search nodes; pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode, so that the search space is reduced, and the comparison and excavation performance is improved. The beneficial effects that finally reach are: and combining the utility with the comparison sequence mode, designing a utility-driven comparison sequence mode, and providing an efficient mining algorithm to find out the utility comparison sequence mode in the two input databases. The utility-aligned sequence patterns are those sequences whose utility values in the two databases differ greatly (beyond a predefined contrast), and the utility-aligned sequence patterns can account well for the differences between the two databases.
Drawings
FIG. 1 is a flow chart of a method for utility-based database-based contrast mining in accordance with an embodiment of the present invention;
FIG. 2 is a search space example diagram of a high utility sequence pattern mining algorithm in an embodiment of the present invention;
FIG. 3 is an exemplary graph of a search space tree for the UCPM algorithm in an embodiment of the present invention;
FIG. 5 is a diagram illustrating an example of a sequence < a > corresponding to Utility-chains in the databases of Table 1 and Table 2 according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating a mode of traversing and outputting utility comparison sequences for each node of the search tree according to a pruning strategy according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a utility-based database alignment mining device in an embodiment of the present invention;
FIG. 8 is an architecture of a computer device in an embodiment of the invention.
Detailed Description
In order to further describe the technical scheme of the present invention in detail, the present embodiment is implemented on the premise of the technical scheme of the present invention, and detailed implementation modes and specific steps are given.
The embodiment of the invention takes application in the field of business intelligence as an example, and provides a sequence database comparison mining method based on effectiveness, as shown in fig. 1, the method comprises the following steps:
s101: preprocessing two sequence databases mined by pre-contrast;
in a specific embodiment, the preprocessing includes data collection and cleansing, preprocessing two sequence databases to be analyzed and compared, removing irrelevant information, formatting, and inputting into a comparison mining system. The formatted data are shown in tables 1-3:
table 1 input database D+
Table 2 input database D_
TABLE 3 profit list for goods
S102: respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, and determining a common part of the two data trees LQS-tree as a search tree for contrastive mining of the sequence databases;
the UCPM (utility-driven contrast pattern mining) algorithm excavates the utility contrast sequence patterns in the two contrast sequence databases that exceed a predefined contrast. And taking the formatted two-comparison transaction sequence database, the commodity profit list and the user-specified utility ratio as the input of the UCPM algorithm, and mining the two-comparison transaction sequence database, wherein the algorithm outputs a utility comparison sequence mode to help the user to analyze and make a decision.
The search space mined by the original high-utility sequence pattern is represented by an LQS-tree, as shown in FIG. 2, each node of the search space represents a candidate which can become the sequence pattern. In order to complete the sequence mode mining task, nodes in the tree need to be traversed, and whether the utility value of the nodes reaches a preset value or not is calculated and judged.
In the specific embodiment, a new search space is defined by using the attribute of a contrast mining algorithm, and useless search nodes are filtered out. Because the utility contrast pattern is a sequence with great difference between the utility contrast patterns in the two contrast databases, the candidate nodes in the search tree that may become the result pattern must satisfy the LQS-tree that appears in the two sequence databases at the same time, as shown in fig. 3, the common part of the LQS-tree of the two sequence databases constitutes the search space of the UCPM algorithm.
Further, in the embodiment, a projection database and a compact data structure, such as CU-matrix (hereinafter, matrix) and Utility-chain, are used. According to the characteristic that only the information of the database related to the nodes is needed in the process of traversing the nodes, a projection database form is adopted as a data structure in operation: matrix and Utility-chain, as shown in fig. 4 and 5. The former is used for storing the utility value of each item of each sequence in the database; the latter is used for storing the utility value and the position information of the candidate sequence mode formed in the mining process.
S103: pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode.
The method adopts the compact utility upper bound for pruning, and the utility upper bound is an important concept in the high-utility sequence pattern mining technology. All high-utility sequence pattern mining algorithms utilize a specific utility upper bound to prune a search space, so that the algorithm efficiency is improved. The upper bound of utility for a sequence pattern must satisfy two conditions: firstly, the utility upper bound value of the sequence mode is larger than the utility value; and the utility upper bound value of the sequence mode is larger than that of the extended sequence. Therefore, when the utility upper bound value of a sequence mode is smaller than the threshold, the utility value of the extended sequence is necessarily smaller than the threshold, so that the sequence mode can be stopped from being extended, and the search space is reduced. The invention provides three compact upper boundary itemset and sequence optimization of Utility Ratio (iOUR and sOUR) and width optimization of Utility Ratio (wOUR) for contrast effect, and the calculation modes are as follows:
wherein iOUR (t; D)+,D_) Sequence pattern t in sequence database D obtained by adding an item to the last item set of the sequence pattern in the first type of item extension+、D_For the contrast in (1), the upper bound, iESU (t, D)+),iESU(t,D_) The sequence pattern t obtained by the first type item expansion is positioned in a sequence database D+、D_Upper bound on utility value of; sOUR (t; D)+,D_) Sequence pattern t obtained by adding an item set to the sequence pattern in the second type item extension is in sequence database D+、D-For the contrast in (1), upper bound, sESU (t, D)+),sESU(t,D-) Respectively, the sequence pattern t obtained by the second kind item expansion is positioned in the sequence database D+、D-Upper limit of utility value of, miui (t, D)-),miui(t,D+) Respectively corresponding sequence patterns t obtained in the first class item expansion and the second class item expansion in a sequence database D+、D-The minimum utility value, wOUR (t; D) occurring in+,D-) Sequence pattern t obtained by expanding one item for sequence pattern t' in sequence expansion is in sequence database D+、D-For contrast in (1), upper bound, DIU (t, D)+),DIU(t,D-) Respectively, the sequence pattern t obtained in the sequence extension is located in the database D+,D-Upper limit of utility value of, miui (t', D)-),miui(t',D+) Respectively sequence patterns t' in database D+,D-Most appeared inAnd a small utility value, wherein the first type of item extension is a sequence pattern t obtained by adding one item in the last item set of the sequence pattern, and the second type of item extension is a sequence pattern t obtained by adding one item set in the last item set of the sequence pattern.
In the specific implementation process, the three effective upper limits iESU, sESU, DIU and the minimum effective value miui are defined as follows:
assuming that the sequence pattern t has an extensible position P in a sequence data S of the database, there is
Where u (t, P, S) is the utility value of the scalable position P in S in the sequence pattern t, and u (S/(t, P)) is the sum of the utility values of all the terms of the scalable position P.
iESU and sESU of sequence pattern t in sequence data S are defined as
Further, iESU and sESU of the sequence pattern t in the database D are defined as
An exemplary description of the sequence pattern < { a, b } > in the examples is given below
Assuming that the sequence pattern t is generated by expanding an item by the sequence pattern t', DIU of t in the sequence data S is defined as
The DIU of the further sequence pattern t in the database D is defined as
An exemplary description of the sequence pattern < { a, b } > in the examples is given below
The minimum utility of the sequence pattern t in the database D is defined as
An exemplary description of the sequence pattern < { a, b } > in the examples is given below
In combination with the definition and calculation method of the above-given utility upper bound, minimum utility sequence example, we take the sequence < { a, b } > as an example to calculate each comparison utility upper bound.
Further, pruning the search tree by using the comparison effect upper bound, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode, as shown in fig. 6, the specific process includes:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
Further, the S203 further includes: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
Further, the pruning strategy specifically includes:
the extended contrast sequence mode comprises item extended descendant nodes and sequence extended descendant nodes, when the item extended descendant nodes are traversed, two extended item sets are obtained by respectively scanning utility value linked lists corresponding to two sequence databases, and elements in the intersection of the two extended item sets are connected with prefix sequences of the item extended descendant nodes to generate a new candidate contrast sequence mode.
In a specific embodiment, the search space can be effectively pruned by using the calculated upper bounds, and when we (< { ab } >) of the sequence < { ab } > is obtained, if we (< { ab } >) is smaller than a predefined contrast threshold, it is not necessary to traverse the sequence < { ab } > and the extended sequence thereof; calculating the upper bound iOUR (< { ab } >) and sOUR (< { ab } >) when wOUR (< { ab } >) meets the requirement, and calculating the utility ratio contrast of the current sequence and continuing the item (sequence) extension operation on the current sequence only when iOUR (< { ab } >) and sOUR (< { ab } >) meet the preset contrast utility condition.
In the specific embodiment, the UCPM (utility driven contrast pattern mining) algorithm flow includes:
s1001, inputting a comparison database D comprising two formats+、D-An external utility table and a minimum utility contrast threshold δ; output as two input comparison database D+、D-Full effect versus sequence pattern.
S1002, building CU-matrixes of two databases, and building Utility-chains for the sequence (1-sequence) with the length of 1. For each identical 1-sequence in the two databases, the UCPM firstly calculates and judges the utility ratio of the UCPM, and outputs the UCPM when the UCPM meets the condition;
s1003, dividing the descendant nodes of the 1-sequence node into item expansion descendants and sequence expansion descendants, respectively calling an item expansion method and a sequence expansion method to generate and process corresponding descendant nodes, and traversing the item expansion descendant nodes firstly, wherein for each type of descendant nodes, a depth pruning strategy, namely iOUR or sOUR is adopted, and the traversal of the descendant nodes is continued only when the upper bound of the corresponding utility ratio is greater than or equal to a comparison threshold delta;
s1004, when the node of the descendant is expanded by the processing item, an expandable item set is obtained by scanning Utility-chains corresponding to the two databases respectively, and the element in the intersection of the two sets is connected with the prefix sequence to generate a new candidate sequence. For each generated new sequence, a width pruning strategy is adopted, namely, the wOUR is calculated firstly, only when the wOUR is larger than or equal to delta, the corresponding Utility-chains are generated continuously, parameters such as the upper bound of each effect are calculated, and the following process is similar to the operation of processing the parent node. The algorithm ends when no new sequences are generated.
Hereinafter, an apparatus corresponding to the method shown in fig. 1, a utility-based sequence database comparison mining apparatus 300 according to an embodiment of the present disclosure will be described with reference to fig. 7, and fig. 7 is a schematic structural diagram of the utility-based sequence database comparison mining apparatus according to an embodiment of the present disclosure. Since the function of the apparatus 300 is the same as the details of the method described above with reference to fig. 1, a detailed description of the same is omitted here for the sake of simplicity. As shown in fig. 7, the apparatus 300 includes: the sequence database preprocessing module 301 is used for preprocessing two sequence databases which are subjected to pre-contrast mining; the search tree confirmation module 302 is used for respectively representing the search spaces of the two preprocessed sequence databases by two data trees LQS-tree, and determining the public part of the two data trees LQS-tree as a search tree for the comparative mining of the sequence databases; and the comparison mining module 303 is used for pruning the search tree by using the comparison effect upper bound, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode. The apparatus 300 may include other components in addition to the 3 modules, however, since these components are not related to the contents of the embodiments of the present disclosure, illustration and description thereof are omitted herein.
Further, the specific process of the contrast mining module 303 includes:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
Further, S203 in the contrast mining module 303 further includes: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
The specific working process of the utility-based sequence database comparison mining device 300 refers to the description of the utility-based sequence database comparison mining method, and is not repeated.
Furthermore, an apparatus according to an embodiment of the present invention may also be implemented by means of the architecture of a computing device shown in fig. 8. Fig. 8 illustrates an architecture of the computing device. As shown in fig. 8, a computer system 401, a system bus 403, one or more CPUs 404, input/output components 402, memory 405, and the like. The memory 405 may store various data or files used in computer processing and/or communications as well as program instructions executed by the CPU. The architecture shown in fig. 8 is merely exemplary, and one or more of the components in fig. 5 may be adjusted as needed to implement different devices.
Embodiments of the invention may also be implemented as a computer-readable storage medium. A computer-readable storage medium according to an embodiment has computer-readable instructions stored thereon. The computer readable instructions, when executed by a processor, may perform a method according to embodiments of the invention as described with reference to the above figures.
By combining the utility-based sequence database comparison mining method, device and computer equipment provided by the embodiments, two pre-comparison mined sequence databases are preprocessed to remove irrelevant information; respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, determining the public part of the two data trees LQS-tree as a search tree for comparison mining of the sequence databases, defining a new search space by using the attribute of a comparison algorithm, and filtering useless search nodes; pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode, so that the search space is reduced, and the comparison and excavation performance is improved. The beneficial effects that finally reach are: and combining the utility with the comparison sequence mode, designing a utility-driven comparison sequence mode, and providing an efficient mining algorithm to find out the utility comparison sequence mode in the two input databases. The utility-aligned sequence patterns are those sequences whose utility values in the two databases differ greatly (beyond a predefined contrast), and the utility-aligned sequence patterns can account well for the differences between the two databases.
In this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process or method.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (10)
1. A method for utility-based database-sequential comparative mining, the method comprising:
preprocessing two sequence databases mined by pre-contrast;
respectively representing the search spaces of the two preprocessed sequence databases by using two data trees LQS-tree, and determining a common part of the two data trees LQS-tree as a search tree for contrastive mining of the sequence databases;
pruning the search tree by using the upper bound of the comparison effect, traversing each node of the search tree according to a pruning strategy, and outputting an effect comparison sequence mode.
2. The method of claim 1, wherein pruning the search tree using a comparison efficiency upper bound, traversing nodes of the search tree according to a pruning strategy, and outputting a utility comparison sequence pattern comprises:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 until an end condition is met.
3. The method for database-based comparative mining of utilities according to claim 2, wherein the S203 further comprises: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
4. The utility-based sequence database comparison mining method according to claim 2, wherein the upper bound for comparison effect comprises three types, and the specific expression is as follows:
wherein iOUR (t; D)+,D-) Sequence pattern t in sequence database D obtained by adding an item to the last item set of the sequence pattern in the first type of item extension+、D-For the contrast in (1), the upper bound, iESU (t, D)+),iESU(t,D-) The sequence pattern t obtained by the first type item expansion is positioned in a sequence database D+、D-Upper bound on utility value of; sOUR (t; D)+,D-) Sequence pattern t obtained by adding an item set at the end of the sequence pattern in item extension in the second type item extension is in sequence database D+、D-For the contrast in (1), upper bound, sESU (t, D)+),sESU(t,D-) Respectively, the sequence pattern t obtained by the second kind item expansion is positioned in the sequence database D+、D_Upper limit of utility value of, miui (t, D)_),miui(t,D+) Respectively corresponding sequence patterns t obtained in the first class item expansion and the second class item expansion in a sequence database D+、D_The minimum utility value, wOUR (t; D) occurring in+,D_) Sequence pattern t obtained by expanding one item for sequence pattern t' in sequence expansion is in sequence database D+、D-For contrast in (1), upper bound, DIU (t, D)+),DIU(t,D-) Respectively, the sequence pattern t obtained in the sequence extension is located in the database D+,D-Upper limit of utility value of, miui (t', D)-),miui(t',D+) Respectively sequence patterns t' in database D+,D-The minimum utility value occurring in (a).
5. The utility-based sequence database comparative mining method according to claim 3, wherein the pruning strategy further comprises:
the extended contrast sequence mode comprises item extended descendant nodes and sequence extended descendant nodes, when the item extended descendant nodes are traversed, two extended item sets are obtained by respectively scanning utility value linked lists corresponding to two sequence databases, and elements in the intersection of the two extended item sets are connected with prefix sequences of the item extended descendant nodes to generate a new candidate contrast sequence mode.
6. A utility-based sequence database comparison mining apparatus, the apparatus comprising:
the sequence database preprocessing module is used for preprocessing two sequence databases which are subjected to pre-contrast mining;
the search tree confirmation module is used for respectively representing the search spaces of the two preprocessed sequence databases by two data trees LQS-tree and determining the public part of the two data trees LQS-tree as a search tree for the comparison mining of the sequence databases;
and the comparison mining module is used for pruning the search tree by adopting a comparison effect upper bound, traversing each node of the search tree according to a pruning strategy and outputting an effect comparison sequence mode.
7. The apparatus of claim 6, wherein the specific process of the comparative mining module comprises:
s201, establishing data matrixes of two preprocessed sequence databases, and establishing a utility value linked list according to a sequence mode with the length of 1 in the data matrixes;
s202, determining utility ratio and upper contrast effect bound of the same sequence mode in the two sequence databases according to the utility value linked list;
s203, when the utility ratio of the same sequence mode is greater than or equal to a comparison threshold, determining the same sequence mode as a utility comparison sequence mode and outputting; when the upper limit of the contrast effect of the same sequence mode is larger than or equal to a contrast threshold value, taking the same sequence mode as a candidate contrast sequence mode;
and S204, performing item expansion and sequence expansion on the candidate comparison sequence mode to generate an expanded comparison sequence mode, taking the expanded comparison sequence mode as the same sequence mode, and returning to S202 only until an end condition is met.
8. The apparatus of claim 7, wherein the comparative mining module S003 further comprises: and when the utility ratio of the same sequence mode is smaller than a threshold value, ending the expansion and deletion of the same sequence mode.
9. A utility-based sequence database comparison mining device, comprising:
a processor; and a memory, wherein the memory has stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-5.
10. A computer-readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the method of any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110554575.8A CN113377766B (en) | 2021-05-21 | 2021-05-21 | Sequence database contrast mining method and device based on utility and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110554575.8A CN113377766B (en) | 2021-05-21 | 2021-05-21 | Sequence database contrast mining method and device based on utility and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113377766A true CN113377766A (en) | 2021-09-10 |
CN113377766B CN113377766B (en) | 2022-09-13 |
Family
ID=77571453
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110554575.8A Active CN113377766B (en) | 2021-05-21 | 2021-05-21 | Sequence database contrast mining method and device based on utility and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113377766B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117407442A (en) * | 2023-12-11 | 2024-01-16 | 珠海大横琴科技发展有限公司 | Mining method and device for judging high utility mode, electronic equipment and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6928368B1 (en) * | 1999-10-26 | 2005-08-09 | The Board Regents, The University Of Texas System | Gene mining system and method |
US20120130964A1 (en) * | 2010-11-18 | 2012-05-24 | Yen Show-Jane | Fast algorithm for mining high utility itemsets |
CN105590237A (en) * | 2015-12-18 | 2016-05-18 | 齐鲁工业大学 | Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making |
CN105868296A (en) * | 2016-03-24 | 2016-08-17 | 银江股份有限公司 | Fast pruning policy based method for drug DDD value data analysis in efficient sequence modes |
US20180307722A1 (en) * | 2016-09-27 | 2018-10-25 | Tencent Technology (Shenzhen) Company Limited | Pattern mining method, high-utility itemset mining method, and related device |
CN108733705A (en) * | 2017-04-20 | 2018-11-02 | 哈尔滨工业大学深圳研究生院 | A kind of effective sequential mode mining method and device |
CN110349678A (en) * | 2019-07-19 | 2019-10-18 | 齐鲁工业大学 | A kind of Chinese medicine marketing system and its working method based on the positive and negative sequence rule digging of effective |
CN111930804A (en) * | 2020-08-07 | 2020-11-13 | 河北工业大学 | Top-k self-adaptive contrast mode mining method based on incomplete net tree |
-
2021
- 2021-05-21 CN CN202110554575.8A patent/CN113377766B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6928368B1 (en) * | 1999-10-26 | 2005-08-09 | The Board Regents, The University Of Texas System | Gene mining system and method |
US20120130964A1 (en) * | 2010-11-18 | 2012-05-24 | Yen Show-Jane | Fast algorithm for mining high utility itemsets |
CN105590237A (en) * | 2015-12-18 | 2016-05-18 | 齐鲁工业大学 | Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making |
CN105868296A (en) * | 2016-03-24 | 2016-08-17 | 银江股份有限公司 | Fast pruning policy based method for drug DDD value data analysis in efficient sequence modes |
US20180307722A1 (en) * | 2016-09-27 | 2018-10-25 | Tencent Technology (Shenzhen) Company Limited | Pattern mining method, high-utility itemset mining method, and related device |
CN108733705A (en) * | 2017-04-20 | 2018-11-02 | 哈尔滨工业大学深圳研究生院 | A kind of effective sequential mode mining method and device |
CN110349678A (en) * | 2019-07-19 | 2019-10-18 | 齐鲁工业大学 | A kind of Chinese medicine marketing system and its working method based on the positive and negative sequence rule digging of effective |
CN111930804A (en) * | 2020-08-07 | 2020-11-13 | 河北工业大学 | Top-k self-adaptive contrast mode mining method based on incomplete net tree |
Non-Patent Citations (3)
Title |
---|
CHOWDHURY FARHAN AHMED: "Mining High Utility Web Access Sequences in Dynamic Web Log Data", 《IEEE XPLORE》 * |
JERRY CHUN-WEI LIN: "High-Utility Sequential Pattern Mining with Multiple Minimum Utility Threadholds", 《SPRINGER》 * |
魏芹双: "对比模式挖掘研究进展", 《网络安全技术与应用》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117407442A (en) * | 2023-12-11 | 2024-01-16 | 珠海大横琴科技发展有限公司 | Mining method and device for judging high utility mode, electronic equipment and medium |
CN117407442B (en) * | 2023-12-11 | 2024-03-19 | 珠海大横琴科技发展有限公司 | Mining method and device for judging high utility mode, electronic equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN113377766B (en) | 2022-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lin et al. | An incremental mining algorithm for high utility itemsets | |
CN105320756B (en) | A kind of database association rule digging method based on improvement Apriori algorithm | |
JP3195233B2 (en) | System and method for finding generalized relevant rules in a database | |
Lee et al. | Sliding window filtering: an efficient method for incremental mining on a time-variant database | |
Masseglia et al. | Efficient mining of sequential patterns with time constraints: Reducing the combinations | |
CN113377766B (en) | Sequence database contrast mining method and device based on utility and computer equipment | |
Le et al. | Mining frequent closed inter-sequence patterns efficiently using dynamic bit vectors | |
Truong et al. | Efficient algorithms for mining frequent high utility sequences with constraints | |
Nguyen et al. | An efficient algorithm for mining frequent weighted itemsets using interval word segments | |
Adhikari et al. | Advances in knowledge discovery in databases | |
Huang et al. | US-Rule: Discovering utility-driven sequential rules | |
Wang et al. | Flexible online association rule mining based on multidimensional pattern relations | |
Chand et al. | Target oriented sequential pattern mining using recency and monetary constraints | |
Alsaeedi et al. | An incremental interesting maximal frequent itemset mining based on FP-Growth algorithm | |
CN110471960B (en) | High-utility item set mining method containing negative utility | |
CN104408641A (en) | Brand feature extraction method and system of electronic commerce recommendation model | |
Ioakimidis | Robust reliability under uncertainty conditions by using modified info-gap models with two to four horizons of uncertainty and quantifier elimination | |
Iguchi et al. | A second-order discretization for degenerate systems of stochastic differential equations | |
Paik et al. | A new method for mining association rules from a collection of XML documents | |
CN107688581B (en) | Data model processing method and device | |
KR100430479B1 (en) | System and mechanism for discovering temporal realtion rules from interval data | |
Dalmas et al. | Heuristics for high-utility local process model mining | |
Sasi et al. | Comparative analysis of ARIMA and double exponential smoothing for forecasting rice sales in fair price shop | |
Sethi et al. | Association rule mining: A review | |
Király et al. | Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |