CN111475551A - High average utility sequence pattern mining method under non-overlapping condition - Google Patents

High average utility sequence pattern mining method under non-overlapping condition Download PDF

Info

Publication number
CN111475551A
CN111475551A CN202010544978.XA CN202010544978A CN111475551A CN 111475551 A CN111475551 A CN 111475551A CN 202010544978 A CN202010544978 A CN 202010544978A CN 111475551 A CN111475551 A CN 111475551A
Authority
CN
China
Prior art keywords
pattern
mode
length
candidate
average utility
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010544978.XA
Other languages
Chinese (zh)
Inventor
武优西
耿萌
户倩
雷荣
刘锦
陈明婕
翟景琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN202010544978.XA priority Critical patent/CN111475551A/en
Publication of CN111475551A publication Critical patent/CN111475551A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a high average utility sequence mode mining method under a non-overlapping condition, which relates to the technical field of electric digital data processing.

Description

High average utility sequence pattern mining method under non-overlapping condition
Technical Field
The technical scheme of the invention relates to the technical field of electric digital data processing, in particular to a high average utility sequence pattern mining method under a non-overlapping condition.
Background
With the rapid development of information technology, people step into the internet era, a large amount of data is generated, and under the background of data explosion, how to convert the data into information useful for people becomes a problem which needs to be solved urgently at present, and the problem of data mining also comes to bear. Data mining is the analysis and automated processing of mass data to obtain relationships, trends and laws contained in the data. The current research on the data mining subject is divided into association rule mining and sequence pattern mining. Association rule mining is the mining of relationships between different items within the same transaction with the aim of finding reliable and representative rule relationships. Sequence pattern mining is also the most promising research content, which is to extract the subsequences with high occurrence frequency from a single sequence or a sequence database so as to find the association between each item for prediction and planning.
In the big data era, the data volume begins to increase greatly, a plurality of data sets with complex structures and fine division appear, the current dilemma cannot be solved by the traditional sequence pattern mining technology, and therefore, a targeted constraint condition must be added in the sequence pattern mining. The problem of subsequence explosion type growth is solved through clearance constraint, the excavation range is reduced, the user requirements can be met, and the operation efficiency is improved. The periodic gap constraint is a special gap constraint that differs from each other in that: periodic gaps mean that the spacing between the terms of a pattern is all equal in size, while ordinary gap constraints mean that the spacing between the terms of a pattern is arbitrary in size. Example a below gives a detailed description of the gap constraint.
Example a. p ═ p1[min1,max1]p1...[minj,maxj]pj...[minm-1,maxm-1]pmIs a pattern in which pj∈∑, m is the length of P, minjAnd maxjIs an integer and 0. ltoreq. minj≤maxj
minjAnd maxjIs the gap constraint of pattern P, representing entry P, respectivelyj-1And the term pjA minimum limit and a maximum limit on the number of wildcards allowed to exist in between. If min is present1=min2=…=minm-1M and max1=max2=…=maxm-1N, then the pattern is referred to as a pattern with periodic gap constraints, abbreviated as P1p2...pj...Pm,gap=[M,N]. For example, A [0, 1 ]]T[1,3]G is a mode with gap constraint, and A [0, 2 ]]T[0,2]G is called the periodic gap constraint of [0, 2 ]]The mode (2).
The sequence pattern mining under the constraint of the period gap is divided into sequence pattern mining under the conditions of no special, one-time and no overlapping according to a calculation method of the support degree. The no special condition means that the use times of any character is not limited and the character can be repeatedly used at any position; the one-time condition means that any character in the sequence string can be used at most once; the non-overlap condition is that characters at the same position in the sequence string cannot be reused at the same position in the pattern, but can be reused at different positions in the pattern. The following example B explains these three conditions in detail.
Example b. given pattern P ═ P1[min1,max1]p2[min2,max2]p3=A[0,1]T[0,1]T, the pattern is in the sequence string S ═ S1s2s3s4s5s6s7s8s9The number of occurrences in AATTATTAT is 5, see fig. 1 of the drawings of the specification.
No special condition is provided: characters at any position in the sequence S are allowed to be repeatedly used, namely, no constraint condition is imposed on the occurrence; as shown in fig. 1 of the drawings, the pattern P has 5 occurrences in the sequence string S, which are <1, 3, 4>, <2, 4, 6>, <5, 6, 7> and <5, 7, 9>, that is, all 5 occurrences are occurrences without special conditions.
Disposable conditions are as follows: characters at any position in the sequence S can only be used once at most; in the one-time condition, the sequence string length given in example B is 9, and the pattern length is 3, so the number of occurrences of the pattern in the one-time condition is at most 9/3 ═ 3, and in practice, the support degree of the pattern is 2, i.e., two occurrences are: <1, 3, 4>, <5, 6, 7> (the occurrence of support degree of 2 is not unique under one-off condition, such as { <1, 3, 4>, <5, 7, 9> } or { <2, 3, 4>, <5, 6, 7> } etc. are all feasible). However, the calculation of the pattern support degree under the one-time condition is an NP-difficult problem, so that the sequence pattern mining under the one-time condition is incomplete mining.
Non-overlapping conditions: characters in the sequence can be matched and appear for multiple times, but cannot be in the same position; in example B, appear<1,3,4>And occur<2,3,4>Belonging to the overlapping occurrence because of s3Is p by2Multiple use, s4Is also covered by p3The product can be used for many times; but instead of the other end of the tube<1,3,4>And<2,4,6>these two occurrences constitute no overlapping occurrences, although s4Is used twice, but is respectively associated with p3And p2Matching is carried out, and non-overlapping constraint is met. In this example, there are 3 occurrences under no overlap condition, each being<1,3,4>、<2,4,6>And<5,6,7>. From the above, the non-overlapping condition does not generate a large number of redundant patterns like the non-special condition, and does not ignore valuable patterns like the one-time condition, and more importantly, the non-overlapping condition sequence pattern mining is easier to mine the valuable patterns.
In real life, the number of frequent patterns is too large, the patterns are difficult to guide people to find rules, and in the traditional frequent pattern mining process, only the occurrence frequency of each item in a sequence string is considered, and the utility values of each item, such as profit, price, importance degree and the like, are not considered, and the utility values are often important evidences for people to make decisions. For example, in a shopping datum, the traditional frequent pattern mining can only find out the commodity which is purchased most frequently, but in practical application, the selling quantity is far from enough, and the cost and profit value of the commodity have higher research value. Therefore, on the basis of frequent pattern mining, corresponding utility values are added to various items to provide high-utility pattern mining, and the problem becomes a hot research problem in the field of data mining at present.
At present, most of research on sequential pattern mining aiming at the utility only considers the overall benefit of the pattern and does not consider the influence of the pattern length on the utility, and generally, the utility obtained by calculating the pattern with the larger length is also larger, which is unfavorable for practical application. To mitigate the effects of pattern length and determine the high utility patterns that the user really needs, an average utility metric should be employed, which is defined as the total utility of the patterns divided by the size of the patterns, and then the high average utility patterns are mined out compared to a threshold given by the user. The following example C explains how the total utility value and the average utility value for a pattern are calculated.
Example c. given pattern P ═ P1[min1,max1]p2[min2,max2]p3=A[0,1]T[0,1]T, under the condition of no overlapping, the mode is in the sequence string S ═ S1s2s3s4s5s6s7s8s9When the support degree in AATTATTAT is 3, the utility value of the term a is set to 10, the utility value of the term T is set to 5, and the total utility value of the pattern P is pu (P) ∑1≤j≤mU(pj) × (p) ═(10+5+5) × 3 ═ 60, with average utility values:
Figure BDA0002540379750000021
the sequence pattern mining problem is proposed from 1995, and a plurality of classical methods are generated to the present time, and the problem is divided into two types according to the difference of mining methods, namely, the problem is based on Apriori property, a database is scanned for a plurality of times, and a longer candidate pattern is generated in a shorter frequent pattern; secondly, generating candidate patterns in a pattern growth mode, and reducing the size of a candidate pattern set, for example, the document "effective mining of closed reactive sequences from a sequence database, in proc.ieee int.conf.data eng" published by Ding et al reports non-overlapping sequence pattern mining, but the method reported in the document does not consider the influence of the external utility of each item of the pattern on the importance degree of the pattern, for example, in a biological sequence, the frequency itself may not be enough to mine a gene sequence related to a certain disease, a gene may not appear frequently, but its high expression may cause the gene to be very significant in appearance, otherwise, a suppressor gene may appear frequently but has no practical significance; xi Ting published literature, "sequence pattern mining without overlap constraint, Hubei university Master thesis" adopts a mining strategy with depth and breadth priority to perform frequent sequence pattern mining without overlap constraint; the three papers are frequent Sequence Pattern mining researches, only considering the occurrence frequency of patterns in a Sequence and not considering the importance degree, profit, external utility and the like of each item in the patterns, which leads to the discovery of some patterns with high occurrence frequency but no use. In the mining problem, pattern matching is also an important link, the pattern matching problem is required to have completeness, flexibility and high efficiency, the document, "Strict pattern matching under non-overlapping condition, Science China Information Science", published by Wu et al, adopts a net tree structure to calculate the occurrence number of Strict pattern matching under non-overlapping condition, and proves the completeness, the correctness and the effectiveness of the proposed method, but the document does not consider the attribute interested by the user in the pattern matching although researching the Strict pattern matching with gap constraint, so that the matching result contains the occurrence inconsistent with the pattern meaning; wu et al, who published a document, "NETASPNO," an Approximate structured pattern matching under non-overlapping conditions, IEEE Access, "studied Approximate pattern matching based on Hamming distance, and improved the effectiveness of the method by avoiding backtracking and pruning strategies, but the document studied Approximate pattern matching, without accuracy. The researches are carried out on the mining of frequent patterns, but only the occurrence frequency of the patterns is considered to be too many, and each utility value is added to the mining process more reasonably.
At present, two stages exist in research on high utility pattern mining, firstly, a candidate set is generated, the mode of generating the candidate set is divided into enumeration tree and pattern growth, and secondly, the average utility value of each candidate pattern is calculated, and whether the pattern is a high utility pattern is judged. Yao et al, published in the document "A fundamental approach to mining iterative techniques from databases," first proposes a definition and a mathematical model of high utility pattern mining, and at the same time proposes to judge whether the pattern is likely to be a high utility pattern according to an estimated value of the pattern, but in actual situations, the method generates a large number of candidate sets, and the space-time cost for mining is large; erwin et al proposed a tree structure-based mining method in the "CTU-mine an effective high probability mining using the pattern growing approach", which has better performance in the case of dense data sets; the documents "influencing algorithm for high performance records" and "influencing algorithm records for high performance records" published by Tseng et al design a mode growth method based on an UP-tree structure, compress and store data, and design a pruning strategy to reduce time and space overhead. In the above method, the utility of the model will increase with the increase of the length thereof, so that the overall utility of the model is considered to have many disadvantages, for example, in the sales data set, the set consisting of many low-profit commodities will also receive high profit, but such commodity combination is not decisive for the store.
CN110399406A discloses a method, an apparatus, and a computer-readable storage medium for mining global high utility sequence patterns, where the method uses a linked list data structure to mine global high utility sequence patterns, and there is a defect that the mining result has a pattern with too large length and containing useless items, where only the global utility of the pattern is considered, and the pattern length is not combined with the pattern length; CN109101530A discloses a high utility event sequence pattern mining algorithm, which reports high utility sequence pattern mining of security events, and there exists a situation that it is considered that the cumulative sum of transaction attributes in the pattern is greater than or equal to a given threshold value and belongs to a high utility pattern, and it does not consider that multiple cumulative sums of multiple transactions with less influence may also result in high utility, for example, the high utility pattern includes a defect of transactions with less influence on the result; CN108733705A discloses a high utility sequence pattern mining method and device, which are used for researching high utility sequence pattern mining in commodity sales, and have the defect that the occurrence number of patterns in a sequence and the total profit of commodity combination are not considered, but the user needs few.
In summary, the research of the prior art on high-utility pattern mining has the defect that the pattern utility value is difficult to reduce the number of candidate patterns under the condition that the downward closure characteristic is not met.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method generates a candidate set by using a mode growing method, and quickly calculates the average utility value of the candidate mode on line by using a queue data structure, so that the mining of the high utility sequence mode under the non-overlapping condition is realized, the defect that the number of the candidate modes is difficult to reduce under the condition that the mode utility value does not accord with the downward closure characteristic in the research of the high utility mode mining in the prior art is overcome, the completeness of the calculation is ensured, the number of the candidate modes is greatly reduced by a pruning strategy, and the time-space efficiency of the calculation is improved.
The technical scheme adopted by the invention for solving the technical problem is as follows: a high average utility sequence pattern mining method under a non-overlapping condition utilizes a pattern growing method to generate a candidate set, and uses a queue data structure to quickly calculate the average utility value of the candidate pattern on line, and the specific steps are as follows:
step one, reading in a sequence database SDB, a minimum gap min, a maximum gap max and a minimum average utility threshold minun:
reading in a given sequence database SDB, determining that the total number of sequences contained therein is N, and recording each sequence in the sequence database SDB as a sequence S1Sequence S2…, sequence Sk…, sequence SNWhere k is 1. ltoreq. N, sequence SkThe characters included in (1) are respectively denoted as characters s1S character2…, character snReading given minimum gap min, maximum gap max and minimum support threshold minun;
and secondly, generating a high average utility mode set and a high upper bound mode set with the length of 1:
calculating an average utility value and an average utility upper bound of each character in the sequence database SDB read in the first step, adding the characters of which the average utility value is greater than or equal to a minimum average utility threshold minun into a high average utility mode set with the length of 1, and adding the characters of which the average utility upper bound is greater than or equal to the minimum average utility threshold minun into a high upper bound mode set with the length of 1, thereby generating a high average utility mode set and a high upper bound mode set with the length of 1;
thirdly, generating a candidate mode set with the length of i + 1:
generating a candidate pattern set with the length of i +1 according to the high upper bound pattern set with the length of i,
①, when i is 1, combining the characters in the high upper bound pattern set with length 1 obtained in the second step with each other to generate a candidate pattern set with pattern length i + 1;
② when i > 1, in generating the candidate pattern, pattern P ═ P1p2...pm-1pmPrefix (P) is a prefix of pattern P, excluding the last sub-pattern P of pattern PmThe remaining part is called the prefix of pattern P, i.e. prefix (P) ═ P1p2...pm-1Suffix (P) is a suffix of pattern P, excluding the first submode P of pattern P1The remaining part is calledSuffix of pattern P, i.e. suffix (P) ═ P2...pm-1pmWhen there are two patterns P and R with length i and the suffix of the pattern P is equal to the prefix of the pattern R, the pattern P and the pattern R are spliced into a pattern T with length i +1 by using a pattern splicing method, i.e. suffix (P) ═ P2p3...pL=prefix(R)=r1r2...ri-1When the pattern is generated, the pattern with the length of i +1 is generated
Figure BDA0002540379750000051
Figure BDA0002540379750000052
The specific processing method for generating the candidate pattern set with the length of i +1 by adopting the pattern splicing method is as follows:
when the high upper bound mode set with the length of i is not empty, traversing the high upper bound mode set from left to right, and sequentially taking out the modes P in the high upper bound mode setaCalculating suffix (P)a) Then from left to right to find a satisfaction of suffix (P)a)=prefix(Pb) Pattern P of the conditionbWill pattern PaAnd mode PbSplicing is carried out to form a mode with the mode length of i +1
Figure BDA0002540379750000053
Adding the mode T into the candidate mode set with the mode length of i +1, and satisfying the suffix (P) for all the high upper bound mode setsa)=prefix(Pb) Pattern P of the conditionbSplicing, repeating the steps for all the modes in the high upper bound mode set until the splicing of the last mode is finished, thereby generating a candidate mode set with the length of i + 1;
fourthly, calculating the average utility value and the average utility upper bound value of the modes in the candidate mode set with the length of i + 1:
and (4.1) sequentially calculating the mode support degree of the modes in the candidate mode set with the length of i +1 obtained in the third step:
first, a candidate pattern with length i +1 is readThe centralized mode P determines n queues to be created according to the number n of the sub-modes of the mode, and the queues are respectively marked as queues Q1Queue Q2.j…, queue Qn,1≤j≤n,
Then, sequentially creating nodes in n queues by adopting a depth priority and backtracking strategy, wherein the specific operation method is that a queue QjNode (2)
Figure BDA0002540379750000054
Representing the jth sub-pattern P of a pattern P in a candidate pattern set of length i +1jIn the ith position in the sequence S in the sequence database SDB read in the first step, under the condition of no overlapping constraint, the same node is not allowed to exist in the same queue, but the same node is allowed to exist in different queues, and the queue Q is createdjBefore the last node, it must first be determined whether the cycle gap constraint and queue Q are satisfiedjWhether there is already a node in
Figure BDA0002540379750000055
In the queue QjAlready existing nodes in (2): 1) node point
Figure BDA0002540379750000056
Has been used by the previous occurrence, 2) by passing through the junction
Figure BDA0002540379750000057
Cannot find a presence when a node is present
Figure BDA0002540379750000058
Already present, in both cases, in queue QjThe end node can not be created, and continues to find the queue Q under the condition of the cycle gap constraintjLast node, when last queue creates a node
Figure BDA0002540379750000059
Description has already been presented in groups, and the same approach continues to be addedTraversing the last queue until the last character of the sequence string is scanned and the node creation of the queue is finished, wherein the number of the obtained nodes is the support degree of the mode, and thus, sequentially calculating the mode support degree sup (P) of the mode P in the candidate mode set with the length of i +1 obtained in the third step;
and (4.2) calculating the average utility value and the average utility upper bound value of the modes in the candidate mode set with the length of i +1 according to the following formulas:
the calculation process is divided into the following two steps:
① the average utility value PAU (P) of the patterns in the candidate pattern set of length i +1 is obtained according to the following formula (1) for calculating the average utility value,
Figure BDA0002540379750000061
in formula (1), U (p)j) Is the utility value of the jth item in pattern P in the candidate pattern set, sup (P) is the support of pattern P, m represents the length of pattern P in the candidate pattern set,
② calculates the average effectiveness upper bound SPU (P) for patterns in the candidate pattern set of length i +1 according to equation (2) below,
SPU(P)=sup(P)×Umax(2),
in the formula (2), sup (P) is the support degree of the pattern P in the candidate pattern set obtained in the step (4.1), UmaxThe maximum utility value of each character;
fifthly, obtaining a high average utility mode set and a high upper bound mode set with the length of i + 1:
sequentially calculating the average utility value PAU (P) and the average utility upper limit value SPU (P) of each candidate pattern P in the candidate pattern set with the length of i +1 generated in the third step through the fourth step, adding the candidate pattern to the high average utility pattern set with the length of i +1 when PAU (P) is greater than or equal to the minimum support threshold value minin, and adding the candidate pattern to the high average utility pattern set with the length of i +1 when the minimum support threshold value minin of SPU (P), thereby obtaining a high average utility pattern set with the length of i +1 and a high upper limit pattern set;
sixthly, judging whether the candidate pattern set with the length of i +1 or the high upper bound pattern set with the length of i +1 is empty, if not, returning to the third step, the fourth step and the fifth step, and if so, finishing mining the high average utility pattern under the condition of no overlapping;
and seventhly, outputting all the mined high average utility modes on a display.
The high-efficiency sequence pattern mining method under the non-overlapping condition uses VC + +6.0 as programming software, Visio2015 as a drawing tool, and Pentium (R) Dual-Core 32Processor + as a Processor, and Windows7 as an operating system and the above versions, which are well known to those skilled in the art.
The invention has the beneficial effects that: compared with the prior art, the invention has the prominent substantive characteristics as follows:
(1) the invention researches the high-utility sequence pattern mining with period gap constraint under the non-overlapping condition based on the mode growth mode, and provides an online pattern matching method with depth priority. Firstly, searching a character set appearing in a data set, mining a high average utility mode with the length of 1, then sequentially carrying out mode splicing according to a high upper bound mode set of a previous layer to generate a candidate mode of a next layer, finally calculating an average utility value of the candidate mode, screening according to a minimum average utility threshold value set by a user, and determining the candidate mode with the average utility value not lower than the threshold value as a high utility sequence mode, so that the user can set different threshold values according to different requirements, realize personalized mining and meet various practical problems;
(2) CN110232084A discloses an approximate pattern matching method with local-global constraint, which finds the occurrence in the sequence without special condition, i.e. the characters in the sequence can be reused at any position of the pattern; CN110232140A discloses a one-time approximate pattern matching method with local-whole constraint, which is approximate pattern matching under one-time condition, i.e. characters in a sequence can only be used once in a pattern; CN110245167A discloses a non-overlapping approximate pattern matching method with local-global constraint, which is to study the approximate pattern matching without overlapping in the sequence, i.e. characters in the sequence can be reused at different positions of the pattern. The three prior patent technologies are previous patent applications of the inventor team, and all research approximate pattern matching, namely, the distance between a pattern and a sequence is concerned, so that the method is suitable for time series analysis. The method has the prominent substantive characteristics that the distance between the sequence and the mode is not allowed, and the method belongs to accurate mode matching, in addition, the method adds corresponding utility values to each item in the character set, and considers the occurrence frequency of the mode and the utility of each item; the method has the remarkable improvement that the user can obtain a more targeted mode set by combining the utility value, and useless data is ignored. Due to the outstanding substantive features and significant advances of the present invention compared with the three prior art, it is not obvious to those skilled in the art that the method of the present invention can be obtained by combining the above-mentioned reference document with the common general knowledge or conventional technical means in the field.
(3) Compared with the CN106095930A petroleum production data frequent pattern mining method based on weak wildcards, the method has the prominent substantive characteristics that: mining the sequence pattern under the condition of no overlapping constraint, namely, characters in the sequence can be repeatedly used at different positions of the pattern; the method has the remarkable advantages that the utility values of all items are increased on the basis of considering the occurrence frequency of the mode, and the targeted mode which is more in line with the user requirement is mined by combining the importance degree of the utility value evaluation mode.
(4) Compared with 'mining with wildcard sequence mode meeting non-overlapping conditions, Xie Feiqiang Su Peng, No. 5 in 2017', the method has the outstanding substantial characteristic that the utility values of all the items are considered in the importance degree measurement of the mode; the method has the remarkable advantages that the occurrence frequency of the mode, the utility value of each item and the length of the mode are comprehensively considered, and the important and moderate-length mode is mined.
(5) CN109271419A discloses an online string matching method without clearance constraint, which adopts the first-in first-out property of a queue to dynamically output all occurrences so as to realize online pattern matching of character strings, but the invention researches the pattern matching without clearance, the sub-pattern quantity is larger, but the invention carries out pattern matching under the condition of periodic clearance constraint, the periodic constraint can solve the problem of explosive growth of the sub-pattern, the operation efficiency is improved, and the biggest substantial difference between the two is achieved;
(6) CN108984695A discloses a string matching method for quickly and accurately matching similar character strings through a character string filtering threshold, wherein the similar character strings are screened by calculating the number of slices on a target character string at least matched with each similar character string, although the invention also belongs to string matching, the invention researches the accurate pattern matching problem with no overlapping constraint and cycle gap constraint, which is the maximum substantive difference between the two;
(7) CN108255836A discloses a character string matching method, which utilizes a preset rule to calculate a first edit distance of a first character string and a second character string and a matching value of a key character string, and finally obtains similarity according to the two values, thereby improving the accuracy of character string matching, and inventing and researching approximate matching of the character strings, while the invention researches pattern matching under a non-overlapping condition and has a periodic gap, which is the maximum substantive difference between the two;
(8) CN108846083A discloses a frequent pattern mining method, firstly converting each participle into corresponding codes, then screening by using the codes to obtain a target frequent item set code combination, and finally constructing FP-Tree by using the target frequent item set code combination formed by the codes or performing frequent pattern mining, wherein the method is to perform association rule mining on item sets, and does not consider the influence of utility values of all items on the overall result, but adds a new measure of utility values in sequence pattern mining, and considers the influences of the occurrence frequency of the pattern and the importance degree of all items, which is the maximum substantial difference between the two;
(9) CN110188131A discloses a frequent pattern mining method and device, the method uses transaction data to be analyzed as a data set, generates a database tree according to a Prefix span algorithm, traverses the database tree to prune in sequence according to pruning conditions, excavates frequent patterns capable of meeting preset conditions, sorts the frequent patterns from large to small according to the competence values of the frequent patterns, and finally outputs the first k frequent patterns, the invention linearly combines three measures of occupancy, support and full confidence into the competence values, comprehensively evaluates the patterns, also carries out a series of pruning operations in the mining process, excavates a Top-k pattern interesting to a user under the condition of higher efficiency, and the invention excavates a high average utility pattern with the average utility value higher than the given threshold of the user, which is the greatest substantive difference between the two;
(10) CN109344179A provides a frequent adjacent sequence mode mining method, which comprises the steps of firstly sorting a data set, obtaining the number mt and the maximum length L of each item in the data set, then creating a one-dimensional empty array and a one-dimensional empty sparse tensor, sequentially traversing a sequence string to store a sequence with the length of 1 into the one-dimensional array, respectively corresponding each line (column) in each array to a position index in the sparse tensor, accumulating the value of each element in the sparse tensor (namely the frequency of the corresponding sequence mode), secondly screening out the element with the frequency higher than the support degree in the sparse tensor, wherein the corresponding sequence mode is the frequent sequence mode, the invention uses the data structure of the sparse tensor, avoids the problem of data explosion, improves the timeliness, uses the queue data structure to calculate the support degree of the mode, and uses a mode splicing method to generate the candidate mode, greatly reduces the number of the candidate mode, improves the efficiency by one level, and is the maximum substantial difference between the two modes;
(11) CN106250549B discloses a frequent pattern mining method based on memory, firstly creating a frequent pattern tree, initializing a tree structure, traversing a database, perfecting the frequent pattern tree in sequence, and then traversing by using a depth-first search strategy logarithm;
(12) CN107145548B discloses a parallel sequence pattern mining method based on Spark platform, firstly segmenting data into database segments with the same size, secondly generating a sequence with the size of 1 by using MapReduce and storing the sequence in RDD, finally reading a sequence pattern with the size of (k-1) from RDD, and generating a candidate k-size sequence pattern C through a candidate sequence pattern generation stepkThe method comprises the steps of calling a ReducebyKey () function to calculate corresponding support degree, and outputting a sequence mode with the support degree being more than or equal to a set minimum support degree, so that the problem of low calculation efficiency when a large amount of data is confronted by the conventional excavation method is solved, the problem of unbalanced load is solved to the greatest extent, but the utility problem of each project is not considered, and finally, the generated result has the unfair characteristic;
(13) CN107870939A discloses a pattern mining method with utility value, which is to calculate the utility value of the candidate pattern in each transaction for the generated candidate pattern set, and delete the transaction with the utility value smaller than the minimum utility threshold value of the user's set quota, so as to greatly reduce the running time of the mining algorithm, then, calculating the period value of the candidate pattern, determining the candidate pattern with the period value less than or equal to the set period threshold value as the final mining result, the invention addresses the frequent pattern mining problem of subtractive mining using utility values, which, although the concept of utility values is introduced, however, the influence of the mode length on the overall effectiveness of the mode is not considered, the definition of average effectiveness is introduced, the effectiveness value of the mode is reasonably calculated, and the sequence mode with reasonable length and high effectiveness value is mined, which is the largest substantial difference between the two modes;
(14) CN106777182A discloses a data flow high utility item set mining algorithm for reducing candidate items, firstly, scanning a data flow window to create a global tree, then generating a candidate mode on the basis of the global tree, and finally calculating an actual effect to screen a high utility item set, wherein the invention aims at item set mining, and the influence of mode length is not considered when the mode utility value is calculated, but the invention carries out high utility sequence mode mining in sequence data sets, thereby not only calculating the utility value of the mode, but also adding the mode length into an average utility value calculation formula, and better meeting the actual requirement, which is the maximum substantial difference between the two;
(15) CN105868296B discloses a medication DDD value data analysis method of a high utility sequence mode based on a rapid pruning strategy, the invention establishes utility matrixes of each item by using a matrix structure, simplifies the complexity of pruning operation and reduces the scanning times of a database, and the invention stores each item and the corresponding utility value into map and calculates the average utility value of the mode, which is the maximum substantive difference between the two;
(16) CN105590237A discloses application of a high-utility sequence mode with negative profit items in electronic commerce decision making, the invention belongs to association rule mining and mining of the high-utility mode with negative items, so that a seller can reasonably arrange shelf arrangement modes and marketing strategies, and the invention aims at mining a mode with high utility value in a sequence database and belongs to sequence mode mining, which is the greatest substantial difference between the two modes.
Compared with the prior art, the method has the following remarkable progress:
(1) the method adds gap constraint in the problem of sequence pattern mining, researches the problem of sequence pattern mining with period gap constraint, namely, a user can flexibly set the gap size among each item according to the actual situation, thereby not only solving the problem of mass increase of the subsequence, improving the space-time efficiency of the method, but also leading the pattern to be more flexible and the application to be wider.
(2) In the method, no overlapping constraint is introduced under the condition of period gap constraint, and in the process of calculating the average utility of the mode, the support degree of the mode, namely the mode matching problem, needs to be calculated firstly. In the matching with the cycle gap constraint, as long as one item is subjected to position change, a new appearance is generated, characters at any position in the sequence S are allowed to appear and be repeatedly used under no special condition, namely, no constraint condition exists for appearance, a complete solution of the mode support degree can be calculated, but the complete solution does not accord with Apriori property, and the solution space exponential type is increased by enlarging the search space through Apriori-like property; characters at any position in the sequence S can only be used once at most under a one-time condition, although a result set is reduced and Apriori properties are met, the mode support degree can only be approximately calculated, some valuable information may be omitted, and the requirement of a user for mining all frequent mode sets is not met; the non-overlapping condition meets Apriori properties, a complete solution can be obtained, a large number of redundant modes cannot be generated like no special condition, valuable modes cannot be ignored like a one-time condition, and the method has theoretical research significance.
(3) The invention researches high-utility sequence pattern mining, which means that the frequency of the occurrence of the patterns is not only considered, but also the utility values of all the items are added into the mining process to find out the patterns really needed by a user. In real life, an item often contains additional information such as profit, price, importance degree and the like, the additional information can help people to make decisions, for example, in a shopping datum, the traditional frequent pattern mining can only find out the commodity with the largest purchase frequency, and the high-utility sequence pattern mining can find out a combination which enables sellers to make more profit, and helps sellers to reasonably arrange shelves and select marketing means. Through calculation, the mode with larger length has larger utility value, and for the sake of fairness, the invention defines the concept of average utility value, and reduces the influence of mode length on mining, so that the high utility sequence mode mining is more reasonable and has more application advantages.
(4) The method provided by the invention can be applied to DNA, protein and other biological sequences, and can help users to mine rules contained in the biological sequences and carry out important research according to the rules. The embodiment of the invention is a simple DNA sequence, wherein the sequence S represents a character type biological sequence, the character set is A, T, C, G, the biological sequence data has large quantity and is difficult to analyze, a representative pattern is extracted from the sequence S by using the high-utility sequence pattern mining method provided by the invention, the user is helped to carry out further analysis and research, the difficulty in data processing is reduced, and the method has great research significance and development potential.
Drawings
The invention is further illustrated with reference to the following figures and examples.
Fig. 1 shows all occurrences of a given pattern P in example B in a given sequence S.
FIG. 2 is a schematic block diagram of a flow chart of a computer process used in the method of the present invention.
FIG. 3 shows the utility values of the characters given in embodiment 1 of the present invention.
FIG. 4 shows the character set and the number of occurrences of a given sequence string in accordance with embodiment 1 of the present invention.
Fig. 5 is a queue created when calculating the mode P ═ ACA support degree according to embodiment 1 of the present invention.
Fig. 6 shows the final result of embodiment 1 of the present invention, i.e. all the mined high average utility patterns.
Detailed Description
The embodiment shown in fig. 1 shows that the occurrence frequency of the pattern P in the given sequence string S in example B is 5, wherein 'a', 'T', 'a' and 'T' respectively represent 9 characters in the sequence S, and their corresponding position indexes are '1', '2', '3', '4', '5', '6', '7', '8' and '9'; we denote these 5 occurrences with the index of the corresponding positions of the characters, written as: <1, 3, 4>, <2, 4, 6>, <5, 6, 7> and <5, 7, 9>, wherein the occurrence of satisfying the no-overlap constraint is: <1, 3, 4>, <2, 4, 6> and <5, 6, 7 >.
FIG. 2 is a flow of the computer processing adopted by the method of the present invention: 1) start → 2) read-in sequence database SDB, minimum gap min, maximum gap max, and minimum average utility threshold value minun → 3) generate high average utility pattern set and high upper bound pattern set of pattern length l → 4) generate candidate pattern set of length i +1 → 5) calculate average utility value and average utility value of patterns in candidate pattern set of length i +1 with upper bound value → 6) obtain high average utility pattern set and high upper bound pattern set of length i +1 → 7) determine whether candidate pattern set of length i +1 or high upper bound pattern set of length i +1 is empty, when "no", return to step 4, when "yes", proceed to step 8 → 8) output all mined high average utility patterns on display → 9) end.
Example 1
Given a DNA sequence S ═ S1s2s3s4s5s6s7s8s9s10s11s12s13ATTCATCACATCA, periodic gap is [0, 3]]Given a minimum average utility threshold minun of 25, the utility values for the various items in the character set are shown in fig. 3.
Step one, reading in a sequence database SDB, a minimum gap min, a maximum gap max and a minimum average utility threshold minun:
reading in a given sequence database SDB, which contains 1 sequence S ═ S1s2s3s4s5s6s7s8s9s10s11s12s13ATTCATCACATCA, the alphabet is { a, T, C }, the minimum gap min is 0, the maximum gap max is 3, and the minimum average utility threshold minun is 25.
And secondly, generating a high average utility mode set and a high upper bound mode set with the length of 1:
calculating an average utility value and an average utility upper bound of each character in the sequence database read in the first step, adding characters with the average utility value being more than or equal to a minimum average utility threshold minun into a high average utility mode set with the length of 1, and adding characters with the average utility upper bound being more than or equal to the minimum average utility threshold minun into a high upper bound mode set with the length of 1, thereby generating a high average utility mode set and a high upper bound mode set with the length of 1;
the specific operation of this embodiment is as follows:
1) processing the first character 'A', calculating to obtain that the average utility value of the character is 50, the average utility upper bound value is 50, and the average utility upper bound value is greater than the minimum average utility threshold value 25, so that the character 'A' is stored in the high average utility mode set and the high upper bound mode set;
2) processing the first character 'T', and calculating to obtain the average utility value of the character as 20, the average utility upper bound value as 40, the average utility value being less than the minimum average utility threshold value 25, but the average utility upper bound value being greater than the minimum average utility threshold value 25, so that only the character 'T' is stored in the high upper bound mode set;
3) processing the third character 'C', calculating to obtain that the average utility value of the character is 32, the upper bound value of the average utility is 40, and the average utility value is greater than the minimum average utility threshold value 25, so that the character 'C' is stored in the high average utility mode set and the high upper bound mode set;
thirdly, generating a candidate mode set with the length of i + 1:
generating a candidate pattern set with the length of i +1 according to the high upper bound pattern set with the length of i,
①, when i is 1, combining the characters in the high upper bound pattern set with pattern length 1 obtained from the second step of processing with each other to generate a candidate pattern set with pattern length i + 1;
② when i>1, in the process of generating candidate patterns, the pattern P ═ P1p2…pm-1pmPrefix (P) is a prefix of pattern P, excluding the last sub-pattern P of pattern PmThe remaining part is called the prefix of pattern P, i.e. prefix (P) ═ P1p2…pm-1Suffix (P) is a suffix of pattern P, excluding the first submode P of pattern P1The remaining part is called the suffix of pattern P, i.e. suffix (P) ═ P2…pm-1pmWhen there are two patterns P and R with length i and the suffix of the pattern P is equal to the prefix of the pattern R, the pattern P and the pattern R are spliced into a pattern T with length i +1 by using a pattern splicing method, i.e. suffix (P) ═ P2p3…pL=prefix(R)=r1r2…ri-1Then, the pattern T ═ P ⊕ R ═ P with the length i +1 is generated1p2…pL-1pL⊕q1q2…ri-1ri=p1p2…ri-1ri
In this embodiment, a specific processing method for generating a candidate pattern set with a length of i +1 by using the pattern splicing method is as follows:
the high bound mode with length 1 from the first step is: A. t, C, respectively;
1) processing the 1 st high bound mode A, taking A as a prefix, and adding A, T, C to the later respectively to obtain three candidate modes with the length of 2 and taking A as the prefix, wherein the three candidate modes are respectively as follows: AA. The AT and the AC are connected to each other,
2) processing the 2 nd high bound mode T, taking T as a prefix, and respectively adding A, T, C to the later to obtain three candidate modes with the length of 2 and taking T as the prefix, wherein the three candidate modes are respectively as follows: TA, TT and TC,
3) processing the 3 rd high bound mode C, taking C as a prefix, and adding A, T, C to the prefix respectively to obtain three candidate modes with the length of 2 and taking C as a prefix, wherein the three candidate modes are respectively: CA. CT and CC;
fourthly, calculating the average utility value and the average utility upper bound value of the modes in the candidate mode set with the length of i + 1:
and (4.1) sequentially calculating the mode support degree of the modes in the candidate mode set with the length of i +1 obtained in the third step:
firstly, reading in a mode P in a candidate mode set with the length of i +1, determining n queues to be created according to the number n of sub-modes of the mode, and respectively recording the n queues as a queue Q1Queue Q2…, queue Qj…, queue Qn,1≤j≤n,
Then, sequentially creating nodes in n queues by adopting a depth priority and backtracking strategy, wherein the specific operation method is that a queue QjNode (2)
Figure BDA0002540379750000111
Representing a set of candidate patterns of length i +1Of pattern PjIn the ith position in the sequence S in the sequence database SDB read in the first step, under the condition of no overlapping constraint, the same node is not allowed to exist in the same queue, but the same node is allowed to exist in different queues, and the queue Q is createdjBefore the last node, it must first be determined whether the cycle gap constraint and queue Q are satisfiedjWhether there is already a node in
Figure BDA0002540379750000121
In the queue QjAlready existing nodes in (2): 1) node point
Figure BDA0002540379750000122
Has been used by the previous occurrence, 2) by passing through the junction
Figure BDA0002540379750000123
Cannot find a presence when a node is present
Figure BDA0002540379750000124
Already present, in both cases, in queue QjThe end node can not be created, and continues to find the queue Q under the condition of the cycle gap constraintjLast node, when last queue creates a node
Figure BDA0002540379750000125
When a group appears, the same method is continuously added until the last character of the sequence string is scanned, the creation of the nodes of the queue is finished, the last queue is traversed, the number of the obtained nodes is the support degree of the mode, and the mode support degree sup (P) is sequentially calculated for the mode P in the candidate mode set with the length of i +1 obtained in the third step;
and (4.2) calculating the average utility value and the average utility upper bound value of the modes in the candidate mode set with the length of i +1 according to the following formulas:
the calculation process is divided into the following two steps:
① the average utility value PAU (P) of the patterns in the candidate pattern set of length i +1 is obtained according to the following formula (1) for calculating the average utility value,
Figure BDA0002540379750000126
in formula (1), U (p)j) Is the utility value of the jth item in pattern P in the candidate pattern set, sup (P) is the support of pattern P, m represents the length of pattern P in the candidate pattern set,
② calculates the average effectiveness upper bound SPU (P) for patterns in the candidate pattern set of length i +1 according to equation (2) below,
SPU(P)=sup(P)×Umax(2),
in the formula (2), sup (P) is the support degree of the pattern P in the candidate pattern set obtained in the step (4.1), UmaxThe maximum utility value of each character;
the specific operation of this embodiment is as follows:
sequentially traversing each candidate pattern in the candidate pattern set, calculating the support degree of the pattern P by using a depth-first online matching method, calculating the average utility value PAU (P) of the pattern P by using a formula (1), and calculating the average utility upper limit value SPU (P) of the pattern P by using a formula (2).
The specific implementation process of the steps comprises the following steps:
1) processing the 1 st candidate pattern AA, firstly calculating the support degree of 'AA' to obtain sup (AA) ═ 4, calculating the average utility value of the pattern to be 40 by the formula (1), and calculating the average utility upper bound value of the pattern to be 40 by the formula (2);
2) processing the 2 nd candidate mode AT, firstly calculating the support degree of the 'AT' to obtain sup (AT) 3, calculating the average utility value of the mode to be 22.5 by the formula (1), and calculating the average utility upper limit value of the mode to be 30 by the formula (2);
3) processing the 3 rd candidate pattern AC, firstly calculating the support degree of the AC to obtain sup (AC) 4, calculating the average utility value of the pattern to be 36 by formula (1), and calculating the average utility upper bound value of the pattern to be 40 by formula (2);
4) processing the 4 th candidate pattern TA, firstly calculating the support degree of the TA to obtain sup (TA) 3, calculating the average utility value of the pattern to be 22.5 by formula (1), and calculating the average utility upper bound value of the pattern to be 30 by formula (2);
5) processing the 5 th candidate pattern TT, firstly calculating the support degree of the TT to obtain sup (TT) 2, calculating the average utility value of the pattern to be 10 through a formula (1), and calculating the average utility upper bound value of the pattern to be 20 through a formula (2);
6) processing the 6 th candidate pattern TC, firstly calculating the support degree of the TC to obtain sup (TC) 4, calculating the average utility value of the pattern to be 26 by formula (1), and calculating the average utility upper bound value of the pattern to be 40 by formula (2);
7) processing the 7 th candidate pattern CA, firstly calculating the support degree of the CA to obtain sup (CA) 4, calculating the average utility value of the pattern to be 36 by formula (1), and calculating the average utility upper bound value of the pattern to be 40 by formula (2);
8) processing the 8 th candidate pattern CT, firstly calculating the support degree of CT to obtain sup (CT) 2, calculating the average utility value of the pattern to be 13 by formula (1), and calculating the average utility upper bound value of the pattern to be 20 by formula (2);
9) processing a 9 th candidate pattern CC, firstly calculating the support degree of the CC to obtain sup (CC) 3, calculating the average utility value of the pattern to be 24 by a formula (1), and calculating the average utility upper bound value of the pattern to be 30 by a formula (2);
fifthly, obtaining a high average utility mode set and a high upper bound mode set with the length of i + 1:
sequentially calculating the average utility value PAU (P) and the average utility upper limit value SPU (P) of each candidate pattern P in the candidate pattern set with the length of i +1 generated in the third step by the fourth step, adding the candidate pattern to the high average utility pattern set with the length of i +1 when PAU (P) is not less than the minimum support threshold value minin, and adding the candidate pattern to the high average utility pattern set with the length of i +1 when SPU (P) is not less than the minimum support threshold value minin, thereby obtaining the average utility value PAU (P) and the average utility upper limit value SPU (P)Obtaining a high average utility mode set and a high upper bound mode set with the length of i + 1; traversing a high upper bound mode set with the length of i +1, if the mode is not empty, circularly executing a third step, a fourth step and a fifth step, generating an i +1 layer candidate mode set by a mode growth method, sequentially matching and calculating the support degree of the modes in the candidate mode set, calculating the average utility value of the modes by a formula (1), calculating the average utility upper bound value of the modes by a formula (2), screening out the modes with the average utility value higher than a given threshold value, namely high average utility modes, and storing the modes in a high average utility mode set Hi+1In the method, a mode with the average utility upper bound value higher than a given threshold value, namely a high upper bound utility mode, is screened out and stored in a high upper bound mode set Ui+1Performing the following steps;
the specific implementation process of the steps comprises the following steps:
1) the 1 st candidate pattern AA is processed through the candidate pattern set with length 2, and since pau (AA) > 40 > minun ═ 25 and spu (AA) > 40 > minun ═ 25, it is at H2And U2Respectively adding a mode AA; the 2 nd candidate mode AT is processed because pau (AT) 22.5 < minun 25, spu (AT) 40 > minun 30, and only mode AT is added to U2Performing the following steps; the 3 rd candidate pattern AC is processed at H because pau (AC) 36 > minun 25 and spu (AC) 40 > minun 252And U2Adding a mode AC respectively; the 4 th candidate pattern TA is processed, and since pau (TA) < min ═ 25 and spu (TA) > 30 > min ═ 25, only the pattern TA is added to U2Performing the following steps; the 5 th candidate pattern "TT" is processed, because pau (TT) ═ 10 < minun ═ 25 and spu (TT) < 20 < minun ═ 25, so H2And U2None of them can add the pattern TT; the 6 th candidate pattern TC is processed at H because pau (TC) > 26 > minun ═ 25, spu (TC) > 40 > minun ═ 252And U2Respectively adding a mode TC; the 7 th candidate pattern CA is processed, since pau (CA) > 36 > minun ═ 25 and spu (CA) > 40 > minun ═ 25, at H2And U2Respectively adding patterns CA; the 8 th candidate pattern CT is processed, and since pau (CT) < minun ═ 25 and spu (CT) < 20 < minun ═ 25, H is high2And U2Mode CT cannot be added in the method; the 9 th candidate pattern CC is processed, and since pau (CC) < minun ═ 25 and spu (CC) > 30 > minun ═ 25, only the pattern CC is added to U2Performing the following steps; after all the modes in the candidate mode set are judged, emptying the candidate mode set so as to store the next layer of candidate modes;
2) processing a high upper bound mode set with the length of 2, when a first mode AA is obtained, sequentially traversing the high upper bound mode of the layer by taking the suffix mode of the mode as A, finding a 1 st element, namely AA, wherein the prefix of the AA is also A, so that a new mode AAA can be formed, storing the AAA into a candidate mode set, continuously searching a 2 nd element AC backwards, generating AAC, storing into the candidate mode set, continuously searching a 3 rd element AT, splicing with the AA to form a mode AAT, storing into the candidate mode set, continuously traversing the high upper bound mode set backwards, and splicing with the AA without other modes with prefixes of A; a second pattern AC is obtained, with the suffix pattern C, and the concatenation is performed in the same way, to obtain 2 patterns with length 3, which are: ACA, ACC; and splicing all the high bound modes of the layer to obtain a candidate mode set with the length of 3, wherein the result is { AAA, AAC, AAT, ACA, ACC, ATA, ATC, CAA, CAC, CAT, CCA, CCC, TAA, TAT, TAC, TCA and TCC }. Sequentially calculating the average utility value of the candidate mode centralized mode, firstly, calculating to obtain sup (AAA) 3 by using a depth-first online matching method, and calculating the average utility value according to the formula (1)
Figure BDA0002540379750000141
Above the minimum average utility threshold, so AAA is added into the layer 3 set of high average utility patterns H3 and the set of high upper bound patterns U3; the 2 nd mode AAC is calculated sequentially backward,
Figure BDA0002540379750000142
Figure BDA0002540379750000143
added to both H3 and U3 as in the previous mode; continue calculating the 3 rd pattern backwardsAAT,
Figure BDA0002540379750000144
Below the threshold, the mode is not a high average utility mode, and therefore a utility upper bound value needs to be calculated according to equation (2), spu (aat) ═ aat × UmaxThe method comprises the steps of (1) discarding a mode when an upper bound obtained by the mode is still smaller than a threshold value, (4) th mode ACA, (PAU) (ACA) (33.3) which is higher than the threshold value and is stored in H3 and U3), (5) th mode ACC, (PAU) (ACC) (26) which is higher than the threshold value and is stored in H3 and U3), (6) th mode ATA (ATA), PAU (ATA) (25) which is not lower than the threshold value and is stored in H3 and U3), (7) th mode ATC, (ATC) (23) which is lower than the threshold value and is not a high average utility mode, calculating an upper bound value of utility to judge whether downward splicing can be carried out, and (30) which meets the requirement of the threshold value and is stored in a high upper bound mode set U3, calculating each mode in the candidate mode sets according to the same method, judging whether the modes can be stored in corresponding sets, emptying the candidate mode sets after scanning is finished, finally obtaining high average lengths of ACA (AAC) (3, AAC) (AAC ) and TCA (26) and TCA (TCC) (3668) and TCA (TCC) which are high average utility modes and TCA) which are stored in corresponding sets
3) Processing a high upper bound mode set with the length of 3, when a first mode AAA is obtained, sequentially traversing the high upper bound mode of the layer by taking the suffix mode of the mode as AA, finding a 1 st element, namely AAA, because the prefix of the AAA is also AA, forming a new mode AAAA, storing the AAAA in a candidate mode set, continuously searching the AAC of a 2 nd element backwards, generating AAAC, storing the AAAC in the candidate mode set, continuously traversing the high upper bound mode set backwards, and splicing the mode without other prefixes of AA with the AAA; next, acquiring a second AAC mode, wherein the suffix mode of the AAC mode is AC, traversing the high-upper-bound mode again, finding out modes with prefixes of AC, namely ACA and ACC respectively, and splicing to obtain AACA and AACC; all high bound modes of the layer are spliced by the same method to obtain all candidate mode sets with the length of 4, and the result is { AAAA, AAAC, AACA, AACC, ACAA, ACAC, ACCA, ATCA, ATCC, CAAA, CAAC, CACA, CACC, CCAA, CCAC, TCAA, TCAC }. Sequentially calculating the average utility value of the candidate mode set mode, namely firstly the first mode AAAA, calculating to obtain sup (AAAA) 2 by using a depth-first online matching method, and calculating the average utility value according to a formula (1)
Figure BDA0002540379750000151
Figure BDA0002540379750000152
Less than the minimum average utility threshold, the mode is not a high average utility mode, and therefore the utility upper bound value needs to be calculated according to equation (2), spu (aaaa) ═ aaaa) × Umax2 × 10-20, the upper bound is still less than the threshold, the mode is discarded, the 2 nd mode AAAC is calculated sequentially,
Figure BDA0002540379750000153
after SPU (AAAC) ═ × U is calculatedmax2 × 10-20, still below the threshold, is discarded as in the previous mode, continues to calculate the 3 rd mode AACA backwards,
Figure BDA0002540379750000154
Figure BDA0002540379750000155
above a threshold, which is a high average utility pattern, AACA is added to the layer 4 set of high average utility patterns H4 and the high upper bound set of patterns U4; in the 4 th mode AACC, when the calculated utility upper bound value is equal to spu (atc) 20, the mode is discarded when the calculated utility upper bound value is lower than the threshold, the mode is 15, and the mode is lower than the threshold; the 5 th mode ACAA, PAU (ACAA) ═ 28.5, above the threshold, stored in H4 and U4; the 6 th mode ACAC, PAU (ACAC), also above the threshold, is stored in H4 and U4; the 7 th mode ACCA, PAU (ACCA) ═ 27, above the threshold, stored in H4 and U4; the 8 th mode ATCA, PAU (ATCA) is 24.75, below the threshold value, it is not the high average utility mode, the upper bound value of the calculation utility is used to judge whether the downward splicing can be carried out, the calculation result is SPU (ATCA) is 30, meets the threshold value requirement, the mode is stored in the high upper boundIn pattern set U4; sequentially calculating each mode in the candidate mode set according to the same method, judging whether the mode can be stored in the corresponding set, emptying the candidate mode set after scanning is finished, and finally obtaining a high average utility mode set H4 with the length of 4 as follows: { AACA, ACAA, ACAC, ACCA, CACA }, the high bound pattern set U4 of length 4 is: { AACA, ACAA, ACAC, ACCA, ATCA, CACA, TCAA, TCAC, TCCA }.
4) Processing a high upper bound mode set with the length of 4, when obtaining a first mode AACA, enabling a postfix mode of the mode to be ACA, sequentially traversing the high upper bound mode of the layer to find a 2 nd element, namely ACAA, wherein because the prefix of the ACA is also ACA, a new mode AACAA can be formed, storing the AACAA into a candidate mode set, continuously searching a 3 rd element ACAC backwards, generating AACAC, storing into the candidate mode set, continuously traversing the high upper bound mode set backwards, and splicing the mode without other prefixes of ACA with the AACA; next, obtaining a second mode ACAA, wherein the suffix mode of the mode is CAA, traversing the high upper bound mode again, and if the mode with the prefix of CAA is not found, the mode can not be spliced to generate a new mode; sequentially obtaining a third mode ACAC with CAC as suffix, traversing the high upper bound mode set to find a mode with CAC as prefix, namely CACA, and splicing to obtain ACACACA; traversing backwards continuously, and the mode ACCA can not be spliced; splicing the mode ATCA with the mode TCAA and the mode TCAC respectively to obtain the ATCAA and the ATCAC; splicing the modes ACAA and ACAC by the mode CACA to obtain CACACAA and CACACA; neither mode TCAA nor mode TCCA can be spliced; the mode TCAC can be spliced with CACA to obtain TCACA; and finally obtaining all candidate mode sets with the length of 5, wherein the results are { AACAA, AACAC, ACACACA, ATCAA, ATCA, CACACAA, CACACAC, TCACA }. After acquiring the candidate mode with the length of 5, calculating the average utility values one by one in the same way as the step 4), starting from the first mode, and discarding the mode AACAA which is neither the high average utility mode nor the high upper bound mode, wherein (PAU), (AACAA) and (SPU), (AACAA) are 19.2 and 20; the average utility of the second mode AACAC is 18.4, the upper limit of the utility is 20, and the same is abandoned; the average utility of the third mode ACACACA is 27.6, belongs to a high average utility mode, and is stored in a 5 th layer high average utility mode set H5 and a high upper bound mode set U5; continuously traversing the modes ATCAA, ATCAC, CACAAA and CACAAC downwards, wherein the modes are not high-utility and high-upper-bound modes, and discarding; the average utility of the last candidate mode TCACA is 24.6, which does not belong to the high average utility mode, but the upper utility bound of the mode is 30, which meets the condition and belongs to the high upper bound mode, and the mode is stored in U5; the resulting length-5 high average utility pattern set H4 is: { ACACACA }, high upper bound pattern set U4 of length 4 is: { ACACACA, TCACA }.
5) Processing a high upper bound mode set with the length of 5 to obtain a first mode ACACA, wherein the mode suffix is CACA, and no mode with the prefix CACA exists in the layer mode set, so that no new candidate mode is generated; the second mode, TCACA, also fails to generate new candidate modes.
Sixthly, when the candidate pattern set with the length of i +1 is empty, finishing mining the high average utility pattern under the non-overlapping condition, and executing the seventh step;
since the candidate pattern set of length 6 is empty in the fifth step, the high average utility sequence pattern mining ends.
The specific operation of this embodiment is:
traversing the high average utility mode set generated in the above steps, outputting the high average utility mode on the display layer by layer, finally counting the number and the calculation times of the high average utility mode, and calculating the operation time,
fig. 4 shows that, in the present embodiment, characters existing in a sequence and the occurrence times of each character are determined by traversing a given sequence string S-ATTCATCACATCA, where the character set is { A, T, C }, and the occurrence times are denoted as sup (a) -5, sup (t) -4, and sup (a) -4;
fig. 5 shows that the calculation pattern P ═ ACA in the present embodiment is [0, 3] at the cycle interval]Under the constraint of (3), a queue is created by using a depth-first online pattern matching method to calculate the support degree. Determining the number of queues to be created according to the submodes of the mode, wherein three queues are created because the mode P has three submodes; finding the first and p in the sequence1Judging whether the subscript is used in the queue of the current layer or not, if not, performing enqueue operation, entering a first queue, and if so, entering a first queueIf the data is occupied, continuously searching character subscripts which can be matched with the data backwards until successful enqueuing; then at s1Finding characters in interval 0 to 3 can be compared with p2If the matched character subscript 4 is occupied, enabling the character subscript 4 to enter a second queue, if not, enabling the character subscript 4 to enter the second queue, if the character subscript 4 is occupied, continuing to look for the character backwards until the character is queued or the character in the interval is scanned completely, and if the character matched with the interval is scanned completely and can not be matched, tracing back to an upper-layer sub-mode, looking for a second character capable of being matched, and enabling the subscript to be queued; then searching characters which can enter a third queue, wherein the same method is adopted until the sequence string is completely scanned, and the subscript number stored in the last queue is the support degree of the pattern P in the sequence string S;
fig. 6 shows that in this embodiment, high utility sequence pattern mining is performed according to a given sequence S, a gap constraint [0, 3], and a minimum average utility threshold minun, and a pattern with an average utility value higher than a threshold is searched layer by layer to obtain a complete high average utility pattern set.
And seventhly, outputting all the mined high average utility modes on a display.
Example 2
Given a DNA sequence S ═ S1s2s3s4s5s6s7s8s9s10s11s12ATTCATCACATC, periodic gap is [0, 3]]Given a minimum average utility threshold minun of 25, the utility values for the various items in the character set are shown in fig. 3.
And sixthly, when the high upper bound mode set with the length of i +1 is empty, finishing mining the high average utility sequence mode, and executing the seventh step.
Since the high upper bound pattern set of length 5 is empty in the fifth step, the high average utility sequence pattern mining ends. "
Except for the above differences, the same procedure as in example 1 was repeated.
In the above embodiment, the programming software is VC + +6.0, the drawing tool is Visio2015, the Processor is pentium (r) Dual-Core 32Processor +, and the operating system is Windows7 or above, which are well known to those skilled in the art.

Claims (1)

1. The mining method of the high average utility sequence mode under the non-overlapping condition is characterized in that: generating a candidate set by using a pattern growing method, and quickly calculating the average utility value of the candidate pattern on line by using a queue data structure, wherein the method comprises the following specific steps:
step one, reading in a sequence database SDB, a minimum gap min, a maximum gap max and a minimum average utility threshold minun:
reading in a given sequence database SDB, determining that the total number of sequences contained therein is N, and recording each sequence in the sequence database SDB as a sequence S1Sequence S2…, sequence Sk…, sequence SNWhere k is 1. ltoreq. N, sequence SkThe characters included in (1) are respectively denoted as characters s1S character2…, character snReading given minimum gap min, maximum gap max and minimum support threshold minun;
and secondly, generating a high average utility mode set and a high upper bound mode set with the length of 1:
calculating an average utility value and an average utility upper bound of each character in the sequence database SDB read in the first step, adding the characters of which the average utility value is greater than or equal to a minimum average utility threshold minun into a high average utility mode set with the length of 1, and adding the characters of which the average utility upper bound is greater than or equal to the minimum average utility threshold minun into a high upper bound mode set with the length of 1, thereby generating a high average utility mode set and a high upper bound mode set with the length of 1;
thirdly, generating a candidate mode set with the length of i + 1:
generating a candidate pattern set with the length of i +1 according to the high upper bound pattern set with the length of i,
①, when i is 1, combining the characters in the high upper bound pattern set with length 1 obtained in the second step with each other to generate a candidate pattern set with pattern length i + 1;
② when i>1, in the process of generating candidate patterns, the pattern P ═ P1p2…pm-1pmPrefix (P) is a prefix of pattern P, excluding the last sub-pattern P of pattern PmThe remaining part is called the prefix of pattern P, i.e. prefix (P) ═ P1p2…pm-1Suffix (P) is a suffix of pattern P, excluding the first submode P of pattern P1The remaining part is called the suffix of pattern P, i.e. suffix (P) ═ P2…pm-1pmWhen there are two patterns P and R with length i and the suffix of the pattern P is equal to the prefix of the pattern R, the pattern P and the pattern R are spliced into a pattern T with length i +1 by using a pattern splicing method, i.e. suffix (P) ═ P2p3…pL=prefix(R)=r1r2…ri-1When the pattern is generated, the pattern with the length of i +1 is generated
Figure FDA0002540379740000011
Figure FDA0002540379740000012
The specific processing method for generating the candidate pattern set with the length of i +1 by adopting the pattern splicing method is as follows:
when the high upper bound mode set with the length of i is not empty, traversing the high upper bound mode set from left to right, and sequentially taking out the modes P in the high upper bound mode setaCalculating suffix (P)a) Then from left to right to find a satisfaction of suffix (P)a)=prefix(Pb) Pattern P of the conditionbWill pattern PaAnd mode PbSplicing is carried out to form a mode with the mode length of i +1
Figure FDA0002540379740000013
Adding the mode T into the candidate mode set with the mode length of i +1, and satisfying the suffix (P) for all the high upper bound mode setsa)=prefix(Pb) Pattern P of the conditionbSplicing to centralize the high upper bound modeRepeating the steps for all the modes until the last mode splicing is finished, thereby generating a candidate mode set with the length of i + 1;
fourthly, calculating the average utility value and the average utility upper bound value of the modes in the candidate mode set with the length of i + 1:
and (4.1) sequentially calculating the mode support degree of the modes in the candidate mode set with the length of i +1 obtained in the third step:
firstly, reading in a mode P in a candidate mode set with the length of i +1, determining n queues to be created according to the number n of sub-modes of the mode, and respectively recording the n queues as a queue Q1Queue Q2…, queue Qj…, queue Qn,1≤j≤n,
Then, sequentially creating nodes in n queues by adopting a depth priority and backtracking strategy, wherein the specific operation method is that a queue QjNode (2)
Figure FDA0002540379740000021
Representing the jth sub-pattern P of a pattern P in a candidate pattern set of length i +1jIn the ith position in the sequence S in the sequence database SDB read in the first step, under the condition of no overlapping constraint, the same node is not allowed to exist in the same queue, but the same node is allowed to exist in different queues, and the queue Q is createdjBefore the last node, it must first be determined whether the cycle gap constraint and queue Q are satisfiedjWhether there is already a node in
Figure FDA0002540379740000022
In the queue QjAlready existing nodes in (2): 1) node point
Figure FDA0002540379740000023
Has been used by the previous occurrence, 2) by passing through the junction
Figure FDA0002540379740000024
Cannot find a presence when a node is present
Figure FDA0002540379740000025
Already present, in both cases, in queue QjThe end node can not be created, and continues to find the queue Q under the condition of the cycle gap constraintjLast node, when last queue creates a node
Figure FDA0002540379740000026
When a group appears, the same method is continuously added until the last character of the sequence string is scanned, the creation of the nodes of the queue is finished, the last queue is traversed, the number of the obtained nodes is the support degree of the mode, and the mode support degree sup (P) is sequentially calculated for the mode P in the candidate mode set with the length of i +1 obtained in the third step;
and (4.2) calculating the average utility value and the average utility upper bound value of the modes in the candidate mode set with the length of i +1 according to the following formulas:
the calculation process is divided into the following two steps:
① the average utility value PAU (P) of the patterns in the candidate pattern set of length i +1 is obtained according to the following formula (1) for calculating the average utility value,
Figure FDA0002540379740000027
in formula (1), U (p)j) Is the utility value of the jth item in pattern P in the candidate pattern set, sup (P) is the support of pattern P, m represents the length of pattern P in the candidate pattern set,
② calculates the average effectiveness upper bound SPU (P) for patterns in the candidate pattern set of length i +1 according to equation (2) below,
SPU(P)=sup(P)×Umax(2),
in the formula (2), sup (P) is the support degree of the pattern P in the candidate pattern set obtained in the step (4.1), UmaxThe maximum utility value of each character;
fifthly, obtaining a high average utility mode set and a high upper bound mode set with the length of i + 1:
sequentially calculating the average utility value PAU (P) and the average utility upper bound value SPU (P) of each candidate pattern P in the candidate pattern set with the length of i +1 generated in the third step through the fourth step, adding the candidate pattern into the high average utility pattern set with the length of i +1 when PAU (P) is greater than or equal to the minimum support threshold value minin, and adding the candidate pattern into the high average utility pattern set with the length of i +1 when SPU (P) is greater than or equal to the minimum support threshold value minin, thereby obtaining a high average utility pattern set with the length of i +1 and a high upper bound pattern set;
sixthly, judging whether the candidate pattern set with the length of i +1 or the high upper bound pattern set with the length of i +1 is empty, if not, returning to the third step, the fourth step and the fifth step, and if so, finishing mining the high average utility pattern under the condition of no overlapping;
and seventhly, outputting all the mined high average utility modes on a display.
CN202010544978.XA 2020-06-15 2020-06-15 High average utility sequence pattern mining method under non-overlapping condition Pending CN111475551A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010544978.XA CN111475551A (en) 2020-06-15 2020-06-15 High average utility sequence pattern mining method under non-overlapping condition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010544978.XA CN111475551A (en) 2020-06-15 2020-06-15 High average utility sequence pattern mining method under non-overlapping condition

Publications (1)

Publication Number Publication Date
CN111475551A true CN111475551A (en) 2020-07-31

Family

ID=71765277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010544978.XA Pending CN111475551A (en) 2020-06-15 2020-06-15 High average utility sequence pattern mining method under non-overlapping condition

Country Status (1)

Country Link
CN (1) CN111475551A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801793A (en) * 2021-01-31 2021-05-14 哈尔滨工业大学(威海) Method for mining high-profit commodities in e-commerce transaction data
CN113792099A (en) * 2021-08-12 2021-12-14 上海熙业信息科技有限公司 Data flow high-utility item set mining system based on historical effective table pruning
CN113886396A (en) * 2021-10-20 2022-01-04 电子科技大学 Power system fault detection method and system based on high-utility frequent pattern mining
CN115964415A (en) * 2023-03-16 2023-04-14 山东科技大学 Pre-HUSPM-based database sequence insertion processing method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801793A (en) * 2021-01-31 2021-05-14 哈尔滨工业大学(威海) Method for mining high-profit commodities in e-commerce transaction data
CN112801793B (en) * 2021-01-31 2022-04-15 哈尔滨工业大学(威海) Method for mining high-profit commodities in e-commerce transaction data
CN113792099A (en) * 2021-08-12 2021-12-14 上海熙业信息科技有限公司 Data flow high-utility item set mining system based on historical effective table pruning
CN113792099B (en) * 2021-08-12 2023-08-25 上海熙业信息科技有限公司 Data flow high-utility item set mining system based on historical utility table pruning
CN113886396A (en) * 2021-10-20 2022-01-04 电子科技大学 Power system fault detection method and system based on high-utility frequent pattern mining
CN115964415A (en) * 2023-03-16 2023-04-14 山东科技大学 Pre-HUSPM-based database sequence insertion processing method

Similar Documents

Publication Publication Date Title
CN111475551A (en) High average utility sequence pattern mining method under non-overlapping condition
US7801924B2 (en) Decision tree construction via frequent predictive itemsets and best attribute splits
Mörchen Unsupervised pattern mining from symbolic temporal data
Xu et al. Web mining and social networking: techniques and applications
Zhang et al. TKUS: Mining top-k high utility sequential patterns
Marteau Time warp edit distance with stiffness adjustment for time series matching
Snir et al. Quartets MaxCut: a divide and conquer quartets algorithm
CN106599278A (en) Identification method and method of application search intention
JP2009193584A (en) Determining words related to word set
Qu et al. Mining high utility itemsets using extended chain structure and utility machine
Wu et al. NWP-Miner: Nonoverlapping weak-gap sequential pattern mining
Luna et al. Efficient mining of top-k high utility itemsets through genetic algorithms
Uday Kiran et al. Efficiently finding high utility-frequent itemsets using cutoff and suffix utility
Li et al. Extracting statistical graph features for accurate and efficient time series classification
JP2003141158A (en) Retrieval device and method using pattern under consideration of sequence
US7458001B2 (en) Sequential pattern extracting apparatus
Kim et al. Efficient approach for mining high-utility patterns on incremental databases with dynamic profits
Huang et al. Targeted mining of top-k high utility itemsets
Sohrabi et al. Finding similar documents using frequent pattern mining methods
Trabelsi et al. A new methodology to bring out typical users interactions in digital libraries
CN116662934A (en) Early warning target association relation analysis method, system, storage medium and terminal
JP2004110327A (en) Time series correlation extracting device
Nalousi et al. Weighted Frequent Itemset Mining Using Weighted Subtrees: WST-WFIM
Wang et al. A Markov logic network method for reconstructing association rule-mining tasks in library book recommendation
Sharma et al. A probabilistic approach to apriori algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200731

RJ01 Rejection of invention patent application after publication