CN111581262A

CN111581262A - Order-preserving sequence pattern mining method

Info

Publication number: CN111581262A
Application number: CN202010544303.5A
Authority: CN
Inventors: 武优西; 户倩; 郭媛; 王晓慧; 赵晓倩; 王珠林; 崔文峰
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2020-08-25

Abstract

The invention discloses a method for mining a sequence preserving sequence mode, which relates to the technical field of electric digital data processing, and is characterized in that a mode fusion method is used for generating candidate modes, the number of the candidate modes is reduced, and the mode support degree of the candidate modes is calculated through a series of conversion and verification steps, so that the defects that in the prior art, aiming at mining frequent modes from a time sequence, the accuracy, the generality and the completeness of solution are difficult to be considered, the important information is difficult to be lost when the time sequence is processed, and the key trend is difficult to be analyzed through mining frequent modes are overcome.

Description

Order-preserving sequence pattern mining method

Technical Field

The technical scheme of the invention relates to the technical field of electric digital data processing, in particular to a method for mining an order-preserving sequence mode.

Background

Sequence pattern mining becomes one of important tasks in data mining, and has wide application in sequence analysis, classification, prediction and the like, and the task is to find frequently-occurring patterns in massive sequence data. Sequence data is currently generally divided into two categories: character sequences and time sequences. Common character sequences comprise DNA sequences, protein sequences and the like, and frequent patterns in the common character sequences can help people to solve the problems in biology; the time series is numerical data measured and recorded over time, such as daily stock price, oil production, daily temperature, etc. are common time series, and the numerical value is not significant, so that people are more interested in the trend presented by the data, for example, in stock market, an analyst may want to know whether there is a period in which the stock price of a company falls for 10 days continuously and then rises within the next 5 days, in which case, the change pattern of the stock price is more significant than the actual value of the stock price. Therefore, frequent trends are found from the time series, people can be helped to know the development law of things, and theoretical basis is provided for prediction and decision-making of people.

The frequent mode refers to a mode with the mode support degree not less than the minimum support degree threshold value min, that is, the occurrence number of the mode in the data set is not less than the minimum support degree threshold value min. At present, many frequent pattern mining methods for character sequences have been proposed, but they cannot be directly applied to time sequence mining, because the time sequence has the significant characteristics of high dimension, continuity, large data volume and the like, before the time sequence mining, one-step preprocessing is usually required to be performed on the time sequence, so that numerical data is converted into data of other domains. The common method is to perform time series symbolization processing, such as the common SAX method, to convert the time series numerical data into character data and then perform mining, but the preprocessing step has the following defects: the method needs to manually set parameters, easily loses some important information in the process, and breaks the continuity of the time sequence to a certain extent. Taking the SAX method as an example, two time series with different trends are symbolized to obtain the same character sequence, as shown in fig. 1 (a) and (b) of the attached drawings, two time series with significant trend differences are symbolized to be "beccde" after being symbolized by the SAX, which is very unfavorable for trend analysis of the time series, so a more complete mining method is required.

The concept of order preservation provides a new idea for trend analysis of time series, which has been applied to the problem of order-preserving matching, and the idea is to find patterns influenced by relative order, rather than their absolute values, in the field of interest, and to indicate that matching is successful when the relative order of the subsequence is the same as that of the given pattern. The following example A details the relative order concept and the problem of pattern matching for the order-preserving sequence.

Example a. given time series S ═ S (S)₁,s₂,s₃,s₄,s₅,s₆,s₇,s₈,s₉,s₁₀,s₁₁,s₁₂,s₁₃,s₁₄,s₁₅,s₁₆,s₁₇) (9,12,11,17,16,21,14,18,15,19,21,19,26,18,25,26,27), and (P) mode P₁,p₂,p₃,p₄,p₅)＝(6,5,8,4,7)。

In example a, for a given pattern P ═ (6,5,8,4,7), its relative order is (3,2,5,1,4), since in a pattern P with a pattern length of 5, P is₄Since 4 is the smallest number among 5 numbers of pattern P, P is written₄Has a relative order of 1, in the same way as p₅Is the fourth smallest among 5 numbers of pattern P, so P is noted₅Is 4. The task of order-preserving sequence pattern matching is to find out the subsequences in the time sequence S that have the same relative order as the pattern P. FIG. 2 of the drawings in the specification shows that(s)₄,s₅,s₆,s₇,s₈) That is (17,16,21,14,18) appears in one group because its relative order is also (3,2,5,1,4), with the same relative order as pattern P; for the same reason,(s)₁₁,s₁₂,s₁₃,s₁₄,s₁₅) Another group appears as (21,19,26,18, 25). As can be seen from FIG. 2 of the drawings, the trend and the pattern of the data fluctuation of the two matched subsequencesThe data fluctuation trends of P are quite similar, which is the characteristic of the order-preserving sequence mode, namely the trend characteristic of the time sequence can be well represented.

Although it can be seen from example a that the existing order-preserving pattern matching technology can find the subsequences with the same trend change as the given pattern P in the time sequence, this does not satisfy the user requirement because sometimes the user does not have prior knowledge, does not give a specific pattern in advance, and is more interested in those frequently occurring but unknown patterns. The invention provides an order-preserving sequence pattern mining method, which is used for mining order-preserving sequence patterns frequently appearing in a time sequence, wherein each generated frequent order-preserving sequence pattern represents a frequent trend, so that a user can obtain a data change rule within a period of time according to a mining result and can predict the trend of future data according to the data change rule, and the method has practical significance and practical value. The following example B describes the problem of order-preserving sequence pattern mining in detail.

Example b. given the time series S ═ S (S)₁,s₂,s₃,s₄,s₅,s₆,s₇,s₈,s₉,s₁₀,s₁₁,s₁₂,s₁₃,s₁₄,s₁₅,s₁₆) (12,11,22,26,13,15,19,20,27,14,17,21,25,31,16,18), and the minimum support threshold value minsup is 3.

For subsequence(s)₃,s₄,s₅,s₆) The subsequence(s) can be found in the same way as (22,26,13,15) and the relative order is (3,4,1,2)₈,s₉,s₁₀,s₁₁) And subsequence(s)₁₃,s₁₄,s₁₅,s₁₆) Is also (3,4,1,2), so that the subsequence with the relative order of (3,4,1,2) appears 3 times in total, and the pattern expressed by the relative order of (3,4,1,2) is called the order-preserving sequence pattern. The subsequence(s) can be seen in FIG. 3 of the drawings of the specification₃,s₄,s₅,s₆) Subsequence(s)₈,s₉,s₁₀,s₁₁) And subsequence(s)₁₃,s₁₄,s₁₅,s₁₆) Of (2) aThe trends are very similar and can be expressed as (3,4,1, 2). The solution goal of the order-preserving sequence pattern mining problem is to mine all frequent order-preserving sequence patterns over a given time sequence. For example B, there are 7 frequent order-preserving sequence patterns in the time series S, that is, (1,2), (2,1), (1,2,3), (2,3,1), (3,1,2), (1,2,3,4), and (3,4,1,2), which are all important trends that frequently occur on the time series S, and the user can perform the next prediction decision and other work according to the mining result, so that it has very important practical significance.

The time series pattern mining problem generally requires accuracy, generality and completeness. When processing a high-dimensional time sequence, the problems of loss of valuable information and excessive space-time complexity are required to be avoided, and the purpose of analyzing key trends in the time sequence is required to be achieved by finding frequent patterns, but the existing related technology is difficult to meet the conditions at the same time. CN107451293A discloses a method and an apparatus for mining contrast patterns, which researches a method for mining contrast patterns in a multi-class sequence data set, but the data targeted by this technique is character-type data, and because of the high dimension of time series, if this method is directly applied to time series mining, there is a defect that the space-time complexity will be too high; the document published by chen et al, "text emotion feature extraction method based on order preserving submatrix and frequent sequence pattern mining, university of Shandong," studies a method for mining order preserving submatrix from feature vectors converted from Chinese network comment data, but the document is a method for vectorizing text data and constructing a matrix, needs to consider rows and columns of the matrix at the same time, does not conform to the characteristics of one-dimensional time sequence, and the proposed method cannot be applied to time sequence analysis, and does not have the generality of solution. The document "HOTSAX" published by Keogh et al, effective refining the most unused time series subsequences, IEEEInternational reference on Data mining, "researches the mining method for finding abnormal patterns from time series, but the document needs to adopt SAX method to carry out one-step pretreatment before mining the time series, which can cause the loss of important information and destroy the continuity of the original time series to a certain extent, which is the deficiency thereof; the document "Order-preserving matching, the scientific science" published by Kim et al researches a method for finding a subsequence with the same relative sequence as a known pattern from a time sequence, but the technology can only calculate the support degree of a sequence-preserving pattern, and has the defects that a frequently-occurring but unknown sequence-preserving pattern in a data set cannot be found, so that the key trend in the time sequence cannot be analyzed, and certain limitations exist in the solving difficulty and the application range.

In summary, the existing technology aims at mining frequent patterns from a time sequence, and has the defects that the accuracy, generality and completeness of solution are difficult to be considered, important information is difficult to be lost when the time sequence is processed, and key trends are difficult to be analyzed by mining frequent patterns.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for mining the order-preserving sequence mode is provided, the candidate modes are generated by using a mode fusion method, the number of the candidate modes is reduced, the mode support degree of the candidate modes is calculated through a series of conversion and verification steps, and the defects that in the prior art, aiming at mining frequent modes from a time sequence, the accuracy, the generality and the completeness of solution are difficult to be considered, important information is difficult to be lost when the time sequence is processed, and the key trend is difficult to be analyzed through mining the frequent modes are overcome.

The technical scheme adopted by the invention for solving the technical problem is as follows: the order-preserving sequence pattern mining method generates candidate patterns by using a pattern fusion method, reduces the number of the candidate patterns, and calculates the support degree of the candidate patterns through a series of conversion and verification steps, and comprises the following specific steps:

first, inputting a time sequence S and a minimum support threshold minsup:

inputting a time sequence S, determining the length of the time sequence S to be n, and respectively recording each element in the time sequence S as an element S₁Element s₂…, element s_nInputting a minimum support threshold minsup, which is the minimum number of occurrences of the desired pattern in the time series S, specified by the user;

second, a frequent pattern set fre with a pattern length of 2 is obtained₂：

Candidate pattern set cand with pattern length of 2₂When { (1,2), (2,1) }, the candidate pattern set cand with the pattern length of 2 is sequentially calculated according to the following calculation procedure of the pattern support degree₂Each candidate pattern P in { (1,2), (2,1) } is_dThe mode support degree in the time series S, when the mode support degree of the candidate mode is larger than or equal to the minimum support degree threshold value min, the candidate mode P_dThat is, a frequent pattern with a pattern length of 2, and compares the candidate pattern P_dAdding to frequent pattern set fre of pattern length 2₂Thus obtaining a frequent pattern set fre of pattern length 2₂，

The calculation steps of the mode support degree are as follows:

firstly, the candidate pattern P in the current processed candidate pattern set_dIs sorted from small to large, the ith element is arranged in a candidate pattern P_dHas a position index of]In candidate pattern P_dIn which is p_index[i]<p_index[i+1]The condition is satisfied, wherein p_index[i]Is a candidate pattern P_dThe ith-ranked element, p_index[i+1]Is a candidate pattern P_dThe element with the rank of i +1, i is more than or equal to 1 and less than or equal to m-1, wherein m is the candidate pattern P currently processed_dThe length of the pattern of (a) is,

then the candidate pattern P_dThe binary string P 'is converted according to the following formula (1), and each element in the binary string P' is denoted as element a₁…, element a_i…, element a_m-1The time series S is converted into a binary string S 'according to the following formula (2), and each element in the binary string S' is respectively designated as element b₁…, element b_j…, element b_n-1The equations (1) and (2) are shown below,

in equations (1) and (2), m is the currently processed candidate pattern P_dThe initial value of m is 2, n is the length of the time series S, a_iIs the value of each element in the binary digit string P', wherein i is more than or equal to 1 and less than or equal to m-1, and the candidate pattern P_dTwo consecutive elements p_iAnd p_i+1Comparing, wherein i is more than or equal to 1 and less than or equal to m-1, when p_i<p_i+1Then a_iIs equal to 1, when p_i>p_i+1Then a_iEqual to 0; b_jIs the value of each element in the binary digit string S', wherein j is more than or equal to 1 and less than or equal to n-1, and two continuous elements S in the time sequence S_jAnd s_j+1Comparing, wherein j is more than or equal to 1 and less than or equal to n-1, when s_j<s_j+1Then b_jIs equal to 1, when s_j>s_j+1Then b_jEqual to 0;

finding out the occurrence of binary string P 'in binary string S' by classical pattern matching algorithm, retaining the corresponding subsequence in time sequence S as candidate subsequence according to the occurrence whenever finding one occurrence, and verifying position index l of first element of the candidate subsequence₁Whether or not conditions are satisfied

Satisfy, candidate pattern P_dPlus one, not satisfied, candidate pattern P_dIs not changed, wherein,

for the candidate sub-sequence and the candidate pattern P_dElement p of (1)_index[i]The position of (a) of (b) corresponds to the element,

for the candidate sub-sequence and the candidate pattern P_dElement p of (1)_index[i+1]I is more than or equal to 1 and less than or equal to m-1, when all occurrences are found and all candidate subsequences are verified, the candidate pattern P can be obtained_dThe mode support of (1);

thirdly, generating a candidate pattern set cand with a pattern length of L +1_L+1：

Adopting mode fusion method, and collecting fre by frequent mode with mode length L_LGenerating a candidate pattern set cand with a pattern length L +1_L+1Wherein L represents the pattern length of the currently processed frequent pattern, the initial value of L is 2, and in the process of generating the candidate pattern set, for the frequent pattern P, each element of the frequent pattern P is an element P₁Element p₂… element p_LThe last element P of the frequent pattern P_LThe remaining part, called prefix of frequent pattern P, is denoted as prefix (P), and the relative order of the prefixes of frequent pattern P is denoted as prefix (P); the first element P of the frequent pattern P₁The remaining part, except for the suffix of the frequent pattern P, is designated as suffix (P), the relative order of suffixes of the frequent pattern P is designated as suffix (P),

the mode fusion method has the following fusion rules under two different conditions:

1) the general case is as follows: for the frequent pattern P and the frequent pattern Q with the length of both patterns being L, each element of the frequent pattern P is an element P₁Element p₂… element p_LEach element of the frequent pattern Q is an element Q₁Element q₂… element q_LWhen the relative order of suffixes of the frequent pattern P is equal to the relative order of prefixes of the frequent pattern Q, but the suffixes of the frequent pattern P are not equal to the prefixes of the frequent pattern Q, the frequent pattern P and the frequent pattern Q can be merged into a candidate pattern with a pattern length L +1, which is denoted as a candidate pattern X, and each element of the candidate pattern X is an element X₁Element x₂… element x_L+1This is a common case, and the specific fusion rule is as follows:

comparing the first element P of the frequent pattern P₁And the last element Q of the frequent pattern Q_LThe size of (2):

① when p₁<q_LLet the first element X of the candidate pattern X₁＝p₁The last element X of candidate pattern X_L+1＝q_L+1, and then the elements P of the other positions of the frequent pattern P than the first element_uWith the last element Q of the frequent pattern Q_LBy comparison, when p is_u>q_LThen the corresponding position element X of the candidate pattern X_u＝p_u+1, otherwise, x_u＝p_uWherein u is more than or equal to 2 and less than or equal to L;

② when p₁>q_LLet the first element X of the candidate pattern X₁＝p₁+1, the last element X of the candidate pattern X_L+1＝q_LThen, the elements Q of the other positions except the last element of the frequent pattern Q are set_vWith the first element P of the frequent pattern P₁Making a comparison when q is_v>p₁Then the corresponding position element X of the candidate pattern X_v+1＝q_v+1, otherwise, x_v+1＝q_vWherein v is more than or equal to 1 and less than or equal to L-1;

2) special cases are as follows: for the frequent pattern P and the frequent pattern Q with the length of both patterns being L, each element of the frequent pattern P is an element P₁Element p₂… element p_LEach element of the frequent pattern Q is an element Q₁Element q₂… element q_LWhen not only the relative order of suffixes of the frequent pattern P and the relative order of prefixes of the frequent pattern Q are equal, but also the suffixes of the frequent pattern P and the prefixes of the frequent pattern Q are equal, the frequent pattern P and the frequent pattern Q can be merged into two candidate patterns with a pattern length L +1, which are respectively denoted as a candidate pattern T and a candidate pattern K, each element of the candidate pattern T is an element T₁Element t₂…, element t_L+1Each element of the candidate pattern K is an element K₁Element k₂… yuanElement k_L+1This is a special case, and the specific fusion rule is as follows:

when generating the candidate pattern T, let the first element T of the candidate pattern T₁＝p₁+1, the last element T of the candidate pattern T_L+1＝p₁Then, the elements P of the other positions of the frequent pattern P except the first element are set_uAnd p₁Making a comparison when p_u>p₁Then the corresponding position element T of the candidate pattern T_u＝p_u+1, otherwise, t_u＝p_uWherein u is more than or equal to 2 and less than or equal to L;

when generating the candidate pattern K, let the first element K of the candidate pattern K₁＝p₁The last element K of K_L+1＝p₁+1, and then the elements P of the other positions of the frequent pattern P than the first element_uAnd p₁Making a comparison when p_u>p₁Then the corresponding position element K of the candidate pattern K_u＝p_u+1, otherwise, k_u＝p_uWherein u is more than or equal to 2 and less than or equal to L;

by adopting the mode fusion method, the frequent mode set fre with the mode length of L is adopted_LGenerating a candidate pattern set cand with a pattern length L +1_L+1The specific treatment method comprises the following steps:

frequent pattern set fre when the pattern length is L_LNot empty, first take out frequent pattern set fre_LFirst frequent pattern P in_aCalculating the frequent pattern P_aAnd the relative order of suffixes, then sequentially traversing the frequent pattern set fre from left to right_LEach of the frequent patterns P_bAnd sequentially judging the frequent pattern P_bAnd a frequent pattern P_aWhether two conditions in the mode fusion method are met or not is determined, when any condition is met, the candidate modes with the mode length of L +1 are generated by fusion according to the corresponding fusion rule, and then the generated candidate modes with the mode length of L +1 are added into a candidate mode set cand with the mode length of L +1_L+1In (3), when all the frequent patterns P are traversed_bFor the frequent pattern P_aThe fusion process of (2) is ended, and then from the frequent pattern set fre_LFirst frequent pattern P in_aAnd the above steps are repeated until the frequent pattern set fre is processed_LThe generation of the candidate pattern set cand with the pattern length of L +1 is completed for the last frequent pattern in (1)_L+1；

Fourthly, obtaining a frequent pattern set fre with the pattern length of L +1_L+1：

According to the method for calculating the mode support degree in the second step, the candidate mode set cand with the mode length L +1 is calculated in sequence_L+1Each candidate pattern P in_dMode support degree sup (P)_dS) when the candidate pattern P is_dMode support degree sup (P)_dS) is more than or equal to the minimum support threshold value minsup, the candidate pattern P is selected_dFrequent pattern set fre added to pattern length L +1_L+1When the candidate pattern set cand is calculated_L+1The mode support of all the candidate modes in the set, that is, the frequent mode set fre with the mode length of L +1 is obtained_L+1；

And fifthly, finishing the excavation of the order-preserving sequence mode:

frequent pattern set fre when the pattern length is L +1_L+1When the candidate pattern set cand is not empty, the third step and the fourth step are cycled until the candidate pattern set cand with the pattern length L +1_L+1Frequent pattern set fre of null or pattern length L +1_L+1And if the sequence is empty, finishing the mining of the sequence preserving sequence mode.

In the order preserving sequence pattern mining method, the used programming software is VC + +6.0, the drawing tool is Visio2013, the used Processor is Pentium (R) Dual-Core 32Processor +, the operating system is Windows7 and above versions, and the software and hardware environment used by the classic pattern matching algorithm are well known by those skilled in the art.

The invention has the beneficial effects that: compared with the prior art, the invention has the prominent substantive characteristics as follows:

(1) the invention solves the problem of mining the order-preserving sequence mode, firstly reads in the time sequence S and the minimum support thresholdminsup, determining frequent pattern set fre with pattern length 2₂By the mode fusion method, from the frequent pattern set fre with the pattern length of 2₂Generating a candidate pattern set cand with a pattern length of 3₃Then, the candidate pattern sets cand with the pattern length of 3 are calculated respectively₃Adding the candidate mode with the mode support degree not less than the minimum support degree threshold value min into the frequent mode set fre with the mode length of 3₃When the mode support degree of the candidate mode is calculated, the occurrence number of the candidate mode in the time sequence S is obtained through a series of conversion and verification steps, so that the complete mode support degree result can be ensured, and the judgment of invalid sequence segments is reduced; the above process is iterated until the candidate pattern set cand with the pattern length L +1_L+1Or frequent pattern set fre_L+1When the time is empty, the mining of the order-preserving sequence mode is finished; by the method, the times of calculating the mode support number are reduced, and the time complexity and the space complexity are reduced, so that the problem of mining the order-preserving sequence mode is solved.

(2) CN106339609A discloses an optimal alignment sequence pattern mining method with free space constraint, which is a method for mining alignment patterns, inputting two types of sequence sets of positive and negative examples, and allowing existence of space constraint, while the method is a preserving pattern, all the inputs are one type of sequence sets, and there is no positive or negative score, so that the method is more general, and there is no space constraint, so that the mined pattern is more accurate, which is the most substantial difference between the two.

(3) CN105868314A discloses a method for mining weighted negative sequence patterns under multiple support degrees, which is a negative sequence pattern, sets two parameters of weighted support degree and minimum support degree, and if the set parameters are not proper, the mining result is not accurate, but the invention can mine the desired frequent sequence-preserving sequence pattern only by setting a threshold value of the minimum support number, which is the largest substantial difference between the two.

(4) CN109101530A discloses a high-utility event sequence pattern mining algorithm, which is used for mining high-utility event sequence patterns, while the method is used for mining order-preserving sequence patterns, which are concerned about frequent change trends and do not need to consider the utility value of the patterns, which is the greatest substantial difference between the two.

(5) CN104750830A discloses a method for mining time series data periodically, which is to mine time series, but the mined time series is periodic patterns in the time series, and has little significance for analyzing trends in the time series, while the mined time series is order-preserving sequence patterns, which can capture critical trends frequently occurring in the time series, which is the most substantial difference between the two.

(6) CN104182461A discloses a time series data mining system, in the mining process, a time series clustering analysis module divides time series data into various categories according to the degree of association, and finally calculates the frequent pattern of the time series, but the invention analyzes the time series from the angle of mining the sequence pattern, when mining the frequent pattern, a candidate pattern is generated by using a pattern fusion method, and the pattern support degree of the candidate pattern in the time series is calculated through a series of conversion and verification steps, so that all the frequent patterns are obtained, and the time series does not need to be classified, compared with CN104182461A, the invention has obvious progress.

(7) CN108874952A discloses a method for mining the most frequent sequence mode based on distributed logs, the method is used for mining the most frequent sequence mode, a plurality of frequent modes are removed in a result set, and a plurality of useful information can be missed, and all frequent modes are mined by the method, so that the completeness of the result is guaranteed, and the method is remarkably improved compared with CN 108874952A.

(8) CN106469171A discloses a method for mining parallel frequent time sequence, the invention firstly converts time sequence into character sequence matrix, and mines frequent pattern on the basis, the conversion process is easy to lose valuable information, and the result is unfavorable to trend analysis of time sequence, but the invention can directly mine without converting time sequence, and the mining result can help people understand important trend of transaction development, compared with CN106469171A, the invention has obvious progress.

(9) CN109344179A discloses a frequent adjacent sequence pattern mining method, which stores patterns with different lengths into different sparse tensors and respectively finds out frequent patterns therein, resulting in high space-time complexity, but the invention firstly generates frequent patterns with short lengths, and then generates candidate patterns with longer lengths by fusing the frequent patterns with short lengths by using a pattern fusion method, thereby effectively reducing the number of the candidate patterns, having lower space-time complexity and fast efficiency, and having remarkable progress compared with CN 109344179A.

(10) CN107844540A discloses a time series mining method for power data, which first preprocesses data, and then divides a database to generate a sequence pattern set, and this invention needs to preprocess a time series, which will result in loss of valuable information and destroy the continuity of the time series to a certain extent, but the invention can mine without preprocessing the time series, and can find a key trend in the time series without losing the useful information, compared with CN107844540A, the invention has significant progress.

(11) CN109033341A discloses a Top-k contrast sequence pattern mining algorithm based on concurrency with interval constraint, which is to mine a contrast sequence pattern from a character sequence and needs to set interval constraint to make the mining process more complex and the mining result not accurate, but to mine an order-preserving sequence pattern from a time sequence and does not need to set interval constraint to make the mined result accurate and effective, compared with CN109033341A, the invention has obvious progress.

(12) CN108073701A discloses a mining method of rare patterns of multi-dimensional time sequence data, which is characterized in that mining is carried out on a multi-dimensional time sequence, rare patterns which do not appear frequently are mined, the application range of the mining method is smaller, people are more concerned about frequently appearing patterns in an actual scene, and the mining method is carried out on a one-dimensional time sequence, is more widely applied, and has remarkable progress compared with CN 108073701A.

(13) CN107451293A discloses a method for mining contrast patterns from multi-class sequence data, which is applicable to character-type sequences and cannot be directly applied to time sequences, and the mining method provided by the present invention has the substantial characteristic of analyzing the trend of time sequences mainly aiming at time sequence data. The method has the remarkable advantages that frequent order-preserving sequence modes are excavated in the time sequence, the frequently presented trend changes of numerical data can be found, and a theoretical basis is provided for prediction and decision of people.

(14) The text emotion feature extraction method based on the order preserving submatrix and frequent sequence pattern mining is characterized in that the data expression form of the order preserving submatrix disclosed by Shandong university school newspaper is a matrix, the emotion tendency of a text is analyzed through the model, and the mining method provided by the invention has the substantial characteristic that the data expression form of the order preserving sequence pattern is a sequence, and the key trend in a time sequence is analyzed through the sequence pattern. The method has the obvious advantages that frequent patterns can be mined from the time sequence without preprocessing data, loss of valuable information is avoided, and the purpose of analyzing key trends in the time sequence can be achieved.

Compared with the prior art, the method has the following remarkable progress:

(1) the method realizes the mining of the order-preserving sequence mode in the time sequence, can find out the frequent trend in the time sequence without preprocessing, overcomes the defect of information loss caused by increasing preprocessing steps in the prior art, and fills the vacancy that the prior art can only locate the appearance position of the known order-preserving mode in the sequence and can not find the unknown frequent order-preserving mode.

(2) The method introduces the concept of order preservation into the sequence pattern mining, most of the existing methods concern the absolute value of the pattern, but ignore the overall trend change of the pattern, so the difference between time sequences cannot be effectively reflected in the field of time sequence analysis, and the method pays more attention to the relative size of numerical values, expresses the trend characteristics of the numerical data by the relative sequence of the patterns, more accords with the characteristics of numerical data, and has universality and practical significance for the time sequence analysis;

(3) the invention discovers frequently-occurring trend changes from a time sequence, the existing mining technology needs to preprocess the time sequence before mining, such as SAX symbolization processing, but the invention can mine without the step, thus ensuring that the continuity of the time sequence is not damaged, and valuable information is not missed, so that the mining result is more complete, the application of the mining result is more extensive, and the mining method is more in line with the requirement of actual work;

(4) the invention researches the excavation of an order-preserving sequence mode, and mainly has two core problems: calculating the mode support degree and generating the candidate mode. The existing technology mainly focuses on solving the first big problem, namely, the support degree of the pattern is calculated through an order-preserving pattern matching technology, but the mining can not be completed only by the technology, a method for generating a candidate pattern is needed, and no technology related to order-preserving candidate pattern generation exists at present, so that the invention provides a brand-new pattern fusion method to generate the candidate pattern according to the characteristics of the order-preserving sequence pattern, ensures the complete operation of the mining process, fills the vacancy that the existing technology can only locate the occurrence position of the known order-preserving pattern in the sequence and can not find the unknown frequent order-preserving pattern, and has great practical significance.

(5) The method provided by the invention is reasonably applied to the time sequence, can help a user to obtain the data change rule within a period of time and provide a theoretical basis for predicting the trend and decision of future data, so that the method has important research value. The order-preserving sequence pattern mining method provided by the invention not only can help a user to extract valuable information and knowledge, but also reduces the difficulty of data processing and analysis, and has great development potential.

Drawings

The invention is further illustrated with reference to the figures and examples.

FIG. 1 is a comparison of two sets of significantly different time series that are SAX-signed to the same character sequence.

Fig. 2 shows all occurrences of pattern P in case a in time series S.

Fig. 3 is a trend graph of the time series S in the example B, and the sub-sequence represented by the dotted line is the appearance of the order-preserving sequence pattern (3,4,1,2) in the time series S.

FIG. 4 is a schematic flow chart of a computer process used in the method of the present invention.

Detailed Description

As shown in fig. 1, in the prior art, before mining a time sequence, a time sequence needs to be symbolized by using an SAX method, and numerical data is converted into character data, because the SAX is segmented by using a segment aggregation approximation (PAA), and then each segment is averaged, two time sequences with different trend information are symbolized to obtain the same symbol sequence. Fig. 1 (a) and (b) are time series with two distinct trends, but both are symbolized as "beccde" after being symbolized by SAX. The above explains that the existing processing technology for time series loses important information in data, and is not beneficial to analyzing the trend of time series.

The embodiment shown in fig. 2 shows that the pattern P in example a has 2 occurrences in the time series S, wherein the length of the time series S is 17, the 17 numerical corresponding position indices are denoted by '1' to '17', respectively, and the 1 st and 2 nd occurrences in the time series S of the pattern P ═ 6,5,8,4,7 are denoted by the corresponding position indices in the time series S, so that the 2 order-preserving occurrences in the time series S of the pattern P are <4,5,6,7,8> and <11,12,13,14,15>, respectively. The above illustrates that the existing order-preserving pattern matching technology can only find the appearance position of a given pattern P in a time series S, and cannot find frequently occurring but unknown patterns in the time series.

The example shown in FIG. 3 shows the subsequence(s) in example B₃,s₄,s₅,s₆) Subsequence(s)₈,s₉,s₁₀,s₁₁) And subsequence(s)₁₃,s₁₄,s₁₅,s₁₆) Is (3,4,1,2), so the preserved sequence pattern (3,4,1,2) appears 3 times in the time sequence S, therefore the pattern support degree of the preserved sequence pattern (3,4,1,2) is not less than the minimum support degree threshold value min, so the preserved sequence pattern (3,4,1,2) is a frequent preserved sequence pattern. By taking the example as an example, the method can overcome the defects of the prior art by mining the frequent order-preserving sequence mode, and achieve the purposes of not missing valuable information and analyzing the key trend in the time sequence.

FIG. 4 is a flow of the computer processing employed by the method of the present invention: 1) start → 2) input time sequence S and minimum support threshold min → 3) obtain frequent pattern set fre with pattern length 2₂→ 4) generating candidate pattern set cand with pattern length L +1_L+1→ 5) candidate pattern set cand with pattern length L +1_L+1If the result is empty, executing the step 10; no, step 6 → 6) is executed to calculate the candidate pattern set cand with pattern length L +1_L+1Candidate pattern P in_dMode support sup (P) in time series S_dS) → 7) determining candidate pattern P_dMode support sup (P) in time series S_dS) whether the minimum support threshold value minsup is not less than or equal to, if yes, step 8 → 8) is executed to put the candidate pattern P_dAdding to frequent pattern set fre of pattern length L +1_L+1Middle → 9) judging the frequent pattern set fre with a pattern length of L +1_L+1If not, executing the step 4; yes, step 10 → 10) is executed.

Example 1

Given the time series S ═ 1.1,1.2,1.3,1.4,1.5,1.1,1.2,1.3,1.4,1.5,1.1,1.2,1.3,1.4,1.5,1.3,1.4), the minimum support threshold min ═ 3.

First, inputting a time sequence S and a minimum support threshold minsup:

the input time series S ═ 1.1,1.2,1.3,1.4,1.5,1.1,1.2,1.3,1.4,1.5,1.1,1.2,1.3,1.4,1.5,1.3,1.4), and the minimum support threshold min ═ 3;

second, a frequent pattern set fre with a pattern length of 2 is obtained₂：

Candidate pattern set cand with pattern length of 2₂When { (1,2), (2,1) }, the candidate pattern set cand with the pattern length of 2 is sequentially calculated according to the following calculation procedure of the pattern support degree₂The mode support degree of each candidate mode in time series S in { (1,2), (2,1) } is determined, and when the mode support degree of the candidate mode is larger than or equal to a minimum support degree threshold value min, the candidate mode P_dThat is, a frequent pattern with a pattern length of 2, and compares the candidate pattern P_dAdding to frequent pattern set fre of pattern length 2₂Performing the following steps;

the calculation steps of the mode support degree are as follows:

the specific operation of this embodiment is as follows:

1) computing a candidate pattern set cand with a pattern length of 2₂1 st candidate pattern P in₁Mode support sup (P) in time series S ═ 1,2₁S) is 13 because of sup (P)₁S) is equal to or more than the minimum support threshold value minsup, so the candidate pattern P is set₁Adding to frequent pattern set fre of pattern length 2₂Middle, fre₂＝{(1,2)}，

2) Computing a candidate pattern set cand with a pattern length of 2₂2 nd candidate pattern P in₂Mode support sup (P) in time series S ═ 2,1₂S) is 3 because of sup (P)₂S) is equal to or more than the minimum support threshold value minsup, so the candidate pattern P is set₂Adding to frequent pattern set fre of pattern length 2₂Middle, fre₂＝{(1,2),(2,1)}，

In summary, the frequent pattern set fre with the pattern length of 2 is obtained₂＝{(1,2),(2,1)}；

Adopting mode fusion method, and collecting fre by frequent mode with mode length L_LGenerating a candidate pattern set cand with a pattern length L +1_L+1Wherein L represents the pattern length of the currently processed frequent pattern, the initial value of L is 2, and in the process of generating the candidate pattern set, for the frequent pattern P, each element of the frequent pattern P is an element P₁Element p₂… element p_LThe last element P of the frequent pattern P_LThe remaining part, except for the prefix called frequent pattern P, is denoted prefix (P), frequencyThe relative order of the prefixes of the frequent pattern P is denoted as prefixorder (P); the first element P of the frequent pattern P₁The remaining part, except for the suffix of the frequent pattern P, is designated as suffix (P), the relative order of suffixes of the frequent pattern P is designated as suffix (P),

② when p₁>q_LLet the first element X of the candidate pattern X₁＝p₁+1, the last element X of the candidate pattern X_L+1＝q_LThen divide the last element of the frequent pattern Q byElements q in other positions than_vWith the first element P of the frequent pattern P₁Making a comparison when q is_v>p₁Then the corresponding position element X of the candidate pattern X_v+1＝q_v+1, otherwise, x_v+1＝q_vWherein v is more than or equal to 1 and less than or equal to L-1;

2) special cases are as follows: for the frequent pattern P and the frequent pattern Q with the length of both patterns being L, each element of the frequent pattern P is an element P₁Element p₂… element p_LEach element of the frequent pattern Q is an element Q₁Element q₂… element q_LWhen not only the relative order of suffixes of the frequent pattern P and the relative order of prefixes of the frequent pattern Q are equal, but also the suffixes of the frequent pattern P and the prefixes of the frequent pattern Q are equal, the frequent pattern P and the frequent pattern Q can be merged into two candidate patterns with a pattern length L +1, which are respectively denoted as a candidate pattern T and a candidate pattern K, each element of the candidate pattern T is an element T₁Element t₂…, element t_L+1Each element of the candidate pattern K is an element K₁Element k₂…, element k_L+1This is a special case, and the specific fusion rule is as follows:

when generating the candidate pattern K, let the first element K of the candidate pattern K₁＝p₁The last element K of K_L+1＝p₁+1, and then the elements P of the other positions of the frequent pattern P than the first element_uAnd p₁Making a comparison when p_u>p₁Then the corresponding position element of the candidate pattern Kk_u＝p_u+1, otherwise, k_u＝p_uWherein u is more than or equal to 2 and less than or equal to L;

The operation of this embodiment is as follows:

1. from a frequent pattern set fre of pattern length 2₂Generating a candidate pattern set cand with a pattern length of 3₃；

Since the frequent pattern set fre of pattern length 2 is obtained by the second step₂＝{(1,2),(2,1)}，

1) Processing frequent pattern set fre with pattern length of 2₂1 st frequent pattern P in₁＝(1,2)：

Frequent pattern P₁Suffix of (A), (B), (C₁) 2, frequent pattern P₁Relative order of suffixes of (P)₁)＝(1)，

① fetch frequent pattern set fre with pattern length 2₂1 st frequent pattern P in₁(1,2), frequent pattern P₁Prefix (P) of₁) Frequent pattern P ═ 1₁Relative order of prefixes of prefixorder (P)₁) (1) because of suffixorder (P)₁)＝prefixorder(P₁) But suffix (P)₁)≠prefix(P₁) This case is common to the mode fusion method, so the frequent mode P is passed₁And a frequent pattern P₁A candidate pattern (1,2,3) with a pattern length of 3 can be generated by fusion, and added to the candidate pattern set cand with a pattern length of 3₃In (c), from this cand₃＝{(1,2,3)}，

② fetch frequent pattern set fre with pattern length 2₂2 nd frequent pattern P in₂(2,1), frequent pattern P₂Prefix (P) of₂) 2, frequent pattern P₂Relative order of prefixes of prefixorder (P)₂) (1) because of suffixorder (P)₁)＝prefixorder(P₂) And suffix (P)₁)＝prefix(P₂) This case is a special case of the mode fusion method, so that the frequent mode P is passed₁And a frequent pattern P₂Two candidate patterns (2,3,1) and (1,3,2) with a pattern length of 3 can be generated by fusion and added to the candidate pattern set cand with a pattern length of 3₃In (c), from this cand₃＝{(1,2,3),(2,3,1),(1,3,2)}，

Thus for a frequent pattern set fre of pattern length 2₂1 st frequent pattern P in₁Finishing the treatment;

2) processing frequent pattern set fre with pattern length of 2₂2 nd frequent pattern P in₂＝(2,1)：

Frequent pattern P₂Suffix of (A), (B), (C₂) Frequent pattern P ═ 1₂Relative order of suffixes of (P)₂)＝(1)，

① taking out pattern of length 2Frequent pattern set fre₂1 st frequent pattern P in₁(1,2), frequent pattern P₁Prefix (P) of₁) Frequent pattern P ═ 1₁Relative order of prefixes of prefixorder (P)₁) (1) because of suffixorder (P)₂)＝prefixorder(P₁) And suffix (P)₂)＝prefix(P₁) This case is a special case of the mode fusion method, so that the frequent mode P is passed₂And a frequent pattern P₁Two candidate patterns (3,1,2) and (2,1,3) with a pattern length of 3 can be generated by fusion, and added to the candidate pattern set cand with a pattern length of 3₃In (c), from this cand₃＝{(1,2,3),(2,3,1),(1,3,2),(3,1,2),(2,1,3)}；

② fetch frequent pattern set fre with pattern length 2₂2 nd frequent pattern P in₂(2,1), frequent pattern P₂Prefix (P) of₂) 2, frequent pattern P₂Relative order of prefixes of prefixorder (P)₂) (1) because of suffixorder (P)₂)＝prefixorder(P₂) But suffix (P)₂)≠prefix(P₂) This case is common to the mode fusion method, so the frequent mode P is passed₂And a frequent pattern P₂A candidate pattern (3,2,1) with a pattern length of 3 can be generated by fusion, and added to the candidate pattern set cand with a pattern length of 3₃In (c), from this cand₃＝{(1,2,3),(2,3,1),(1,3,2),(3,1,2),(2,1,3),(3,2,1)}，

Thus for a frequent pattern set fre of pattern length 2₂2 nd frequent pattern P in₂Finishing the treatment;

in summary, the candidate pattern set cand with the pattern length of 3 is obtained₃＝{(1,2,3),(2,3,1),(1,3,2),(3,1,2),(2,1,3),(3,2,1)}；

Candidate pattern set cand when pattern length is 3₃After the generation, the candidate pattern set cand with the pattern length of 3 is calculated by 1) in the following fourth step₃The mode support degree of each candidate mode in the time series S is obtained, thereby obtaining a frequent mode set with the mode length of 3Hefre (Hefre)₃；

2. From a frequent pattern set fre of pattern length 3₃Generating a candidate pattern set cand with a pattern length of 4₄：

Since the frequent pattern set fre with a pattern length of 3 is obtained from 1) of the fourth step₃＝{(1,2,3),(2,3,1),(3,1,2)}，

1) Processing frequent pattern set fre with pattern length of 3₃1 st frequent pattern P in₁＝(1,2,3)：

Frequent pattern P₁Suffix of (A), (B), (C₁) (2,3), frequent pattern P₁Relative order of suffixes of (P)₁)＝(1,2)，

① fetch frequent pattern set fre with pattern length of 3₃1 st frequent pattern P in₁(1,2,3), frequent pattern P₁Prefix (P) of₁) (1,2), frequent pattern P₁Relative order of prefixes of prefixorder (P)₁) (1,2) because of suffixorder (P)₁)＝prefixorder(P₁) (1,2) but suffix (P)₁)≠prefix(P₁) This case is common to the mode fusion method, so the frequent mode P is passed₁And a frequent pattern P₁A candidate pattern (1,2,3,4) with a pattern length of 4 can be generated by fusion, and added to the candidate pattern set cand with a pattern length of 4₄In (c), from this cand₄＝{(1,2,3,4)}；

② fetch frequent pattern set fre with pattern length of 3₃2 nd frequent pattern P in₂(2,3,1), frequent pattern P₂Prefix (P) of₂) (2,3), frequent pattern P₂Relative order of prefixes of prefixorder (P)₂) (1,2) because of suffixorder (P)₁)＝prefixorder(P₂) And suffix (P)₁)＝prefix(P₂) This case is a special case of the mode fusion method, so that the frequent mode P is passed₁And a frequent pattern P₂Two candidate patterns (2,3,4,1) and (1,3,4,2) with the pattern length of 4 can be generated by fusion and added into a candidate pattern set with the pattern length of 4cand₄In (c), from this cand₄＝{(1,2,3,4),(2,3,4,1),(1,3,4,2)}；

③ fetch frequent pattern set fre with pattern length of 3₃The 3 rd frequent pattern P in₃(3,1,2), frequent pattern P₃Prefix (P) of₃) (3,1), frequent pattern P₃Relative order of prefixes of prefixorder (P)₃) (2,1) because of suffixorder (P)₁)≠prefixorder(P₃) So that the two cases of the pattern fusion method are not satisfied, so that the pattern P is frequent₁And a frequent pattern P₃Candidate patterns with a pattern length of 4 cannot be generated by fusion.

Thus for a frequent pattern set fre with a pattern length of 3₃1 st frequent pattern P in₁Finishing the treatment;

2) processing frequent pattern set fre with pattern length of 3₃2 nd frequent pattern P in₂＝(2,3,1)：

Frequent pattern P₂Suffix of (A), (B), (C₂) (3,1), frequent pattern P₂Relative order of suffixes of (P)₂)＝(2,1)，

① fetch frequent pattern set fre with pattern length of 3₃1 st frequent pattern P in₁(1,2,3), frequent pattern P₁Prefix (P) of₁) (1,2), frequent pattern P₁Relative order of prefixes of prefixorder (P)₁) (1,2) because of suffixorder (P)₂)≠prefixorder(P₁) So that the two cases of the pattern fusion method are not satisfied, so that the pattern P is frequent₂And a frequent pattern P₁Candidate patterns with a pattern length of 4 cannot be generated by fusion.

② fetch frequent pattern set fre with pattern length of 3₃2 nd frequent pattern P in₂(2,3,1), frequent pattern P₂Prefix (P) of₂) (2,3), frequent pattern P₂Relative order of prefixes of prefixorder (P)₁) (1,2) because of suffixorder (P)₂)≠prefixorder(P₂) Therefore, the two cases of the pattern fusion method are not satisfied, so the pattern is frequentP₂And a frequent pattern P₂Candidate patterns with a pattern length of 4 cannot be generated by fusion.

③ fetch frequent pattern set fre with pattern length of 3₃The 3 rd frequent pattern P in₃(3,1,2), frequent pattern P₃Prefix (P) of₃) (3,1), frequent pattern P₃Relative order of prefixes of prefixorder (P)₃) (2,1) because of suffixorder (P)₂)＝prefixorder(P₃) And suffix (P)₂)＝prefix(P₃) This case is a special case of the mode fusion method, so that the frequent mode P is passed₂And a frequent pattern P₃Two candidate patterns (3,4,1,2) and (2,4,1,3) with a pattern length of 4 can be generated by fusion, and added to the candidate pattern set cand with a pattern length of 4₄In (c), from this cand₄＝{(1,2,3,4),(2,3,4,1),(1,3,4,2),(3,4,1,2),(2,4,1,3)}；

Thus for a frequent pattern set fre with a pattern length of 3₃2 nd frequent pattern P in₂Finishing the treatment;

3) processing frequent pattern set fre with pattern length of 3₃The 3 rd frequent pattern P in₃＝(3,1,2)：

Frequent pattern P₃Suffix of (A), (B), (C₃) (1,2), frequent pattern P₃Relative order of suffixes of (P)₃)＝(1,2)，

① fetch frequent pattern set fre with pattern length of 3₃1 st frequent pattern P in₁(1,2,3), frequent pattern P₁Prefix (P) of₁) (1,2), frequent pattern P₁Relative order of prefixes of prefixorder (P)₁) (1,2) because of suffixorder (P)₃)＝prefixorder(P₁) And suffix (P)₃)＝prefix(P₁) This case is a special case of the mode fusion method, so that the frequent mode P is passed₃And a frequent pattern P₁Two candidate patterns (3,1,2,4) and (4,1,2,3) with a pattern length of 4 can be generated by fusion, and added to the candidate pattern set cand with a pattern length of 4₄In (c), from this cand₄＝{(1,2,3,4),(2,3,4,1),(1,3,4,2),(3,4,1,2),(2,4,1,3),(4,1,2,3),(3,1,2,4)}；

② fetch frequent pattern set fre with pattern length of 3₃2 nd frequent pattern P in₂(2,3,1), frequent pattern P₂Prefix (P) of₂) (2,3), frequent pattern P₂Relative order of prefixes of prefixorder (P)₂) (1,2) because of suffixorder (P)₃)＝prefixorder(P₂) But suffix (P)₃)≠prefix(P₂) This case is common to the mode fusion method, so the frequent mode P is passed₃And a frequent pattern P₂A candidate pattern (4,2,3,1) with a pattern length of 4 can be generated by fusion, and is added to the candidate pattern set cand with the pattern length of 4₄In (c), from this cand₄＝{(1,2,3,4),(2,3,4,1),(1,3,4,2),(3,4,1,2),(2,4,1,3),(4,1,2,3),(3,1,2,4),(4,2,3,1)}；

③ fetch frequent pattern set fre with pattern length of 3₃The 3 rd frequent pattern P in₃(3,1,2), frequent pattern P₃Prefix (P) of₃) (3,1), frequent pattern P₃Relative order of prefixes of prefixorder (P)₃) (2,1) because of suffixorder (P)₃)≠prefixorder(P₃) So that the two cases of the pattern fusion method are not satisfied, so that the pattern P is frequent₃And a frequent pattern P₃Candidate patterns with a pattern length of 4 cannot be generated by fusion.

Thus for a frequent pattern set fre with a pattern length of 3₃The 3 rd frequent pattern P in₃Finishing the treatment;

in summary, the candidate pattern set cand with the pattern length of 4 is obtained₄＝{(1,2,3,4),(2,3,4,1),(1,3,4,2),(3,4,1,2),(2,4,1,3),(4,1,2,3),(3,1,2,4),(4,2,3,1)}；

Candidate pattern set cand when pattern length is 4₄After generation, the candidate pattern set cand with the pattern length of 4 is calculated by 2) in the following fourth step₄The mode support degree of each candidate mode in the time series S is obtained, thereby obtaining a frequent mode set with the mode length of 4fre₄；

3. From a frequent pattern set fre of pattern length 4₄Generating a candidate pattern set cand with a pattern length of 5₅；

Since the frequent pattern set fre with a pattern length of 4 is obtained by 2) of the fourth step₄＝{(1,2,3,4)}，

1) Processing a frequent pattern set fre of pattern length 4₄1 st frequent pattern P in₁＝(1,2,3,4)：

Frequent pattern P₁Suffix of (A), (B), (C₁) (2,3,4), frequent pattern P₁Relative order of suffixes of (P)₁)＝(1,2,3)，

① fetch frequent pattern set fre with pattern length 4₄1 st frequent pattern P in₁(1,2,3,4), frequent pattern P₁Prefix (P) of₁) (1,2,3), frequent pattern P₁Relative order of prefixes of prefixorder (P)₁) (1,2,3) because of suffixorder (P)₁)＝prefixorder(P₁) But suffix (P)₁)≠prefix(P₁) This case is common to the mode fusion method, so the frequent mode P is passed₁And a frequent pattern P₁A candidate pattern (1,2,3,4,5) with a pattern length of 5 can be generated by fusion, and added to the candidate pattern set cand with a pattern length of 5₅In (c), from this cand₅＝{(1,2,3,4,5)}；

Thus for a frequent pattern set fre with a pattern length of 4₄1 st frequent pattern P in₁Finishing the treatment;

in summary, the candidate pattern set cand with the pattern length of 5 is obtained₅＝{(1,2,3,4,5)}；

Candidate pattern set cand when the pattern length is 5₅After generation, the candidate pattern set cand with the pattern length of 5 is calculated by 3) in the following fourth step₅The mode support degree of each candidate mode in the time series S is obtained, thereby obtaining a frequent mode set fre with a mode length of 5₅；

4. From frequent patterns of pattern length 5Set fre₅Generating a candidate pattern set cand with a pattern length of 6₆；

Since the frequent pattern set fre of pattern length 5 is obtained from step 3) of the fourth step₅＝{(1,2,3,4,5)}，

1) Processing frequent pattern set fre with pattern length of 5₅1 st frequent pattern P in₁＝(1,2,3,4,5)：

Frequent pattern P₁Suffix of (A), (B), (C₁) (2,3,4,5), frequent pattern P₁Relative order of suffixes of (P)₁)＝(1,2,3,4)，

Fetching frequent pattern set fre with pattern length of 5₅1 st frequent pattern P in₁(1,2,3,4,5), frequent pattern P₁Prefix (P) of₁) (1,2,3,4), frequent pattern P₁Relative order of prefixes of prefixorder (P)₁) (1,2,3,4) because of suffixorder (P)₁)＝prefixorder(P₁) But suffix (P)₁)≠prefix(P₁) This case is common to the mode fusion method, so the frequent mode P is passed₁And a frequent pattern P₁A candidate pattern (1,2,3,4,5,6) with a pattern length of 6 can be generated by fusion, and added to the candidate pattern set cand with a pattern length of 6₆In (c), from this cand₆＝{(1,2,3,4,5,6)}；

Thus for a frequent pattern set fre with a pattern length of 5₅1 st frequent pattern P in₁Finishing the treatment;

in summary, the candidate pattern set cand with the pattern length of 6 is obtained₆＝{(1,2,3,4,5,6)}；

Candidate pattern set cand when the pattern length is 6₆After generation, the candidate pattern set cand with the pattern length of 6 is calculated by 4) in the fourth step below₆The mode support degree of each candidate mode in the time series S is obtained, thereby obtaining a frequent mode set fre with a mode length of 6₆；

The operation of this embodiment is as follows:

1) obtaining a frequent pattern set fre with a pattern length of 3₃：

① calculate a candidate pattern set cand with a pattern length of 3₃1 st candidate pattern P in₁Mode support sup (P) in time series S ═ 1,2,3₁S) is 9 because of sup (P)₁S) is equal to or more than the minimum support threshold value minsup, so the candidate pattern P is set₁Adding (1,2,3) to a frequent pattern set fre with a pattern length of 3₃From this fre₃＝{(1,2,3)}；

② calculate a candidate pattern set cand with a pattern length of 3₃2 nd candidate pattern P in₂Pattern support sup (P) in time series S ═ 2,3,1₂S) is 3 because of sup (P)₂S) is equal to or more than the minimum support threshold value minsup, so the candidate pattern P is set₂Add (2,3,1) to frequent pattern set fre with pattern length of 3₃From this fre₃＝{(1,2,3),(2,3,1)}；

③ calculate a candidate pattern set cand with a pattern length of 3₃The 3 rd candidate pattern P in₃Mode support sup (P) in time series S ═ 1,3,2₃S) is 0 because of sup (P)₃,S)<Minimum support threshold value minsup, so candidate pattern P₃Not frequently (1,3, 2);

④ calculate a candidate pattern set cand with a pattern length of 3₃The 4 th candidate pattern P in₄＝(3,1,2) mode support sup (P) in the time series S₄S) is 3 because of sup (P)₄S) is equal to or more than the minimum support threshold value minsup, so the candidate pattern P is set₄Adding (3,1,2) to a frequent pattern set fre with a pattern length of 3₃From this fre₃＝{(1,2,3),(2,3,1),(3,1,2)}；

⑤ calculate a candidate pattern set cand with a pattern length of 3₃The 5 th candidate pattern P in₄Pattern support sup (P) in time series S ═ 2,1,3₅S) is 0 because of sup (P)₅,S)<Minimum support threshold value minsup, so candidate pattern P₅Not frequently (2,1, 3);

⑥ calculate a candidate pattern set cand with a pattern length of 3₃The 6 th candidate pattern P in₆Pattern support sup (P) in time series S ═ 3,2,1₆S) is 0 because of sup (P)₆,S)<Minimum support threshold value minsup, so candidate pattern P₆Not frequently (3,2, 1);

in summary, the frequent pattern set fre with the pattern length of 3 is obtained₃＝{(1,2,3),(2,3,1),(3,1,2)}；

2) Obtaining a frequent pattern set fre with a pattern length of 4₄：

① calculate a candidate pattern set cand with a pattern length of 4₄1 st candidate pattern P in₁Mode support sup (P) in time series S ═ 1,2,3,4₁S) is 6 because of sup (P)₁S) is equal to or more than the minimum support threshold value minsup, so the candidate pattern P is set₁Adding (1,2,3,4) to a frequent pattern set fre with a pattern length of 4₄From this fre₄＝{(1,2,3,4)}；

② calculate a candidate pattern set cand with a pattern length of 4₄2 nd candidate pattern P in₂Support sup (P) of (2,3,4,1) in time series S₂S) is 2 because of sup (P)₂,S)<Minimum support threshold value minsup, so candidate pattern P₂Not frequently (2,3,4, 1);

③ calculate a candidate pattern set cand with a pattern length of 4₄The 3 rd candidate pattern P in₃Mode support sup (P) in time series S ═ 1,3,4,2₃S) is 0 because of sup (P)₃,S)<Minimum support threshold value minsup, so candidate pattern P₃Not frequently (1,3,4, 2);

④ calculate a candidate pattern set cand with a pattern length of 4₄The 4 th candidate pattern P in₄Support sup (P) in time series S for (3,4,1,2)₄S) is 2 because of sup (P)₄,S)<Minimum support threshold value minsup, so candidate pattern P₄Not frequently (3,4,1, 2);

⑤ calculate a candidate pattern set cand with a pattern length of 4₄The 5 th candidate pattern P in₅Support sup (P) in time series S for (2,4,1,3)₅S) is 0 because of sup (P)₅,S)<Minimum support threshold value minsup, so candidate pattern P₅Not frequently (2,4,1, 3);

⑥ calculate a candidate pattern set cand with a pattern length of 4₄The 6 th candidate pattern P in₆Support sup (P) in time series S for (4,1,2,3)₆S) is 2 because of sup (P)₆,S)<Minimum support threshold value minsup, so candidate pattern P₆Not frequently (4,1,2, 3);

⑦ calculate a candidate pattern set cand with a pattern length of 4₄The 7 th candidate pattern P in₇Support sup (P) in time series S for (3,1,2,4)₇S) is 0 because of sup (P)₇,S)<Minimum support threshold value minsup, so candidate pattern P₇Not frequently (3,1,2, 4);

⑧ calculate a candidate pattern set cand with a pattern length of 4₄The 8 th candidate pattern P in (1)₈Support sup (P) of (4,2,3,1) in time series S₈S) is 0 because of sup (P)₈,S)<Minimum support threshold value minsup, so candidate pattern P₈Not frequently (4,2,3, 1);

in summary, a frequent pattern set fre with a pattern length of 4 is obtained₄＝{(1,2,3,4)}；

3) Obtaining a frequent pattern set fre with a pattern length of 5₅：

① calculate a candidate pattern set cand with a pattern length of 5₅First candidate pattern P in₁Mode support sup (P) in time series S ═ 1,2,3,4,5₁S) is 3 because of sup (P)₁S) is equal to or more than the minimum support threshold value minsup, so the candidate pattern P is set₁Join the frequent pattern set fre with pattern length of 5 ═ 1,2,3,4,5₅Middle, fre₅＝{(1,2,3,4,5)}；

In summary, the frequent pattern set fre with the pattern length of 5 is obtained₅＝{(1,2,3,4,5)}；

4) Obtaining a frequent pattern set fre with a pattern length of 6₆：

① calculate a candidate pattern set cand with a pattern length of 6₆Candidate pattern P in₁Pattern support sup (P) in time series S for (1,2,3,4,5,6)₁S) is 0 because of sup (P)₁,S)<Minimum support threshold value minsup, so candidate pattern P₁Not frequently (1,2,3,4,5, 6);

in summary, a frequent pattern set with a pattern length of 6 is obtained

And fifthly, finishing the excavation of the order-preserving sequence mode:

frequent pattern set fre when the pattern length is L +1_L+1And if the sequence is empty, finishing mining the sequence preserving sequence mode.

Because in the fourth step, the frequent pattern set with a pattern length of 6

Frequent pattern set fre with pattern length of 6₆And the sequence preserving sequence pattern is empty, so that the mining of the sequence preserving sequence pattern is finished.

Example 2

Given the time series S ═ (2,1,3,4,8,9,7,12,14,13,15,17), the minimum support threshold minsup ═ 3.

"fifth step, when the mode length is the candidate mode set cand of L +1_L+1If the sequence is empty, the mining of the sequence preserving mode is finished。

Because in the third step the set of candidate patterns with a pattern length of 5

Candidate pattern set cand with pattern length of 5₅And the sequence preserving sequence pattern is empty, so that the mining of the sequence preserving sequence pattern is finished. "

Except for the above differences, the same procedure as in example 1 was repeated.

In the above embodiment, the programming software is VC + +6.0, the drawing tool is Visio2013, the Processor is pentium (r) Dual-Core 32Processor +, the operating system is Windows7 and above, the classic pattern matching algorithm, and the software and hardware environments used above are well known to those skilled in the art.

Claims

1. The method for mining the order-preserving sequence mode is characterized by comprising the following steps: the method for generating the candidate patterns by using the pattern fusion method reduces the number of the candidate patterns and calculates the support degree of the candidate patterns through a series of conversion and verification steps, and comprises the following specific steps:

first, inputting a time sequence S and a minimum support threshold minsup:

second, a frequent pattern set fre with a pattern length of 2 is obtained₂：

The calculation steps of the mode support degree are as follows:

Adopting mode fusion method, and collecting fre by frequent mode with mode length L_LGenerating a candidate pattern set cand with a pattern length L +1_L+1Wherein L represents the pattern length of the currently processed frequent pattern, the initial value of L is 2, and the candidate pattern is generatedIn the process of formula aggregation, for the frequent pattern P, each element thereof is an element P₁Element p₂… element p_LThe last element P of the frequent pattern P_LThe remaining part, called prefix of frequent pattern P, is denoted as prefix (P), and the relative order of the prefixes of frequent pattern P is denoted as prefix (P); the first element P of the frequent pattern P₁The remaining part, except for the suffix of the frequent pattern P, is designated as suffix (P), the relative order of suffixes of the frequent pattern P is designated as suffix (P),

According to the method for calculating the mode support degree in the second step, the candidate mode set cand with the mode length L +1 is calculated in sequence_L+1Each candidate pattern P in_dMode support degree sup (P)_dS) when the candidate pattern isP_dMode support degree sup (P)_dS) is more than or equal to the minimum support threshold value minsup, the candidate pattern P is selected_dFrequent pattern set fre added to pattern length L +1_L+1When the candidate pattern set cand is calculated_L+1The mode support of all the candidate modes in the set, that is, the frequent mode set fre with the mode length of L +1 is obtained_L+1；

And fifthly, finishing the excavation of the order-preserving sequence mode: