CN110609857B - Dynamic threshold-based time series data sequence pattern mining method - Google Patents

Dynamic threshold-based time series data sequence pattern mining method Download PDF

Info

Publication number
CN110609857B
CN110609857B CN201910811085.4A CN201910811085A CN110609857B CN 110609857 B CN110609857 B CN 110609857B CN 201910811085 A CN201910811085 A CN 201910811085A CN 110609857 B CN110609857 B CN 110609857B
Authority
CN
China
Prior art keywords
frequent
item
sequence
mining
suffix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910811085.4A
Other languages
Chinese (zh)
Other versions
CN110609857A (en
Inventor
王巍
辛国栋
田静
吕芳
黄俊恒
魏玉良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weihai Tianzhiwei Network Space Safety Technology Co ltd
Harbin Institute of Technology Weihai
Original Assignee
Weihai Tianzhiwei Network Space Safety Technology Co ltd
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weihai Tianzhiwei Network Space Safety Technology Co ltd, Harbin Institute of Technology Weihai filed Critical Weihai Tianzhiwei Network Space Safety Technology Co ltd
Priority to CN201910811085.4A priority Critical patent/CN110609857B/en
Publication of CN110609857A publication Critical patent/CN110609857A/en
Application granted granted Critical
Publication of CN110609857B publication Critical patent/CN110609857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2477Temporal data queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of data processing, and relates to a mining method of a sequence mode. The method comprises the following steps: dividing the original sequence by using a time window to form a time sequence set; scanning the time sequence set to obtain a term set; screening the one-item by adopting a dynamic threshold value according to the definition of the frequent one-item, and deleting the non-frequent items from the sequence set to obtain a frequent one-item set; constructing a suffix by taking the first item as a prefix from the first item in the frequent polynomial set, mining a frequent polynomial on the suffix, wherein the item defined by the frequent polynomial is a frequent item, and finishing mining when the suffix is empty; and traversing the frequent item set, and iterating to perform the previous step of operation until the frequent item set is traversed to obtain the frequent item set. In the method, when a frequent polynomial is identified, a dynamically changed support threshold is adopted; not only are frequently-occurring patterns among different sequences mined, but also frequently-occurring patterns within one sequence are mined.

Description

Dynamic threshold-based time series data sequence pattern mining method
Technical Field
The invention belongs to the technical field of data processing, and relates to a mining method of a sequence mode.
Background
The abnormal financial transaction mode refers to a special transaction mode that reflects abnormal transaction behavior in the transaction data of the account, such as a special transaction amount. Abnormal financial transaction patterns are concealed in many illegal financial activities such as money laundering, credit card fraud, illegal funding, reimbursement, and the like. The mass transfer is an important task to be solved urgently in the field of financial security, and the essence of the mass transfer is that an organizer achieves illegal convergence and transfer of funds through development and offline, so that the life and property safety of people is harmed, and the normal order of social economy is disturbed. The financial transaction data analysis is an effective way for realizing abnormal transaction pattern mining and abnormal organization hierarchy division. However, the existing method for mining abnormal financial transaction patterns based on financial transaction data mainly depends on a manually established rule base or a simple statistical method, and is low in efficiency due to the fact that a large amount of time and labor are consumed. The data mining method can help criminal investigation personnel to find out the relation between data, quickly and efficiently detect cases, and maintain the financial, economic and safe properties and social stability of China.
At present, abnormal financial transaction pattern recognition based on item set and sequence pattern mining methods has made great progress in the fields of money laundering prevention, credit card fraud detection and the like. Some studies have applied clustering algorithms to transaction behavior pattern recognition in the areas of money laundering detection and credit card fraud detection, and have made some progress. Researchers provide an O-Apriori (VSO-Apriori) algorithm, so that cross-transaction association rule mining of a multivariate time sequence is realized, and the O-Apriori algorithm has important significance for predicting the trend of a financial time sequence. However, the method only extracts the co-occurring patterns among a plurality of different time sequences, and the problem of finding frequent patterns inside the time sequences is not involved. Some researchers conduct research aiming at typical algorithms in sequence pattern mining, such as GSP, Perfix span and the like, and apply the sequence pattern mining algorithm to Web log mining tasks, and the GSP algorithm adopts a breadth-first search strategy to find all sequence patterns, but has the problem of low long sequence processing efficiency. The Prefix span algorithm belongs to a mode growth algorithm, adopts a depth-first strategy to search, and is an algorithm based on a projection database. The method improves the efficiency of mining the sequence mode, but faces the problem of excessive storage space occupation.
The data sets mined by the sequence patterns are sequential, such as time series, gene sequences, and the like. The bank's transaction records are typical representatives of time series data. The core idea of the existing sequence pattern mining work is to perform mining of a single term and a polynomial according to a uniform support threshold. Applying such mining algorithms directly to the problem of the biography mode mining will have three problems: 1) the frequency of normal amounts in the transaction record is often higher than abnormal amounts. Mining is carried out by using a uniform threshold standard, and the mining result is often complicated and useless; 2) the procurement and rebate amount in the distribution activity has timeliness, and the timeliness of data is rarely concerned by a common sequence pattern mining algorithm; 3) the existing sequence pattern mining algorithm only considers whether a pattern has universality in a sequence set or not, and does not consider the significance of repeated appearance of the pattern in the sequence. In the process of the distribution activity, the distribution mode not only frequently appears between sequences, but also frequently appears in the sequences, which cannot be met by the existing sequence mode mining algorithm.
Disclosure of Invention
The invention aims to solve the problems in the existing sequence pattern mining technology and provides a sequence pattern mining method based on a dynamic threshold value, which can mine frequent patterns from time series data so that workers can analyze the relevance of the data and find valuable information.
The technical scheme adopted by the invention for solving the technical problems is as follows: a time series data sequence pattern mining method based on dynamic threshold comprises the following steps:
(1) preprocessing original bank transaction sequence data including amount and transaction time information, sequencing the data according to time, and dividing an original sequence by using a time window to form a time sequence set;
(2) scanning the time sequence set to obtain a term set;
(3) aiming at the characteristic that the frequency of the normal amount is higher than the abnormal amount in the bank transaction, different threshold standards are used when a frequent item is identified; screening the one-item by adopting a dynamic threshold value according to the definition of the frequent one-item, and deleting the non-frequent items from the sequence set to obtain a frequent one-item set;
(4) constructing a suffix by taking the first item as a prefix from the first item in the frequent polynomial set, mining a frequent polynomial on the suffix, wherein the item defined by the frequent polynomial is a frequent item, and finishing mining when the suffix is empty;
(5) traversing the frequent item set, and iterating the operation in the step (4) until the frequent item set is traversed to obtain a frequent item set;
the frequent term is defined as: given the probability p (normal) that an item appears in normal transaction data p1,p2……pn},n>0; calculating the probability Q ═ Q of each item appearing in the transaction sequence set1,q2……qn}; terms that satisfy the following formula are considered to be frequent ones:
Figure GDA0002897694520000031
where α is a threshold coefficient, α × piIs qiA threshold value of (d); n represents the nth item;
the frequent polynomial is defined as: if events A, B are independent of each other, P(AB)=P(A)*P(B)(ii) a If P(AB)<P(A)*P(B)The two are considered to be negatively correlated if P(AB)>P(A)*P(B)The two are considered to be in positive correlation; wherein P is(A)Number of occurrences of prefix a/total number of sequences, P(B)Number of occurrences of suffix B/total number of sequences, P(AB)Number of times pattern AB co-occurs/total number of sequences; the total number of sequences refers to the total number of sequences divided by a time window; the goal is to find a set of items that are positively correlated as frequent items; when the two amounts are in positive correlation, the two amounts are considered as the return-to-purchase mode.
As a further improvement of the present invention, in the mining process of the frequent polynomial, the construction of the suffix needs to satisfy: the items at the same position can not be matched with the same items at a plurality of positions, and can be matched with the same items at one position.
The dynamic threshold-based sequence pattern mining method has the following beneficial effects:
(1) when a frequent polynomial is identified, a dynamically changing support threshold is adopted instead of a uniform support threshold min-sup;
(2) not only are frequently-occurring patterns among different sequences mined, but also frequently-occurring patterns inside one sequence are mined;
(3) the sequence mining process needs to meet the one-off condition, and the problem of repeated counting of the modes is solved.
Drawings
FIG. 1 is a flow chart of a dynamic threshold based mining method for time series data patterns according to an embodiment of the present invention;
FIG. 2 is a projection of an embodiment of prefixes and suffixes;
FIG. 3 shows the equation of1A mining graph of frequent polynomials for prefixes.
Detailed Description
The dynamic threshold-based time series data sequence pattern mining method of the present invention is illustrated and described in detail below with reference to the accompanying drawings and embodiments, so as to enable those skilled in the art to better understand the technical solution of the present invention, but the technical idea of the present invention is not limited to the specific contents described in the embodiments.
The flow of the time series data pattern mining method based on the dynamic threshold value in this embodiment is shown in fig. 1, and specifically includes the following steps:
firstly, acquiring an original sequence set
Let sequence be (S1, S2, S3 … … … Sn) a sequence set containing n Sequences, Si denotes the ith sequence, and the sequence set format of this example is shown in table 1.
TABLE 1 sequence listing
Figure GDA0002897694520000051
Secondly, dividing the original sequence by utilizing a time window to form a time sequence set
And giving a time window with the length of w and a sliding step length t for the sequence Si, placing the time window at the starting position of the sequence, wherein the time window corresponds to a subsequence with the length of w on the sequence, and then moving the time window to the right by t steps to form another subsequence with the length of w. And so on until the time window contains the last entry.
Thirdly, screening the one-item by adopting a dynamic threshold value according to the definition of the frequent one-item, deleting the non-frequent items from the sequence set to obtain a frequent one-item set
For sequence sets, given the probability p (normal) { p1, p2 … … pn }, n >0 that each term occurs in normal data. In the mining of a set of items, the probability Q ═ Q1, Q2 … … qn, qi is calculated to be consistent with pi. Terms satisfying the following formula are considered to be frequent ones.
Figure GDA0002897694520000061
Where α is the threshold coefficient and α xpi is the threshold of qi.
Let α be 0.5, Pnormal be the proportion of all terms in normal data, regardless of the structure of the sequence, and Pnormal be { l } {1:4/27,l2:8/27,l3:3/27,l4:8/27,l5:3/27,l6:9/27,l7:8/27}。
First, the prefixes with length of one in the sequence set and the corresponding frequencies are counted to obtain table 2.
Table 2 prefix occurrence frequency table
Figure GDA0002897694520000062
From the frequent one-term definition, a frequent one-term set can be derived: onefarequantseq ═ { l1,l2,l3,l4,l5Due to l6,l7It does not satisfy the mining of a term definition and is therefore removed from the set of sequences. The sequence set is updated as shown in table 3.
TABLE 3 updated sequence Listing
Figure GDA0002897694520000063
Fourthly, constructing a suffix by taking the first item as a prefix from the first item in the frequent item set
For example: for the sequence S1 ═<l1,l2,l3,l4,l5,l1,l3,l2,l7>Wherein P ═<l1>,Q=<l2,l3,l4,l5,l1,l3,l2,l7>Q is a suffix to P. It is to be noted here that:<l3,l2,l7>not in the sequence S1<l1>Suffix of (c). The sequence build suffix projection diagrams in table 3 are shown in fig. 2.
And fifthly, mining the frequent polynomial for the suffix, when the suffix sequence contains prefixes, dividing and updating the suffix sequence, forming a frequent item AB by an item B meeting the definition of the frequent polynomial and the prefix A of the item B, and continuously mining by taking the item AB as a new prefix until the suffix is empty, and finishing mining.
Frequent polynomial definition: in the mining of multinomial sets, by the principle of independence, if events A, B are independent of each other, P is(AB)=P(A)*P(B). If P(AB)<P(A)*P(B)Considered to be negatively correlated, P(AB)>P(A)*P(B)The two are considered to be in positive correlation. Wherein P is(A)Number of occurrences of prefix a/total number of sequences, P(B)Number of occurrences of suffix B/total number of sequences, P(AB)The number of times pattern A, B co-occurs/total number of sequences. The total number of sequences refers to the total number of sequences divided over a time window. The goal is to find a set of items that are positively correlated as frequent items.
One-Off condition definition: let p be1=(i1,i2……im) And p2=(j1,j2……jm) Is the double occurrence of pattern P if
Figure GDA0002897694520000071
All have ipIs not equal to jqThen is called p1,p2The One-Off condition is satisfied.
For example: s1 ═<l1,l2,l3,l4,l5,l1,l3,l2,l7>Wherein location (l)1)={0,5},location(l2) 1,7, assume that the pattern P is equal to<l1,l2>If p is1=<0,1>Then p is2Can not be<0,7>Or<5,1>Can only be<5,7>。
With prefix as l1By way of example, according to1Is used to calculate whether each term satisfies the mining polynomial definition, in mining1In the process of suffix, whether the suffix also contains l needs to be judged1And if the suffix is contained, the suffix needs to be disconnected and formed at the corresponding position. But in the calculation of P(AB)And P(A)、P(B)Their denominator in time refers to the original sequence set. Through calculation, only l is found3Satisfies the mining polynomial definition3And l1Composing a new prefix<l1,l3>Taking the prefix A as the prefix A of the next step, finding a suffix sequence corresponding to the A on the original sequence, mining, finding no item meeting the condition after calculation, and taking the suffix sequence l as the prefix A of the next step1The mining of entries for prefixes is complete. Then continue digging2,l3Suffix of etc., process1In accordance, a schematic diagram of the excavation process is shown in FIG. 3.
And sixthly, mining all the frequent items after all the frequent items are traversed. In the whole process, attention needs to be paid to dynamic division and one-off conditions of suffixes, namely, items at the same position cannot be matched with the same items at multiple positions, and can only be matched with the same items at one position.
The method can be applied to the bank transaction data of the reimbursement organization, so that the abnormal transaction mode of the reimbursement organization is excavated. For example: member A makes a subscription, the amount of the subscription is PM, the upper level of member A obtains rebate RM1, and the upper level of member A obtains rebate RM 2. In this way, the rebate obtained by the ith upper level of the A is RMi, the A has n upper levels, 0< i < ═ n, and a purchase-for-return mode of the marketing is excavated, namely a < PM, RMi > mode.
The method can be suitable for the mode mining task of the time sequence data in the fields of politics, economy, culture, medical treatment and the like, and can be used for automatically detecting abnormal behaviors at any time and any place, such as the identification of financial security incident handlers on the abnormal behaviors, so that the application prospect is very wide.

Claims (2)

1. A time series data sequence pattern mining method based on dynamic threshold is characterized by comprising the following steps:
(1) preprocessing original bank transaction sequence data including amount and transaction time information, sequencing the data according to time, and dividing an original sequence by using a time window to form a time sequence set;
(2) scanning the time sequence set to obtain a term set;
(3) aiming at the characteristic that the frequency of the normal amount is higher than the abnormal amount in the bank transaction, different threshold standards are used when a frequent item is identified; screening the one-item by adopting a dynamic threshold value according to the definition of the frequent one-item, and deleting the non-frequent items from the sequence set to obtain a frequent one-item set;
(4) constructing a suffix by taking the first item as a prefix from the first item in the frequent polynomial set, mining a frequent polynomial on the suffix, wherein the item defined by the frequent polynomial is a frequent item, and finishing mining when the suffix is empty;
(5) traversing the frequent item set, and iterating the operation in the step (4) until the frequent item set is traversed to obtain a frequent item set;
the frequent term is defined as: given the probability p (normal) that an item appears in normal transaction data p1,p2……pn},n>0; calculating the probability Q ═ Q of each item appearing in the transaction sequence set1,q2……qn}; terms that satisfy the following formula are considered to be frequent ones:
Figure FDA0002897694510000011
where α is a threshold coefficient, α × piIs qiA threshold value of (d); n represents the nth item;
the frequent polynomial is defined as: if events A, B are independent of each other, P(AB)=P(A)*P(B)(ii) a If P(AB)<P(A)*P(B)The two are considered to be negatively correlated if P(AB)>P(A)*P(B)The two are considered to be in positive correlation; wherein P is(A)Number of occurrences of prefix a/total number of sequences, P(B)Number of occurrences of suffix B/total number of sequences, P(AB)Number of times pattern AB co-occurs/total number of sequences; the total number of sequences refers to the total number of sequences divided by a time window; the goal is to find a set of items that are positively correlated as frequent items; when the two amounts are in positive correlation, the two amounts are considered as the return-to-purchase mode.
2. The mining method of time series data pattern based on dynamic threshold value of claim 1, wherein in the mining process of frequent polynomial, the construction of suffix needs to satisfy: the items at the same position can not be matched with the same items at a plurality of positions, and can be matched with the same items at one position.
CN201910811085.4A 2019-08-30 2019-08-30 Dynamic threshold-based time series data sequence pattern mining method Active CN110609857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910811085.4A CN110609857B (en) 2019-08-30 2019-08-30 Dynamic threshold-based time series data sequence pattern mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910811085.4A CN110609857B (en) 2019-08-30 2019-08-30 Dynamic threshold-based time series data sequence pattern mining method

Publications (2)

Publication Number Publication Date
CN110609857A CN110609857A (en) 2019-12-24
CN110609857B true CN110609857B (en) 2021-03-05

Family

ID=68890755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910811085.4A Active CN110609857B (en) 2019-08-30 2019-08-30 Dynamic threshold-based time series data sequence pattern mining method

Country Status (1)

Country Link
CN (1) CN110609857B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2374298A1 (en) * 2002-03-01 2003-09-01 Ibm Canada Limited-Ibm Canada Limitee Computation of frequent data values
CN103761236B (en) * 2013-11-20 2017-02-08 同济大学 Incremental frequent pattern increase data mining method
CN103955450B (en) * 2014-05-06 2016-09-21 杭州东信北邮信息技术有限公司 A kind of neologisms extraction method
CN107590621B (en) * 2017-10-10 2020-08-21 清华大学 Defect affinity analysis method and device based on self-adaptive frequent set mining method
CN108563757B (en) * 2018-04-16 2021-05-28 泰州学院 Universal event sequence frequent plot mining method

Also Published As

Publication number Publication date
CN110609857A (en) 2019-12-24

Similar Documents

Publication Publication Date Title
Rajagopal Customer data clustering using data mining technique
Xing et al. Employing latent dirichlet allocation for fraud detection in telecommunications
Taha et al. SIIMCO: A forensic investigation tool for identifying the influential members of a criminal organization
Larik et al. Clustering based anomalous transaction reporting
CN109754258B (en) Online transaction fraud detection method based on individual behavior modeling
Mahanta et al. Finding calendar-based periodic patterns
Khan et al. A Bayesian approach for suspicious financial activity reporting
CN101599165A (en) A kind of dynamic financial network monitoring analytical method
Baron et al. Monitoring the evolution of web usage patterns
Roddick et al. Temporal data mining: survey and issues
CN110609857B (en) Dynamic threshold-based time series data sequence pattern mining method
Zhu An outlier detection model based on cross datasets comparison for financial surveillance
Diop et al. Pattern on demand in transactional distributed databases
Yu et al. Local isolation coefficient-based outlier mining algorithm
Stefanowski et al. Mining context based sequential patterns
Shams et al. Modeling clustered non-stationary Poisson processes for stochastic simulation inputs
Castiñeira et al. A new approach for fast evaluations of large portfolios of oil and gas fields
CN114022283A (en) Upstream and downstream data mining method based on bank transaction line enterprise
Gouda et al. Mining sequential patterns in dense databases
Nofong Mining Productive Emerging Patterns and Their Application in Trend Prediction.
Ham et al. MBiS: an efficient method for mining frequent weighted utility itemsets from quantitative databases
Xiong et al. A multi-supports-based sequential pattern mining algorithm
CHUANLU et al. Selecting Actionable Patterns from Positive and Negative Sequential Patterns.
Leung et al. An efficient system for detecting outliers from financial time series
Zhang et al. Mining inter-transaction association rules from multiple time-series data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant