CN110609857B

CN110609857B - Dynamic threshold-based time series data sequence pattern mining method

Info

Publication number: CN110609857B
Application number: CN201910811085.4A
Authority: CN
Inventors: 王巍; 辛国栋; 田静; 吕芳; 黄俊恒; 魏玉良
Original assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Current assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2021-03-05
Anticipated expiration: 2039-08-30
Also published as: CN110609857A

Abstract

The invention belongs to the technical field of data processing, and relates to a mining method of a sequence mode. The method comprises the following steps: dividing the original sequence by using a time window to form a time sequence set; scanning the time sequence set to obtain a term set; screening the one-item by adopting a dynamic threshold value according to the definition of the frequent one-item, and deleting the non-frequent items from the sequence set to obtain a frequent one-item set; constructing a suffix by taking the first item as a prefix from the first item in the frequent polynomial set, mining a frequent polynomial on the suffix, wherein the item defined by the frequent polynomial is a frequent item, and finishing mining when the suffix is empty; and traversing the frequent item set, and iterating to perform the previous step of operation until the frequent item set is traversed to obtain the frequent item set. In the method, when a frequent polynomial is identified, a dynamically changed support threshold is adopted; not only are frequently-occurring patterns among different sequences mined, but also frequently-occurring patterns within one sequence are mined.

Description

Dynamic threshold-based time series data sequence pattern mining method

Technical Field

The invention belongs to the technical field of data processing, and relates to a mining method of a sequence mode.

Background

The abnormal financial transaction mode refers to a special transaction mode that reflects abnormal transaction behavior in the transaction data of the account, such as a special transaction amount. Abnormal financial transaction patterns are concealed in many illegal financial activities such as money laundering, credit card fraud, illegal funding, reimbursement, and the like. The mass transfer is an important task to be solved urgently in the field of financial security, and the essence of the mass transfer is that an organizer achieves illegal convergence and transfer of funds through development and offline, so that the life and property safety of people is harmed, and the normal order of social economy is disturbed. The financial transaction data analysis is an effective way for realizing abnormal transaction pattern mining and abnormal organization hierarchy division. However, the existing method for mining abnormal financial transaction patterns based on financial transaction data mainly depends on a manually established rule base or a simple statistical method, and is low in efficiency due to the fact that a large amount of time and labor are consumed. The data mining method can help criminal investigation personnel to find out the relation between data, quickly and efficiently detect cases, and maintain the financial, economic and safe properties and social stability of China.

At present, abnormal financial transaction pattern recognition based on item set and sequence pattern mining methods has made great progress in the fields of money laundering prevention, credit card fraud detection and the like. Some studies have applied clustering algorithms to transaction behavior pattern recognition in the areas of money laundering detection and credit card fraud detection, and have made some progress. Researchers provide an O-Apriori (VSO-Apriori) algorithm, so that cross-transaction association rule mining of a multivariate time sequence is realized, and the O-Apriori algorithm has important significance for predicting the trend of a financial time sequence. However, the method only extracts the co-occurring patterns among a plurality of different time sequences, and the problem of finding frequent patterns inside the time sequences is not involved. Some researchers conduct research aiming at typical algorithms in sequence pattern mining, such as GSP, Perfix span and the like, and apply the sequence pattern mining algorithm to Web log mining tasks, and the GSP algorithm adopts a breadth-first search strategy to find all sequence patterns, but has the problem of low long sequence processing efficiency. The Prefix span algorithm belongs to a mode growth algorithm, adopts a depth-first strategy to search, and is an algorithm based on a projection database. The method improves the efficiency of mining the sequence mode, but faces the problem of excessive storage space occupation.

The data sets mined by the sequence patterns are sequential, such as time series, gene sequences, and the like. The bank's transaction records are typical representatives of time series data. The core idea of the existing sequence pattern mining work is to perform mining of a single term and a polynomial according to a uniform support threshold. Applying such mining algorithms directly to the problem of the biography mode mining will have three problems: 1) the frequency of normal amounts in the transaction record is often higher than abnormal amounts. Mining is carried out by using a uniform threshold standard, and the mining result is often complicated and useless; 2) the procurement and rebate amount in the distribution activity has timeliness, and the timeliness of data is rarely concerned by a common sequence pattern mining algorithm; 3) the existing sequence pattern mining algorithm only considers whether a pattern has universality in a sequence set or not, and does not consider the significance of repeated appearance of the pattern in the sequence. In the process of the distribution activity, the distribution mode not only frequently appears between sequences, but also frequently appears in the sequences, which cannot be met by the existing sequence mode mining algorithm.

Disclosure of Invention

The invention aims to solve the problems in the existing sequence pattern mining technology and provides a sequence pattern mining method based on a dynamic threshold value, which can mine frequent patterns from time series data so that workers can analyze the relevance of the data and find valuable information.

The technical scheme adopted by the invention for solving the technical problems is as follows: a time series data sequence pattern mining method based on dynamic threshold comprises the following steps:

(1) preprocessing original bank transaction sequence data including amount and transaction time information, sequencing the data according to time, and dividing an original sequence by using a time window to form a time sequence set;

(2) scanning the time sequence set to obtain a term set;

(3) aiming at the characteristic that the frequency of the normal amount is higher than the abnormal amount in the bank transaction, different threshold standards are used when a frequent item is identified; screening the one-item by adopting a dynamic threshold value according to the definition of the frequent one-item, and deleting the non-frequent items from the sequence set to obtain a frequent one-item set;

(4) constructing a suffix by taking the first item as a prefix from the first item in the frequent polynomial set, mining a frequent polynomial on the suffix, wherein the item defined by the frequent polynomial is a frequent item, and finishing mining when the suffix is empty;

(5) traversing the frequent item set, and iterating the operation in the step (4) until the frequent item set is traversed to obtain a frequent item set;

the frequent term is defined as: given the probability p (normal) that an item appears in normal transaction data p₁,p₂……p_n},n>0; calculating the probability Q ═ Q of each item appearing in the transaction sequence set₁,q₂……q_n}; terms that satisfy the following formula are considered to be frequent ones:

where α is a threshold coefficient, α × p_iIs q_iA threshold value of (d); n represents the nth item;

the frequent polynomial is defined as: if events A, B are independent of each other, P_(AB)＝P_(A)*P_(B)(ii) a If P_(AB)<P_(A)*P_(B)The two are considered to be negatively correlated if P_(AB)>P_(A)*P_(B)The two are considered to be in positive correlation; wherein P is_(A)Number of occurrences of prefix a/total number of sequences, P_(B)Number of occurrences of suffix B/total number of sequences, P_(AB)Number of times pattern AB co-occurs/total number of sequences; the total number of sequences refers to the total number of sequences divided by a time window; the goal is to find a set of items that are positively correlated as frequent items; when the two amounts are in positive correlation, the two amounts are considered as the return-to-purchase mode.

As a further improvement of the present invention, in the mining process of the frequent polynomial, the construction of the suffix needs to satisfy: the items at the same position can not be matched with the same items at a plurality of positions, and can be matched with the same items at one position.

The dynamic threshold-based sequence pattern mining method has the following beneficial effects:

(1) when a frequent polynomial is identified, a dynamically changing support threshold is adopted instead of a uniform support threshold min-sup;

(2) not only are frequently-occurring patterns among different sequences mined, but also frequently-occurring patterns inside one sequence are mined;

(3) the sequence mining process needs to meet the one-off condition, and the problem of repeated counting of the modes is solved.

Drawings

FIG. 1 is a flow chart of a dynamic threshold based mining method for time series data patterns according to an embodiment of the present invention;

FIG. 2 is a projection of an embodiment of prefixes and suffixes;

FIG. 3 shows the equation of₁A mining graph of frequent polynomials for prefixes.

Detailed Description

The dynamic threshold-based time series data sequence pattern mining method of the present invention is illustrated and described in detail below with reference to the accompanying drawings and embodiments, so as to enable those skilled in the art to better understand the technical solution of the present invention, but the technical idea of the present invention is not limited to the specific contents described in the embodiments.

The flow of the time series data pattern mining method based on the dynamic threshold value in this embodiment is shown in fig. 1, and specifically includes the following steps:

firstly, acquiring an original sequence set

Let sequence be (S1, S2, S3 … … … Sn) a sequence set containing n Sequences, Si denotes the ith sequence, and the sequence set format of this example is shown in table 1.

TABLE 1 sequence listing

Secondly, dividing the original sequence by utilizing a time window to form a time sequence set

And giving a time window with the length of w and a sliding step length t for the sequence Si, placing the time window at the starting position of the sequence, wherein the time window corresponds to a subsequence with the length of w on the sequence, and then moving the time window to the right by t steps to form another subsequence with the length of w. And so on until the time window contains the last entry.

Thirdly, screening the one-item by adopting a dynamic threshold value according to the definition of the frequent one-item, deleting the non-frequent items from the sequence set to obtain a frequent one-item set

For sequence sets, given the probability p (normal) { p1, p2 … … pn }, n >0 that each term occurs in normal data. In the mining of a set of items, the probability Q ═ Q1, Q2 … … qn, qi is calculated to be consistent with pi. Terms satisfying the following formula are considered to be frequent ones.

Where α is the threshold coefficient and α xpi is the threshold of qi.

Let α be 0.5, Pnormal be the proportion of all terms in normal data, regardless of the structure of the sequence, and Pnormal be { l } {₁:4/27，l₂:8/27，l₃:3/27，l₄:8/27，l₅:3/27，l₆:9/27，l₇:8/27}。

First, the prefixes with length of one in the sequence set and the corresponding frequencies are counted to obtain table 2.

Table 2 prefix occurrence frequency table

From the frequent one-term definition, a frequent one-term set can be derived: onefarequantseq ═ { l₁,l₂,l₃,l₄,l₅Due to l₆,l₇It does not satisfy the mining of a term definition and is therefore removed from the set of sequences. The sequence set is updated as shown in table 3.

TABLE 3 updated sequence Listing

Fourthly, constructing a suffix by taking the first item as a prefix from the first item in the frequent item set

For example: for the sequence S1 ═<l₁，l₂，l₃，l₄，l₅，l₁，l₃，l₂，l₇>Wherein P ═<l₁>，Q＝<l₂，l₃，l₄，l₅，l₁，l₃，l₂，l₇>Q is a suffix to P. It is to be noted here that:<l₃，l₂，l₇>not in the sequence S1<l₁>Suffix of (c). The sequence build suffix projection diagrams in table 3 are shown in fig. 2.

And fifthly, mining the frequent polynomial for the suffix, when the suffix sequence contains prefixes, dividing and updating the suffix sequence, forming a frequent item AB by an item B meeting the definition of the frequent polynomial and the prefix A of the item B, and continuously mining by taking the item AB as a new prefix until the suffix is empty, and finishing mining.

Frequent polynomial definition: in the mining of multinomial sets, by the principle of independence, if events A, B are independent of each other, P is_(AB)＝P_(A)*P_(B). If P_(AB)<P_(A)*P_(B)Considered to be negatively correlated, P_(AB)>P_(A)*P_(B)The two are considered to be in positive correlation. Wherein P is_(A)Number of occurrences of prefix a/total number of sequences, P_(B)Number of occurrences of suffix B/total number of sequences, P_(AB)The number of times pattern A, B co-occurs/total number of sequences. The total number of sequences refers to the total number of sequences divided over a time window. The goal is to find a set of items that are positively correlated as frequent items.

One-Off condition definition: let p be₁＝(i₁,i₂……i_m) And p₂＝(j₁,j₂……j_m) Is the double occurrence of pattern P if

All have i_pIs not equal to j_qThen is called p₁,p₂The One-Off condition is satisfied.

For example: s1 ═<l₁,l₂,l₃,l₄,l₅,l₁,l₃,l₂,l₇>Wherein location (l)₁)＝{0,5},location(l₂) 1,7, assume that the pattern P is equal to<l₁,l₂>If p is₁＝<0,1>Then p is₂Can not be<0,7>Or<5,1>Can only be<5,7>。

With prefix as l₁By way of example, according to₁Is used to calculate whether each term satisfies the mining polynomial definition, in mining₁In the process of suffix, whether the suffix also contains l needs to be judged₁And if the suffix is contained, the suffix needs to be disconnected and formed at the corresponding position. But in the calculation of P_(AB)And P_(A)、P_(B)Their denominator in time refers to the original sequence set. Through calculation, only l is found₃Satisfies the mining polynomial definition₃And l₁Composing a new prefix<l₁,l₃>Taking the prefix A as the prefix A of the next step, finding a suffix sequence corresponding to the A on the original sequence, mining, finding no item meeting the condition after calculation, and taking the suffix sequence l as the prefix A of the next step₁The mining of entries for prefixes is complete. Then continue digging₂,l₃Suffix of etc., process₁In accordance, a schematic diagram of the excavation process is shown in FIG. 3.

And sixthly, mining all the frequent items after all the frequent items are traversed. In the whole process, attention needs to be paid to dynamic division and one-off conditions of suffixes, namely, items at the same position cannot be matched with the same items at multiple positions, and can only be matched with the same items at one position.

The method can be applied to the bank transaction data of the reimbursement organization, so that the abnormal transaction mode of the reimbursement organization is excavated. For example: member A makes a subscription, the amount of the subscription is PM, the upper level of member A obtains rebate RM1, and the upper level of member A obtains rebate RM 2. In this way, the rebate obtained by the ith upper level of the A is RMi, the A has n upper levels, 0< i < ═ n, and a purchase-for-return mode of the marketing is excavated, namely a < PM, RMi > mode.

The method can be suitable for the mode mining task of the time sequence data in the fields of politics, economy, culture, medical treatment and the like, and can be used for automatically detecting abnormal behaviors at any time and any place, such as the identification of financial security incident handlers on the abnormal behaviors, so that the application prospect is very wide.

Claims

1. A time series data sequence pattern mining method based on dynamic threshold is characterized by comprising the following steps:

(2) scanning the time sequence set to obtain a term set;

2. The mining method of time series data pattern based on dynamic threshold value of claim 1, wherein in the mining process of frequent polynomial, the construction of suffix needs to satisfy: the items at the same position can not be matched with the same items at a plurality of positions, and can be matched with the same items at one position.