CN105589900A

CN105589900A - Data mining method based on multi-dimensional analysis

Info

Publication number: CN105589900A
Application number: CN201410671003.8A
Authority: CN
Inventors: 王骏; 杨鸿超
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2016-05-18

Abstract

The invention provides a data mining method based on multi-dimensional analysis. The method comprises following steps: extracting original event records from a database; screening and converting the extracted original event records to form an event record affair set based on a time sequence, wherein the event record affair set is composed of multiple affairs, each affair is composed of a plurality of event elements; generating a frequent mode tree based on the event record affair set; and screening out frequent items associated with predetermined object event elements according to the generated frequent mode tree. The data mining method based on the multi-dimensional analysis disclosed by the invention is applicable to the parallel calculation in a distributed environment and can process massive object data.

Description

Based on the data digging method of multidimensional analysis

Technical field

The present invention relates to data digging method, more specifically, relate to the data digging method based on multidimensional analysis.

Background technology

At present, along with becoming increasingly abundant of the class of business of the increasingly extensive and different field of cyber-net application, for example, for example carry out data mining, to find relevance between occurring of the different event (relevance between different customer consumption event in financial field for the event log data (the transaction record data in financial field) of magnanimity, for example, a certain class holder is sometime before the consumption of Mou Lei trade company, may be engraved in what type trade company post-consumer in what time, probability is how many; The trade company's consumption that may go in what moment again afterwards what type, probability is how many) become more and more important.

In existing technical scheme, conventionally on unit, carry out the process relevant to sequence pattern analysis (it refers to from excavating the pattern that the frequency of occurrences is high for time or other patterns database), to find the potential association between different pieces of information.

But there are the following problems for existing technical scheme: (1) only can implement analytic process on unit, therefore be difficult to adapt to the parallel computation under distributed environment; (2) data volume that can process is limited, cannot cover all samples, thereby causes precision of analysis lower; (3) only can analyze for single dimension, cannot realize the data relation analysis based on multidimensional.

Therefore, there is following demand: the parallel computation that can adapt under distributed environment is provided and can processes the data digging method based on multidimensional analysis of magnanimity target data.

Summary of the invention

In order to solve the existing problem of above-mentioned prior art scheme, the present invention proposes the parallel computation that can adapt under distributed environment and can process the data digging method based on multidimensional analysis of magnanimity target data.

The object of the invention is to be achieved through the following technical solutions:

Based on a data digging method for multidimensional analysis, the described data digging method based on multidimensional analysis comprises the following steps:

(A1) from database, extract primitive event record. And the primitive event record extracting is screened and changes to form based on seasonal effect in time series logout affairs collection, wherein, described logout affairs collection is made up of multiple affairs, and each affairs are made up of some event elements;

(A2) generate frequent pattern tree (fp tree) based on described logout things collection;

(A3) filter out frequent that is associated with predetermined object event element according to generated frequent pattern tree (fp tree).

In disclosed scheme, preferably, described step (A1) further comprises in the above:

(1) from database, extract primitive event record according to the screening conditions of setting;

(2) for each primitive event record extracting, select wherein predetermined some fields and press predetermined format basis of formation logout, each basic logout represents once actual event, and each basic logout at least comprises event body field, event type field and Time To Event field;

(3) at least as major key, all basic logouts are divided into groups using the value of event body field and Time To Event field by pre-defined rule;

(4) respectively the record in each grouping is cleaned, merge into a basic logout by the identical basic logout of value of event type field in each grouping and Time To Event field;

(5) using the basic logout in each grouping as the event element representing with " event type $ Time To Event " form, and all event elements in same grouping are merged to form logout affairs, its be represented as<event type 1 $ Time To Event 1, event type 2 $ Time To Events 2, event type i $ Time To Event i,>, wherein, " event type i $ Time To Event i " represents i event element in these logout affairs, thus, logout affairs corresponding to all groupings form described logout affairs collection.

In disclosed scheme, preferably, described step (A2) further comprises in the above:

(1) travel through described logout affairs collection, calculate total frequency that each event element occurs, and according to frequency order from big to small, all event elements are sorted to obtain the list of the event element frequency;

(2) each affairs of concentrating for described logout affairs, resequence the each event element in these affairs according to the order of event element in the list of the described event element frequency;

(3) root node of establishment frequent pattern tree (fp tree), travels through described logout affairs collection again, and the event element in each affairs of processing through step (2) is inserted in created frequent pattern tree (fp tree) as frequent.

In the above in disclosed scheme, preferably, described step (A2) further comprises: all event elements are being sorted to obtain after the list of the event element frequency according to frequency order from big to small, the event element that its support is less than to predetermined threshold is picked out, wherein, the support of event element is calculated by following formula: event element support=frequency/total number of transactions.

In disclosed scheme, preferably, described step (A3) further comprises in the above: the frequent item that screening is associated with predetermined certain object event element as follows:

(1) from described frequent pattern tree (fp tree), find the node of all these event elements, and upwards travel through its ancestor node, obtain all paths, thereby obtain the conditional pattern base of this event element in described frequent pattern tree (fp tree);

(2) described conditional pattern base is used as to original transaction collection and is built the condition pattern tree of this object event element, thereby obtain all frequent item sets of this object event element;

(3) from obtained frequent item set, filter out frequent that is associated with this object event element according to predetermined min confidence

Data digging method based on multidimensional analysis disclosed in this invention has the following advantages: can adapt to the parallel computation under distributed environment; Can process mass data, thereby cover all samples to obtain higher precision of analysis; (3) can realize the data relation analysis based on multidimensional.

Brief description of the drawings

By reference to the accompanying drawings, technical characterictic of the present invention and advantage will be understood better by those skilled in the art, wherein:

Fig. 1 is the flow chart of the data digging method based on multidimensional analysis according to an embodiment of the invention.

Detailed description of the invention

Fig. 1 is the flow chart of the data digging method based on multidimensional analysis according to an embodiment of the invention. As shown in Figure 1, the data digging method based on multidimensional analysis disclosed in this invention comprises the following steps: that (A1) extracts primitive event record from database. And the primitive event record extracting is screened and changes to form based on seasonal effect in time series logout affairs collection, wherein, described logout affairs collection is made up of multiple affairs, and each affairs are made up of some event elements; (A2) generate frequent pattern tree (fp tree) based on described logout things collection; (A3) filter out according to generated frequent pattern tree (fp tree) frequent (being different from object event element relative other event elements) being associated with predetermined object event element.

Preferably, in the data digging method based on multidimensional analysis disclosed in this invention, described step (A1) further comprises: (1) extracts primitive event record (for example transaction record, each record represents once the actual transaction occurring) according to the screening conditions of setting from database, (2) for each primitive event record extracting, select wherein predetermined some fields fields such as () such as card number, exchange hour, merchant type and press predetermined format basis of formation logout, each basic logout represents once actual event (for example certain holder has carried out one-time-consumption in the trade company in certain type sometime), for example, and each basic logout at least comprises event body field (card number field), event type field (for example merchant type field) and Time To Event field, (3) at least by pre-defined rule using the value of event body field and Time To Event field as major key for example, to all basic logouts divide into groups (, same card number All Activity record is on the same day assigned in same group), (4) respectively the record in each grouping is cleaned, merge into a basic logout by the identical basic logout of value of event type field in each grouping and Time To Event field, (5) using the basic logout in each grouping as the event element representing with " event type $ Time To Event " form, and all event elements in same grouping (are for example merged to form logout affairs, affairs represent All Activity merchant type and the exchange hour that a card number occurred in some day), its be represented as<event type 1 $ Time To Event 1, event type 2 $ Time To Events 2, event type i $ Time To Event i,>, wherein, " event type i $ Time To Event i " represents i event element in these logout affairs, thus, logout affairs corresponding to all groupings form described logout affairs collection.

Preferably, in the data digging method based on multidimensional analysis disclosed in this invention, described step (A2) further comprises: (1) travels through described logout affairs collection, calculate total frequency that each event element occurs, and according to frequency order from big to small, all event elements are sorted to obtain the list of the event element frequency, (2) each affairs of concentrating for described logout affairs, resequence the each event element in these affairs according to the order of event element in the list of the described event element frequency, (3) root node (it is with " null " mark) of establishment frequent pattern tree (fp tree), travels through described logout affairs collection again, and the event element in each affairs of processing through step (2) is inserted in created frequent pattern tree (fp tree) as frequent. wherein, described frequent pattern tree (fp tree) is by a root node (value is null), item prefix subtree (as children) and a frequent head table form, each node in described prefix subtree comprises three territory: item_name, count and node_link, wherein, item_name records in territory the mark of the represented item of this node, count territory record arrives the number of transactions of the subpath of this node, node_link territory is for the next node of threaded tree like-identified, if there is no the next node of like-identified, its value is " null ", and the list item of frequent item head table comprises the head pointer headofnode_link in a frequent identification field item_name and a sensing tree with first frequent node of this mark. wherein, for being included in the item α on certain node in frequent pattern tree (fp tree), a path that arrives α from root node will be had, the part path that does not comprise α place node in this path is called the prefix subpath of α, α is called the suffix in this path, and in a frequent pattern tree (fp tree), likely there is the node of multiple α of comprising to exist, their α items from frequent head table, link together by the node_link in the headofnode_link in item head table and a prefix subtree, in frequent pattern tree (fp tree), the node of each α of comprising can form a different prefix subpath of α, the conditional pattern base of all these paths composition α, be called the condition pattern tree of α with the constructed frequent pattern tree (fp tree) of the conditional pattern base of α. in the data digging method based on multidimensional analysis disclosed in this invention, the basic building process of described frequent pattern tree (fp tree) is as follows: the root node T that creates frequent pattern tree (fp tree), with " null " mark, event element in the each affairs that will process through step (2) is as frequent the table [p|P] after sorting, wherein p is first frequent, and P is remaining frequent, call subsequently insert_tree ([p|P], T) function is carried out following process: if T has children, N makes N.item_name=p.item_name, the counting of N increases by 1, otherwise create a new node N, its counting is set to 1, be linked to its father node T, and be linked to the node with identical item_name by node_link, if and P non-NULL, recursively call insert_tree (P, N), complete thus the structure of frequent pattern tree (fp tree). frequent pattern tree (fp tree) disclosed in this invention has been stored the full detail for Mining Frequent Itemsets Based. the memory headroom that this frequent pattern tree (fp tree) is shared and the degree of depth of tree and width are proportional, and the degree of depth of tree is the maximum of the contained number of entry in single affairs, and the width of tree is the quantity of average every layer of contained project.

Preferably, in the data digging method based on multidimensional analysis disclosed in this invention, described step (A2) further comprises: all event elements are being sorted to obtain after the list of the event element frequency according to frequency order from big to small, the event element that its support is less than to predetermined threshold (for example 0.01) is picked out, wherein, the support of event element is calculated by following formula: event element support=frequency/total number of transactions.

Preferably, in the data digging method based on multidimensional analysis disclosed in this invention, described step (A3) further comprises: the frequent item that screening is associated with predetermined certain object event element as follows: (1) finds the node of all these event elements from described frequent pattern tree (fp tree), and upwards travel through its ancestor node, obtain all paths, thereby obtain the conditional pattern base of this event element in described frequent pattern tree (fp tree); (2) described conditional pattern base is used as to original transaction collection and is built the condition pattern tree of this object event element, thereby obtain all frequent item sets of this object event element; (3) from obtained frequent item set, filter out frequent that is associated with this object event element according to predetermined min confidence. For example, suppose that the frequency of concentrating event element (mcc_1 $ 15) to occur at affair is 10000, certain frequent (the mcc_1 $ 15 finding from frequent pattern tree (fp tree), mcc_2 $ 16) the frequency be 1250, represent that holder arrived after mcc_1 type trade company post-consumer at 15 o'clock, probably in ensuing one hour, again consume to the trade company of mcc_2 type, its possibility is 1250/10000=0.125, this confidence level of frequent that Here it is, thus, can be by regulating min confidence to screen the frequent item that possibility occurrence is high.

Therefore the data digging method based on multidimensional analysis disclosed in this invention has following advantages: can adapt to the parallel computation under distributed environment; Can process mass data, thereby cover all samples to obtain higher precision of analysis; (3) can realize the data relation analysis based on multidimensional.

Although the present invention is described by above-mentioned preferred embodiment, its way of realization is not limited to above-mentioned embodiment. Should be realized that: in the situation that not departing from purport of the present invention and scope, those skilled in the art can make different variations and amendment to the present invention.

Claims

1. the data digging method based on multidimensional analysis, the described data digging method based on multidimensional analysis comprises the following steps:

(A1) from database, extract primitive event record, and the primitive event record extracting is screened and changes to form based on seasonal effect in time series logout affairs collection, wherein, described logout affairs collection is made up of multiple affairs, and each affairs are made up of some event elements;

2. the data digging method based on multidimensional analysis according to claim 1, is characterized in that, described step (A1) further comprises:

3. the data digging method based on multidimensional analysis according to claim 2, is characterized in that, described step (A2) further comprises:

4. the data digging method based on multidimensional analysis according to claim 3, it is characterized in that, described step (A2) further comprises: all event elements are being sorted to obtain after the list of the event element frequency according to frequency order from big to small, the event element that its support is less than to predetermined threshold is picked out, wherein, the support of event element is calculated by following formula: event element support=frequency/total number of transactions.

5. the data digging method based on multidimensional analysis according to claim 4, is characterized in that, described step (A3) further comprises: the frequent item that screening is associated with predetermined certain object event element as follows:

(3) from obtained frequent item set, filter out frequent that is associated with this object event element according to predetermined min confidence.