CN105589900A - Data mining method based on multi-dimensional analysis - Google Patents

Data mining method based on multi-dimensional analysis Download PDF

Info

Publication number
CN105589900A
CN105589900A CN201410671003.8A CN201410671003A CN105589900A CN 105589900 A CN105589900 A CN 105589900A CN 201410671003 A CN201410671003 A CN 201410671003A CN 105589900 A CN105589900 A CN 105589900A
Authority
CN
China
Prior art keywords
event
logout
affairs
tree
frequent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410671003.8A
Other languages
Chinese (zh)
Inventor
王骏
杨鸿超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN201410671003.8A priority Critical patent/CN105589900A/en
Publication of CN105589900A publication Critical patent/CN105589900A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a data mining method based on multi-dimensional analysis. The method comprises following steps: extracting original event records from a database; screening and converting the extracted original event records to form an event record affair set based on a time sequence, wherein the event record affair set is composed of multiple affairs, each affair is composed of a plurality of event elements; generating a frequent mode tree based on the event record affair set; and screening out frequent items associated with predetermined object event elements according to the generated frequent mode tree. The data mining method based on the multi-dimensional analysis disclosed by the invention is applicable to the parallel calculation in a distributed environment and can process massive object data.

Description

Based on the data digging method of multidimensional analysis
Technical field
The present invention relates to data digging method, more specifically, relate to the data digging method based on multidimensional analysis.
Background technology
At present, along with becoming increasingly abundant of the class of business of the increasingly extensive and different field of cyber-net application, for example, for example carry out data mining, to find relevance between occurring of the different event (relevance between different customer consumption event in financial field for the event log data (the transaction record data in financial field) of magnanimity, for example, a certain class holder is sometime before the consumption of Mou Lei trade company, may be engraved in what type trade company post-consumer in what time, probability is how many; The trade company's consumption that may go in what moment again afterwards what type, probability is how many) become more and more important.
In existing technical scheme, conventionally on unit, carry out the process relevant to sequence pattern analysis (it refers to from excavating the pattern that the frequency of occurrences is high for time or other patterns database), to find the potential association between different pieces of information.
But there are the following problems for existing technical scheme: (1) only can implement analytic process on unit, therefore be difficult to adapt to the parallel computation under distributed environment; (2) data volume that can process is limited, cannot cover all samples, thereby causes precision of analysis lower; (3) only can analyze for single dimension, cannot realize the data relation analysis based on multidimensional.
Therefore, there is following demand: the parallel computation that can adapt under distributed environment is provided and can processes the data digging method based on multidimensional analysis of magnanimity target data.
Summary of the invention
In order to solve the existing problem of above-mentioned prior art scheme, the present invention proposes the parallel computation that can adapt under distributed environment and can process the data digging method based on multidimensional analysis of magnanimity target data.
The object of the invention is to be achieved through the following technical solutions:
Based on a data digging method for multidimensional analysis, the described data digging method based on multidimensional analysis comprises the following steps:
(A1) from database, extract primitive event record. And the primitive event record extracting is screened and changes to form based on seasonal effect in time series logout affairs collection, wherein, described logout affairs collection is made up of multiple affairs, and each affairs are made up of some event elements;
(A2) generate frequent pattern tree (fp tree) based on described logout things collection;
(A3) filter out frequent that is associated with predetermined object event element according to generated frequent pattern tree (fp tree).
In disclosed scheme, preferably, described step (A1) further comprises in the above:
(1) from database, extract primitive event record according to the screening conditions of setting;
(2) for each primitive event record extracting, select wherein predetermined some fields and press predetermined format basis of formation logout, each basic logout represents once actual event, and each basic logout at least comprises event body field, event type field and Time To Event field;
(3) at least as major key, all basic logouts are divided into groups using the value of event body field and Time To Event field by pre-defined rule;
(4) respectively the record in each grouping is cleaned, merge into a basic logout by the identical basic logout of value of event type field in each grouping and Time To Event field;
(5) using the basic logout in each grouping as the event element representing with " event type $ Time To Event " form, and all event elements in same grouping are merged to form logout affairs, its be represented as<event type 1 $ Time To Event 1, event type 2 $ Time To Events 2, event type i $ Time To Event i,>, wherein, " event type i $ Time To Event i " represents i event element in these logout affairs, thus, logout affairs corresponding to all groupings form described logout affairs collection.
In disclosed scheme, preferably, described step (A2) further comprises in the above:
(1) travel through described logout affairs collection, calculate total frequency that each event element occurs, and according to frequency order from big to small, all event elements are sorted to obtain the list of the event element frequency;
(2) each affairs of concentrating for described logout affairs, resequence the each event element in these affairs according to the order of event element in the list of the described event element frequency;
(3) root node of establishment frequent pattern tree (fp tree), travels through described logout affairs collection again, and the event element in each affairs of processing through step (2) is inserted in created frequent pattern tree (fp tree) as frequent.
In the above in disclosed scheme, preferably, described step (A2) further comprises: all event elements are being sorted to obtain after the list of the event element frequency according to frequency order from big to small, the event element that its support is less than to predetermined threshold is picked out, wherein, the support of event element is calculated by following formula: event element support=frequency/total number of transactions.
In disclosed scheme, preferably, described step (A3) further comprises in the above: the frequent item that screening is associated with predetermined certain object event element as follows:
(1) from described frequent pattern tree (fp tree), find the node of all these event elements, and upwards travel through its ancestor node, obtain all paths, thereby obtain the conditional pattern base of this event element in described frequent pattern tree (fp tree);
(2) described conditional pattern base is used as to original transaction collection and is built the condition pattern tree of this object event element, thereby obtain all frequent item sets of this object event element;
(3) from obtained frequent item set, filter out frequent that is associated with this object event element according to predetermined min confidence
Data digging method based on multidimensional analysis disclosed in this invention has the following advantages: can adapt to the parallel computation under distributed environment; Can process mass data, thereby cover all samples to obtain higher precision of analysis; (3) can realize the data relation analysis based on multidimensional.
Brief description of the drawings
By reference to the accompanying drawings, technical characterictic of the present invention and advantage will be understood better by those skilled in the art, wherein:
Fig. 1 is the flow chart of the data digging method based on multidimensional analysis according to an embodiment of the invention.
Detailed description of the invention
Fig. 1 is the flow chart of the data digging method based on multidimensional analysis according to an embodiment of the invention. As shown in Figure 1, the data digging method based on multidimensional analysis disclosed in this invention comprises the following steps: that (A1) extracts primitive event record from database. And the primitive event record extracting is screened and changes to form based on seasonal effect in time series logout affairs collection, wherein, described logout affairs collection is made up of multiple affairs, and each affairs are made up of some event elements; (A2) generate frequent pattern tree (fp tree) based on described logout things collection; (A3) filter out according to generated frequent pattern tree (fp tree) frequent (being different from object event element relative other event elements) being associated with predetermined object event element.
Preferably, in the data digging method based on multidimensional analysis disclosed in this invention, described step (A1) further comprises: (1) extracts primitive event record (for example transaction record, each record represents once the actual transaction occurring) according to the screening conditions of setting from database, (2) for each primitive event record extracting, select wherein predetermined some fields fields such as () such as card number, exchange hour, merchant type and press predetermined format basis of formation logout, each basic logout represents once actual event (for example certain holder has carried out one-time-consumption in the trade company in certain type sometime), for example, and each basic logout at least comprises event body field (card number field), event type field (for example merchant type field) and Time To Event field, (3) at least by pre-defined rule using the value of event body field and Time To Event field as major key for example, to all basic logouts divide into groups (, same card number All Activity record is on the same day assigned in same group), (4) respectively the record in each grouping is cleaned, merge into a basic logout by the identical basic logout of value of event type field in each grouping and Time To Event field, (5) using the basic logout in each grouping as the event element representing with " event type $ Time To Event " form, and all event elements in same grouping (are for example merged to form logout affairs, affairs represent All Activity merchant type and the exchange hour that a card number occurred in some day), its be represented as<event type 1 $ Time To Event 1, event type 2 $ Time To Events 2, event type i $ Time To Event i,>, wherein, " event type i $ Time To Event i " represents i event element in these logout affairs, thus, logout affairs corresponding to all groupings form described logout affairs collection.
Preferably, in the data digging method based on multidimensional analysis disclosed in this invention, described step (A2) further comprises: (1) travels through described logout affairs collection, calculate total frequency that each event element occurs, and according to frequency order from big to small, all event elements are sorted to obtain the list of the event element frequency, (2) each affairs of concentrating for described logout affairs, resequence the each event element in these affairs according to the order of event element in the list of the described event element frequency, (3) root node (it is with " null " mark) of establishment frequent pattern tree (fp tree), travels through described logout affairs collection again, and the event element in each affairs of processing through step (2) is inserted in created frequent pattern tree (fp tree) as frequent. wherein, described frequent pattern tree (fp tree) is by a root node (value is null), item prefix subtree (as children) and a frequent head table form, each node in described prefix subtree comprises three territory: item_name, count and node_link, wherein, item_name records in territory the mark of the represented item of this node, count territory record arrives the number of transactions of the subpath of this node, node_link territory is for the next node of threaded tree like-identified, if there is no the next node of like-identified, its value is " null ", and the list item of frequent item head table comprises the head pointer headofnode_link in a frequent identification field item_name and a sensing tree with first frequent node of this mark. wherein, for being included in the item α on certain node in frequent pattern tree (fp tree), a path that arrives α from root node will be had, the part path that does not comprise α place node in this path is called the prefix subpath of α, α is called the suffix in this path, and in a frequent pattern tree (fp tree), likely there is the node of multiple α of comprising to exist, their α items from frequent head table, link together by the node_link in the headofnode_link in item head table and a prefix subtree, in frequent pattern tree (fp tree), the node of each α of comprising can form a different prefix subpath of α, the conditional pattern base of all these paths composition α, be called the condition pattern tree of α with the constructed frequent pattern tree (fp tree) of the conditional pattern base of α. in the data digging method based on multidimensional analysis disclosed in this invention, the basic building process of described frequent pattern tree (fp tree) is as follows: the root node T that creates frequent pattern tree (fp tree), with " null " mark, event element in the each affairs that will process through step (2) is as frequent the table [p|P] after sorting, wherein p is first frequent, and P is remaining frequent, call subsequently insert_tree ([p|P], T) function is carried out following process: if T has children, N makes N.item_name=p.item_name, the counting of N increases by 1, otherwise create a new node N, its counting is set to 1, be linked to its father node T, and be linked to the node with identical item_name by node_link, if and P non-NULL, recursively call insert_tree (P, N), complete thus the structure of frequent pattern tree (fp tree). frequent pattern tree (fp tree) disclosed in this invention has been stored the full detail for Mining Frequent Itemsets Based. the memory headroom that this frequent pattern tree (fp tree) is shared and the degree of depth of tree and width are proportional, and the degree of depth of tree is the maximum of the contained number of entry in single affairs, and the width of tree is the quantity of average every layer of contained project.
Preferably, in the data digging method based on multidimensional analysis disclosed in this invention, described step (A2) further comprises: all event elements are being sorted to obtain after the list of the event element frequency according to frequency order from big to small, the event element that its support is less than to predetermined threshold (for example 0.01) is picked out, wherein, the support of event element is calculated by following formula: event element support=frequency/total number of transactions.
Preferably, in the data digging method based on multidimensional analysis disclosed in this invention, described step (A3) further comprises: the frequent item that screening is associated with predetermined certain object event element as follows: (1) finds the node of all these event elements from described frequent pattern tree (fp tree), and upwards travel through its ancestor node, obtain all paths, thereby obtain the conditional pattern base of this event element in described frequent pattern tree (fp tree); (2) described conditional pattern base is used as to original transaction collection and is built the condition pattern tree of this object event element, thereby obtain all frequent item sets of this object event element; (3) from obtained frequent item set, filter out frequent that is associated with this object event element according to predetermined min confidence. For example, suppose that the frequency of concentrating event element (mcc_1 $ 15) to occur at affair is 10000, certain frequent (the mcc_1 $ 15 finding from frequent pattern tree (fp tree), mcc_2 $ 16) the frequency be 1250, represent that holder arrived after mcc_1 type trade company post-consumer at 15 o'clock, probably in ensuing one hour, again consume to the trade company of mcc_2 type, its possibility is 1250/10000=0.125, this confidence level of frequent that Here it is, thus, can be by regulating min confidence to screen the frequent item that possibility occurrence is high.
Therefore the data digging method based on multidimensional analysis disclosed in this invention has following advantages: can adapt to the parallel computation under distributed environment; Can process mass data, thereby cover all samples to obtain higher precision of analysis; (3) can realize the data relation analysis based on multidimensional.
Although the present invention is described by above-mentioned preferred embodiment, its way of realization is not limited to above-mentioned embodiment. Should be realized that: in the situation that not departing from purport of the present invention and scope, those skilled in the art can make different variations and amendment to the present invention.

Claims (5)

1. the data digging method based on multidimensional analysis, the described data digging method based on multidimensional analysis comprises the following steps:
(A1) from database, extract primitive event record, and the primitive event record extracting is screened and changes to form based on seasonal effect in time series logout affairs collection, wherein, described logout affairs collection is made up of multiple affairs, and each affairs are made up of some event elements;
(A2) generate frequent pattern tree (fp tree) based on described logout things collection;
(A3) filter out frequent that is associated with predetermined object event element according to generated frequent pattern tree (fp tree).
2. the data digging method based on multidimensional analysis according to claim 1, is characterized in that, described step (A1) further comprises:
(1) from database, extract primitive event record according to the screening conditions of setting;
(2) for each primitive event record extracting, select wherein predetermined some fields and press predetermined format basis of formation logout, each basic logout represents once actual event, and each basic logout at least comprises event body field, event type field and Time To Event field;
(3) at least as major key, all basic logouts are divided into groups using the value of event body field and Time To Event field by pre-defined rule;
(4) respectively the record in each grouping is cleaned, merge into a basic logout by the identical basic logout of value of event type field in each grouping and Time To Event field;
(5) using the basic logout in each grouping as the event element representing with " event type $ Time To Event " form, and all event elements in same grouping are merged to form logout affairs, its be represented as<event type 1 $ Time To Event 1, event type 2 $ Time To Events 2, event type i $ Time To Event i,>, wherein, " event type i $ Time To Event i " represents i event element in these logout affairs, thus, logout affairs corresponding to all groupings form described logout affairs collection.
3. the data digging method based on multidimensional analysis according to claim 2, is characterized in that, described step (A2) further comprises:
(1) travel through described logout affairs collection, calculate total frequency that each event element occurs, and according to frequency order from big to small, all event elements are sorted to obtain the list of the event element frequency;
(2) each affairs of concentrating for described logout affairs, resequence the each event element in these affairs according to the order of event element in the list of the described event element frequency;
(3) root node of establishment frequent pattern tree (fp tree), travels through described logout affairs collection again, and the event element in each affairs of processing through step (2) is inserted in created frequent pattern tree (fp tree) as frequent.
4. the data digging method based on multidimensional analysis according to claim 3, it is characterized in that, described step (A2) further comprises: all event elements are being sorted to obtain after the list of the event element frequency according to frequency order from big to small, the event element that its support is less than to predetermined threshold is picked out, wherein, the support of event element is calculated by following formula: event element support=frequency/total number of transactions.
5. the data digging method based on multidimensional analysis according to claim 4, is characterized in that, described step (A3) further comprises: the frequent item that screening is associated with predetermined certain object event element as follows:
(1) from described frequent pattern tree (fp tree), find the node of all these event elements, and upwards travel through its ancestor node, obtain all paths, thereby obtain the conditional pattern base of this event element in described frequent pattern tree (fp tree);
(2) described conditional pattern base is used as to original transaction collection and is built the condition pattern tree of this object event element, thereby obtain all frequent item sets of this object event element;
(3) from obtained frequent item set, filter out frequent that is associated with this object event element according to predetermined min confidence.
CN201410671003.8A 2014-11-21 2014-11-21 Data mining method based on multi-dimensional analysis Pending CN105589900A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410671003.8A CN105589900A (en) 2014-11-21 2014-11-21 Data mining method based on multi-dimensional analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410671003.8A CN105589900A (en) 2014-11-21 2014-11-21 Data mining method based on multi-dimensional analysis

Publications (1)

Publication Number Publication Date
CN105589900A true CN105589900A (en) 2016-05-18

Family

ID=55929482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410671003.8A Pending CN105589900A (en) 2014-11-21 2014-11-21 Data mining method based on multi-dimensional analysis

Country Status (1)

Country Link
CN (1) CN105589900A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250549A (en) * 2016-08-14 2016-12-21 重庆大学 A kind of Frequent Pattern Mining method based on internal memory
CN107798021A (en) * 2016-09-07 2018-03-13 北京京东尚科信息技术有限公司 Data correlation processing method, system and electronic equipment
CN109359176A (en) * 2018-09-10 2019-02-19 平安科技(深圳)有限公司 Data extraction method, device, computer equipment and storage medium
CN112667827A (en) * 2020-12-23 2021-04-16 北京奇艺世纪科技有限公司 Data anomaly analysis method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001167098A (en) * 1999-12-07 2001-06-22 Hitachi Ltd Distributed parallel analyzing method for mass data
CN102142992A (en) * 2011-01-11 2011-08-03 浪潮通信信息系统有限公司 Communication alarm frequent itemset mining engine and redundancy processing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001167098A (en) * 1999-12-07 2001-06-22 Hitachi Ltd Distributed parallel analyzing method for mass data
CN102142992A (en) * 2011-01-11 2011-08-03 浪潮通信信息系统有限公司 Communication alarm frequent itemset mining engine and redundancy processing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李彤岩: ""基于数据挖掘的通信网告警相关性分析研究"", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250549A (en) * 2016-08-14 2016-12-21 重庆大学 A kind of Frequent Pattern Mining method based on internal memory
CN106250549B (en) * 2016-08-14 2019-09-20 重庆大学 A kind of Frequent Pattern Mining method memory-based
CN107798021A (en) * 2016-09-07 2018-03-13 北京京东尚科信息技术有限公司 Data correlation processing method, system and electronic equipment
CN107798021B (en) * 2016-09-07 2021-04-30 北京京东尚科信息技术有限公司 Data association processing method and system and electronic equipment
CN109359176A (en) * 2018-09-10 2019-02-19 平安科技(深圳)有限公司 Data extraction method, device, computer equipment and storage medium
CN112667827A (en) * 2020-12-23 2021-04-16 北京奇艺世纪科技有限公司 Data anomaly analysis method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN104200369B (en) Method and device for determining commodity distribution range
US20110161132A1 (en) Method and system for extracting process sequences
CN103761236A (en) Incremental frequent pattern increase data mining method
CN111159428A (en) Method and device for automatically extracting event relation of knowledge graph in economic field
JP2019502979A (en) Automatic interpretation of structured multi-field file layouts
CN105589900A (en) Data mining method based on multi-dimensional analysis
CN104636337B (en) A kind of data cleansing storage method for value-added tax
CN116415206B (en) Operator multiple data fusion method, system, electronic equipment and computer storage medium
CN104462421A (en) Multi-tenant expanding method based on Key-Value database
CN109669975B (en) Industrial big data processing system and method
CN105095436A (en) Automatic modeling method for data of data sources
JP6242540B1 (en) Data conversion system and data conversion method
CN103440265A (en) MapReduce-based CDC (Change Data Capture) method of MYSQL database
CN107103035A (en) This earth&#39;s surface data-updating method and device
JP5169560B2 (en) Business flow processing program, method and apparatus
CN103984723A (en) Method used for updating data mining for frequent item by incremental data
CN104391891A (en) Heterogeneous replication method for database
CN107291938A (en) Order Query System and method
CN111881126A (en) Big data management system
CN109063063B (en) Data processing method and device based on multi-source data
CN105653567A (en) Method for quickly looking for feature character strings in text sequential data
CN111522819A (en) Method and system for summarizing tree-structured data
CN110378569A (en) Industrial relations chain building method, apparatus, equipment and storage medium
CN114090590B (en) Multi-object label data extraction method and system
CN106776607A (en) Search engine operation behavior treating method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160518

RJ01 Rejection of invention patent application after publication