CN104765851A - Big data analysis and extraction method - Google Patents

Big data analysis and extraction method Download PDF

Info

Publication number
CN104765851A
CN104765851A CN201510189396.3A CN201510189396A CN104765851A CN 104765851 A CN104765851 A CN 104765851A CN 201510189396 A CN201510189396 A CN 201510189396A CN 104765851 A CN104765851 A CN 104765851A
Authority
CN
China
Prior art keywords
index
node
attribute
layer
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510189396.3A
Other languages
Chinese (zh)
Inventor
杨立波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Bo Yuan Epoch Softcom Ltd
Original Assignee
Chengdu Bo Yuan Epoch Softcom Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Bo Yuan Epoch Softcom Ltd filed Critical Chengdu Bo Yuan Epoch Softcom Ltd
Priority to CN201510189396.3A priority Critical patent/CN104765851A/en
Publication of CN104765851A publication Critical patent/CN104765851A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a big data analysis and extraction method. The method comprises the steps that attributes included in a data set are divided into a non-continuous type and a continuous type according to the characteristics of data types in the data set, two stages of indexes are built on the basis of multi-dimension indexes, and the data set is mined and matched in real time through the graded indexes. According to the data analysis and extraction method, performance is greatly improved under the conditions that mining accuracy is not reduced by a data matching method according to the built graded indexes.

Description

A kind of large data analysis extracting method
Technical field
The present invention relates to large data analysis, particularly the large data analysis extracting method of one.
Background technology
The operation data online mining utilizing large data processing to realize large enterprise is with a wide range of applications.For under large data environment, comprise the data set of different media formats.By to mining rule generating indexes, the distinguishing speed that rule calculates can be promoted, greatly improve the efficiency of data set online mining.Data set comprises the metamessage of the different attributes such as text, picture, audio frequency and video, there is larger difference between attribute.But in the mining process of reality, because mining rule collection scale is large, relevant dimension is high, make the excavation calculated amount on large-scale dataset large, inefficiency.And prior art is be optimized index for the data set of single type mostly, and the quantity of rule set is also relatively less, do not make full use of multiple types of data and concentrate relation between different attribute, thus be difficult to be applied directly in polytype data mining, directly affects excavation performance.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of large data analysis extracting method, comprising:
According to the characteristic of the data type of data centralization itself, the Attribute transposition comprised by data set is discrete and continuous type, and two-stage index is set up on the basis of multi-dimensional indexing, utilizes this hierarchical index excavate in real time data set and mate.
Preferably, described hierarchical index comprises binary search tree hierarchical index, multidimensional vector hierarchical index, and the contingency table of mining rule and operational symbol, the contingency table of described mining rule AND operator completes the combination of two layer index mining rule results, wherein, binary search tree is generated as bottom index according to the operational symbol on discrete attribute; On bottom, be hyperspace by all continuous type best property of attribute mapping, the operational symbol relevant according to continuity attribute generates multi-dimensional indexing, and attribute operator identical for attribute is divided generating indexes according to dimension, comprises the operation of hierarchical index: search, insertion and deletion;
Node in described hierarchical index can be divided into 3 classes: first floor node top, the intermediate node mid of the second layer and leaf node leaf; Wherein in first floor node top, comprising following element: attr, is the discrete attribute that first floor binary search tree node is corresponding; Value is the discrete value that this binary search tree node is corresponding; Weight is the priority of the operational symbol of this node on behalf, left, right, is the left and right child nodes of this node; In intermediate node, comprising following element: branch, represent the intermediate node pointer of hyperspace tree construction corresponding to second layer index, comprise following element: mbr in leaf node, is the multidimensional vector that second layer leaf node is corresponding.
Preferably, described this hierarchical index that utilizes excavates in real time data set and mates, and comprises further:
For the feature of each data tuple excavation surface to the rule set in data mining, first the binary search tree index scanning ground floor calculates the discontinuous operational symbol set satisfied condition, then according to effective node of ground floor index hit, search arithmetic symbol contingency table, find the node pointer of second layer index, enter in second layer multi-dimensional indexing and continue to search, finally, the result of combination ground floor and second layer hit, carries out Rapid Combination calculating by the rule set finally hit;
Wherein, for arbitrary data set tuple, first discrete property value is scanned in many attributes binary search tree index, only just carry out the mining rule process of second layer index when there being discrete operational symbol to hit, otherwise directly return; Each multi-dimensional indexing in ground floor is carried out successively to the mining rule of property value key-value pair, and by mining rule result stored in buffer memory, according to the buffer memory operation rule aggregating algorithm obtained, the final mining rule result set obtaining hit.
The present invention compared to existing technology, has the following advantages:
The present invention proposes a kind of data analysis extracting method, according to set up hierarchical index, data matching method, when not reducing excavation accuracy rate, significantly improves performance.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the large data analysis extracting method according to the embodiment of the present invention.
Embodiment
Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some in these details or all details.
An aspect of of the present present invention provides a kind of large data analysis extracting method.Fig. 1 is the large data analysis extracting method process flow diagram according to the embodiment of the present invention.
The present invention, on the basis of multi-dimensional indexing, introduces the thought of hierarchical index.Consider the characteristic of different types of data collection itself, the Attribute transposition comprised by data set is discrete attribute and continuous type attribute.Consider and the characteristic that operational symbol sharing degree on discrete attribute is high propose a kind of two-layer hierarchical index, give index and generate and excavate and matching process.
The present invention proposes the dynamic index that extensive mining rule calculates on different types of data collection, support the real-time update of mining rule, its main process comprises index and generates and real-time matching calculating.During generating indexes, first the property set of different types of data collection is classified: continuous type attribute and discrete attribute.Then, according to attribute type, the mining rule collection of input is divided into different operational symbol set, generates hierarchical index based on different operational symbol set: generate binary search tree as ground floor index according to the operational symbol on discrete attribute; At the second layer, be hyperspace by all continuous type best property of attribute mapping, the operational symbol relevant according to continuity attribute generates multi-dimensional indexing.Because the operational symbol on discrete attribute is all discrete value, thus the ground floor index generated can quick position in mining rule, and space expense is also smaller.At the second layer of hierarchical index of the present invention, attribute operator identical for attribute is divided generating indexes according to dimension by the present invention, promotes mining rule processing speed as much as possible.When excavation calculating is carried out to the data set tuple t arrived in real time, first vector extraction carried out to t and cut out calculating; Vector after quantification obtains different operational symbol vectors after the process of operational symbol attributive classification; The mining rule method of hierarchical index of the present invention is used to filter the mining rule collection of the condition that is met by two-stage index.
From structure, hierarchical index of the present invention comprises 3 important component parts and 3 important operations.Wherein, 3 ingredients respectively: the binary search tree hierarchical index of (1) ground floor; (2) the multidimensional vector hierarchical index of the second layer; (3) contingency table of mining rule and operational symbol.Based on hierarchical index of the present invention 3 main operations respectively: (1) search for; (2) insert; (3) delete.
Hierarchical index of the present invention is 1 two-layer hierarchical index generally.Ground floor is the binary search tree index generated by discrete operational symbol; The second layer is the hyperspace tree that the multidimensional vector corresponding according to continuous type operational symbol generates; Other 1 very important ingredient is the contingency table of mining rule AND operator, has been used for the Rapid Combination of two layer index mining rule results.
Node in hierarchical index of the present invention can be divided into 3 classes: first floor node top, the intermediate node mid of the second layer and leaf node leaf.
Following element is comprised: attr is the discrete attribute that first floor binary search tree node is corresponding in first floor node, value is the discrete value that this binary search tree node is corresponding, weight is the priority of the operational symbol of this node on behalf, and left, right are the left and right child nodes of this node.In intermediate node: branch represents the intermediate node pointer of hyperspace tree construction corresponding to second layer index.In leaf node: mbr is the multidimensional vector that second layer leaf node is corresponding.
Based on hierarchical index of the present invention, propose a kind of feasible index generation method, be divided into 3 steps: regular pre-service, operational symbol set and division and hierarchical index generate.First according to attributive classification, certain division is carried out to pretreated rule set, on this basis again to the data set layering generating indexes after division.
When rule set divides, for pretreated n rule, discrete operational symbol set A and continuous type operational symbol set B are divided according to attribute classification and codomain.Wherein:
Σ || A||+||B||q ∈ Qp (q), Q is registered mining rule collection, and q is single mining rule, and p is operational symbol.Suppose that the operational symbol set tieed up first divides in order to s interval I 1, I 2..., I s, wherein each interval only has discrete operational symbol or continuous type operational symbol.Owing to have passed through pretreated dimension transformation, make I 1, I 2..., I s, interval in the operational symbol attribute similarity that contains, be convenient to the generation of index by different level.Meanwhile, the rule set that different types of data focuses volume contains precedence information usually, and the priority of rule is higher, represents that it is more urgent by the demand excavating calculating.When a new mining rule is registered in system, first mining rule is divided into discrete operational symbol p according to attribute type by pretreatment module by hierarchical index of the present invention dwith continuous type operational symbol p c.Then by discrete operational symbol p dbe inserted in the ground floor index of hierarchical index of the present invention, be namely inserted in binary search tree index corresponding to discrete attribute; Finally by continuous type operational symbol p cbe inserted in the second layer index of hierarchical index.
When by p dwhen being inserted into the binary search tree of ground floor, first inserting according to the standard inserted mode of sequence binary tree, but at this moment may violate the heap characteristic of binary search tree, therefore need bottom-up rotation, until heap characteristic is met.Deletion is contrary, first priority is set to minimum, transfers to leaf from top to bottom, then delete.
The second layer insertion method of hierarchical index is roughly: first navigate to the target leaves node that will insert, and the process of location itself is a recursive procedure.Second layer index inserts p cprocess from the root node of the second layer, in turn according to BFS (Breadth First Search), search for according to the relation of inclusion of hyperspace, after finding a leaf node n, check point number of n.If find to have exceeded branch threshold value M, then directly carry out node split, produce new node, and by existing for n node and P cvector utilize heuristic strategies to be evenly distributed in two nodes, finally upgrade parent information successively.If a point number of n does not exceed threshold value M, then directly complete update by upgrading father node.
Based on the characteristic of discrete operational symbol, the ground floor index binary search tree of hierarchical index of the present invention generates, and the quick mining rule accelerating discrete operational symbol calculates.Suppose that the priority of empty tree is for infinitely great, then the interpolation above and delet method can correctly process the situation only having a son.The interpolation of ground floor index and the expected time complexity of deletion action are O (log n).
Insertion process sixty-four dollar question is the division strategy of second layer index interior joint.The present invention here adopts heuristic strategies.First all blocks that will divide are taken out.Then two blocks that overlapping area area that is minimum, covering two blocks simultaneously is maximum are selected.Finally remaining piece is incorporated in different nodes successively according to the overlapping difference of area.
The mining process of data set tuple at index structure is introduced in this embodiment.For tuple t, the abundant excavation surface of hierarchical index of the present invention is to the feature of rule set in data mining, and the binary search tree index first scanning ground floor calculates the discontinuous operational symbol set satisfied condition.Because discrete operational symbol sharing degree is very high, this just improves sweep velocity greatly, accelerates the whole processing procedure of mining rule.Then, according to effective node of ground floor index hit, search arithmetic symbol contingency table, find the node pointer of second layer index, enter in second layer multi-dimensional indexing and continue to search, finally, the result of combination ground floor and second layer hit, carries out Rapid Combination calculating by the rule set finally hit.
For arbitrary data set tuple t, first discrete property value is scanned in many attributes binary search tree index, if hit, indicate that discrete operational symbol is hit, just carry out the mining rule process of second layer index in this case.Otherwise directly return.Triggering index by generating ground floor, greatly accelerating the mining rule process of hierarchical index data set tuple of the present invention.If hit discrete operational symbol, then next carry out the P mining procedure of rule of second layer index.Namely each multi-dimensional indexing in ground floor is carried out successively to the mining rule of property value key-value pair, and by mining rule result stored in buffer memory.The present invention, according to the buffer memory operation rule aggregating algorithm obtained, finally obtains the mining rule result set of hit.
In sum, the present invention proposes a kind of data analysis extracting method, according to set up hierarchical index, data matching method, when not reducing excavation accuracy rate, significantly improves performance.
Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system, or be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system, thus, they can be stored and be performed by computing system within the storage system.Like this, the present invention is not restricted to any specific hardware and software combination.
Should be understood that, above-mentioned embodiment of the present invention only for exemplary illustration or explain principle of the present invention, and is not construed as limiting the invention.Therefore, any amendment made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims (3)

1. a large data analysis extracting method, is characterized in that, comprising:
According to the characteristic of the data type of data centralization itself, the Attribute transposition comprised by data set is discrete and continuous type, and two-stage index is set up on the basis of multi-dimensional indexing, utilizes this hierarchical index excavate in real time data set and mate.
2. method according to claim 1, it is characterized in that, described hierarchical index comprises binary search tree hierarchical index, multidimensional vector hierarchical index, and the contingency table of mining rule and operational symbol, the contingency table of described mining rule AND operator completes the combination of two layer index mining rule results, wherein, binary search tree is generated as bottom index according to the operational symbol on discrete attribute; On bottom, be hyperspace by all continuous type best property of attribute mapping, the operational symbol relevant according to continuity attribute generates multi-dimensional indexing, and attribute operator identical for attribute is divided generating indexes according to dimension, comprises the operation of hierarchical index: search, insertion and deletion;
Node in described hierarchical index can be divided into 3 classes: first floor node top, the intermediate node mid of the second layer and leaf node leaf; Wherein in first floor node top, comprising following element: attr, is the discrete attribute that first floor binary search tree node is corresponding; Value is the discrete value that this binary search tree node is corresponding; Weight is the priority of the operational symbol of this node on behalf, left, right, is the left and right child nodes of this node; In intermediate node, comprising following element: branch, represent the intermediate node pointer of hyperspace tree construction corresponding to second layer index, comprise following element: mbr in leaf node, is the multidimensional vector that second layer leaf node is corresponding.
3. method according to claim 2, is characterized in that, described this hierarchical index that utilizes excavates in real time data set and mates, and comprises further:
For the feature of each data tuple excavation surface to the rule set in data mining, first the binary search tree index scanning ground floor calculates the discontinuous operational symbol set satisfied condition, then according to effective node of ground floor index hit, search arithmetic symbol contingency table, find the node pointer of second layer index, enter in second layer multi-dimensional indexing and continue to search, finally, the result of combination ground floor and second layer hit, carries out Rapid Combination calculating by the rule set finally hit;
Wherein, for arbitrary data set tuple, first discrete property value is scanned in many attributes binary search tree index, only just carry out the mining rule process of second layer index when there being discrete operational symbol to hit, otherwise directly return; Each multi-dimensional indexing in ground floor is carried out successively to the mining rule of property value key-value pair, and by mining rule result stored in buffer memory, according to the buffer memory operation rule aggregating algorithm obtained, the final mining rule result set obtaining hit.
CN201510189396.3A 2015-04-21 2015-04-21 Big data analysis and extraction method Pending CN104765851A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510189396.3A CN104765851A (en) 2015-04-21 2015-04-21 Big data analysis and extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510189396.3A CN104765851A (en) 2015-04-21 2015-04-21 Big data analysis and extraction method

Publications (1)

Publication Number Publication Date
CN104765851A true CN104765851A (en) 2015-07-08

Family

ID=53647679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510189396.3A Pending CN104765851A (en) 2015-04-21 2015-04-21 Big data analysis and extraction method

Country Status (1)

Country Link
CN (1) CN104765851A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787386A (en) * 2016-03-03 2016-07-20 南京航空航天大学 Cloud database access control model based on PBAC model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823846A (en) * 2014-01-28 2014-05-28 浙江大学 Method for storing and querying big data on basis of graph theories
US20140237554A1 (en) * 2013-02-15 2014-08-21 Infosys Limited Unified platform for big data processing
CN104331421A (en) * 2014-10-14 2015-02-04 安徽四创电子股份有限公司 High-efficiency processing method and system for big data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140237554A1 (en) * 2013-02-15 2014-08-21 Infosys Limited Unified platform for big data processing
CN103823846A (en) * 2014-01-28 2014-05-28 浙江大学 Method for storing and querying big data on basis of graph theories
CN104331421A (en) * 2014-10-14 2015-02-04 安徽四创电子股份有限公司 High-efficiency processing method and system for big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
臧文羽 等,: ""H-Tree:一种面向大数据流在线检测的层次索引"", 《计算机学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787386A (en) * 2016-03-03 2016-07-20 南京航空航天大学 Cloud database access control model based on PBAC model

Similar Documents

Publication Publication Date Title
CN101031907B (en) Index processing
US8756231B2 (en) Search using proximity for clustering information
CN110147455B (en) Face matching retrieval device and method
KR20040103495A (en) Positional access using a b-tree
CN105787126B (en) K-d tree generation method and k-d tree generation device
CN103064906A (en) File management method and device
CN111737228B (en) Database and table dividing method and device
CN102207935A (en) Method and system for establishing index
CN102999495B (en) A kind of synonym Semantic mapping relation determines method and device
CN104778259A (en) High-efficiency data analyzing and processing method
CN112115313A (en) Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium
CN105354283A (en) Resource searching method and apparatus
Kopelowitz et al. Dynamic weighted ancestors
US20150293971A1 (en) Distributed queries over geometric objects
CN112445776B (en) Presto-based dynamic barrel dividing method, system, equipment and readable storage medium
US9235578B2 (en) Data partitioning apparatus and data partitioning method
CN104765851A (en) Big data analysis and extraction method
CN109657060B (en) Safety production accident case pushing method and system
CN111522918A (en) Data aggregation method and device, electronic equipment and computer readable storage medium
CN104850591A (en) Data conversion storage method and device
CN114490835B (en) High-utility item set mining method and device, electronic equipment and medium
CN105740371A (en) Density-based incremental clustering data mining method and system
JPWO2012049883A1 (en) Data structure, index creation device, data search device, index creation method, data search method, index creation program, and data search program
CN110866088B (en) Method and system for fast full-text retrieval between corpora
CN104991963B (en) Document handling method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150708

RJ01 Rejection of invention patent application after publication