CN104765851A

CN104765851A - Big data analysis and extraction method

Info

Publication number: CN104765851A
Application number: CN201510189396.3A
Authority: CN
Inventors: 杨立波
Original assignee: Chengdu Bo Yuan Epoch Softcom Ltd
Current assignee: Chengdu Bo Yuan Epoch Softcom Ltd
Priority date: 2015-04-21
Filing date: 2015-04-21
Publication date: 2015-07-08

Abstract

The invention provides a big data analysis and extraction method. The method comprises the steps that attributes included in a data set are divided into a non-continuous type and a continuous type according to the characteristics of data types in the data set, two stages of indexes are built on the basis of multi-dimension indexes, and the data set is mined and matched in real time through the graded indexes. According to the data analysis and extraction method, performance is greatly improved under the conditions that mining accuracy is not reduced by a data matching method according to the built graded indexes.

Description

A kind of large data analysis extracting method

Technical field

The present invention relates to large data analysis, particularly the large data analysis extracting method of one.

Background technology

The operation data online mining utilizing large data processing to realize large enterprise is with a wide range of applications.For under large data environment, comprise the data set of different media formats.By to mining rule generating indexes, the distinguishing speed that rule calculates can be promoted, greatly improve the efficiency of data set online mining.Data set comprises the metamessage of the different attributes such as text, picture, audio frequency and video, there is larger difference between attribute.But in the mining process of reality, because mining rule collection scale is large, relevant dimension is high, make the excavation calculated amount on large-scale dataset large, inefficiency.And prior art is be optimized index for the data set of single type mostly, and the quantity of rule set is also relatively less, do not make full use of multiple types of data and concentrate relation between different attribute, thus be difficult to be applied directly in polytype data mining, directly affects excavation performance.

Summary of the invention

For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of large data analysis extracting method, comprising:

According to the characteristic of the data type of data centralization itself, the Attribute transposition comprised by data set is discrete and continuous type, and two-stage index is set up on the basis of multi-dimensional indexing, utilizes this hierarchical index excavate in real time data set and mate.

Preferably, described hierarchical index comprises binary search tree hierarchical index, multidimensional vector hierarchical index, and the contingency table of mining rule and operational symbol, the contingency table of described mining rule AND operator completes the combination of two layer index mining rule results, wherein, binary search tree is generated as bottom index according to the operational symbol on discrete attribute; On bottom, be hyperspace by all continuous type best property of attribute mapping, the operational symbol relevant according to continuity attribute generates multi-dimensional indexing, and attribute operator identical for attribute is divided generating indexes according to dimension, comprises the operation of hierarchical index: search, insertion and deletion;

Node in described hierarchical index can be divided into 3 classes: first floor node top, the intermediate node mid of the second layer and leaf node leaf; Wherein in first floor node top, comprising following element: attr, is the discrete attribute that first floor binary search tree node is corresponding; Value is the discrete value that this binary search tree node is corresponding; Weight is the priority of the operational symbol of this node on behalf, left, right, is the left and right child nodes of this node; In intermediate node, comprising following element: branch, represent the intermediate node pointer of hyperspace tree construction corresponding to second layer index, comprise following element: mbr in leaf node, is the multidimensional vector that second layer leaf node is corresponding.

Preferably, described this hierarchical index that utilizes excavates in real time data set and mates, and comprises further:

For the feature of each data tuple excavation surface to the rule set in data mining, first the binary search tree index scanning ground floor calculates the discontinuous operational symbol set satisfied condition, then according to effective node of ground floor index hit, search arithmetic symbol contingency table, find the node pointer of second layer index, enter in second layer multi-dimensional indexing and continue to search, finally, the result of combination ground floor and second layer hit, carries out Rapid Combination calculating by the rule set finally hit;

Wherein, for arbitrary data set tuple, first discrete property value is scanned in many attributes binary search tree index, only just carry out the mining rule process of second layer index when there being discrete operational symbol to hit, otherwise directly return; Each multi-dimensional indexing in ground floor is carried out successively to the mining rule of property value key-value pair, and by mining rule result stored in buffer memory, according to the buffer memory operation rule aggregating algorithm obtained, the final mining rule result set obtaining hit.

The present invention compared to existing technology, has the following advantages:

The present invention proposes a kind of data analysis extracting method, according to set up hierarchical index, data matching method, when not reducing excavation accuracy rate, significantly improves performance.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the large data analysis extracting method according to the embodiment of the present invention.

Embodiment

Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.Scope of the present invention is only defined by the claims, and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes, and also can realize the present invention according to claims without some in these details or all details.

An aspect of of the present present invention provides a kind of large data analysis extracting method.Fig. 1 is the large data analysis extracting method process flow diagram according to the embodiment of the present invention.

The present invention, on the basis of multi-dimensional indexing, introduces the thought of hierarchical index.Consider the characteristic of different types of data collection itself, the Attribute transposition comprised by data set is discrete attribute and continuous type attribute.Consider and the characteristic that operational symbol sharing degree on discrete attribute is high propose a kind of two-layer hierarchical index, give index and generate and excavate and matching process.

The present invention proposes the dynamic index that extensive mining rule calculates on different types of data collection, support the real-time update of mining rule, its main process comprises index and generates and real-time matching calculating.During generating indexes, first the property set of different types of data collection is classified: continuous type attribute and discrete attribute.Then, according to attribute type, the mining rule collection of input is divided into different operational symbol set, generates hierarchical index based on different operational symbol set: generate binary search tree as ground floor index according to the operational symbol on discrete attribute; At the second layer, be hyperspace by all continuous type best property of attribute mapping, the operational symbol relevant according to continuity attribute generates multi-dimensional indexing.Because the operational symbol on discrete attribute is all discrete value, thus the ground floor index generated can quick position in mining rule, and space expense is also smaller.At the second layer of hierarchical index of the present invention, attribute operator identical for attribute is divided generating indexes according to dimension by the present invention, promotes mining rule processing speed as much as possible.When excavation calculating is carried out to the data set tuple t arrived in real time, first vector extraction carried out to t and cut out calculating; Vector after quantification obtains different operational symbol vectors after the process of operational symbol attributive classification; The mining rule method of hierarchical index of the present invention is used to filter the mining rule collection of the condition that is met by two-stage index.

From structure, hierarchical index of the present invention comprises 3 important component parts and 3 important operations.Wherein, 3 ingredients respectively: the binary search tree hierarchical index of (1) ground floor; (2) the multidimensional vector hierarchical index of the second layer; (3) contingency table of mining rule and operational symbol.Based on hierarchical index of the present invention 3 main operations respectively: (1) search for; (2) insert; (3) delete.

Hierarchical index of the present invention is 1 two-layer hierarchical index generally.Ground floor is the binary search tree index generated by discrete operational symbol; The second layer is the hyperspace tree that the multidimensional vector corresponding according to continuous type operational symbol generates; Other 1 very important ingredient is the contingency table of mining rule AND operator, has been used for the Rapid Combination of two layer index mining rule results.

Node in hierarchical index of the present invention can be divided into 3 classes: first floor node top, the intermediate node mid of the second layer and leaf node leaf.

Following element is comprised: attr is the discrete attribute that first floor binary search tree node is corresponding in first floor node, value is the discrete value that this binary search tree node is corresponding, weight is the priority of the operational symbol of this node on behalf, and left, right are the left and right child nodes of this node.In intermediate node: branch represents the intermediate node pointer of hyperspace tree construction corresponding to second layer index.In leaf node: mbr is the multidimensional vector that second layer leaf node is corresponding.

Based on hierarchical index of the present invention, propose a kind of feasible index generation method, be divided into 3 steps: regular pre-service, operational symbol set and division and hierarchical index generate.First according to attributive classification, certain division is carried out to pretreated rule set, on this basis again to the data set layering generating indexes after division.

When rule set divides, for pretreated n rule, discrete operational symbol set A and continuous type operational symbol set B are divided according to attribute classification and codomain.Wherein:

Σ _{|| A||+||B||}=Σ _{q ∈ Qp (q)}, Q is registered mining rule collection, and q is single mining rule, and p is operational symbol.Suppose that the operational symbol set tieed up first divides in order to s interval I ₁, I ₂..., I _s, wherein each interval only has discrete operational symbol or continuous type operational symbol.Owing to have passed through pretreated dimension transformation, make I ₁, I ₂..., I _s, interval in the operational symbol attribute similarity that contains, be convenient to the generation of index by different level.Meanwhile, the rule set that different types of data focuses volume contains precedence information usually, and the priority of rule is higher, represents that it is more urgent by the demand excavating calculating.When a new mining rule is registered in system, first mining rule is divided into discrete operational symbol p according to attribute type by pretreatment module by hierarchical index of the present invention _dwith continuous type operational symbol p _c.Then by discrete operational symbol p _dbe inserted in the ground floor index of hierarchical index of the present invention, be namely inserted in binary search tree index corresponding to discrete attribute; Finally by continuous type operational symbol p _cbe inserted in the second layer index of hierarchical index.

When by p _dwhen being inserted into the binary search tree of ground floor, first inserting according to the standard inserted mode of sequence binary tree, but at this moment may violate the heap characteristic of binary search tree, therefore need bottom-up rotation, until heap characteristic is met.Deletion is contrary, first priority is set to minimum, transfers to leaf from top to bottom, then delete.

The second layer insertion method of hierarchical index is roughly: first navigate to the target leaves node that will insert, and the process of location itself is a recursive procedure.Second layer index inserts p _cprocess from the root node of the second layer, in turn according to BFS (Breadth First Search), search for according to the relation of inclusion of hyperspace, after finding a leaf node n, check point number of n.If find to have exceeded branch threshold value M, then directly carry out node split, produce new node, and by existing for n node and P _cvector utilize heuristic strategies to be evenly distributed in two nodes, finally upgrade parent information successively.If a point number of n does not exceed threshold value M, then directly complete update by upgrading father node.

Based on the characteristic of discrete operational symbol, the ground floor index binary search tree of hierarchical index of the present invention generates, and the quick mining rule accelerating discrete operational symbol calculates.Suppose that the priority of empty tree is for infinitely great, then the interpolation above and delet method can correctly process the situation only having a son.The interpolation of ground floor index and the expected time complexity of deletion action are O (log n).

Insertion process sixty-four dollar question is the division strategy of second layer index interior joint.The present invention here adopts heuristic strategies.First all blocks that will divide are taken out.Then two blocks that overlapping area area that is minimum, covering two blocks simultaneously is maximum are selected.Finally remaining piece is incorporated in different nodes successively according to the overlapping difference of area.

The mining process of data set tuple at index structure is introduced in this embodiment.For tuple t, the abundant excavation surface of hierarchical index of the present invention is to the feature of rule set in data mining, and the binary search tree index first scanning ground floor calculates the discontinuous operational symbol set satisfied condition.Because discrete operational symbol sharing degree is very high, this just improves sweep velocity greatly, accelerates the whole processing procedure of mining rule.Then, according to effective node of ground floor index hit, search arithmetic symbol contingency table, find the node pointer of second layer index, enter in second layer multi-dimensional indexing and continue to search, finally, the result of combination ground floor and second layer hit, carries out Rapid Combination calculating by the rule set finally hit.

For arbitrary data set tuple t, first discrete property value is scanned in many attributes binary search tree index, if hit, indicate that discrete operational symbol is hit, just carry out the mining rule process of second layer index in this case.Otherwise directly return.Triggering index by generating ground floor, greatly accelerating the mining rule process of hierarchical index data set tuple of the present invention.If hit discrete operational symbol, then next carry out the P mining procedure of rule of second layer index.Namely each multi-dimensional indexing in ground floor is carried out successively to the mining rule of property value key-value pair, and by mining rule result stored in buffer memory.The present invention, according to the buffer memory operation rule aggregating algorithm obtained, finally obtains the mining rule result set of hit.

In sum, the present invention proposes a kind of data analysis extracting method, according to set up hierarchical index, data matching method, when not reducing excavation accuracy rate, significantly improves performance.

Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system, or be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system, thus, they can be stored and be performed by computing system within the storage system.Like this, the present invention is not restricted to any specific hardware and software combination.

Should be understood that, above-mentioned embodiment of the present invention only for exemplary illustration or explain principle of the present invention, and is not construed as limiting the invention.Therefore, any amendment made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims

1. a large data analysis extracting method, is characterized in that, comprising:

2. method according to claim 1, it is characterized in that, described hierarchical index comprises binary search tree hierarchical index, multidimensional vector hierarchical index, and the contingency table of mining rule and operational symbol, the contingency table of described mining rule AND operator completes the combination of two layer index mining rule results, wherein, binary search tree is generated as bottom index according to the operational symbol on discrete attribute; On bottom, be hyperspace by all continuous type best property of attribute mapping, the operational symbol relevant according to continuity attribute generates multi-dimensional indexing, and attribute operator identical for attribute is divided generating indexes according to dimension, comprises the operation of hierarchical index: search, insertion and deletion;

3. method according to claim 2, is characterized in that, described this hierarchical index that utilizes excavates in real time data set and mates, and comprises further: