CN103678550A - Mass data real-time query method based on dynamic index structure - Google Patents

Mass data real-time query method based on dynamic index structure Download PDF

Info

Publication number
CN103678550A
CN103678550A CN201310648180.XA CN201310648180A CN103678550A CN 103678550 A CN103678550 A CN 103678550A CN 201310648180 A CN201310648180 A CN 201310648180A CN 103678550 A CN103678550 A CN 103678550A
Authority
CN
China
Prior art keywords
node
result
data
query
mass data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310648180.XA
Other languages
Chinese (zh)
Other versions
CN103678550B (en
Inventor
陈丹伟
庄俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Longkon wisdom Polytron Technologies Inc
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201310648180.XA priority Critical patent/CN103678550B/en
Publication of CN103678550A publication Critical patent/CN103678550A/en
Application granted granted Critical
Publication of CN103678550B publication Critical patent/CN103678550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees

Abstract

The invention discloses a mass data real-time query method based on a dynamic index structure (DC-Tree). According to the method, dimensionality reduction is carried out on a mass multi-dimension data set, high space efficiency and low query time are supported, distributed redundant storage is supported, therefore, data distribution efficiency in a traditional distributed mechanism is improved and the method is suitable for mass data processing. The method includes the first step that a multi-dimension data record (DR) maps a function fz through a Z Curve in a Master Node to generate a dimensionality reduction result set S; the second step that the Master Node selects k hash functions to carry out mapping on the result set S through a Bloom Filter to generate a node set NN; the third step that the data record DR is updated, and dynamic establishment is carried out on each element in the node set NN; the fourth step that a user inquires an MDS result to obtain the node set NN through the first step and the second step, and a parallel query method is started; the fifth step that the user carries out aggregation on all access nodes in the node set NN to obtain the final query result Rset.

Description

A kind of mass data Query method in real time based on dynamic indexing structure
Technical field
The present invention relates to the large data query technique of computing machine field, particularly a kind of mass data Query method in real time based on dynamic indexing structure.
Background technology
Along with the develop rapidly of internet, social networks, mobile application etc. are increasingly burning hot, and we see that the data volume of the network information is increasing, and large data are defined as a kind of emerging concept data, and data, as the carrier of information, play a part very important.The explosive growth of data makes us enter the epoch that large-scale data is analyzed, and is characterized in that calculating strength is large, and requires large-scale concurrent Storage and Processing ability.How processing rapidly mass data, extract timely and effectively valuable information from mass data, is the technical matters of being badly in need of solution.
At present, large-scale data analysis has 2 kinds of mainstream technologys: the first is to start the eighties in 20th century, the parallel database that Teradata, Gamma research project be representative of take is progressively full-fledged, it is comprised of sequence of operations symbol, the output stream of last operational character is the inlet flow of next operational character, record passes through these operational characters successively by the mode of streamline, has higher performance.The 2nd kind is with the parallel computation framework of the simple function formula programming based on Map Reduce and distributed file system GFS composition a kind of " without sharing " headed by Google, supports its every day of the search of hundreds of millions times.The Hadoop of Apache is the realization of increasing income of a kind of Map Reduce.But these large-scale data treatment technologies are difficult to requirement of real time, it is more the processing for off-line data.Hadoop similarly is more a kind of ETL instrument, and both relations are not to vie each other but complement one another.
On the other hand, the dynamic indexing structure R-Tree being proposed by Guttman and the mutation based on R-Tree, the operations such as its insertion, inquiry can be carried out simultaneously, and support the model of multidimensional, advantage in numerous Spatial Data Index Technology is very obvious, when but it is processed for large-scale data,, along with the increase of height of tree degree, its inquiry node degree of overlapping increases, and causes search efficiency to decline very fast.And the present invention can solve problem above well.
Summary of the invention
The object of the invention is to provide a kind of extensive multidimensional data Query method in real time based on dynamic indexing structure (DC-Tree), the method has solved the hysteresis quality problem that extensive multidimensional data is processed, and has realized the mass data real-time query model in distributed structure/architecture system.
The technical solution adopted for the present invention to solve the technical problems is: the present invention proposes a kind of mass data Query method in real time based on dynamic indexing structure (DC-Tree), and the method comprises the steps:
Step 1: multidimensional data records DR by Z Curve mapping function f in MasterNode z, generate dimensionality reduction result set S;
Step 2:MasterNode selectes k hash function, by Bloom Filter, result set S is shone upon, and generates set of node NN;
Step 3: new data records DR more, carries out dynamic construction to each element in set of node NN;
Step 4: user User inquiry MDS result, by step 1, step 2, obtain set of node NN, enable parallel query method;
Step 5: user User carries out polymerization to the result set of all access nodes in set of node NN, obtains final Query Result Rset.
The present invention is by magnanimity cube dimensionality reduction based on dynamic indexing structure, support the method for the low query time of space-efficient, and support distributed redundant storage, thus promoted the efficiency of data allocations in traditional distributed mechanism, adapt to the processing of large-scale data.The present invention has set up the multidimensional data tree with concept hierarchy structure, and the single attribute querying method breaking traditions makes data set with multidimensional functional attributes be divided into different dimensions and builds, the polymerization workload while greatly reducing single attribute inquiry.
The present invention, by high-dimensional data space data-mapping is arrived to the one-dimensional space, greatly reduces the work load of data management node, the dynamic increase of supported data memory node.Design mass data simultaneously and inserted and querying method, supported the dynamic construction of multidimensional property data, and supported the real-time effect of mass data inquiry, increased query script access lock mechanism, adapted to the concurrency demand of inquiry.
One, system architecture
Fig. 1 provides the architectural framework of mass data real time inquiry system, and this system is comprised of following four parts: data management node (Master Node), dynamic index tree (DC-Tree), data memory node (Data Node) and user (User).MasterNode is responsible for the location of data query/renewal, mainly uses dimensionality reduction and fast query technology.DC-Tree is mainly used for dynamic construction multidimensional property data query tree, and real-time query effect is provided.DataNode is responsible for the storage of concrete data.User (User) sends inquiry request to MasterNode, and MasterNode will, to inquiry request contents processing, determine that institute's query contents is on part DataNode, and these satisfactory DataNode are submitted to user.After completing this operation, user will disconnect in MasterNode, and the DataNode that initiatively access is submitted to inquires about.Entire system framework is illustrated in fig. 1 shown below.
Mass data real-time query scheme of the present invention is comprised of following four part operations: MDS (the minimum subset of describing) decomposition, Z curve dimension-reduction treatment, Bloom Filter location, DC-Tree index and result polymerization.
Two, method flow
1.MDS (the minimum subset of describing) decomposes
MDS (the minimum subset of the describing) form of expression is (M 1..., M d), wherein
Figure BDA0000430114430000021
might as well establish M i={ a i1, a i2..., a ik, 1≤i≤d wherein, a ik∈ D i, multidimensional data record set corresponding to this MDS (the minimum subset of describing) is { (a 11, a 21..., a d1) ..., (a 1k, a 2k..., a dk), be designated as MM.
2.Z curve dimension-reduction treatment
According to acquired results collection MM in above-mentioned steps 1, use Z Curve method to carry out dimensionality reduction operation, establishing Z Curve mapping function is f z(p, m, n), p ∈ MM wherein, m is Z Curve exponent number, the number of dimensions that n is multidimensional model, might as well establish mapping function f zrreturn value is y p.This mapping function computation process false code is as follows:
(1)y p=0;
REPEAT
REPEAT
(2)y p=y p+2 n(i-1)+j-1a ji
UNTIL?j≥n
UNTIL?i≥m
(3)RETURN?y p
Because the mapping function space complexity of n dimension m rank Z Curve is O (n), so the above results needs length, be that the array of n is deposited result set y p, might as well establish this result set is S.
3.Bloom Filter location
According to the result set S={y after gained dimension-reduction treatment in above-mentioned steps 2 1..., y n, then according to the elaboration to Bloom Filter in related work, now need to select k hash function HF i, 1≤i≤k wherein, because Bloom Filter itself exists certain error rate, in order to reduce this positive tropism's mistake, the present invention has used Knuth demonstration when building hash function: two hash function HF 1and HF 2form by below can generate more hash function:
HF i=[HF 1+HF 2+f(i)]mod?r
1≤i≤k wherein, r is Bloom Filter array length, HF 1and HF 2two separate hash functions.When f (i)=0, adopt two hash functions mechanism, otherwise be just expansion hash function mechanism, the hash function of generation has kept positive tropism's error rate constant like this, and has improved the counting yield of system.
After selected k function, in pair set S, data are shone upon, and return to a DataNode set of node, might as well be made as NN.And this collection NN is back to user.
4.DC-Tree index and result polymerization
User, according to gained set NN in above-mentioned steps 3, navigates to the required DataNode that carries out index, and DataNode adopts DC-Tree indexing means to search.After searching on each DataNode, indexed results can be sent to an indexed results and concentrate, might as well be made as RSet, now again this indexed results collection be carried out to polymerization, obtain final Query Result.
Beneficial effect:
1, the present invention has improved the efficiency of data allocations, adapts to the processing of large-scale data, the polymerization workload while having reduced single attribute inquiry.
2, the present invention has realized the efficient concurrent processing of large-scale data and real-time function.
Accompanying drawing explanation
Fig. 1 is system architecture diagram of the present invention.
Fig. 2 is dynamic insertion method process flow diagram of the present invention.
Fig. 3 is parallel query method flow diagram of the present invention.
Embodiment
Below by conjunction with Figure of description, further illustrate technical scheme of the present invention.
Embodiment 1
As shown in Figures 2 and 3, the present invention proposes a kind of mass data Query method in real time based on dynamic indexing structure (DC-Tree), and the method comprises the steps:
Step 1: multidimensional data records DR by Z Curve mapping function f in MasterNode z, generate dimensionality reduction result set S;
Step 2:MasterNode selectes k hash function, by Bloom Filter, result set S is shone upon, and generates set of node NN;
Step 3: new data records DR more, carries out dynamic construction to each element in set of node NN;
Step 4: user User inquiry MDS result, by step 1, step 2, obtain set of node NN, enable parallel query method;
Step 5: user User carries out polymerization to the result set of all access nodes in set of node NN, obtains final Query Result Rset.
The present invention is by magnanimity cube dimensionality reduction based on dynamic indexing structure, support the method for the low query time of space-efficient, and support distributed redundant storage, thus promoted the efficiency of data allocations in traditional distributed mechanism, adapt to the processing of large-scale data.The present invention has set up the multidimensional data tree with concept hierarchy structure, and the single attribute querying method breaking traditions makes data set with multidimensional functional attributes be divided into different dimensions and builds, the polymerization workload while greatly reducing single attribute inquiry.
A new multidimensional data of the present invention records DR, and by the quick locating query set of node of MasterNode NN, and dynamic appending is to corresponding DC-Tree, and user User is by MDS query node collection NN, and Query Result is returned in polymerization.
Its embodiment is:
(1) multidimensional data records DR by Z Curve mapping function f in MasterNode z, generate dimensionality reduction result set S;
(2) selected k the hash function of MasterNode, shines upon result set S by Bloom Filter, generates set of node NN;
(3) new data records DR more, carries out dynamic construction to each element in set of node NN;
Dynamically insert: for root node D applies for locking LOCK; Upgrade the Measure value of directory junction; If DR is only included in a child's the MDS of D, make so D be set to this catalogue child node; If DR is included in child's the MDS of a plurality of D, finds out so that child who comprises minimal data node in these children, and D is set to this catalogue child node; If DR is not included in any one child's the MDS of D, first copy a D, might as well be made as D ', DR is added in each child's node of D, calculate the overlapping value after adding, select that child's node of overlapping value minimum, and be made as D; Data recording DR is inserted in D, and upgrades the Measure value of D; If it is maximum that the spatial accommodation of D has reached, call division function SPLIT, using D as parameter transmission; Upgrade Measure and the MDS of father's node of D; Make D point to father's node of D, if D does not upgrade or D is not root node, again data recording DR is inserted in D, and upgrades the Measure value of D, continue to carry out, otherwise finish; For root node D application release UNLOCK;
(4) user User inquiry MDS result, obtains set of node NN by step 1, step 2, enables parallel query method;
Parallel query: to all nodes in set of node NN, if this node do not lock, Concurrency Access all in NN node; For root node D applies for locking LOCK; The node of child each time C to D, any one dimension to C, if with inquiry MDS not on same dimension hierarchy, lower dimension hierarchy in both is converted to more high-dimensional level; If C _ MDS is included in inquiry MDS, should _ MDS and Measure Values thereof join in result set; If C _ MDS and inquiry MDS have overlapping but be not comprised in inquiry MDS, this child's node C is made as to D, recursive call parallel query function PARALLEL QUERY, continue and NN in node carry out same operation; If C is leafy node, access finishes; For root node D application release UNLOCK;
(5) user User carries out polymerization to the result set of all access nodes in set of node NN, obtains final Query Result Rset;
(6) overall process finishes.
Embodiment 2
As shown in Figure 1, the present invention provides the architectural framework of mass data real time inquiry system, and this system is comprised of following four parts: data management node (Master Node), dynamic index tree (DC-Tree), data memory node (Data Node) and user (User).MasterNode is responsible for the location of data query/renewal, mainly uses dimensionality reduction and fast query technology.DC-Tree is mainly used for dynamic construction multidimensional property data query tree, and real-time query effect is provided.DataNode is responsible for the storage of concrete data.User (User) sends inquiry request to MasterNode, and MasterNode will, to inquiry request contents processing, determine that institute's query contents is on part DataNode, and these satisfactory DataNode are submitted to user.After completing this operation, user will disconnect in MasterNode, and the DataNode that initiatively access is submitted to inquires about.
Mass data Query method in real time of the present invention is comprised of following four operation parts, comprising: MDS (the minimum subset of describing) decomposition, Z curve dimension-reduction treatment, Bloom Filter location, DC-Tree index and result polymerization.

Claims (5)

1. the mass data Query method in real time based on dynamic indexing structure, is characterized in that, described method comprises following steps:
Step 1: multidimensional data records DR by Z Curve mapping function f in MasterNode z, generate dimensionality reduction result set S;
Step 2:MasterNode selectes k hash function, by Bloom Filter, result set S is shone upon, and generates set of node NN;
Step 3: new data records DR more, carries out dynamic construction to each element in set of node NN;
Step 4: user User inquiry MDS result, by step 1, step 2, obtain set of node NN, enable parallel query method;
Step 5: user User carries out polymerization to the result set of all access nodes in set of node NN, obtains final Query Result Rset.
2. a kind of mass data Query method in real time based on dynamic indexing structure according to claim 1, is characterized in that: in described method, set up real-time query model, by magnanimity cube dimensionality reduction.
3. a kind of mass data Query method in real time based on dynamic indexing structure according to claim 1, is characterized in that: in described method, set up the multidimensional data tree with concept hierarchy structure.
4. a kind of mass data Query method in real time based on dynamic indexing structure according to claim 1, it is characterized in that, described method comprises: MDS (the minimum subset of describing) decomposition, Z curve dimension-reduction treatment, Bloom Filter location, DC-Tree index and result polymerization.
5. a kind of mass data Query method in real time based on dynamic indexing structure according to claim 1, is characterized in that: described method is based on dynamic indexing structure.
CN201310648180.XA 2013-09-09 2013-12-04 Mass data real-time query method based on dynamic index structure Active CN103678550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310648180.XA CN103678550B (en) 2013-09-09 2013-12-04 Mass data real-time query method based on dynamic index structure

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201310408184.0 2013-09-09
CN2013104081840 2013-09-09
CN201310408184 2013-09-09
CN201310648180.XA CN103678550B (en) 2013-09-09 2013-12-04 Mass data real-time query method based on dynamic index structure

Publications (2)

Publication Number Publication Date
CN103678550A true CN103678550A (en) 2014-03-26
CN103678550B CN103678550B (en) 2017-02-08

Family

ID=50316095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310648180.XA Active CN103678550B (en) 2013-09-09 2013-12-04 Mass data real-time query method based on dynamic index structure

Country Status (1)

Country Link
CN (1) CN103678550B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090962A (en) * 2014-07-14 2014-10-08 西北工业大学 Nested query method oriented to mass distributed-type database
CN104731875A (en) * 2015-03-06 2015-06-24 浙江大学 Method and system for obtaining multi-dimensional data stability
CN106020724A (en) * 2016-05-20 2016-10-12 南京邮电大学 Neighbor storage method based on data mapping algorithm
CN106528773A (en) * 2016-11-07 2017-03-22 山东首讯信息技术有限公司 Spark platform supported spatial data management-based diagram calculation system and method
CN107273471A (en) * 2017-06-07 2017-10-20 国网上海市电力公司 A kind of binary electric power time series data index structuring method based on Geohash
CN107832347A (en) * 2017-10-16 2018-03-23 北京京东尚科信息技术有限公司 Method of Data with Adding Windows, system and electronic equipment
CN109597807A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 Number storehouse list processing method and apparatus
CN109783441A (en) * 2018-12-24 2019-05-21 南京中新赛克科技有限责任公司 Mass data inquiry method based on Bloom Filter
CN114866262A (en) * 2022-07-07 2022-08-05 万商云集(成都)科技股份有限公司 Storage access method, device, equipment and medium for data certificate file
US11741258B2 (en) 2021-04-16 2023-08-29 International Business Machines Corporation Dynamic data dissemination under declarative data subject constraints

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6122628A (en) * 1997-10-31 2000-09-19 International Business Machines Corporation Multidimensional data clustering and dimension reduction for indexing and searching
JP5490905B2 (en) * 2009-09-29 2014-05-14 エヌイーシー ヨーロッパ リミテッド Method and system for stochastic processing of data
EP2564306A4 (en) * 2010-04-27 2017-04-26 Cornell University System and methods for mapping and searching objects in multidimensional space

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090962B (en) * 2014-07-14 2017-03-29 西北工业大学 Towards the nested query method of magnanimity distributed data base
CN104090962A (en) * 2014-07-14 2014-10-08 西北工业大学 Nested query method oriented to mass distributed-type database
CN104731875A (en) * 2015-03-06 2015-06-24 浙江大学 Method and system for obtaining multi-dimensional data stability
CN104731875B (en) * 2015-03-06 2018-04-17 浙江大学 A kind of method and system for obtaining multidimensional data stability
CN106020724A (en) * 2016-05-20 2016-10-12 南京邮电大学 Neighbor storage method based on data mapping algorithm
CN106528773B (en) * 2016-11-07 2020-06-26 山东联友通信科技发展有限公司 Map computing system and method based on Spark platform supporting spatial data management
CN106528773A (en) * 2016-11-07 2017-03-22 山东首讯信息技术有限公司 Spark platform supported spatial data management-based diagram calculation system and method
CN107273471A (en) * 2017-06-07 2017-10-20 国网上海市电力公司 A kind of binary electric power time series data index structuring method based on Geohash
CN107832347A (en) * 2017-10-16 2018-03-23 北京京东尚科信息技术有限公司 Method of Data with Adding Windows, system and electronic equipment
CN109597807A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 Number storehouse list processing method and apparatus
CN109783441A (en) * 2018-12-24 2019-05-21 南京中新赛克科技有限责任公司 Mass data inquiry method based on Bloom Filter
US11741258B2 (en) 2021-04-16 2023-08-29 International Business Machines Corporation Dynamic data dissemination under declarative data subject constraints
CN114866262A (en) * 2022-07-07 2022-08-05 万商云集(成都)科技股份有限公司 Storage access method, device, equipment and medium for data certificate file
CN114866262B (en) * 2022-07-07 2022-11-22 万商云集(成都)科技股份有限公司 Storage access method, device, equipment and medium for data certificate file

Also Published As

Publication number Publication date
CN103678550B (en) 2017-02-08

Similar Documents

Publication Publication Date Title
CN103678550A (en) Mass data real-time query method based on dynamic index structure
CN106227800B (en) Storage method and management system for highly-associated big data
Moniruzzaman et al. Nosql database: New era of databases for big data analytics-classification, characteristics and comparison
CN107291807B (en) SPARQL query optimization method based on graph traversal
CN106933833B (en) Method for quickly querying position information based on spatial index technology
Zhao et al. Modeling MongoDB with relational model
CN102270232B (en) Semantic data query system with optimized storage
CN103177094B (en) Cleaning method of data of internet of things
CN102915365A (en) Hadoop-based construction method for distributed search engine
US20100235344A1 (en) Mechanism for utilizing partitioning pruning techniques for xml indexes
CN107491476B (en) Data model conversion and query analysis method suitable for various big data management systems
Wang et al. Parallel trajectory search based on distributed index
Curé et al. On the evaluation of RDF distribution algorithms implemented over apache spark
Kang et al. Research on construction methods of big data semantic model
Liu et al. Using provenance to efficiently improve metadata searching performance in storage systems
Mittal et al. Efficient random data accessing in MapReduce
CN103440308A (en) Digital thesis retrieval method based on formal concept analyses
Sheng et al. Dynamic top-k range reporting in external memory
CN110389953B (en) Data storage method, storage medium, storage device and server based on compression map
Ptiček et al. Big data and new data warehousing approaches
CN107291875B (en) Metadata organization management method and system based on metadata graph
CN110321456B (en) Massive uncertain XML approximate query method
JP6371136B2 (en) Data virtualization server, query processing method and query processing program in data virtualization server
CN112148830A (en) Semantic data storage and retrieval method and device based on maximum area grid
Yu et al. Distributed top-k keyword search over very large databases with MapReduce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190314

Address after: Room A808, World Trade Center Building, 67 Shanxi Road, Gulou District, Nanjing City, Jiangsu Province

Patentee after: Longkon wisdom Polytron Technologies Inc

Address before: 210003 new model road 66, Gulou District, Nanjing, Jiangsu

Patentee before: Nanjing Post & Telecommunication Univ.

TR01 Transfer of patent right
CP02 Change in the address of a patent holder

Address after: Floor 31, Asia Pacific business building, No. 2 Hanzhong Road, Gulou District, Nanjing, Jiangsu 210005

Patentee after: LUCULENT SMART TECHNOLOGIES CO.,LTD.

Address before: Room A808, World Trade Center Building, 67 Shanxi Road, Gulou District, Nanjing City, Jiangsu Province

Patentee before: LUCULENT SMART TECHNOLOGIES CO.,LTD.

CP02 Change in the address of a patent holder