CN103425772A - Method for searching massive data with multi-dimensional information - Google Patents

Method for searching massive data with multi-dimensional information Download PDF

Info

Publication number
CN103425772A
CN103425772A CN2013103501267A CN201310350126A CN103425772A CN 103425772 A CN103425772 A CN 103425772A CN 2013103501267 A CN2013103501267 A CN 2013103501267A CN 201310350126 A CN201310350126 A CN 201310350126A CN 103425772 A CN103425772 A CN 103425772A
Authority
CN
China
Prior art keywords
dimension
data
cube
information
constraint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013103501267A
Other languages
Chinese (zh)
Other versions
CN103425772B (en
Inventor
宋杰
郭朝鹏
王智
徐澍
张一川
朱志良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201310350126.7A priority Critical patent/CN103425772B/en
Publication of CN103425772A publication Critical patent/CN103425772A/en
Application granted granted Critical
Publication of CN103425772B publication Critical patent/CN103425772B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for searching massive data with multi-dimensional information, which relates to the field of data mining. The method comprises steps of: loading the dimensional information of the massive data with the multi-dimensional information; loading the massive data; adopting an on-line analysis processing (OLAP) method to search the massive data. The method for searching massive data with multi-dimensional information organizes the massive data with the multi-dimensional information through a dimensional coding method, utilizes a data segmentation storage method to simplify the data block addressing, quickly realizes dimensional hierarchy transformation in an intermediate variable (i.e. analytic path) manner, screens the data through utilizing a selecting method based on the data block, and only computes and processes the actual participating data.

Description

A kind of mass data inquiry method with multidimensional information
Technical field
Invention relates to Data Mining, particularly a kind of mass data inquiry method with multidimensional information.
Background technology
Along with the arrival of large data age, the traditional data analysis fields such as traditional data management, inquiry have been caused to great challenge.In order to tackle mass data bring to challenge, in academia and industry member, extensively adopt MapReduce programming model and distributed file system to deal with this challenge.OLAP (On-LineAnalytical Processing on-line analytical processing) is very important analysis means and method in the traditional data analysis field.In large data field, olap analysis has also been proposed to new requirement.
OLAP can be divided into 3 kinds of ROLAP (the on line data analyzing and processing of Relational OLAP facing relation), MOLAP (Multidimensional OLAP is towards the on-line analytical processing of multidimensional) and HOLAP (on-line analytical processing of Hybrid OLAP mixed type) according to its implementation difference.Wherein ROLAP adopts relation table storage dimension information and factual data, and MOLAP adopts multidimensional data structure storage dimension information and factual data, and HOLAP is referred to as to mix OLAP, and the method combines ROLAP and MOLAP technology.No matter be which kind of OLAP, all need the support of storage and computing platform, especially under large data environment.The lot of challenges of bringing in order to solve large data, educational circles and industry emerge many new technologies, as distributed file system, NoSQL (Not Only Structured Query Language is not only based on Structured Query Language (SQL)) Database Systems, MapReduce programming model and relevant optimization method, these technology are all applied in large data analysis widely.
In large data environment, OLAP optimization method commonly used has following two kinds: utilize the result optimizing OLAP performance of precomputation and condensed cube and optimize the OLAP performance by optimizing storage organization and algorithm.But the former will generate a large amount of data, can't be applicable to the mass data environment, and the latter's Optimized Measures do not have the lifting of matter mostly based on ROLAP to the performance of OLAP.The SPAJG-OLAP subset of having researched and proposed in the OLAP inquiry is arranged, in the optimisation strategy of the aspect research mass data massively parallel processing frameworks such as storage, inquiry, data distribution, Internet Transmission and distributed caching with realize technology.This research is based on parallel database technical optimization ROLAP performance, reach by the optimization to OLAP inquiry and storage the purpose of accelerating OLAP, but because ROLAP is based on relational database technology, can produce a large amount of attended operations, its effect of optimization not obvious in the very huge situation of data volume.
With regard to distributed OLAP system, some cloud Database Systems based on Hadoop, such as Hive, HadoopDB, HBase etc., all support OLAP.Current, to mass data, in distributed OLAP field, extensively adopt the methods such as data directory, burst to be optimized ROLAP.But ROLAP need to adopt the attended operation of relational model and consumes resources, when data volume increases, the optimization function of index and burst sharply descends.Also have by querying condition being optimized to the method for ROLAP in addition.But, equally due to inevitable attended operation, its optimization function is not very obvious.MOLAP is stored data as data cube, but need to manage and optimize dimension, causes the not authoritative report of the current research to MOLAP and system.
Summary of the invention
The deficiency existed for existing invention, the purpose of this invention is to provide a kind of mass data inquiry method with multidimensional information, to reach the purpose that data query, gathering are calculated in the mass data environment.
Technical scheme of the present invention is achieved in that a kind of mass data inquiry method with multidimensional information, comprises the following steps:
Step 1: the dimension information to mass data with multidimensional information is loaded, and specifically comprises the steps:
Step 1.1: the dimension information to mass data is differentiated judge whether each dimension information of mass data meets following three constraints simultaneously:
Constraint 1: dimension by one and only a dimension level form, dimension is the ordering relation that all dimension ranks form;
Constraint 2: in any dimension rank of dimension, only comprise a dimension attribute, this dimension attribute comprises several dimension values;
Constraint 3: in the dimension value tree formed in all dimension values, the child node that the brotgher of node comprises same number;
If all meet, perform step 1.3, otherwise, perform step 1.2;
Step 1.2: dimension information is processed, made each dimension form the dimension value tree construction that meets constraint, processing procedure is as follows:
For retraining 1: if a plurality of dimension levels are arranged, as required the dimension level is given up, only retained a dimension level and get final product;
For retraining 2: if certain one dimension rank comprises a plurality of dimension attributes, as required dimension attribute is given up, only retained a dimension attribute and get final product;
For retraining 3: if the child node number difference that the brotgher of node comprises is added null value, make the child node number of the brotgher of node identical;
Step 1.3: dimension information is encoded;
For the dimension value of every one dimension rank in dimension value tree, from left to right with decimal number, encode successively, after all dimension values all have corresponding coding, coding work finishes;
Step 1.4: the coding to dimension information is stored;
For any one dimension information of mass data, store every one dimension rank title and in this rank the number of the brotgher of node, finally form the file of all dimension information of mass data, be stored in distributed file system;
Step 2: mass data is loaded;
Step 2.1: the user establishes the corresponding relation of the coding of the practical significance of dimension information of mass data and its dimension information as required, and any data that are about in mass data are meaned with the coding of dimension information;
Step 2.2: the most fine-grained all multidimensional mass datas form the data cube structure, any data of mass data are as a cell in this data cube, the information of this cell comprises: this cell is positioned at the coordinate of cube, and the represented factual data value of cell; Wherein, the coordinates table of cell is shown:
The coding of<mono-dimension information, the coding of another dimension information ..., the coding of last dimension information >;
Described fine granularity refers to: lowermost layer dimension rank data pointed;
Step 2.3: data cube are cut:
According to user's query demand, under the shortest condition of query time, data cube are cut guaranteeing, form data block, the length of side of specified data piece;
Step 2.4: the ready-portioned data block of step 2.3 is encoded, and method is: in data block, the coordinate of cell is divided by the length of side of data block arbitrarily, and the value obtained after the data obtained rounds up is as the coding of data block;
Step 2.5: the data block store of step 2.3 well cutting is in distributed file system, and the coding that step 2.4 is established is as the title of data block file;
Step 3: the method that adopts on line data to analyze OLAP is inquired about mass data;
Step 3.1: the user arranges querying condition, comprising:
Query aim: refer to and determine for which data cube to be inquired about, be i.e. target cube;
Query context: in the query aim of having established, for any partial data inquired about;
The dimension information of result: the dimension information that refers to result data cube;
Method for congregating: the operation that the data in query context are assembled;
Step 3.2: whether the querying condition that determining step 3.1 sets meets following constraint condition:
Constraint 1: query aim exists, and query context should be less than or equal to the data area of query aim;
Constraint 2: the dimension amount of result data cube should be less than or equal to the dimension amount of query aim;
Constraint 3: the low-dimensional rank of any dimension of result data cube should be higher than the low-dimensional rank of the corresponding dimension of query aim;
Constraint 4: method for congregating must be distributed or algebraic expression;
If meet constraint 1, retrain 2 and retrain 4, perform step 3.3 simultaneously;
If satisfied constraint 1 simultaneously~retrain 4 to perform step 3.4;
If do not meet any one top condition, inquire about unsuccessfully, finish;
Step 3.3: query aim is changed, found the upper level cube of current query aim, judge whether upper level cube meets constraint 3, if do not meet, then continue the upper level cube of this upper level of inquiry cube, if can't meet all the time, inquire about unsuccessfully, finish query script; If found the upper level cube of satisfied constraint 3, should cube replace with target cube;
Step 3.4: the scalping of data: the scope of determining the minimum data piece that inquiry is required according to query context;
Step 3.5: the fine screen of data; The data block file that scanning step 3.4 filters out, screened all cells that are positioned at data block according to query context, if cell is positioned at query context, performs step 3.6; Otherwise, give up this cell;
Step 3.6: utilize the coding of dimension to change the dimension rank of cell, the dimension information of the dimension information of comparing result data cube and target cube, determine the dimension information changed, and the coordinate of this dimension of expression on the cell coordinate is modified;
Step 3.7: to thering is the cell of same coordinate, according to the method for congregating set, the factual data value in cell is carried out to aggregation operator;
Step 3.8: the data after step 3.7 is assembled, form result data cube, the information of this result data cube is returned to the user, and, using this result cube as new data cube storage, make it can be used as the query aim that next round is inquired about.
Beneficial effect of the present invention: a kind of mass data inquiry method with multidimensional information of the present invention has following advantage:
The method of 1, encoding by dimension is organized the mass data with multidimensional information
2, utilize the method for data block storage to simplify the addressing of data block.
3, utilize the coding of dimension, realize rapidly the conversion of dimension level.
4, the method for having passed through the selection of based on data piece is carried out the screening of data, only for the data of actual participation, is calculated and processes.
The accompanying drawing explanation
The dimension structural representation that Fig. 1 is one embodiment of the present invention Marine Sciences data centralization;
Fig. 2 is one embodiment of the present invention time dimension coding schematic diagram;
Fig. 3 is one embodiment of the present invention area dimension coding schematic diagram;
Fig. 4 is one embodiment of the present invention depth dimension coding schematic diagram;
Fig. 5 is the time dimension schematic diagram that one embodiment of the present invention does not meet constraint 1;
Fig. 6 is one embodiment of the present invention multidimensional mass data cubic structure schematic diagram;
Fig. 7 is one embodiment of the present invention multidimensional mass data cube piecemeal schematic diagram;
Fig. 8 is one embodiment of the present invention data scalping schematic diagram;
Fig. 9 is that one embodiment of the present invention MapReduce realizes the mass data inquiry method schematic diagram;
Figure 10 is that one embodiment of the present invention changes dimension coding schematic diagram;
Figure 11 is that one embodiment of the present invention changes dimension coding exemplary plot in MapReduce;
Figure 12 is the process flow diagram that one embodiment of the present invention has the mass data inquiry method of multidimensional information.
Embodiment
Below, with certain Marine Sciences data instance, with embodiment, the present invention is described in further detail by reference to the accompanying drawings.
Present embodiment is usingd certain Marine Sciences data as research object, and this oceanographic data is the mass data with three-dimensional information.Comprise: time dimension, regional peacekeeping depth dimension, correspondence means with Time, Area and Depth respectively, its dimension structure is as shown in Figure 1.Wherein, Time has 4 ranks: be respectively year (Year), month (Month), day (Day), inferior (Slot).Wherein, Slot refers to that three time periods of one day once test respectively, and its dimension value is respectively " morning ", " afternoon " and " evening ", and its dimension value tree construction as shown in Figure 2.In Fig. 2, if the have 3 years data in (on Dec 31,1 day~2012 January in 2010), on Year dimension rank, to 3 dimension values should be arranged, be respectively 2010, 2011, 2012, on Month dimension rank, to 36 dimension values should be arranged, it is respectively in January, 2010, in February, 2010, ..., in Dec, 2012, on Day dimension rank, we suppose that each monthlyly has 31 days, corresponding 1116 dimension values, be respectively on January 1st, 2010, on January 2nd, 2010, ..., on Dec 31st, 2012, at Slot, tie up on rank to 3348 dimension values should be arranged, it is respectively the morning on January 1st, 2010, noon on January 1st, 2010, ..., afternoon on Dec 31st, 2012.Area has 7 dimension ranks: 1 °, 1/2 °, 1/4 °, 1/8 °, 1/16 °, 1/32 ° and 1/64 °.Wherein 1 ° of finger is by 1 ° of longitude and 1 ° of square region that latitude forms, and earth surface can be divided into 360 * 180 1 °Fang districts, and its dimension value tree construction as shown in Figure 3.The data that comprise 2 ° of zones in Fig. 3, comprise 2 dimension values on the dimension rank of 1 °, it is respectively 1 °, 2 °, comprise 4 dimension values on 1/2 ° of dimension rank, it is respectively 1/2 °, 1 °, 3/2 °, 2 °, by that analogy, comprise 128 dimension values on the dimension rank of 1/64 °, be respectively 1/64 °, 1/32 ° ..., 2 °.Depth has 3 dimensions rank: 1OOm, 50m, 1Om, and wherein, 1OOm refers to " degree of depth of ocean is divided with the interval of 100 meters ", and its dimension value tree construction as shown in Figure 4.If the depth of water has 500m in Fig. 4, on the dimension rank of 1OOm, to 5 dimension values should be arranged, be respectively 1OOm, 200m ..., 500m, on the dimension rank of 50m, to 10 dimension values should be arranged, be respectively 50m, 1OOm ..., 500m, on the dimension rank of 1Om, to 50 dimension values should be arranged, be respectively 1Om, 20m ..., 500m.The oceanographic data collection cube in, a factual data is only arranged in each cell, i.e. the temperature data of ocean, the data type of these data is double-precision floating points.
Present embodiment, the method that the above-mentioned oceanographic data with 3 dimension information is inquired about, its flow process as shown in figure 12, comprises the following steps:
Step 1: the dimension information to mass data with 3 dimension information is loaded, and specifically comprises the steps:
Step 1.1: the dimension information to mass data is differentiated judge whether each dimension information of mass data meets following three constraints simultaneously:
Constraint 1: dimension by one and only a dimension level form, dimension is the ordering relation that all dimension ranks form.
In present embodiment, for oceanographic data, if in time dimension, comprise week this dimension during rank (as, the morning on Friday on January 1st, 2010), due to can not comprise the middle of the month complete week (as, the beginning of January 1 as January in 2010, but this day is Friday, not the beginning in a complete week), therefore will produce two dimension levels, be respectively year (year)-week (week)-Ri (day)-inferior (slot), year (year)-month (month)-Ri (day)-inferior (slot), this dimension does not meet constraint, and its schematic diagram as shown in Figure 5.The ordering relation limited in present embodiment, for the time dimension of oceanographic data, should meet and year comprise season, comprise in season month, comprise by the moon day, comprise day time, if do not meet such relation do not meet constraint.
Constraint 2: in any dimension rank of dimension, only comprise a dimension attribute, this dimension attribute comprises several dimension values.
Consider the city dimension of a hypothesis, its dimension level is: province-city-district (for example, Liaoning Province-Shenyang City-peace zone).In city's dimension rank, comprise 2 dimension attributes, be respectively city title, city area code, (for example: Shenyang/024).This dimension does not meet constraint.
Constraint 3: in all dimension value compositions are set the dimension value, the child node that the brotgher of node comprises same number;
In the present embodiment, for oceanographic data, in real situation, in time dimension every month comprise different number of days (as, comprise 31 days in January, 2010, comprises 28 days in February, 2010 etc.), be reacted in dimension value tree, in same year the moon brotgher of node each other, as in January, 2010 and in February, 2010 are the brotgher of node, its father node is 2010., in January, 2010, its child node number was 31 (comprise 31 days in January, 2010), and in February, 2010, its child node number was 28 (comprise 28 days in February, 2010).Time dimension does not meet constraint so.
If meet, perform step 1.3, otherwise, perform step 1.2;
Step 1.2: dimension information is processed, made each dimension form the dimension value tree construction that meets constraint, processing procedure is as follows:
For retraining 1: if a plurality of dimension levels are arranged, as required the dimension level is given up, only retained a dimension level and get final product;
For example, for Friday on January 1st, 2010 this time dimension, owing to there being 2 dimension levels, so need to tie up rank week, give up.Finally formed time dimension as shown in Figure 1.
For retraining 2: if certain one dimension rank comprises a plurality of dimension attributes, as required dimension attribute is given up, only retained a dimension attribute and get final product;
Comprise 2 dimension attributes in the dimension rank of the city Wei Zhong,Zai Shi supposed, city title and city numbering.In order to meet constraint 2, dimension attribute city title is deleted, only retain the city numbering, make the city dimension meet constraint 2.
For retraining 3: if the child node number difference that the brotgher of node comprises is added null value, make the child node number of the brotgher of node identical.
For example, in present embodiment, the time dimension information of hypothesis oceanographic data is: have 12 months the whole year, is fixed with 31 days every month, in this hypothesis, there is situation about not being inconsistent with actual conditions, for example, have data on February 31st, 2010 in Fig. 2 in time dimension, this date does not exist, and present embodiment is by setting the mode of null value, time dimension is revised, as, if the leap year will within-31 days on the 29th, be set to null value, if non-leap-year, will be set to null value in-31 days on the 30th.For the month that only has 30 days, null value will be set as on the 31st.The dimension value tree of the final time dimension formed as shown in Figure 2.In Fig. 2, showed the dimension value tree construction of time dimension.Wherein having fixed each monthlyly has 31 days, if run into the less than situation of 31 days, uses null value to supply.
Step 1.3: dimension information is encoded;
For each other dimension value of level in dimension value tree, from left to right with decimal number, encode successively, after all dimension values all have corresponding coding, coding work finishes;
In the present embodiment, to the result after the time dimension coding as shown in Figure 2.For time dimension, do not meet the hypothesis of tieing up in this patent, so fixedly be 31 days every month, run into the less than situation of 31 days and use null value to supply.In February, 2010 is actual only has 28 days, so use 3 null values to supply 31 days.In Fig. 2, only show a null value, on February 31st, 1.In the present embodiment, the Marine Sciences data set comprises the ocean ocean temperature data of 3 years altogether, so tie up on rank and have 3 dimension values in year, from left to right encodes, and it is encoded to 0,1,2; Have 36 dimension values (year Dec in January, 2010 to 2013) on moon dimension rank, it is encoded to 0,1 ..., 35; Have 1116 dimension values on day dimension rank, coding is respectively 0,1 ..., 1115; Have 3348 dimension values on inferior dimension rank, coding is respectively 0,1 ..., 3347.The coding of zone dimension as shown in Figure 3.In the present embodiment, the Marine Sciences data set comprises the data that relate to 2 °Fang districts altogether, in 1 ° of dimension rank, comprises altogether 2 dimension values, it is encoded to 0,1, by that analogy, in low-dimensional rank (1/64 °), have 128 dimension values, coding is respectively 0,1 ..., 127.As shown in Figure 4, in the present embodiment, the Marine Sciences data set comprises the data that relate to 500 meters layer depths to the coding of depth dimension altogether.Have 5 dimension values on 1OOm dimension rank, its coding is respectively 0,1 ..., 4; Have 10 dimension values on 50m dimension rank, its coding is respectively 0,1 ..., 9; Have 50 dimension values on 10m dimension rank, its coding is respectively 0,1 ..., 49.
Step 1.4: the coding to dimension information is stored;
For any one dimension information of mass data, store every one dimension rank title and in this rank the number of the brotgher of node, finally form the file of all dimension information of mass data, be stored in distributed file system.
In the present embodiment, for the result after the time dimension coding as shown in Figure 2.In the dimension value of time dimension tree, a year dimension rank comprises 3 dimension values altogether, and 3 dimensions are worth the brotgher of node each other, and its father node is that ALL is total data, and brotgher of node quantity is 3; Month rank comprises 36 nodes altogether, the brotgher of node each other of the node under same year wherein, and the node that its father node is this year had in January, 2010~2,010 12 child nodes in year Dec under 2010, and under rank, the number of the brotgher of node is 12; Day rank comprises 1116 nodes altogether, wherein in the same node below the moon brotgher of node each other, the node that its father node is this month, as within 2010 1, had below the moon totally 31 nodes on January 31,1 day~2010 January in 2010, in this rank, brotgher of node quantity is 31; In like manner, in inferior rank, the number of the brotgher of node is 3.In the present embodiment, use the XML file to be stored time dimension, its storage file is as shown in table 1.The title Time that has comprised time dimension in this document, and each ties up the number of basic title He its brotgher of node, for example, for year dimension rank, name is called Year, and the number of the brotgher of node is 3.
Table 1 is the time dimension with the storage of XML form
Step 2: mass data is loaded;
Step 2.1: the user establishes the corresponding relation of the coding of the practical significance of dimension information of mass data and its dimension information as required, and any data that are about in mass data are meaned with the coding of dimension information;
The translater of the meaning of dimension coding and dimension value is the realization that needs according to oneself by the user.For example, in the present embodiment, time dimension morning 1 day January upper 2010 year, can obtain its coding on inferior dimension level by translater is 0, and can to obtain its coding on 1/64 ° of dimension level by translater be 0 to upper 1/64 ° of area dimension.The implementation method that translater is commonly used is as follows: when data volume is larger, adopt database to be stored and search.For example in database, create a table, 2 row are only arranged in table, store respectively the correspondence coding of meaning and the dimension value of dimension value; When dimension value negligible amounts, adopt distributed memory to be stored and search.For example in distributed memory, create a Hash table, the meaning of storage dimension value and the correspondence coding of dimension value; When coding and dimension value have mathematical corresponding relation, adopt coding directly to calculate.For example, for depth dimension, can divide divided by other layer depth of leading dimension level the method rounded after step-length by current layer depth and obtain.Being encoded to as: 45m layer depth on 10m dimension rank 45/10 is 4,4 to be the coding of 45m layer depth on 10m dimension rank after rounding.
Step 2.2: the most fine-grained all multidimensional mass datas form the data cube structure, any data of mass data are as a cell in this data cube, the information of this cell comprises: this cell is positioned at the coordinate of cube, and the represented factual data value of cell; Wherein, the coordinates table of cell is shown:
The coding of<mono-dimension information, the coding of another dimension information ..., the coding of last dimension information >;
Described fine granularity refers to: lowermost layer dimension rank data pointed;
In the present embodiment, the multidimensional mass data forms the data cube structure, and its structure as shown in Figure 6.Any data of mass data are as a cell in this data cube, and the information of this cell comprises: this cell is positioned at the coordinate of cube, and the represented factual data value of cell.For oceanographic data: the morning on January 1st, 2010, the temperature data of the 10m depth of water in 1/64 ° of district was for 18 ℃, this oceanographic data is as a cell in data cube, preserve the information of this unit in distributed file system, comprise coordinate<0, 0, 0 >, wherein, first 0 coding meaned on time dimension, the morning in 2010 on January 1, inferior rank on be encoded to 0, mean the coding on the dimension of zone for the 2nd 0, 1/64 ° is encoded to 0 on 1/64 ° of dimension rank, mean for the 3rd 0 to be encoded to 0 on depth dimension, what be 10m in 10m dimension rank is encoded to 0, and 18 ℃ of factual datas.
Step 2.3: data cube are cut:
According to user's query demand, under the shortest condition of query time, data cube are cut guaranteeing, form data block, the length of side of specified data piece;
User's query demand: refer to user's querying condition commonly used with and the probability that occurs.User's the query demand of take has following benefit concerning us as object: guarantee search efficiency, system can respond fast for user's inquiry; Guaranteed the balance between system concurrency and scheduling cost.
In the present embodiment, use the MapReduce programming model to be realized mass data inquiry method.In mass data inquiry method, the input of MapReduce is the data block file, and the performance of whole querying method and the size of piece are closely related.The block file value is less, and concurrency is better, and the quantity of actual participation computing is fewer, but now dispatch cost, uprises, and it is very crucial that the value of definite block file of how to compromise will become.Owing to being difficult to exhaustive all querying conditions and probability of occurrence thereof, so adopt random sampling in the present embodiment, extract some querying conditions and probability of occurrence thereof.In addition, in the present embodiment, the value of block file size is also considered the running environment of querying method.Mass data inquiry method in present embodiment utilizes MapReduce to realize, so the value of block file size also needs to consider some characteristics of MapReduce, such as file addressing time, data processing time etc.Table 2 has been listed the definition of related symbol, wherein λ iFor variable, T and N aBe result of calculation, other is known constant.
The definition of table 2 related symbol
Figure BDA0000365703220000091
The piece number N of an average query hit aThe probability that can occur by each querying condition obtains, as shown in Equation 1.
Figure BDA0000365703220000092
After some influence factors that consideration MapReduce is correlated with, can obtain the averaging time that an OLAP operation consumes, as shown in Equation 2.
Figure BDA0000365703220000093
λ in the time of can calculating T and get minimum value by formula 2 i, also just obtained block size.
In the present embodiment, suppose that the value number of piece on each dimension calculated is respectively 6,4,5, the length of side that is piece is respectively 558,32,10, on time dimension, according to step-length, 558 divided, 32 divided according to step-length on area dimension, 10 divided according to step-length on depth dimension.Final division result as shown in Figure 7.In Fig. 7, less grid is cell, larger grid be data block (as, data block 5, data block 11, data block 18, data block 19, data block 20, data block 21, data block 22, data block 23, data block 42, data block 43 and data block 44).For convenient each data block in the drawings of drawing only comprises 9 cells, but, in practical operation, the grid quantity that data block comprises is 558*32*10, is 178560.Comprise altogether 6*4*5=120 data block in the present embodiment.
During situation that the piece value number on the dimension that occurs calculating can't divide exactly, for example suppose that the value number of piece on each dimension calculated is respectively at 7,4,5 o'clock in the present embodiment, on time dimension, can't be divided exactly, 3348/7 while being not integer, obtain the division step-length on this dimension by the method rounded up, now the division step-length on time dimension is 479.Now, for data block, be supplied by the method that adds some null values.
Step 2.4: the ready-portioned data block of step 2.3 is encoded, and method is: in data block, the coordinate of cell is divided by the length of side of data block arbitrarily, and the value obtained after the data obtained rounds up is as the coding of data block.
From logical perspective, data block can be regarded one little cube as, comprised a part of cell in the data cube, and data cube also can be regarded by data block and form as.In the present embodiment, suppose to have data be the morning on June 1st, 2012 ocean temperature of 1/32 °Fang district 40m layer depth be 15 ℃, its corresponding coordinate is<2697, Isosorbide-5-Nitrae >, if the length of side of piece is respectively 558,32,10, data block coordinate under it is<4,0,0 >.Wherein 4 be by these data the coding on time dimension 2697 divided by data block upper length 558 at time dimension, and the rear acquisition that rounds up; First 0 is that the coding 1 on area dimension obtains divided by data block length 32 on the dimension of area by these data; Second 0 is to have the coding 4 of these data on depth dimension to obtain divided by data block length 10 on depth dimension.
Step 2.5: the data block store of step 2.3 well cutting is in distributed file system, and the coding that step 2.4 is established is as the title of data block file.
For the cell by data block and data block is stored in distributed file system, at first the present embodiment has carried out serializing to its coding.In logic, the data structure of " cube and its cell " or " cube and its piece " all can analogize to the data structure of " Multidimensional numerical and its element ".Physically, the storage unit that piece is cube, by after the cell linearization in piece, piece can be used as unique file and is stored.For the ease of addressing, piece and cell all need to support linearization and antilinear computing, and this computing is consistent with linearization and the antilinear computing of Multidimensional numerical.If there is a n dimension group, be denoted as<A of its dimension scale 1, A 2... An>, the coordinate of element X in hyperspace in Multidimensional numerical is denoted as (X 1, X 2..., X n), the coordinate after its linearization is denoted as index (X), and as shown in Equation 3, the antilinear method is as shown in Equation 4 for its linearization technique.
index(X)=(...((X n×A n-1+X n-1)×A n-2+...+X 3)×A 2+X 2)×A 1+X 1 (3)
temp 1=index
Figure BDA0000365703220000101
Figure BDA0000365703220000102
...
X n=temp n%A n
For cell and the linearization of data block coding and the method for antilinear, with linearization and the antilinear method of above-mentioned Multidimensional numerical, be identical.To data block linearization and antilinear the time, the dimension scale refers to the value number of data block in each dimension, when cell is carried out to linearization and antilinear, the dimension scale refers to the length of each dimension of data cube, and each dimension lowermost layer that forms data cube is tieed up the value number of dimension value in rank.
Suppose that it is 15 ℃ that the ocean temperature of data 1/32 °Fang district 40m layer depth in the morning on June 1st, 2012 is arranged, its corresponding coordinate is<2697, Isosorbide-5-Nitrae >, and the length of side of data cube is respectively 3348,128,50, wherein 3348 is time dimension lowermost layer dimension levels, the dimension value quantity that inferior dimension level comprises, the 128th, low-dimensional level is tieed up in area, 1/64 ° of dimension value quantity that the dimension level comprises, the 50th, layer depth is tieed up low-dimensional level, the dimension value quantity that 1Om dimension level comprises.Directly using formula (3) can calculate the linearizing coordinate of these data is (4*128+1) * 3348+2697=1720221.Use formula (4), temp during initialization during antilinear 1=1720221; X 1=1720221%3348=2679, X 2=513%128=1,
Figure BDA0000365703220000112
X 3=4%128=4, the coordinate after final antilinear is<2679, Isosorbide-5-Nitrae>.
Suppose that it is<4,0 that the coordinate of a data block is arranged, 0 >, wherein, the value number of piece on each dimension is respectively 6,4,5.The coordinate after this piece linearization is ((0*4+0) * 6)+4=4; Use formula (4), temp during initialization during antilinear 1=4; X 1=4%3348=4,
Figure BDA0000365703220000113
X 2=0%128=0,
Figure BDA0000365703220000114
X 3=0%128=0 the coordinate of this piece after final antilinear is<4,0,0>.
If the coordinate that is written into data block current is<4,0,0 >, in piece the coordinate range of cell be<2232,0,0 to<2769,31,9.Wherein initial scope 2232, be the initial scope of encoding on time dimension, by 4*558, calculates and get.End range 2769 on the time is is to be calculated and got by 2232+558-1.If the data coordinates be written into current is<2769,31,9 > and, the ocean temperature that corresponding data are the 32/64 °Fang district 90m degree of depth in the morning on June 24th, 2012.The numerical value of supposing these data is 20 ℃ of newly-generated being recorded as<3961241,20 >, 3961241=(((9*128)+31) * 3348+557) wherein.Add record<3961241,20 in the data block file >, now whole block file importing is complete, and the coordinate after its file sequence of blocks of data by name, be 4.
According to the method described above, by block file, all load complete.By the current data cube called after primitive cube be written into, so far data loading work completes.
Step 3: the method that adopts on line data to analyze OLAP is inquired about mass data;
Step 3.1: the user arranges querying condition, comprising:
Query aim: refer to and determine for which data cube to be inquired about, be i.e. target cube.
For an initial inquiry, the formed data cube of mass data is referred to as to original data cube, through after initial query, can produce the next stage cube of this raw data cube, through repeatedly the inquiry after, also can produce the next stage cube of this next stage cube, or the other next stage cube of primitive cube, the data cube of these new generations all can be used as the basis of inquiry next time and is selected.
Query context: in the query aim of having established, for any partial data inquired about.
Client provides wants the data area of analyzing.By user-defined interpretation method, the practical significance of dimension value in data area is translated as to the coded message of dimension value.If the user will inquire about the temperature data of °Fang district in afternoon 4,/64 20 day April in 2012 the 32/64 °Fang district 90m depth of water in the morning on June 25th, 1m to 2012 in primitive cube, by the data encoding scope of translating rear correspondence be<2570,4,0 to<2769,31,9 >.
The dimension information of result: the dimension information that refers to result data cube.
If we will check the mean value of the ocean temperature of all data every day in the selected data scope, the dimension information of result data is as follows: time dimension, the year-moon-Ri; The area dimension is same with the target Emission in Cubic; Depth dimension and target Emission in Cubic are same.
Method for congregating: the operation that the data in query context are assembled
For example, if the user will inquire about °Fang district in afternoon 4,/64 20 day April in 2012 the °Fang district 90m depth of water in the morning 32/64 on June 25th, 1m to 2012 in primitive cube, the average of every day in this scope, this method for congregating is exactly averaging.Method for congregating commonly used also has: summation, maximal value, minimum value etc.
Step 3.2: whether the querying condition that determining step 3.1 sets meets following constraint condition:
Constraint 1: query aim exists, and query context should be less than or equal to the data area of query aim;
The specified query aim of user must exist in system.For example, in system initialisation phase, the user has loaded all mass datas called after primitive cube.If the user has specified other cubes in the time of first inquiry, do not meet constraint 1.If the data that the user is written into are 2010 to 2012, the data of 2 °Fang districts, 500m layer depth and inquire about in this primitive cube, but query contents has related to the data of 2013, query context has been greater than the scope of query aim, does not meet constraint 1.In fact, for multidimensional data, any one dimension does not all meet constraint 1 in the data area of query aim.
Constraint 2: the dimension amount of result data cube should be less than or equal to the dimension amount of query aim;
If the query aim that the user provides is primitive cube, the dimension of primitive cube is time dimension, area dimension, layer depth dimension, but has related to other dimensions in the result data that the user sets cube, and for example record's dimension, now do not meet constraint 2.When the user has only specified time dimension and there is no designated area peacekeeping layer depth when dimension, meet constraint 2.Now mean, the granularity that the user does not change regional peacekeeping layer depth dimension in other words on regional peacekeeping layer depth dimension result data cube and query aim be consistent.
Constraint 3: the low-dimensional rank of any dimension of result data cube should be higher than the low-dimensional rank of the corresponding dimension of query aim;
In the present embodiment, result data cube does not change on regional peacekeeping layer depth dimension with query aim, and only on the time is, the dimension level of result data cube is the year-moon-Ri, and the dimension level of raw data cube is the year-moon-Ri-inferior.Mean the result data cube rank that risen than query aim on time dimension.This meets constraint 3, if we suppose that the dimension information of inquiry cube and result data cube exchanges, means the result data cube rank that descended than query aim on time dimension, and this does not meet constraint 3.
Constraint 4: method for congregating must be distributed or algebraic expression;
An aggregate function distributes, if it can be calculated with following distribution mode: establish data and be divided into n set, function calculates a cluster set on every part.If function is for n the result that cluster set obtains, the same with the result that function is obtained for all data, this function can the user distribution formula calculate.Common distributed method for congregating has: summation, maximal value, minimum value, counting etc.
A function is algebraically, if it can be calculated by an algebraic function with some parameters, and each parameter can be tried to achieve with a distributed aggregate function.Common algebraic expression method for congregating has: average.
In the present embodiment, use mean value as method for congregating, meet constraint 4.If use, ask median as method for congregating, do not meet constraint 4.
If meet constraint 1, retrain 2 and retrain 4, perform step 3.3 simultaneously;
If satisfied constraint 1 simultaneously~retrain 4 to perform step 3.4;
If do not meet any one top condition, inquire about unsuccessfully, finish;
Step 3.3: query aim is changed, found the upper level cube of current query aim, judge whether upper level cube meets constraint 3, if do not meet, then continue the upper level cube of this upper level of inquiry cube, if can't meet all the time, inquire about unsuccessfully, finish query script; If found the upper level cube of satisfied constraint 3, should cube replace with target cube;
In the situation that do not meet constraint 3, can, by the method that target cube is changed, make it to meet constraint 3.But can't meet the inquiry of constraint 3 by conversion method, this is inquired about unsuccessfully.
If have cube one, each dimension level is: the year-moon-Ri-inferior, 1 °-1/2 °-1/4 °-1/8 °-1/16 °-1/32 °-1/64 °, 1OOm-50m-1Om; Cubes two, each dimension level is: the year-moon-Ri, 1 °-1/2 °-1/4 °-1/8 °-1/16 °-1/32 °-1/64 °, 1OOm-50m-1Om; Cubes three, the year-moon-Ri-inferior, 1 °-1/2 °-1/4 °-1/8 °-1/16 °-1/32 °-1/64 °, 100m-50m-10m, we guarantee aspect data area simultaneously, and cube one comprises cubes two, and cubes two comprise cubes three.Our hypothesis cubes two is to be drawn by a cube inquiry, and now our plan obtains cubes three by inquiry, and query aim is cubes two.But this inquiry does not meet constraint 3, carry out the conversion of target cube.Cubes two is to obtain by cube one, setting up inquiry, and we say that cube one is cubes two upper level cube.We change target cube into cube one, and now inquiry becomes result data cube for cubes three, and target cube is cube one, again detects constraint 3, meets, and changes successfully.If after conversion, until target data cube is primitive cube, still can't meet constraint 3, inquire about unsuccessfully.
Step 3.4: the scalping of data: the scope of determining the minimum data piece that inquiry is required according to query context;
The process schematic diagram of data scalping as shown in Figure 8.In the present embodiment, we will be inquired about for primitive cube, the temperature data of °Fang district in afternoon 4,/64 20 day April in 2012 the 32/64 °Fang district 90m depth of water in the morning on June 25th, 1m to 2012 that data area is, corresponding coding range is<2570,4,0 > to<2769,31,9 >.Be reflected in Fig. 8, the coordinate of some B is<2570,4,0 >, some H is<2769,31,9 > and, cube ABCD-EFGH is the data area that will inquire about.In order to determine the scope of minimum data piece, the coordinate that we order B point and H converts respectively the coordinate of the piece at its place to, and conversion method is identical with step 2.4.In the present embodiment, the length of side of data block is 558,32,10.After conversion, the scope that obtains the minimum data piece is<4,0,0 > to<4,0,0 >.So, in the present embodiment, only need to scan the data of 1 block file.Be reflected in Fig. 8, corresponding A ' B ' C ' D '-E ' F ' G ' H ' is piece<4,0,0 > in the data that comprise.The present embodiment, include altogether 6*4*5=120 data block., by the method for data scalping, we are only processed the data of needs inquiry, thereby removed other 119 data blocks, participate in calculating.
Step 3.5: the fine screen of data; The data block file that scanning step 3.4 filters out, screened all cells that are positioned at data block according to query context, if cell is positioned at query context, performs step 3.6; Otherwise, give up this cell;
In the present embodiment, use the MapReduce programming model to realize the multidimensional information mass data inquiry method.MapReduce is comprised of four parts: InputFormatter, Mapper, Reducer and OutputFormatter, the essence of corresponding data is shone respectively, changes the dimension rank, four steps of gathering and Output rusults collection.Its operational scheme as shown in Figure 9.
In the present embodiment, the coordinate of the data block by the data scalping is<4,0,0 >, corresponding data block name is called 4, and wherein 4 is by the linearization of data block coordinate.InputFormatter is that the data of file-level are read in, so InputFormatter reads by the data block file after scalping, namely the data block file 4.InputFormatter reads all records in data block 4 successively, and each record is all<key, value > form, wherein key is the coordinate after the current cell linearization read, value is the value of data in current cell.InputFormatter is the key antilinear afterwards, after obtain the coordinate of data in data cube, with query context, contrasted, if meet query context retain this data, otherwise just give up these data.For example to read data be<3961241 to InputFormatter, 20 >, the coordinate 3961241 after linearization is wherein carried out obtaining coordinate<2570,4 after antilinear, 0 >, wherein in the present embodiment the scope of data query be<2570,4,0 to<2769,31,9 >, these data meet query context, enter into next step and change other operation of cell dimension level.Otherwise, give up these data.
Step 3.6: change the dimension rank of cell, the dimension information of the dimension information of comparing result data cube and target cube, determine the dimension information changed, and the coordinate of this dimension of expression on the cell coordinate is modified;
Fig. 9 has showed in the present embodiment, the overall process of the mass data inquiry method of realizing based on MapReduce.If we by recording in block file abstract be<key, value > wherein key be the coordinate after the cell linearization, value refers to the value of data in cell.Read required block file by InputFormatter, key is carried out to unserializing, the data layout obtained is<(a 1, a 2..., a n), value>(a wherein 1, a 2..., a n) be the actual coordinate of cell, after detecting with querying condition, all qualified data will enter into Mapper and change other operation of dimension level.In Mapper, data are by<(a 1, a 2..., a n), value>be changed to<(b 1, b 2..., b n), value>(b wherein 1, b 2..., b n) be the coordinate of cell after change dimension rank.Passing to Reducer after final this coordinate meeting is linearized processes.Be output as<index of all Mapper, value > form.In processing procedure, the value value can not change all the time.In Reducer, the record of identical index can enter into same Reducer and be processed, and Reducer outputs to OutputFormatter after using the method for congregating of appointment to be assembled all value.Last OutputFormatter is responsible for the generation of block file.In whole computation process, in Mapper, other change of dimension level of cell coordinate is very important operation.
In order to change the dimension coding of cell, in the present embodiment, we introduce the concept of analysis path.Figure 10 has showed a schematic diagram that changes the dimension level, and its dimension level is the year-moon-Ri.Having 4 dimension values in year dimension rank is respectively 2010,2011,2012,2013, the moon dimension rank have 48 values, be respectively in January, 2010, in February, 2010 ..., in Dec, 2013; In day dimension rank, we fixedly have 31 days every month, have 1488 values.Now we on February 2nd, 2010 this dimension value give an example.Know that the dimension coding on February 2nd, 2010 is 2 if we are current, and now need this dimension value is changed to a moon rank, will calculate the dimension coding in February, 2010.For the ease of calculating, introduce the concept of analysis path in present embodiment, simultaneously in order to sketch, we do the diary in February 2 in 2010 of dimension value<and 2010 1, 2 2, 2 3, wherein 2010 1Be illustrated in the 1st dimension rank of time dimension,, in year dimension rank, the value of dimension value is 2010, and other several are as the same.Claim that the residing position (this position from left to right count, since 0) of dimension value between the brotgher of node is the analysis path of this dimension value simultaneously.For example, for<2010 1, 2 2, 2 3Its analysis path is<0,1,1>, 2010 are worth the brotgher of node each other with other all dimensions in year dimension rank, and in the dimension value started, so its analysis path is 0, in tieing up rank in the moon, the dimension of in February, 2010 and other 2010 interior each months is worth the brotgher of node each other, its father node is 2010, its 2nd dimension value of living in, but due to analysis path since 0 counting, the computing method of the analysis path that its analysis path is on February 2nd, 1,2010 are identical with the analysis path computing method in February, 2010.Current known<2010 1, 2 2, 2 3Be encoded to 32, need to try to achieve<2010 1, 2 2Coding.In the present embodiment, introduce the transformational relation of analysis path and coding as shown in formula (5) and formula (6).Wherein, coding is denoted as code (), and for example code (2010 1) be that the coding of 2010 is code (2010 1)=0; Analysis path is denoted as order (), and the analysis path that for example order () is 2010 is order (2010 1)=0.| l i| mean in i dimension rank the number of the brotgher of node of any dimension value.For example, in tieing up rank in year | l 1|=4, each time is the brotgher of node each other, in the dimension rank of the moon | l 2|=12, each moon in same year is the brotgher of node each other, in the dimension rank of day | l 3|=31, the sky in the same moon is the brotgher of node each other.
code(v i)=(...((0+order(v 1))×|l 2|+order(v 2))×|l 3|+order(v 3)...)×|l i|+order(v i)
(5)
temp i=code(v i)
Figure BDA0000365703220000162
...
order(v 1)=temp 1%|l 1|
Current known code (2010 1, 2 22 3)=32.According to formula (6), temp 3=32, order (2010 1, 2 22 3)=32%31=1,
Figure BDA0000365703220000164
, order (2010 1, 2 2)=1%12=1, Order (2010 1)=0%4=0.Thereby try to achieve for<2010 1, 2 2, 2 3Analysis path is<0,1,1>, the so obvious relative position due to node can not change, so for<2010 1, 2 2Analysis path is<0,1>.By formula (5), calculate code (2010 1, 2 2The * of)=(0+0) 48+1=1.Thereby try to achieve in February, 2010, be encoded to 2.So far completed the conversion of February 2 in February, 2010 in 2010 of encoding.
In the present embodiment, for the querying method of mass data, use MapReduce to be realized, wherein realize that key is the renewal of cell coordinate and the aggregation operator of the interior tolerance of cell.For the ease of narration, we use 2 dimensions cube to be set forth this process, have showed the change procedure of cell coordinate in the querying method implementation and value thereof in Figure 11, wherein d 1, d 2Be 2 dimensions, be divided into respectively 4 parts.For d 1, its dimension level is l 1 1-l 2 1-l 3 1, for d 2, its dimension level is l 1 2-l 2 2-l 3 2-l 4 2Its corresponding parameter is shown in as shown in Table 3.| l| refers to the quantity of the brotgher of node of arbitrary node in dimension rank l.
Form 3 is carried out correlation parameter for the OLAP algorithm
Figure BDA0000365703220000163
Suppose that processed unit coordinate is 126, factual data is 67, and the coordinate of the piece that it is corresponding is 6.If this unit meets querying condition, give Mapper to be processed, be input as<126,67.In Mapper, at first 126 can, by antilinear, draw (10,6).If in query script, corresponding new dimension level is l 1 2, l 2 2.By formula 5 and formula 6, respectively 2 dimension codings are upgraded, result is (1,0), and the index that draws after linearization is 1 again.Final this Mapper Output rusults is<1,67 >.After the coding of all input data is changed, this step finishes.
Step 3.7: to thering is the cell of same coordinate, according to the method for congregating set, the factual data value in cell is carried out to aggregation operator;
Continuation be take Figure 11 as example.As other Mapper exports following data<1,57 >,<Isosorbide-5-Nitrae 3, in Figure 11 the input data of Reducer be<1,67,57,43}).Suppose in the present embodiment, use summation as method for congregating, Reducer be output as<1,167.The output of above-mentioned Reducer is stored in the block file of OutputFormatter generation with the form of record.
Step 3.8: the data after step 3.7 is assembled, form result data cube, the information of this result data cube is returned to the user, and, using this result cube as new data cube storage, make it can be used as the query aim that next round is inquired about.
After all MapReduce tasks complete, return to the client successful information.And this result cube is stored as new data cube, make it can be used as the query aim of next round inquiry.
Although more than described the specific embodiment of the present invention, the those skilled in the art in this area should be appreciated that these only illustrate, and can make various changes or modifications to these embodiments, and not deviate from principle of the present invention and essence.Scope of the present invention is only limited by appended claims.

Claims (2)

1. the mass data inquiry method with multidimensional information is characterized in that: comprise the following steps:
Step 1: the dimension information to mass data with multidimensional information is loaded, and specifically comprises the steps:
Step 1.1: the dimension information to mass data is differentiated judge whether each dimension information of mass data meets following three constraints simultaneously:
Constraint 1: dimension by one and only a dimension level form, dimension is the ordering relation that all dimension ranks form;
Constraint 2: in any dimension rank of dimension, only comprise a dimension attribute, this dimension attribute comprises several dimension values;
Constraint 3: in all dimension value compositions are set the dimension value, the child node that the brotgher of node comprises same number;
If meet, perform step 1.3, otherwise, perform step 1.2;
Step 1.2: dimension information is processed, made each dimension form the dimension value tree construction that meets constraint, processing procedure is as follows:
For retraining 1: if a plurality of dimension levels are arranged, as required the dimension level is given up, only retained a dimension level and get final product;
For retraining 2: if certain one dimension rank comprises a plurality of dimension attributes, as required dimension attribute is given up, only retained a dimension attribute and get final product;
For retraining 3: if the child node number difference that the brotgher of node comprises is added null value, make the child node number of the brotgher of node identical;
Step 1.3: dimension information is encoded;
For each other dimension value of level in dimension value tree, from left to right with decimal number, encode successively, after all dimension values all have corresponding coding, coding work finishes;
Step 1.4: the coding to dimension information is stored;
For any one dimension information of mass data, store every one dimension rank title and in this rank the number of the brotgher of node, finally form the file of all dimension information of mass data, be stored in distributed file system;
Step 2: mass data is loaded;
Step 2.1: the user establishes the corresponding relation of the coding of the practical significance of dimension information of mass data and its dimension information as required, and any data that are about in mass data are meaned with the coding of dimension information;
Step 2.2: the most fine-grained all multidimensional mass datas form the data cube structure, any data of mass data are as a cell in this data cube, the information of this cell comprises: this cell is positioned at the coordinate of cube, and the represented factual data value of cell; Wherein, the coordinates table of cell is shown:
The coding of<mono-dimension information, the coding of another dimension information ..., the coding of last dimension information >;
Step 2.3: data cube are cut:
According to user's query demand, under the shortest condition of query time, data cube are cut guaranteeing, form data block, the length of side of specified data piece;
Step 2.4: the ready-portioned data block of step 2.3 is encoded, and method is: in data block, the coordinate of cell is divided by the length of side of data block arbitrarily, and the value obtained after the data obtained rounds up is as the coding of data block;
Step 2.5: the data block store of step 2.3 well cutting is in distributed file system, and the coding that step 2.4 is established is as the title of data block file;
Step 3: the method that adopts on line data to analyze OLAP is inquired about mass data;
Step 3.1: the user arranges querying condition, comprising:
Query aim: refer to and determine for which data cube to be inquired about, be i.e. target cube;
Query context: in the query aim of having established, for any partial data inquired about;
The dimension information of result: the dimension information that refers to result data cube;
Method for congregating: the operation that the data in query context are assembled;
Step 3.2: whether the querying condition that determining step 3.1 sets meets following constraint condition:
Constraint 1: query aim exists, and query context should be less than or equal to the data area of query aim;
Constraint 2: the dimension amount of result data cube should be less than or equal to the dimension amount of query aim;
Constraint 3: the low-dimensional rank of any dimension of result data cube should be higher than the low-dimensional rank of the corresponding dimension of query aim;
Constraint 4: method for congregating must be distributed or algebraic expression;
If meet constraint 1, retrain 2 and retrain 4, perform step 3.3 simultaneously;
If satisfied constraint 1 simultaneously~retrain 4 to perform step 3.4;
If do not meet any one top condition, inquire about unsuccessfully, finish;
Step 3.3: query aim is changed, found the upper level cube of current query aim, judge whether upper level cube meets constraint 3, if do not meet, then continue the upper level cube of this upper level of inquiry cube, if can't meet all the time, inquire about unsuccessfully, finish query script; If found the upper level cube of satisfied constraint 3, should cube replace with target cube;
Step 3.4: the scalping of data: the scope of determining the minimum data piece that inquiry is required according to query context;
Step 3.5: the fine screen of data; The data block file that scanning step 3.4 filters out, screened all cells that are positioned at data block according to query context, if cell is positioned at query context, performs step 3.6; Otherwise, give up this cell;
Step 3.6: change the dimension rank of cell, the dimension information of the dimension information of comparing result data cube and target cube, determine the dimension information changed, and the coordinate of this dimension of expression on the cell coordinate is modified;
Step 3.7: to thering is the cell of same coordinate, according to the method for congregating set, the factual data value in cell is carried out to aggregation operator;
Step 3.8: the data after step 3.7 is assembled, form result data cube, the information of this result data cube is returned to the user, and, using this result cube as new data cube storage, make it can be used as the query aim that next round is inquired about.
2. the mass data inquiry method with multidimensional information according to claim 1, it is characterized in that: the described fine granularity of step 2.2 refers to: low-dimensional rank data pointed.
CN201310350126.7A 2013-08-13 2013-08-13 A kind of mass data inquiry method with multidimensional information Expired - Fee Related CN103425772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310350126.7A CN103425772B (en) 2013-08-13 2013-08-13 A kind of mass data inquiry method with multidimensional information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310350126.7A CN103425772B (en) 2013-08-13 2013-08-13 A kind of mass data inquiry method with multidimensional information

Publications (2)

Publication Number Publication Date
CN103425772A true CN103425772A (en) 2013-12-04
CN103425772B CN103425772B (en) 2016-08-10

Family

ID=49650511

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310350126.7A Expired - Fee Related CN103425772B (en) 2013-08-13 2013-08-13 A kind of mass data inquiry method with multidimensional information

Country Status (1)

Country Link
CN (1) CN103425772B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942266A (en) * 2014-03-27 2014-07-23 上海巨数信息科技有限公司 Data analysis method capable of achieving self-defining of complex service computational logic on basis of OLAP
CN104391997A (en) * 2014-12-15 2015-03-04 北京国双科技有限公司 Data cube based visual data display method and device
CN104408186A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Two-dimensional visual data display method and device based on data cubes
CN104408196A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Two-dimensional visual data display method and device based on data cubes
CN104408184A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Two-dimensional visual data display method and device based on data cubes
CN104408200A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Visual data display method and device based on data cubes
CN104408181A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Visual data display method and device based on data cubes
CN104462450A (en) * 2014-12-15 2015-03-25 北京国双科技有限公司 Data cube based two-dimensional visual data display method and device
CN104462446A (en) * 2014-12-15 2015-03-25 北京国双科技有限公司 Data cube based two-dimensional visual data display method and device
CN105117497A (en) * 2015-09-28 2015-12-02 上海海洋大学 Ocean big data master-slave index system and method based on Spark cloud network
CN105302838A (en) * 2014-07-31 2016-02-03 华为技术有限公司 Classification method as well as search method and device
CN105677840A (en) * 2016-01-06 2016-06-15 东北大学 Data query method based on multi-dimensional increasing data model
CN105843842A (en) * 2016-03-08 2016-08-10 东北大学 Multi-dimensional gathering querying and displaying system and method in big data environment
CN105956071A (en) * 2016-04-28 2016-09-21 乐视控股(北京)有限公司 Memory optimization method and memory optimization device for OLAP aggregation operation
CN106528847A (en) * 2016-11-24 2017-03-22 北京集奥聚合科技有限公司 Multi-dimensional processing method and system for massive data
CN106897293A (en) * 2015-12-17 2017-06-27 中国移动通信集团公司 A kind of data processing method and device
CN108388640A (en) * 2018-02-26 2018-08-10 北京环境特性研究所 A kind of data transfer device, device and data processing system
CN109241236A (en) * 2018-10-16 2019-01-18 中国海洋大学 Ocean geography Spatial Multi-Dimensional time-varying field data distribution formula tissue and inquiry processing method
CN109408459A (en) * 2018-08-29 2019-03-01 先临三维科技股份有限公司 3D data processing method, device, computer equipment and storage medium
CN109471893A (en) * 2018-10-24 2019-03-15 上海连尚网络科技有限公司 Querying method, equipment and the computer readable storage medium of network data
CN110019541A (en) * 2017-07-21 2019-07-16 杭州海康威视数字技术股份有限公司 Data query method, apparatus and computer readable storage medium
CN110543940A (en) * 2019-08-29 2019-12-06 中国人民解放军国防科技大学 Neural circuit body data processing method, system and medium based on hierarchical storage
CN116821174A (en) * 2023-07-17 2023-09-29 深圳计算科学研究院 Data query method and device based on logic data block

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663117A (en) * 2012-04-18 2012-09-12 中国人民大学 OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform
US20120317137A1 (en) * 2010-01-28 2012-12-13 Guangzhou Ccm Information Science & Technology Co., Ltd. Method for multi-dimensional database storage and inquiry

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120317137A1 (en) * 2010-01-28 2012-12-13 Guangzhou Ccm Information Science & Technology Co., Ltd. Method for multi-dimensional database storage and inquiry
CN102663117A (en) * 2012-04-18 2012-09-12 中国人民大学 OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942266A (en) * 2014-03-27 2014-07-23 上海巨数信息科技有限公司 Data analysis method capable of achieving self-defining of complex service computational logic on basis of OLAP
CN105302838A (en) * 2014-07-31 2016-02-03 华为技术有限公司 Classification method as well as search method and device
CN105302838B (en) * 2014-07-31 2019-01-15 华为技术有限公司 Classification method, lookup method and equipment
CN104462446A (en) * 2014-12-15 2015-03-25 北京国双科技有限公司 Data cube based two-dimensional visual data display method and device
CN104408196A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Two-dimensional visual data display method and device based on data cubes
CN104408200A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Visual data display method and device based on data cubes
CN104408181A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Visual data display method and device based on data cubes
CN104462450A (en) * 2014-12-15 2015-03-25 北京国双科技有限公司 Data cube based two-dimensional visual data display method and device
CN104391997A (en) * 2014-12-15 2015-03-04 北京国双科技有限公司 Data cube based visual data display method and device
CN104408186A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Two-dimensional visual data display method and device based on data cubes
CN104408184A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Two-dimensional visual data display method and device based on data cubes
CN105117497A (en) * 2015-09-28 2015-12-02 上海海洋大学 Ocean big data master-slave index system and method based on Spark cloud network
CN105117497B (en) * 2015-09-28 2018-12-07 上海海洋大学 Ocean big data principal and subordinate directory system and method based on Spark cloud network
CN106897293A (en) * 2015-12-17 2017-06-27 中国移动通信集团公司 A kind of data processing method and device
CN106897293B (en) * 2015-12-17 2020-09-11 中国移动通信集团公司 Data processing method and device
CN105677840A (en) * 2016-01-06 2016-06-15 东北大学 Data query method based on multi-dimensional increasing data model
CN105677840B (en) * 2016-01-06 2019-02-05 东北大学 A kind of data query method based on the cumulative data model of multidimensional
CN105843842A (en) * 2016-03-08 2016-08-10 东北大学 Multi-dimensional gathering querying and displaying system and method in big data environment
CN105956071A (en) * 2016-04-28 2016-09-21 乐视控股(北京)有限公司 Memory optimization method and memory optimization device for OLAP aggregation operation
CN106528847A (en) * 2016-11-24 2017-03-22 北京集奥聚合科技有限公司 Multi-dimensional processing method and system for massive data
CN110019541A (en) * 2017-07-21 2019-07-16 杭州海康威视数字技术股份有限公司 Data query method, apparatus and computer readable storage medium
CN110019541B (en) * 2017-07-21 2022-04-05 杭州海康威视数字技术股份有限公司 Data query method and device and computer readable storage medium
CN108388640A (en) * 2018-02-26 2018-08-10 北京环境特性研究所 A kind of data transfer device, device and data processing system
CN109408459A (en) * 2018-08-29 2019-03-01 先临三维科技股份有限公司 3D data processing method, device, computer equipment and storage medium
CN109241236A (en) * 2018-10-16 2019-01-18 中国海洋大学 Ocean geography Spatial Multi-Dimensional time-varying field data distribution formula tissue and inquiry processing method
CN109471893A (en) * 2018-10-24 2019-03-15 上海连尚网络科技有限公司 Querying method, equipment and the computer readable storage medium of network data
CN110543940A (en) * 2019-08-29 2019-12-06 中国人民解放军国防科技大学 Neural circuit body data processing method, system and medium based on hierarchical storage
CN116821174A (en) * 2023-07-17 2023-09-29 深圳计算科学研究院 Data query method and device based on logic data block

Also Published As

Publication number Publication date
CN103425772B (en) 2016-08-10

Similar Documents

Publication Publication Date Title
CN103425772A (en) Method for searching massive data with multi-dimensional information
CN110059067B (en) Water conservancy space vector big data storage management method
Goring et al. Neotoma: A programmatic interface to the Neotoma Paleoecological Database
Peuquet et al. An event-based spatiotemporal data model (ESTDM) for temporal analysis of geographical data
CN106933833B (en) Method for quickly querying position information based on spatial index technology
Song et al. HaoLap: A Hadoop based OLAP system for big data
CN108804602A (en) A kind of distributed spatial data storage computational methods based on SPARK
CN102722531B (en) Query method based on regional bitmap indexes in cloud environment
CN102306180A (en) Modeling method based on mass laser radar grid point cloud data
CN109635068A (en) Mass remote sensing data high-efficiency tissue and method for quickly retrieving under cloud computing environment
CN107220285A (en) Towards the temporal index construction method of magnanimity track point data
CN104199986A (en) Vector data space indexing method base on hbase and geohash
CN113946700A (en) Space-time index construction method and device, computer equipment and storage medium
CN103995861A (en) Distributed data device, method and system based on spatial correlation
CN108009265B (en) Spatial data indexing method in cloud computing environment
Froese et al. The border k-means clustering algorithm for one dimensional data
CN106021567A (en) Mass vector data partition method and system based on Hadoop
CN113010620A (en) Natural resource data index statistical method and system based on geographical multilevel grids
CN114048204A (en) Beidou grid space indexing method and device based on database inverted index
CN103678550A (en) Mass data real-time query method based on dynamic index structure
CN111104457A (en) Massive space-time data management method based on distributed database
CN102609490A (en) Column-storage-oriented B+ tree index method for DWMS (data warehouse management system)
CN116775661A (en) Big space data storage and management method based on Beidou grid technology
CN105117442A (en) Probability based big data query method
CN107992584A (en) A kind of ocean big data classification parsing and gridding storage method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160810