CN105787090A - Index building method and system of OLAP system of electric data - Google Patents

Index building method and system of OLAP system of electric data Download PDF

Info

Publication number
CN105787090A
CN105787090A CN201610147684.7A CN201610147684A CN105787090A CN 105787090 A CN105787090 A CN 105787090A CN 201610147684 A CN201610147684 A CN 201610147684A CN 105787090 A CN105787090 A CN 105787090A
Authority
CN
China
Prior art keywords
data
electric power
prefix trees
index
regional code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610147684.7A
Other languages
Chinese (zh)
Inventor
崔蔚
王亚玲
刘万涛
刘越
虎嵩林
黄高攀
张明明
夏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING NANRUI GROUP CO
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Institute of Computing Technology of CAS
State Grid Zhejiang Electric Power Co Ltd
State Grid Liaoning Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
NANJING NANRUI GROUP CO
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Institute of Computing Technology of CAS
State Grid Zhejiang Electric Power Co Ltd
State Grid Liaoning Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING NANRUI GROUP CO, State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, Institute of Computing Technology of CAS, State Grid Zhejiang Electric Power Co Ltd, State Grid Liaoning Electric Power Co Ltd, Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd filed Critical NANJING NANRUI GROUP CO
Priority to CN201610147684.7A priority Critical patent/CN105787090A/en
Publication of CN105787090A publication Critical patent/CN105787090A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an index building method and system of an OLAP system of electric data.The method comprises the steps that a prefix tree is built according to archive data in the electric data; reassembling is conducted on electricity information acquisition data of the electric data in a data file according to leaf nodes of the prefix tree, and multiple data slices are generated in the data file; data slice positions of all the data slices in the data file are determined, and mapping relations between the data slice positions and the leaf nodes are established to serve as an index of the OLAP system of the electric data.Accordingly, by means of building of an OLAP index, reassembling is conducted on the electricity information acquisition data, the data sequential reading efficiency is improved, the index can be built according to the finest granularity of the electricity information acquisition data, irrelevant data can be filtered in the query process, the query performance is improved, and the query efficiency is greatly improved.

Description

The index establishing method of the OLAP system of a kind of electric power data and system
Technical field
The present invention relates to technical field of data processing, particularly to the index establishing method of OLAP system and the system of a kind of electric power data.
Background technology
In power domain, it is necessary to adopting big data, electric power is carried out OLAP (OnlineAnalyticalProcessing, on-line analytical processing) inquiry, OLAP query has features such as relating to big, multi-table join frequent operation, the SQL structure complexity of data volume.
In prior art, by electric power being inquired about with adopting data and is analyzed with adopting data analysis system, include with extraction system and multiple acquisition terminal with adopting data analysis system, OLAP system and HDFS (HadoopDistributedFileSystem, Hadoop distributed file system) is included with extraction system.After gathering data, with extraction system multiple acquisition terminals are gathered with adopting data and the user profile being saved in relational database, the archives class data such as power equipment (intelligent electric meter, transformator etc.) information are stored in HDFS, OLAP system adopt internal memory Computational frame Spark and SQL instrument Shark thereon to HDFS carries out OLAP query with adopting data.
But, due to significantly high with the frequency acquisition adopting data, the data volume meeting rapid expansion of storage in HDFS, and OLAP system SQL instrument Shark of employing when carrying out OLAP query only supports coarseness subregion, does not support that fine granularity indexes, and causes that the efficiency of inquiry is very low.
Summary of the invention
In view of this, the present invention provides index establishing method and the system of the OLAP system of a kind of electric power data, in order to improves with inquiry velocity and the analytical performance of adopting data analysis system, meets the query demand gathering the big data of class.
The invention provides the index establishing method of the OLAP system of a kind of electric power data, including:
Prefix trees is set up according to the archives class data in described electric power data;
Leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generates multiple data slice in described data file;
Determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
Preferably, described set up prefix trees according to the archives class data in described electric power data, including:
Read the described archives class data preserved in relational database management system;
From the region level coding schedule of electric power data described in described archives class extracting data, described region level coding schedule includes each described use and adopts the region level belonging to the regional code in region belonging to data and described regional code;
Described prefix trees is set up according to described region level coding schedule, make the leaf node one_to_one corresponding of the described regional code in the level coding schedule of described region and described prefix trees, and make the level one_to_one corresponding of the described region level in the level coding schedule of described region and described prefix trees.
Preferably, the described leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, described data file generates multiple data slice, including:
Read the described regional code that each described use is adopted in data, and determine that the described use containing identical described regional code adopts data;
Described recombinate in the data file with adopting data to each, the described use containing identical described regional code is adopted the integrated described data slice of data sink, and records each described data slice data slice position in described data file.
Preferably, described determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data, including:
Determine each described data slice data slice position in described data file, and the described regional code of data is adopted according to the described use in each described data slice, and the one-to-one relationship of the leaf node of the described regional code in the level coding schedule of described region and described prefix trees, set up the mapping relations between each described data slice position and the described leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
Preferably, also include:
Described prefix trees is saved in the internal memory of server at the OLAP system place of described electric power data, the index of the OLAP system of described electric power data is saved in distributed memory system.
Another aspect of the present invention also discloses the index establishing system of the OLAP system of a kind of electric power data, including:
Prefix trees sets up module, for setting up prefix trees according to the archives class data in described electric power data;
Data reorganization module, for leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generates multiple data slice in described data file;
Module set up in index, for determining each described data slice data slice position in described data file, and sets up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
Preferably, described prefix trees is set up module and is included:
First reads unit, for reading the described archives class data preserved in relational database management system;
Extraction unit, for from the region level coding schedule of electric power data described in described archives class extracting data, described region level coding schedule includes each described use and adopts the region level belonging to the regional code in region belonging to data and described regional code;
Set up unit, for setting up described prefix trees according to described region level coding schedule, make the leaf node one_to_one corresponding of the described regional code in the level coding schedule of described region and described prefix trees, and make the level one_to_one corresponding of the described region level in the level coding schedule of described region and described prefix trees.
Preferably, described data reorganization module includes:
Second reads unit, for reading the described regional code that each described use is adopted in data, and determines that the described use containing identical described regional code adopts data;
Data recombination unit, for described recombinating in the data file with adopting data to each, described use containing identical described regional code is adopted the integrated described data slice of data sink, and records each described data slice data slice position in described data file.
Preferably, described index is set up module and is included:
Unit set up in index, for determining each described data slice data slice position in described data file, and the described regional code of data is adopted according to the described use in each described data slice, and the one-to-one relationship of the leaf node of the described regional code in the level coding schedule of described region and described prefix trees, set up the mapping relations between each described data slice position and the described leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
Preferably, also include:
Described prefix trees is saved in the internal memory of server at the OLAP system place of described electric power data, the index of the OLAP system of described electric power data is saved in distributed memory system.
According to technique scheme it can be seen that this application provides the index establishing method of the OLAP system of a kind of electric power data and system, described method sets up prefix trees according to the archives class data in described electric power data;Leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generates multiple data slice in described data file;Determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.Visible, by setting up the index of OLAP, to carrying out data recombination with adopting data, improve data order reading efficiency, and according to adopting the most fine granularity index building of data, extraneous data can be filtered when inquiry, improve query performance so that the efficiency of inquiry is greatly improved.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to the accompanying drawing provided.
The flow chart of the index establishing method of the OLAP system of a kind of electric power data that Fig. 1 provides for the embodiment of the present application;
Fig. 2 is a concrete structure schematic diagram in the embodiment of the present application;
Fig. 3 is another concrete structure schematic diagram in the embodiment of the present application;
Fig. 4 is the another kind of flow chart of the index establishing method of the OLAP system of a kind of electric power data that the embodiment of the present invention provides;
Fig. 5 is a kind of structural representation of a kind of data handling system that the embodiment of the present application provides.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.
Coarseness subregion is only supported in order to solve OLAP system SQL instrument Shark of employing when carrying out OLAP query in prior art, do not support that fine granularity indexes, cause the technical problem that the efficiency of inquiry is very low, this application provides the index establishing method of the OLAP system of a kind of electric power data.
The flow chart of the index establishing method of the OLAP system of a kind of electric power data that Fig. 1 provides for the embodiment of the present application.
In prior art, it is divided into off-line analysis and olap analysis two class with adopting data analysis.With adopting data, the data about power information that namely intelligent electric meter reports.Off-line analysis includes the business scenarios such as line loss analyzing, data integrity rate calculating, electricity calculating, anomaly analysis, it is characterized in that service logic is complicated, the operation time is longer (several tens minutes to a few hours), for avoiding resource contention, generally in the non-working time at night the previous day carried out off-line analysis with adopting data.
Olap analysis then refers to the various ad-hoc inquiry that front end service system sends, typically require the report data that off-line analysis is generated or reported data and archives class data carry out join operation, power consumption as inquired about certain city is maximum 10 special become users meter reading data, inquire about certain power office and specially become user and currently filled terminal detail etc..The features such as (several seconds to several minutes) that has that analysis task is fixing, response time requirement is higher.In the design of concrete system, off-line analysis adopts Hive and MapReduce Computational frame, Oozie carry out task scheduling, and the report data of generation writes back HDFS, then transfers to OLAP system to carry out on-line analysis inquiry and uses.
Olap analysis adopts internal memory Computational frame Spark and SQL instrument Shark thereon.Owing to have employed distributed memory computing technique, Spark is applicable to mass data is responded all kinds of analytical calculations that time requirement is comparatively harsh;Shark then provides a set of class SQL interface on Spark, reduces the use difficulty of user, and shortens the transition process of the existing service logic based on relational database.But primary Shark does not support that fine granularity indexes, and causes that search efficiency is extremely low.
Therefore, the technical problem in order to solve prior art this application discloses the index establishing method of the OLAP system of a kind of electric power data, including:
S101, set up prefix trees according to the archives class data in described electric power data;
S102, leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generate multiple data slice in described data file;
S103, determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
In the embodiment of the present application, setting up prefix trees Trie according to the archives class data in described electric power data, archives class data are saved in relational database management system RDBMS.In RDBMS, preserving the region level coding schedule of electric power data, region level coding schedule includes each described use and adopts the region level belonging to the regional code in region belonging to data and described regional code.In a level of each layer of corresponding region level coding of prefix trees, leaf node represents that the most fine-grained level encodes.Indexing at leaf node place, namely leaf node passes through all records encoded using the value of this node of pointed as region level.
In the embodiment of the present application, referring to Fig. 2, there is the diverse location being scattered in data file that the record of the same area level coding is generally random, before restructuring, in distributed file system HDFS, these records can be arranged in different data files, thus being stored on different nodes.When Shark carries out retrieving a certain region level coding, substantial amounts of random disk read operation can be produced, have impact on query performance.
The present invention, when setting up prefix trees, generates multiple data slice, so that all record Coutinuous stores with same area coding, the data slice at these record places is referred to as a Slice.So the disk random read operation of poor efficiency is converted to efficient sequential read operations.Improve efficiency during inquiry.
Finally determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
In actually used, be a prefix trees Tire model corresponding with adopting data area level coding simplified in conjunction with Fig. 3, Fig. 3, dig up the roots outside node, the three layers of this tree corresponding city-level respectively, district's level, power supply station's level level coding.As shown in Figure 3, value be 1010101 leaf node point to all regions levels and be encoded to the record of 1010101.Trie tree construction in index TrieIndex builds by reading external table, is saved in the internal memory of server to improve access speed.And the mapping relations table of leaf node and data slice position is saved in data base HBase, this is because leaf node to preserve the position relationship of the data slice of each table indexed, in addition with table large number of in extraction system, cause complete Trie tree and mapping relations being saved in internal memory.If additionally, the prefix trees structure in internal memory is excessive, can cause that system start-up is excessively slow, the problems such as slow are crossed in fault recovery.In HBase table be recorded as key assignments form, key table shows most fine granularity region level coding, and value expression has filename and the side-play amount at all record places of this coding.
Visible, the present invention is by setting up the index of OLAP, to carrying out data recombination with adopting data, improve data order reading efficiency, and according to adopting the most fine granularity index building of data, extraneous data can be filtered when inquiry, improve query performance so that the efficiency of inquiry is greatly improved.
The another kind of flow chart of the index establishing method of the OLAP system of a kind of electric power data that Fig. 4 provides for the embodiment of the present invention.
The index establishing method of the OLAP system of a kind of electric power data provided by the invention, including:
The described archives class data preserved in S401, reading relational database management system;
S402, from the region level coding schedule of electric power data described in described archives class extracting data, described region level coding schedule includes each described use and adopts the region level belonging to the regional code in region belonging to data and described regional code;
S403, set up described prefix trees according to described region level coding schedule, make the leaf node one_to_one corresponding of the described regional code in the level coding schedule of described region and described prefix trees, and make the level one_to_one corresponding of the described region level in the level coding schedule of described region and described prefix trees.
S404, leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generate multiple data slice in described data file;
S405, determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
Preferably, the described leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, described data file generates multiple data slice, including:
Read the described regional code that each described use is adopted in data, and determine that the described use containing identical described regional code adopts data;
Described recombinate in the data file with adopting data to each, the described use containing identical described regional code is adopted the integrated described data slice of data sink, and records each described data slice data slice position in described data file.
Preferably, described determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data, including:
Determine each described data slice data slice position in described data file, and the described regional code of data is adopted according to the described use in each described data slice, and the one-to-one relationship of the leaf node of the described regional code in the level coding schedule of described region and described prefix trees, set up the mapping relations between each described data slice position and the described leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
Described prefix trees is saved in the internal memory of server at the OLAP system place of described electric power data, the index of the OLAP system of described electric power data is saved in distributed memory system.
In the embodiment of the present application, the foundation of index is broadly divided into two steps.The first step is for creating Trie tree, and second step is the process that leaf node mapping relations were knitted and set up to data recombination.
In the first step, from the region level coding schedule of electric power data described in described archives class extracting data, described region level coding schedule includes each described use and adopts the region level belonging to the regional code in region belonging to data and described regional code.Described prefix trees is set up according to described region level coding schedule, make the leaf node one_to_one corresponding of the described regional code in the level coding schedule of described region and described prefix trees, and make the level one_to_one corresponding of the described region level in the level coding schedule of described region and described prefix trees.
As shown in algorithm 1, first the table (the 1st row) of reading and saving regional code from relational database management system RDBMS, and state that Trie tree is for empty (the 2nd row), then all regional codes using reading create prefix trees and Trie tree (3-5 row), finally return to the Trie tree (the 6th row) created.After Trie tree structure completes, it is saved in the internal memory of server, so can accelerate to process the speed of access index during inquiry.
Algorithm 1.Trie tree sets up process.
Input: the external table (being saved in RDBMS) of region hierarchical relationship
Output: Trie tree
1.regionCodeSet=readFromRDBMS ();
2.TrieTree=null;
3.FORregionCodeinregionCodeSetDO
4.TrieTree.insert(regionCode);
5.ENDFOR
6.returnTrieTree。
In second step, the leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, described data file generates multiple data slice.Determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
First to tables of data according to date subregion, on each subregion, i.e. TrieIndex is then indexed.Index the catalogue number that can't increase under table, simply the record in data file is reorganized, namely change their relative ranks.Due to huge with adopting data volume, therefore, the process indexed is a MapReduce task, carries out parallel on multiple nodes, so can improve efficiency.MapReduce task is divided into two stages, Map stage and Reduce stage.
The flow chart of this task is as shown in Figure 5.The process that full table data recombination is namely knitted by the process indexed, because it can not reduce the data volume needing to write out, the MapReduce hence setting up index need not use Combine function.
The logic in Map stage is such as shown in algorithm 2, in Map function, to every record, use regional code row number (in different tables, the row difference of regional code, this information can be obtained by metadata, 1st row) extract the value (the 2nd row) of the regional code field of this record, with this value for Key, whole piece is recorded as Value, is sent to Reduce function (the 3rd row).
Algorithm 2.TrieIndex sets up process Map function.
Input: gather the every a line of class electric power data: key is the side-play amount of this row, is worth for this row line
Output: key is regional code, is worth for whole piece record
1.idx=conf.get (REGION_CODE_COL);
2.key=getRegionCode (line, idx);
3.Emit(key,line);
The Reduce stage utilizes the output in Map stage to complete data recombination and knits, and its logic is such as shown in algorithm 3.In Reduce function, recording the skew of current output file, namely start side-play amount (the 1st row), the end side-play amount initializing Slice is-1, and current Slice is sized to 0, obtains the name (2-4 row) of current output file.For certain Key (i.e. regional code), traversal has all records of this Key, for certain record, calculates the length of this record, is added in Slice size variable, then directly exports this and recorded (5-8 row) in output file.It follows that calculate the value terminating side-play amount, namely start side-play amount and Slice size sum (the 9th row).Finally, using regional code as Key, the tlv triple that export file name, beginning side-play amount, the side-play amount that terminates are constituted is Value, in write HBase.So, through the process that index is set up, the relation between the most fine-grained regional code and corresponding data sheet and Slice is stored in the concordance list in HBase.
Algorithm 3.TrieIndex sets up process Reduce function.
Input: key is regional code region, value is the record chained list lineList that this regional code is corresponding
Output: index information write HBase, whole piece record exports file
The current offset of 1.start=output file;
2.end=-1;
3.sliceSize=0;
The name of 4.filename=output file;
5.FORlineinlineListDO
6.sliceSize+=sizeOf (line);
7.output(line);
8.ENDFOR
9.end+=sliceSize;
10.HBase.put(region,<filename,start,end>);
After TrieIndex has set up, Shark uses the process of TrieIndex inquiry can be divided into three phases: (1) first stage is as shown in algorithm 4, first Shark resolves inquiry, obtains the condition (the 1st row) that in inquiry predicate, regional code is relevant.Then, according to the Trie tree in the condition query internal memory obtained, obtain the most fine granularity regional code set (the 2nd row) that inquiry is relevant.Finally, from the positional information (the 3rd row) of the relevant Slice of HBase reading area coding, and the temporary file write in HDFS uses (the 4th row) for process below.
Algorithm 4.GetQueryRelatedRegionCodes (Q)
Input: inquiry Q
Output: the regional code set that the most fine-grained inquiry is relevant:
1.regionCodePred=extract (Q);
2.keyset=Trie.search (regionCodePred);
3.sliceLocations=HBase.getAll (keyset);
4.writeToTmpFileonHDFS(sliceLocations);
(2) second stage is such as shown in algorithm 5, first the Split (i.e. InputSplit in MapReduce) obtained after filtration is gathered initialization of variable for empty (the 1st row).Split can be understood as a kind of input data.Second step, obtains all of Split (the 2nd row) from input table, and the temporary file produced from algorithm 3 obtains each Slice associated with the query position (the 3rd row) hereof.3rd step, filters those and the inquiry disjoint Split of relevant Slice (4-8 row).Finally, for each Split chosen, obtain this Split be there is a need to the position (9-10 row) of the Slice of reading, and offset from little arrival sequence (the 11st row) according to the position of Slice, then use the file name of Split and Split side-play amount as key, all Slice in this Split offset list for value, it is stored in an interim table of HBase, use (the 12nd row) during Slice unrelated for following filter, finally return that Split set (the 14th row) chosen.
Algorithm 5.FilterUnrelatedSplits
Input: the temporary file that algorithm 3 obtains, the inside comprises the positional information of the relevant Slice of inquiry
Output: the Splits, i.e. the input Splits of Spark program after filtering
1.finalSplits=Ф
2.allSplits=getAllSplitsFromInputPath ();
3.sliceLocations=readFromTmpFileOnHDFS ();
4.FORsplitinallSplitsDO
5.IFsplit.overlap(sliceLocations)THEN
6.finalSplits.add(split);
7.ENDIF
8.ENDFOR
9.FORsplitinfinalSplitsDO
10.slicesLocs=getSlicesInSplit (sliceLocations);
11.Sort(slicesLocs);
12.HBase.put(split.name_split.start,slicesLocs);
13.ENDFOR
14.ReturnfinalSplits;
(3) in the phase III as shown in algorithm 6, RecordReader is responsible for the data read operation of reality, such name first using current Split from HBase and side-play amount read as Key needs the Slice the read offset information (the 1st row) gathered in this Split, for each Slice, such reads data therein, it is sent to the operator of Spark, and filters unrelated data (2-4 row)
Algorithm 6.FilterUnrelatedSlices
Input: the interim table in the HBase that algorithm 4 obtains, Split
Output: read the Slice that inquiry is relevant, be sent to Spark
SlicesLocs=HBase.get (split.name_split.start);
FORsliceinslicesLocsDO
sendDataToSpark(slice);
ENDFOR
Index in the present invention does not increase too much catalogue as subregion, has reached to use the most fine-grained regional code to filter witnessing of extraneous data by data file re-organized, has drastically increased query performance.
Fig. 5 is the structural representation of the index establishing system of the OLAP system of a kind of electric power data that the embodiment of the present invention provides.
The index establishing system of the OLAP system of a kind of electric power data that the embodiment of the present invention provides, including:
Prefix trees sets up module 501, for setting up prefix trees according to the archives class data in described electric power data;
Data reorganization module 502, for leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generates multiple data slice in described data file;
Module 503 set up in index, for determining each described data slice data slice position in described data file, and sets up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
Preferably, described prefix trees is set up module and is included:
First reads unit, for reading the described archives class data preserved in relational database management system;
Extraction unit, for from the region level coding schedule of electric power data described in described archives class extracting data, described region level coding schedule includes each described use and adopts the region level belonging to the regional code in region belonging to data and described regional code;
Set up unit, for setting up described prefix trees according to described region level coding schedule, make the leaf node one_to_one corresponding of the described regional code in the level coding schedule of described region and described prefix trees, and make the level one_to_one corresponding of the described region level in the level coding schedule of described region and described prefix trees.
Preferably, described data reorganization module includes:
Second reads unit, for reading the described regional code that each described use is adopted in data, and determines that the described use containing identical described regional code adopts data;
Data recombination unit, for described recombinating in the data file with adopting data to each, described use containing identical described regional code is adopted the integrated described data slice of data sink, and records each described data slice data slice position in described data file.
Preferably, described index is set up module and is included:
Unit set up in index, for determining each described data slice data slice position in described data file, and the described regional code of data is adopted according to the described use in each described data slice, and the one-to-one relationship of the leaf node of the described regional code in the level coding schedule of described region and described prefix trees, set up the mapping relations between each described data slice position and the described leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
Preferably, also include:
Described prefix trees is saved in the internal memory of server at the OLAP system place of described electric power data, the index of the OLAP system of described electric power data is saved in distributed memory system.
This application provides the index establishing method of the OLAP system of a kind of electric power data and system, described system sets up prefix trees according to the archives class data in described electric power data;Leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generates multiple data slice in described data file;Determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.Visible, by setting up the index of OLAP, to carrying out data recombination with adopting data, improve data order reading efficiency, and according to adopting the most fine granularity index building of data, extraneous data can be filtered when inquiry, improve query performance so that the efficiency of inquiry is greatly improved.
It should be noted that, the index establishing system of the OLAP system of a kind of electric power data of the present embodiment can adopt the index establishing method of the OLAP system of a kind of electric power data in said method embodiment, for realizing the whole technical schemes in said method embodiment, the function of its modules can implement according to the method in said method embodiment, it implements the associated description that process can refer in above-described embodiment, repeats no more herein.
For convenience of description, it is divided into various module to be respectively described with function when describing system above.Certainly, the function of each module can be realized in same or multiple softwares and/or hardware when implementing the application.
Each embodiment in this specification all adopts the mode gone forward one by one to describe, between each embodiment identical similar part mutually referring to, what each embodiment stressed is the difference with other embodiments.Especially for device or system embodiment, owing to it is substantially similar to embodiment of the method, so describing fairly simple, relevant part illustrates referring to the part of embodiment of the method.Apparatus and system embodiment described above is merely schematic, the wherein said unit illustrated as separating component can be or may not be physically separate, the parts shown as unit can be or may not be physical location, namely may be located at a place, or can also be distributed on multiple NE.Some or all of module therein can be selected according to the actual needs to realize the purpose of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
Professional further appreciates that, the unit of each example described in conjunction with the embodiments described herein and algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate the interchangeability of hardware and software, generally describe composition and the step of each example in the above description according to function.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel specifically can should be used for using different methods to realize described function to each, but this realization is it is not considered that beyond the scope of this invention.
The method described in conjunction with the embodiments described herein or the step of algorithm can directly use the software module that hardware, processor perform, or the combination of the two is implemented.Software module can be placed in any other form of storage medium known in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable ROM, depositor, hard disk, moveable magnetic disc, CD-ROM or technical field.
Described above to the disclosed embodiments, makes professional and technical personnel in the field be capable of or uses the present invention.The multiple amendment of these embodiments be will be apparent from for those skilled in the art, and generic principles defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention is not intended to be limited to the embodiments shown herein, and is to fit to the widest scope consistent with principles disclosed herein and features of novelty.

Claims (10)

1. the index establishing method of the OLAP system of an electric power data, it is characterised in that including:
Prefix trees is set up according to the archives class data in described electric power data;
Leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generates multiple data slice in described data file;
Determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
2. method according to claim 1, it is characterised in that described set up prefix trees according to the archives class data in described electric power data, including:
Read the described archives class data preserved in relational database management system;
From the region level coding schedule of electric power data described in described archives class extracting data, described region level coding schedule includes each described use and adopts the region level belonging to the regional code in region belonging to data and described regional code;
Described prefix trees is set up according to described region level coding schedule, make the leaf node one_to_one corresponding of the described regional code in the level coding schedule of described region and described prefix trees, and make the level one_to_one corresponding of the described region level in the level coding schedule of described region and described prefix trees.
3. method according to claim 2, it is characterised in that the described leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generates multiple data slice in described data file, including:
Read the described regional code that each described use is adopted in data, and determine that the described use containing identical described regional code adopts data;
Described recombinate in the data file with adopting data to each, the described use containing identical described regional code is adopted the integrated described data slice of data sink, and records each described data slice data slice position in described data file.
4. method according to claim 3, it is characterized in that, described determine each described data slice data slice position in described data file, and set up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data, including:
Determine each described data slice data slice position in described data file, and the described regional code of data is adopted according to the described use in each described data slice, and the one-to-one relationship of the leaf node of the described regional code in the level coding schedule of described region and described prefix trees, set up the mapping relations between each described data slice position and the described leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
5. method according to claim 1, it is characterised in that also include:
Described prefix trees is saved in the internal memory of server at the OLAP system place of described electric power data, the index of the OLAP system of described electric power data is saved in distributed memory system.
6. the index establishing system of the OLAP system of an electric power data, it is characterised in that including:
Prefix trees sets up module, for setting up prefix trees according to the archives class data in described electric power data;
Data reorganization module, for leaf node according to described prefix trees, to recombinating in the data file with adopting data in described electric power data, generates multiple data slice in described data file;
Module set up in index, for determining each described data slice data slice position in described data file, and sets up the mapping relations between described data slice position and the leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
7. system according to claim 6, it is characterised in that described prefix trees is set up module and included:
First reads unit, for reading the described archives class data preserved in relational database management system;
Extraction unit, for from the region level coding schedule of electric power data described in described archives class extracting data, described region level coding schedule includes each described use and adopts the region level belonging to the regional code in region belonging to data and described regional code;
Set up unit, for setting up described prefix trees according to described region level coding schedule, make the leaf node one_to_one corresponding of the described regional code in the level coding schedule of described region and described prefix trees, and make the level one_to_one corresponding of the described region level in the level coding schedule of described region and described prefix trees.
8. system according to claim 7, it is characterised in that described data reorganization module includes:
Second reads unit, for reading the described regional code that each described use is adopted in data, and determines that the described use containing identical described regional code adopts data;
Data recombination unit, for described recombinating in the data file with adopting data to each, described use containing identical described regional code is adopted the integrated described data slice of data sink, and records each described data slice data slice position in described data file.
9. system according to claim 8, it is characterised in that described index is set up module and included:
Unit set up in index, for determining each described data slice data slice position in described data file, and the described regional code of data is adopted according to the described use in each described data slice, and the one-to-one relationship of the leaf node of the described regional code in the level coding schedule of described region and described prefix trees, set up the mapping relations between each described data slice position and the described leaf node of described prefix trees, as the index of the OLAP system of described electric power data.
10. system according to claim 5, it is characterised in that also include:
Described prefix trees is saved in the internal memory of server at the OLAP system place of described electric power data, the index of the OLAP system of described electric power data is saved in distributed memory system.
CN201610147684.7A 2016-03-15 2016-03-15 Index building method and system of OLAP system of electric data Pending CN105787090A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610147684.7A CN105787090A (en) 2016-03-15 2016-03-15 Index building method and system of OLAP system of electric data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610147684.7A CN105787090A (en) 2016-03-15 2016-03-15 Index building method and system of OLAP system of electric data

Publications (1)

Publication Number Publication Date
CN105787090A true CN105787090A (en) 2016-07-20

Family

ID=56393668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610147684.7A Pending CN105787090A (en) 2016-03-15 2016-03-15 Index building method and system of OLAP system of electric data

Country Status (1)

Country Link
CN (1) CN105787090A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649687A (en) * 2016-12-16 2017-05-10 飞狐信息技术(天津)有限公司 Method and device for on-line analysis and processing of large data
CN107066506A (en) * 2017-01-11 2017-08-18 中国科学院空间应用工程与技术中心 A kind of method and device for improving space science and application data recall precision
CN111291040A (en) * 2018-12-10 2020-06-16 中国移动通信集团四川有限公司 Data processing method, device, equipment and medium
CN112835920A (en) * 2021-01-22 2021-05-25 河海大学 Distributed SPARQL query optimization method based on hybrid storage mode
CN113407539A (en) * 2021-06-21 2021-09-17 湖北央中巨石信息技术有限公司 Area coding query method and system based on prefix tree and application thereof
CN118245443A (en) * 2024-05-27 2024-06-25 中协通通信技术有限公司 File management method and system based on artificial intelligence

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102087646A (en) * 2009-12-07 2011-06-08 北大方正集团有限公司 Method and device for establishing index
CN104572678A (en) * 2013-10-16 2015-04-29 北大方正集团有限公司 Index establishment method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102087646A (en) * 2009-12-07 2011-06-08 北大方正集团有限公司 Method and device for establishing index
CN104572678A (en) * 2013-10-16 2015-04-29 北大方正集团有限公司 Index establishment method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王亚玲等: "基于Sharp/Shark的电力用采大数据OLAP分析系统", 《中国科学技术大学学报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649687A (en) * 2016-12-16 2017-05-10 飞狐信息技术(天津)有限公司 Method and device for on-line analysis and processing of large data
CN106649687B (en) * 2016-12-16 2023-11-21 飞狐信息技术(天津)有限公司 Big data online analysis processing method and device
CN107066506A (en) * 2017-01-11 2017-08-18 中国科学院空间应用工程与技术中心 A kind of method and device for improving space science and application data recall precision
CN111291040A (en) * 2018-12-10 2020-06-16 中国移动通信集团四川有限公司 Data processing method, device, equipment and medium
CN111291040B (en) * 2018-12-10 2022-10-18 中国移动通信集团四川有限公司 Data processing method, device, equipment and medium
CN112835920A (en) * 2021-01-22 2021-05-25 河海大学 Distributed SPARQL query optimization method based on hybrid storage mode
CN112835920B (en) * 2021-01-22 2022-10-14 河海大学 Distributed SPARQL query optimization method based on hybrid storage mode
CN113407539A (en) * 2021-06-21 2021-09-17 湖北央中巨石信息技术有限公司 Area coding query method and system based on prefix tree and application thereof
CN118245443A (en) * 2024-05-27 2024-06-25 中协通通信技术有限公司 File management method and system based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN105787090A (en) Index building method and system of OLAP system of electric data
CN104376053B (en) A kind of storage and retrieval method based on magnanimity meteorological data
CN105608203B (en) A kind of Internet of Things log processing method and device based on Hadoop platform
CN102426609B (en) Index generation method and index generation device based on MapReduce programming architecture
CN103020204B (en) A kind of method and its system carrying out multi-dimensional interval query to distributed sequence list
CN106611046A (en) Big data technology-based space data storage processing middleware framework
CN104111996A (en) Health insurance outpatient clinic big data extraction system and method based on hadoop platform
CN106708993A (en) Spatial data storage processing middleware framework realization method based on big data technology
CN106648446A (en) Time series data storage method and apparatus, and electronic device
CN107807932B (en) Hierarchical data management method and system based on path enumeration
CN106503276A (en) A kind of method and apparatus of the time series databases for real-time monitoring system
JP2012098934A (en) Document management system, method for controlling document management system and program
CN103177094B (en) Cleaning method of data of internet of things
CN104239377A (en) Platform-crossing data retrieval method and device
CN103678491A (en) Method based on Hadoop small file optimization and reverse index establishment
CN111427847A (en) Indexing and query method and system for user-defined metadata
CN104778225A (en) Method for synchronizing data in unstructured data multi-storage system
CN104035956A (en) Time-series data storage method based on distributive column storage
CN103793493A (en) Method and system for processing car-mounted terminal mass data
CN108509437A (en) A kind of ElasticSearch inquiries accelerated method
CN103399888B (en) The differential synchronization method of grid model data and system
CN109298978B (en) Recovery method and system for database cluster of specified position
CN107798062A (en) A kind of transformer station&#39;s historical data unifies storage method and system
CN102779138A (en) Hard disk access method of real time data
CN103678550A (en) Mass data real-time query method based on dynamic index structure

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160720

RJ01 Rejection of invention patent application after publication