CN107092656B - A kind of tree data processing method and system - Google Patents

A kind of tree data processing method and system Download PDF

Info

Publication number
CN107092656B
CN107092656B CN201710178695.6A CN201710178695A CN107092656B CN 107092656 B CN107092656 B CN 107092656B CN 201710178695 A CN201710178695 A CN 201710178695A CN 107092656 B CN107092656 B CN 107092656B
Authority
CN
China
Prior art keywords
data
domain
line
tree
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710178695.6A
Other languages
Chinese (zh)
Other versions
CN107092656A (en
Inventor
陈世敏
王智义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201710178695.6A priority Critical patent/CN107092656B/en
Publication of CN107092656A publication Critical patent/CN107092656A/en
Application granted granted Critical
Publication of CN107092656B publication Critical patent/CN107092656B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion

Abstract

The present invention proposes a kind of tree data processing method and system (System for TrEE structured Data, STEED), it is related to technical field of data processing, the system is supported to read text data, and resolved to the binary format data of line or column, wherein during parsing, dynamic generation syntax tree stores the definition of semi-structured data;The binary format data of line or column are stored, wherein realizing that the binary format data of parallel type or column are mutually converted, and the binary format data are directly output as to the JSON data of text formatting;Based on the binary format data, inquiry operation is carried out to semi-structured data.

Description

A kind of tree data processing method and system
Technical field
The present invention relates to technical field of data processing, in particular to a kind of tree data processing method and system (System for TrEE structured Data, STEED).
Background technique
With the development of computer network and big data processing technique, traditional relational data are not able to satisfy more and more To data definition and the requirement used under network and big data environment, and using JSON and Protocol Buffers as the half of representative Structural data, while can also be according to data because the data of object in programming language (Object) can either be expressed adequately Format change original data format is modified and is expanded, so it is widely used in the actual environment.
The definition of tree data:
Tvalue=Tprimitive|Tobject|Tarray
Tprimitive=string | number | boolean | | null
Record=Tobject
As it appears from the above, tree data definition is as follows:
1. the value in tree data can be 3 kinds below:
The numerical value of object structure;The numerical value of array structure;The numerical value of atomic type;
The numerical value of 2.object structure includes that inside is by multiple key-value pairs (key value pair) to structure by brace Can be at, the number of key-value pair it is any number of, but requirement cannot with the presence of duplicate key object structure object In;
The data of 3.array structure include that inside is made of multiple values (value) by square brackets, and the number of value, which can be, appoints It anticipates multiple, thereby increases and it is possible to have duplicate value and occur;
4. the data of atomic type can be character string (string), numerical value (number), Boolean (boolean) and sky (null) etc.;
5. in key-value pair described in as above 2, the value of key can only be (string) type.
6. the data of each tree are object structures.
The source of common data is by the following aspects:
1) data information (Data Feeds)
Using twitter being transmitted using JSON format to data in a network as representative.User and related api routine Corresponding data update can be obtained by monitoring corresponding port.Since its data content is abundant, structure is relative complex, data Source is more stable and the data volume that provides is sufficiently large, therefore experiment of the invention and data are based primarily upon during analyzing Twitter data set.As follows, the present invention, which analyzes, carries out the number of the level of nesting and duplicate domain in twitter data Corresponding analysis.
2) online data services (Online Data Service)
Online data service is carried out using the data of JSON format.Common type is the corresponding operating of transmission client Content and the corresponding operating result of return etc..The present invention has studied the semi-structured data of the online data services of separate sources, Such as Yahoo (Yahoo), Sina weibo and IMDB etc..Usual user can make
Leaf node level There is no duplicate domain 1 duplicate domain Extra 2 duplicate domains It amounts to
1 16 0 0 16
2 61 2 0 33
3 51 21 4 76
4 1 19 4 24
5 0 12 0 12
6 0 12 0 12
It amounts to 129 66 8 203
With JSON according to the demand of certain api interface format editor's data service, be sent to corresponding data server it Afterwards, the returned data of JSON format is parsed to complete a data service.
It is as shown in Figure 1 that relevant analysis has been carried out to the online data services of microblogging API in the present invention.
The number in the duplicate domain for including in its path of selective analysis of the present invention: black portions are from root to leaf in figure Node does not have the path of duplicate domain, and light-colored part is the path of only 1 duplicate domain, and white portion is to have 2 or more duplicate domains Path.Mode in the present invention using statistic histogram shows its ratio constituted: from root to leaf in most of syntax tree At most there was only 1 duplicate domain in the path of node.3) communication protocol
The present invention analyzes in Apache Hadoop and Hadoop HBase and communicates relevant protocol format, uses It is that the semi-structured data that Protocol Buffers is defined carries out communicating relevant data transmission.It is fixed in above system The a variety of different types of semi-structured communication formats of justice, for being in communication with each other between different machines and control.It is mostly used in The format very simple of the semi-structured data of communication.
It is as shown in Figure 2 that relevant analysis has been carried out to the communication protocol of Apache Hadoop in the present invention.
The number in the duplicate domain for including in its path of selective analysis of the present invention: black portions are from root to leaf in Fig. 2 Child node does not have the path of duplicate domain, and light-colored part is the path of only 1 duplicate domain, and white portion is to have 2 or more to repeat The path in domain.Mode in the present invention using statistic histogram shows its ratio constituted: from root to leaf in most of syntax tree At most there was only 1 duplicate domain in the path of child node.
4) public data collection
By the data in analysis DBpedia and data.gov, public data collection is carried out using the data of JSON format Storage.But it is different from traditional semi-structured data file, the data in these data sets are only by a JSON Data composition.This record is broadly divided into two parts: first part is made of a nested minor structure (object in JSON), After storing in data acquisition system data format;The content that second part is then recorded by one storage of array every, and every Record is not nested structure.This record easily can be split into data definition and data content two very by the present invention Part, and then handled using the method that traditional semi-structured data is handled.
5) sensing data
Newest sensor platform, such as Arduino, Dragon Board, Beagle Bone etc., can generate and Handle the data of JSON type.The present invention analyzes the data in the above source, it is found that the format inside its data is more simple: The depth of nesting in all domains, which is up to 2 and at most only has a multi-domain, in data appears in path from root to leaf node On.
But at this stage existing data processing system cannot JSON format to the above source semi-structured data into Row processing well: under the premise of can either providing complete function, while operations have preferable performance.The present invention analyzes It is a large amount of to support semi-structured data management system, mainly there are following three points to the roadmap of semi-structured data:
1) function of traditional relevant database is extended
Such as PostgreSQL and Oracle etc., by the semi-structured data such as JSON with the two of text or in-line coding into In the table for storing relevant database in the form of a continuous data block of format processed.Carrying out corresponding inquiry operation When, it calls internal analytical function to parse the content in data block, reads the data value in the domain of needs.Next it adjusts Corresponding inquiry operation is carried out to it with the operation function in relevant database.
2) NoSQL data processing system
Inside carries out binary coding, such as MongoDB etc. to semi-structured data using more flexible mode.Its Advantage, which is can be realized, to be parsed primary semi-structured data, is stored and inquiry operation, and there are stronger data to deposit Storage and inquiry advantage.It is during realization, according to the design feature of semi-structured data, newly defines or extend one Inquire relevant operation.
3) column data format handles data
Google Protocol Buffers and Apache Hive+Parquet support counts semi-structured data According to processing and the operation such as inquiry.Compared to above data processing system of two classes based on line data, column data processing system System can be capable of providing better query analysis performance in most cases, but its internal realization is more complicated: internal Data are stored usually using the form of column cluster.Have for semi-structured data parsing and the realization of inquiry operation higher Difficulty.
The method of above 3 kinds of realizations semi-structured data processing system there is a problem of different degrees of at this stage.
1) extending existing relevant database supports the processing of semi-structured data relatively inefficient
By analyzing the relevant database that semi-structured data can be supported to handle at this stage, most database is found All without the structure and the corresponding data encoding of data characteristics progress and optimization for semi-structured data.It is mainly by half hitch Structure data are stored as the form of text data block, by its internal some data analytical function realized to the number of text type It is parsed according to block, to obtain the information needed in every record.Directly store text type in the database in this way JSON formatted data wastes a large amount of space.
Simultaneously during data query, a large amount of character string comparison and inquiry operation are needed, to greatly limit The efficiency of data processing.Existing research according to the present invention, although many systems support the operation of semi-structured data, When data volume increases, the runing time of inquiry is often too long and the requirement that causes it to be difficult to meet real-time.
Relevant database can't support some new design features in semi-structured data simultaneously well.Such as it is straight It connects and supports to support semi-structured data design feature to nested and duplicate domain syntactic definition, generalized Petri net query grammar.
2) NoSQL data processing system is not good enough to the coding and search efficiency of data
The present invention analyzes and has studied the NoSQL data processing system MongoDB being widely used.Due to JSON data language The flexibility of justice defines redundancy and cumbersome data encoding format inside MongoDB.It is found in research, the efficiency of coding Very low, in most cases, the data file after coding can be greater than the data of original text formatting.Inside data Coding there is no the effective redundancy reduced in JSON text data, can also be brought in query process on the contrary additional Performance consumption.The performance that this allows for its data processing is relatively limited, especially for the processing of mass data.
Meanwhile these NoSQL data processing systems cause its some operation that can not hold due to the limitation of its interior design Row.For example, efficiently can not completely realize join connection operation in SQL (although joined in latest edition in MongoDB Relevant similar operator, but do not fully meet still the connection operation of join defined in SQL and execute efficiency too It is low).
3) column data format process data
In relevant database, the storage of columnar database and query performance generally can all be better than line data library.This It is the data for not needing to read and handle domain unrelated with current queries in record in query process because of it.But inside it Principle is complicated, function realizes relative difficulty.
Similar, in the system for supporting to handle semi-structured data, is stored and inquired using column data Internal system is also more complicated.There is no the limit of grammer to JSON internal form in most of management system using line data System, both the content of its data did not needed that preparatory definition, the structure of data can constantly develop in use.But it is right For the data management system of column, the definition (Schema) that needs to provide column data in advance and in use can not The structure of dynamic changing data.This just significantly limits the flexibility of semi-structured data.
In addition, also not many at this stage selected based on the semi-structured data processing system of column data for user It selects.It can the Apache Hive+Parquet that is only realized at this stage based on Java of column system for users to use.Due to Java The limitation of programming language, the space that the efficiency of inquiry also advanced optimizes.And the platform of its operation needs Apache The support of Hadoop and HDFS, so system initialization and the cost of operation are all very high.
The present invention has found existing three kinds of feasibility sides when carrying out carrying out the correlative study such as handling to semi-structured data Case because handling semi-structured data when to caused by the limitation of data structure and realization.
Firstly, design feature internal in semi-structured data causes the data processing to it cannot be by expansion relation type Database obtains.The two has data format different it is assumed that so handling semi-structured number using relevant database According to when can generate higher cost so that being difficult to bear.So the present invention, which is redesigned and realized, is intended for semi-structured number According to data processing system, enable it to meet the processing to the semi-structured data of labyrinth.
Secondly, it is possible that the spies such as structure change in the flexible and use process defined in view of semi-structured data Point, major part NoSQL data management system directly stores it using the data of class this paper structure at this stage.This is resulted in Its storage efficiency is too low and sampling process cost when inquiring is very high.In design of the invention, the knot that is extracted from data Structure is stored in Schema syntactic definition, and least structural information is only retained in data.This, simplifies repeat in data Structural information, while but also being possibly realized for some query optimizations of data content.
Finally, the support of many basic modules is needed based on the semi-structured storage of column that JAVA is realized at this stage, such as Document storage system, scheduling system etc..These can all cause it to have some additional limitations and meeting to the function of system and use It is caused to execute inefficient.The present invention is based on the C/C++ notebook data processing systems (STEED) realized to be completely independent exploitation, This allows for system and is possibly realized from integral optimize;It there will not be and such as need in advance to be defined the format of data And it the limitation generated due to platform such as can not change.
Summary of the invention
In view of the deficiencies of the prior art, the present invention proposes a kind of tree data processing method and system (STEED).
The present invention proposes a kind of tree data processing method, comprising:
Step 1, semi-structured data is read, and is resolved to the binary format data of line or column, wherein During parsing, dynamic generation or syntax tree is established according to definition, stores the definition of semi-structured data;
Step 2, the binary format data of line or column are stored, wherein realizing described the two of parallel type or column System formatted data is mutually converted, and the binary format data are directly output as to the JSON data of text formatting;
Step 3, the binary format data are based on, inquiry operation is carried out to semi-structured data.
The step 1 includes setting for describing and defining two of domain in Protocol Buffers and JSON text data The definition of binary data type, nested structure;The definition of semi-structured data is established, wherein for Protocol Buffers's Text data defines in file first according to its syntax tree before parsing data and defines dynamic generation syntax tree to syntax tree, The data of JSON format are during data parse according to the number of format and content dynamic generation JSON format in its data According to the definition of syntax tree.
The step 1 further includes parsing to semi-structured data: by single record as unit of layer-by-layer nested storage Line storage organization;By data tree define middle leaf as unit of the column storage organization that stores.
Semi-structured data is handled in the following manner:
Each node of syntax tree in semi-structured data is filled in definition, and the correlation of node itself is not only described in node Information, it is also by the ID of syntax tree interior joint that node is interrelated, form tree.
Syntax tree is established respectively for JSON and Protocol Buffers in resolving respectively, wherein
It establishes JSON syntax tree: dynamically establishing syntax tree by data during parsing data, wherein assuming every The type of the value in a domain is that member type will not change and in array is all consistent, during establishing syntax tree, The type of its value is determined according to the type of data intermediate value, and the domain for the JSON that value is array is defined as repeating, remaining section Point is defined as not necessarily will appear, first corresponding with field name according to father's parent node ID in resolving Whether there is or not the definition of relevant structure by symbol table lookup for domain name, if it is not, adding relevant node into syntax tree, otherwise The value of node is parsed;
Establish Protocol Buffers syntax tree: Protocol Buffers defines message work in proto file For new data type, each domain wherein included is the data of basic data type or other compound types, is being established During Protocol Buffers syntax tree, first parsing proto file, extend new data type, later according still further to The definition of data type is extended one by one and is assembled into the syntax tree of data structure by specified root node.
Storage and operation for line or the binary format data of column:
1) shaping number: TypeInt (8/16/32/64) respectively indicates 8/16/32/64 shaping number;
2) floating number: Type (Float/Double) respectively indicates the floating number of float and double type;
3) character string: the character string that TypeString is indicated;
4) timestamp: TypeTimeStamp indicates timestamp, and inside is implemented with TypeInt64.
The step 3 includes when executing inquiry operation, needed for first generating this inquiry according to the content in query statement The operation tree of foundation, each of described operation tree node are all a SQL operations.
It further include the query grammar of generalized Petri net, as follows:
(1) " ": for the level of nesting in the path expression of spacer domain;
(2) " any ": an arbitrary numerical value in duplicate domain is indicated;
(3) " all ": numerical value all in duplicate domain is indicated;
The result of output are as follows: the data of JSON format;Ignore the class JSON data of nested structure.
Further include: line data reads operation: a whole line structure is read from the binary format data of line Data, at the time of reading, every time from the binary format data of line read a Row Object line object successively into Row is read, until reaching end of file EOF;
Line data filtration operation: to the binary format data of the line of reading carry out the condition in where words and expressions into Row judgement, and after generating the binary format data of new line in group by words and expressions, aggregation is assembled Result be filtered operation, wherein first parse where words and expressions, each predicate is instantiated as to carry out data ratio Compared with object, the value of reading is compared later, judges the true value of each predicate, decides whether to pass through conditional operation;
Line data mapping operations: the nested structure in recursive function reply semi-structured data is called, in each domain In, valuation of a field in former data is read respectively, and domain associated with the query is only written to the knot of operation after being parsed to it In fruit;
It connects operation: attended operation is realized using Hash connection, wherein according to the join in one of data set record Key occurrence calculates its corresponding cryptographic Hash and by whole record storage in Hash table, traverses another data acquisition system later, The position in the correspondence Hash table with identical hash key Hash keys is searched, later closes the data of two line structures And and this record is waited to be pulled to one layer of operator operation;
Grouping operation: HashValueItemContainer is defined first for storing each in each of Hash table Storage unit, specific value value be to be directed toward the address of HashValueItem in Hash table, wherein (1) is first in middle layer guarantor Deposit the specific address of record storage and the content of each calculative aggregation aggregation;
(2) in Block Buffer object, the actual content for the record being saved is stored, wherein when entirely grouping operation Complete and then by aggregation assemble result be input to corresponding position, and wait result by upper layer other Operator operates pull-up;
Sorting operation: by all record storages into buffer caching, and carrying out comparative sorting, wherein every time only to behaviour Make the memory of system application fixed size for storing a plurality of data that lower level operations obtain, while using one in comparison procedure A array records the initial address of every record and changes position of the pointer in array in sequencer procedure;
According to the condition of sequence, comparator is defined, and carries out operation as follows:
(1) comparator reads the numerical value in all domains for comparing operation from the binary format data of line;
(2) to improve relative efficiency, the process for comparing and exporting is as follows:
A) retain the most-significant byte that 8 bytes store data in the domain that first needs to compare in every record;
The sequence in the domain b) sorted as needed using comparator successively value and is ranked up, until what is compared As a result;
C) it is compared using STL::sort function;
D) itself data copy, the pointer number of modification record output sequence are not carried out to record in comparison procedure Group.
The present invention also proposes a kind of system based on the tree data processing method.
As it can be seen from the above scheme the present invention has the advantages that
1, the line storage organization of semi-structured data;It realizes and the row binary of semi-structured data is stored, make it The characteristics of semanteme of semi-structured data can completely be expressed and adapt to the variation of its data definition.Furthermore, it is desirable that the letter of its structure List is easy to express, storage efficiency with higher;
2, the column storage organization of semi-structured data;It realizes to the column binary storage of semi-structured data, makes it It is able to use the structure of column storage expressed intact semi-structured data.It is required that its knot that can express semi-structured data complexity The content of structure feature and efficient storing data;
3, two kinds of formats of semi-structured data line and column mutually convert realization;It is realized using parsing and packing algorithm Binary system line and column data mutually convert;
4, the syntax tree that semi-structured data defines is realized;Use the definition information of structure in tree storing data;
5, inquiry operation is carried out to semi-structured data;It is grasped using the inquiry that line and column data carry out class SQL to it Make;
6, be based on semi-structured data the characteristics of, the query grammar of generalized Petri net;Since there are multivalues in semi-structured data Domain defines " ANY ", " ALL " and path expression and solves the problems, such as the data ambiguousness in query process;
7, the optimization based on simple path in semi-structured data;Simple path refers on from root node to leaf node most Only exist a multi-domain more.Present invention discover that there are a large amount of such structures in common semi-structured data, propose and real Show the storage for this spline structure and query optimization, greatly improves the efficiency of inquiry.
As shown in Fig. 4, the present invention has carried out data using different size of data set and has been already loaded into memory (hot Cached) and data be loaded into memory not yet (cold cached) query analysis experiment.In experiment, the present invention is used Different SQL query statements is to obtain the performance comparison of corresponding arithmetic operation, including project mapping, filter are filtered, Group grouping, sort sequence and join attended operation.
Query performance according to Fig.4, in the experiment that cold cached data are not loaded into memory, STEED phase There is 4.1 to 17.8 times of performance speed-up ratio for Hive+Parquet, there is 55.9 to 105.2 times of acceleration relative to MongoDB Than there is 33.8 to 1294 times of speed-up ratio relative to PostgreSQL;And in the experiment of hot cached, STEED pairs MongoDB has 19.5 to 59.3 times of speed-up ratio, there is 19.5 to 59.3 times of speed-up ratio to Hive+Parquet, right PostgreSQL has 16.9 to 392 times of speed-up ratio.The inquiry language of each inquiry operation of the invention is listed in annex in detail Sentence.
Detailed description of the invention
The JSON data format of Fig. 1 microblogging API definition is analyzed;
The correlation analysis of Fig. 2A pache Hadoop communication protocol;
Fig. 3 is the comprising modules figure of steed;
Fig. 4 is the query performance comparison diagram of steed;
Fig. 5 is the procedure chart that Protocol Buffers establishes syntax tree;
Fig. 6 is line data compound type structural schematic diagram;
Fig. 7 is column data store organisation schematic diagram;
Fig. 8 is the data-optimized storage organization schematic diagram of column;
Fig. 9 is each inquiry operation schematic diagram of steed;
Figure 10 is storage organization schematic diagram in division operation calculating process;
Figure 11 is the line storage organization schematic diagram by optimization;
Figure 12 is the prioritization scheme schematic diagram of alternative line storage organization.
Specific embodiment
In view of the above the deficiencies in the prior art, the present invention redesigns and realizes a semi-structured data processing system STEED.The following present the overall architecture of STEED system and briefly introduce the functional requirement of each module, post analysis this The interfaces of several intermodules defines, while briefly explaining inside STEED is how to handle and storing data.
As shown in figure 3, STEED is mainly by three module compositions:
(1) data resolution module:
Text data is read, and is resolved to the binary format data of line or column, is stored in data storage In module.During data parsing, dynamic generation syntax tree stores the definition of semi-structured data.To JSON format When data are parsed, since it does not define corresponding data format (syntax tree, schema tree), so the present invention is only Can during parsing data dynamic generation data format definition;And to the data of Protocol Buffers format, text The relevant definition of data and data of this format can be previous with being provided in data parsing, so the present invention is in parsing text formatting Data before syntax tree can be established according to its definition.According to the definition in domain in syntax tree, the present invention is by the data of text structure It is converted into the binary format data of line and column.
(2) data memory module:
Store the line generated by data resolution module and column binary file.It may be implemented in inside to this The mutual conversion of two kinds of formatted datas, and it is directly output as the JSON data of text formatting.In STEED system, this The characteristics of invention is stored also according to line and column data has carried out certain optimization to its storage organization, enables to have higher Storage and search efficiency.
(3) query analysis module:
Data based on line and column format carry out inquiry operation to semi-structured data, including projector reflects It penetrates, filter filtering, group grouping, sort sequence is connected with join.When STEED needs to be implemented one query, first by Query Parser query parser generates according to the content in query statement and this time inquires the required operation tree established (Operator Tree), each of tree node are all a SQL operations.Data are in operation tree according to from leaf to root The sequence of node completes the operation of various pieces until reaching root node completes this inquiry operation.The invention also achieves some The multithreading version of operation supports projector mapping, the operation such as filter filtering and group grouping.
STEED system one is divided into three modules, next the present invention by the realization details of each module that makes introductions all round and Process.
Part 1 data resolution module
This part describes the realization details of the data resolution module of STEED and the key algorithm of inside in detail, simultaneously According to the design feature of semi-structured data, illustrate STEED is how to solve respectively for JSON and Protocol Buffers Analyse and establish the process of syntax tree.
1.1 data resolution module architectural overviews
Data resolution module is mainly made of following three parts:
(1) Data Type data type:
For describing and defining the binary data types in domain in JSON and Protocol Buffers text data. Some basic data types, such as int, double, string etc. are defined in STEED system.For the number of JSON format According to, it is only necessary to the value of text data is mapped to the data type of internal system;And for Protocol Buffers and Speech, the data composite data type defined using its schema convert the data type that STEED defaults accordingly, for The process for establishing syntax tree later uses.
(2) Schema Tree data syntax tree:
The definition of semi-structured data is established, both syntax tree.
For the text data of Protocol Buffers, defined in file according to its schema first before parsing data Dynamic generation syntax tree is defined to schema.In data resolving, the content and structure of the syntax tree of definition is remained unchanged.
The data of JSON format then need the present invention during data parsing according to the format and content in its data The definition of this syntax tree of dynamic generation.Present invention assumes that the type of numerical value remains unchanged in each domain, while every in array The type of the value of a element is all identical.
STEED stores the corresponding syntax tree definition of each data set.In query analysis module, STEED will be according to language The definition of data carries out corresponding inquiry operation to data set in method tree.
(3) Parser:
For the semi-structured data of text formatting to be split to the form for becoming key-value pair (key value pairs), and It is parsed into the storage organization of the line or column that define inside STEED later.For Protocol Buffers data, solving The process of analysis only needs to carry out data according to the definition of syntax tree the conversion of format;And for the data of JSON format, this hair It is bright that the domain for whether occurring newly defining in data also needed to analyze during parsing, and then existing syntax tree is repaired Change.
1.2Data Type type
1.2.1STEED the basic data type supported
STEED internal system defines the data of some binary formats, the storage for line and column formatted data And operation:
1) shaping number: TypeInt (8/16/32/64) respectively indicates 8/16/32/64 shaping number;
2) floating number: Type (Float/Double) respectively indicates the floating number of float and double type;
3) character string: the character string that TypeString is indicated;
4) timestamp: TypeTimeStamp indicates timestamp, and inside is implemented with TypeInt64.
The above data type can support the sky of sentencing to its value, mutually convert herein with binary data, than Compared with operation etc..
1.2.2JSON the conversion of data type
JSON defines in its data the possible type of data in each domain.Each data type that the present invention is defined It is mapped to the corresponding internal data type of STEED, as shown in the table:
For basic data type, the Type mapping for directly defining JSON becomes the master data class inside STEED Type;And for object in JSON with array these nested complex data types, it is corresponding to also define its inside STEED The mode of ranks storage, specific storage mode is see next chapter data memory module.
1.2.3Protocol the conversion of Buffers data type
Similar to JSON, Protocol Buffers also defines the basic data type in some inside.In the inside of STEED In realization, the present invention directly converts the type (C++Type) in C++ for these basic data types, and its value is stored In result after parsing.Referring to https: //developers.google.com/protocol-buffers/docs/ proto3#scalar。
In addition, compound data type message can also be defined in the schema of Protocol Buffers.Using multiple Data type is closed, the present invention can define the data format definition of multilayer nest.Meanwhile in the definition of compound type, this hair Domain bright to can choose valuation of a field attribute, that both required centainly will appear, optional it is possible that domain and The domain that repeated can repeat.
1.3 syntax trees (Schema Tree)
In this trifle, how it is using syntax tree (Schema Tree) description half structure that the present invention will introduce STEED Change data.Can also introduce in resolving simultaneously is how to be directed to the data and knot of JSON and Protocol Buffers Structure feature establishes grammer.
1.3.1 the definition of syntax tree
There are following some design features for semi-structured data:
1) there are a large amount of nested structures in data: the definition in each domain has depth and traditional relationship type flat Data compared to more complicated;
2) many multi-domains in data: in a record, many values be might have, some domain therein is carried out Duplication.
3) there is a large amount of sparse domain in data: a large amount of domain is not assigned in most data, and is used Traditional relevant database carries out processing meeting to it in a manner of table so that storage and inquiry are very inefficient.
In order to efficiently describe the above feature in each domain in semi-structured data, while improving line and column Storage and search efficiency, the present invention is according to following each node for defining syntax tree in filling semi-structured data:
The relevant information of node: data type itself is not only described in node, may be assigned in the level of nesting and domain Number etc.;Also by SchemaNode syntactic node ID that node is mutual association, forms tree.Next the present invention How will introduce respectively is that JSON and Protocol Buffers establishes syntax tree respectively in resolving.
1.3.1JSON the foundation of syntax tree
Since there is no the relevant definition of data by JSON, so the present invention can only pass through data during parsing data Dynamically establish syntax tree.Herein, present invention assumes that the type of the value in each domain is member will not change and in array Type is all consistent.During establishing syntax tree, the present invention only needs to determine its value according to the type of data intermediate value Type.On the other hand.Whether occur being uncertain in record due to each domain in JSON data, so the present invention will Value is that the domain of the JSON of array is defined as what repeated repeated, remaining node is defined as optional and not necessarily can Occur.In resolving, STEED needs first to pass through symbol according to father's parent node ID and the corresponding domain name of field name Number table searches that whether there is or not the definition of relevant structure.If added without the definition of this node into Schema Tree syntax tree Relevant node;Otherwise then the value of this node is parsed, detailed resolving is see next trifle.
1.3.2Protocol the foundation of Buffers syntax tree, as shown in Figure 5:
Shown in the following example, Protocol Buffers can define message as new data class in proto file Type.Each domain wherein included is also possible to the data of other compound types either basic data type.The present invention During achievement, proto file is parsed first, extends new data type;The root node specified later according still further to user (root) definition of these data types is extended to one by one and is assembled into the syntax tree (Schema Tree) of data structure.It The present invention one by one can parse each text data according to the definition of syntax tree afterwards.
The parsing of 1.4 data
In this trifle, the present invention will introduce the data parsing algorithms of STEED.Here the present invention has ignored in system and is permitted The realization of more underlying basis classes only lists and the analytically dependent algorithm of Document type data.
Since semi-structured data respectively defines two kinds of compound data structures, both object (object) and array (array), so during parsing, the present invention distinguishes it both different composite constructions using different methods It is parsed.On the other hand, for the output of line and column binary data, JSON and Protocol Buffers is in this hair It is consistent during bright realization, so next the present invention introduces respectively first JSON and Protocol Buffers's Analytical algorithm, then illustrate be the data how its data exported as binary system line and column later.
1.4.1JSON data resolving algorithm
Shown in following algorithm, the present invention here to atomic data types and composite data type using different strategies into Row parsing: for the data of atomic type, the present invention is directly converted into the data of binary format according to the value of its text formatting It is stored or is exported;For the data of composite construction, the present invention needs to analyze and parse its structure the child domain until all It is all atomic data types.Later according still further to its line or column
Storage organization be written into storage file.To in the data resolving of JSON text formatting, the present invention It needs that each domain is compared, judges whether it is newly-increased node, and then modify existing syntax tree.
For the nested structure (top box left half) in semi-structured data, split the domain of same layer becomes first The form of " key-value pair " is analyzed according still further to each key-value pair respectively later.Post analysis each key define whether once Occurred, corresponding Schema Tree is updated if not occurring, while recording the value of corresponding domain in Schema Tree.It It is parsed afterwards according to the value recurrence of each nodes records in Schema Tree: if it is compound data type, then calling phase The composite construction analytical function answered continues to parse;If it is the value of simple types, then directly output it in result to the end It goes.
And for the array of multi-domain (top box right half), since it indicates that the multiple of the same domain repeat Value, so not having to analyze it to schema present invention only requires successively calling corresponding analytical function to parse its content The modification of tree.
1.4.2Protocol Buffers data resolving algorithm
For the data of Protocol Buffers format, the resolving of Document type data relative to Protocol Buffers is simpler: since the format of data has been defined before data parsing in it, so of the invention It does not need to check and modify syntax tree during parsing, it is only necessary to parse i.e. the value in domain each in record respectively It can.Specific analytic method is similar with JSON: compound type calls corresponding analytical function to be parsed;Simple types is then direct Its value is output in result.
1.4.3 the output algorithm of line and column data
During parsing, data can be parsed the binary format for becoming line or column by STEED.Here originally Invention exports the detailed process for line or column formatted data for it is introduced:
(1) line compound type data output algorithm:
As shown in algorithm above, for the composite data type of object and array, the data of line structure use respectively The object of its line structure is added the value in each domain until whole record completes parsing.
(2) column compound type data output algorithm:
Data file relative to line structure exports, and column structures data only need during exporting by its leaf section Specific value and its structural information are directly output in file on point.So the present invention does not need to retain during parsing The structure of object and array semantically only record its structure correlation letter
It ceases and is output in the file of column storage.Can thus make export binary format process it is relatively easy and Efficiently.
Part 2 data memory module
After the parsing that data resolution module completes data line or column, data memory module to the result of parsing into Row storage and certain structure are converted, such as the mutual conversion of line and column format, by the data of binary format directly with text This format exports etc..In this chapter, the present invention introduces first and the bottom storage knot of line and column binary data Structure.Later, the packing algorithm based on Google Dremel, the present invention will be illustrated that STEED is the number for how realizing column structures According to the packing algorithm for being converted into line structured data.
2.1 line storage organizations are summarized
In the description of previous chapter resolving, the present invention deposits its data using the binary format of atomic type Storage;And other two composite construction object object and array array, the present invention are then stored according to the method format of such as Fig. 6:
Line is similar with the storage organization of column, is mainly made of several parts below:
(1) relevant information of this storage organization, such as storage organization Header Information structure head information: are recorded Size, element number wherein included etc..
(2) (ID) OFFSET Array ID and offset array: for object object, the present invention needs to mark Wherein the id in each domain is used to indicate the presence of its value;And for array array, each value therein is the tax in identical domain Value, so its offset offset information for only remaining each value.
(3) array of Value Array numerical value: the numerical value that the storage organization of line all repeats values is stored as The form of array is stored, the type of intermediate value either atomic type data, be also possible to the data of compound type. In object object, what it is due to expression is different valuation of a fields, so the type of each value can not be identical;But in array In array, expression be same domain multiple assignment, so the type that the present invention defaults each value here is identical.According to The offset offset information of each value before, the present invention can carry out random access to the value in arbitrary domain.
2.2 column storage organizations are summarized
Column storage organization is relative complex relative to line structure, and invention defines following related notions for arranging It is indicated in formula structure and stores its structural information:
(1)Repetition Level:Repeated value repeat at which field in the Repetition in field ' s path. data is the repetition carried out on which level.
(2)Definition Level:Number of field in the path could be undefined There is several layers of omissible domain (optional and repeated) in but present. data is to occur.
How data about column structures use these relevant information to carry out the process that column is converted to line data Next trifle is explained in detail, the present invention only introduces its storage organization in line structured data here.
CAB (Column Align Block) is the basic unit of column storage of the present invention.In resolving, each value (value) a Column Item can be all generated to be stored in CAB.Because having many duplicate domains in semi-structured data, Each duplicate domain may result in be had a plurality of Column Item and is inserted into CAB in a record.The present invention is in order to improve The efficiency of storage and inquiry, CAB is stored using the mode that record id is aligned, and each CAB stores identical more Item number of the record record without considering specific column item.
The specific structure chart of CAB is listed in Fig. 7.Mainly it is made of following four part:
(1) Header information: for describing the relevant information of CAB, the record strip number etc. including its size and storage.
(2) Repetition Array array: for recording the repetition value of every Column Item.Because The maximum value of repetition is the depth capacity in each domain, so the present invention is stored using several bit instead of integer here Its value.By the analysis to data content, mode that the present invention summarizes following several template to be likely to occur it into Row is summarized and optimization, as shown in Figure 8.
A) the non-repetitive domain None Repeated: there is no recursive domain in the level of nesting, this domain is most in record Only one value, does not both have the case where repeated assignment of values.If STEED can be inserted into when certain domain void values during parsing Null be aligned every record can naturally.In this way, every record has and an only column item is in corresponding column In data file.Therefore, this array is omitted in the present invention in storage organization.
B) Single Repeated can only be repeated in some level of nesting: one and only one in the level of nesting can With duplicate level.Present invention only requires first Column Item (Record Boundary) of every record of label.Institute To go to mark first Column Item or its unique duplicate level of nesting of every record present invention only requires 1bit.
C) Multi Repeated can be repeated in multiple level of nesting: if from root to there are multiple in leaf node Recursive domain, the present invention just need multiple bits to point out its duplicate domain.
After having carried out specific analysis to data, present invention discover that the domain of the overwhelming majority is all at most only can be on one layer It repeats.So being stored by using this 3 template templates, column storage organization of the invention improves storage and behaviour The efficiency of work.
(3) Value Area value region: this part has recorded the value of whole Column Item.For elongated and fixed Two kinds of long data formats, present invention uses two different storage strategies:
A) data type of fixed length: the length of each data is the same, so the present invention has recorded each value in Header Occupied space only needs mobile fixed length that next numerical value can be read every time;Additional offset number is not needed Group.
B) elongated data type: the length of each data is different, so present invention needs are remembered in offset array Record the storage location of each value;For a large amount of duplicate domain (such as user language etc.) of content of value, we only store a tool The elongated value of body, the offset by being multiplexed the occurrence in different column item improve the efficiency of its storage.
2.3 lines and column format conversion algorithms
The algorithm that the present invention mutually converts introduction in memory module line and column file in this part.For row For formula data file, each semi-structured data collection can generate a line data file storage after being parsed Whole thresholdings and relevant structural information in all records.On the other hand, for column data file, each textual data Several column storage files are generated according to rally.Each domain can generate a column storage file and store this all recorded All values in domain.So in the process of running, memory module just needs to realize the ranks format conversion operation of storing data.Together When, it is also desirable to realization meets the needs of directly exporting data for JSON text formatting.
2.3.1 the data of line to column parse
The conversion process of line structured data to column structures data is similar to the process that text structure data parse, here It repeats no more.Due to the matching for not needing to carry out the structure of text data character in resolving, and uses and parsed The storage organization of the object or array of good line;Document type data is not needed to carry out simultaneously to binary format turn It changes, the efficiency that line to column structures is converted will be substantially better than the efficiency of character resolution.
2.3.2 data assembling of the column to line
The file that column data file is assembled into line format according to certain rules can be completed into column structures Data are converted into the data of line structure.Based on the packing algorithm of Google Dremel, the present invention makes inside STEED here The assembling to column file is completed with similar algorithm.Specific algorithm is as follows:
In an assembling process, STEED is according to the repetition in the sequence and Column Item of finite-state automata Value reads Column Item from Column Reader, later further according to definition value judgement and output phase The level of nesting information answered.When the last one Column Reader of reading runs through the last one Column of this record When Item, both traverses and complete the assembling that all Column Reader, Assembler assemblers just complete a record. Assembler assembler can constantly be run, until all records are all completed to assemble, Column Reader all at this time End of file EOF should all be read.
In algorithm below, in addition to two functions of specific assembling process AssembleRecd, move and return point Not Shi Yong definition value judge data the level of nesting structural information.
In following pseudocode, the present invention needs that depth-priority-searching method is begun to use to traverse schema from root node Tree is later ranked up column file according to the sequence that its leaf node occurs, is successively read according to the sequence after sequence The content of column item in each column file.According to the content of the column item of reading, the present invention be can control The hierarchy information of data nesting: value is output to by output later when the relevant information for indicating nested structure in forefront first In line structure to be assembled, the next column file that jump and read then is judged again, one is finally pre-read from next column Column item judgement needs the level of the level of nesting returned and exports relevant structural information.When all column texts After part all at least completes primary reading, the present invention just completes the assembling of a record.Process more than repeating, until all Column file all run through, the present invention just completes the assembling to entire data acquisition system.
Third portion query analysis module
Data based on line and column structures, STEED can carry out the query analysis similar to SQL.But compared to The relational data of traditional table structure, semi-structured data cause it in inquiry since it has nested and multi-domain There can be certain ambiguity.The present invention extends the grammer of inquiry thus, makes it that can eliminate data ambiguity to a certain extent. The invention also achieves basic operations some in SQL, such as projector maps, filter filtering, group by grouping and Sort sequence etc..In this chapter, the present invention introduces the semanteme after the extension for semi-structured data first.Later, for being The operation for a variety of semi-structured data having been carried out in system, the present invention can successively introduce the specific algorithm of its realization.
3.1SQL is directed to the semantic extension of semi-structured data
Traditional relational data stores flat data using table structure: for all values all in same layer, there is no embedding Encasing structure;Each domain has and only one numerical value can be to its assignment;The meeting split table when designing table, it is a large amount of so that it will not exist Sparse domain.And for semi-structured data, the above feature is all not applicable.And it is semi-structured in order to support The operation of data, the present invention newly define following some operators:
(1) " ": for the level of nesting in the path expression of spacer domain.
(2) " any ": an arbitrary numerical value in duplicate domain is indicated;
(3) " all ": numerical value all in duplicate domain is indicated.
The result present invention of output has multiple option:
(1) data of JSON format:
(2) ignore the class JSON data of nested structure;
The arithmetic type that 3.2STEED is supported
As shown in figure 9, STEED supports a plurality of types of operations based on line and column data.In each operator Between, data are successively flowed using the mode of pull, until the output operator in top layer is completed from binary to text The conversion of this formatted data.Next, the present invention can successively introduce its internal various realization details.
3.2.1Row From Operator (line data reading operation)
STEED reads the data of a whole line structure from the data file of line.Since each is recorded in line It is the storage carried out according to record for unit in data file, every record is all with Row Object line object for storage What format was stored.So the present invention reads a Row from row binary data file when reading record every time Object line object is successively read out, until reaching end of file EOF.
3.2.2Schema Filter (Where or Having Clause) Operator (is defined based on schema Filtration operation in where and having words and expressions)
In this Operator, parallel type data of the present invention carry out filter filter operation.This operation can be used for STEED carries out the condition in where words and expressions to it and judges after reading line data;And in group by words and expressions After the data for generating new line, it also can be used, operation is filtered to the result of aggregation aggregation.
During specific filter filter operation, invention defines RowCondition (line condition class) to be used for Judge whether relevant domain meets the condition of each predicate (predicate) in record.Specific deterministic process is as follows:
The present invention first parses where words and expressions, each predicate is instantiated as to carry out data comparison Object: data can be read from line data structure;The value of reading is compared later, judges each predicate (predicate) true value, determines whether it passes through the conditional operation in this operator.
3.2.3Project Operator (mapping operations)
The present invention stores all domain in every record in the data of line structure, but most query statement Only need the value in some domains.In this way in entire query process, just have largely in the data for inquiring unrelated domain It has been copied between each operator operation repeatedly.These additional memory copyings can reduce the efficiency that the present invention inquires.So this The data of invention parallel type structure realize projector operation for extracting and inquiring relevant domain, in this way in copy procedure In copied the relevant domain of inquiry only to improve the efficiency of inquiry.
In calculating process, the present invention uses the nested structure called in recursive function reply semi-structured data.Every In one domain, the present invention reads the valuation of a field in former data respectively, parsing to it after only will with inquiry relevant domain It is written in the result of operation.A large amount of unrelated domain in line data can thus be ignored, improve the efficiency of query process.It is right In multi-domain, if it is repeated in leaf node, STEED only needs the multiple of Coutinuous store in the array in this domain of direct copying Value.If repeated in non-leaf nodes, to the minor structure difference recurrence and parsing in each array.It should be noted that It is that during extracting subtree, the present invention only remains the subtree being assigned;Both, if relevant domain in this subtree It is not assigned, then this subtree will not be retained in the result of projector.
3.2.4Assemble Operator (the assembling operation of column to line data)
In this operator calculating process, STEED, which is completed, will inquire relevant domain from the conversion of line structured data For the assembling process of column structures data.Specific packing algorithm is see before.STEED passes through Query Parser first to be inquired The SQL statement that the parsing of sentence resolver needs to be implemented obtains all and inquires relevant domain, establishes a finite state using it Automatic machine (FSM) is to control the reading order of line structured data in an assembling process.It is complete according to previous packing algorithm later It is converted at from column to the format of line number data, which is not described herein again.
3.2.5Column Filter Operator (providing the column of filter operation to the assembling operation of line data)
Compared to Assembler operator (the assembling operation of column to line data), Column Filter Operator (providing the column of filter operation to the assembling operation of line data) not only realizes column structures to line structure Assembling, moreover it is possible in an assembling process to each record carry out filter filter operation.Since in query process, where is sub Sentence can filter out some records for being unsatisfactory for condition, so if the present invention does not assemble these invalid notes in an assembling process Record, can greatly improve search efficiency.So the present invention reads a CAB every time and carries out filter filtering in query process Operate and set up corresponding bit map bitmap and record its comparison result, finally according to decide whether again in the result of record into Row assembling.
3.2.6Join Operator (connection operation)
Attended operation is realized using hash join (Hash connection) in STEED, only supports the connection behaviour of two tables at this stage Make.During executing this operation, STEED calculates it according to the join key occurrence in one of data set record Corresponding cryptographic Hash and by whole record storage in Hash table.Another data acquisition system is traversed later, is searched with identical Position (bucket) in the correspondence Hash table of hash key Hash keys.The data of the two line structures are closed later And and wait this record by the operator operation of pull (drawing) to upper one layer.STEED does not use relationship type at this stage Query optimizer optimizes in database, so suggest in query process using lesser data set as in from clause the The data set of one appearance, to obtain higher storage efficiency.
3.2.7Group Operator (grouping operation)
In the operation of the inside that inquiry operation is supported at this stage, Group grouping is most complicated operation.The present invention will The class of some new definition in calculating process is introduced, and corresponding implementation procedure is analyzed.It is connected with join operator Operate similar, group operator division operation stores what corresponding group key was grouped using hash table Hash table Key assignments.During operation, first by reading data from the data of line structure, calculating its hash key cryptographic Hash and adding It is added in Hash table.Further according to needing to judge it, whether there is or not aggregation aggregate operations in hash value cryptographic Hash later Content carry out operation.Wherein the data store organisation of hash value cryptographic Hash is as shown in Figure 10:
The present invention first defines HashValueItemContainer for storing each in each of Hash table Storage unit (bucket), the address that specific value value be these HashValueItem of direction in Hash table.Each in this way Object all just like structure shown in Fig. 10:
(1) present invention keeps records of the specific address of storage and each calculative in middle layer first The content of aggregation aggregation.
(2) in Block Buffer object, the actual content for the record being saved is stored.It is pointed out that this The domain that a little records are grouped in addition to those grouped field based on value, is all the expression aggregation not being assigned Assemble the domain of result.After entire group grouping operation is completed, result of the present invention again by aggregation aggregation is inputted To corresponding position, and result is waited to operate pull pull-up by other operator on upper layer.
3.2.8Order Operator (sorting operation)
For order by sorting operation operation, the present invention is needed by all record storages into buffer caching, it Afterwards to its comparative sorting.The problem of in view of memory headroom allocative efficiency, the present invention are only fixed big to operating system application every time Small memory can save the cost of memory copying during realloc is redistributed in memory in this way.Meanwhile in order to avoid The cost of data is repeatedly copied in sequencer procedure, the present invention records rising for every record using an array in comparison procedure Beginning and changes position of the pointer in array at address in sequencer procedure.When being finally reached to this array sequential access, access To record be all the result for meeting ordering requirements.
Furthermore according to the condition of sequence, invention defines comparer comparators with record is compared, according to following Mode carries out operation:
(1) this comparer comparator can read the numerical value in all domains for comparing behaviour from line storage organization Make.
(2) in order to improve relative efficiency, the present invention realizes the process for comparing and exporting as follows:
A) most-significant byte that 8 bytes store data in the domain that first needs to compare is remained in every record.For all Value type for, this space is enough to store its corresponding value and is taken without the data of complicated parallel type structure Value;For character string, preceding 8 comparisons in most cases can also obtain determining comparison result.So comparing In the process, the present invention is first compared using this 8 byte of caching.When the type of data is the more identical of character string and prefix When, the present invention just will do it to be compared in next step.
The sequence in the domain b) sorted as needed using comparer comparator successively value and is ranked up, until To comparison result.
C) the realization present invention of specific comparison function is compared using STL::sort function.
D) itself data copy is not carried out to record in comparison procedure, only has modified the pointer of record output sequence Array, this avoid the multiple copy functions of memory.And during drawing data by upper layer operator operation pull, this Invention also provides only corresponding pointer, to improve the efficiency of its data processing.
4th part utilizes the method and system of simple path characteristic optimization tree data
In this part, the present invention is summarized and has been summarized simple according to the related data of existing a variety of data sources The concept in path, and query optimization has been carried out using this feature in STEED
4.1 the definition of simple path
In the data for analyzing a variety of separate sources, it has been found that in the syntax tree of each data set, there is a large amount of From root to the path of leaf node at most only one duplicate domain.The present invention is can use during inquiry in these data Design feature Optimizing Queries process, improve search efficiency.So the present invention is defined as follows simple path: in data set It is multivalue from root to can only at most there is a domain (some node in syntax tree) on the path of leaf node in syntax tree , our such paths are referred to as simple path.Can use in STEED simple path to tree data carry out storage and The relevant optimization of query process.
The structure of 4.2 semi-structured data lines storage
As previously mentioned, STEED makes in line storage organization in order to accurately express the hierarchical information in tree-shaped structured data With relative complex storage organization.By analysis, it is considered herein that from the expression of data, it cannot be carried out into one The optimization and improvement of step.But by the analysis in path briefly above it is found that the present invention can be stored in data by simplifying Structural information indicates efficiency to improve data in internal system, so that the efficiency of its parsing and inquiry has further promotion. It is contemplated by the invention that better line storage organization it is as shown in figure 11: for the data of simple path, STEED can be in data Only the relational structural information of store leaf node (domain) refers to corresponding path to replace original nested storage organization.And make After being optimized with simple path, STEED can use the relevant information of leaf node in data from the syntax tree in system (Schema Tree) obtains the relevant information of all nodes on entire path.It is stored in data in this way, STEED passes through to simplify Structural information improves the expression efficiency of line data and the execution efficiency of inquiry.
4.2Flatten Assemble (flat line structure assembler)
STEED is in the assembling process of data, the hierarchical structure that needs to spend a large amount of costs to restore data.Such as preceding institute It states, duplicate level is no more than 2 layers in the domain in most of data set, so value most in data is ok Optimized accordingly using simple path.And the assembling process in STEED for the domain of simple path is then more easy:
Ignore the level in default binary data using the flat line structure assembler of Flatten Assembler to close System, leaf node, which had both been used only, to be indicated to ignore non-leaf nodes all in path from the path to leaf node.In this way, The present invention is achieved that the level of nesting by line structured data is limited to one layer of purpose, thus during data query It has saved data space consuming in memory and has improved the search efficiency of data.
Specific packing algorithm is as shown above:
Before assembly, it needs to sort to each column to be assembled accordingly according to the ID of leaf node.Later, All Column Item for successively reading every record in each Column Reader in sequence, successively by the numerical value of reading It is written in the result of assembling with relevant structural information.Here since the result of assembling only remains a nested layer It is secondary, so STEED only needs the value by each domain to be appended in current object in an assembling process, assembled without considering As a result nest relation.
The storage organization of 4.3 flat line datas
In the present invention, STEED query process inquired using the line data of flat structure and in terms of Optimization.For the non-simple path in syntax tree, since STEED needs to identify the multi-domain of the different level of nesting in data, this Invention continues to use the expression of the tree data of system default.And for simple path, the present invention uses such as Figure 11 Structure it is stored or is assembled:
1) in syntax tree from root to there is no the domain of duplicate node on the path of leaf node: in flat data storage organization only Need the ID of store leaf node and the numerical value of corresponding field;
2) from root to the domain of only one duplicate node on the path of leaf node in syntax tree: flat data storage organization In can be exported according to following two structure, be detailed in Figure 12:
A) it is used as a specific value to be stored in flat structure the numerical value of each duplicate domain --- it is had in-data The multinomial value for having identical ID, number are decided by the number of duplicate domain;
B) duplicate domain is stored in flat structure as a whole --- only have the ID of a duplicate domain in-data Indicate its specific value, and this domain is the multiple numerical value of representation by an array form.
3) syntax tree is from root to there is multiple duplicate nodes on the path of leaf node: flat data storage organization can not table The numerical value in multiple repeatable domains is the repetition occurred on which layer on up to path, continues to use original default in the present invention Tree shaped data storage organization --- the ID of leaf node is still used in the data of-flat structure, but corresponding value is offset Amount is directed toward the position for storing complete nested structure.

Claims (7)

1. a kind of tree data processing method characterized by comprising
Step 1, the binary data class for describing and defining domain in Protocol Buffers and JSON text data is set The definition of type, nested structure;The definition of semi-structured data is established, wherein for the text data of Protocol Buffers, It is defined in file according to its syntax tree first before parsing data and dynamic generation syntax tree, the number of JSON format is defined to syntax tree Determined according to during data parse according to format and the data syntax tree of content dynamic generation JSON format in its data Justice;Semi-structured data is read, and is resolved to the binary format data of line or column, wherein in the process of parsing In, dynamic generation or syntax tree is established according to definition, comprising: establish JSON syntax tree: passing through number during parsing data According to dynamically syntax tree is established, wherein assuming that the type of the value in each domain is that member type will not change and in array is It is consistent, during establishing syntax tree, the type of its value is determined according to the type of data intermediate value, and will be worth for array The domain of JSON is defined as repeating, remaining node is defined as not necessarily will appear, in resolving, first according to parent Whether there is or not the definition of relevant structure by symbol table lookup for father's node ID domain name corresponding to field name, if it is not, Relevant node is added into syntax tree, and otherwise the value of node is parsed;Establish Protocol Buffers syntax tree: Protocol Buffers defines message as new data type in proto file, and each domain wherein included is base The data of this data type or other compound types, during establishing Protocol Buffers syntax tree, first Proto file is parsed, new data type is extended, later extends the definition of data type one by one according still further to specified root node And it is assembled into the syntax tree of data structure;And the definition of storage semi-structured data, comprising: semi-structured number is filled in definition According to each node of middle syntax tree, the relevant information of node itself is not only described in node, also passes through syntax tree interior joint ID is interrelated by node, forms tree;
Step 2, the binary format data of line or column are stored, wherein realizing the binary system of parallel type or column Formatted data is mutually converted, and the binary format data are directly output as to the JSON data of text formatting;
Step 3, the binary format data are based on, inquiry operation is carried out to semi-structured data.
2. tree data processing method as described in claim 1, which is characterized in that the step 1 further includes to half hitch Structure data are parsed: by single record as unit of successively nested storage line storage organization;It is fixed with data tree Leaf is the column storage organization of unit storage in justice.
3. tree data processing method as described in claim 1, which is characterized in that the data of binary format are defined, Storage and operation for line or the binary format data of column:
1) shaping number: TypeInt (8/16/32/64) respectively indicates 8/16/32/64 shaping number;
2) floating number: Type (Float/Double) respectively indicates the floating number of float and double type;
3) character string: the character string that TypeString is indicated;
4) timestamp: TypeTimeStamp indicates timestamp, and inside is implemented with TypeInt64.
4. tree data processing method as described in claim 1, which is characterized in that the step 3 includes when execution is looked into When asking operation, the operation tree established needed for this inquiry is first generated according to the content in query statement, it is every in the operation tree One node is all a SQL operation.
5. tree data processing method as described in claim 1, which is characterized in that further include the inquiry language of generalized Petri net Method, as follows:
(1) " ": for the level of nesting in the path expression of spacer domain;
(2) " any ": an arbitrary numerical value in duplicate domain is indicated;
(3) " all ": numerical value all in duplicate domain is indicated;
The result of output are as follows: the data of JSON format;Ignore the class JSON data of nested structure.
6. tree data processing method as described in claim 1, which is characterized in that further include:
Line data reads operation: reading the data of a whole line structure from the binary format data of line, is reading When, a Row Object line object is read from the binary format data of line every time is successively read out, until Reach end of file EOF;
Line data filtration operation: the condition in where words and expressions is carried out to the binary format data of the line of reading and is sentenced It is disconnected, and after generating the binary format data of new line in group by words and expressions, to the knot of aggregation aggregation Fruit is filtered operation, wherein first parsing where words and expressions, each predicate is instantiated as to carry out data comparison Object is later compared the value of reading, judges the true value of each predicate, decides whether to pass through conditional operation;
Line data mapping operations: calling the nested structure in recursive function reply semi-structured data, in each domain, point Not Du Qu valuation of a field in former data, only domain associated with the query is written in the result of operation after being parsed to it;
It connects operation: attended operation is realized using Hash connection, wherein according to the join key in one of data set record Occurrence calculates its corresponding cryptographic Hash and by whole record storage in Hash table, traverses another data acquisition system later, looks into The position in the correspondence Hash table with identical hash key Hash keys is looked for, later closes the data of two line structures And and this record is waited to be pulled to one layer of operator operation;
Grouping operation: HashValueItemContainer is defined first and is each stored in each of Hash table for storing Unit, specific value value be to be directed toward the address of HashValueItem in Hash table, wherein (1) saves note in middle layer first The content of the specific address of address book stored and each calculative aggregation aggregation;
(2) in Block Buffer object, the actual content for the record being saved is stored, wherein when entirely grouping operation completion And then the aggregation result assembled is input to corresponding position, and wait other operator of result by upper layer Operate pull-up;
Sorting operation: by all record storages into buffer caching, and carrying out comparative sorting, wherein being only to operation every time The memory of system application fixed size uses a number for storing a plurality of data that lower level operations obtain in comparison procedure Group records the initial address of every record and changes position of the pointer in array in sequencer procedure;
According to the condition of sequence, comparator is defined, and carries out operation as follows:
(1) comparator reads the numerical value in all domains for comparing operation from the binary format data of line;
(2) to improve relative efficiency, the process for comparing and exporting is as follows:
A) retain the most-significant byte that 8 bytes store data in the domain that first needs to compare in every record;
The sequence in the domain b) sorted as needed using comparator successively value and is ranked up, until obtaining comparison result;
C) it is compared using STL::sort function;
D) itself data copy, the array of pointers of modification record output sequence are not carried out to record in comparison procedure.
7. a kind of system based on tree data processing method as claimed in any one of claims 1 to 6.
CN201710178695.6A 2017-03-23 2017-03-23 A kind of tree data processing method and system Active CN107092656B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710178695.6A CN107092656B (en) 2017-03-23 2017-03-23 A kind of tree data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710178695.6A CN107092656B (en) 2017-03-23 2017-03-23 A kind of tree data processing method and system

Publications (2)

Publication Number Publication Date
CN107092656A CN107092656A (en) 2017-08-25
CN107092656B true CN107092656B (en) 2019-12-03

Family

ID=59646394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710178695.6A Active CN107092656B (en) 2017-03-23 2017-03-23 A kind of tree data processing method and system

Country Status (1)

Country Link
CN (1) CN107092656B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107783850B (en) * 2017-09-28 2020-10-16 北京天元创新科技有限公司 Method, device, server and system for analyzing node tree checking record
CN107801213A (en) * 2017-10-23 2018-03-13 深圳市沃特沃德股份有限公司 Data transmission method and device
CN107992992B (en) * 2017-11-07 2021-12-21 中国银行股份有限公司 Unionpay IC card transaction data analysis system and method
CN108491207B (en) * 2018-03-02 2020-11-17 平安科技(深圳)有限公司 Expression processing method, device, equipment and computer readable storage medium
CN108520053B (en) * 2018-04-04 2020-03-31 东北大学 Big data query method based on data distribution
CN109325022B (en) * 2018-07-20 2021-04-27 新华三技术有限公司 Data processing method and device
CN109508409A (en) * 2018-10-23 2019-03-22 魔秀科技(北京)股份有限公司 A kind of semi-structured json data freely parse adaptation method
CN109710620B (en) * 2018-12-29 2021-03-16 杭州复杂美科技有限公司 Data storage method, data reading method, device and storage medium
CN111435372A (en) * 2019-01-11 2020-07-21 阿里巴巴集团控股有限公司 Data display method and system, data editing method and system, equipment and medium
CN110263104B (en) * 2019-05-14 2022-12-27 创新先进技术有限公司 JSON character string processing method and device
CN110309007A (en) * 2019-07-02 2019-10-08 深圳市友华通信技术有限公司 The display output method and device of D-Bus
CN110618983B (en) * 2019-08-15 2023-01-06 复旦大学 JSON document structure-based industrial big data multidimensional analysis and visualization method
CN110719290A (en) * 2019-10-15 2020-01-21 杭州鸿雁智能科技有限公司 Protocol translation method and device for home interconnected network
CN111046630B (en) * 2019-12-06 2021-07-20 中国科学院计算技术研究所 Syntax tree extraction method of JSON data
CN111159316B (en) * 2020-02-14 2023-03-14 北京百度网讯科技有限公司 Relational database query method, device, electronic equipment and storage medium
CN112527794B (en) * 2020-12-07 2023-05-26 广州海量数据库技术有限公司 Data processing method and system for realizing aggregate data types in database
CN112559527B (en) * 2020-12-15 2022-06-07 武汉大学 Data conversion method based on multi-branch tree node relation matching
CN113297296B (en) * 2021-05-31 2022-08-16 西南大学 JSON processing method for multi-style type data
CN113505269B (en) * 2021-07-02 2024-03-29 卡斯柯信号(成都)有限公司 Binary file detection method and device based on XML
CN114357054B (en) * 2022-03-10 2022-06-03 广州宸祺出行科技有限公司 Method and device for processing unstructured data based on ClickHouse
CN116050358B (en) * 2023-03-21 2023-06-06 北京飞轮数据科技有限公司 Data processing method and device applied to dynamic data and electronic equipment

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2821458A1 (en) * 2001-02-28 2002-08-30 Koninkl Philips Electronics Nv SCHEME, SYNTAX ANALYSIS METHOD, AND METHOD FOR GENERATING A BINARY STREAM FROM A SCHEME
ATE341901T1 (en) * 2001-07-13 2006-10-15 France Telecom METHOD FOR COMPRESSING A TREE HIERARCHY, ASSOCIATED SIGNAL AND METHOD FOR DECODING A SIGNAL
DE10231970B3 (en) * 2002-07-15 2004-02-26 Siemens Ag Coding method for data element positions in data structure e.g. for XML document coding, has position codes assigned to data element positions in given serial sequence
US7761459B1 (en) * 2002-10-15 2010-07-20 Ximpleware, Inc. Processing structured data
US20130151534A1 (en) * 2011-12-08 2013-06-13 Digitalsmiths, Inc. Multimedia metadata analysis using inverted index with temporal and segment identifying payloads
US10262012B2 (en) * 2015-08-26 2019-04-16 Oracle International Corporation Techniques related to binary encoding of hierarchical data objects to support efficient path navigation of the hierarchical data objects
WO2017070188A1 (en) * 2015-10-23 2017-04-27 Oracle International Corporation Efficient in-memory db query processing over any semi-structured data formats

Also Published As

Publication number Publication date
CN107092656A (en) 2017-08-25

Similar Documents

Publication Publication Date Title
CN107092656B (en) A kind of tree data processing method and system
CN107016071B (en) A kind of method and system using simple path characteristic optimization tree data
CN107491561B (en) Ontology-based urban traffic heterogeneous data integration system and method
CN107066551A (en) The line and column storage method and system of a kind of tree shaped data
CN110837492B (en) Method for providing data service by multi-source data unified SQL
CN111046630B (en) Syntax tree extraction method of JSON data
US20130124545A1 (en) System and method implementing a text analysis repository
US20240012810A1 (en) Clause-wise text-to-sql generation
US20130006968A1 (en) Data integration system
CN108509543B (en) Streaming RDF data multi-keyword parallel search method based on Spark Streaming
CN102411580B (en) The search method of XML document and device
CN103116625A (en) Volume radio direction finde (RDF) data distribution type query processing method based on Hadoop
CN105989150A (en) Data query method and device based on big data environment
CN105808746A (en) Relational big data seamless access method and system based on Hadoop system
CN107491476B (en) Data model conversion and query analysis method suitable for various big data management systems
CN113094449B (en) Large-scale knowledge map storage method based on distributed key value library
CN105608228B (en) A kind of efficient distributed RDF data storage method
US20060161525A1 (en) Method and system for supporting structured aggregation operations on semi-structured data
US20230350899A1 (en) Query engine for recursive searches in a self-describing data system
CN110795526A (en) Mathematical formula index creating method and system for retrieval system
CN114218472A (en) Intelligent search system based on knowledge graph
CN111752542A (en) Database query interface engine based on XML template
CN106484815B (en) A kind of automatic identification optimization method based on mass data class SQL retrieval scene
CN113157723B (en) SQL access method for Hyperridge Fabric
CN114372174A (en) XML document distributed query method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant