CN107092656A - A kind of tree data processing method and system - Google Patents

A kind of tree data processing method and system Download PDF

Info

Publication number
CN107092656A
CN107092656A CN201710178695.6A CN201710178695A CN107092656A CN 107092656 A CN107092656 A CN 107092656A CN 201710178695 A CN201710178695 A CN 201710178695A CN 107092656 A CN107092656 A CN 107092656A
Authority
CN
China
Prior art keywords
data
tree
domain
line
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710178695.6A
Other languages
Chinese (zh)
Other versions
CN107092656B (en
Inventor
陈世敏
王智义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201710178695.6A priority Critical patent/CN107092656B/en
Publication of CN107092656A publication Critical patent/CN107092656A/en
Application granted granted Critical
Publication of CN107092656B publication Critical patent/CN107092656B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion

Abstract

The present invention proposes a kind of tree data processing method and system (System for TrEE structured Data, STEED), it is related to technical field of data processing, the system is supported to read text data, and resolved to the binary format data of line or column, wherein during parsing, dynamic generation syntax tree stores the definition of semi-structured data;The binary format data of line or column are stored, wherein realize that the binary format data of parallel type or column are mutually changed, and the binary format data are directly output as to the JSON data of text formatting;Based on the binary format data, inquiry operation is carried out to semi-structured data.

Description

A kind of tree data processing method and system
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of tree data processing method and system (System for TrEE structured Data, STEED).
Background technology
With the development of computer network and big data treatment technology, traditional relational data can not increasingly be met To data definition and the requirement used under network and big data environment, and half by representative of JSON and Protocol Buffers Structural data is because object in programming language (Object) data can either be expressed sufficiently, while can also be according to data Format change original data format is modified and expanded, so it is widely used in actual environment.
The definition of tree data:
Tvalue=Tprimitive|Tobject|Tarray
Tprimitive=string | number | boolean | | null
Record=Tobject
As it appears from the above, tree data definition is as follows:
1. the value in tree data can be following 3 kinds:
The numerical value of object structures;The numerical value of array structures;The numerical value of atomic type;
The numerical value of 2.object structures is included by brace, and inside is by multiple key-value pairs (key value pair) to structure Into, key-value pair number can be it is any number of, but requirement can not with the presence of repetition key object structures object In;
The data of 3.array structures are included by square brackets, and inside is made up of multiple values (value), and the number of value, which can be, appoints Meaning is multiple, thereby increases and it is possible to which the value for having repetition occurs;
4. the data of atomic type can be character string (string), numerical value (number), Boolean (boolean) and sky (null) etc.;
5. as above in the key-value pair described in 2, the value of key can only be (string) type.
6. the data of each tree are object structures.
The source of common data is by the following aspects:
1) data information (Data Feeds)
Data are transmitted using JSON forms in a network by representative of twitter.User and related api routine Corresponding data renewal can be obtained by monitoring corresponding port.Because its data content is abundant, structure is relative complex, data Data volume that is more stable and providing of originating is sufficiently large, therefore is based primarily upon during the experiment and data analysis of the present invention Twitter data sets.As follows, the present invention is analyzed to be carried out to the level of nesting in twitter data and the number of duplicate domain Corresponding analysis.
2) online data services (Online Data Service)
Online data, services are carried out using the data of JSON forms.Common type is the corresponding operating of transmission client Content and the corresponding operating result of return etc..The present invention have studied the semi-structured data of the online data services of separate sources, Such as Yahoo (Yahoo), Sina weibo and IMDB etc..Usual user can make
Leaf node level There is no duplicate domain 1 duplicate domain Unnecessary 2 duplicate domains Amount to
1 16 0 0 16
2 61 2 0 33
3 51 21 4 76
4 1 19 4 24
5 0 12 0 12
6 0 12 0 12
Amount to 129 66 8 203
With demands of the JSON according to certain api interface format editing data, services, be sent to corresponding data server it Afterwards, the returned data of parsing JSON forms is so as to complete a data, services.
The analysis that in the present invention microblogging API online data services have been carried out with correlation is as shown in Figure 1.
The number in the domain of the repetition included in its path of selective analysis of the present invention:Black portions are from root to leaf in figure Node does not have the path of duplicate domain, and light-colored part is the path of only 1 duplicate domain, and white portion is to have more than 2 duplicate domains Path.In the present invention its ratio constituted is shown using the mode of statistic histogram:From root to leaf in most of syntax tree At most there was only the domain of 1 repetition in the path of node.3) communication protocol
The present invention analyzes Apache Hadoop and the related protocol format that communicated in Hadoop HBase, what it was used It is that the semi-structured data that Protocol Buffers are defined carries out the related data transfer of communication.It is fixed in system more than The a variety of different types of semi-structured communication formats of justice, for being in communication with each other between different machines and are controlled.It is mostly used in The form very simple of the semi-structured data of communication.
The analysis for having carried out correlation in the present invention to Apache Hadoop communication protocol is as shown in Figure 2.
The number in the domain of the repetition included in its path of selective analysis of the present invention:Black portions are from root to leaf in Fig. 2 Child node does not have the path of duplicate domain, and light-colored part is the path of only 1 duplicate domain, and white portion repeats to there is more than 2 The path in domain.In the present invention its ratio constituted is shown using the mode of statistic histogram:From root to leaf in most of syntax tree At most there was only the domain of 1 repetition in the path of child node.
4) public data collection
By analyzing the data in DBpedia and data.gov, it carries out public data collection using the data of JSON forms Storage.But the data in traditional semi-structured data file, these data sets are different from only by a JSON Data are constituted.This record is broadly divided into two parts:Part I is made up of a nested minor structure (object in JSON), After storing in data acquisition system data form;The content that Part II is then recorded by a storage of array every, and every Record is without nested structure.This record easily can be split into data definition and data content two very by the present invention Part, and then handled using the method for traditional semi-structured data processing.
5) sensing data
Newest sensor platform, such as Arduino, Dragon Board, Beagle Bone etc., can produce and Handle the data of JSON types.The present invention analyzes the data in above source, it is found that the form inside its data is more simple: The depth of nesting in all domains, which is up to 2 and at most only has a multi-domain, in data appears in path from root to leaf node On.
But existing data handling system can not be entered the semi-structured data of the JSON forms in source to more than at this stage Row processing well:On the premise of complete function can either being provided, while operations have preferable performance.The present invention is analyzed A large amount of to support semi-structured data management systems, its roadmap to semi-structured data mainly has following three points:
1) function of traditional relevant database is extended
Such as PostgreSQL and Oracle, the semi-structured data such as JSON are entered with the two of text or in-line coding Form processed with the form of a continuous data block store relevant database table in.Carrying out corresponding inquiry operation When, call the analytical function of inside to parse the content in data block, read the data value in the domain needed.Next adjust Corresponding inquiry operation is carried out to it with the operation function in relevant database.
2) NoSQL data handling systems
Inside carries out binary coding using more flexible mode to semi-structured data, such as MongoDB.Its Advantage, which is to realize, to be parsed to primary semi-structured data, is stored and inquiry operation, is deposited with stronger data Storage and inquiry advantage.It, according to the design feature of semi-structured data, newly defines or extended one during realization The related operation of a little inquiries.
3) column data format is handled data
Google Protocol Buffers and Apache Hive+Parquet support to carry out number to semi-structured data According to processing and the operation such as inquiry.Data handling system of two classes based on line data compared to more than, column data processing system System can provide more preferable query analysis performance in most cases, but its internal realization is more complicated:It is internal Data are stored usually using the form of row cluster.Parsed for semi-structured data and the realization of inquiry operation has higher Difficulty.
3 kinds of the above realizes the problem of method of semi-structured data processing system has different degrees of at this stage.
1) extend existing relevant database and support that the processing of semi-structured data is relatively inefficient
The relevant database that semi-structured data is handled can be supported by analysis at this stage, most database is found The structure and data characteristicses for not all being directed to semi-structured data carry out corresponding data encoding and optimization.It is mainly by half hitch Structure data storage is the form of text data block, passes through number of its internal some data analytical function realized to text type Parsed according to block, so as to obtain the information needed in every record.Text type is directly stored so in database JSON formatted datas waste substantial amounts of space.
Simultaneously, it is necessary to substantial amounts of character string comparison and inquiry operation during data query, so as to greatly limit The efficiency of data processing.According to the existing research of the present invention, although many systems support the computing of semi-structured data, still When data volume increase, its run time inquired about is often oversize and the requirement that causes it to be difficult to meet real-time.
Relevant database can't support the new design feature of some in semi-structured data well simultaneously.For example it is straight Connect and support syntactic definition, generalized Petri net query grammar to nested and duplicate domain to support semi-structured data design feature.
2) NoSQL data handling systems are not good enough to the coding and search efficiency of data
The present invention analyzes and have studied the NoSQL data handling systems MongoDB being widely used.Due to JSON data languages Redundancy is defined inside the flexibility of justice, MongoDB and cumbersome data encoding format.Found in research, its efficiency encoded Very low, in most cases, the data file after its coding can be more than the data of original text formatting.Inside data The not effective redundancy reduced in JSON text datas of coding, conversely can also bring extra in query process Performance consumption.The performance that this allows for its data processing is relatively limited, especially for the processing of mass data.
Meanwhile, these NoSQL data handling systems cause its some operation not hold due to the limitation of its indoor design OK.For example, efficiently can not completely realize join concatenation operations in SQL (although being added in latest edition in MongoDB Related similar operator, but the efficiency for not fully meeting the join concatenation operations defined in SQL still and performing is too It is low).
3) column data format process data
In relevant database, the storage of columnar database and query performance typically can all be better than line data storehouse.This It is because it need not read and handle the data that record neutralizes the unrelated domain of current queries in query process.But inside it Principle is complicated, function realizes relative difficulty.
Similar, in the system to semi-structured data processing is supported, stored and inquired about using column data Internal system is also more complicated.There is no the limit of grammer to JSON internal forms in most of management system using line data System, both the content of its data do not need advance definition, in use the structure of data can constantly develop.But it is right , it is necessary to provide the definition (Schema) of column data in advance and in use can not for the data management system of column The structure of dynamic changing data.This just significantly limit the flexibility of semi-structured data.
In addition, being also available for user to select without many semi-structured data processing systems based on column data at this stage Select.The Apache Hive+Parquet that column system that can be for users to use is only realized based on Java at this stage.Due to Java The limitation of programming language, its efficiency inquired about also has the space further optimized.And the platform of its operation needs Apache Hadoop and HDFS support, so system initialization and the cost of operation are all very high.
The present invention has found existing three kinds of feasibility sides when carrying out the correlative study such as handling to semi-structured data progress Case because handle semi-structured data when to data structure with realization limitation caused by.
First, design feature internal in semi-structured data causes the data processing to it can not be by expansion relation type Database is obtained.Both have different it is assumed that so handling semi-structured number using relevant database for data format According to when can produce higher cost so that being difficult to bear.So the present invention, which is redesigned and realized, is intended for semi-structured number According to data handling system so that it can meet the processing to the semi-structured data of labyrinth.
Secondly, it is contemplated that it is possible that the spy such as structure change during the flexible and use that semi-structured data is defined Point, at this stage major part NoSQL data management systems directly it is stored using the data of class this paper structures.This is resulted in Its storage efficiency is too low and sampling process cost during inquiry is very high.In the design of the present invention, from the knot of extracting data Structure is stored in Schema syntactic definitions, and minimum structural information is only retained in data.This, simplifies repeated in data Structural information, while also making it possible some query optimizations for data content.
Finally, the support of many basic modules is needed based on the semi-structured storage of column that JAVA is realized at this stage, for example Document storage system, scheduling system etc..These can all cause its function to system and using having some extra limitations and meeting Cause the inefficient of its execution.The present invention is completely independent exploitation based on the C/C++ notebook data processing systems (STEED) realized, This allows for system and is possibly realized from integral optimize;It there will not be all if desired for being defined in advance to the form of data And the limitation produced due to platform such as can not change.
The content of the invention
In view of the shortcomings of the prior art, the present invention proposes a kind of tree data processing method and system (STEED).
The present invention proposes a kind of tree data processing method, including:
Step 1, semi-structured data is read, and is resolved to the binary format data of line or column, wherein During parsing, dynamic generation or syntax tree is set up according to definition, store the definition of semi-structured data;
Step 2, the binary format data of storage line or column, wherein realizing described the two of parallel type or column System formatted data is mutually changed, and the binary format data are directly output as to the JSON data of text formatting;
Step 3, based on the binary format data, inquiry operation is carried out to semi-structured data.
The step 1 includes being provided for description with defining the two of Protocol Buffers and domain in JSON text datas The definition of binary data type, nested structure;The definition of semi-structured data is set up, wherein for Protocol Buffers' Text data, defines in file according to its syntax tree and dynamic generation syntax tree is defined to syntax tree first before parsing data, The number of form and content dynamic generation JSON form of the data of JSON forms during data are parsed in its data According to the definition of syntax tree.
The step 1 also includes parsing semi-structured data:By wall scroll record in units of successively nested storage Line storage organization;By data tree define middle leaf in units of the column storage organization that stores.
Semi-structured data is handled in the following manner:
The correlation of node in itself is not only described in each node of syntax tree, node in definition filling semi-structured data Information, it is also by the ID of syntax tree interior joint that node is interrelated, form tree.
Syntax tree is set up respectively for JSON and Protocol Buffers in resolving respectively, wherein,
Set up JSON syntax trees:Syntax tree is dynamically set up by data during data are parsed, wherein assuming every The type of the value in individual domain is that member type will not change and in array is all consistent, during syntax tree is set up, The type of its value is determined according to the type of data intermediate value, and the domain for the JSON that value is array is defined as repeating, remaining section Point is defined as not necessarily occurring, in resolving, first corresponding with field name according to father's parent node ID The structure that domain name whether there is correlation by symbol table search is defined, if it is not, adding the node of correlation into syntax tree, otherwise Value to node is parsed;
Set up Protocol Buffers syntax trees:Protocol Buffers message defined in proto files make For new data type, wherein each domain included is basic data type or the data of other compound types, setting up During Protocol Buffers syntax trees, proto files are parsed first, new data type is extended, afterwards according still further to The definition of data type is extended and is assembled into the syntax tree of data structure by the root node specified one by one.
Storage and computing for line or the binary format data of column:
1) shaping number:TypeInt (8/16/32/64) represents the shaping number of 8/16/32/64 respectively;
2) floating number:Type (Float/Double) represents the floating number of float and double types respectively;
3) character string:The character string that TypeString is represented;
4) timestamp:TypeTimeStamp represents timestamp, and inside is implemented with TypeInt64.
The step 3 is included when performing inquiry operation, and first the content in query statement generates this and inquires about required Each node in the operation tree of foundation, the operation tree is a SQL operation.
Also include the query grammar of generalized Petri net, it is as follows:
(1)“.”:For the level of nesting in the path expression of spacer domain;
(2)“any”:Represent an arbitrary numerical value in the domain of repetition;
(3)“all”:Represent numerical value all in the domain of repetition;
The result of output is:The data of JSON forms;Ignore the class JSON data of nested structure.
Also include:Line data reads computing:A whole piece line structure is read from the binary format data of line Data, when reading, read a Row Object lines object every time from the binary format data of line and enter successively Row is read, until reaching EOF EOF;
Line data filtration operation:Condition in the binary format data progress where words and expressions of the line of reading is entered Row judges, and after generating in group by words and expressions the binary format data of new line, aggregation is assembled Result carry out filter operation, wherein parsed first to where words and expressions, each predicate is instantiated as progress data ratio Compared with object, afterwards the value to reading be compared, judge the true value of each predicate, decide whether by conditional operation;
Line data mapping operations:Recursive function is called to tackle the nested structure in semi-structured data, in each domain In, valuation of a field in former data is read respectively, and domain associated with the query is only written to the knot of computing after being parsed to it In fruit;
Concatenation operation:Attended operation is realized using Hash connection, wherein the join in one of data set record Key occurrences calculate its corresponding cryptographic Hash and whole piece record storage are traveled through to another data acquisition system in Hash table, later, The position in the corresponding Hash table with identical hash key Hash keys is searched, afterwards the data of two line structures are closed And, and wait this to record the operator computings for being pulled to last layer;
It is grouped computing:Defining HashValueItemContainer first is used to store each each in Hash table Specific value values is point to HashValueItem address in memory cell, Hash table, wherein (1) is protected in intermediate layer first Deposit the specific address and the content of each calculative aggregation aggregations of record storage;
(2) in Block Buffer objects, the actual content for the record being saved is stored, wherein when entirely packet computing After completion, then the result that aggregation assembles is input to corresponding position, and wait result by upper strata other Operator operates pull-up;
Sorting operation:By all record storages into buffer cachings, and comparative sorting is carried out, wherein every time only to behaviour Making the internal memory of system application fixed size is used to store many datas that lower level operations are obtained, while using one in comparison procedure Individual array records the initial address of every record and changes position of the pointer in array in sequencer procedure;
According to the condition of sequence, comparator is defined, and carry out computing as follows:
(1) comparator reads the numerical value in all domains for comparing operation from the binary format data of line;
(2) to improve relative efficiency, the process for comparing and exporting is as follows:
A) most-significant byte that 8 bytes store data in the domain that first needs compares is retained in being recorded at every;
B) order in the domain sorted as needed using comparator, value and is ranked up, until what is compared successively As a result;
C) STL is used::Sort functions are compared;
D) data copy is not carried out itself to record in comparison procedure, only the pointer number of modification record output order Group.
The present invention also proposes a kind of system based on described tree data processing method.
From above scheme, the advantage of the invention is that:
1, the line storage organization of semi-structured data;Realize and the row binary of semi-structured data is stored, make it The semanteme of expression semi-structured data that can be completely simultaneously adapts to the characteristics of its data definition changes.In addition, it is desirable to the letter of its structure Singly, it is easy to expression, with higher storage efficiency;
2, the column storage organization of semi-structured data;The column binary storage to semi-structured data is realized, makes it Column can be used to store the structure of expressed intact semi-structured data.It is required that it can express the complicated knot of semi-structured data The content of structure feature and efficient data storage;
3, the mutual conversion of two kinds of forms of semi-structured data line and column is realized;Realized using parsing and packing algorithm Binary system line and column data are mutually converted;
4, the syntax tree that semi-structured data is defined is realized;Use the definition information of structure in tree data storage;
5, inquiry operation is carried out to semi-structured data;The inquiry that class SQL is carried out to it using line and column data is grasped Make;
6, the characteristics of based on semi-structured data, the query grammar of generalized Petri net;Due to there is multivalue in semi-structured data Domain, the problem of definition " ANY ", " ALL " and path expression solve the data ambiguousness in query process;
7, the optimization based on simple path in semi-structured data;Simple path refers on from root node to leaf node most Only exist a multi-domain more.Present invention discover that there are a large amount of such structures in common semi-structured data, propose and real Show the storage for this spline structure and query optimization, greatly improve the efficiency of inquiry.
As shown in Figure 4, the present invention has carried out data using different size of data set and has been already loaded into internal memory (hot Cached) and data be also not loaded into internal memory (cold cached) query analysis experiment.In experiment, the present invention is used Different SQL query statements is to obtain the performance comparison of corresponding arithmetic operation, including project mappings, and filter is filtered, Group is grouped, sort sequences and join attended operations.
Query performance according to Fig. 4, is not loaded into the experiment of internal memory, STEED phases in cold cached data There is 4.1 to 17.8 times of performance speed-up ratio for Hive+Parquet, relative to the acceleration that MongoDB there are 55.9 to 105.2 times Than relative to the speed-up ratio that PostgreSQL has 33.8 to 1294 times;And in hot cached experiment, STEED pairs MongoDB has 19.5 to 59.3 times of speed-up ratio, the speed-up ratio for having 19.5 to 59.3 times to Hive+Parquet, right PostgreSQL has 16.9 to 392 times of speed-up ratio.The inquiry language of each inquiry operation of the present invention is listed in annex in detail Sentence.
Brief description of the drawings
The JSON data formats analysis of Fig. 1 microblogging API definitions;
The correlation analysis of Fig. 2A pache Hadoop communication protocols;
Fig. 3 is steed comprising modules figure;
Fig. 4 is steed query performance comparison diagram;
Fig. 5 is the procedure chart that Protocol Buffers set up syntax tree;
Fig. 6 is line data compound type structural representation;
Fig. 7 is column data store organisation schematic diagram;
Fig. 8 is the data-optimized storage organization schematic diagram of column;
Fig. 9 is each inquiry operation schematic diagrames of steed;
Figure 10 is storage organization schematic diagram in division operation calculating process;
Figure 11 is the line storage organization schematic diagram by optimization;
Figure 12 is the prioritization scheme schematic diagram of alternative line storage organization.
Embodiment
In view of above the deficiencies in the prior art, the present invention redesigns and realizes a semi-structured data processing system STEED.The following present the overall architecture of STEED systems and briefly introduce the functional requirement of each module, post analysis this The interface definition of several intermodules, while it is how to handle and data storage to briefly explain inside STEED.
As shown in figure 3, STEED is main by three module compositions:
(1) data resolution module:
Text data is read, and is resolved to the binary format data of line or column, data storage is stored in In module.During data are parsed, dynamic generation syntax tree stores the definition of semi-structured data.To JSON forms When data are parsed, because it does not define corresponding data format (syntax tree, schema tree), so the present invention is only Can parse data during dynamic generation data format definition;And to the data of Protocol Buffers forms, text The definition that the data of this form are related to data can be previous with being provided in data parsing, so the present invention is in parsing text formatting Data before syntax tree can be set up according to its definition.According to the definition in domain in syntax tree, the data of the invention by text structure It is converted into the binary format data of line and column.
(2) data memory module:
Store the line generated by data resolution module and column binary file.It can be internally realized to this The mutual conversion of two kinds of formatted datas, and it is directly output as the JSON data of text formatting.In STEED systems, this The characteristics of invention is always according to line and column data storage has carried out certain optimization to its storage organization, enables have higher Storage and search efficiency.
(3) query analysis module:
Data based on line and column form, inquiry operation is carried out to semi-structured data, including projector reflects Penetrate, filter filterings, group packets, sort sequences and join connections etc..When STEED need perform one query when, first by The operation tree that content generation of the Query Parser query parsers in query statement is set up needed for this time inquiring about Each node in (Operator Tree), tree is a SQL operation.Data are in operation tree according to from leaf to root The order of node completes the computing of various pieces until reaching root node completes this inquiry operation.The invention also achieves some The multithreading version of operation, supports projector mappings, the operation such as filter filterings and group packets.
STEED systems one are divided into three modules, next the present invention by each module that makes introductions all round realize details and Process.
Part 1 data resolution module
This part describes the key algorithm for realizing details and inside of STEED data resolution module in detail, simultaneously According to the design feature of semi-structured data, it is how to be solved respectively for JSON and Protocol Buffers to illustrate STEED Analyse and set up the process of syntax tree.
1.1 data resolution module architectural overviews
Data resolution module is mainly made up of following three part:
(1) Data Type data types:
Binary data types for describing and defining domain in JSON and Protocol Buffers text datas. Some basic data types, such as int defined in STEED systems, double, string etc..For the number of JSON forms According to, it is only necessary to the value of text data is mapped to the data type of internal system;And for Protocol Buffers Speech, the data composite data type defined using its schema is changed accordingly to the data type that STEED gives tacit consent to, for The process for setting up syntax tree later is used.
(2) Schema Tree data syntaxs tree:
The definition of semi-structured data is set up, both syntax tree.
For Protocol Buffers text data, defined first according to its schema before parsing data in file Dynamic generation syntax tree is defined to schema.In data resolving, the content and structure of the syntax tree of definition keeps constant.
The data of JSON forms then need format and content of the present invention during data are parsed in its data The definition of this syntax tree of dynamic generation.Present invention assumes that the type of numerical value keeps constant in each domain, while every in array The type of the value of individual element is all identical.
STEED stores the corresponding syntax tree definition of each data set.In query analysis module, STEED will be according to language The definition of data carries out corresponding inquiry operation to data set in method tree.
(3)Parser:
For the semi-structured data of text formatting to be split into the form as key-value pair (key value pairs), and The line defined inside STEED or the storage organization of column are parsed into later.For Protocol Buffers data, in solution The process of analysis only needs to carry out data according to the definition of syntax tree the conversion of form;And for the data of JSON forms, this hair It is bright to also need to the domain newly defined whether occur in analyze data during parsing, and then existing syntax tree is repaiied Change.
1.2Data Type types
1.2.1STEED the basic data type supported
STEED internal systems define the data of some binary formats, the storage for line and column formatted data And computing:
1) shaping number:TypeInt (8/16/32/64) represents the shaping number of 8/16/32/64 respectively;
2) floating number:Type (Float/Double) represents the floating number of float and double types respectively;
3) character string:The character string that TypeString is represented;
4) timestamp:TypeTimeStamp represents timestamp, and inside is implemented with TypeInt64.
The above data type can support the sky of sentencing to its value, herein with the mutual conversion of binary data, than Relatively operation etc..
1.2.2JSON the conversion of data type
JSON defines in its data the possible type of data in each domain.Each data type that the present invention is defined STEED corresponding internal data type is mapped to, it is as shown in the table:
For basic data type, the Type mapping for directly defining JSON turns into the master data class inside STEED Type;And it is corresponding to also define its for these nested complex data types of object in JSON and array, inside STEED The mode of ranks storage, specific storage mode is see next chapter data memory module.
1.2.3Protocol the conversion of Buffers data types
Similar to JSON, Protocol Buffers also define some internal basic data types.In STEED inside In realization, these basic data types are directly converted into the type (C++Type) in C++ by the present invention, and its value is stored In result after parsing.Referring to https://developers.google.com/protocol-buffers/docs/ proto3#scalar。
In addition, compound data type message can also be defined in Protocol Buffers schema.Using multiple Data type is closed, the present invention can define the data format definition of multilayer nest.Meanwhile, in the definition of compound type, this hair It is bright to select valuation of a field attribute, the domain that both required necessarily occurs, optional it is possible that domain and The domain that repeated can repeat.
1.3 syntax trees (Schema Tree)
In this trifle, the present invention will introduce STEED is how to describe half structure using syntax tree (Schema Tree) Change data.It is the data and knot for how being directed to JSON and Protocol Buffers that can also introduce in resolving simultaneously Structure feature sets up grammer.
1.3.1 the definition of syntax tree
There are following some design features in semi-structured data:
1) there is substantial amounts of nested structure in data:The definition in each domain has a depth, and traditional relationship type is flat Data compared to more complicated;
2) many multi-domains in data:In one records, it might have many values and some domain therein carried out Replicate.
3) there is substantial amounts of sparse domain in data:Substantial amounts of domain is not assigned in most data, and is used Traditional relevant database carries out processing in the way of table to it can be so that store and inquire about very poorly efficient.
In order to the above feature in each domain in efficient description semi-structured data, while improving line and column Storage and search efficiency, the present invention is according to following each node for defining syntax tree in filling semi-structured data:
The relevant information of node in itself is not only described in node:It may be assigned in data type, the level of nesting and domain Number etc.;Association also by SchemaNode syntactic nodes ID by node mutually, forms tree.Next it is of the invention How will introduce respectively is that JSON and Protocol Buffers set up syntax tree respectively in resolving.
1.3.1JSON the foundation of syntax tree
Because JSON does not have the related definition of data, so the present invention can only pass through data during data are parsed Dynamically set up syntax tree.Herein, present invention assumes that the type of the value in each domain is member will not change and in array Type is all consistent.During syntax tree is set up, the present invention only needs to determine its value according to the type of data intermediate value Type.On the other hand.Because whether each domain in JSON data occurs being uncertain in record, so the present invention will The domain for being worth the JSON for array is defined as what repeated repeated, and remaining node is defined as optional and not necessarily can Occur.In resolving, STEED needs first to pass through symbol according to father's parent node ID and the corresponding domain names of field name Number table search whether there is the structure definition of correlation.If the definition without this node, added into Schema Tree syntax trees Related node;Otherwise the value then to this node is parsed, and detailed resolving is see next trifle.
1.3.2Protocol the foundation of Buffers syntax trees, as shown in Figure 5:
Shown in the following example, Protocol Buffers can define message as new data class in proto files Type.The each domain wherein included both can be the data of basic data type or other compound types.The present invention During achievement, proto files are parsed first, extend new data type;The root node specified afterwards according still further to user (root) definition of these data types is extended one by one and is assembled into the syntax tree (Schema Tree) of data structure.It The present invention can just be parsed to each text data one by one according to the definition of syntax tree afterwards.
1.4 data are parsed
In this trifle, the present invention will introduce STEED data parsing algorithms.Here the present invention have ignored in system and be permitted The realization of many underlying basis classes, is only listed and the analytically dependent algorithm of Document type data.
Because semi-structured data respectively defines two kinds of compound data structures, both object (object) and array (array), so during parsing, the present invention is distinguished it using different methods both different composite constructions Parsed.On the other hand, for the output of line and column binary data, JSON and Protocol Buffers are in this hair It is consistent during bright realization, so next the present invention is introduced JSON and Protocol Buffers' respectively first Analytical algorithm, then illustrate how be later is binary system line and the data of column by its data output.
1.4.1JSON data resolving algorithm
Shown in following algorithm, the present invention is entered to atomic data types and composite data type using different strategies here Row parsing:For the data of atomic type, the present invention is directly converted into the data of binary format according to the value of its text formatting Stored or exported;For the data of composite construction, the present invention needs to analyze and parse its structure the child domain until all All it is atomic data types.Afterwards according still further to its line or column
Storage organization be written into storage file.It is of the invention in the data resolving of JSON text formattings Need that each domain is compared, whether judge it is newly-increased node, and then change existing syntax tree.
For the nested structure (top box left half) in semi-structured data, split the domain of same layer turns into first The form of " key-value pair ", is analyzed respectively according still further to each key-value pair afterwards.Post analysis each keys define whether once Occurred, corresponding Schema Tree are updated if not occurring, while recording the value of corresponding domain in Schema Tree.It Parsed afterwards according to the value recurrence of each nodes records in Schema Tree:If compound data type, then call phase The composite construction analytical function answered continues to parse;If the value of simple types, then directly it is output in last result Go.
And for the array (top box right half) of multi-domain, because it represents that the multiple of same domain repeat Value, thus present invention only requires call corresponding analytical function to parse its content successively and without analyzing it to schema Tree modification.
1.4.2Protocol Buffers data resolving algorithm
For the data of Protocol Buffers forms, the resolving of Document type data relative to Protocol Buffers are simpler:Because it has been defined for the form of data before data parsing, so of the invention Syntax tree need not be checked and changed during parsing, it is only necessary to which the value to each domain in record is parsed i.e. respectively Can.Specific analytic method is similar with JSON:Compound type calls corresponding analytical function to be parsed;Simple types is then direct Its value is output in result.
1.4.3 line and the output algorithm of column data
During parsing, STEED can parse data the binary format as line or column.Here originally Invention will introduce its detailed process for being output as line or column formatted data:
(1) line compound type data output algorithm:
As shown in algorithm above, for object and array composite data type, the data of line structure are used respectively The object of its line structure is added to the value in each domain until whole piece record completes parsing.
(2) column compound type data output algorithm:
Data file relative to line structure is exported, and is only needed its leaf section during column structures data output Specific value and its structural information are directly output in file on point.So during parsing, the present invention need not retain Object semantically and array structure, only record its structure correlation letter
Cease and be output in the file of column storage.Can thus cause export binary format process it is relatively easy and Efficiently.
Part 2 data memory module
After the parsing that data resolution module completes data line or column, data memory module enters to the result of parsing Row storage and certain structure are changed, the mutual conversion of such as line and column form, by the data of binary format directly with text This form is exported etc..In this chapter, the present invention is introduced first and the bottom storage of line and column binary data is tied Structure.Afterwards, the packing algorithm based on Google Dremel, the present invention will be illustrated that STEED is the number for how realizing column structures According to the packing algorithm for being converted into line structured data.
2.1 line storage organizations are summarized
In the description of previous chapter resolving, the present invention is deposited using the binary format of atomic type to its data Storage;And other two composite construction object objects and array arrays, it is of the invention then stored according to such as Fig. 6 method form:
Line and the storage organization of column ratio are relatively similar, are mainly made up of following several parts:
(1) Header Information structures header:Record the relevant information of this storage organization, such as storage organization Size, wherein element number included etc..
(2) (ID) OFFSET Array ID and offset array:For object objects, the present invention needs mark The id in wherein each domain is used for the presence for representing its value;And for array arrays, each value therein is the tax in identical domain Value, so it only remains the offset offset informations of each value.
(3) array of Value Array numerical value:The numerical value that values repeats all is stored as by the storage organization of line The form of array is stored, and the type of its intermediate value both can be the data of atomic type or the data of compound type. In object objects, what it is due to expression is different valuation of a fields, so the type being each worth can be differed;But in array In array, expression be same domain multiple assignment, so the type that is each worth of present invention acquiescence is identical here.According to The offset offset informations being each worth before, the present invention can carry out random access to the value in arbitrary domain.
2.2 column storage organizations are summarized
Column storage organization is relative complex relative to line structure, is used for invention defines following related notion in row Represented in formula structure and store its structural information:
(1)Repetition Level:Repeated value repeat at which field in the Repetition in field ' s path. data is the repetition carried out on which level.
(2)Definition Level:Number of field in the path could be undefined It is to occur there to be several layers of omissible domain (optional and repeated) in but present. data.
How data on column structures using these related information enter the process that determinant is changed to line data Next trifle is explained in detail, the present invention only introduces its storage organization in line structured data here.
CAB (Column Align Block) is the elementary cell of column storage of the present invention.In resolving, each value (value) a Column Item can be all produced to be stored in CAB.Because having the domain much repeated in semi-structured data, The domain each repeated may result in be had a plurality of Column Item and is inserted into CAB in a record.The present invention is in order to improve Storage and the efficiency of inquiry, are stored to CAB using the record id modes alignd, and each CAB stores identical many Bar number of the record records without considering specific column item.
CAB specific structure chart is listed in Fig. 7.Mainly it is made up of following four part:
(1) Header headers:Relevant information for describing CAB, including its size and the record strip number of storage etc..
(2) Repetition Array arrays:Repetition values for recording every Column Item.Because Repetition maximum is the depth capacity in each domain, so the present invention replaces integer to store using some bit here Its value.By the analysis to data content, the present invention summarizes following several template to enter the pattern that it is likely to occur Row is summarized and optimized, as shown in Figure 8.
A) the non-repetitive domains of None Repeated:There is no this domain in recursive domain, record most in the level of nesting Only one of which value, the both situation without repeated assignment of values.STEED during parsing if can insert during some domain void values Null causes every record to align naturally.So, every record has and only one column item is in corresponding column In data file.Therefore, the present invention eliminates this array in storage organization.
B) Single Repeated can only be repeated in some level of nesting:One and only one in the level of nesting can With the level repeated.Present invention only requires first Column Item (Record Boundary) of every record of mark.Institute To go first Column Item or its level of nesting uniquely repeated of every record of mark present invention only requires 1bit.
C) Multi Repeated can be repeated in multiple levels of nesting:If multiple from root to existing in leaf node Recursive domain, the present invention is accomplished by multiple bits and points out its domain repeated.
After specific analysis has been carried out to data, present invention discover that the domain of the overwhelming majority is all at most only can be on one layer Repeat.So being stored by using this 3 template templates, column storage organization of the invention improves storage and grasped The efficiency of work.
(3) Value Area value regions:This part have recorded whole Column Item value.For elongated and fixed Two kinds of long data formats, present invention uses two kinds of different storage strategies:
A) data type of fixed length:The length of each data is the same, so the present invention have recorded each value in Header Next numerical value can be read in shared space, the length that mobile fixation is only needed to every time;Extra offset numbers are not needed Group.
B) elongated data type:The length of each data is different, so the present invention needs to remember in offset arrays Record the storage location being each worth;The domain (such as user language) that content for value is largely repeated, we only store a tool The change long value of body, its efficiency stored is improved by being multiplexed the offset of the occurrence in different column item.
2.3 lines and column format conversion algorithms
The present invention will introduce the algorithm mutually converted in memory module line and column file in this part.For row For formula data file, each semi-structured data collection can generate a line data file storage after being parsed Whole thresholdings and related structural information in all records.On the other hand, for column data file, each textual data Several column storage files are produced according to rally.Each domain can produce this of the whole record of a column storage file storage All values in domain.So in the process of running, memory module is accomplished by realizing the ranks format conversion operation of data storage.Together When, it is also desirable to realize and meet demand of the data output for JSON text formattings directly.
2.3.1 line is parsed to the data of column
The transfer process of line structured data to column structures data is similar to the process that text structure data are parsed, here Repeat no more.Due to the matching of line character need not be entered to the structure of text data in resolving, and use and parsed The object or array of good line storage organization;Turn without the need for Document type data is carried out to binary form Change, the efficiency that line to column structures is changed will be substantially better than the efficiency of character resolution.
2.3.2 data assembling of the column to line
The file that column data file is assembled into line form according to certain rule can just complete column structures Data are converted into the data of line structure.It is of the invention here to make inside STEED based on Google Dremel packing algorithm The assembling to column file is completed with similar algorithm.Specific algorithm is as follows:
In an assembling process, STEED is according to the repetition in the order and Column Item of finite-state automata Value reads Column Item from Column Reader, judges afterwards further according to definition value and exports phase The level of nesting information answered.When last Column Reader of reading runs through last Column of this record During Item, both travel through and complete the assembling that all Column Reader, Assembler assemblers just complete a record. Assembler assemblers can constantly be run, until all records all complete assembling, now all Column Reader EOF EOF should all be read.
In following algorithm, except specific two functions of assembling process AssembleRecd, move and return point Not Shi Yong definition value judge data the level of nesting structural information.
In following false code, the present invention needs to begin to use depth-priority-searching method to travel through schema from root node Tree, the order occurred afterwards according to its leaf node is ranked up to column file, is successively read according to the order after sequence Column item content in each column file.According to the column item of reading content, the present invention can be controlled The nested hierarchy information of data:The relevant information that nested structure is represented in prostatitis is worked as in output first, is afterwards output to value In line structure to be assembled, the next row file that redirect and read then is judged again, one is finally pre-read from next column Bar column item judge to need the level of the level of nesting returned and export the structural information of correlation.When all column texts Part is all at least completed after reading once, and the present invention just completes the assembling of a record.The process of the above is repeated, until all Column file all run through, the present invention just complete the assembling to whole data acquisition system.
Third portion query analysis module
Data based on line and column structures, STEED can carry out the query analysis similar to SQL.But compared to The relational data of traditional table structure, semi-structured data because it has nested and multi-domain and causes it in inquiry There can be certain ambiguity.The grammer of inquiry is extended for this present invention, it is eliminated data ambiguity to a certain extent. The invention also achieves the basic computing of some in SQL, such as projector mappings, filter filterings, group by packets and Sort sequences etc..In this chapter, the present invention introduces the semanteme after the extension for semi-structured data first.Afterwards, for being The computing for a variety of semi-structured data being had been carried out in system, the present invention can introduce the specific algorithm of its realization successively.
3.1SQL is directed to the semantic extension of semi-structured data
Traditional relational data stores flat data using table structure:All values are all in same layer, in the absence of embedding Encasing structure;Each domain has and only one of which numerical value can be to its assignment;The meeting split table when designing table, it is not in a large amount of to make it Sparse domain.And for semi-structured data, its above feature is not all applied to.And it is semi-structured in order to support The operation of data, the present invention newly defines some following operators:
(1)“.”:For the level of nesting in the path expression of spacer domain.
(2)“any”:Represent an arbitrary numerical value in the domain of repetition;
(3)“all”:Represent numerical value all in the domain of repetition.
The result present invention of output has multiple option:
(1) data of JSON forms:
(2) the class JSON data of nested structure are ignored;
The arithmetic type that 3.2STEED is supported
As shown in figure 9, STEED supports polytype computing based on line and column data.In each operator Between, data are flowed successively using pull mode, until the output operator in top layer are completed from binary to text The conversion of this formatted data.Next, the present invention, which can introduce the various of its inside successively, realizes details.
3.2.1Row From Operator (line data reading computing)
STEED reads the data of a whole piece line structure from the data file of line.Because each record is in line It is the storage carried out according to record for unit in data file, every record is all using Row Object lines objects as storage What form was stored.So when reading record, the present invention reads a Row from row binary data file every time Object line objects are read out successively, until reaching EOF EOF.
3.2.2Schema Filter (Where or Having Clause) Operator (is defined based on schema Filtration operation in where and having words and expressions)
In this Operator, parallel type data of the present invention carry out filter filter operations.This operation can be used for STEED carries out the condition in where words and expressions to it and judged after line data is read;And in group by words and expressions After the data of the new line of generation, it can also be used to carry out filter operation to the result that aggregation assembles.
During specific filter filter operations, it is used for invention defines RowCondition (line condition class) Judge whether domain related in record meets the condition of each predicate (predicate).Specific deterministic process is as follows:
The present invention is parsed to where words and expressions first, and each predicate is instantiated as to carry out data comparison Object:Data can be read from line data structure;The value to reading is compared afterwards, judges each predicate (predicate) whether true value, determine it by the conditional operation in this operator.
3.2.3Project Operator (mapping operations)
The present invention stores all domains in every record in the data of line structure, but most query statement Only need the value in some domains.So in whole query process, just have substantial amounts of in the data for inquiring about unrelated domain Copied between each operator computings repeatedly.These extra memory copyings can reduce the efficiency that the present invention is inquired about.So this The data of invention parallel type structure, which realize projector computings, to be used to extract the domain related to inquiring about, so in copy procedure In the related domain of inquiry has only been copied to improve the efficiency of inquiry.
In calculating process, the present invention uses the nested structure called in recursive function reply semi-structured data.Every In one domain, the present invention reads the valuation of a field in former data respectively, only by the domain related to inquiry after being parsed to it It is written in the result of computing.Substantial amounts of unrelated domain in line data can thus be ignored, the efficiency of query process is improved.It is right In multi-domain, if it is repeated in leaf node, STEED only needs the multiple of Coutinuous store in the array in this domain of direct copying Value.If repeated in non-leaf nodes, to distinguishing recurrence and parsing with the minor structure in each array.It should be noted It is that during subtree is extracted, the present invention only remains the subtree being assigned;Both, if related domain in this subtree It is not assigned, then this subtree will not be retained in projector result.
3.2.4Assemble Operator (the assembling computing of column to line data)
In this operator calculating process, STEED, which is completed, will inquire about related domain from the conversion of line structured data For the assembling process of column structures data.Specific packing algorithm is see before.STEED is inquired about by Query Parser first The parsing of sentence resolver needs the SQL statement performed to obtain all domains related to inquiry, and a finite state is set up using it Automatic machine (FSM) is to control the reading order of line structured data in an assembling process.It is complete according to previous packing algorithm afterwards Change, repeat no more here into from column to the form of line number data.
3.2.5Column Filter Operator (the assembling computing of the column of offer filter operation to line data)
Compared to Assembler operator (the assembling computing of column to line data), Column Filter Operator (providing the column of filter operation to the assembling computing of line data) not only realizes column structures to line structure Assembling, moreover it is possible in an assembling process to it is each record carry out filter filter operations.Because in query process, where is sub Sentence can filter out some records for being unsatisfactory for condition, if so the present invention does not assemble these invalid notes in an assembling process Record, can greatly improve search efficiency.So in query process, the present invention reads a CAB and carries out filter filterings every time Operate and set up corresponding bit map bitmaps and record its result of the comparison, finally according to decide whether again in the result of record into Row assembling.
3.2.6Join Operator (concatenation operation)
Attended operation is realized using hash join (Hash connection) in STEED, the connection behaviour of two tables is only supported at this stage Make.During this operation is performed, join key occurrences of the STEED in one of data set record calculates it Corresponding cryptographic Hash and by whole piece record storage in Hash table.Another data acquisition system is traveled through later, is searched with identical Position (bucket) in the corresponding Hash table of hash key Hash keys.The data of the two line structures are closed afterwards And, and this record is waited by pull (drawing) to last layer operator computings.STEED does not use relationship type at this stage Query optimizer is optimized in database, so advise regarding less data set as in from clause in query process The data set of one appearance, to obtain higher storage efficiency.
3.2.7Group Operator (packet computing)
In the built-in function that inquiry operation is supported at this stage, Group packets are most complicated operations.The present invention will The class of some new definition in calculating process is introduced, and corresponding implementation procedure is analyzed.With join operator connections Operation is similar, and group operator division operations store corresponding group key packets using hash table Hash tables Key assignments.During computing, first by reading data from the data of line structure, calculate its hash key cryptographic Hash and add It is added in Hash table.Afterwards further according to needing to judge it whether there is aggregation aggregate operations in hash value cryptographic Hash Content carry out computing.The data store organisation of wherein hash value cryptographic Hash is as shown in Figure 10:
The present invention, which first defines HashValueItemContainer, to be used to store each each in Hash table Specific value values are to point to these HashValueItem address in memory cell (bucket), Hash table.Each so Object have structure as shown in Figure 10:
(1) present invention is stored in intermediate layer keeping records first specific address and each calculative The content of aggregation aggregations.
(2) in Block Buffer objects, the actual content for the record being saved is stored.It is pointed out that this The domain that a little records are grouped except those grouped field based on value, is all the expression aggregation not being assigned Assemble the domain of result.After whole group packet computings are completed, the present invention again inputs the result that aggregation assembles To corresponding position, and result is waited to operate pull pull-ups by other operator on upper strata.
3.2.8Order Operator (sorting operation)
For order by sorting operation computings, the present invention is needed by all record storages into buffer cachings, it Afterwards to its comparative sorting.The problem of in view of memory headroom allocative efficiency, the present invention only fixes big to operating system application every time Small internal memory, can so save the cost of memory copying during realloc is redistributed in internal memory.Meanwhile, in order to avoid The cost of data is repeatedly copied in sequencer procedure, the present invention records of every record in comparison procedure using an array Beginning and changes position of the pointer in array at address in sequencer procedure.When being finally reached to this array sequential access, access To record be all the result that meets ordering requirements.
Further according to the condition of sequence, invention defines comparer comparators with record is compared, it is according to following Mode carries out computing:
(1) this comparer comparator can read the numerical value in all domains for comparing behaviour from line storage organization Make.
(2) in order to improve relative efficiency, the process for comparing and exporting has been implemented as described below in the present invention:
A) most-significant byte that 8 bytes store data in the domain that first needs compares is remained in being recorded at every.For all Value type for, this space is enough to store its corresponding value and taken without the data of complicated parallel type structure Value;For character string, the comparative result that the comparison of first 8 in most cases can also be determined.So comparing During, the present invention is first compared using this 8 byte of caching.When the type of data is identical for the comparison of character string and prefix When, the present invention can just carry out next step comparison.
B) order in the domain sorted as needed using comparer comparators, value and is ranked up successively, until To result of the comparison.
C) the realization present invention of specific comparison function uses STL::Sort functions are compared.
D) itself data copy is not carried out to record in comparison procedure, only have modified the pointer of record output order Array, this avoid the multiple copy function of internal memory.And during data are drawn by upper strata operator operations pull, this Invention also provide only corresponding pointer, to improve the efficiency of its data processing.
4th part utilizes the method and system of simple path characteristic optimization tree data
In this part, the present invention summarizes according to the related data of existing a variety of data sources and has summarized simple The concept in path, and carried out query optimization using this feature in STEED
The definition of 4.1 simple paths
Analyzing the data of a variety of separate sources, it has been found that in the syntax tree of each data set, existing substantial amounts of From root to the path of the most only one of which duplicate domains of leaf node.The present invention can be utilized in these data during the inquiry Design feature Optimizing Queries process, improve search efficiency.So, the present invention is defined as follows to simple path:In data set It is multivalue from root to can only at most there is a domain (some node in syntax tree) on the path of leaf node in syntax tree , our such paths are called simple path.Tree data can be stored using simple path in STEED and The related optimization of query process.
The structure of 4.2 semi-structured data lines storage
As it was previously stated, STEED makes in line storage organization in order to accurately express the hierarchical information in tree-shaped structured data With relative complex storage organization.By analysis, it is considered herein that from the expression of data, traveling one can not be entered to it The optimization and improvement of step.But by the analysis in path briefly above, what the present invention can be stored by simplifying in data Structural information represents efficiency to improve data in internal system so that it is parsed and the efficiency of inquiry has further lifting. It is contemplated by the invention that more preferable line storage organization it is as shown in figure 11:For the data of simple path, STEED can be in data The only relational structural information of store leaf node (domain) replaces original nested storage organization to refer to corresponding path.And make After being optimized with simple path, STEED can utilize the relevant information of leaf node in data from the syntax tree in system (Schema Tree) obtains the relevant information of all nodes on whole path.So, STEED is stored by simplifying in data Structural information improves the expression efficiency of line data and the execution efficiency of inquiry.
4.2Flatten Assemble (flat line structure assembler)
STEED is in the assembling process of data, it is necessary to spend a large amount of costs to recover the hierarchical structure of data.Such as preceding institute State, the level repeated in the domain in most data set is no more than 2 layers, so value most in data can Optimized accordingly using simple path.And the assembling process in the domain in STEED for simple path is then more easy:
Ignore the level in acquiescence binary data using the flat line structure assemblers of Flatten Assembler to close System, both represented non-leaf nodes all from path is ignored with the path to leaf node using only leaf node.So, The present invention, which is achieved that, is limited to the level of nesting of line structured data one layer of purpose, so that during data query Save space consuming of the data in internal memory and improve the search efficiency of data.
Specific packing algorithm is as implied above:
Before assembly, it is necessary to be sorted accordingly according to the ID of leaf node to each row to be assembled.Afterwards, All Column Item of every record in each Column Reader are read in sequence successively, successively by the numerical value of reading It is written to related structural information in the result of assembling.Here because the result of assembling only remains a nested layer It is secondary, so STEED only needs to be appended to the value in each domain in current object in an assembling process, without considering assembling As a result nest relation.
The storage organization of 4.3 flat line datas
In the present invention, STEED is in terms of query process is inquired about and stored using the line data of flat structure Optimization.For the non-simple path in syntax tree, because STEED needs the multi-domain of the different levels of nesting in identification data, this Invention is continuing with the expression of the tree data of system default.And for simple path, the present invention is used such as Figure 11 Structure it is stored or assembled:
1) in syntax tree from root to there is no the domain of duplicate node on the path of leaf node:In flat data storage organization only Need the ID of store leaf node and the numerical value of corresponding field;
2) in syntax tree from root to only one of which duplicate node on the path of leaf node domain:Flat data storage organization In can be exported according to following two structures, refer to Figure 12:
A) it is stored in the numerical value of each duplicate domain as a specific value in flat structure --- had in-data The multinomial value for having an identical ID, its number is decided by the number of duplicate domain;
B) domain repeated is stored entirely in flat structure as one --- only have the ID of a duplicate domain in-data Its specific value is represented, and this domain is the multiple numerical value of representation by an array form.
3) syntax tree is from root to the node for having multiple repetitions on the path of leaf node:Flat data storage organization can not table The numerical value in multiple repeatable domains is the repetition occurred on which layer on up to path, and original acquiescence is continuing with the present invention Tree shaped data storage organization --- the ID of leaf node is still used in the data of-flat structure, but corresponding value is skew Amount, points to the position of the complete nested structure of storage.

Claims (10)

1. a kind of tree data processing method, it is characterised in that including:
Step 1, semi-structured data is read, and is resolved to the binary format data of line or column, wherein in solution During analysis, dynamic generation or syntax tree is set up according to definition, store the definition of semi-structured data;
Step 2, the binary format data of storage line or column, wherein realizing parallel type or the binary system of column Formatted data is mutually changed, and the binary format data are directly output as to the JSON data of text formatting;
Step 3, based on the binary format data, inquiry operation is carried out to semi-structured data.
2. tree data processing method as claimed in claim 1, it is characterised in that the step 1 includes being provided for Description and the definition for defining Protocol Buffers and the binary data types, nested structure in domain in JSON text datas; The definition of semi-structured data is set up, wherein for Protocol Buffers text data, the root first before parsing data Defined according to its syntax tree and define dynamic generation syntax tree in file to syntax tree, the process that the data of JSON forms are parsed in data The middle form in its data and the definition of the data syntax tree of content dynamic generation JSON forms.
3. tree data processing method as claimed in claim 2, it is characterised in that the step 1 also includes to half hitch Structure data are parsed:By wall scroll record in units of successively nested storage line storage organization;It is fixed with data tree Leaf is the column storage organization that unit is stored in justice.
4. tree data processing method as claimed in claim 2, it is characterised in that handle half structure in the following manner Change data:
The related letter of node in itself is not only described in definition filling semi-structured data in each node of syntax tree, node Breath, it is also by the ID of syntax tree interior joint that node is interrelated, form tree.
5. tree data processing method as claimed in claim 4, it is characterised in that respectively in resolving for JSON and Protocol Buffers set up syntax tree respectively, wherein,
Set up JSON syntax trees:Syntax tree is dynamically set up by data during data are parsed, wherein assuming each domain The type of value be that member type will not change and in array is all consistent, during syntax tree is set up, according to The type of data intermediate value determines the type of its value, and the domain for the JSON that value is array is defined as repeating, and remaining node is equal It is defined as not necessarily occurring, in resolving, first according to father's parent node ID domain name corresponding with field name The structure for whetheing there is correlation by symbol table search is defined, if it is not, the node of correlation is added into syntax tree, otherwise to section The value of point is parsed;
Set up Protocol Buffers syntax trees:Protocol Buffers message defined in proto files are as new Data type, wherein each domain included is basic data type or the data of other compound types, setting up During Protocol Buffers syntax trees, proto files are parsed first, new data type is extended, afterwards according still further to The definition of data type is extended and is assembled into the syntax tree of data structure by the root node specified one by one.
6. tree data processing method as claimed in claim 1, it is characterised in that define the data of binary format, Storage and computing for line or the binary format data of column:
1) shaping number:TypeInt (8/16/32/64) represents the shaping number of 8/16/32/64 respectively;
2) floating number:Type (Float/Double) represents the floating number of float and double types respectively;
3) character string:The character string that TypeString is represented;
4) timestamp:TypeTimeStamp represents timestamp, and inside is implemented with TypeInt64.
7. tree data processing method as claimed in claim 1, it is characterised in that the step 3 is included when execution is looked into When asking operation, first the content in query statement generates every in the operation tree set up needed for this inquiry, the operation tree One node is all a SQL operation.
8. tree data processing method as claimed in claim 1, it is characterised in that the inquiry language also including generalized Petri net Method, it is as follows:
(1)“.”:For the level of nesting in the path expression of spacer domain;
(2)“any”:Represent an arbitrary numerical value in the domain of repetition;
(3)“all”:Represent numerical value all in the domain of repetition;
The result of output is:The data of JSON forms;Ignore the class JSON data of nested structure.
9. tree data processing method as claimed in claim 1, it is characterised in that also include:
Line data reads computing:The data of a whole piece line structure are read from the binary format data of line, are being read When, a Row Object lines object is read from the binary format data of line every time and is read out successively, until Reach EOF EOF;
Line data filtration operation:Condition in the binary format data progress where words and expressions of the line of reading is sentenced It is disconnected, and after generating in group by words and expressions the binary format data of new line, the knot assembled to aggregation Fruit carries out filter operation, wherein first parsing where words and expressions, each predicate is instantiated as to carry out data comparison Object, afterwards the value to reading be compared, judge the true value of each predicate, decide whether by conditional operation;
Line data mapping operations:Recursive function is called to tackle the nested structure in semi-structured data, in each domain, point Not Du Qu valuation of a field in former data, only domain associated with the query is written in the result of computing after being parsed to it;
Concatenation operation:Attended operation is realized using Hash connection, wherein the join key in one of data set record Occurrence calculates its corresponding cryptographic Hash and whole piece record storage is traveled through to another data acquisition system in Hash table, later, looks into The position looked in the corresponding Hash table with identical hash key Hash keys, is afterwards closed the data of two line structures And, and wait this to record the operator computings for being pulled to last layer;
It is grouped computing:Defining HashValueItemContainer first is used to store each each storage in Hash table Specific value values are remembered to point to HashValueItem address wherein (1) is preserved in intermediate layer first in unit, Hash table The specific address of address book stored and the content of each calculative aggregation aggregations;
(2) in Block Buffer objects, the actual content for the record being saved is stored, wherein when entirely packet computing completion Afterwards, then by the result that aggregation assembles corresponding position is input to, and waits result by other operator on upper strata Operate pull-up;
Sorting operation:By all record storages into buffer cachings, and comparative sorting is carried out, wherein being only to operation every time The internal memory of system application fixed size is used to store many datas that lower level operations are obtained, while using a number in comparison procedure Group records the initial address of every record and changes position of the pointer in array in sequencer procedure;
According to the condition of sequence, comparator is defined, and carry out computing as follows:
(1) comparator reads the numerical value in all domains for comparing operation from the binary format data of line;
(2) to improve relative efficiency, the process for comparing and exporting is as follows:
A) most-significant byte that 8 bytes store data in the domain that first needs compares is retained in being recorded at every;
B) order in the domain sorted as needed using comparator, value and is ranked up, until obtaining result of the comparison successively;
C) STL is used::Sort functions are compared;
D) data copy is not carried out itself to record in comparison procedure, only the array of pointers of modification record output order.
10. a kind of system of the tree data processing method based on as described in claim 1-9 any one.
CN201710178695.6A 2017-03-23 2017-03-23 A kind of tree data processing method and system Active CN107092656B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710178695.6A CN107092656B (en) 2017-03-23 2017-03-23 A kind of tree data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710178695.6A CN107092656B (en) 2017-03-23 2017-03-23 A kind of tree data processing method and system

Publications (2)

Publication Number Publication Date
CN107092656A true CN107092656A (en) 2017-08-25
CN107092656B CN107092656B (en) 2019-12-03

Family

ID=59646394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710178695.6A Active CN107092656B (en) 2017-03-23 2017-03-23 A kind of tree data processing method and system

Country Status (1)

Country Link
CN (1) CN107092656B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107783850A (en) * 2017-09-28 2018-03-09 北京天元创新科技有限公司 A kind of node tree chooses analytic method, device, server and the system of record
CN107801213A (en) * 2017-10-23 2018-03-13 深圳市沃特沃德股份有限公司 Data transmission method and device
CN107992992A (en) * 2017-11-07 2018-05-04 中国银行股份有限公司 Unionpay's IC card transaction data analysis system and method
CN108491207A (en) * 2018-03-02 2018-09-04 平安科技(深圳)有限公司 Expression processing method, apparatus, equipment and computer readable storage medium
CN108520053A (en) * 2018-04-04 2018-09-11 东北大学 A kind of big data querying method based on data distribution
CN109325022A (en) * 2018-07-20 2019-02-12 新华三技术有限公司 A kind of data processing method and device
CN109508409A (en) * 2018-10-23 2019-03-22 魔秀科技(北京)股份有限公司 A kind of semi-structured json data freely parse adaptation method
CN109710620A (en) * 2018-12-29 2019-05-03 杭州复杂美科技有限公司 Date storage method, method for reading data, equipment and storage medium
CN110263104A (en) * 2019-05-14 2019-09-20 阿里巴巴集团控股有限公司 JSON character string processing method and device
CN110309007A (en) * 2019-07-02 2019-10-08 深圳市友华通信技术有限公司 The display output method and device of D-Bus
CN110618983A (en) * 2019-08-15 2019-12-27 复旦大学 JSON document structure-based industrial big data multidimensional analysis and visualization method
CN110719290A (en) * 2019-10-15 2020-01-21 杭州鸿雁智能科技有限公司 Protocol translation method and device for home interconnected network
CN111046630A (en) * 2019-12-06 2020-04-21 中国科学院计算技术研究所 Syntax tree extraction method of JSON data
CN111159316A (en) * 2020-02-14 2020-05-15 北京百度网讯科技有限公司 Relational database query method and device, electronic equipment and storage medium
CN111435372A (en) * 2019-01-11 2020-07-21 阿里巴巴集团控股有限公司 Data display method and system, data editing method and system, equipment and medium
CN112527794A (en) * 2020-12-07 2021-03-19 广州海量数据库技术有限公司 Data processing method and system for realizing set data types in database
CN112559527A (en) * 2020-12-15 2021-03-26 武汉大学 Data conversion method based on multi-branch tree node relation matching
CN113297296A (en) * 2021-05-31 2021-08-24 西南大学 JSON processing method for multi-style type data
CN113505269A (en) * 2021-07-02 2021-10-15 卡斯柯信号(成都)有限公司 Binary file detection method and device based on XML
CN114357054A (en) * 2022-03-10 2022-04-15 广州宸祺出行科技有限公司 Method and device for processing unstructured data based on ClickHouse
CN116050358A (en) * 2023-03-21 2023-05-02 北京飞轮数据科技有限公司 Data processing method and device applied to dynamic data and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1462400A (en) * 2001-02-28 2003-12-17 皇家菲利浦电子有限公司 Schema, syntactic analysis method and method of generating a bit stream based on schema
US20040267710A1 (en) * 2001-07-13 2004-12-30 Alexandre Cotarmanac'h Method for compressing a hierarchical tree, corresponding signal and method for decoding a signal
CN1669225A (en) * 2002-07-15 2005-09-14 西门子公司 Method for coding positions of data elements in a data structure
US7761459B1 (en) * 2002-10-15 2010-07-20 Ximpleware, Inc. Processing structured data
US20130151534A1 (en) * 2011-12-08 2013-06-13 Digitalsmiths, Inc. Multimedia metadata analysis using inverted index with temporal and segment identifying payloads
US20170060912A1 (en) * 2015-08-26 2017-03-02 Oracle International Corporation Techniques related to binary encoding of hierarchical data objects to support efficient path navigation of the hierarchical data objects
CN108140046A (en) * 2015-10-23 2018-06-08 甲骨文国际公司 For DB query processings in the efficient memory of any semi-structured data form

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1462400A (en) * 2001-02-28 2003-12-17 皇家菲利浦电子有限公司 Schema, syntactic analysis method and method of generating a bit stream based on schema
US20040267710A1 (en) * 2001-07-13 2004-12-30 Alexandre Cotarmanac'h Method for compressing a hierarchical tree, corresponding signal and method for decoding a signal
CN1669225A (en) * 2002-07-15 2005-09-14 西门子公司 Method for coding positions of data elements in a data structure
US7761459B1 (en) * 2002-10-15 2010-07-20 Ximpleware, Inc. Processing structured data
US20130151534A1 (en) * 2011-12-08 2013-06-13 Digitalsmiths, Inc. Multimedia metadata analysis using inverted index with temporal and segment identifying payloads
US20170060912A1 (en) * 2015-08-26 2017-03-02 Oracle International Corporation Techniques related to binary encoding of hierarchical data objects to support efficient path navigation of the hierarchical data objects
CN108140046A (en) * 2015-10-23 2018-06-08 甲骨文国际公司 For DB query processings in the efficient memory of any semi-structured data form

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107783850B (en) * 2017-09-28 2020-10-16 北京天元创新科技有限公司 Method, device, server and system for analyzing node tree checking record
CN107783850A (en) * 2017-09-28 2018-03-09 北京天元创新科技有限公司 A kind of node tree chooses analytic method, device, server and the system of record
CN107801213A (en) * 2017-10-23 2018-03-13 深圳市沃特沃德股份有限公司 Data transmission method and device
CN107992992B (en) * 2017-11-07 2021-12-21 中国银行股份有限公司 Unionpay IC card transaction data analysis system and method
CN107992992A (en) * 2017-11-07 2018-05-04 中国银行股份有限公司 Unionpay's IC card transaction data analysis system and method
CN108491207A (en) * 2018-03-02 2018-09-04 平安科技(深圳)有限公司 Expression processing method, apparatus, equipment and computer readable storage medium
CN108520053A (en) * 2018-04-04 2018-09-11 东北大学 A kind of big data querying method based on data distribution
CN108520053B (en) * 2018-04-04 2020-03-31 东北大学 Big data query method based on data distribution
CN109325022A (en) * 2018-07-20 2019-02-12 新华三技术有限公司 A kind of data processing method and device
CN109325022B (en) * 2018-07-20 2021-04-27 新华三技术有限公司 Data processing method and device
CN109508409A (en) * 2018-10-23 2019-03-22 魔秀科技(北京)股份有限公司 A kind of semi-structured json data freely parse adaptation method
CN109710620A (en) * 2018-12-29 2019-05-03 杭州复杂美科技有限公司 Date storage method, method for reading data, equipment and storage medium
CN109710620B (en) * 2018-12-29 2021-03-16 杭州复杂美科技有限公司 Data storage method, data reading method, device and storage medium
CN111435372A (en) * 2019-01-11 2020-07-21 阿里巴巴集团控股有限公司 Data display method and system, data editing method and system, equipment and medium
CN110263104A (en) * 2019-05-14 2019-09-20 阿里巴巴集团控股有限公司 JSON character string processing method and device
CN110263104B (en) * 2019-05-14 2022-12-27 创新先进技术有限公司 JSON character string processing method and device
CN110309007A (en) * 2019-07-02 2019-10-08 深圳市友华通信技术有限公司 The display output method and device of D-Bus
CN110618983A (en) * 2019-08-15 2019-12-27 复旦大学 JSON document structure-based industrial big data multidimensional analysis and visualization method
CN110618983B (en) * 2019-08-15 2023-01-06 复旦大学 JSON document structure-based industrial big data multidimensional analysis and visualization method
CN110719290A (en) * 2019-10-15 2020-01-21 杭州鸿雁智能科技有限公司 Protocol translation method and device for home interconnected network
CN111046630B (en) * 2019-12-06 2021-07-20 中国科学院计算技术研究所 Syntax tree extraction method of JSON data
CN111046630A (en) * 2019-12-06 2020-04-21 中国科学院计算技术研究所 Syntax tree extraction method of JSON data
CN111159316A (en) * 2020-02-14 2020-05-15 北京百度网讯科技有限公司 Relational database query method and device, electronic equipment and storage medium
CN112527794A (en) * 2020-12-07 2021-03-19 广州海量数据库技术有限公司 Data processing method and system for realizing set data types in database
CN112527794B (en) * 2020-12-07 2023-05-26 广州海量数据库技术有限公司 Data processing method and system for realizing aggregate data types in database
CN112559527B (en) * 2020-12-15 2022-06-07 武汉大学 Data conversion method based on multi-branch tree node relation matching
CN112559527A (en) * 2020-12-15 2021-03-26 武汉大学 Data conversion method based on multi-branch tree node relation matching
CN113297296B (en) * 2021-05-31 2022-08-16 西南大学 JSON processing method for multi-style type data
CN113297296A (en) * 2021-05-31 2021-08-24 西南大学 JSON processing method for multi-style type data
CN113505269B (en) * 2021-07-02 2024-03-29 卡斯柯信号(成都)有限公司 Binary file detection method and device based on XML
CN113505269A (en) * 2021-07-02 2021-10-15 卡斯柯信号(成都)有限公司 Binary file detection method and device based on XML
CN114357054A (en) * 2022-03-10 2022-04-15 广州宸祺出行科技有限公司 Method and device for processing unstructured data based on ClickHouse
CN116050358A (en) * 2023-03-21 2023-05-02 北京飞轮数据科技有限公司 Data processing method and device applied to dynamic data and electronic equipment

Also Published As

Publication number Publication date
CN107092656B (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN107092656B (en) A kind of tree data processing method and system
CN107016071B (en) A kind of method and system using simple path characteristic optimization tree data
CN107066551A (en) The line and column storage method and system of a kind of tree shaped data
CN107491561B (en) Ontology-based urban traffic heterogeneous data integration system and method
CN111046630B (en) Syntax tree extraction method of JSON data
US7860863B2 (en) Optimization model for processing hierarchical data in stream systems
US20240012810A1 (en) Clause-wise text-to-sql generation
US11580147B2 (en) Conversational database analysis
CN109614413B (en) Memory flow type computing platform system
CN108509543B (en) Streaming RDF data multi-keyword parallel search method based on Spark Streaming
CN105706078A (en) Automatic definition of entity collections
CN105706092B (en) The method and system of four values simulation
CN103116625A (en) Volume radio direction finde (RDF) data distribution type query processing method based on Hadoop
CN110637291A (en) Efficient use of TRIE data structures in databases
CN107491476B (en) Data model conversion and query analysis method suitable for various big data management systems
CN106874425B (en) Storm-based real-time keyword approximate search algorithm
CN113094449B (en) Large-scale knowledge map storage method based on distributed key value library
US20070078816A1 (en) Common sub-expression elimination for inverse query evaluation
CN115358200A (en) Template document automatic generation method based on SysML meta model
CN113157723B (en) SQL access method for Hyperridge Fabric
CN116628066B (en) Data transmission method, device, computer equipment and storage medium
CN114372174A (en) XML document distributed query method and system
CN106484815A (en) A kind of automatic identification optimization method for retrieving scene based on mass data class SQL
CN109726292A (en) Text analyzing method and apparatus towards extensive multilingual data
CN110309214A (en) A kind of instruction executing method and its equipment, storage medium, server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant