CN107066551B - Row-type and column-type storage method and system for tree-shaped data - Google Patents

Row-type and column-type storage method and system for tree-shaped data Download PDF

Info

Publication number
CN107066551B
CN107066551B CN201710179108.5A CN201710179108A CN107066551B CN 107066551 B CN107066551 B CN 107066551B CN 201710179108 A CN201710179108 A CN 201710179108A CN 107066551 B CN107066551 B CN 107066551B
Authority
CN
China
Prior art keywords
data
column
value
line
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710179108.5A
Other languages
Chinese (zh)
Other versions
CN107066551A (en
Inventor
陈世敏
王智义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201710179108.5A priority Critical patent/CN107066551B/en
Publication of CN107066551A publication Critical patent/CN107066551A/en
Application granted granted Critical
Publication of CN107066551B publication Critical patent/CN107066551B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML

Abstract

The invention provides a method and a system for storing row type and column type of tree-shaped structure data. The method supports reading and resolving the text data of the tree structure into a line type or column type binary format for storage. During the analysis process, dynamically generating a syntax tree and storing the definition of the semi-structured data; in the query process, STEED reads the related structure information of the original data through the syntax tree and completes the query related operation by combining the content in the binary data. The line storage structure takes records as a unit, and nested and repeated domains of the semi-structured data represented by the nested substructure are defined inside the line storage structure; the column-wise storage as described above stores the value of each path from the root to the leaf node in the syntax tree and its structure information individually in all records, in units of the path. The invention simplifies the structure of data storage and improves the storage efficiency of the data storage by analyzing the semi-structured data storage structure.

Description

Row-type and column-type storage method and system for tree-shaped data
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a system for storing tree-shaped data in a line type and a column type.
Background
With the development of computer networks and big data processing technologies, traditional relational data are increasingly unable to meet the requirements for data definition and use in network and big data environments, and semi-structured data represented by JSON and Protocol Buffers can be widely used in practical environments because they can sufficiently express data of objects (objects) in programming languages, and can modify and expand the original data format according to the format change of the data.
Definition of tree structure data:
Tvalue=Tprimitive|Tobject|Tarray
Tprimitive=string|number|boolean|null
Figure BDA0001253161390000011
Figure BDA0001253161390000012
Record=Tobject
as indicated above, the tree structure data is defined as follows:
1. the values in the tree structure data may be the following 3 types:
the value of the object structure; numerical values of array structure; a numerical value of an atomic type;
the value of the object structure is included by curly brackets, and the inside of the object structure is composed of a plurality of key value pair (key value pair) pairs, the number of the key value pairs can be any number, but the key value pairs cannot be repeated and exist in the object of the object structure;
the data of the array structure is included by square brackets, the inside of the data is composed of a plurality of values (value), the number of the values can be any number, and repeated values can possibly appear;
4. the data of the atom type may be a string (string), a number (number), a boolean (boolean), a null (null), and the like;
5. in the key value pair described in the above 2, the value of the key can only be (string) type.
6. The data of each tree structure is object structured.
The common sources of data are derived from several aspects:
1) data (Data Feeds)
Data is transmitted in a network using JSON format, as represented by twitter. The user and the associated API program may obtain the corresponding data update by listening to the corresponding port. The experiment and data analysis process of the invention is mainly based on twitter data set due to the abundant data content, relatively complex structure, relatively stable data source and enough large data quantity. As follows, the invention analyzes the corresponding analysis of the nesting level and the number of the repeated domains in the twitter data.
2) Online Data Service (Online Data Service)
And performing online data service by using data in JSON format. The common types are transmitting the corresponding operation content of the client and returning the corresponding operation result, etc. The invention researches semi-structured data of online data services from different sources, such as Yahoo (Yahoo), Sina microblog, IMDB and the like. Usually the user can make
Leaf node hierarchy Without duplicate fields 1 repeating field More than 2 repetition fields Total of
1 16 0 0 16
2 61 2 0 33
3 51 21 4 76
4 1 19 4 24
5 0 12 0 12
6 0 12 0 12
Total of 129 66 8 203
And the JSON is used for editing the requirement of the data service according to a certain API format, and after the requirement is sent to a corresponding data server, the returned data in the JSON format is analyzed, so that the primary data service is completed.
The online data service of the microblog API is subjected to relevant analysis in the present invention as shown in fig. 1.
The invention mainly analyzes the number of repeated domains contained in the path: in the graph, a black part is a path without repeated domains from the root to a leaf node, a light part is a path with only 1 repeated domain, and a white part is a path with more than 2 repeated domains. The invention uses the mode of statistical histogram to display the proportion of the components: most syntax trees have a path from the root to the leaf nodes with at most 1 duplicated field. 3) Communication protocol
The invention analyzes the communication related Protocol formats in Apache Hadoop and Hadoop HBase, and uses semi-structured data defined by Protocol Buffers to carry out communication related data transmission. In the above system, a number of different types of semi-structured communication formats are defined for intercommunication and control between different machines. Most of the semi-structured data used for communication are very simple in format.
The related analysis of the communication protocol of Apache Hadoop in the invention is shown in FIG. 2.
The invention mainly analyzes the number of repeated domains contained in the path: in fig. 2, the black part is a path without the repetitive domain from the root to the leaf node, the light part is a path with only 1 repetitive domain, and the white part is a path with more than 2 repetitive domains. The invention uses the mode of statistical histogram to display the proportion of the components: most syntax trees have a path from the root to the leaf nodes with at most 1 duplicated field.
4) Common data set
It uses data in JSON format for storage of common data sets by parsing the data in DBpedia and data. But unlike semi-structured data files in the traditional sense, the data in these datasets consists of only one piece of JSON data. This recording is mainly divided into two parts: the first part is composed of a nested substructure (object in JSON), which stores the format of the data in the subsequent data set; the second part stores the content of each record by an array, and each record is a structure without nesting. The invention can easily split the record into two parts of data definition and data content, and further processes the record by using the traditional semi-structured data processing method.
5) Sensor data
Recent sensor platforms, such as Arduino, Dragon Board, Beagle Board, etc., are capable of generating and processing JSON-type data. The invention analyzes the data from the sources, and finds that the internal format of the data is simpler: the nesting depth of all fields in the data is at most 2 and at most only one multi-valued field appears on the path from the root to the leaf node.
However, existing data processing systems at present cannot process the semi-structured data in the JSON format from the above sources well: on the premise of providing complete functions, all operations have better performance. The invention analyzes a large number of management systems supporting semi-structured data, and the processing idea of the semi-structured data mainly comprises the following three points:
1) extending functionality of a traditional relational database
Such as PostgreSQL and Oracle, stores semi-structured data such as JSON in a table of a relational database in a continuous block of data in either text or in an intra-coded binary format. When the corresponding query operation is carried out, an internal analysis function is called to analyze the content in the data block, and the data value in the required domain is read. And calling an operation function in the relational database to perform corresponding query operation on the relational database.
2) NoSQL data processing system
The semi-structured data is binary coded using a more flexible approach internally, such as montodb or the like. The method has the advantages that the method can realize the analysis, storage and query operations of the native semi-structured data, and has stronger data storage and query advantages. In the implementation process, some query-related operations are newly defined or expanded according to the structural characteristics of the semi-structured data.
3) Processing data in columnar data format
Google Protocol Buffers and Apache Hive + partial support operations such as data processing and query on semi-structured data. In comparison to the two types of data processing systems based on line data, the line data processing system can provide better query analysis performance in most cases, but the internal implementation thereof is more complex: the internal part usually stores data in the form of column clusters. The method has higher difficulty in realizing the analysis and the query operation of the semi-structured data.
At present, the above 3 methods for realizing the semi-structured data processing system have problems of different degrees.
1) Expanding existing relational databases to support processing of semi-structured data is quite inefficient
By analyzing the relational database which can support the semi-structured data processing at the present stage, the fact that most databases do not carry out corresponding data coding and optimization aiming at the structure and the data characteristics of the semi-structured data is found. The method mainly stores semi-structured data into a text data block form, and analyzes the data block of the text type through a plurality of data analysis functions realized in the semi-structured data, so as to obtain the information required in each record. This wastes a lot of space by storing the JSON formatted data of text type directly in the database.
Meanwhile, in the process of data query, a large number of character string comparison and query operations are required, so that the efficiency of data processing is greatly limited. According to the existing research of the invention, although many systems support the operation of semi-structured data, when the data volume is increased, the query running time is often too long, so that the requirement of real-time performance is difficult to meet.
Relational databases also do not support some of the new structural features of semi-structured data well. For example, syntax definitions for nested and repeated domains are directly supported, and the SQL query syntax is extended to support the semi-structured data structure characteristics.
2) The NoSQL data processing system has insufficient efficiency of encoding and querying data
The present invention analyzes and studies the widely used NoSQL data processing system, MongoDB. Because of the flexibility of JSON data semantics, a redundant and cumbersome data encoding format is defined inside the montodb. In the research, the coding efficiency is low, and in most cases, the coded data file is larger than the data in the original text format. The internal coding of the data does not effectively reduce redundant information in the JSON text data, but rather causes additional performance consumption during the query process. This makes its data processing performance relatively limited, especially for the processing of large amounts of data.
Meanwhile, some operations of these NoSQL data processing systems cannot be performed due to limitations of their internal designs. For example, MongoDB cannot efficiently and completely implement join operations in SQL (although related similar operators are added in the latest version, join operations defined in SQL are still not completely satisfied and the execution efficiency is too low).
3) Columnar data format processing data
In a relational database, the storage and query performance of a columnar database is generally superior to that of a line database. This is because it does not require reading and processing data in the record for fields that are not relevant to the current query during the query. But the internal principle is complex and the function is relatively difficult to realize.
Similarly, in systems that support the processing of semi-structured data, the system interior for storing and querying using columnar data is also more complex. Most management systems using line data have no syntax limitation on the internal JSON format, namely the content of the data does not need to be defined in advance, and the structure of the data can be continuously changed in the using process. However, for a columnar data management system, the definition (Schema) of columnar data needs to be given in advance and the structure of the data cannot be changed dynamically during the use process. This greatly limits the flexibility of semi-structured data.
In addition, at present, a plurality of semi-structured data processing systems based on columnar data are not available for users to choose. The current stage of the columnar system available for users is only Apache Hive + partial realized based on Java. Due to the limitation of Java programming language, the efficiency of the query can be further optimized. And the running platform of the system needs the support of Apache Hadoop and HDFS, so the system initialization and running cost is high.
When the invention is used for carrying out relevant researches such as processing on semi-structured data, the invention discovers that the three existing feasible schemes are all caused by the limitations on the data structure and the realization when the semi-structured data is processed.
First, internal structural features in semi-structured data make its data processing unavailable by expanding relational databases. Both have different assumptions about the data format, so it is more costly to handle semi-structured data using a relational database. The present invention redesigns and implements a semi-structured data processing system oriented so that it can satisfy the processing of semi-structured data of complex structures.
Secondly, considering the characteristics of flexibility of definition of semi-structured data, possible structural change in the using process and the like, most of the NoSQL data management systems at the present stage directly use the data of the text-like structure to store the data. This results in a low storage efficiency and a costly value-taking process for query. In the design of the present invention, the structure extracted from the data is stored in the Schema syntax definition, and only minimal structural information is retained in the data. This simplifies the repetitive structural information in the data, while also enabling some query optimization for the data content.
Finally, the columnar semi-structured storage based on JAVA at present needs support of a plurality of basic modules, such as a file storage system, a scheduling system and the like. These all result in additional limitations on the functionality and use of the system and may result in inefficient performance thereof. The data processing System (STEED) realized based on C/C + + is completely and independently developed, so that the system can be optimized from the whole; there is no limitation due to the platform such as the format of the data needs to be defined in advance and cannot be changed.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a method and a system for storing tree-shaped data in a row-type and column-type mode.
The invention provides a row-type and column-type storage method of tree-shaped data, which comprises the following steps:
step 1, storing the tree data according to a line-type and line-type mode to generate a line-type storage structure and a line-type storage structure, wherein the line-type storage structure comprises structure head information, an array of numerical values, an ID array and an offset array of assigned nodes, and the line-type storage structure comprises a repetition layer and a definition layer;
step 2, analyzing data from a line type storage structure to a column type storage structure;
and 3, assembling data from the column type storage structure to the line type storage structure.
The column-type storage structure comprises CAB basic units, and each value generates a column data item to be stored in the CAB basic unit in the resolving process, wherein the CAB basic units are stored in a record id alignment mode, each CAB basic unit stores the same number of records, and the number of the column data items is not considered.
The CAB basic unit in the column-wise storage structure comprises:
(1) header information: the related information used for describing CAB basic unit includes the size and the number of the stored record;
(2) the bit array is used for recording the repeated value and the defined value of each line of data items, wherein a plurality of bits are used for replacing integers to store the repeated value and the defined value, and the repeated value and the defined value are optimized according to the structure information of the path;
(3) numerical value region: for recording the values of all column data items.
The array in (2) the CAB basic unit of the column-wise storage structure comprises:
a) null is inserted when some fields have no value in the parsing process so that each record can be naturally aligned;
b) single value field: the duplicate values are the same in each column of data entries, so the array of duplicate values is ignored;
c) values that would only be repeated at a single nesting level: marking a first column data item of each record by using a repeated value, and marking the repeated value of each column data item by using only 1 bit;
d) the values that would be repeated at multiple nesting levels: its repeated value is indicated by a number of bits.
The (3) value region in the column-wise memory structure CAB basic unit further includes: for two data formats with variable length and fixed length, the storage is carried out by the following strategies:
a) fixed-length data type: recording the space occupied by each value of the fixed-length data type in the header information, and reading a next numerical value by moving the fixed length every time without an additional offset array;
b) variable-length data types: the storage position of each value of the data type with variable length is recorded in an offset array, only one specific variable length value is stored for a domain with a large number of repeated contents of the value, and the storage efficiency of the specific variable length value is improved by multiplexing the offset of the specific variable length value in different column data items.
Head information in the line storage structure, for recording related information of the line storage structure; offset array ID and offset array: for objects, marking the id of each field to indicate that its value exists, and for arrays, each value is the same field assignment, so only the offset information of each value is retained; array of values: the row-type storage structure stores the numerical values with repeated occurrences of the values in the form of an array, wherein the type of the values is data of an atomic type or data of a composite type.
The columnar storage structure represents the structure of the tree structured data using the repetition value and the definition value:
(1) repetition value: at which level the repetition in the data is performed;
(2) defining the value: the level of occurrence of domains that can be omitted in the data.
Data parsing of a row-wise storage structure to a column-wise storage structure includes:
in the analysis process, character matching is not needed to be carried out on the structure of the text data, and the analyzed data of the line type storage structure is directly analyzed into the data of the column type storage structure.
Data assembly of a columnar storage structure to a row-wise storage structure includes:
traversing the syntax tree from the root node, sequencing the columnar storage structures according to the appearance sequence of the leaf nodes, sequentially reading the content of the column data items in each columnar storage structure according to the sequenced sequence, controlling the hierarchy structure information of data nesting according to the content of the read column data items, outputting relevant information representing nesting structures in the current column-type storage structure, outputting values to the line-type storage structure to be assembled, then judging the next column type storage structure to be jumped and read, finally pre-reading a column data item from the next column type storage structure to judge the hierarchy of the nesting layer to be returned and output the related structure information, and when all the column-type storage structures are read at least once, completing the assembly of one record, repeating the processes until all the column-type storage structures are read, and completing the assembly of the whole data set.
The invention also provides a system which is completed by using the line type and column type storage method of the tree data.
According to the scheme, the invention has the advantages that:
1, a line-wise storage structure of semi-structured data; the method realizes the line binary storage of the semi-structured data, so that the semi-structured data can completely express the semantics of the semi-structured data and adapt to the characteristic of data definition change of the semi-structured data. In addition, it is required to have a simple structure, easy expression, and high storage efficiency;
2, columnar storage structure of semi-structured data; columnar binary storage of semi-structured data is implemented to enable the use of columnar storage to fully express the structure of the semi-structured data. The method is required to express the complex structural characteristics of semi-structured data and efficiently store the data content;
3, realizing interconversion of the row format and the column format of the semi-structured data; using an analysis and assembly algorithm to realize mutual conversion of binary row form data and column form data;
4, syntax tree implementation defined by semi-structured data; storing definition information of a structure in the data by using a tree structure;
5, performing query operation on the semi-structured data; performing SQL-like query operation on the row-type data and the column-type data;
6, expanding the query grammar of the SQL based on the characteristics of the semi-structured data; because multi-value domains exist in semi-structured data, ANY, ALL and path expressions are defined to solve the problem of data ambiguity in the query process;
7, optimizing based on a simple path in the semi-structured data; a simple path means that there is at most one multi-valued domain from the root node to the leaf node. The invention discovers that a large number of structures exist in common semi-structured data, proposes and realizes the storage and query optimization aiming at the structures, and greatly improves the query efficiency.
As shown in FIG. 4, the present invention uses data sets of different sizes to perform a query analysis experiment in which data has been loaded into memory (hotcache) and data has not been loaded into memory (cold cache). In the experiment, the invention uses different SQL query sentences to obtain the performance comparison of corresponding operation operations, including project mapping, filter filtering, group grouping, sort ordering and join connection operation.
According to the query performance shown in FIG. 4, in the experiment where the cold cached data is not loaded into the memory, STEED has a performance acceleration ratio of 4.1 to 17.8 times with respect to Hive + partial, 55.9 to 105.2 times with respect to MongoDB, and 33.8 to 1294 times with respect to PostgreSQL; in hot cached experiments, STEED has a 19.5 to 59.3 times acceleration ratio for MongoDB, 19.5 to 59.3 times acceleration ratio for Hive + partial, and 16.9 to 392 times acceleration ratio for PostgreSQL. The query statements for the various query operations of the present invention are detailed in the appendix.
Drawings
FIG. 1 is a JSON data format analysis defined by a microblog API;
FIG. 2 correlation analysis of the Apache Hadoop communication protocol;
FIG. 3 is a block diagram of the composition of a stepped;
FIG. 4 is a graph comparing the query performance of stepped;
FIG. 5 is a diagram of a process for Protocol Buffers building syntax trees;
FIG. 6 is a schematic diagram of a line data composite type structure;
FIG. 7 is a schematic diagram of a columnar data storage structure;
FIG. 8 is a schematic diagram of a columnar data optimized storage structure;
FIG. 9 is a schematic diagram of the operation of a stepped query;
FIG. 10 is a diagram illustrating a memory structure during a grouping operation;
FIG. 11 is a schematic diagram of an optimized row-wise storage structure;
FIG. 12 is a schematic diagram of an alternative optimization scheme for a row-wise storage structure.
Detailed Description
In view of the above deficiencies of the prior art, the present invention redesigns and implements a semi-structured data processing system sted. The following describes the overall architecture of the sted system and briefly describes the functional requirements of each module, and then analyzes the interface definitions between these several modules, and briefly describes how data is processed and stored inside the sted system.
As shown in fig. 3, sted is mainly composed of three modules:
(1) a data analysis module:
reading the text data, analyzing the text data into line-type or column-type binary format data, and storing the line-type or column-type binary format data in a data storage module. And in the data analysis process, a syntax tree is dynamically generated, and the definition of the semi-structured data is stored. When data in the JSON format is analyzed, because a corresponding data format (syntax tree) is not defined, the definition of the data format can only be dynamically generated in the process of analyzing the data; for data in Protocol Buffers, the data in text format and the definition related to the data are provided together before data parsing, so that the syntax tree can be established according to the definition before the data in text format is parsed. According to the definition of the fields in the syntax tree, the invention converts the data of the text structure into the binary format data of the line and column.
(2) A data storage module:
the line type and column type binary files generated by the data analysis module are stored. The data in the two formats can be converted into each other internally, and can be directly output as JSON data in a text format. In the STEED system, the invention also optimizes the storage structure according to the characteristics of the line-type and column-type data storage, so that the storage and query efficiency can be higher.
(3) The query analysis module:
and performing query operation on the semi-structured data based on the data in the line and column formats, wherein the query operation comprises projector mapping, filter filtering, group grouping, sort sorting, join connection and the like. When STEED needs to execute a Query, an operation Tree (Operator Tree) needed to be established by the Query is generated by a Query Parser according to the content in a Query statement, and each node in the Tree is an SQL operation. And the data completes the operation of each part in the operation tree according to the sequence from the leaf to the root node until the root node is reached to complete the query operation. The invention also realizes the multi-thread version of some operations, supports the operations of projector mapping, filter filtering, group grouping and the like.
The STEED system is divided into three modules in total, and then the invention will introduce the implementation details and procedures of each module one by one.
Part 1 data analysis module
This section introduces details of implementation of the STEED data parsing module and key algorithms inside in detail, and explains how STEED parses and builds syntax trees for JSON and Protocol Buffers, respectively, according to structural features of semi-structured data.
1.1 data parsing Module structural overview
The data analysis module mainly comprises the following three parts:
(1) data Type:
binary data types used to describe and define fields in JSON and Protocol Buffers text data. Some basic data types are defined in the sted system, such as int, double, string, etc. For data in the JSON format, only the value of the text data needs to be mapped to the data type in the system; for Protocol Buffers, the data composite data type defined by the schema is used to perform corresponding conversion on the default data type of the STEED for the later process of building the syntax tree.
(2) Schema Tree data syntax Tree:
a definition of semi-structured data, i.e. a syntax tree, is built.
For text data of Protocol Buffers, before parsing the data, a syntax tree is dynamically generated for schema definition according to a schema definition file of the text data. During the data parsing process, the content and structure of the defined syntax tree remain unchanged.
Data in the JSON format requires the present invention to dynamically generate the definition of the syntax tree according to the format and content in the data during the data parsing process. The present invention assumes that the type of value in each domain remains the same, while the type of value for each element in the array is the same.
Sted stores the syntax tree definition for each data set. In the query analysis module, the sted performs corresponding query operations on the data set according to the definition of the data in the syntax tree.
(3)Parser:
The method is used for splitting the semi-structured data in the text format into key value pairs (key value pairs) and then resolving the key value pairs into a storage structure of line type or column type defined in STEED. For Protocol Buffers data, format conversion is only needed to be carried out on the data according to the definition of a syntax tree in the analysis process; for data in the JSON format, whether a newly defined domain appears in the data or not needs to be analyzed in the analyzing process, and then the existing syntax tree is modified.
1.2 Data Type
1.2.1 basic data types supported by STEED
The sted system defines some binary format data internally for storage and operation of line-wise and column-wise format data:
1) shaping number: TypeInt (8/16/32/64) respectively represents the shaping number of 8/16/32/64 bits;
2) floating point number: type (Float/Double) represents floating point numbers of Float and Double types, respectively;
3) character string: a string represented by TypeString;
4) time stamping: TypeTimeStamp denotes a timestamp, and is internally embodied by TypeInt 64.
These data types above may all support the determination of their values, the inter-transformation of text and binary data, comparison operations, and the like.
1.2.2 transformation of JSON data types
JSON defines the possible types of data in each of its fields. The present invention maps each data type it defines to a corresponding internal data type of sted, as shown in the following table:
Figure BDA0001253161390000121
for the basic data type, directly mapping the type defined by JSON into the basic data type in STEED; for the nested complex data types of object and array in JSON, the corresponding row-column storage mode is also defined inside the sted, and the specific storage mode is referred to as the next chapter of data storage module.
1.2.3 transformation of Protocol Buffers data types
Similar to JSON, Protocol Buffers also define some internal basic data types. In the internal implementation of STEED, the present invention directly converts these basic data types into types in C + + (C + + Type) and stores their values in the parsed result. See https:// developers. google.com/protocol-buffers/docs/proto3# scalar.
Figure BDA0001253161390000122
In addition, a composite data type message may also be defined in the schema of Protocol Buffers. Using compound data types, the present invention can define multiple levels of nested data format definitions. Meanwhile, in the definition of the composite type, the invention can select the assigned attributes of the domains, namely the domain where required can occur certainly, the domain where optional can occur and the domain where required can occur repeatedly.
1.3 syntax Tree (Schema Tree)
In this section, the present invention will introduce how sted describes semi-structured data using a syntax Tree (Schema Tree). Meanwhile, how to establish grammar aiming at the data and structural characteristics of JSON and Protocol Buffers in the parsing process is also introduced.
1.3.1 definition of syntax Tree
Semi-structured data has some structural features:
1) there are a large number of nested structures in the data: the definition of each domain is deep and more complex compared with the traditional relational flat data;
2) many multi-valued fields in data: there may be many values in a record that duplicate a domain.
3) There are a large number of sparse domains in the data: a large number of fields are not assigned values in most data and processing them in a tabular manner using conventional relational databases makes storage and querying very inefficient.
In order to efficiently describe the above characteristics of each domain in the semi-structured data and simultaneously improve the storage and query efficiency of the line and the column, the invention fills each node of the syntax tree in the semi-structured data according to the following definitions:
Figure BDA0001253161390000131
the node not only describes the relevant information of the node per se: data type, number of possible assignments in nested hierarchy and domain, etc.; and associating the nodes with each other through the syntax node ID of the schema node to form a tree structure. Next, the present invention will respectively describe how to respectively build syntax trees for JSON and Protocol Buffers in the parsing process.
1.3.1 building of JSON syntax trees
Since JSON has no data-related definition, the invention can only dynamically build a syntax tree through data in the process of parsing data. Here, the present invention assumes that the type of value for each domain is unchanged and that the member types in the array are consistent. In the process of building the syntax tree, the invention only needs to determine the value type according to the value type of the data. In another aspect. Since it is uncertain whether each field in the JSON data appears in the record, the present invention defines the field of the JSON with the value of array as repeat, and defines the rest nodes as optional. In the parsing process, STEED needs to look up whether related structure definitions exist or not through a symbol table according to the parent node ID and the domain name corresponding to the field name. If the definition of the node does not exist, adding the related node into the Schema Tree syntax Tree; otherwise, the value of this node is parsed, and the detailed parsing process is shown in the next section.
1.3.2 Protocol Buffers syntax tree establishment, as shown in FIG. 5:
as shown in the following example, Protocol Buffers will define the message as a new data type in the proto file. Each field contained therein may be of either a basic data type or other complex type of data. In the process of building the tree, the proto file is firstly analyzed, and a new data type is expanded; and then expanding the definitions of the data types one by one according to a root node (root) specified by a user and assembling the data types into a syntax Tree (Schema Tree) of the data structure. Then, the invention can analyze each piece of text data one by one according to the definition of the syntax tree.
1.4 data resolution
In this section, the present invention will introduce the data parsing algorithm of STEED. The invention omits the implementation of a plurality of underlying basic classes in the system, and only lists the algorithms related to the text format data analysis.
Since the semi-structured data respectively defines two composite data structures, namely an object (object) and an array (array), in the process of parsing, the invention parses the two different composite structures respectively by using different methods. On the other hand, for the output of line-type and column-type binary data, JSON and Protocol Buffers are consistent in the implementation of the present invention, so the present invention first introduces the parsing algorithms in JSON and Protocol Buffers, respectively, and then explains how to output the data thereof as binary line-type and column-type data later.
1.4.1 JSON data analysis process algorithm
As shown in the following algorithm, the present invention adopts different strategies to analyze the atomic data type and the composite data type: for data of an atom type, the method directly converts the data into data of a binary format according to the value of the text format of the data for storage or output; for data of a composite structure, the present invention needs to analyze and parse its structure until all child domains are of atomic data type. And then according to its row or column
Figure BDA0001253161390000151
The storage structure of (2) writes it to a storage file. In the process of analyzing the data in the JSON text format, each domain needs to be compared, whether the domain is a newly added node or not is judged, and then the existing syntax tree is modified.
For the nested structure (the left part of the upper text box) in the semi-structured data, the domain of the same layer is firstly split into a form of key value pairs, and then the domain is respectively analyzed according to each key value pair. And then analyzing whether the definition of each key appears once, if not, updating the corresponding Schema Tree, and simultaneously recording the value of the corresponding domain in the Schema Tree. And then recursively analyzing according to the value recorded by each node in the Schema Tree: if the data type is the composite data type, calling a corresponding composite structure analysis function to continue analyzing; if it is a simple type of value, it is output directly to the final result.
For an array of a multi-value field (the right part of the upper text box), since the array represents a plurality of repeated values of the same field, the invention only needs to call the corresponding resolving functions in turn to resolve the content of the array without analyzing the modification of the schema tree.
1.4.2 Protocol Buffers data analysis process algorithm
Figure BDA0001253161390000161
For data in Protocol Buffers format, the parsing process of text format data is simpler than that of Protocol Buffers: since the format of the data is defined before the data is analyzed, the invention does not need to check and modify the syntax tree in the analyzing process, and only needs to analyze the value of each domain in the record respectively. The specific analysis method is similar to JSON: the compound type calls a corresponding analysis function to analyze; the simple type then outputs its value directly into the result.
1.4.3 output Algorithm for line and column data
In the parsing process, sted may parse data into a line-wise or column-wise binary format. The invention will be described herein in detail with respect to its output as data in a row-wise or column-wise format:
Figure BDA0001253161390000162
(1) line-type composite type data output algorithm:
as shown in the above algorithm, for the composite data types of object and array, the data of the line structure is added to the value of each field by using the object of the line structure until the whole record is completely resolved.
(2) The column type composite type data output algorithm:
compared with the output of a data file with a line-type structure, the output of the data with the line-type structure only needs to output specific values and structure information on leaf nodes of the data with the line-type structure directly into the file. Therefore, in the parsing process, the invention does not need to keep the structures of semantic object and array, and only records the structure related information
Figure BDA0001253161390000171
And output to the file stored in the column. This makes the process of outputting the binary format relatively simple and efficient.
Part 2 data storage module
After the data analysis module completes the analysis of the data in the line or column format, the data storage module stores the analysis result and performs certain structural conversion, such as the interconversion of the line and column format, and the direct output of the data in the binary format in the text format, etc. In this section, the invention first introduces an underlying storage structure for row-wise and column-wise binary data. Then, based on the assembly algorithm of Google Dremel, the invention also describes how sted realizes the assembly algorithm of converting the data of the column-type structure into the data of the line-type structure.
2.1 overview of line storage architecture
In the description of the previous chapter parsing process, the invention stores the data thereof by using an atomic type binary format; the other two composite structure object objects and array arrays are stored according to the method format shown in FIG. 6:
the storage structures of the row type and the column type are similar and mainly consist of the following parts:
(1) header Information structure Header Information: and recording relevant information of the storage structure, such as the size of the storage structure, the number of elements contained in the storage structure and the like.
(2) (ID) OFFSET Array ID and OFFSET Array: for object objects, the invention requires marking the presence of an id where each field is used to denote its value; for array arrays, each value is assigned to the same field, so that it retains only the offset information for each value.
(3) Array of Value Array values: the line type storage structure stores values which repeatedly appear as an array, wherein the value type can be data of an atom type or data of a compound type. In an object, the type of each value may not be the same, since it represents an assignment of a different domain; but in the array, multiple assignments for the same domain are represented, so the invention herein defaults to the type of each value being the same. The present invention can randomly access values of an arbitrary field based on offset information of each previous value.
2.2 columnar storage Structure overview
The columnar storage structure is relatively complex relative to the row-type structure, and the invention defines the following related concepts for representing and storing structure information thereof on the columnar structure:
(1) repetition Level the Repetition in the Repeated value repeat field in the field's path data is at which Level the Repetition is performed.
(2) Definition Level Number of fields in the path of the elementary bed elementary stream representation several layers of fields (optional and predicted) are present that can be omitted.
The process of how the data of the columnar structure is converted from columnar structure to row-type data using the related information will be described in the next section, and the present invention is only described herein in terms of the storage structure in the row-type structure data.
CAB (column Align Block) is the basic unit of the columnar storage of the invention. During parsing, each value (value) results in a Column Item stored in the CAB. Because there are many duplicate fields in semi-structured data, each duplicate field may result in multiple Column items being inserted into the CAB in a record. In order to improve the efficiency of storage and query, the CABs are stored in a record id alignment mode, and each CAB stores the same number of record records regardless of the number of specific column items.
A specific block diagram of CAB is set forth in fig. 7. The device mainly comprises the following four parts:
(1) header information: the relevant information for describing CAB includes its size and the number of records stored, etc.
(2) Reproduction Array for recording the reproduction value of each Column Item. Since the maximum value of repetition is the maximum depth of each domain, the present invention herein uses several bits instead of an integer to store its value. Through the analysis of the data content, the invention summarizes several templates below to summarize and optimize the possible patterns, as shown in fig. 8.
a) And a non-duplicated domain, namely, no repeatable domain exists in a nesting layer, and the domain in the record has at most one value and has no Repeated assignment. STEED inserts null if some fields have no value during parsing so that each record can be naturally aligned. Thus, each record has and only one column item in the corresponding column data file. Thus, the present invention omits this array in the memory structure.
b) Single repeat will only repeat at a certain nesting level: there is one and only one of the nesting levels that can be repeated. The present invention only needs to mark the first Column Item (Record Boundary) of each Record. The invention requires only 1bit to mark the first Column Item of each record or its only duplicate nesting level.
c) Multi-repeat is Repeated at multiple nested levels: if there are multiple fields that can be duplicated from the root to the leaf nodes, the present invention requires multiple bits to indicate the fields that are duplicated.
After specific analysis of the data, the present invention finds that most of the domains are repeated at most on one layer. The columnar storage structure of the present invention improves the efficiency of storage and operation by using these 3 template templates for storage.
(3) Value Area numerical region: this section records the values of all Column items. For two data formats of variable length and fixed length, the invention uses two different storage strategies:
a) fixed-length data type: the length of each datum is the same, so the invention records the space occupied by each value in the Header, and only a fixed length is required to be moved each time, and then a value can be read; no additional array of offsets is required.
b) Variable-length data types: the length of each data is different, so the present invention requires recording the storage location of each value in the offset array; for fields where the content of the value is largely repeated (e.g. user language, etc.), we only store one specific variable length value, which improves its efficiency of storage by multiplexing the offset of that specific value in different column items.
2.3 line and column Format conversion Algorithm
In this section the invention will be described in the context of algorithms for storage module line-wise and column-wise file interconversion. For a line data file, each semi-structured data set generates a line data file after parsing is completed, and the line data file stores all field values and related structure information in all records. For columnar data files, on the other hand, each text data set will result in several columnar storage files. Each field will generate a columnar storage file that stores all the values of the field for all records. Thus, during operation, the memory module needs to implement a row-column format conversion operation of the stored data. Meanwhile, the requirement of directly outputting data into a JSON text format needs to be met.
2.3.1 line-to-line data parsing
The process of converting the line-type structure data into the column-type structure data is similar to the process of parsing the text structure data, and is not described in detail here. The storage structure of the object or array of the line which is already resolved is used as the character matching of the structure of the text data is not needed in the resolving process; meanwhile, format conversion from text format data to binary format is not needed, and the efficiency of line-to-line structure conversion is obviously superior to that of character analysis.
2.3.2 column-to-row data Assembly
The column type data file is assembled into a file in a line type format according to a certain rule, and then the data in the column type structure is converted into the data in the line type structure. Based on the assembly algorithm of Google Dremel, the invention uses a similar algorithm to complete the assembly of the columnar files inside the sted. The specific algorithm is as follows:
in the assembling process, STEED reads the Column Item from the Column Reader according to the sequence of the finite state automata and the repetionation value in the Column Item, and then judges and outputs corresponding nesting level information according to the definition value. When the last Column Reader read finishes the last Column item of the record, the Assembler finishes the assembly of the record after all the Column readers are traversed. The Assembler will continue to run until all records have been assembled, at which point all Column readers should read the end of file EOF.
In the following algorithm, except for the specific assembly process, the two functions of assembly Recd, move and return use definition value to judge the structure information of the nested hierarchy of data respectively.
In the following pseudo code, the method needs to traverse a schema tree by using a depth-first algorithm from a root node, then sort the column files according to the sequence of appearance of leaf nodes of the column files, and sequentially read the contents of column items in each column file according to the sorted sequence. According to the content of the read column item, the invention can control the hierarchical structure information of data nesting: firstly, outputting related information which represents an embedded structure in a current column, then outputting a value to a line structure to be assembled, then judging a next column file to be jumped and read, and finally pre-reading a column item from the next column to judge the hierarchy of the embedded layer which needs to be returned and outputting related structure information. When all the column files are read at least once, the invention completes the assembly of one record. The above process is repeated until all the column files are read, and the invention completes the assembly of the whole data set.
Figure BDA0001253161390000221
Part 3 query analysis module
Based on the line-wise and line-wise structured data, sted can perform SQL-like query analysis. However, compared with the relational data of the traditional table structure, the semi-structured data has certain ambiguity in query due to the nesting and multi-value domain. Therefore, the invention expands the grammar of the query, so that the grammar can eliminate the data ambiguity to a certain extent. The invention also realizes some basic operations in SQL, such as projector mapping, filter filtering, group by grouping, sort ordering, etc. In this section, the invention first introduces expanded semantics for semi-structured data. Then, the present invention will sequentially introduce specific algorithms for implementing various semi-structured data operations that have been implemented in the system.
3.1 semantic extension of SQL against semi-structured data
Conventional relational data stores flat data using a table structure: all values are in the same layer, and no nested substructure exists; each domain has and only one value to which it can be assigned; tables are split when they are designed so that there are no large number of sparse domains. None of the above features are applicable to semi-structured data. In order to support the operation of semi-structured data, the invention newly defines some operators as follows:
(1)".": nesting levels in path expressions for interval domains.
(2) "any": a value representing any of the repeated fields;
(3) "all" means all values in the repeated domain.
Output results the present invention has multiple options:
(1) data in JSON format:
(2) class JSON data of the nested structure is ignored;
3.2 STEED-supported operation types
As shown in fig. 9, sted supports multiple types of operations based on line-wise and column-wise data. Between operators, data flows in turn in pull until the output operator at the top completes the conversion from binary to text format data. Next, the present invention will be described in its various implementation details.
3.2.1Row From Operator (line data read operation)
STEED reads an entire line of data from the line of data file. Since each record is stored in units of records in the line data file, each record is stored in a format in which Row Object line objects are stored. So in reading a record, the present invention reads one RowObject line object from the line binary data file at a time, and reads sequentially until the end of file EOF is reached.
3.2.2 Schema Filter (Where or hanging class) Operator (based on the filtering operation in the Where and hanging Clause defined by Schema)
In this Operator, the present invention performs a filter filtering operation on line data. This operation can be used for the STEED to judge the condition in the where word sentence after reading the line data; after new line data is generated in the group by clause, the new line data can be used to perform a filtering operation on the aggregation result.
During a specific filter filtering operation, the invention defines the RowCondition (line condition class) for judging whether the related domain in the record meets the condition of each predicate (predicate). The specific determination process is as follows:
the invention firstly analyzes the where word sentence, instantiates each predicate as an object which can be compared with data: data may be read from the line data structure; the read values are compared, the true value of each predicate (predicate) is judged, and whether the predicate (predicate) is operated by the condition in the operator is determined.
3.2.3 Project Operator (mapping operation)
In the data of the line structure, the invention stores all the fields in each record, but most query statements only need values of some fields. Thus, during the whole query process, a large amount of data of the domain irrelevant to the query is copied among the operator operations. These additional memory copies reduce the efficiency of the present invention query. Therefore, the invention realizes the projector operation on the data of the line structure for extracting the domain relevant to the query, and only the domain relevant to the query is copied in the copying process, thereby improving the efficiency of the query.
During operation, the invention uses a recursive function to deal with the nested structure in the semi-structured data. In each domain, the invention respectively reads the assignment of the domain in the original data, and only writes the domain related to the query into the result of the operation after analyzing the assignment. Therefore, a large number of irrelevant domains in the line data can be ignored, and the efficiency of the query process is improved. For a multi-value field, if it is repeated at a leaf node, STEED need only directly copy the values stored consecutively in the array of this field. If repeated at a non-leaf node, the sub-structure in each array is recursed and resolved separately. It should be noted that, in the process of extracting the subtree, the invention only retains the assigned subtree; that is, if the associated domain in this sub-tree is not assigned, then this sub-tree will not be retained in the project's result.
3.2.4 Assembly Operator (column-to-line data assembly operation)
During this operator operation, STEED completes the assembly process of converting query-related fields from line-structured data to column-structured data. The specific assembly algorithm is described before. The STEED firstly analyzes SQL sentences required to be executed through a Query Parser to obtain all domains relevant to the Query, and a finite state automaton (FSM) is built by using the domains to control the reading sequence of line structure data in the assembling process. The format conversion from column-wise to row-wise data is then done according to the previous packing algorithm and will not be described here.
3.2.5 Column Filter Operator (assembly operation of Column-to-line data for providing filtering operation)
Compared with an Assembler operator (assembly operation of the Column-type data to the line-type data), a Column filter operator (assembly operation of the Column-type data to the line-type data to provide filtering operation) not only realizes assembly of the Column-type structure to the line-type structure, but also can perform filter filtering operation on each record in the assembly process. Because the where clause can filter some records which do not meet the condition in the query process, if the invalid records are not assembled in the assembly process, the query efficiency can be greatly improved. Therefore, in the query process, the invention reads one CAB each time to carry out filter filtering operation, sets corresponding bit map bitmap to record the comparison result, and finally determines whether to carry out assembly according to the recorded result.
3.2.6 Join Operator (connection operation)
In STEED, hash join is used to realize join operation, and only join operation of two tables is supported at present. In the process of executing the operation, STEED calculates the corresponding hash value according to the join key specific value in one of the data set records and stores the whole record in the hash table. Another data set is later traversed looking up the location (bucket) in the corresponding hash table with the same hash key. The data of the two line structures are then merged and the operator operation that waits for this record to be pulled to the previous layer is awaited. At present, STEED does not use a query optimizer in a relational database for optimization, so that a smaller data set is suggested to be used as a data set which appears first in a from clause in the query process, and higher storage efficiency is obtained.
3.2.7 Group Operator (packet operation)
Among the internal operations supported at the present stage of the query operation, Group grouping is the most complex operation. The invention introduces some newly defined classes in the operation process and analyzes the corresponding execution process. Similar to join operator connection operation, group operator grouping operation stores the key values of the corresponding group key groups using a hash table. In the operation process, data is read from the data of the line structure, and a hash key hash value of the data is calculated and added into a hash table. And then judging whether the aggregation operation exists or not according to needs to operate the content in the hash value. The data storage structure of the hash value is shown in fig. 10:
the invention firstly defines that HashValueItemContainer is used for storing each storage unit (bucket) in the hash table, and the specific value in the hash table is the address pointing to the HashValueItem. Each such object has the structure shown in fig. 10:
(1) the invention firstly saves the specific address stored in the record and the content of each aggregation required to be calculated in the middle layer.
(2) In the Block Buffer object, the actual contents of the saved record are stored. It should be noted that these records are fields representing aggregation results that are not assigned, except for those fields where grouped fields are grouped based on values. After the whole group grouping operation is completed, the invention inputs the aggregation result to the corresponding position and waits for the result to be pulled by other operator operation pull of the upper layer.
3.2.8 Order Operator (sorting operation)
For order by sort operation, the invention needs to store all records into the buffer cache, and then compare and sort the records. In consideration of the problem of memory space allocation efficiency, the invention only applies for the memory with fixed size to the operating system every time, thus saving the cost of memory copy in the process of realloc reallocating the memory. Meanwhile, in order to avoid the cost of copying data for multiple times in the sorting process, the invention uses an array to record the starting address of each record in the comparison process and changes the position of a pointer in the array in the sorting process. And finally, when the number group is sequentially accessed, the accessed records are the results meeting the sorting requirement.
Furthermore, according to the ordering condition, the invention defines the comparison record for the comparator, which is operated according to the following mode:
(1) this comparator can read the values of all fields from the row-wise storage structure for the compare operation.
(2) In order to improve the comparison efficiency, the invention realizes the comparison and output processes as follows:
a) in each record 8 bytes are reserved to store the upper 8 bits of the first data in the field to be compared. For all value types, this space is sufficient to store their corresponding values without the need for complex evaluation of the data of the line structure; for a string, the comparison of the first 8 bits will in most cases also result in a positive comparison. So in the comparison process, the present invention uses the buffered 8 bytes for comparison first. When the data type is a character string and the comparison of the prefixes is the same, the next comparison is performed.
b) And sequentially taking values and sequencing by using a comparator according to the sequence of the domains needing to be sequenced until a comparison result is obtained.
c) Implementation of the specific comparison function the present invention uses the STL:: sort function for comparison.
d) In the comparison process, data copying is not carried out on the record per se, and only the pointer array of the record output sequence is modified, so that the multi-time copying operation of the memory is avoided. In the process of pulling data by the operation pull of the upper layer operator, the invention only provides the corresponding pointer so as to improve the data processing efficiency.
Part 4 method and system for optimizing tree structure data by using simple path characteristics
In this section, the invention summarizes and generalizes the concept of simple paths based on the relevant data of existing multiple data sources, and performs query optimization using the characteristics in STEED
4.1 definition of simple Path
After analyzing data from a variety of different sources, we have found that there are a large number of paths from the root to the leaf nodes in the syntax tree of each data set, with at most one duplicate domain. The invention can optimize the query process and improve the query efficiency by using the structural characteristics in the data in the query process. Therefore, the present invention defines the following for a simple path: in the syntax tree of the data set, at most one field (a certain node in the syntax tree) can exist on the path from the root to the leaf node, and the path is called a simple path. In STEED, the storage and query process related optimization of tree-structured data can be performed by using a simple path.
4.2 Structure of semi-structured data line storage
As described earlier, sted uses a relatively complex storage structure in a line storage structure in order to accurately express hierarchical information in tree-structured data. Through analysis, the invention considers that the data cannot be further optimized and improved in terms of expression. However, as can be seen from the analysis of the simple path, the present invention can improve the efficiency of representing data in the system by simplifying the structural information stored in the data, so that the efficiency of parsing and querying the data is further improved. The better line-type storage structure contemplated by the present invention is shown in fig. 11, for data of a simple path, sted can store only relevant structure information of leaf nodes (domains) in the data to refer to the corresponding path instead of the original nested storage structure. After using the simple path for optimization, the sted may obtain the relevant information of all nodes on the whole path from a syntax Tree (Schema Tree) in the system by using the relevant information of the leaf nodes in the data. Thus, sted improves the expression efficiency of line data and the execution efficiency of queries by simplifying the structure information stored in the data.
4.3 Flatten Assembly (Flat line structure assembler)
Figure BDA0001253161390000281
STEED requires a significant amount of cost to restore the hierarchy of data during the assembly of the data. As mentioned above, the repeated hierarchy in most domains in the data set does not exceed 2 levels, so that most values in the data can be optimized accordingly by using a simple path. While the assembly process for the simple path domain in sted is much simpler:
the hierarchical relationships in the default binary data are ignored using the Flatten Assembler flat line structure Assembler, i.e., only leaf nodes are used to represent the path from the heel to the leaf nodes and all non-leaf nodes in the path are ignored. Therefore, the invention realizes the purpose of limiting the nesting level of the line structure data to one layer, thereby saving the space consumption of the data in the memory and improving the query efficiency of the data in the process of data query.
The specific assembly algorithm is as shown above:
before assembling, each column to be assembled needs to be sorted correspondingly according to the ID of the leaf node. And then, sequentially reading all the Column items of each record in each Column Reader, and sequentially writing the read numerical values and related structure information into the assembled result. Since the assembly result only retains one nested hierarchy, the sted only needs to add the value of each domain to the current object during the assembly process, and does not need to consider the nested relation of the assembly result.
4.4 storage Structure of Flat line data
In the invention, STEED optimizes the query process by using line data of a flat structure for query and storage. For non-simple paths in the syntax tree, the invention continues to use the default tree-structured data expression method since sted needs to identify the multi-valued fields of different nesting levels in the data. While for simple paths, the present invention stores or assembles them using the structure as in fig. 11:
1) the syntax tree has no duplicate node on the path from the root to the leaf node: only the ID of the leaf node and the numerical value of the corresponding field need to be stored in the flat data storage structure;
2) the syntax tree has only one domain of duplicate nodes on the path from the root to the leaf nodes: the flat data storage structure can be output according to the following two structures, as shown in detail in fig. 12:
a) storing the value of each repeating field as a specific value in a flat structure-there will be multiple items of values with the same ID in the data, and the number of the items depends on the number of repeating fields;
b) storing the repeated fields as a whole in a flat structure-only one repeated field ID in the data represents its specific value, and the field is a structure in array form representing a plurality of values.
3) There are multiple duplicate nodes on the path of the syntax tree from the root to the leaf nodes: the flat data storage structure cannot express the repetition of the numerical values of a plurality of repeatable domains on a path on which layer, the original default tree-shaped data storage structure is continuously used in the invention, namely the IDs of leaf nodes are still used in the data of the flat structure, but the corresponding values are offset and point to the positions for storing the complete nested structure.

Claims (10)

1. A method for storing tree-like data in a row-wise and column-wise manner, comprising:
step 1, storing the tree data according to a line-type and line-type mode to generate a line-type storage structure and a line-type storage structure, wherein the line-type storage structure comprises structure head information, an array of numerical values, an ID array and an offset array of assigned nodes;
step 2, analyzing data from a line type storage structure to a column type storage structure;
and 3, assembling data from the column type storage structure to the line type storage structure through an assembly algorithm based on Google Dremel.
2. A method of line-wise and column-wise storage of tree data according to claim 1, wherein the column-wise storage structure includes CAB primitives, and wherein during the parsing process, each value produces a column data item for storage in the CAB primitives, and wherein the CAB primitives are stored using record id alignment, and wherein each CAB primitive stores as many records as there are columns regardless of the number of columns.
3.A method for line-wise and column-wise storage of tree data according to claim 2, wherein the CAB elements in the column-wise storage structure comprise:
(1) header information: the related information used for describing CAB basic unit includes the size and the number of the stored record;
(2) the bit array is used for recording the repeated value and the defined value of each line of data items, wherein a plurality of bits are used for replacing integers to store the repeated value and the defined value, and the repeated value and the defined value are optimized according to the structure information of the path;
(3) numerical value region: for recording the values of all column data items.
4. A method for row-wise and column-wise storage of tree data according to claim 3, wherein the (2) bit array in the CAB base unit of the column-wise storage structure comprises:
a) null is inserted when some fields have no value in the parsing process so that each record can be naturally aligned;
b) single value field: the duplicate values are the same in each column of data entries, so the array of duplicate values is ignored;
c) values that would only be repeated at a single nesting level: marking a first column data item of each record by using a repeated value, and marking the repeated value of each column data item by using only 1 bit;
d) the values that would be repeated at multiple nesting levels: its repeated value is indicated by a number of bits.
5. A method for line-wise and column-wise storage of tree data according to claim 3, wherein the (3) numerical region in the CAB elementary units of the column-wise storage structure further comprises: for two data formats with variable length and fixed length, the storage is carried out by the following strategies:
a) fixed-length data type: recording the space occupied by each value of the fixed-length data type in the header information, and reading a next numerical value by moving the fixed length every time without an additional offset array;
b) variable-length data types: the storage position of each value of the data type with variable length is recorded in an offset array, only one specific variable length value is stored for a domain with a large number of repeated contents of the value, and the storage efficiency of the specific variable length value is improved by multiplexing the offset of the specific variable length value in different column data items.
6. A line-and-column type storage method of tree data according to claim 1, wherein the head information in the line storage structure is used to record related information of the line storage structure; offset array ID and offset array: for an object, marking the existence of a value of each field by using the id of each field, and for an array, each value is the same value assigned by the field, so that only the offset information of each value is reserved; array of values: the row-type storage structure stores the numerical values with repeated occurrences of the values in the form of an array, wherein the type of the values is data of an atomic type or data of a composite type.
7. A method of row-wise and column-wise storage of tree data according to claim 1, wherein the column-wise storage structure represents the structure of the tree-wise structured data using repetition values and definition values:
(1) repetition value: at which level the repetition in the data is performed;
(2) defining the value: the level of occurrence of domains that can be omitted in the data.
8. A method of line-and-column storage of tree data as recited in claim 1, wherein the data parsing of a line-based storage structure into a column-based storage structure comprises:
in the analysis process, character matching is not needed to be carried out on the structure of the text data, and the analyzed data of the line type storage structure is directly analyzed into the data of the column type storage structure.
9. A method of line-wise and line-wise storage of tree data according to claim 1, wherein data assembly of a line-wise storage structure into line-wise storage structures comprises:
traversing the syntax tree from the root node, sequencing the columnar storage structures according to the appearance sequence of the leaf nodes, sequentially reading the content of the column data items in each columnar storage structure according to the sequenced sequence, controlling the hierarchy structure information of data nesting according to the content of the read column data items, outputting relevant information representing nesting structures in the current column-type storage structure, outputting values to the line-type storage structure to be assembled, then judging the next column type storage structure to be jumped and read, finally pre-reading a column data item from the next column type storage structure to judge the hierarchy of the nesting layer to be returned and output the related structure information, and when all the column-type storage structures are read at least once, completing the assembly of one record, repeating the processes until all the column-type storage structures are read, and completing the assembly of the whole data set.
10. A system implemented using a method of line-wise and column-wise storage of tree data according to any of claims 1-9.
CN201710179108.5A 2017-03-23 2017-03-23 Row-type and column-type storage method and system for tree-shaped data Active CN107066551B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710179108.5A CN107066551B (en) 2017-03-23 2017-03-23 Row-type and column-type storage method and system for tree-shaped data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710179108.5A CN107066551B (en) 2017-03-23 2017-03-23 Row-type and column-type storage method and system for tree-shaped data

Publications (2)

Publication Number Publication Date
CN107066551A CN107066551A (en) 2017-08-18
CN107066551B true CN107066551B (en) 2020-04-03

Family

ID=59618034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710179108.5A Active CN107066551B (en) 2017-03-23 2017-03-23 Row-type and column-type storage method and system for tree-shaped data

Country Status (1)

Country Link
CN (1) CN107066551B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304499B (en) * 2018-01-15 2021-06-29 贵州易鲸捷信息技术有限公司 Method, terminal and medium for pushing down predicate in SQL connection operation
CN108153911B (en) * 2018-01-24 2022-07-19 广西师范学院 Distributed cloud storage method of data
CN108572925B (en) * 2018-02-26 2022-04-12 湖南戈人自动化科技有限公司 STEP file equivalent binary data storage method
CN110569300A (en) * 2018-05-17 2019-12-13 江苏优瀛科技有限公司 Method and system for realizing data sorting of report forms with tree hierarchical structure
CN110287190A (en) * 2019-06-25 2019-09-27 四川深度在线广告传媒有限公司 A kind of big data analysis custom coding memory structure and coding, coding/decoding method
CN111190896B (en) * 2019-08-16 2023-10-17 腾讯科技(深圳)有限公司 Data processing method, device, storage medium and computer equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630322A (en) * 2009-08-26 2010-01-20 中国人民解放军信息工程大学 Method for storing and accessing file set under tree directory structure in database
CN102609490A (en) * 2012-01-20 2012-07-25 东华大学 Column-storage-oriented B+ tree index method for DWMS (data warehouse management system)
CN103095819A (en) * 2013-01-04 2013-05-08 微梦创科网络科技(中国)有限公司 Data information pushing method and data information pushing system
CN105488235A (en) * 2016-02-03 2016-04-13 苏州见微物联网科技有限公司 Cloud platform data management system based on industrial big data and construction method thereof
CN106250523A (en) * 2016-08-04 2016-12-21 北京国电通网络技术有限公司 A kind of method of distributed column storage system index
CN106503084A (en) * 2016-10-10 2017-03-15 中国科学院软件研究所 A kind of storage and management method of the unstructured data of facing cloud database

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8312050B2 (en) * 2008-01-16 2012-11-13 International Business Machines Corporation Avoiding database related joins with specialized index structures

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630322A (en) * 2009-08-26 2010-01-20 中国人民解放军信息工程大学 Method for storing and accessing file set under tree directory structure in database
CN102609490A (en) * 2012-01-20 2012-07-25 东华大学 Column-storage-oriented B+ tree index method for DWMS (data warehouse management system)
CN103095819A (en) * 2013-01-04 2013-05-08 微梦创科网络科技(中国)有限公司 Data information pushing method and data information pushing system
CN105488235A (en) * 2016-02-03 2016-04-13 苏州见微物联网科技有限公司 Cloud platform data management system based on industrial big data and construction method thereof
CN106250523A (en) * 2016-08-04 2016-12-21 北京国电通网络技术有限公司 A kind of method of distributed column storage system index
CN106503084A (en) * 2016-10-10 2017-03-15 中国科学院软件研究所 A kind of storage and management method of the unstructured data of facing cloud database

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"DB2 with BLU acceleration: So much more than just a column store";V. Raman 等;《The 39th International Conference on Very Large Data Bases》;20130830;第1-12页 *
"Storing and Quer ying Tree-Structured Records in Dremel";Foto N. Afrati 等;《The 40th International Conference on Very Large Data Bases》;20140905;第1131-1142页 *
"大数据分析与高速数据更新";陈世敏;《计算机研究与发展》;20150215;第333-342页 *

Also Published As

Publication number Publication date
CN107066551A (en) 2017-08-18

Similar Documents

Publication Publication Date Title
CN107066551B (en) Row-type and column-type storage method and system for tree-shaped data
CN107092656B (en) A kind of tree data processing method and system
US10846285B2 (en) Materialization for data edge platform
CN107016071B (en) A kind of method and system using simple path characteristic optimization tree data
CN111046630B (en) Syntax tree extraction method of JSON data
Karnitis et al. Migration of relational database to document-oriented database: Structure denormalization and data transformation
US10769124B2 (en) Labeling versioned hierarchical data
US11416473B2 (en) Using path encoding method and relational set operations for search and comparison of hierarchial structures
CN112219199A (en) Efficient use of TRIE data structures in databases
Maneth et al. Grammar-based graph compression
CN108509543B (en) Streaming RDF data multi-keyword parallel search method based on Spark Streaming
US10671586B2 (en) Optimal sort key compression and index rebuilding
US20200334252A1 (en) Clause-wise text-to-sql generation
CN107491476B (en) Data model conversion and query analysis method suitable for various big data management systems
CN110795526B (en) Mathematical formula index creating method and system for retrieval system
CN106557568A (en) The processing method that the XML file format of pattern match is changed with relational database
CN114356971A (en) Data processing method, device and system
CN113094449A (en) Large-scale knowledge map storage scheme based on distributed key value library
CN113157723B (en) SQL access method for Hyperridge Fabric
CN116628066B (en) Data transmission method, device, computer equipment and storage medium
CN110389953B (en) Data storage method, storage medium, storage device and server based on compression map
CN108595588B (en) Scientific data storage association method
Sakr et al. Centralized RDF query processing
RU2605387C2 (en) Method and system for storing graphs data
CN112835920B (en) Distributed SPARQL query optimization method based on hybrid storage mode

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant