WO2018157765A1 - 数据库系统中对象的编码及运算方法与数据库服务器 - Google Patents

数据库系统中对象的编码及运算方法与数据库服务器 Download PDF

Info

Publication number
WO2018157765A1
WO2018157765A1 PCT/CN2018/077164 CN2018077164W WO2018157765A1 WO 2018157765 A1 WO2018157765 A1 WO 2018157765A1 CN 2018077164 W CN2018077164 W CN 2018077164W WO 2018157765 A1 WO2018157765 A1 WO 2018157765A1
Authority
WO
WIPO (PCT)
Prior art keywords
database
global
dictionary table
sample space
query
Prior art date
Application number
PCT/CN2018/077164
Other languages
English (en)
French (fr)
Inventor
刘辉军
孟聪
张朝威
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18760336.0A priority Critical patent/EP3582124B1/en
Publication of WO2018157765A1 publication Critical patent/WO2018157765A1/zh
Priority to US16/559,245 priority patent/US11194806B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Definitions

  • the present invention relates to the field of database technology, and more particularly to a method and a database management system and a database server for a database system.
  • Modern information processing is based on an object-based identification (ID)-based cognitive model in which objects in any sample space have an identifier (ID) to distinguish them from other objects in the sample space.
  • ID identifier
  • the object is then described in human language.
  • the database is based on the above cognitive model when processing data. Take a commercial database as an example. It usually uses a dictionary table to encode/decode various strings/texts and then participate in various query operations. Specifically, if the database uses a dictionary table for a column of the source data file, the data of this column constitutes an object sample space. When importing data, the database selects the appropriate encoding algorithm, encodes each object in the column, generates a unique identifier ID, and saves it to the physical file of the data table. In the source data file, the other data provided by this column corresponds to the description information of the column object. In addition, the table has a private dictionary table that stores mapping information for the column object for decoding. For example, the encoded value (object ID) obtained by encoding the "branch name" column in the source data file is as follows:
  • the database first loads the encoded value of the column, and then reversely queries its private dictionary table to obtain the description information of each object of the column, and then performs subsequent query operations according to the description information.
  • each column of the database table uses a private dictionary table in the encoding process, which increases the coding consumption and wastes disk storage space.
  • the encoded information cannot be shared between the multiple tables in contact, and the obtained encoded value (object ID) cannot be used across the table.
  • the invention provides a method for encoding and querying objects in a database, a database management system and a database server, which can use the global dictionary table to encode objects in the database, and directly use the object identifier to participate in various information processing operations, and improve Database performance.
  • a method for encoding an object in a database comprising: determining a global object sample space according to a correlation of a plurality of object sample spaces in the database; and creating a global dictionary table based on the global object sample space,
  • the global dictionary table includes an object identifier (ID) and object description information of each object in the global object sample space, and an object identifier of each object is unique within the global object sample space, and the object ID and the object in the global dictionary table.
  • the description information is a one-to-one mapping relationship
  • the global dictionary table is provided with an encoding interface; the encoding interface that calls the global dictionary table encodes at least one object in the database to obtain an encoded value of the at least one object, the at least one The encoding value of the object is the object ID of the at least one object.
  • the global dictionary table is divided into multiple object sample spaces (such as multiple tables) by means of share encoding. Shared use, so it can provide globally unique coded values for objects in these sample spaces, and the encoded values and object description information are in a one-to-one mapping relationship, thereby avoiding the creation of multiple local dictionary tables, reducing coding consumption and saving. disk space.
  • an object sample space including: tuples, columns, or rows.
  • an object sample space can be a partial data set in a database, such as a table, a column of data in a table, or multiple columns of data in multiple tables. In this way, multiple tables or columns in the database can share the global dictionary table, thereby avoiding the creation of multiple local dictionary tables, reducing coding consumption and saving disk space.
  • determining the global object sample space according to the correlation of the plurality of object sample spaces in the database comprises: determining at least two columns having relevance in the database, the at least two columns constituting the global object sample space .
  • At least two columns of relevance include at least two columns that are simultaneously operated when the query operation is performed.
  • the global dictionary table provides an operation interface, the method further comprising: invoking the operation interface to operate on the global dictionary table, the operation comprising: querying, updating, deleting, partitioning or importing operations At least one of them. This supports dynamic updates to the global dictionary table.
  • the method further comprises: calling the encoding interface of the global dictionary table to encode the object of the database, and storing the encoded data into the data storage.
  • the global dictionary table is one of a data file, an in-memory table, a database table, or an index table.
  • the global dictionary table can be stored in external memory and loaded into memory when used.
  • the second aspect provides a method for query operation based on an object ID, comprising: receiving a Structured Query Language (SQL) statement sent by a client, where the SQL statement is used to request a query operation on at least one object in the database, Processing the SQL statement to generate an execution plan; extracting an stored object identifier (ID) of the at least one object; wherein the object ID of the at least one object is obtained by encoding the at least one object by calling an encoding interface of the global dictionary table And stored in a data store; the global dictionary table includes an object ID and object description information of each object in the global object sample space, and the object identifier of each object is unique within the global object sample space, the global The object ID in the dictionary table is a one-to-one mapping relationship with the description information of the object; the global object sample space is composed of a plurality of object sample spaces having correlations in the database; according to the execution plan, the at least The object ID of an object performs the query operation to generate a query result and the query If returned to the
  • the encoded object identifier is unique and has a one-to-one mapping relationship with the object description information. Then, after loading the database object, it is not necessary to decompress, and the encoding value of each object (ie, the object identifier) can be directly used to participate in query operations such as JOIN, SORT, AGGREGATE, etc., and the operation of the string type is optimized to the operation of the integer type, and the operation is improved. effectiveness.
  • a third aspect provides a database management system, including: a structured query language SQL engine, an execution engine, and a storage engine; wherein the SQL engine is configured to generate a corresponding execution plan according to a SQL statement submitted by a client, The SQL statement is configured to request a query operation on at least one object in the database; the storage engine is configured to store a global dictionary table, where the global dictionary table includes an object ID and an object description information of each object in the global object sample space.
  • An object identifier of each object is unique within the global object sample space, and the object ID in the global dictionary table is in a one-to-one mapping relationship with the description information of the object;
  • the global object sample space is related by the database Constructing a plurality of object sample spaces;
  • the execution engine is configured to extract an object ID of the at least one object stored by the storage engine, and performing the using the object ID of the at least one object directly according to the execution plan Query operations to generate query results and return the query results to the client.
  • the storage engine is also used to create the global dictionary table.
  • another database management system having the above method implemented.
  • the database management system can be implemented by hardware or by corresponding software implementation by hardware.
  • the hardware or software includes one or more functional modules that implement the above methods.
  • the database management system may comprise means for performing the method of any of the first or second aspects of the second aspect.
  • a database server comprising: at least one processor, a non-transitory computer readable medium storing executable code; the executable code, when executed by the at least one processor, configured to perform the method.
  • a database system comprising: a client and any of the above database management systems.
  • a cluster database system comprising: a hardware layer and a virtual machine monitor (VMM) running on the hardware layer, and a plurality of virtual machines.
  • the virtual machine runs an executable program based on the VMM and the hardware resources provided by the hardware layer to implement some or all of the functions of the above database management system.
  • VMM virtual machine monitor
  • a non-transitory computer readable medium storing an executable program, the executable program comprising a program for implementing the functions of the database management system described above.
  • the database is stored in a data store, which is a hard disk, a disk, a disk array, or a storage server.
  • the same type of objects in the database constitute an object sample space, and the objects include: a tuple, a column, or a row.
  • the plurality of object sample spaces having correlations includes at least two columns having correlations in the database, and the at least two columns constitute the global object sample space.
  • the at least two columns having a correlation include at least two columns that are simultaneously operated when the query operation is performed.
  • the query operation includes at least one of a JOIN, a sort SORT, or an aggregate AGGREGATE.
  • the global dictionary table is one of a data file, an in-memory table, a database table, or an index table.
  • an object in a database is encoded by using a global dictionary table, and the global dictionary table can be used by multiple tables in a share encoding manner, and the encoded object identifier is unique, and The object description information is a one-to-one mapping relationship. Further, after loading the database object, it is not necessary to decompress, and the encoding value (ie, object identifier) of each object can be directly used to participate in query operations such as JOIN, SORT, and AGGREGATE, and the operation of the string type is optimized to an integer type operation, thereby improving the operation. s efficiency.
  • FIGS. 1A-1C are schematic diagrams showing a database system provided by an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a database server according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of encoding and compressing objects in a database using a global dictionary table according to an embodiment of the present invention.
  • Figure 4 shows a schematic diagram of a global dictionary table of an embodiment of the invention.
  • FIG. 5 is a schematic diagram showing a process of encoding and compressing a specified column based on a global dictionary table according to an embodiment of the present invention.
  • FIG. 6 is a flowchart of a method for performing a database query operation using an object ID according to an embodiment of the present invention.
  • FIG. 7 and 8 are schematic diagrams showing a process of performing an equivalent JOIN query operation according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram showing a process of executing an equivalent JOIN query operation by an execution engine in the prior art.
  • FIGS 10 and 11 are schematic diagrams showing the process of processing an equivalent expression according to an embodiment of the present invention.
  • FIG. 12 is a schematic diagram showing a process of executing an engine equivalent expression in the prior art.
  • FIG. 13 and 14 are diagrams showing a process of performing a packet aggregation operation according to an embodiment of the present invention.
  • FIG. 15 is a schematic diagram showing a process in which an execution engine performs a packet aggregation operation in the prior art.
  • FIG. 16 is a schematic block diagram of another database server according to an embodiment of the present invention.
  • FIG. 17 is a schematic diagram of a cluster database system according to an embodiment of the present invention.
  • FIG. 18 is a schematic diagram of a database system according to an embodiment of the present invention.
  • the technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are a part of the embodiments of the present invention, but not all embodiments.
  • the method, the database management system and the database server provided by the embodiments of the present invention can be applied to a stand-alone database system or a cluster database system.
  • the Database System is an ideal data processing system developed to meet the needs of data processing.
  • the database system generally consists of the following three parts: (1) database (database, DB), refers to a collection of organized, shareable data that is stored in the computer for a long time.
  • the data in the database is organized, described and stored according to a certain mathematical model, with less redundancy, higher data independence and scalability, and can be shared by various users.
  • Hardware including data storage required to store data, such as memory and/or disk.
  • Software including DBMS (database management system), DBMS is the core software of the database system, is the system software for scientifically organizing and storing data, and efficiently acquiring and maintaining data.
  • the database engine is DBMS. core content.
  • FIG. 1A is a schematic diagram of a stand-alone database system, including a database management system and a data storage (Data Store), which is used to provide services such as querying and modifying a database, and the database management system will The data is stored in the data memory.
  • the database management system and data storage are usually located on a single server, such as an SMP (Symmetric Multi-Processor) server.
  • the SMP server includes multiple processors, all of which share resources such as bus, memory, and I/O systems.
  • the functionality of the database management system can be implemented by one or more processors executing programs in memory.
  • FIG. 1B is a schematic diagram of a cluster database system adopting a shared-storage architecture, including multiple nodes (such as nodes 1-N in FIG. 1B), and each node is deployed with a database management system to provide a database query for the user. And modifying services, multiple database management systems store shared data in the shared data store, and perform read and write operations on the data in the data store through the switch.
  • the shared data storage can be a shared disk array.
  • a node in a clustered database system can be a physical machine, such as a database server, or a virtual machine running on an abstract hardware resource. If the node is a physical machine, the switch is a Storage Area Network (SAN) switch, an Ethernet switch, a fiber switch, or other physical switching device. If the node is a virtual machine, the switch is a virtual switch.
  • SAN Storage Area Network
  • FIG. 1C is a schematic diagram of a cluster database system adopting a shared-nothing architecture, each node has its own unique hardware resources (such as data storage), an operating system, and a database, and nodes communicate through a network. Under this system, the data will be distributed to each node according to the database model and application characteristics. The query task will be divided into several parts, executed in parallel on all nodes, and coordinated with each other to provide database services as a whole. All communication functions are in the same way. Implemented on a high-bandwidth network interconnection system. Like the clustered database system of the Shared-storage architecture described in Figure 1B, the nodes here can be either physical or virtual machines.
  • the data store of the database system includes, but is not limited to, a solid state drive (SSD), a disk array, or other type of non-transitory computer readable medium.
  • SSD solid state drive
  • the database is not shown in Figures 1A-1C, it should be understood that the database is stored in a data store.
  • a database system may include fewer or more components than those shown in Figures 1A-1C, or include components other than those shown in Figures 1A-1C, Figures 1A-1C only Components that are more relevant to the implementations disclosed by embodiments of the present invention are shown.
  • a cluster database system can include any number of nodes.
  • the database management system functions of each node may be implemented by appropriate combinations of software, hardware, and/or firmware running on each node, respectively.
  • an embodiment of the present invention provides a database server 100, including: at least one processor 104, a non-transitory computer-readable medium 106 and database management for storing executable code.
  • System 108 The executable code, when executed by at least one processor 104, is configured to implement the components and functions of database management system 108.
  • the non-transitory computer readable medium 106 can include one or more non-volatile memories including, by way of example, a semiconductor memory device, such as an EPROM (Erasable Programmable Read Only Memory).
  • the non-transitory computer readable medium 106 can also include any device that is configured as a main memory.
  • the at least one processor 104 can include any type of general purpose computing circuit or special purpose logic circuit, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).
  • the at least one processor 104 can also be one or more processors, such as a CPU, coupled to one or more semiconductor substrates.
  • the database management system 108 may be an RDBMS (Relational Database Management System).
  • the database management system 108 supports SQL (Structured Query Language).
  • SQL refers to a specialized programming language that is dedicated to managing data stored in relational databases.
  • SQL can refer to various types of data-related languages, including, for example, data definition languages and data manipulation languages, where SQL can include data insertion, query, update and delete, schema creation and modification, and data access control.
  • SQL can include descriptions related to various language elements, including clauses, expressions, predicates, queries, and statements.
  • a clause can refer to various components of a statement and a query, and in some cases, a clause can be considered optional.
  • the expression can be configured to generate scalar values and/or tables including data columns and/or rows.
  • predicates can be configured to specify conditions for adjusting the effects of statements and queries.
  • a query is a request to view, access, and/or manipulate data stored in a database.
  • Database management system 108 can receive queries in SQL format (referred to as SQL queries) from database client 102.
  • SQL queries queries in SQL format
  • the database management system 108 generates query results corresponding to the query by accessing relevant data from the database and manipulating the relevant data, and returns the query results to the database client 102.
  • a database is a collection of data organized, described, and stored in a mathematical model that can include one or more database structures or formats, such as row storage and column storage.
  • the database is typically stored in a data store, such as external data store 120 in FIG. 2, or non-transitory computer readable medium 106.
  • the database management system 108 is an in-memory database management system.
  • Database client 102 can include any type of device or application that is configured to interact with database management system 108.
  • database client 102 includes one or more application servers.
  • Database management system 108 includes SQL engine 110, execution engine 122, and storage engine 134.
  • the SQL engine 110 generates a corresponding execution plan according to a SQL statement submitted by the client 102, such as a query (Query), and the execution engine 122 operates in accordance with the execution plan of the statement to generate a query result.
  • the storage engine 134 is responsible for managing the data of the table and the actual content of the index on the file system, and also managing the data such as Cache, Buffer, transaction, and Log at runtime. For example, storage engine 134 can write execution results of execution engine 122 to data store 120 via physical I/O.
  • the SQL engine 110 includes a parser 112 and an optimizer 114, wherein the parser 110 is configured to perform syntax and semantic analysis of SQL statements, and to expand and divide the views in the query into small query blocks.
  • the optimizer 114 generates a set of execution plans that may be used for the statement, estimates the cost of each execution plan, compares the cost of the plan, and ultimately selects a plan with the least cost.
  • an embodiment of the present invention provides a method for encoding an object in a database by using a global dictionary table, including:
  • S310 Form a global object sample space according to the correlation of the plurality of object sample spaces.
  • an object sample space can be a partial data set in a database, such as a column of data in a table.
  • the embodiment of the present invention determines a column or row having relevance from the database to form a global object space.
  • a basic unit of operation in the database is an object, where the basic units include: tuples, columns/rows, tables. For example, one tuple in each column or each row is an object, or one column is an object.
  • a column with relevance refers to a column with an associated relationship that the database operates at the same time as the query operation. These columns may come from the same table or from different tables. Take the most common JOIN operation in the database as an example. The two columns selected when the database performs a JOIN operation are clearly relevant.
  • a row with relevance refers to a row that will be operated simultaneously during a query operation.
  • S320 Creating a global dictionary table in the global object sample space: (1) providing a globally unique code value for each object as its object identifier (ID); (2) ensuring that the object ID and the object description information are one-to-one mapping relationship
  • ID object identifier
  • object description information are one-to-one mapping relationship
  • the global dictionary table provides a globally unique encoded value to identify an object and is used by multiple information objects in a shared encoding manner.
  • the data inside the global dictionary table is mainly divided into two major sets: ⁇ object code value ⁇ and ⁇ object description information ⁇ . These two parts need to meet the one-to-one mapping constraint shown in Figure 4.
  • a specific implementation of the global dictionary table will be described in detail in the following embodiments.
  • S330 Calling an encoding interface of the global dictionary table to encode the specified object to obtain an encoded value (object ID) of the specified object.
  • object ID an encoded value of the specified object.
  • the specified object here can be a column or columns of data in the database.
  • the foregoing method further includes:
  • the global dictionary table provides the following two external interfaces:
  • Decoding interface used to input integer type encoded values and output object description information.
  • the database management system encodes/decodes one or more specified objects by calling the above encoding/decoding interface.
  • the global dictionary table further provides an operation interface, and correspondingly, the foregoing method further includes:
  • the operation interface of the global dictionary table is invoked to operate on the global dictionary table, including but not limited to one or more of operations such as query, update, delete, partition, or import.
  • the global dictionary table is shared by multiple tables in the form of share encoding, it can provide globally unique code values for the column data of these tables, and the column data ID and column description information are one-to-one mapping. The relationship, thus avoiding the creation of multiple local dictionary tables, reducing the coding consumption and saving disk space.
  • join_col_1/join_col_2 The global object sample space is formed for join_col_1/join_col_2 as follows:
  • global_dic_table For Use global dictionary table encoding/decoding: for Each column data is encoded/decoded by global_dic_table, where global_dic_table(encoding_num, desc_num) is defined as a global dictionary table.
  • global_dic_table(encoding_num, desc_num) is defined as a global dictionary table.
  • the global dictionary table provides the encoded value as an ID for uniquely identifying each column object, avoiding the creation of multiple local dictionary tables, reducing coding consumption and reducing the use of disk space.
  • the implementation of the global dictionary table includes but is not limited to the following:
  • the memory table can be represented by the Dictionary ⁇ key,value> data structure, so for the global dictionary table implemented by the memory table, the key value is the encoding value for each object; and the value is the description information of each object, as shown below:
  • the global dictionary table GLOBAL_DIC can be created by the following function:
  • the operation interface is defined as follows:
  • the memory table can be stored in data store 120 and loaded into non-transitory computer readable medium 106 when in use.
  • the operational interface is defined as follows:
  • the essential difference between the global dictionary table and the existing local dictionary table is that the global dictionary table guarantees the full mapping relationship and does not depend on the existing data; the local dictionary table only maps the existing data. relationship.
  • the encoding interface of the created global dictionary table can be called for encoding.
  • a global dictionary table implemented as an in-memory table If you want to use the global dictionary table to encode and compress a 11-bit mobile phone number column of the table, you need the following operations:
  • the original 11-digit mobile phone number is stored by the string type, converted into a numerical type of encoded value by embedding the sub-query, and inserted into the target table after the compression is completed.
  • the specific process for the database management system 108 to encode and compress the specified column based on the global dictionary table includes: when performing a batch import or insert operation on the table, for example, storing 11-bit phones for a specified column in the table.
  • the character type of the number is listed as COL1 and COL2, and the global coding table is obtained to obtain the object identifier (ID), and then all the object identification IDs (value types) corresponding to the full column data obtained by the query are compressed and stored, so that the actually stored data is a numeric type.
  • Object ID ID
  • the database management system 108 can call the decoding interface of the global dictionary table for decoding.
  • a global dictionary table implemented as an in-memory table If a column needs to participate in projection, etc., operations that must use column data description information, the decoding operation can be completed as follows:
  • the encoded value is used as a condition of the subquery, and the string value of the 11-digit mobile phone number is decoded by the embedded query, and finally outputted as a projection column.
  • the projection is a monocular operation, and the operation selects a specified attribute value from the table to form a new table.
  • the global dictionary table is shared by multiple tables in a shared encoding manner, it can provide globally unique encoded values for the column data of these tables. For the various query operations of the database, including but not limited to: JOIN, SORT, AGGREGATE, etc. If the column that completes the encoding using the global dictionary table does not participate in the operation that must use the object description information, it can directly use its column data object ID for operation. .
  • the column data identifier ID can be directly used to replace the description information of the column data to perform various query operations, and the decoding is not required at a high cost, thereby improving query execution efficiency. Therefore, as shown in FIG.
  • the foregoing embodiment of the present invention further provides a method for performing a database query operation using an object ID, including: S610: SQL engine 110, based on the above-described coding and compression of a database object based on a global shared dictionary table.
  • the SQL statement sent by the client 102 is received, and the SQL statement is processed to generate an execution plan.
  • the SQL statement requests a query operation on at least one specified object in the database, and the query operation includes, but is not limited to, at least one of operations such as JOIN, SORT, or AGGREGATE.
  • the specified object here is specified by the condition in the SQL statement, usually one or more data columns/rows of the database.
  • the execution engine 122 extracts an object identifier (ID) of the specified object that is compressed and stored; wherein the object identifier of the specified object is obtained by encoding and compressing the specified object through the global dictionary table described in the related embodiments of FIG. 3-5.
  • the storage engine is stored in the data store.
  • the execution engine 122 directly performs a query operation using the object ID of the specified object according to the execution plan to generate a query result, and returns the query result to the client.
  • the method for querying the database query provided by the embodiment of the present invention directly replaces the description information of the column object with the object ID to perform various query operations based on the share encoding mode provided by the global dictionary table. Since the granularity of operations is unified from various variable length types to integer types, query execution efficiency is greatly improved.
  • the same tuple of the character type columns COL1 and COL2 of the 11-digit telephone number is stored in the selection tables T1 and T2.
  • the SQL engine 110 performs processing such as parsing, semantic analysis, and optimization rewriting on the SQL statement output by the user, and finally generates an execution plan, and passes the execution plan to the execution engine 122 for calculation.
  • the processing flow of the execution engine 122 based on the global dictionary table is as shown in FIG. 8.
  • the SCAN operator of the execution engine 122 extracts the physical storage compressed object identifier directly as the specified columns COL1 and COL2 by the storage engine 134 when performing the single table scan.
  • the result is passed to the JOIN operator.
  • the JOIN operator directly uses the object identifier to perform the equivalent JOIN operation.
  • the materialization operator queries the global dictionary table for decompression to obtain the object identifier. The corresponding character type original value.
  • the processing flow of the execution engine 122 is as shown in FIG.
  • the storage engine 134 first needs to decompress the COL1 and COL2 columns: load the data of the specified columns COL1 and COL2, and reversely query the local dictionary table to obtain the original character value of the 11-digit mobile phone number. Then the SCAN operator scans the original values of the decompressed character strings of COL1 and COL2 and passes them to the JOIN operator. The JOIN operator uses the original character values of COL1 and COL2 to participate in the equivalent JOIN operation.
  • COL1 holds the character value of the 11-digit mobile phone number
  • the processing flow of the execution engine 122 based on the global dictionary table is as shown in FIG. 11: the SCAN operator converts the string constant '13512345678' in the equivalence condition of the above-mentioned equivalence expression into an object ID by querying the global dictionary table ( 10000), and then directly compare the coded value in the COL1 column with the converted object ID (10000).
  • the COL1 column is not encoded using the global dictionary table, the COL1 is coded and decoded using a local dictionary table. Then, when performing the equivalent expression operation, as shown in FIG.
  • the storage engine 134 first needs to decompress the COL1 column: load the data of the COL1 column, and query the local dictionary table corresponding to COL1 to obtain the character of the 11-digit mobile phone number. Original value. Then, the SCAN operator of the execution engine 122 compares the character original value of all 11 mobile phone numbers in the COL1 column with the string constant in the equivalent condition, and obtains the operation result.
  • the user inputs the SQL statement through the client 102: SELECT COUNT (COL1) FROM T1 GROUP BY COL1; similar to the embodiment shown in FIG. 7 and FIG. 10, the SQL engine 110 performs parsing, semantic analysis, and optimization rewriting on the equivalence expression, and finally generates an execution plan and transmits the execution plan to the execution.
  • the engine 122 performs an operation.
  • the processing flow of the execution engine 122 based on the global dictionary table is as shown in FIG. 14: when the SCAN operator performs the single table scan, the physical storage compressed object identifier is directly extracted by the storage engine 134 as the scan result of the COL1 column, and is transmitted.
  • the AGG operator directly uses the object identifier for group aggregation operation; finally, if it is necessary to display the character type original value of the execution result, the materialization operator queries the global dictionary table for decompression to obtain the character type corresponding to the object identifier. Original value.
  • the COL1 column is not encoded using the global dictionary table, the COL1 is coded and decoded using a local dictionary table.
  • the storage engine 134 first needs to decompress the COL1 column: load the data of the COL1 column, and query the local dictionary table corresponding to COL1 to obtain the character type of the 11-digit mobile phone number. value. Then, the AGG operator performs a group aggregation operation using the original character value of the 11-bit mobile phone number decompressed by the COL1 column, thereby obtaining an operation result.
  • a non-equivalent JOIN scenario for example, the user enters a SQL statement of a non-equivalent JOIN through the client 102: SELECT T1.COL1, T2.COL2 FROM T1, T2 WHERE T1.COL1>T2.
  • the processing of COL2; SQL engine 110 and execution engine 122 is substantially the same as the process in the equivalent JOIN scenario described in Figures 7-8.
  • the main difference is that since the operation of the non-equivalent JOIN scene is not only one-to-one mapping, but also the consistency of the operation result, it is necessary to additionally perform the following processing when establishing the global dictionary table:
  • the sorted sequence number is used as the object identifier (ID) to ensure consistency between the non-equivalent operation using the object identifier (ID) and the non-equivalent operation using the character type original value.
  • the embodiment of the present invention uses a global dictionary table to encode an object in a database.
  • the global dictionary table can be used by multiple tables in a shared encoding manner, and the encoded object identifier is unique. And the object description information is a one-to-one mapping relationship. Then, after loading the database object, it is not necessary to decompress, and the encoding value of each object (ie, the object identifier) can be directly used to participate in query operations such as JOIN, SORT, AGGREGATE, etc., and the operation of the string type is optimized to the operation of the integer type, and the operation is improved. effectiveness.
  • the cold data test mainly includes two processes of loading data and JOIN operations from the data memory. After the data is loaded and cached into memory, the repeated hot data testing process mainly includes the query operation process. It can be seen from the above table data that the beneficial effects of the technical solutions of the present application are mainly as follows:
  • FIG. 16 is a schematic block diagram of another database server 400 according to an embodiment of the present invention.
  • the database server 400 includes a processor 410, a memory 420, a bus system 430, an input device 440, and an output device 450.
  • bus system 430 which may include, in addition to the data bus, a power bus, a control bus, a status signal bus, and the like. However, for clarity of description, various buses are labeled as bus system 430 in the figure.
  • Processor 410 may comprise any type of general purpose computing circuit or special purpose logic circuit such as a central processing unit (CPU), an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • Memory 420 can include read only memory (ROM) and random access memory (RAM) for storing data and executable programs. A portion of the memory 420 may also include non-volatile random access memory (NVRAM).
  • the processor 410 is adapted to execute an executable program stored in the memory 420 to implement or perform the method steps described in the various method embodiments of the present invention described above. For example, some or all of the functions of the SQL engine and the execution engine in the embodiments of FIGS. 2 through 14 described above are implemented.
  • the database server 400 may correspond to a method of encoding and compressing a database object according to an embodiment of the present invention, and an execution body of a method of performing a database query operation using an object ID, and each of the database server 400
  • the above and other operations and/or functions of the components are respectively implemented in order to implement the respective processes of the respective methods in FIG. 2 to FIG. 14 , and are not described herein again for brevity.
  • an embodiment of the present invention further provides a cluster database system 500, including: a hardware layer 1007 and a virtual machine monitor (VMM) 1001 running on the hardware layer 1007, and a plurality of virtual machines 1002.
  • a virtual machine can be used as a data node of the cluster database system 500.
  • the virtual machine 1002 is a virtual computer simulated by a virtual machine software on a common hardware resource.
  • the virtual machine can be installed with an operating system and an application program, and the virtual machine can also access network resources.
  • the virtual machine is like working on a real computer.
  • Hardware layer 1007 A hardware platform running in a virtualized environment, which can be abstracted from hardware resources of one or more physical hosts.
  • the hardware layer may include a variety of hardware, including, for example, a processor 1004 (eg, a CPU) and a memory 1005, and may also include a network card 1003 (eg, an RDMA network card), a high speed/low speed input/output (I/O, Input/Output) device. And other devices with specific processing capabilities.
  • the virtual machine 1002 runs an executable program based on the VMM, and the hardware resources provided by the hardware layer 1007, to implement some or all of the functions of the SQL engine and the execution engine in the embodiments of FIGS. 2 through 14 described above. For the sake of brevity, it will not be repeated here.
  • the cluster database system 500 may further include a host: as a management layer, to complete management and allocation of hardware resources; to present a virtual hardware platform for the virtual machine; and to implement scheduling and isolation of the virtual machine.
  • Host may be a virtual machine monitor (VMM); it may also be a combination of VMM and a privileged virtual machine.
  • the virtual hardware platform provides various hardware resources for each virtual machine running on it, such as providing a virtual processor (such as a VCPU), virtual memory, a virtual disk, a virtual network card, and the like.
  • the virtual disk may correspond to a file of the Host or a logical block device.
  • the virtual machine runs on the virtual hardware platform that Host prepares for it, and one or more virtual machines run on the Host.
  • the VCPU of the virtual machine 1002 implements or performs the method steps described in the various method embodiments of the present invention by executing an executable program stored in its corresponding virtual memory. For example, some or all of the functions of the SQL engine and the execution engine in the embodiments of FIGS. 2 through 14 described above are implemented.
  • an embodiment of the present invention further provides a database system, including: a database server 800, based on a client device 900 connected to a database server 800 through a communication network. among them,
  • a client operating system 904 is running on the hardware layer 906 of the client device 900, and an application 902 is running on the operating system 904; an operating system 814 is running on the hardware layer 816 of the database server 800, and a database is running on the operating system 814.
  • the application 902 interfaces with a database management system 812 running on the database server 800 via a communication network and accesses or manipulates a database stored in the data store 818, for example, by querying, updating or deleting data in the database through SQL statements, or importing new data. Data to the database.
  • Hardware layers 906 and 816 contain the basic hardware elements required for the operating system and applications to run, such as processors, such as CPUs, memory, input/output devices, network interfaces, and the like.
  • the data store 818 can be an external memory of the database server 800, such as a hard disk, a disk, a storage array, or a storage server, etc., in communication with the database server 800.
  • data store 818 may also be integrated within database server 800 to interact with the processor and I/O devices via a bus or other internal communication means.
  • Executable code is stored in the memory of database server 800, which when executed by the processor is configured to implement the components and functions of database management system 812.
  • the database management system 812 may be specifically the database management system 108 shown in FIG. 2 .
  • FIG. 2 For related functions and implementation details, reference may be made to the related embodiments in FIG. 2 to FIG. 15 , and details are not described herein again.
  • executable program should be interpreted broadly to include, but not be limited to, instructions, instruction sets, code, code segments, subroutines, software modules, applications, software packages, Threads, processes, functions, firmware, middleware, etc.
  • the size of the sequence of the method steps described in the above embodiments does not mean that the order of execution is sequential, and the order of execution of each process should be determined by its function and internal logic, and should not constitute any limitation on the actual implementation process of the embodiment of the present invention.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Operations Research (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种数据库系统中对象的编码和运算方法、数据库管理系统以及数据库服务器。根据数据库中多个对象样本空间的相关性,形成全局对象样本空间,在全局对象样本空间内创建全局字典表,以为每一个对象提供全局唯一的编码值作为其对象标识,并保证对象标识与对象的描述信息为一一映射关系。基于全局字典表对数据库对象进行编码压缩,减少编码开销。进一步地,直接使用对象标识进行数据库查询运算,将字符串类型的运算优化为整数类型的运算,提升了运算的效率。

Description

数据库系统中对象的编码及运算方法与数据库服务器 技术领域
本发明涉及数据库技术领域,并且更具体地,涉及一种数据库系统中对象的编码及运算方法、数据库管理系统与数据库服务器。
背景技术
现代化信息处理是建立在基于对象标识(ID)的认识模型上的,在该认知模型中,任一样本空间中对象(Object)都有一个标识(ID),以区别于样本空间中的其他对象,然后使用人类语言来描述此对象。这样,所有对象都具有的两个基本属性:对象ID和对象描述信息,即任一对象可以被表示为:Object=(ID,description)。基于该认知模型的信息处理主要包括:1)建立对象样本空间。2)通过人工/自动编码方式为对象样本空间中的每个对象提供编码值标识该对象的唯一ID,并保存该对象的人类语言描述信息。3)根据信息处理逻辑,使用{Object=(ID,description)}信息参与各种信息运算。
数据库在处理数据时,也正是基于上述认知模型的。以商用数据库为例,它通常使用字典表对各种字符串/文本进行编码/解码,然后参与各种查询运算。具体地,若数据库对源数据文件某一列采用字典表,则这一列的数据就构成了一个对象样本空间。导入数据时,数据库会选择相应的编码算法,对该列中每一个对象进行编码,生成唯一标识ID并保存到数据表的物理文件中。源数据文件中,该列提供的其他数据即对应于该列对象的描述信息。此外,该表拥有一个私有的字典表,存储了该列对象的映射信息,用于解码。比如对源数据文件中“支行名称”列进行编码得到的编码值(对象ID)如下:。
对象ID 支行名称(描述)
0001 XX银行(A支行)
0002 XX银行(B支行)
0003 XX银行(C支行)
最后,参与各种查询运算时,数据库首先加载该列的编码值,然后反向查询其私有的字典表,以获取该列每个对象的描述信息,然后根据这些描述信息进行后续的查询运算。
由上可见,数据库表的各个列在编码过程中使用了私有字典表,增加了编码消耗,浪费了磁盘存储空间。另外,由于各个列使用自身的私有字典表,存在联系的多表之间无法共享编码信息,得到的编码值(对象ID)不能够跨表通用。当需要参与查询运算时,首先必须通过私有字典表得到所有对象的描述信息,然后才能够对这些描述信息进行各种复杂的运算,大大降低了数据库执行查询运算的性能。
发明内容
本发明提供一种对数据库中的对象编码和查询运算的方法、数据库管理系统以及数据 库服务器,能够使用全局字典表对数据库中的对象进行编码,进而直接使用对象标识参与各种信息处理运算,提升数据库的性能。
第一方面,提供了一种对数据库中的对象编码的方法,包括:根据该数据库中多个对象样本空间的相关性,确定全局对象样本空间;基于所述全局对象样本空间创建全局字典表,该全局字典表包含该全局对象样本空间内每一个对象的对象标识(ID)和对象描述信息,每一个对象的对象标识在该全局对象样本空间内唯一,该全局字典表中对象ID与对象的描述信息为一一映射关系,且该全局字典表提供有编码接口;调用该全局字典表的编码接口对该数据库中的至少一个对象进行编码,以得到该至少一个对象的编码值,该至少一个对象的编码值为该至少一个对象的对象ID。
采用上述方法,充分考虑了多个对象样本空间的相关性,创建针对全局样本空间的全局字典表,全局字典表以共享编码(share encoding)的方式被多个对象样本空间(比如多个表)共享使用的,所以它能够为这些样本空间的对象提供全局唯一的编码值,并且编码值与对象描述信息为一一映射的关系,从而避免创建多个局部字典表,减少了编码消耗,节省了磁盘空间。
在一个可能的设计中,该数据库中的同一类对象构成一个对象样本空间,所述对象包括:元组、列或者行。在另一个可能的设计中,一个对象样本空间可以为数据库中的部分数据集,比如一张表、一张表中的某一列数据,或者多张表中的多列数据。这样,数据库中的多张表或多个列可以共享全局字典表,从而避免创建多个局部字典表,减少了编码消耗,节省了磁盘空间。
在一个可能的设计中,根据所述数据库中多个对象样本空间的相关性,确定全局对象样本空间包括:确定该数据库中具有相关性的至少两列,该至少两列构成该全局对象样本空间。
在一个可能的设计中,具有相关性的至少两列包括:在查询运算时会被同时操作的至少两列。
在一个可能的设计中,该全局字典表提供操作接口,该方法还包括:调用所述操作接口对所述全局字典表进行操作,所述操作包括:查询、更新、删除、分区或导入操作中的至少一种。这样可以支持全局字典表的动态更新。
在一个可能的设计中,该方法还包括:调用该全局字典表的编码接口对该数据库的对象进行编码后,将编码后的数据存储到数据存储器中。
在一个可能的设计中,该全局字典表为数据文件、内存表、数据库表或索引表中的一种。该全局字典表可以存储在外部存储器中,使用的时候加载到内存中。
第二方面,提供了一种基于对象ID的查询运算的方法,包括:接收客户端发送的结构化查询语言(SQL)语句,该SQL语句用于请求对数据库中的至少一个对象进行查询运算,处理该SQL语句以生成执行计划;提取存储的该至少一个对象的对象标识(ID);其中,该至少一个对象的对象ID是通过调用全局字典表的编码接口对该至少一个对象进行编码后得到并存储在数据存储器中的;该全局字典表包含全局对象样本空间内每一个对象的对象ID和对象描述信息,所述每一个对象的对象标识在所述全局对象样本空间内唯一,所述全局字典表中对象ID与对象的描述信息为一一映射关系;所述全局对象样本空间是由所述 数据库中具有相关性多个对象样本空间构成的;根据所述执行计划,直接使用所述至少一个对象的对象ID进行所述查询运算,以生成查询结果,并将所述查询结果返回给所述客户端。
采用上述方法,由于全局字典表可被多个表以share encoding方式使用,编码后的对象标识是唯一的,并且和对象描述信息是一一映射关系。那么,加载数据库对象后不必进行解压,可直接使用各个对象的编码值(即对象标识)参与JOIN、SORT、AGGREGATE等查询运算,将字符串类型的运算优化为整数类型的运算,提升了运算的效率。
第三方面,提供一种数据库管理系统,包括:结构化查询语言SQL引擎、执行引擎和存储引擎;其中,所述SQL引擎用于根据客户端提交的SQL语句,生成对应的执行计划,所述SQL语句用于请求对数据库中的至少一个对象进行查询运算;所述存储引擎用于存储全局字典表,所述全局字典表包含全局对象样本空间内每一个对象的对象ID和对象描述信息,所述每一个对象的对象标识在所述全局对象样本空间内唯一,所述全局字典表中对象ID与对象的描述信息为一一映射关系;所述全局对象样本空间是由所述数据库中具有相关性多个对象样本空间构成的;所述执行引擎用于提取所述存储引擎存储的所述至少一个对象的对象ID,根据所述执行计划,直接使用所述至少一个对象的对象ID执行所述查询运算,以生成查询结果,并将所述查询结果返回给所述客户端。
在一个可能的设计中,所述存储引擎还用于创建所述全局字典表。
第四方面,提供另一种数据库管理系统,该数据库管理系统具有实现上述方法。该数据库管理系统可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括实现上述方法一个或多个功能模块。具体地,所述数据库管理系统可以包括用于执行第一方面或第二方面的任一可能的实现方式中的方法的模块。
第五方面,提供一种数据库服务器,包括:至少一个处理器、存储可执行代码的非瞬态计算机可读介质;所述可执行代码在被所述至少一个处理器执行时被配置为执行上述方法。
第六方面,提供一种数据库系统,包括:客户端和上述任一数据库管理系统。
第七方面,提供一种集群数据库系统,包括:硬件层和运行在硬件层之上的虚拟机监控器(VMM),以及多个虚拟机。虚拟机基于VMM,以及硬件层提供的硬件资源,运行可执行程序,以实现上述数据库管理系统的部分或全部功能。
第八方面,提供一种存储有可执行程序的非瞬态计算机可读介质,该可执行程序包括用于实现上述数据库管理系统的功能的程序。
在上述各个实现方式中,数据库存储在数据存储器中,该数据存储器为硬盘、磁盘、磁盘阵列或存储服务器。
在上述各个实现方式中,所述数据库中的同一类对象构成一个对象样本空间,所述对象包括:元组、列或者行。
在上述各个实现方式中,具有相关性多个对象样本空间包括:所述数据库中具有相关性的至少两列,所述至少两列构成所述全局对象样本空间。
在上述各个实现方式中,所述具有相关性的至少两列包括:在所述查询运算时会被同时操作的至少两列。
在上述各个实现方式中,所述查询运算包含:连接JOIN、排序SORT或聚集AGGREGATE中的至少一种。
在上述各个实现方式中,所述全局字典表为数据文件、内存表、数据库表或索引表中的一种。
基于上述技术方案,在本发明实施例中,使用全局字典表对数据库中的对象进行编码,该全局字典表可被多个表以share encoding方式使用,编码后的对象标识是唯一的,并且和对象描述信息是一一映射关系。进一步地,加载数据库对象后不必进行解压,可直接使用各个对象的编码值(即对象标识)参与JOIN、SORT、AGGREGATE等查询运算,将字符串类型的运算优化为整数类型的运算,提升了运算的效率。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。
图1A-1C示出了本发明实施例提供的数据库系统的示意图。
图2示出了本发明实施例提供的数据库服务器的结构示意图。
图3示出了本发明实施例提供的使用全局字典表对数据库中的对象进行编码压缩的示意图。
图4示出了发明实施例的全局字典表的示意图。
图5示出了本发明实施例的基于全局字典表对指定列进行编码压缩存储的过程示意图。
图6示出了本发明实施例提供的使用对象ID进行数据库查询运算的方法的流程图。
图7和图8为本发明实施例的执行等值JOIN查询运算的过程示意图。
图9示出了现有技术中执行引擎执行等值JOIN查询运算的过程示意图。
图10和图11示出了本发明实施例的处理等值表达式的过程示意图。
图12示出了现有技术中执行引擎处理等值表达式的过程示意图。
图13和图14示出了本发明实施例的执行分组聚集运算的过程示意图。
图15示出了现有技术中执行引擎执行分组聚集运算的过程示意图。
图16示出了本发明实施例提供的另一种数据库服务器的示意性框图
图17示出了本发明实施例提供的一种集群数据库系统的示意图。
图18示出了本发明实施例提供的一种数据库系统的示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。本发明实施例提供的方法、数据库管理系统及数据库服务器可以应用于单机数据库系统或者集群数据库系统。数据库系统(Database System)是为适应数据处理的需要而发展起来的一种较为理想的数据处理系统。数据库系统一般由以下三部分组成:(1)数据库(database,DB),指长期存储在计算机内的,有组织,可共享的数据的集合。数据库中的 数据按一定的数学模型组织、描述和存储,具有较小的冗余,较高的数据独立性和易扩展性,并可为各种用户共享。(2)硬件,包括存储数据所需的数据存储器,例如内存和/或磁盘。(3)软件,包括DBMS(database management system,数据库管理系统),DBMS是数据库系统的核心软件,是用于科学地组织和存储数据,以及高效获取和维护数据的系统软件,数据库引擎是DBMS的核心内容。
具体地,如图1所示,图1A为单机数据库系统示意图,包括一个数据库管理系统和数据存储器(Data Store),该数据库管理系统用于提供数据库的查询和修改等服务,该数据库管理系统将数据存储到数据存储器中。在单机数据库系统中,数据库管理系统和数据存储器通常位于单一服务器上,比如一台SMP(Symmetric Multi-Processor)服务器。该SMP服务器包括多个处理器,所有的处理器共享资源,如总线,内存和I/O系统等。数据库管理系统的功能可由一个或多个处理器执行内存中的程序来实现。
图1B为采用共享磁盘(Shared-storage)架构的集群数据库系统示意图,包括多个节点(如图1B中的节点1-N),每个节点部署有数据库管理系统,分别为用户提供数据库的查询和修改等服务,多个数据库管理系统存储有共享的数据在共享数据存储器中,并且通过交换机对数据存储器中的数据执行读写操作。共享数据存储器可以为共享磁盘阵列。集群数据库系统中的节点可以为物理机,比如数据库服务器,也可以为运行在抽象硬件资源上的虚拟机。若节点为物理机,则交换机为存储区网络(Storage Area Network,SAN)交换机、以太网交换机,光纤交换机或其它物理交换设备。若节点为虚拟机,则交换机为虚拟交换机。
图1C为采用无共享(Shared-nothing)架构的集群数据库系统示意图,每个节点具有各自独享的硬件资源(如数据存储器)、操作系统和数据库,节点之间通过网络来通信。该体系下,数据将根据数据库模型和应用特点被分配到各个节点上,查询任务将被分割成若干部分,在所有节点上并行执行,彼此协同计算,作为整体提供数据库服务,所有通信功能都在一个高宽带网络互联体系上实现。如同图1B所描述的Shared-storage架构的集群数据库系统一样,这里的节点既可以是物理机,也可以是虚拟机。
在本发明所有实施例中,数据库系统的数据存储器(Data Store)包括但不限于固态硬盘(SSD)、磁盘阵列或其他类型的非瞬态计算机可读介质。图1A-1C中虽未示出数据库,应理解,数据库存储在数据存储器中。所属领域的技术人员可以理解一个数据库系统可能包括比图1A-1C中所示的部件更少或更多的组件,或者包括与图1A-1C中所示组件不同的组件,图1A-1C仅仅示出了与本发明实施例所公开的实现方式更加相关的组件。例如,虽然图1B和1C中已经描述了4个节点,但所属领域的技术人员可理解成一个集群数据库系统可包含任何数量的节点。各节点的数据库管理系统功能可分别由运行在各节点上的软件、硬件和/或固件的适当组合来实现。
为了便于理解和描述,作为示例而非限定,下面以Shared-nothing架构集群数据库系统中一个节点,即数据库服务器为例说明本发明实施例的方案。但是本领域技术人员根据本发明实施例的教导可以很清楚地理解,本发明实施例的方法同样可以应用于Shared-storage架构的集群数据库系统,单机数据库系统,以及任何类型的关系型数据库系统。
如图2所示,本发明实施例提供一种数据库服务器100,包括:至少一个处理器104、存储可执行代码的非瞬态计算机可读介质(non-transitory computer-readable medium)106和数据库管理系统108。所述可执行代码在被至少一个处理器104执行时被配置为实现数据库管理系统108的组件和功能。非瞬态计算机可读介质106可以包括一个或多个非易失性存储器,作为示例,非易失性存储器包括半导体存储器设备,例如EPROM(Erasable Programmable Read Only Memory,可擦可编程只读存储器),EEPROM(Electrically Erasable Programmable Read Only Memory,电可擦只读存储器)和闪存(flash memory);磁盘,例如内部硬盘(internal hard disk)或可移动磁盘(removable disk),磁光盘(magneto optical disk),以及CD ROM和DVD-ROM。此外,非瞬态计算机可读介质106还可以包括被配置为主存储器(main memory)的任何设备。至少一个处理器104可以包括任何类型的通用计算电路或专用逻辑电路,例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)。至少一个处理器104也可以是耦合到一个或多个半导体基板的一个或多个处理器,例如CPU。
数据库管理系统108可以是RDBMS(Relational Database Management System,关系型数据库管理系统)。数据库管理系统108支持SQL(Structured Query Language,结构化查询语言)。通常,SQL是指专门用于管理关系型数据库中保存的数据的专用编程语言。SQL可以指代各种类型的数据相关语言,包括例如数据定义语言和数据操纵语言,其中SQL的范围可以包括数据插入,查询,更新和删除,模式创建和修改以及数据访问控制。此外,在一些示例中,SQL可以包括与各种语言元素相关的描述,包括子句(clause),表达式(expression),谓词(predicate),查询(query)和语句(statement)。例如,子句可以指语句和查询的各种组成部分,并且在一些情况下,子句可以被认为是可选的。此外,表达式可以被配置为产生包括数据列和/或行的标量值(scalar value)和/或表。另外,谓词可经配置以指定条件,以用于调节语句和查询的效果。
查询(query)是请求查看,访问和/或操纵存储在数据库中的数据。数据库管理系统108可以从数据库客户端102接收SQL格式的查询(称为SQL查询)。通常,数据库管理系统108通过从数据库访问相关数据并操纵相关数据以生成查询所对应的查询结果,并将查询结果返回到数据库客户端102。数据库是按一定的数学模型组织、描述和存储的数据集合,数据库可以包括一个或多个数据库结构或格式,例如行存储和列存储。数据库通常存储于数据存储器中,比如图2中的外部数据存储器120,或者非瞬态计算机可读介质106。当数据库存储于非瞬态计算机可读介质106时,数据库管理系统108为内存数据库管理系统。
数据库客户端102可以包括被配置成与数据库管理系统108交互的任何类型的设备或应用程序。在一些示例中,数据库客户端102包括一个或多个应用服务器。
数据库管理系统108包括SQL引擎110、执行引擎122和存储引擎134。SQL引擎110根据客户端102提交的SQL语句,例如查询(Query),生成对应的执行计划,执行引擎122依照语句的执行计划进行操作,以产生查询结果。存储引擎134负责在文件系统之上,管理表的数据、索引的实际内容,同时也会管理运行时的Cache、Buffer、事务、Log等数据。例如存储引擎134可以将执行引擎122的执行结果通过物理I/O写入数据存储器120。SQL引 擎110包括解析器112和优化器114,其中,解析器110用于执行对SQL语句的语法、语义分析,将查询中的视图展开、划分为小的查询块。优化器114为语句生成一组可能被使用的执行计划,估算出每个执行计划的代价,比较计划的代价,最终选择一个代价最小的执行计划。
基于以上描述的数据库管理系统108,如图3所示,本发明实施例提供一种使用全局字典表对数据库中的对象进行编码的方法,包括:
S310:根据多个对象样本空间的相关性,形成全局对象样本空间。
通常,同一类对象或实体(entity)构成一个对象样本空间。典型地,在数据库场景下,一个对象样本空间可以为数据库中的部分数据集,比如一张表中的某一列数据。本发明实施例从数据库中确定出具有相关性的列或行,构成一个全局对象空间。数据库中的一个基本运算单元为一个对象,这里的基本单元包括:元组,列/行,表。例如,每一列或者每一行中的一个元组为一个对象,或者一列为一个对象。具有相关性的列,是指数据库在查询运算时会同时操作的具有关联关系的列,这些列可能来自于同一张表,也可能来自于不同的表。以数据库最常见的JOIN运算为例。数据库执行JOIN运算时所选定的两个列显然具有相关性。同理,具有相关性的行,是指在查询运算时会被同时操作的行。
S320:在全局对象样本空间内创建全局字典表:(1)为每一个对象提供全局唯一的编码值作为其对象标识(ID);(2)保证对象ID与对象的描述信息为一一映射关系;其中,对象的描述信息用于描述或表达该对象的含义,便于人的理解。对象描述信息通常为人类语言。
全局字典表提供全局唯一的编码值来标识一个对象,并以共享编码(share encoding)方式由多个信息对象使用。全局字典表内部的数据主要分为两大集合:{对象编码值}和{对象描述信息}。这两部分需满足如图4所示的一一映射约束。关于全局字典表的具体实现会在后面的实施例中详细介绍。
S330:调用全局字典表的编码接口对指定对象进行编码,以得到该指定对象的编码值(对象ID)。这里的指定对象可以为数据库中的某一列或某几列数据。
可选地,上述方法还包括:
340:调用全局字典表的解码接口对指定对象进行解码,以得到该指定对象的描述信息。
全局字典表提供以下两个对外接口:
1)编码接口:用于输入对象描述信息,以及输出整数型编码值;
2)解码接口:用于输入整数型编码值,以及输出对象描述信息。
数据库管理系统通过调用上述编码/解码接口,对一个或多个指定对象进行编/解码。可选地,全局字典表还提供操作接口,相应地,上述方法还包括:
350:调用全局字典表的操作接口对全局字典表进行操作,包括但不限于:查询、更新、删除、分区或导入等操作中的一种或多种。
因为全局字典表是以共享编码(share encoding)的方式被多个表共享使用的,所以它能够为这些表的列数据提供全局唯一的编码值,并且列数据ID与列描述信息为一一映射的关系,从而避免创建多个局部字典表,减少了编码消耗,节省了磁盘空间。
下面以数据库JOIN运算为例来进一步说明以上方法步骤。若用户选定两个列执行JOIN 运算,由于执行JOIN运算的两个列具有相关性,所以可针对这两个列的所有数据建立一个全局对象样本空间,并使用全局字典表进行编解码:
join_table_1(join_col_1,…,other_cols);
join_table_2(join_col_2,…,other_cols);
对join_col_1/join_col_2形成全局对象样本空间如下:
Figure PCTCN2018077164-appb-000001
对于
Figure PCTCN2018077164-appb-000002
使用全局字典表编码/解码:对于
Figure PCTCN2018077164-appb-000003
中每一个列数据,通过global_dic_table进行编码/解码,其中,定义global_dic_table(encoding_num,desc_num)为全局字典表。这样,全局字典表提供编码值作为ID用于唯一标识各个列对象,避免创建多个局部字典表,减少了编码消耗,减少了磁盘空间的使用。
全局字典表的实现方式包含但不限于如下方式:
1)数据文件
2)内存表(使用Dictionary数据结构进行表示)
3)数据库表
4)索引表
内存表可使用Dictionary<key,value>数据结构进行表示,因此对于通过内存表实现的全局字典表,其key值就是对于每一个对象编码值;而value就是每一个对象描述信息,如下所示:
Key Value
1 ‘00000000000’
2 ‘00000000001’
...... ……
100000 ‘13512345678’
...... ……
100000000000 ‘99999999999’
在一个示例中,假设需要处理的字符串/文本为11位移动电话号码,则可以通过如下 函数来创建全局字典表GLOBAL_DIC:
Figure PCTCN2018077164-appb-000004
由于移动电话号码有着按号段划分的特点,我们可进一步将其设计为如下分区表以提升查询性能:
Figure PCTCN2018077164-appb-000005
进一步地,在一个实施例中,对于上述通过内存表实现的全局字典表,定义其操作接口如下:
添加一个key/value键值对:dic[key]=value/dic.add(key,value);
去除某个key/value键值对:dic.remove(key);
移除所有元素:dic.clear();
判断是否包含特定键key:dic.find(key);
判断某个value对应的key值:需要遍历dic逐个对比value。
内存表可以存储在数据存储器120中,使用的时候加载到非瞬态计算机可读介质106。
在另一个实施例中,对于通过数据库表实现的全局字典表,定义其操作接口如下:
(1)可以通过批量导入或者INSERT操作将‘00000000000’-‘99999999999’及对应的1-100000000000的对象ID值进行入表操作;
(2)可以通过UPDATE操作更新其中的记录;
(3)可以通过DELETE操作更新其中的记录;
(4)可以通过SELECT操作查询表中的记录。
上述全局字典表GLOBAL_DIC建立后,可以提供如下标准的SQL接口对外提编码/解码功能:
编码:11位移动电话号码—>整数型编码值
SELECT ENCODING_NUM FROM GLOBAL_DIC WHERE MOBILE_NUM=?;
解码:整数型编码值—>11位移动电话号码
SELECT MOBILE_NUM FROM GLOBAL_DIC WHERE ENCODING_NUM=?;
若想使用全局字典表,可通过建表语句进行指定:
CREATE TABLE XXX
(
)WITH GLOBAL_DIC;
无论使用具体的实现方式,全局字典表跟现有局部字典表的本质区别在于:全局字典表保证了全量的映射关系,不依赖于已有数据;而局部字典表仅对于现有数据建立了映射关系。
进一步地,可以调用已创建的全局字典表的编码接口进行编码。对于实现为内存表的全局字典表:如果要利用全局字典表对于表的某一11位移动电话号码列进行编码压缩存储,需要如下操作:
(1)对于该11位移动电话号码列中每一个电话号码在内存表中查询其对应的ID值;
(2)修改下列导入SQL语句如下:
INSERT INTO DST_TABLE SELECT COL1,…,ID mobile_num,…,COLn FROM FST_TABLE;如上SQL所示,原先的11位移动电话号码由字符串类型的存储,通过查询内存表替换为数值类型的编码值,在完成压缩后插入目标表。
对于实现为数据库表的全局字典表:如果要利用全局字典表对于表的某一11位移动电话号码列进行编码压缩存储,需修改导入SQL语句如下:
INSERT INTO DST_TABLE SELECT COL1,…,(SELECT ENCODING_NUM FROM GLOBAL_DIC WHERE MOBILE_NUM=COL mobile_num),…,COLn FROM FST_TABLE;
如上SQL所示,原先的11位移动电话号码由字符串类型的存储,通过嵌入子查询转换为数值类型的编码值,在完成压缩后插入目标表。
如图5所示,数据库管理系统108基于全局字典表对指定列进行编码压缩存储的具体过程包括:在对表进行批量导入或者插入等操作时,对于表中的指定列,例如存储11位电话号码的字符类型列COL1和COL2,查询全局编码表获取对象标识(ID),然后将查询得到的全列数据对应的所有对象标识ID(数值类型)进行压缩存储,这样实际存储的数据是数值类型的对象标识(ID)。
进一步的,数据库管理系统108可以调用全局字典表的解码接口进行解码。对于实现为内存表的全局字典表:如果某一列需要参与投影等必须使用列数据描述信息的运算,可通过如下操作完成解码操作:
(1)执行SQL:SELECT COL1,…,ID encoding_num,…,COLn FROM…;可以获取数值类型的编码值;
(2)使用编码值获取11位移动电话号码的字符串值。
对于实现为数据库表的全局字典表:如果某一列需要参与投影等必须使用列数据描述信息参与的运算,可同样通过嵌入子查询完成解码操作:
SELECT COL1,…,(SELECT MOBILE_NUM FROM GLOBAL_DIC WHERE ENCODING_NUM= COL encoding_num),…,COLn FROM…;
如上SQL子查询所示,将编码值作为子查询的条件,通过嵌入查询的方式解码取得11位移动电话号码的字符串值,最后以投影列的形式输出。其中,投影是单目运算,该运算从表中选出指定的属性值组成一个新表。
因为全局字典表是以共享编码(share encoding)的方式被多个表共享使用的,所以它能够为这些表的列数据提供全局唯一的编码值。对于数据库的各种查询运算,包含但并不限于:JOIN、SORT、AGGREGATE等,如果利用全局字典表完成编码的列不参加必须使用对象描述信息的运算,则可直接使用其列数据对象ID进行运算。可直接使用列数据标识ID替换列数据的描述信息进行各种查询运算,而不再需要高代价地进行解码,从而提高查询执行效率。因此,如图6所示,在上述基于全局共享字典表对数据库对象进行编码压缩的基础上,本发明实施例进一步提供一种使用对象ID进行数据库查询运算的方法,包括:S610:SQL引擎110接收客户端102发送的SQL语句,处理该SQL语句以生成执行计划。其中,该SQL语句请求对数据库中的至少一个指定对象进行查询运算,所述查询运算包含但并不限于:JOIN、SORT或AGGREGATE等运算中的至少一种。这里的指定对象,是由SQL语句中的条件指定的,通常为数据库的一个或多个数据列/行。
S620:执行引擎122提取压缩存储的指定对象的对象标识(ID);其中,指定对象的对象标识是通过图3-5相关实施例所描述的全局字典表对指定对象进行编码压缩后得到并通过存储引擎存储在数据存储器中的。
S630:执行引擎122根据执行计划,直接使用指定对象的对象ID进行查询运算,以生成查询结果,并将所述查询结果返回给所述客户端。
本发明实施例提供的数据库查询运算的方法,基于上述全局字典表提供的share encoding模式,直接使用对象ID替换列对象的描述信息进行各种查询运算。由于运算的粒度从各种变长类型统一为整数类型,大大地提高查询执行效率。
下面针对几种具体场景来说明上述直接使用对象ID进行查询运算的方法。
在一个实施例中,参照图2和图7,若用户通过客户端102输入了等值JOIN的SQL语句:SELECT T1.COL1,T2.COL2 FROM T1,T2 WHERE T1.COL=T2.COL2;用于选择表T1和T2中存储11位电话号码的字符类型列COL1和COL2的相同的元组。SQL引擎110对于用户输出的SQL语句进行语法分析、语义分析和优化重写等处理,最终生成执行计划,并将执行计划传给执行引擎122进行运算。
执行引擎122基于全局字典表的处理流程如图8所示:执行引擎122的SCAN算子在执行单表扫描时,直接通过存储引擎134提取物理存储的压缩态对象标识作为指定列COL1和COL2扫描结果,传递给JOIN算子,JOIN算子直接使用对象标识进行等值JOIN运算;最后,若需要显示执行结果的字符型原值,则物化算子查询全局字典表进行解压缩,以获取对象标识对应的字符型原值。相比之下,若未采用全局字典表对指定列COL1和COL2进行share encoding,则执行引擎122的处理流程如图9所示。存储引擎134首先需要对于COL1和COL2列进行解压缩操作:加载指定列COL1和COL2的数据,反向查询局部字典表得到11位移动电话号码的字符型原值。然后SCAN算子扫描获取COL1和COL2解压缩后的字符型原值,并传递给JOIN算子,JOIN算子使用COL1和COL2的字符型原值参与等值JOIN运算。
对比图8和图9描述的处理过程可以发现,使用全局字典表对数据列share encoding之后,存储引擎只需存储数值类型的对象标识,进一步地,等值JOIN时,执行引擎可以直接使用数值类型的对象标识而不是列对象描述信息参与运算,性能有显著提升。
在另一个实施例中,参照图2和图10,若COL1保存了11位移动电话号码的字符值,用户通过客户端102输入了等值表达式:SELECT T1.COL1 FROM T1 WHERE T1.COL1=‘13512345678’;与图7所示的实施例类似,SQL引擎110对于该等值表达式进行语法分析、语义分析和优化重写等处理,最终生成执行计划,并将执行计划传给执行引擎122进行运算。
相应地,执行引擎122基于全局字典表的处理流程如图11所示:SCAN算子通过查询全局字典表将上述等值表达式的等值条件中的字符串常量‘13512345678’转换为对象ID(10000),然后直接将COL1列中的编码值与转换后的对象ID(10000)进行数值等值比较即可。相比之下,若未采用全局字典表对COL1列进行编码,而是使用局部字典表对COL1进行编解码。那么进行等值表达式运算的时候,如图12所示,存储引擎134首先需要对COL1列进行解压缩操作:加载COL1列的数据,查询COL1对应的局部字典表得到11位移动电话号码的字符型原值。然后执行引擎122的SCAN算子将COL1列全部11位移动电话号码的字符型原值与等值条件中的字符串常量进行字符串等值比较,从而得到运算结果。
对比图11和图12描述的处理过程可以发现,在等值表达式场景下,使用全局字典表对数据列编码之后,可以省去COL1列进行解压缩操作,仅需查询全局字典表获取等值条件中的字符串常量的ID,减少了大量I/O操作。
在另一个实施例中,参照图2和图13,对于分组聚集场景,若COL1保存了11位移动电话号码的字符值,用户通过客户端102输入了SQL语句:SELECT COUNT(COL1)FROM T1 GROUP BY COL1;与图7和图10所示的实施例类似,SQL引擎110对于该等值表达式进行语法分析、语义分析和优化重写等处理,最终生成执行计划,并将执行计划传给执行引擎122进行运算。
相应地,执行引擎122基于全局字典表的处理流程如图14所示:SCAN算子在执行单表扫描时,直接通过存储引擎134提取物理存储的压缩态对象标识作为COL1列的扫描结果,传递给AGG算子,AGG算子直接使用对象标识进行分组聚集运算;最后,若需要显示执行结果的字符型原值,则物化算子查询全局字典表进行解压缩,以获取对象标识对应的字符型原值。相比之下,若未采用全局字典表对COL1列进行编码,而是使用局部字典表对COL1进行编解码。那么进行分组聚集运算的时候,如图15所示,存储引擎134首先需要对COL1列进行解压缩操作:加载COL1列的数据,查询COL1对应的局部字典表得到11位移动电话号码的字符型原值。然后AGG算子将使用COL1列解压缩后的11位移动电话号码的字符型原值进行分组聚集运算,从而得到运算结果。
对比图14和图15描述的处理过程可以发现,在分组聚集场景下,使用全局字典表对数据列编码之后,可以省去查询COL1的局部字典表对COL1列解压缩的操作,减少了大量I/O操作。
在另一个实施例中,对于非等值JOIN场景,例如,用户通过客户端102输入了非等值JOIN的SQL语句:SELECT T1.COL1,T2.COL2 FROM T1,T2 WHERE T1.COL1>T2.COL2;SQL 引擎110和执行引擎122的处理过程与图7-8所描述的等值JOIN场景下的流程基本相同。主要的区别在于,由于非等值JOIN场景的运算除了一一映射,还要保证运算结果的一致性,因此需要在建立全局字典表时,额外进行如下处理:
(1)对于全局字典表进行全表SORT(排序)处理;
(2)使用排序后的序号作为对象标识(ID),以保证使用对象标识(ID)的非等值运算与使用字符型原值的非等值运算的结果一致性。
通过上述多种场景下的示例可以看出,本发明实施例使用全局字典表对数据库中的对象进行编码,该全局字典表可被多个表以share encoding方式使用,编码后的对象标识是唯一的,并且和对象描述信息是一一映射关系。那么,加载数据库对象后不必进行解压,可直接使用各个对象的编码值(即对象标识)参与JOIN、SORT、AGGREGATE等查询运算,将字符串类型的运算优化为整数类型的运算,提升了运算的效率。
通过对包含1000万条记录的表进行测试,性能收益实测对比如下:
对于JOIN运算,性能收益对比:
  现有技术方案 本发明技术方案 性能提升
冷数据测试 5401ms 4400ms 18.5%
热数据测试 3792ms 2752ms 27.4%
对于JOIN+AGGREGATE运算,性能收益对比:
  现有技术方案 本发明技术方案 性能提升
冷数据测试 28081ms 9813ms 186.2%
热数据测试 24300ms 7093ms 242.59%
其中,冷数据测试主要包括从数据存储器加载数据和JOIN运算两个过程。在将数据加载并缓存到内存中后,重复进行的热数据测试过程主要包括查询运算过程。由上表数据可以看出,本申请技术方案的有益效果主要有:
(1)使用对象ID而不是对象描述信息参与各种信息处理运算时,性能有显著提升;
(2)使用全局字典表来减少磁盘存储,对加载数据有一定的提升效果,这主要取决于表数据和全局字典表的大小以及它们的存储方式;
(3)与冷数据测试相比,热数据测试的性能提升更加明显;信息处理所涉及的运算越复杂,性能优化的效果就越明显。
图16示出了本发明实施例提供的另一种数据库服务器400的示意性框图,数据库服务器400包括:处理器410、内存(Memory)420、总线系统430、输入设备440、输出设备450。
数据库管理系统400的各个组件通过总线系统430耦合在一起,其中总线系统430除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线系统430。
处理器410可以包括任何类型的通用计算电路或专用逻辑电路,例如中央处理器(CPU)、FPGA(现场可编程门阵列)或ASIC(专用集成电路)。内存420可以包括只读存 储器(ROM)和随机存取存储器(RAM),用于存储数据和可执行程序。内存420的一部分还可以包括非易失性随机存取存储器(NVRAM)。
处理器410被适配为执行存储在内存420中的可执行程序,以实现或者执行本发明上述各方法实施例中所描述的方法步骤。例如,实现上述图2至图14相关的实施例中SQL引擎和执行引擎的部分或全部功能。
应理解,根据本发明实施例的数据库服务器400可对应于本发明实施例的对数据库对象进行编码压缩的方法,以及使用对象ID进行数据库查询运算的方法的执行主体,并且数据库服务器400中的各个组件的上述和其它操作和/或功能分别为了实现图2至图14中的各个方法的相应流程,为了简洁,在此不再赘述。
参见图17,本发明实施例还提供一种集群数据库系统500,包括:包括硬件层1007和运行在硬件层1007之上的虚拟机监控器(VMM)1001,以及多个虚拟机1002。一个虚拟机可以作为集群数据库系统500的一个数据节点。可选第,还可以指定一个虚拟机作为协调节点。
具体的,虚拟机1002是通过虚拟机软件在公共硬件资源上模拟出的虚拟的计算机,虚拟机上可以安装操作系统和应用程序,虚拟机还可访问网络资源。对于在虚拟机中运行的应用程序而言,虚拟机就像是在真正的计算机中进行工作。
硬件层1007:虚拟化环境运行的硬件平台,可以由一个或多个物理主机的硬件资源抽象得到的。其中,硬件层可包括多种硬件,例如包括处理器1004(例如CPU)和存储器1005,还可以包括网卡1003(例如RDMA网卡)、高速/低速输入/输出(I/O,Input/Output)设备,及具有特定处理功能的其它设备。
虚拟机1002基于VMM,以及硬件层1007提供的硬件资源,运行可执行程序,以实现上述图2至图14相关的实施例中SQL引擎和执行引擎的部分或全部功能。为了简洁,在此不再赘述。
进一步地,该集群数据库系统500还可以包括宿主机(Host):作为管理层,用以完成硬件资源的管理、分配;为虚拟机呈现虚拟硬件平台;实现虚拟机的调度和隔离。其中,Host可能是虚拟机监控器(VMM);也有可能是由VMM和1个特权虚拟机的结合。其中,虚拟硬件平台对其上运行的各个虚拟机提供各种硬件资源,如提供虚拟处理器(如VCPU)、虚拟内存、虚拟磁盘、虚拟网卡等等。其中,该虚拟磁盘可对应Host的一个文件或者一个逻辑块设备。虚拟机运行在Host为其准备的虚拟硬件平台上,Host上运行一个或多个虚拟机。虚拟机1002的VCPU通过执行存储在其对应的虚拟内存中的可执行程序,以实现或者执行本发明上述各方法实施例中所描述的方法步骤。例如,实现上述图2至图14相关的实施例中SQL引擎和执行引擎的部分或全部功能。
参见图18,本发明实施例还提供一种数据库系统,包括:数据库服务器800,基于与数据库服务器800通过通信网络连接的客户端设备900。其中,
客户端设备900的硬件层906上运行有客户端操作系统904,操作系统904上运行有应用程序902;数据库服务器800的硬件层816上运行有操作系统814,以及在操作系统814上运行有数据库管理系统812。应用程序902经由通信网络与在数据库服务器800上运行的数据库管理系统812连接并且访问或者操作存储在数据存储器818中的数据库,例如,通过SQL语句查询、更新或删除数据库中的数据,或者导入新的数据至数据库。
硬件层906和816包含操作系统和应用程序运行所需的基本硬件单元,例如,处理器,例如CPU,内存(Memory)、输入/输出设备、网络接口等。
数据存储器818可以是数据库服务器800的外部存储器,比如硬盘、磁盘、存储阵列,或存储服务器等,与数据库服务器800通信连接。或者,数据存储器818也可以集成在数据库服务器800内部,与处理器和I/O设备通过总线或其它内部通信方式交互数据。
数据库服务器800的内存中存储有可执行代码,该可执行代码在被处理器执行时被配置为实现数据库管理系统812的组件和功能。数据库管理系统812具体可以为图2所示的数据库管理系统108,相关功能及实现细节可参照图2至图15相关的实施例,此处不再赘述。
应理解,在本发明的各种实施例中,“可执行程序”应被广泛地解释为包括但不限于:指令,指令集,代码,代码段,子程序,软件模块,应用,软件包,线程,进程,函数,固件,中间件等。上述实施例描述的方法步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实上施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以硬件、或者计算机软件和硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的数据库服务器、数据管理系统和数据库系统的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (21)

  1. 一种对数据库中的对象编码的方法,其特征在于,包括:
    根据所述数据库中多个对象样本空间的相关性,确定全局对象样本空间;
    基于所述全局对象样本空间创建全局字典表,所述全局字典表包含所述全局对象样本空间内每一个对象的对象标识ID和对象描述信息,所述每一个对象的对象标识在所述全局对象样本空间内唯一,所述全局字典表中对象ID与对象的描述信息为一一映射关系,且所述全局字典表提供有编码接口;
    调用所述全局字典表的所述编码接口对所述数据库中的至少一个对象进行编码,以得到所述至少一个对象的编码值,所述至少一个对象的编码值为所述至少一个对象的对象ID。
  2. 根据权利要求1所述的方法,其特征在于,所述数据库中的同一类对象构成一个对象样本空间,所述对象包括:元组。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述数据库中多个对象样本空间的相关性,确定全局对象样本空间包括:
    确定所述数据库中具有相关性的至少两列,所述至少两列构成所述全局对象样本空间。
  4. 根据权利要求3所述的方法,其特征在于,所述具有相关性的至少两列包括:在查询运算时会被同时操作的至少两列。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述全局字典表提供操作接口,所述方法还包括:
    调用所述操作接口对所述全局字典表进行操作,所述操作包括:查询、更新、删除、分区或导入操作中的至少一种。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述全局字典表为数据文件、内存表、数据库表或索引表中的一种。
  7. 一种数据库查询运算的方法,其特征在于,包括:
    接收客户端发送的结构化查询语言SQL语句,处理该SQL语句以生成执行计划;所述SQL语句用于请求对数据库中的至少一个对象进行查询运算;
    提取存储的所述至少一个对象的对象标识ID;其中,所述对象的对象标识是通过调用全局字典表的编码接口对所述至少一个对象进行编码后得到并存储在数据存储器中的;所述全局字典表包含全局对象样本空间内每一个对象的对象ID和对象描述信息,所述每一个对象的对象标识在所述全局对象样本空间内唯一,所述全局字典表中对象ID与对象的描述信息为一一映射关系;所述全局对象样本空间是由所述数据库中具有相关性多个对象样本空间构成的;
    根据所述执行计划,直接使用所述至少一个对象的对象ID进行所述查询运算,以生成查询结果,并将所述查询结果返回给所述客户端。
  8. 根据权利要求7所述的方法,其特征在于,所述数据库中的同一类对象构成一个对象样本空间,所述对象包括:元组。
  9. 根据权利要求7或8所述的方法,其特征在于,所述具有相关性多个对象样本空 间包括:所述数据库中具有相关性的至少两列,所述至少两列构成所述全局对象样本空间。
  10. 根据权利要求9所述的方法,其特征在于,所述具有相关性的至少两列包括:在所述查询运算时会被同时操作的至少两列。
  11. 根据权利要求7至10任一项所述的方法,其特征在于,所述查询运算包含:连接JOIN、排序SORT或聚集AGGREGATE中的至少一种。
  12. 根据权利要求7至11任一项所述的方法,其特征在于,所述全局字典表为数据文件、内存表、数据库表或索引表中的一种。
  13. 一种数据库管理系统,其特征在于,包括:结构化查询语言SQL引擎、执行引擎和存储引擎;其中,
    所述SQL引擎用于根据客户端提交的SQL语句,生成对应的执行计划,所述SQL语句用于请求对数据库中的至少一个对象进行查询运算;
    所述存储引擎用于存储全局字典表,所述全局字典表包含全局对象样本空间内每一个对象的对象标识ID和对象描述信息,所述每一个对象的对象ID在所述全局对象样本空间内唯一,所述全局字典表中对象ID与对象的描述信息为一一映射关系;所述全局对象样本空间是由所述数据库中具有相关性多个对象样本空间构成的;
    所述执行引擎用于提取所述存储引擎存储的所述至少一个对象的对象ID,根据所述执行计划,直接使用所述至少一个对象的对象ID执行所述查询运算,以生成查询结果,并将所述查询结果返回给所述客户端。
  14. 根据权利要求13所述的数据库管理系统,其特征在于,所述数据库中的同一类对象构成一个对象样本空间,所述对象包括:元组。
  15. 根据权利要求13或14所述的数据库管理系统,其特征在于,所述具有相关性多个对象样本空间包括:所述数据库中具有相关性的至少两列,所述至少两列构成所述全局对象样本空间。
  16. 根据权利要求15所述的数据库管理系统,其特征在于,所述具有相关性的至少两列包括:在所述查询运算时会被同时操作的至少两列。
  17. 根据权利要求13至16任一项所述的数据库管理系统,其特征在于,所述存储引擎还用于创建所述全局字典表。
  18. 根据权利要求13至17任一项所述的数据库管理系统,其特征在于,所述全局字典表为数据文件、内存表、数据库表或索引表中的一种。
  19. 根据权利要求13至18任一项所述的数据库管理系统,其特征在于,所述查询运算包含:连接JOIN、排序SORT或聚集AGGREGATE中的至少一种。
  20. 一种数据库服务器,包括:至少一个处理器、存储可执行代码的非瞬态计算机可读介质;所述可执行代码在被所述至少一个处理器执行时被配置为执行权利要求7-12中任一项所述的方法。
  21. 一种数据库系统,包括:客户端和权利要求13至20任一项所述的数据库管理系统。
PCT/CN2018/077164 2017-03-02 2018-02-24 数据库系统中对象的编码及运算方法与数据库服务器 WO2018157765A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18760336.0A EP3582124B1 (en) 2017-03-02 2018-02-24 Method for coding and calculation of object in database system and database server
US16/559,245 US11194806B2 (en) 2017-03-02 2019-09-03 Object encoding and computation method in database system and database server

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710121672.1A CN108536705B (zh) 2017-03-02 2017-03-02 数据库系统中对象的编码及运算方法与数据库服务器
CN201710121672.1 2017-03-02

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/559,245 Continuation US11194806B2 (en) 2017-03-02 2019-09-03 Object encoding and computation method in database system and database server

Publications (1)

Publication Number Publication Date
WO2018157765A1 true WO2018157765A1 (zh) 2018-09-07

Family

ID=63369793

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/077164 WO2018157765A1 (zh) 2017-03-02 2018-02-24 数据库系统中对象的编码及运算方法与数据库服务器

Country Status (4)

Country Link
US (1) US11194806B2 (zh)
EP (1) EP3582124B1 (zh)
CN (1) CN108536705B (zh)
WO (1) WO2018157765A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582786A (zh) * 2018-10-31 2019-04-05 中国科学院深圳先进技术研究院 一种基于自动编码的文本表示学习方法、系统及电子设备
CN110597857A (zh) * 2019-08-30 2019-12-20 南开大学 一种基于共享样本的在线聚集方法
CN112395304A (zh) * 2020-10-30 2021-02-23 迅鳐成都科技有限公司 基于数据行为模拟的数据安全计算方法、系统及存储介质

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11086828B2 (en) * 2018-10-12 2021-08-10 Sap Se Compression of column store tables
US11048816B2 (en) * 2019-04-02 2021-06-29 Sap Se Secure database utilizing dictionary encoding
CN110704407B (zh) * 2019-09-04 2022-07-05 苏宁云计算有限公司 一种数据去重的方法和系统
CN111259205B (zh) * 2020-01-15 2023-10-20 北京百度网讯科技有限公司 一种图数据库遍历方法、装置、设备及存储介质
CN111723313A (zh) * 2020-06-23 2020-09-29 中国平安财产保险股份有限公司 页面跳转的处理方法、装置、设备及存储介质
CN112035257B (zh) * 2020-08-31 2022-05-31 浪潮云信息技术股份公司 基于KV存储的mergejoin异步块运算方法
US11520790B2 (en) 2020-09-17 2022-12-06 International Business Machines Corporation Providing character encoding
CN112182021B (zh) * 2020-11-03 2022-10-18 浙江大搜车软件技术有限公司 一种用户数据查询方法、装置和系统
US11416469B2 (en) * 2020-11-24 2022-08-16 International Business Machines Corporation Unsupervised feature learning for relational data
CN113326261B (zh) * 2021-04-29 2024-03-08 奇富数科(上海)科技有限公司 数据血缘关系提取方法、装置及电子设备
CN113626004A (zh) * 2021-08-19 2021-11-09 新疆大学 一种基于配置模式的快速开发系统
KR20230135257A (ko) * 2022-03-16 2023-09-25 주식회사 티맥스티베로 상이한 파티션 정보를 갖는 테이블들을 쿼리하기 위한 방법
CN117424765B (zh) * 2023-12-19 2024-03-22 天津医康互联科技有限公司 分布式独热编码方法、装置、电子设备及计算机存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020087545A1 (en) * 2001-01-02 2002-07-04 Bright Walter G. Method and apparatus for simplified access to online services
CN103324733A (zh) * 2013-06-30 2013-09-25 四川效率源信息安全技术有限责任公司 一种基于数据库的文件可重构存储及管理方法
CN103326732A (zh) * 2013-05-10 2013-09-25 华为技术有限公司 压缩数据的方法、解压数据的方法、编码器和解码器
CN103942234A (zh) * 2013-01-21 2014-07-23 中国电信股份有限公司 对多个异构数据库操作的方法、中间件装置及系统
CN104346377A (zh) * 2013-07-31 2015-02-11 克拉玛依红有软件有限责任公司 一种基于唯一标识的数据集成和交换方法

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5317742A (en) * 1991-06-21 1994-05-31 Racal-Datacom, Inc. Dynamic translation of network management primitives to queries to a database
WO1996007147A1 (en) * 1994-09-01 1996-03-07 Datacraft Technologies Pty. Ltd. X.500 system and methods
US20020035432A1 (en) * 2000-06-08 2002-03-21 Boguslaw Kubica Method and system for spatially indexing land
US8692695B2 (en) 2000-10-03 2014-04-08 Realtime Data, Llc Methods for encoding and decoding data
JP4381012B2 (ja) * 2003-03-14 2009-12-09 ヒューレット・パッカード・カンパニー 万物識別子を用いたデータ検索システムおよびデータ検索方法
US8589574B1 (en) * 2005-12-29 2013-11-19 Amazon Technologies, Inc. Dynamic application instance discovery and state management within a distributed system
CN101419600A (zh) * 2007-10-22 2009-04-29 深圳市亚贝电气技术有限公司 基于面向对象文件系统的数据副本映射方法及装置
CN101626563B (zh) 2008-07-08 2012-07-04 中国移动通信集团公司 一种通信网络中的数据存储系统及信息处理方法
CN102436479B (zh) * 2011-10-12 2014-01-08 远光软件股份有限公司 全局唯一数据标志符的生成方法和系统
CN102685221B (zh) * 2012-04-29 2014-12-03 华北电力大学(保定) 一种状态监测数据的分布式存储与并行挖掘方法
US8924373B2 (en) * 2012-08-09 2014-12-30 International Business Machines Corporation Query plans with parameter markers in place of object identifiers
CN103765391A (zh) 2012-08-23 2014-04-30 数创株式会社 分布式数据库系统
US10909113B2 (en) * 2013-07-31 2021-02-02 Sap Se Global dictionary for database management systems
US9818010B2 (en) * 2014-10-09 2017-11-14 The Code Corporation Barcode-reading system
CN105045577B (zh) * 2015-06-29 2018-12-18 用友优普信息技术有限公司 接口变更的提示方法和接口变更的提示系统
US20170249393A1 (en) * 2016-02-26 2017-08-31 Salesforce.Com, Inc. Method and browser plugin for creation of objects in a cloud-based object management system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020087545A1 (en) * 2001-01-02 2002-07-04 Bright Walter G. Method and apparatus for simplified access to online services
CN103942234A (zh) * 2013-01-21 2014-07-23 中国电信股份有限公司 对多个异构数据库操作的方法、中间件装置及系统
CN103326732A (zh) * 2013-05-10 2013-09-25 华为技术有限公司 压缩数据的方法、解压数据的方法、编码器和解码器
CN103324733A (zh) * 2013-06-30 2013-09-25 四川效率源信息安全技术有限责任公司 一种基于数据库的文件可重构存储及管理方法
CN104346377A (zh) * 2013-07-31 2015-02-11 克拉玛依红有软件有限责任公司 一种基于唯一标识的数据集成和交换方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3582124A4 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582786A (zh) * 2018-10-31 2019-04-05 中国科学院深圳先进技术研究院 一种基于自动编码的文本表示学习方法、系统及电子设备
CN109582786B (zh) * 2018-10-31 2020-11-24 中国科学院深圳先进技术研究院 一种基于自动编码的文本表示学习方法、系统及电子设备
CN110597857A (zh) * 2019-08-30 2019-12-20 南开大学 一种基于共享样本的在线聚集方法
CN110597857B (zh) * 2019-08-30 2023-03-24 南开大学 一种基于共享样本的在线聚集方法
CN112395304A (zh) * 2020-10-30 2021-02-23 迅鳐成都科技有限公司 基于数据行为模拟的数据安全计算方法、系统及存储介质
CN112395304B (zh) * 2020-10-30 2024-01-02 迅鳐成都科技有限公司 基于数据行为模拟的数据安全计算方法、系统及存储介质

Also Published As

Publication number Publication date
EP3582124A1 (en) 2019-12-18
CN108536705B (zh) 2021-10-01
EP3582124A4 (en) 2020-01-22
CN108536705A (zh) 2018-09-14
EP3582124B1 (en) 2023-06-14
US11194806B2 (en) 2021-12-07
US20190391978A1 (en) 2019-12-26

Similar Documents

Publication Publication Date Title
WO2018157765A1 (zh) 数据库系统中对象的编码及运算方法与数据库服务器
US10838940B1 (en) Balanced key range based retrieval of key-value database
US9965513B2 (en) Set-orientated visibility state retrieval scheme
US10866971B2 (en) Hash collision tables for relational operations
US9542442B2 (en) Accessing data in a column store database based on hardware compatible indexing and replicated reordered columns
US7966343B2 (en) Accessing data in a column store database based on hardware compatible data structures
US11899666B2 (en) System and method for dynamic database split generation in a massively parallel or distributed database environment
US10089377B2 (en) System and method for data transfer from JDBC to a data warehouse layer in a massively parallel or distributed database environment
US10380114B2 (en) System and method for generating rowid range-based splits in a massively parallel or distributed database environment
US10180973B2 (en) System and method for efficient connection management in a massively parallel or distributed database environment
US10528596B2 (en) System and method for consistent reads between tasks in a massively parallel or distributed database environment
US10733184B2 (en) Query planning and execution with source and sink operators
US11544268B2 (en) System and method for generating size-based splits in a massively parallel or distributed database environment
US20150142733A1 (en) System and method for efficient management of big data in a database using streaming tables
US10089357B2 (en) System and method for generating partition-based splits in a massively parallel or distributed database environment
US10078684B2 (en) System and method for query processing with table-level predicate pushdown in a massively parallel or distributed database environment
US10671625B2 (en) Processing a query primitive call on a value identifier set
US11016973B2 (en) Query plan execution engine
US10558661B2 (en) Query plan generation based on table adapter
CN110413642B (zh) 一种应用无感知的分片数据库解析及优化方法
CN117632947A (zh) 在列存储中存储和查询知识图谱
CN117529714A (zh) 用于为迁移rdbms推荐存储格式的方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18760336

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018760336

Country of ref document: EP

Effective date: 20190913