CN117112691A

CN117112691A - Storage method of big data-oriented multi-storage engine database

Info

Publication number: CN117112691A
Application number: CN202310883974.8A
Authority: CN
Inventors: 岳丽军; 周万宁; 届峰; 刘静涛; 邹雨; 刘超; 苏思; 王一; 杨春; 陈单英; 谢德晓; 张新建
Original assignee: Unit 91977 Of Pla
Current assignee: Unit 91977 Of Pla
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-11-24

Abstract

The invention discloses a big data-oriented multi-storage engine database system and a storage method thereof, wherein the system comprises the following steps: an SQL compiler; a stored procedure compiler; a transaction management unit; storing in a distributed memory array; a distributed execution engine; a data source connector; a multi-tenant management component; and a middleware management unit. The method comprises the following steps: unified data storage; unified metadata management; unifying the data model; a unified SQL parsing engine; and unified safety control. The invention ensures that at least more than two domestic processors are supported, and the performance of the multi-storage engine database is improved by 10x to 100x times compared with the open source Hadoop version, thereby meeting the low-delay requirement of an online storage and online service analysis system. SQL supports integrity and performance far ahead of Cloudera Impala. The method is also superior to other Hadoop and MPP products in TPC-DS and TPC-H reference tests, and can realize deep fusion of military databases, typical domestic CPUs and military application scenes.

Description

Storage method of big data-oriented multi-storage engine database

Technical Field

The invention relates to the technical field of data storage, in particular to a storage method of a multi-storage engine database oriented to big data.

Background

With the explosive growth of battlefield information, sensors are used in large quantities, information systems are continuously perfected, and the traditional database storage mode cannot meet the requirements of capturing and understanding battlefield information and a large amount of military knowledge. The more complex and bulky the data types of a large number of pictures, text, values, etc. become in military applications. The statistics, screening and machine learning are almost impossible in massive disordered data, so that data analysis personnel and model training personnel are required to initialize the data before analysis and training, namely, the data are ordered and classified, dirty error data are filtered, and missing data are filled. Due to the continuous expansion of military information data, data analysts and model training personnel need to spend huge efforts on data arrangement, a lot of extra data processing knowledge needs to be learned, and certain engineering capacity is needed, so that the thresholds of the analysts and algorithm personnel are increased, and the labor cost of enterprises is greatly increased. In order to solve the above problems, a high-availability multi-storage engine database based on a large data architecture is needed to automatically normalize and classify data so as to reduce more manpower cost, and more time and effort are spent on analysis and algorithm by data analysts and model training staff. In order to fill the technical blank, the big data-oriented multi-storage engine database technology presents great advantages in the aspects of knowledge representation, knowledge storage, knowledge query and the like, and can meet the requirements of military information retrieval, information analysis, situation awareness and the like. As a physical starting point for data analysis, storage is facing challenges for big data applications.

Based on the basic software and hardware structure and technical characteristics of the domestic basic data, the key technologies such as unified management technology of multiple storage engines, unified SQL analysis and model mapping technology, multiple engine distributed elastic expansion technology, database real-time calculation engine technology, graph database and the like for mass military data are strived for to form a typical domestic autonomous controllable multiple storage engine fusion database, and the prototype system of the database core engine is developed to support the problems of unified management, processing and the like of complex data in a typical military application scene.

Disclosure of Invention

The invention aims to provide a system and a storage method for a big data-oriented multi-storage engine database, which are used for solving the problems existing in the prior art.

In order to achieve the above object, the present invention provides a big data oriented multi-storage engine database system, comprising:

the SQL compiler supports ANSI SQL 92 and SQL 99 standards, supports ANSI SQL 2003 OLAP core expansion, meets the requirements of data warehouse business on SQL, and is convenient for smooth migration of application;

the storage process compiler supports complete data types, flow control, package, cursor, exception handling and dynamic SQL execution, and supports high-speed statistics, addition, deletion, modification and distributed transaction operation in the storage process, so that migration of data application from a relational database to an indicator platform is satisfied;

The transaction management unit realizes the control of consistency and isolation through a two-stage locking protocol and MVCC, and supports Serializable Snapshot Isolation isolation level, so that the transaction consistency under the concurrent condition can be ensured;

the distributed memory is stored in a column mode, and a memory or SSD-based column storage engine Holodesk is used for storing data in a column mode in the memory or SSD, and is assisted with a memory-based execution engine, so that delay caused by IO is completely avoided;

the distributed execution engine independently constructs a distributed data layer, and the calculation data is independently from the JVM memory space of the calculation engine, so that the influence of JVM GC on the system performance and stability is effectively reduced;

the data source connector is used for connecting the execution engine with various data sources, and carrying out real-time statistical analysis on data of various different data sources by the access engine, so that the data is not required to be imported into the HDFS in advance, and the business construction diversified requirements of a user are more convenient;

the multi-tenant management component provides complete multi-tenant management functions, including tenant resource management, tenant authority management and security control module, and facilitates the management and distribution of multiple tenants of enterprises on a unified big data platform; the configuration and management of CPU and memory resources are allowed for multiple tenants, and different tenants use different CPU and memory resource pools, so that the multiple tenants cannot interfere with each other;

The middleware management unit supports JDBC 4.0 and ODBC 3.5 standards, so that the middleware management unit can support a Hibernate/Spring middleware, is completely compatible with a Tableau/QlikView/Cognos report tool, and can be completely docked with the current data application layer of an enterprise.

Further, the storage process compiler comprises a complete Optimizer, and comprises CFG optimizers and Parallel Optimizer, and the CFG optimizers optimize codes in the storage process to complete main optimization such as cyclic expansion, redundant code elimination, function inlining and the like.

Further, the transaction management unit supports the initiation of transactions at Begin Transaction, the termination of transactions at commit or rollback, the control of consistency and isolation through two-phase blocking protocol and MVCC, and the Serializable Snapshot Isolation isolation level, so that the consistency of transactions in the concurrent situation can be ensured.

Further, the Holodesk supports building a distributed index for data fields, allows a user to build an OLAP-Cube for multi-field combinations, and stores the Cube directly on a memory or SSD without requiring additional BI tools to build the Cube.

Further, the distributed execution engine is based on a cost optimizer and a rule optimizer, and is assisted with 100 optimization rules, so that SQL application can exert maximum performance without manual change; the distributed execution engine includes two execution modes: a low latency mode and a high throughput mode.

Further, after the execution plan is started, the data source connector extracts needed data from other data sources through the connection established in advance, enters an execution engine layer to participate in SQL calculation, and releases relevant database connection and corresponding resources after the calculation is completed.

The invention also provides a storage method of the big data-oriented multi-storage engine database adopting the big data-oriented multi-storage engine database system, which comprises the following steps:

step 1, unified data storage, which is to integrate each large database system which is physically independent and manageably separated into a complete unified data storage;

step 2, unified metadata management, namely, a metadata warehouse is arranged in a logic number bin, and an RDBMS cluster for specially storing and managing metadata is used; creating a new table or a new index in each database system, synchronously creating a metadata two-dimensional mapping table in a metadata warehouse by a logic number bin, and simultaneously recording the relation between the metadata two-dimensional mapping table and an original table and/or an index, thereby laying a foundation for SQL unified access; the metadata management system stores information of all database objects, provides a query interface for other systems to search, and the queriable information comprises physical distribution of data, distribution characteristics of the data, maximum minimum value and authority information;

Step 3, unifying the data model, uniformly mapping the data model of the multi-engine database system into a relational data two-dimensional mapping table of a relational database, and laying a theoretical foundation for metadata centralized unified management and SQL unified access;

step 4, unifying SQL analysis engines, providing unifying SQL analysis, optimization and execution services for all data systems in a logic number bin, wherein SQL sentences submitted by users are firstly received by the SQL engines, are converted into Spark codes after analysis and optimization, and then are executed by a high-performance distributed computing cluster, and the original APIs of each system are called in the execution process to realize data access;

step 5, unified security management and control, based on an authentication unified system of the user and the role, the system conforms to an account and/or role RBAC model, and realizes authority management through the role and batch authorization management on the user; supporting a security protocol Kerberos, using LDAP as an account management system, and carrying out unified security authentication on account information through Kerberos.

Further, the step 2 of unified metadata management further includes providing a metadata access collection of Hive, HDFS, HBase; providing a UI interface for unified metadata management and an API interface in a Restful form of related Service, providing a micro-Service docking mode of each type, and storing the data of the UI foreground page data for unified metadata management by using a MySQL database table; providing a message queue (Kafka is currently used) and a metadata operation interface and data message bus mode of an API interface; providing a metadata unified typeSystem type system, a Graph calculation and storage query engine layer, an intelligent label algorithm and a knowledge Graph model; and providing a public storage packaging layer of the graph computation query engine, and supporting the JanusGraph open source graph computation storage query engine.

Furthermore, the unified metadata management in the step 2 further includes real-time efficient indexing of heterogeneous data oriented to graphs, key values, documents and relations, and supports a mixed index BRPQ of numerical point attributes, numerical segment attributes, spatial positions and texts, and specifically includes construction of a PSPQ index, combination of the PSPQ and a keyword index, distributed query of spatial keywords with relation attributes and construction of the index based on Lucene.

Further, in the step 5, the unified security management and control architecture is divided into 4 layers: the system layer uses an improved Apache DS, so that the read-write efficiency is improved by more than 10 times, the same set of users and a unified LDAP/Kerberos authentication mode are used, the use of Kerberos authentication by OpenLDAP is avoided, and the LDAP authentication efficiency is accelerated; the service layer realizes complete ARBAC model support, and provides RESTAPI and user-friendly function support of WebUI and password strategies; the plug-in layer provides authentication, authorization, group mapping and quota management for each component by using a plug-in form, so that each component can use a unified user, group and authority management model; the service layer is in butt joint with the PaaS and SaaS services of various platforms, and is protected by unified safety control.

The method of the invention has the following advantages:

The invention ensures that at least more than two domestic processors can be supported, and the multi-storage engine database has 10x-100x times performance improvement compared with an open source Hadoop version through verifying a prototype system, thereby meeting the low-delay requirement of an online storage and online service analysis system. Integrated optimization on the execution engine and data storage layers, SQL support integrity and performance greatly precede Cloudera Impala. The method is also superior to other Hadoop and MPP products in TPC-DS and TPC-H reference tests, and can realize deep fusion of military databases, typical domestic CPUs and military application scenes.

Drawings

FIG. 1 illustrates a key technology implementation overall architecture diagram;

FIG. 2 illustrates a unified metadata management architecture;

FIG. 3 shows a thread communication protocol diagram;

FIG. 4 illustrates a unified security management architecture diagram;

FIG. 5 shows a Kerbeorts authentication process diagram;

fig. 6 shows a PSPQ index structure;

FIG. 7 shows a B-tree based query schematic;

fig. 8 shows a BRPQ index structure.

Detailed Description

The technical solution of the present invention will be clearly and completely described in conjunction with the specific embodiments, but it should be understood by those skilled in the art that the embodiments described below are only for illustrating the present invention and should not be construed as limiting the scope of the present invention. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments of the present invention, are within the scope of the present invention.

Based on a domestic processor platform, the invention creatively provides a method for constructing a unified data storage engine, a unified metadata management mechanism and a unified data model by a multi-storage engine database technology oriented to big data, and a unified high-efficiency distributed SQL engine is constructed to provide strong platform support for military service scenes.

The invention discloses a big data-oriented multi-storage engine database system, as shown in figure 1, comprising:

SQL Compiler SQL 2003 Compiler

Applications such as enterprise data warehouse and data mart are mostly developed based on SQL, but most products in Hadoop industry are relatively poor for SQL compatible programs or do not support modular expansion of SQL, so that the cost of application migration is very high, and even the application migration is not feasible. In order to reduce the application migration cost, transwarp Inceptor develops a complete SQL compiler, supports ANSI SQL 92 and SQL 99 standards, supports ANSI SQL 2003 OLAP core expansion, can meet the requirements of most existing data warehouse services on SQL, and is convenient for application smooth migration.

In addition to the better SQL semantic analysis layer, the initiator contains a powerful optimizer to ensure that SQL has the best performance on the engine. The initiator comprises a 3-stage optimizer: firstly, a rule-based optimizer applies static optimization rules and generates a logic execution plan, and secondly, a cost-based optimizer selects a more reasonable plan and generates a physical execution plan by measuring the CPU, IO and network costs of a plurality of different execution plans; finally, a Code generator generates more efficient execution codes or JavaByte codes for the execution logic of some comparison cores, thereby ensuring that SQL services have optimal performance on a distributed platform.

2. Stored procedure Compiler PL/SQL Compiler

The existing data warehouse applications in China are mostly based on SQL 2003, and a large number of stored procedures are used to construct complex applications. Thus Transwarp Inceptor contains, in addition to the SQL compiler, a stored procedure compiler for compiling and executing stored procedures.

The initiator supports two main flow SQL standards of Oracle PL/SQL and DB2 SQL PL, including complete data types, flow control, package, cursor, exception handling and dynamic SQL execution, and supports high-speed statistics in the storage process, addition and deletion and modification and distributed transaction operation. Therefore, with the supplement of the stored procedure compiler, the initiator can satisfy the migration of most data applications from the relational database to the initiator platform.

In addition to SQL grammar level support, the stored procedure compiler contains a complete Optimizer, including CFG optimizers, parallel Optimizer, and DAG optimizers. The CFG Optimizer optimizes codes in the storage process, and completes main optimization such as cyclic expansion, redundant code elimination, function inlining and the like. Parallel Optimizer parallelizes some original serial logic, and utilizes the computing power of the cluster to increase the overall execution speed, so that the performance improvement of some key functions such as cursors is very obvious. The DAG Optimizer can secondarily optimize according to the generated DAG graph to generate a more reasonable physical execution plan, and the task cost such as shuffle is reduced in an important way. In order to be compatible with other databases effectively, the initiator supports the isolation of the difference between different SQL standards through different dialect settings, so that the ambiguity of data calculation and processing standards is avoided, and the correctness of data processing is ensured.

3. Transaction management unit Transaction Manager

In order to better meet the requirements of the data warehouse business scenario, the initiator provides complete add-delete-modify SQL support, allowing data to be processed from multiple data sources. Meanwhile, in order to effectively ensure the accuracy of data processing, the initiator provides support for distributed transactions, and the ACID, namely the atomicity, the consistency, the isolation and the durability of the data in the processing process are ensured.

The initiator supports starting the transaction at Begin Transaction and ending the transaction at commit or rollback. The transaction management unit realizes the control of consistency and isolation through a two-stage locking protocol and MVCC, and supports Serializable Snapshot Isolation isolation level, so that the transaction consistency under the concurrent condition can be ensured.

The initiator supports semantic specifications for the add-drop part in SQL2003, supports Insert, update, delete, test and Merge-in primitives, supports single or Update data tables from other data tables and nested queries, and embeds a consistency check function to prevent illegal changes.

Through the optimization of the SQL compiler, the SQL execution plan is added and deleted and modified and is executed in the cluster through the distributed engine, the throughput rate of the whole system can reach several times of that of the relational database, and the high throughput rate requirement of batch processing business can be met. In addition, through reasonable resource planning, the data is added, deleted and modified by the indicator, and meanwhile, high-speed statistical analysis is allowed for the data by the tenant.

4. Distributed memory columnar storage Holodesk

To speed up interactive analysis, the initiator has proposed a memory or SSD based columnar storage engine Holodesk. Holodesk stores data in a memory or SSD in a column manner, and an execution engine based on the memory is assisted, so that delay caused by IO can be completely avoided, and the data scanning speed is greatly improved.

In addition to columnar storage speeding up statistical analysis, holodesk supports building distributed indexes for data fields. The intelligent indexing technology is used for constructing an optimal query scheme for the query, and the Inceptor can reduce SQL query delay to millisecond level.

Holodesk allows a user to construct an OLAP-Cube for multi-field combinations and store the Cube directly in a memory or SSD without additional BI tools to construct the Cube, so that for some complex statistical analysis and report interaction queries, holodesk can realize second-level reactions.

In addition to performance advantages, holodesk also performs well in terms of usability. Both Holodesk's metadata and storage support high availability natively, supporting exception handling and disaster recovery through coherence protocols and multi-version. Under abnormal conditions, holodesk can automatically restore and reconstruct all table information and data without manual restoration, thereby reducing the cost of development and operation and maintenance and ensuring the system

Stability. The importance of the initiator optimizes the Holodesk performance based on the SSD, so that the performance of the SSD based on the PCIE reaches more than 80% of the scheme of the full memory. Therefore, the low-cost memory/flash memory hybrid storage scheme is combined, the analysis performance of the full memory storage can be approximated, and the high cost performance of the solution is ensured.

5. Distributed execution engine Distributed Execution Engine

The special distributed computing engine is developed by the initiator based on the Apache Spark depth, so that the computing performance is greatly improved, a plurality of problems of Spark in stability are effectively solved, and the computing engine can be operated for 7x24 hours without time. In addition, the Inceptor engine independently constructs a distributed data layer to independently obtain calculation data from the JVM memory space of the calculation engine, so that the influence of JVM GC on the system performance and stability can be effectively reduced.

In the aspect of SQL execution plan optimization, the optimizer realizes a cost-based optimizer and a rule-based optimizer, and more than 100 optimization rules are assisted, so that SQL application can be ensured to exert maximum performance under the condition of no need of manual modification. For common data processing problems such as data inclination and the like, the execution engine can automatically identify and optimize the data processing problems, so that most computing scenes with data inclination can be solved, and the influence of the data inclination on the system stability is avoided.

In order to better adapt to various data scenes, the execution engine of the initiator comprises two execution modes: a low latency mode and a high throughput mode. The low-delay mode is mainly applied to scenes with smaller data volume, the execution engine can generate a physical execution plan with low execution delay, and the short execution time of SQL is ensured by reducing or avoiding some high-delay tasks (such as IO, network and the like) so as to achieve or approximate the performance of the relational database under the scenes. The high throughput mode is mainly applied to a scene of big data, and the performance of complex statistical analysis on the ultra-large data volume is improved through reasonable distributed execution. Therefore, the execution engine of the initiator can meet the data service requirements on various data volumes from GB to PB.

6. Data source connector Stargate

Enterprise data may be scattered across multiple systems, failing to share data with each other or perform related analysis, causing data islanding. The problem of data island under most scenes can be effectively solved by constructing a unified big data platform, and then the phenomenon that some data cannot migrate on the unified platform due to various relations still exists. To address such issues, the initiator has proposed a data source connector Stargate. The Stargate is a connector for connecting the execution engine and various data sources, and can carry out real-time statistical analysis on data access engines of various different data sources without leading the data into the HDFS in advance, thereby being more convenient for users to construct diversified demands for services. In the grammar level, the initiator is compatible with the Oracle DB-Link specification, a connection pool with other data sources is established in advance by creating a database Link, and then the data of the data sources can be accessed in real time in the initiator in SQL in a table_name@database Link mode without other operations. After the execution plan begins, stargate extracts the required data from other data sources through pre-established connections, and inputs into the execution engine layer to participate in SQL calculations. After the computation is completed, the relevant database connection and the corresponding resources are released.

The current Stargate support relational databases include Oracle, DB2, mysql, teradata, and PostgreSQL. In addition, stargate can now access Holodesk, HDFS, hyperbase, elastic Search, etc., and Redis can be accessed in the future as a data source.

7. Multi-tenant management component Guardian

The Guardian provides complete multi-tenant management functions, including tenant resource management, tenant authority management, security control and other modules, and can facilitate the management and distribution of multiple tenants of enterprises on a unified big data platform.

Guardian allows configuration and management of CPU and memory resources for multiple tenants, and different tenants use different CPU and memory resource pools, so that the tenants do not interfere with each other. In addition, different priority levels can be set for different users to realize quality of service (QoS). The Guardian supports configuration and management of the disk space of the user through SQL, including quota of the data space and the temporary space, modification and management, so as to facilitate reasonable allocation, management and charging of the platform to the storage resources.

The Guardian supports the use of LDAP protocol for user access control and Kerberos protocol for bottom access control, thereby ensuring the security and isolation of data. The Guardian supports the authority control of a whole set of SQL-based database/table, an administrator can set the authority of inquiring, modifying, deleting and the like of the table of the user, and comprises a whole set of role setting, so that the authority control of the user can be conveniently realized through the setting of a role group.

In addition, guardian supports Row Level Security for precise row-level authority control over the data of the table. Under the multi-tenant scene, different tenants can only see the data with authority in the table, but cannot see the data belonging to other tenants, so that accurate data isolation is realized.

8. Middleware management unit Connector

The goals of the enterprise completely support JDBC 4.0 and ODBC 3.5 standards, so that middleware such as Hibernate/Spring and the like can be supported, report tools such as Tableau/QlikView/Cognos and the like are completely compatible, and the enterprise can be completely docked with the current data application layer of the enterprise.

In addition, the initiator also supports interfacing with other data synchronization tools, has completed mutual authentication and integration with IBM CDC, and can support tools such as Oracle Golden Gate, SAPData Service, etc. Therefore, the enterprise user can synchronize the transaction data into the indicator in real time to perform interactive statistical analysis service.

The technical scheme of the invention comprises the following steps:

1. unified data store

And integrating the large database systems which are physically independent and separated in management into a complete and unified data storage.

2. Unified metadata management mechanism

In order to achieve centralized and unified management of metadata, one metadata repository, i.e., an RDBMS cluster for specialized storage and management of metadata, is deployed in a logical data repository. If a new table or new index is created in each database system, the logical data bin synchronously creates a metadata two-dimensional mapping table in the metadata warehouse, and records the relation between the mapping table and the original table/index, thus laying a foundation for SQL unified access.

The metadata management system is responsible for storing information of all database objects, including schema, table, view, column, trigger, sequence and stored procedures, and the like, and provides a query interface for other systems to retrieve, wherein the queriable information includes, but is not limited to, physical distribution of data, distribution characteristics of data, maximum minimum value, authority information and the like. Because metadata information is very critical data, its reliability determines whether the entire database system can provide service, the prototype system needs to adequately design and consider the reliability of the metadata storage system. The present invention contemplates the use of highly consistent and reliable storage services for recording metadata information, with alternative systems including Apache Zookeeper or Apache Etcd systems. Both Zookeeper and Etcd store the same data in multiple instances, ensuring that the data in each instance is completely consistent through a consistent algorithm (e.g., a Raft algorithm). Thus, even if some instances cannot provide services due to software and hardware problems, the metadata system can still ensure that the services are not dropped as long as one instance still works normally.

3. Unified data model

In order for all data systems in a multi-engine to support SQL access, unification must be achieved across the data model. According to research, although the technical principles of the systems are different, most of the systems adopt a data model similar to a two-dimensional table, compared with an HBase, the four-dimensional table is adopted as an example, data are organized according to rows, and column limits are relaxed by introducing column families, so that the four-dimensional table can be mapped into the two-dimensional table under specific constraint conditions. Through the thought, the invention can uniformly map the data model of the multi-engine database system into the two-dimensional table of the relational database, which is also called a relational data two-dimensional mapping table, thereby laying a theoretical foundation for metadata centralized uniform management and SQL uniform access.

4. Unified SQL parsing engine

A generic distributed SQL engine was built based on a high performance computing framework Spark. The engine is responsible for providing unified SQL analysis, optimization and execution service for all data systems in the logic number bin, SQL sentences submitted by users are firstly received by the SQL engine, are converted into Spark codes after analysis and optimization, and then are executed by a high-performance distributed computing cluster, and the original APIs of each system are called in the execution process to realize data access.

5. Unified safety control

Based on the authentication unified system of the user and the Role, the system conforms to an account/Role RBAC (Role-based access control) model, realizes the authority management through the Role, and carries out batch authorization management on the user. Supporting a security protocol Kerberos, using LDAP as an account management system, and carrying out unified security authentication on account information through Kerberos. Under a complex battlefield countermeasure environment, unified safety control can prevent an adversary information countermeasure system from eavesdropping on information transmission contents of an opponent, interference on signals of the opponent system and injection of false deception information into the information system. The information security requirements of informationized warfare on data applications are:

(1) Data encryption: the encryption protection of the data in the transmission process and static storage is provided, and the sensitive data can still be effectively protected when the sensitive data is unauthorized to access. In the aspect of data encryption and decryption, high-performance and low-delay end-to-end and storage layer encryption and decryption can be realized through an efficient encryption and decryption scheme (non-sensitive data can not be encrypted and the performance is not affected). Meanwhile, effective use of encryption requires secure and flexible key management, and an open source scheme is weak and a product is required to be managed by means of commercialized keys. In addition, the encryption and decryption are transparent to the upper layer service, the upper layer service only needs to specify sensitive data, and the service is not perceived at all in the encryption and decryption process.

(2) User privacy data desensitization: and providing data desensitization and personal information de-identification functions and providing user data encryption services meeting the international cryptographic algorithm.

(3) Multi-tenant isolation: implementing multi-tenant access isolation measures, implementing data security grading, supporting tag-based mandatory access control, providing an ACL-based data access authorization model, providing a global data view and a private data view, and providing access control of the data view.

(4) Data lifecycle management: understanding the source of data in a big data platform, and knowing how the data is used and where it is destroyed by someone is critical to monitoring whether there is illegal data access in a big data system, which needs to be done through security audit. The purpose of the security audit is to capture complete activity records within the system and cannot be altered. The big data platform can conduct all-round safety control on data, and can conduct 'management in advance, control in advance and check afterwards'.

(5) Log audit: the log audit is an essential measure for data management, data tracing and attack detection. The big data platform has log management and analysis capability.

The specific embodiment of the invention comprises the following steps:

1) Unified metadata management implementation

(1) Metadata management architecture

In order to achieve centralized and unified management of metadata, one metadata repository, i.e., an RDBMS cluster for specialized storage and management of metadata, is deployed in a logical data repository. If a new table or a new index is created in each database system, the logical number bin synchronously creates a metadata two-dimensional mapping table in the metadata warehouse, and records the relationship between the metadata two-dimensional mapping table and the original table/index, thus laying a foundation for SQL unified access. We have conducted intensive studies on metadata management, and the architecture design is shown in fig. 2:

Metadata Sources Access

providing functions such as metadata access acquisition of Hive, HDFS, HBase and the like

Spring Framework UI&Restful API

Providing UI interface for unified metadata management and API interface in related Service in Restful form, providing micro Service docking mode of each type, storing data of UI foreground page data for unified metadata management by using MySQL database table, and obtaining page data by inquiring platform Service of background in foreground page operation input, real-time or off-line mode

Metadata Integration&Notification API

Metadata-operated interface and data message bus mode for providing message queues (currently using Kafka) and API interfaces (HTTP or REST modes)

Core Platform

Type System providing metadata unification, graph calculation, storage and query engine layer, intelligent label algorithm, knowledge Graph model and the like

Graph Database

And providing a public storage packaging layer of the Graph computation query engine, and supporting the Janus Graph open source Graph computation storage query engine.

(2) Metadata storage principles

Metadata store uses HBase to store entity information, and Index information store uses elastic search; the Metadata store client connects to the Metadata store service, which in turn connects to the MySQL database to access the Metadata. With the metastore service, multiple clients can be connected at the same time, and the clients do not need to know the user name and password of the MySQL database, and only need to connect with the metastore service. The unified metadata is started by using metadata as a separate service. A MetaStoreServer is started at the server side, and the client side accesses the metadata base through the MetaStoreServer by using the thread protocol. The thread communication protocol is as shown in fig. 3:

the bottom IO module is responsible for actual data transmission, including Socket, file, compressed data stream, and the like.

TTransport is responsible for sending and receiving messages in a Byte Stream mode, and is realized in a thread framework by the bottom IO modules, and each bottom IO module is provided with a corresponding TTransport to be responsible for transmission of Byte Stream (Byte Stream) data of the thread on the IO module. For example, TSocket corresponds to Socket transmission and TFileTransport corresponds to file transmission.

TProtocol is mainly responsible for assembling structured data into a Message, or reading structured data from a Message structure. Tpropcol converts a typed data into a byte stream for transmission to TTransport, or reads a certain length of byte data from TTransport into a specific type of data. For example, int32 may be TBinaryProtocol Encode as a four-byte data, or TBinaryProtocol may extract four-byte data from TTransport as int32.

TServer is responsible for receiving clients' requests and forwarding the requests to Processor for processing. The TServer's primary task is to efficiently accept clients' requests, and in particular to quickly complete requests in the case of highly concurrent requests.

The Processor (or tproccessor) is responsible for making corresponding requests of clients, including the processing steps of RPC request forwarding, call parameter parsing and user logic call, return value writing back, and the like. Processor is a key flow for the server to transfer from the thread framework to the user logic. The Processor is also responsible for writing data into the Message structure or reading data out.

2) Unified security management and control implementation

Under a complex battlefield countermeasure environment, unified safety control can prevent an adversary information countermeasure system from eavesdropping on information transmission contents of an opponent, interference on signals of the opponent system and injection of false deception information into the information system. Therefore, the research of the unified security management and control technology based on the multi-storage engine database is very important, and the unified data authority access control guarantees the security unified management and control of the whole multi-storage engine database platform by researching a unified security management and control architecture and unified service encryption and authentication service.

As shown in fig. 4, the unified security management architecture is divided into 4 layers:

the system layer uses the improved Apache DS, so that the read-write efficiency is improved by more than 10 times, the same set of users and the unified LDAP/Kerberos authentication mode are used, the use of Kerberos authentication by OpenLDAP is avoided, and the LDAP authentication efficiency is accelerated.

The service layer realizes complete ARBAC model support and provides functional support such as RESTAPI, user-friendly WebUI and password strategies and the like. Meanwhile, a JWToken mechanism is adopted, so that basic preparation is made for realizing SSO. Unifying user authentication and authorization, wherein the same set of users and the same set of authorization mechanisms are used from Web service to Hadoop bottom layer; and meanwhile, an LDAP interface, RESTAPI and LoginService are opened for the third party application to integrate authentication and authorization.

The plug-in layer provides authentication, authorization, group mapping, and quota management for each component using the form of plug-ins, so that each component can use a unified user, group, and rights management model.

The service layer is in butt joint with the PaaS and SaaS services of various platforms, and is protected by unified safety control.

Unified service encryption and authentication services are provided through Kerberos protocols. The specific procedure for kerberos authentication is as follows: the client sends a request to an Authentication Server (AS) asking for credentials of a certain server, and the response of the AS then contains these credentials encrypted with the client key.

As shown in fig. 5, the certificate is constituted by: 1) A server TGT; 2) A temporary encryption key (also known as a session key). The client transmits the TGT (including the client identity encrypted with the server key and a copy of the session key) to the server. The session key may be used (now shared by the client and server) to authenticate the client or authentication server, or may be used to provide encryption services for subsequent communications by both parties, or to provide further communication encryption services by exchanging separate sub-session keys.

3) Efficient index real-time mode for heterogeneous data such as graph, key value, document, relation and the like

Aiming at the problem that the traditional relational database is difficult to meet the requirement of complex data types in military big data processing and comprises heterogeneous data such as documents, relations, space-time positions and the like, the invention provides a PSPQ (Point and Segment Preferences Query) index for processing preference attribute information based on numerical points and numerical segments, and then provides a mixed index BRPQ (Boolean Range with Preferences Query index) for supporting the numerical point attribute, the numerical segment attribute, the space position and texts in combination with a keyword index. The method specifically comprises the steps of constructing a PSPQ index, combining the PSPQ index with a keyword index, carrying out spatial keyword distributed query with relation attributes, and constructing the index based on Lucene.

PSPQ index structure

The PSPQ index structure consists of a hash table and an inverted file, and the structure is shown in figure 6.

(1) Hash table. For indexing numerical point relationship attribute information, including key-value pairs (key, pIF). Wherein, key is the identification of numerical point relation attribute information, and pIF is the pointer to the inverted file.

Hash table key value design. The numerical point attributes comprise a plurality of numerical point attributes, the query for each attribute is a range query, and the joint range query for the plurality of attributes is a great challenge of key value design while the pruning rate is ensured. For the challenge, each attribute value interval is divided into a plurality of cells, the key value of each cell is assigned by the minimum value of the interval, and then the key value of the hash table is obtained by carrying out Cartesian product on the key of each attribute among the cells. The method has two advantages, on one hand, data can be quickly accessed according to key values, and can carry out joint range query on a plurality of attributes, on the other hand, hash table key values which do not accord with the query range can be quickly filtered according to preference query during pruning, and the more the intervals of each attribute value are segmented, the more the filtered hash table key values are, and the stronger the pruning capability is.

Preprocessing and cutting each attribute value interval to obtain a hash table key value. And dividing the attribute values of all the objects into equal parts aiming at each numerical point attribute to obtain each interval. The equal parts aim at enabling the number of objects in each corresponding interval to be approximately equal, the data are uniformly associated under each hash table key value, pruning is more stable, and too many or too few objects associated with the hash table key value which is pruned are avoided.

Of course, since the numeric point attribute is a numeric attribute and the different attributes are independent from one another, a B-tree index numeric point preference attribute may also be used. However, the cost after B-tree based is much higher than the cost of BRPQ index, so this project does not use B-tree index. This will be described in detail below, as shown in fig. 7.

The query cost increases after B-tree based. When a plurality of numerical point preference attributes are indexed based on a B tree, a B tree index needs to be established for each attribute, and the B tree indexes cannot be combined with indexes for processing space keywords and indexes for processing numerical section user preference attributes. As shown in the figure, after the B trees are adopted, candidate result sets meeting space and keyword query are needed to be obtained by utilizing the space keyword index, candidate result sets meeting numerical value segment preference are obtained by utilizing the numerical value segment user preference attribute index, candidate result sets meeting the attribute preference are obtained on each B tree, and then intersection sets are obtained for all the candidate result sets. And when the BRPQ index is used for inquiring, the result set after the data set is subjected to the space keyword inquiry is subjected to preference inquiry. Assuming that the number of user preference attributes of the numerical point is p, the query of all data sets is required to be performed p+2 times after the B-tree is based, and the BRPQ index is only required to be queried once. In the query process, the cost of the two space keyword query processes is the same, and for preference query, BRPQ performs preference query on the basis of the result obtained by the space keyword query, and p preference queries are performed on the data set on the basis of the B tree to search for objects meeting the preference of each attribute. And when BRPQ inquires the preference of the numerical point, the hash technology is adopted, the time complexity O (1), the time complexity of the B tree is O (log 2N), N is the height of the B tree, and the time complexity of p B trees is O (plog 2N). In conclusion, the query cost after B-tree based is obviously higher than BRPQ.

The index space increases based on the B-tree. When a plurality of numerical point preference attributes are indexed based on a B-tree, one B-tree index needs to be established for each attribute. Assuming that the number of user preference attributes of the numerical point is p and the number of data set objects is n, not only (p+2) n pointers pointing to data need to be stored, but also the occupied space of p B tree indexes is larger. And BRPQ only needs to increase the key value of the hash table.

The post-insertion cost increases based on the B-tree. After clicking the user preference attribute based on the B-tree index value, it is necessary to insert p+2 times for all datasets. While BRPQ only needs to be inserted once. In the inserting process, the cost of inserting the spatial keyword indexes is the same, but when the preference attribute is inserted, BRPQ adopts a hash technology, the inserting cost is O (1), the worst inserting cost based on the B tree index is O (log), and the time complexity of inserting p B trees is O (plogn). Therefore, the insertion cost after B-tree based is significantly higher than BRPQ.

(2) And arranging files in an inverted way. For indexing user preference attribute information for a numerical segment, including key-value pairs (key, pL). The key is an identification of user preference attribute information of a numerical value segment, pL is a pointer of an inverted linked list corresponding to the key, an object identification or a pointer pO is stored in the inverted linked list, and an object is uniquely identified.

When querying based on inverted file indexes, the DAAT algorithm provides high efficiency for searching certain data which simultaneously appears in a plurality of inverted lists. And when the numerical value section preference is inquired, a large number of branches are cut rapidly by using key values, and the DAAT algorithm is utilized for efficient inquiry.

PSPQ in combination with keyword index

The PSPQ index supports numerical point and numerical segment user preference attribute collaborative pruning, and in order to solve the spatial keyword range query of the user preference constraint, the PSPQ needs to be combined with the spatial keyword index.

Currently, there is a lot of research on space keyword query, and the index of space position generally uses a quadtree or an R tree, as shown in fig. 8, BRPQ adopts the quadtree, but an R tree may also be used. In consideration of a small number of intermediate results and better query performance when ordered keywords are retrieved by using an ordered keyword tree (ordered keyword trie), text information is indexed by using the ordered keyword tree, and a path from which the ordered keyword tree is pruned clearly meets the keyword query. The combined index BRPQ structure is shown. The quadtree child node contains a pointer to the ordered keyword tree and a threshold β is set for the object with which the quadtree child node is associated, upon reaching which the quadtree child node splits. To guarantee few intermediate results, leaf nodes (only the nodes associated with the ordered keyword tree will be referred to as leaf nodes hereinafter) will continue to divide until the number of objects contained in each quadtree node is equal to or less than k. Each path of the ordered keyword tree contains a pointer to the PSPQ.

Spatial keyword distributed query algorithm with relational attributes

In recent years, the amount of data has changed from the GB level to the TB or even the PB level. Therefore, a large amount of data management plays a very important role in data analysis. How efficient indexing and querying data is a bottleneck for data analysis. When the search engine acquires data, the data which is closest to the text and closest to the geographic position and satisfies the relationship attribute can be searched as far as possible. However, due to the abrupt increase in the amount of data, currently available search engines cannot efficiently answer query sentences containing these three types of information because there is no corresponding efficient index to process these types of data simultaneously. Even the existing queries facing the space keywords do not consider the attribute of the relationship attribute, so that the query efficiency is reduced. In the era of large-scale data, the traditional index structure and algorithm are realized on the basis of a single machine, and the processing speed cannot meet the requirements of users. To solve the above problem, a Baseline distributed indexing algorithm KLPDQ (Baseline Algorithm for Keywords and Location-aware with relational attributes Distributed Query index) is proposed that supports keywords, geographic location information, and relationship attribute attributes. On the basis, a more efficient spatial keyword query processing algorithm with relation attributes is provided. The algorithm not only can index three types of data simultaneously, but also can obviously reduce the speed of indexing and querying by adopting distributed indexing and querying processing.

Spatial key Object (Object) = { K, G, P }, with relational properties. O.K is a set of keywords for an object, O.G is the latitude and longitude coordinates of the object, and O.P is a set of relationship attribute attributes. O.P = { P', S }. P.P' is a set of numerical points for the relationship attribute and each unit element represents a minimum value for that attribute, and O.P.S is a set of numerical segment attributes for the relationship attribute. In the present invention, the value segment attribute only considers the business hours attribute of the merchant.

Spatial key range Query with relational attribute (Query) = { K, R, P }. Q.K is a set of keywords for a query and Q.R is a set of geographic location ranges for the query. Q.R = { T, L, B, R }, q.r.t is the upper left corner of the geographic location range; R.L is the lower left corner of the geographic location range; R.B is the lower right corner of the geographic location range; R.R is the upper right corner of the geographic location range. Q.P is a collection of queries for relational attributes of the query. Similar to the construction of objects, Q.P is also composed of queries of numeric points and numeric segments, respectively.

The sufficient requirements for the object to satisfy the query are: in particular, the object must contain the keywords queried in the query statement, must be within the spatial range of the query, and must have a numeric point attribute greater than that of the query, and must be within the numeric segment attribute range of the query.

But because both the query and indexing processes are data-based environments, the mechanism is a Lucence-based distributed indexing and query mechanism. And indexing the space, the text and the relation attribute at the same time, converting the multidimensional information into one-dimensional information, and storing and inquiring the space keyword object under the big data by utilizing Lucence and Hadoop.

Lucene index construction

Some basic concepts of Lucene mainly include indexes, documents, domains, and tokens. An index represents a series of documents. And a document represents a series of fields. A field then represents a series of tokens. A token is an array of strings, different tokens if the same string belongs to different domains. Thus, a token represents a pair of strings, one of which is a domain name and the other of which is a string within the domain.

(1) Index: in the computer field, an inverted index is a data structure that can be used to store the mapping of text. Such as an alphanumeric mapping to its location in the database file, etc.

(2) Segment: the Lucene index cut into smaller chunks is then called a segment. Each segment is an index. All segments are accessed sequentially during the Lucene query.

(3) Document: the document is the smallest unit in the Lucene index and query process. A document represents a series of fields. Each domain has a domain name and a series of values. Because each domain is stored within the document, the document needs to have a unique domain name for each domain, preventing conflicts.

(4) Domain: each domain consists of two parts, a domain name and a value. The value may be any string or number, etc. These values are all used to represent data. The value may optionally be stored in the index in case the document is selected and various types of information within the domain need to be returned.

(5) The term: the term represents a string of characters, consisting of two elements, one being the text information of the character string and the other being the domain name to which it belongs.

Lucene index establishment procedure

For example, the following are three text messages:

T0＝I was a student,and I came from Nanjing.

T1＝My father is a teacher,and I am a student.

T2＝Nanjing is my hometown,but my father’s hometown is Chengdu.

first, keywords are obtained from three texts using a marker. The procedure was as follows.

1) In the preprocessing stage, lucene uses markers to divide text content into segments. For chinese, a chinese parser may be chosen because of semantic problems involved.

2) In the analysis stage, lucene needs to filter out any punctuation and connected words, and then convert each word to capital letters. The indexer then calls function addDocument (Doc) passing the input value to Lucene for indexing. The processed data is stored in the index file and stored in the disk as inverted file data.

In the preprocessing and analysis stage, the keyword information of the three pieces of text data becomes:

key0＝I am student I come from Nanjing.

key1＝My father is teacher I am student.

key2＝Nanjing is my hometown my father hometown is Chengdu.

the reverse index can now be built based on the key information. After the index is established, the update or deletion can be operated by the IndexReader.

While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. A big data oriented multi-storage engine database system comprising:

the SQL compiler supports ANSIQL 92 and SQL 99 standards, supports ANSI SQL 2003OLAP core expansion, meets the requirements of data warehouse business on SQL, and is convenient for application smooth migration;

The transaction management unit realizes the control of consistency and isolation through a two-stage locking protocol and MVCC, and supports Serializable SnapshotIsolation isolation level, so that the transaction consistency under the concurrent condition can be ensured;

the data source connector is used for connecting the execution engine and various data sources, carrying out real-time statistical analysis on the data access engines of various different data sources, and not needing to lead the data into the HDFS in advance, thereby being more convenient for users to construct diversified demands for the service;

2. The big data oriented multi-storage engine database system of claim 1, wherein the storage process compiler comprises a complete Optimizer including CFG optimizers, parallel Optimizer and DAG optimizers, the CFG optimizers optimizing code in the storage process to accomplish loop unrolling, redundant code elimination, optimization of function inlining.

3. The method for storing a big data oriented multi-storage engine database according to claim 1, wherein the transaction management unit supports starting a transaction with BeginTransaction, ending the transaction with commit or rollback, implementing control of consistency and isolation by two-phase blocking protocol and MVCC, supporting Serializable SnapshotIsolation isolation level, thus guaranteeing transaction consistency in case of concurrency.

4. The big data oriented multi-storage engine database system of claim 1, wherein the Holodesk supports building a distributed index for data fields, allowing a user to build OLAP-Cube for multi-field combinations and store the Cube directly on memory or SSD without additional BI tools to build Cube.

5. The big data oriented multi-storage engine database system according to claim 1, wherein the distributed execution engine is based on a cost optimizer and a rule optimizer, and 100 optimization rules are used to ensure that the SQL application can exert the maximum performance without manual modification; the distributed execution engine includes two execution modes: a low latency mode and a high throughput mode.

6. The big data oriented multi-storage engine database system of claim 1, wherein the data source connector extracts the needed data from other data sources through pre-established connection after the execution plan is started, enters the execution engine layer to participate in SQL calculation, and releases the related database connection and the corresponding resources after the calculation is completed.

7. A method of storing a big data oriented multi-storage engine database employing the big data oriented multi-storage engine database system of any of claims 1-6, comprising:

step 2, unified metadata management, namely, a metadata warehouse is arranged in a logic number bin, and an RDBMS cluster for specially storing and managing metadata is used; creating a new table or a new index in each database system, synchronously creating a metadata two-dimensional mapping table in a metadata warehouse by the logic number bin, and simultaneously recording the relation between the metadata two-dimensional mapping table and the original table and/or the index, thereby laying a foundation for SQL unified access; the metadata management system stores information of all database objects, provides a query interface for other systems to search, and the information which can be queried by the search comprises physical distribution of data, distribution characteristics of the data, maximum minimum value and authority information;

step 4, unifying SQL analysis engines, providing unifying SQL analysis, optimization and execution services for all data systems in a logic number bin, wherein SQL sentences submitted by users are firstly received by the SQL engines, are converted into Spark codes after analysis and optimization, and then are executed by a high-performance distributed computing cluster, and calling the native APIs of each system in the execution process to realize data access;

step 5, unified security management and control, namely, based on an authentication unified system of the user and the role, the user and the role follow account and/or role RBAC models, authority management is carried out through the role, and batch authorization management is carried out on the user; supporting a security protocol Kerberos, using LDAP as an account management system, and carrying out unified security authentication on account information through Kerberos.

8. The method of claim 7, wherein the step 2 of unified metadata management further comprises providing Hive, HDFS, HBase metadata access collection; providing a UI interface for unified metadata management and an API interface in a Restful form of related Service, providing a micro-Service docking mode of each type, and storing the data of the UI foreground page data for unified metadata management by using a MySQL database table; providing a message queue and a metadata operation interface and data message bus mode of an API interface; providing a Type System with unified metadata, a Graph calculation and storage query engine layer, an intelligent label algorithm and a knowledge Graph model; and providing a public storage packaging layer of the Graph computation query engine, and supporting the Janus Graph open source Graph computation storage query engine.

9. The method for storing the big data oriented multi-storage engine database according to claim 8, wherein the step 2 unified metadata management further comprises real-time efficient indexing of heterogeneous data oriented to graphs, key values, documents and relations, and supporting numerical point attributes, numerical segment attributes, spatial positions and text mixed indexes BRPQ, including PSPQ index construction, PSPQ and keyword index combination, spatial keyword distributed query with relation attributes and Lucene-based index construction.

10. The method for storing big data oriented multi-storage engine database according to claim 7, wherein the unified security management architecture in step 5 is divided into 4 layers: the system layer uses an improved Apache DS, so that the read-write efficiency is improved by more than 10 times, the same set of users and a unified LDAP/Kerberos authentication mode are used, the use of Kerberos authentication by OpenLDAP is avoided, and the LDAP authentication efficiency is accelerated; the service layer realizes complete ARBAC model support, and provides functional support of REST API, user-friendly Web UI and password strategy; the plug-in layer provides authentication, authorization, group mapping and quota management for each component by using a plug-in form, so that each component uses a unified user, group and authority management model; the service layer is in butt joint with the PaaS and SaaS services of various platforms, and is protected by unified safety control.