CN111258977A

CN111258977A - Tax big data storage and analysis platform

Info

Publication number: CN111258977A
Application number: CN202010024869.5A
Authority: CN
Inventors: 王国强; 程林; 杨培强
Original assignee: Shandong Inspur Business System Co Ltd
Current assignee: Shandong Inspur Business System Co Ltd
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2020-06-09

Abstract

The invention discloses a tax big data storage and analysis platform, and belongs to the technical field of tax data. The tax big data storage and analysis platform comprises an Ambari cluster, a Kudu cluster and an Impala cluster, wherein the Ambari provides a visual operation interface; the Kudu cluster comprises at least two management nodes and a plurality of data storage nodes, wherein the two management nodes are respectively used as a main node and a standby node; the Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and data storage nodes of the Impala Daemon and Kudu cluster are deployed on the same node. The tax big data storage and analysis platform can provide simple, rapid and accurate data processing and analysis capability for storage, analysis and calculation of tax big data, and has good popularization and application values.

Description

Tax big data storage and analysis platform

Technical Field

The invention relates to the technical field of tax data, and particularly provides a tax big data storage and analysis platform.

Background

After the tax agencies are merged, it is necessary to comprehensively aggregate the relevant tax-related data by means of big data, to exert the advanced technical capability of the big data, and to drive the integration and development of tax services.

The tax system accumulates various historical data for many years, but the traditional storage device can not maintain the storage requirement of tax big data development for a long time, and in addition, due to the restriction of technical architecture and server computing resources, when the data volume reaches TB level, the system performance suddenly drops, the response speed is slow, the reliability and the safety are poor, and the data processing and analysis efficiency is low.

The traditional solutions include distributed clusters constructed based on relational databases such as Oracle, and Hbase and Hive distributed clusters based on Hdfs as bottom storage. The distributed cluster constructed by using the traditional database has the problems of slow cross-region data reading, low analysis and calculation efficiency of large data volume and the like. Hbase is a columnar storage database, which is efficient in data storage and fast query, but has limited conditions for query using row keys and does not support SQL operations. Hive is used as a data warehouse, is mainly used for off-line data analysis, usually performs calculation analysis with long time consumption and large data volume, and is not suitable for real-time and rapid data reading and writing. In contrast, Kudu is used as the underlying data storage, and maintains a good data analysis function while supporting high-concurrency and low-delay queries, and this feature makes it possible to consider both OLTP and OLAP services. Impala is used as an interactive SQL query engine, the grammar of the interactive SQL query engine is highly compatible with Hive, standard ODBC and JDBC interfaces are provided, and Impala is superior to Hive in the aspect of execution efficiency.

Disclosure of Invention

The technical task of the invention is to provide a tax big data storage and analysis platform which can provide simple, rapid and accurate data processing and analysis capability for storage, analysis and calculation of tax big data.

In order to achieve the purpose, the invention provides the following technical scheme:

a tax big data storage and analysis platform comprises an Ambari, a Kudu cluster and an Impala cluster, wherein the Ambari provides a visual operation interface and directly associates the Kudu cluster and the Impala cluster; the Kudu cluster comprises at least two management nodes and a plurality of data storage nodes, wherein the two management nodes are respectively used as a main node and a standby node; the Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and data storage nodes of the Impala Daemon and Kudu cluster are deployed on the same node.

Impala is a novel query system mainly developed by Cloudera, provides SQL semantics and can query PB-level big data stored in HDFS and HBase of Hadoop.

Kudu is a storage engine that is sourced by Cloudera and can provide both low latency random reads and writes and efficient data analysis capabilities. The HDFS storage component is a new component fusing functions of the HDFS and the HBase and is provided with a new storage component between the HDFS and the HBase.

Impala Daemon is an Impala Daemon process, which is a core component of Impala, and the name of the process on each node is Impala.

The Impala statescore is responsible for collecting resource information of each Impala process distributed in the cluster, health condition of each node and synchronizing node information.

The Impala Catalog is an Impala directory service component and notifies metadata change generated by an Impala SQL statement to all DataNodes in the cluster. The process name corresponding to the service is catalogd, only one catalogd process is needed for one impala cluster, and when the metadata change is caused by SQL statements executed in the impala cluster, the catalogd service pushes the change to other impala process nodes.

Ambari provides a visual management platform with the functions of cluster installation, configuration, modification, expansion, monitoring and the like, can monitor the health state of a cluster and the use conditions of storage, internal memory, IO and the like in real time in the operation, maintenance and use processes, and provides stable and reliable guarantee for a big data platform. Before cluster deployment, the number of cluster nodes is planned according to the requirements of data storage and analysis. Ambari is deployed at a node and has direct access to the web management interface through the 8080 port.

The Kudu cluster comprises at least two management nodes, when the Kudu cluster is two management nodes, the two management nodes are respectively used as main and standby nodes, only one management node is used as a main node at the same time, and if the currently used management node is unavailable, the management node can be re-determined through election, so that the availability of the Kudu cluster is ensured. The data storage nodes can be continuously increased along with the requirement of data storage, and the expansibility of the Kudu cluster is guaranteed.

The data storage nodes of the Impala Daemon and the Kudu cluster are deployed on the same node, and when the Impala acquires data from the Kudu cluster, the Impala is directly acquired in a local table, so that the network, IO and other expenses of remotely acquiring the data are avoided, and the high efficiency of the cluster is ensured.

Preferably, all data of the Kudu cluster are stored in the data storage nodes, each table in the data storage nodes is provided with a corresponding table structure, a primary key and a partition, and the data are stored in order according to the primary key.

Preferably, the data in the data storage nodes are divided into segment tables, one segment table puts adjacent data together, and the data storage nodes complete the read-write operation of the segment tables. One fragment table has multiple copies placed on different servers, only one fragment table exists as a main table at the same time, each auxiliary table can provide reading operation, writing operation needs to be written in consistency, and reading and writing operation of the multiple fragment tables is completed by a data storage node.

Preferably, the management nodes store all metadata, and only one management node is a master node at a time.

Preferably, the Impala Daemon is a core process, receives a request of a client, generates a query plan, coordinates Impala Catalog and Impala statescore to execute query answering, and summarizes a query result and returns the query result to the client.

Preferably, the Impala statescore is responsible for cluster metadata notification and distribution, the metadata includes Impala Catalog and cluster membership, and the SQL query depends on the metadata.

Preferably, the Impala Daemon and the data storage node of the Kudu cluster are deployed in the same node depth for integration, and the Impala Daemon executes the SQL statement to complete the operations of creating, inserting, updating and deleting the Kudu tax database table.

Preferably, the Impala Catalog is responsible for creating and updating metadata information of the database and the table, and the Impala Catalog update is distributed to Impala Daemon by Impala statescore.

Compared with the prior art, the tax big data storage and analysis platform has the following outstanding beneficial effects: aiming at massive tax data, the tax big data storage and analysis platform is reasonable, feasible and efficient in data storage and analysis by utilizing Kudu + Impala; the Ambari can be used for easily realizing rapid deployment, expansion and operation and maintenance management of the cluster, and solving the storage problem of rapid increase of tax data; the data can be quickly and accurately deeply mined by utilizing the Impala, an auxiliary decision can be provided for data management, research and the like, and the method has good popularization and application values.

Drawings

FIG. 1 is a structural framework diagram of the tax big data storage and analysis platform according to the invention.

Detailed Description

The tax big data storage and analysis platform of the invention will be further described in detail with reference to the accompanying drawings and embodiments.

Examples

As shown in fig. 1, the tax big data storage and analysis platform of the present invention includes Ambari, Kudu cluster and Impala cluster.

Ambari provides a visual operation interface and directly associates the Kudu cluster and the Impala cluster. Ambari provides a visual management platform with the functions of cluster installation, configuration, modification, expansion, monitoring and the like, can monitor the health state of a cluster and the use conditions of storage, internal memory, IO and the like in real time in the operation, maintenance and use processes, and provides stable and reliable guarantee for a big data platform. Before cluster deployment, the number of cluster nodes is planned according to the requirements of data storage and analysis. Ambari is deployed at a node and has direct access to the web management interface through the 8080 port.

When the Kudu cluster is deployed, the Kudu cluster comprises two management nodes, the two management nodes are respectively used as main and standby nodes, only one management node is used as a main node at the same time, and if the currently used management node is unavailable, the management node can be determined again through election, so that the availability of the Kudu cluster is ensured. The data storage nodes can be continuously increased along with the requirement of data storage, and the expansibility of the Kudu cluster is guaranteed. And in the Kudu cluster configuration stage, configuration optimization is carried out on the Kudu cluster according to the actual environment, and the method comprises the following measures: a block _ cache _ capacity _ mb parameter; controlling the maximum amount of memory allocated to the data storage node cache; the maximum memory amount memory _ limit _ hard _ bytes parameter which can be used by the data storage node is set, and the parameter has a large influence on the data write-in capability of the Kudu cluster, so the parameter is generally set to 80% of the memory of the machine.

All data of the Kudu cluster are stored in the data storage nodes, each table in the data storage nodes is provided with a corresponding table structure, a main key and a partition, and the data are stored in order according to the main keys. The data in the data storage nodes are divided into segment tables, one segment table puts adjacent data together, and the data storage nodes finish the read-write operation of the segment tables. One fragment table has multiple copies placed on different servers, only one fragment table exists as a main table at the same time, each auxiliary table can provide reading operation, writing operation needs to be written in consistency, and reading and writing operation of the multiple fragment tables is completed by a data storage node. The management nodes store all metadata, and only one management node is a master node at a time.

The Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and the data storage nodes of the Impala Daemon and the Kudu cluster are deployed on the same node. Impala Daemon is an Impala Daemon process, which is a core component of Impala, and the name of the process on each node is Impala. The Impala statescore is responsible for collecting resource information of each Impala process distributed in the cluster, health condition of each node and synchronizing node information. The Impala Catalog is an Impala directory service component and notifies metadata change generated by an Impala SQL statement to all DataNodes in the cluster. The process name corresponding to the service is catalogd, only one catalogd process is needed for one impala cluster, and when the metadata change is caused by SQL statements executed in the impala cluster, the catalogd service pushes the change to other impala process nodes.

The data storage nodes of the Impala Daemon and the Kudu cluster are deployed on the same node, and when the Impala acquires data from the Kudu cluster, the Impala is directly acquired in a local table, so that the network, IO and other expenses of remotely acquiring the data are avoided, and the high efficiency of the cluster is ensured. The method comprises the steps of receiving a client request, generating a query plan, coordinating Impala Catalog and Impala statescore to execute query answering, summarizing query results and returning the query results to a client.

Impala statescore is responsible for cluster metadata notification and distribution, metadata including Impala Catalog and cluster membership, SQL queries rely on metadata. The Impala Daemon and the data storage nodes of the Kudu cluster are deployed on the same node depth for integration, and SQL statements are executed through the Impala Daemon to complete the operations of creating, inserting, updating and deleting the Kudu tax database table. The Impala Catalog is responsible for creating and updating metadata information of the database and the table, and the Impala Catalog updating is distributed to Impala Daemon by Impala Staestore.

The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. The utility model provides a tax big data storage and analysis platform which characterized in that: the method comprises the steps of Ambari, a Kudu cluster and an Impala cluster, wherein Ambari provides a visual operation interface and is directly associated with the Kudu cluster and the Impala cluster; the Kudu cluster comprises at least two management nodes and a plurality of data storage nodes, wherein the two management nodes are respectively used as a main node and a standby node; the Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and data storage nodes of the Impala Daemon and Kudu cluster are deployed on the same node.

2. The tax big data storage and analysis platform according to claim 1, wherein: all data of the Kudu cluster are stored in the data storage nodes, each table in the data storage nodes is provided with a corresponding table structure, a main key and a partition, and the data are stored in order according to the main keys.

3. The tax big data storage and analysis platform according to claim 2, wherein: the data in the data storage nodes are divided into segment tables, one segment table puts adjacent data together, and the data storage nodes finish the read-write operation of the segment tables.

4. The tax big data storage and analysis platform according to claim 3, wherein: the management nodes store all metadata, and only one management node is a master node at a time.

5. The tax big data storage and analysis platform according to claim 4, wherein: the Impala daemon is a core process, receives a client request, generates a query plan, coordinates Impala Catalog and Impala statescore to execute query answering, summarizes query results and returns the query results to the client.

6. The tax big data storage and analysis platform according to claim 5, wherein: the Impala statescore is responsible for cluster metadata notification and distribution, the metadata comprises Impala Catalog and cluster membership, and the SQL query depends on the metadata.

7. The tax big data storage and analysis platform according to claim 6, wherein: the Impala Daemon and the data storage nodes of the Kudu cluster are deployed on the same node depth for integration, and the Impala Daemon executes SQL statements to complete the operations of creating, inserting, updating and deleting the Kudu tax database table.

8. The tax big data storage and analysis platform according to claim 7, wherein: the Impala Catalog is responsible for creating and updating metadata information of a database and a table, and the Impala Catalog updating is distributed to Impala Daemon through Impala Statescore.