CN111258977A - Tax big data storage and analysis platform - Google Patents

Tax big data storage and analysis platform Download PDF

Info

Publication number
CN111258977A
CN111258977A CN202010024869.5A CN202010024869A CN111258977A CN 111258977 A CN111258977 A CN 111258977A CN 202010024869 A CN202010024869 A CN 202010024869A CN 111258977 A CN111258977 A CN 111258977A
Authority
CN
China
Prior art keywords
impala
data storage
cluster
kudu
big data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010024869.5A
Other languages
Chinese (zh)
Inventor
王国强
程林
杨培强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Business System Co Ltd
Original Assignee
Shandong Inspur Business System Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Business System Co Ltd filed Critical Shandong Inspur Business System Co Ltd
Priority to CN202010024869.5A priority Critical patent/CN111258977A/en
Publication of CN111258977A publication Critical patent/CN111258977A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/256Integrating or interfacing systems involving database management systems in federated or virtual databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a tax big data storage and analysis platform, and belongs to the technical field of tax data. The tax big data storage and analysis platform comprises an Ambari cluster, a Kudu cluster and an Impala cluster, wherein the Ambari provides a visual operation interface; the Kudu cluster comprises at least two management nodes and a plurality of data storage nodes, wherein the two management nodes are respectively used as a main node and a standby node; the Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and data storage nodes of the Impala Daemon and Kudu cluster are deployed on the same node. The tax big data storage and analysis platform can provide simple, rapid and accurate data processing and analysis capability for storage, analysis and calculation of tax big data, and has good popularization and application values.

Description

Tax big data storage and analysis platform
Technical Field
The invention relates to the technical field of tax data, and particularly provides a tax big data storage and analysis platform.
Background
After the tax agencies are merged, it is necessary to comprehensively aggregate the relevant tax-related data by means of big data, to exert the advanced technical capability of the big data, and to drive the integration and development of tax services.
The tax system accumulates various historical data for many years, but the traditional storage device can not maintain the storage requirement of tax big data development for a long time, and in addition, due to the restriction of technical architecture and server computing resources, when the data volume reaches TB level, the system performance suddenly drops, the response speed is slow, the reliability and the safety are poor, and the data processing and analysis efficiency is low.
The traditional solutions include distributed clusters constructed based on relational databases such as Oracle, and Hbase and Hive distributed clusters based on Hdfs as bottom storage. The distributed cluster constructed by using the traditional database has the problems of slow cross-region data reading, low analysis and calculation efficiency of large data volume and the like. Hbase is a columnar storage database, which is efficient in data storage and fast query, but has limited conditions for query using row keys and does not support SQL operations. Hive is used as a data warehouse, is mainly used for off-line data analysis, usually performs calculation analysis with long time consumption and large data volume, and is not suitable for real-time and rapid data reading and writing. In contrast, Kudu is used as the underlying data storage, and maintains a good data analysis function while supporting high-concurrency and low-delay queries, and this feature makes it possible to consider both OLTP and OLAP services. Impala is used as an interactive SQL query engine, the grammar of the interactive SQL query engine is highly compatible with Hive, standard ODBC and JDBC interfaces are provided, and Impala is superior to Hive in the aspect of execution efficiency.
Disclosure of Invention
The technical task of the invention is to provide a tax big data storage and analysis platform which can provide simple, rapid and accurate data processing and analysis capability for storage, analysis and calculation of tax big data.
In order to achieve the purpose, the invention provides the following technical scheme:
a tax big data storage and analysis platform comprises an Ambari, a Kudu cluster and an Impala cluster, wherein the Ambari provides a visual operation interface and directly associates the Kudu cluster and the Impala cluster; the Kudu cluster comprises at least two management nodes and a plurality of data storage nodes, wherein the two management nodes are respectively used as a main node and a standby node; the Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and data storage nodes of the Impala Daemon and Kudu cluster are deployed on the same node.
Impala is a novel query system mainly developed by Cloudera, provides SQL semantics and can query PB-level big data stored in HDFS and HBase of Hadoop.
Kudu is a storage engine that is sourced by Cloudera and can provide both low latency random reads and writes and efficient data analysis capabilities. The HDFS storage component is a new component fusing functions of the HDFS and the HBase and is provided with a new storage component between the HDFS and the HBase.
Impala Daemon is an Impala Daemon process, which is a core component of Impala, and the name of the process on each node is Impala.
The Impala statescore is responsible for collecting resource information of each Impala process distributed in the cluster, health condition of each node and synchronizing node information.
The Impala Catalog is an Impala directory service component and notifies metadata change generated by an Impala SQL statement to all DataNodes in the cluster. The process name corresponding to the service is catalogd, only one catalogd process is needed for one impala cluster, and when the metadata change is caused by SQL statements executed in the impala cluster, the catalogd service pushes the change to other impala process nodes.
Ambari provides a visual management platform with the functions of cluster installation, configuration, modification, expansion, monitoring and the like, can monitor the health state of a cluster and the use conditions of storage, internal memory, IO and the like in real time in the operation, maintenance and use processes, and provides stable and reliable guarantee for a big data platform. Before cluster deployment, the number of cluster nodes is planned according to the requirements of data storage and analysis. Ambari is deployed at a node and has direct access to the web management interface through the 8080 port.
The Kudu cluster comprises at least two management nodes, when the Kudu cluster is two management nodes, the two management nodes are respectively used as main and standby nodes, only one management node is used as a main node at the same time, and if the currently used management node is unavailable, the management node can be re-determined through election, so that the availability of the Kudu cluster is ensured. The data storage nodes can be continuously increased along with the requirement of data storage, and the expansibility of the Kudu cluster is guaranteed.
The data storage nodes of the Impala Daemon and the Kudu cluster are deployed on the same node, and when the Impala acquires data from the Kudu cluster, the Impala is directly acquired in a local table, so that the network, IO and other expenses of remotely acquiring the data are avoided, and the high efficiency of the cluster is ensured.
Preferably, all data of the Kudu cluster are stored in the data storage nodes, each table in the data storage nodes is provided with a corresponding table structure, a primary key and a partition, and the data are stored in order according to the primary key.
Preferably, the data in the data storage nodes are divided into segment tables, one segment table puts adjacent data together, and the data storage nodes complete the read-write operation of the segment tables. One fragment table has multiple copies placed on different servers, only one fragment table exists as a main table at the same time, each auxiliary table can provide reading operation, writing operation needs to be written in consistency, and reading and writing operation of the multiple fragment tables is completed by a data storage node.
Preferably, the management nodes store all metadata, and only one management node is a master node at a time.
Preferably, the Impala Daemon is a core process, receives a request of a client, generates a query plan, coordinates Impala Catalog and Impala statescore to execute query answering, and summarizes a query result and returns the query result to the client.
Preferably, the Impala statescore is responsible for cluster metadata notification and distribution, the metadata includes Impala Catalog and cluster membership, and the SQL query depends on the metadata.
Preferably, the Impala Daemon and the data storage node of the Kudu cluster are deployed in the same node depth for integration, and the Impala Daemon executes the SQL statement to complete the operations of creating, inserting, updating and deleting the Kudu tax database table.
Preferably, the Impala Catalog is responsible for creating and updating metadata information of the database and the table, and the Impala Catalog update is distributed to Impala Daemon by Impala statescore.
Compared with the prior art, the tax big data storage and analysis platform has the following outstanding beneficial effects: aiming at massive tax data, the tax big data storage and analysis platform is reasonable, feasible and efficient in data storage and analysis by utilizing Kudu + Impala; the Ambari can be used for easily realizing rapid deployment, expansion and operation and maintenance management of the cluster, and solving the storage problem of rapid increase of tax data; the data can be quickly and accurately deeply mined by utilizing the Impala, an auxiliary decision can be provided for data management, research and the like, and the method has good popularization and application values.
Drawings
FIG. 1 is a structural framework diagram of the tax big data storage and analysis platform according to the invention.
Detailed Description
The tax big data storage and analysis platform of the invention will be further described in detail with reference to the accompanying drawings and embodiments.
Examples
As shown in fig. 1, the tax big data storage and analysis platform of the present invention includes Ambari, Kudu cluster and Impala cluster.
Ambari provides a visual operation interface and directly associates the Kudu cluster and the Impala cluster. Ambari provides a visual management platform with the functions of cluster installation, configuration, modification, expansion, monitoring and the like, can monitor the health state of a cluster and the use conditions of storage, internal memory, IO and the like in real time in the operation, maintenance and use processes, and provides stable and reliable guarantee for a big data platform. Before cluster deployment, the number of cluster nodes is planned according to the requirements of data storage and analysis. Ambari is deployed at a node and has direct access to the web management interface through the 8080 port.
Kudu is a storage engine that is sourced by Cloudera and can provide both low latency random reads and writes and efficient data analysis capabilities. The HDFS storage component is a new component fusing functions of the HDFS and the HBase and is provided with a new storage component between the HDFS and the HBase.
When the Kudu cluster is deployed, the Kudu cluster comprises two management nodes, the two management nodes are respectively used as main and standby nodes, only one management node is used as a main node at the same time, and if the currently used management node is unavailable, the management node can be determined again through election, so that the availability of the Kudu cluster is ensured. The data storage nodes can be continuously increased along with the requirement of data storage, and the expansibility of the Kudu cluster is guaranteed. And in the Kudu cluster configuration stage, configuration optimization is carried out on the Kudu cluster according to the actual environment, and the method comprises the following measures: a block _ cache _ capacity _ mb parameter; controlling the maximum amount of memory allocated to the data storage node cache; the maximum memory amount memory _ limit _ hard _ bytes parameter which can be used by the data storage node is set, and the parameter has a large influence on the data write-in capability of the Kudu cluster, so the parameter is generally set to 80% of the memory of the machine.
All data of the Kudu cluster are stored in the data storage nodes, each table in the data storage nodes is provided with a corresponding table structure, a main key and a partition, and the data are stored in order according to the main keys. The data in the data storage nodes are divided into segment tables, one segment table puts adjacent data together, and the data storage nodes finish the read-write operation of the segment tables. One fragment table has multiple copies placed on different servers, only one fragment table exists as a main table at the same time, each auxiliary table can provide reading operation, writing operation needs to be written in consistency, and reading and writing operation of the multiple fragment tables is completed by a data storage node. The management nodes store all metadata, and only one management node is a master node at a time.
The Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and the data storage nodes of the Impala Daemon and the Kudu cluster are deployed on the same node. Impala Daemon is an Impala Daemon process, which is a core component of Impala, and the name of the process on each node is Impala. The Impala statescore is responsible for collecting resource information of each Impala process distributed in the cluster, health condition of each node and synchronizing node information. The Impala Catalog is an Impala directory service component and notifies metadata change generated by an Impala SQL statement to all DataNodes in the cluster. The process name corresponding to the service is catalogd, only one catalogd process is needed for one impala cluster, and when the metadata change is caused by SQL statements executed in the impala cluster, the catalogd service pushes the change to other impala process nodes.
The data storage nodes of the Impala Daemon and the Kudu cluster are deployed on the same node, and when the Impala acquires data from the Kudu cluster, the Impala is directly acquired in a local table, so that the network, IO and other expenses of remotely acquiring the data are avoided, and the high efficiency of the cluster is ensured. The method comprises the steps of receiving a client request, generating a query plan, coordinating Impala Catalog and Impala statescore to execute query answering, summarizing query results and returning the query results to a client.
Impala statescore is responsible for cluster metadata notification and distribution, metadata including Impala Catalog and cluster membership, SQL queries rely on metadata. The Impala Daemon and the data storage nodes of the Kudu cluster are deployed on the same node depth for integration, and SQL statements are executed through the Impala Daemon to complete the operations of creating, inserting, updating and deleting the Kudu tax database table. The Impala Catalog is responsible for creating and updating metadata information of the database and the table, and the Impala Catalog updating is distributed to Impala Daemon by Impala Staestore.
The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (8)

1. The utility model provides a tax big data storage and analysis platform which characterized in that: the method comprises the steps of Ambari, a Kudu cluster and an Impala cluster, wherein Ambari provides a visual operation interface and is directly associated with the Kudu cluster and the Impala cluster; the Kudu cluster comprises at least two management nodes and a plurality of data storage nodes, wherein the two management nodes are respectively used as a main node and a standby node; the Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and data storage nodes of the Impala Daemon and Kudu cluster are deployed on the same node.
2. The tax big data storage and analysis platform according to claim 1, wherein: all data of the Kudu cluster are stored in the data storage nodes, each table in the data storage nodes is provided with a corresponding table structure, a main key and a partition, and the data are stored in order according to the main keys.
3. The tax big data storage and analysis platform according to claim 2, wherein: the data in the data storage nodes are divided into segment tables, one segment table puts adjacent data together, and the data storage nodes finish the read-write operation of the segment tables.
4. The tax big data storage and analysis platform according to claim 3, wherein: the management nodes store all metadata, and only one management node is a master node at a time.
5. The tax big data storage and analysis platform according to claim 4, wherein: the Impala daemon is a core process, receives a client request, generates a query plan, coordinates Impala Catalog and Impala statescore to execute query answering, summarizes query results and returns the query results to the client.
6. The tax big data storage and analysis platform according to claim 5, wherein: the Impala statescore is responsible for cluster metadata notification and distribution, the metadata comprises Impala Catalog and cluster membership, and the SQL query depends on the metadata.
7. The tax big data storage and analysis platform according to claim 6, wherein: the Impala Daemon and the data storage nodes of the Kudu cluster are deployed on the same node depth for integration, and the Impala Daemon executes SQL statements to complete the operations of creating, inserting, updating and deleting the Kudu tax database table.
8. The tax big data storage and analysis platform according to claim 7, wherein: the Impala Catalog is responsible for creating and updating metadata information of a database and a table, and the Impala Catalog updating is distributed to Impala Daemon through Impala Statescore.
CN202010024869.5A 2020-01-10 2020-01-10 Tax big data storage and analysis platform Pending CN111258977A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010024869.5A CN111258977A (en) 2020-01-10 2020-01-10 Tax big data storage and analysis platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010024869.5A CN111258977A (en) 2020-01-10 2020-01-10 Tax big data storage and analysis platform

Publications (1)

Publication Number Publication Date
CN111258977A true CN111258977A (en) 2020-06-09

Family

ID=70946915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010024869.5A Pending CN111258977A (en) 2020-01-10 2020-01-10 Tax big data storage and analysis platform

Country Status (1)

Country Link
CN (1) CN111258977A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287017A (en) * 2020-11-05 2021-01-29 浪潮云信息技术股份公司 OpenSSH-based Impala cluster visual management method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109951463A (en) * 2019-03-07 2019-06-28 成都古河云科技有限公司 A kind of Internet of Things big data analysis method stored based on stream calculation and novel column
CN110519100A (en) * 2019-09-03 2019-11-29 浪潮云信息技术有限公司 A kind of more cluster management methods, terminal and computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109951463A (en) * 2019-03-07 2019-06-28 成都古河云科技有限公司 A kind of Internet of Things big data analysis method stored based on stream calculation and novel column
CN110519100A (en) * 2019-09-03 2019-11-29 浪潮云信息技术有限公司 A kind of more cluster management methods, terminal and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宁群仪等: "基于Kudu+Impala的交通大数据存储和分析平台", 《电脑编程技巧与维护》, no. 11, 18 November 2018 (2018-11-18), pages 91 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287017A (en) * 2020-11-05 2021-01-29 浪潮云信息技术股份公司 OpenSSH-based Impala cluster visual management method

Similar Documents

Publication Publication Date Title
US11816126B2 (en) Large scale unstructured database systems
CN107423422B (en) Spatial data distributed storage and search method and system based on grid
US12050622B2 (en) Replicating big data
CN106599043A (en) Middleware used for multilevel database and multilevel database system
WO2019109854A1 (en) Data processing method and device for distributed database, storage medium, and electronic device
CN110309233A (en) Method, apparatus, server and the storage medium of data storage
US20230418811A1 (en) Transaction processing method and apparatus, computing device, and storage medium
CN106649687B (en) Big data online analysis processing method and device
CN105608126A (en) Method and apparatus for establishing secondary indexes for massive databases
CN108228725B (en) GIS application system based on distributed database
CN103365987A (en) Clustered database system and data processing method based on shared-disk framework
CN115114294A (en) Self-adaption method and device of database storage mode and computer equipment
CN105956041A (en) Data model processing method based on Spring Data for MongoDB cluster
CN106780157B (en) Ceph-based power grid multi-temporal model storage and management system and method
CN113961546B (en) Real-time query library design method supporting online analysis and statistics
CN111258977A (en) Tax big data storage and analysis platform
CN117111856A (en) Data lake data processing method, device, system, equipment and medium
CN115934819A (en) Universal distributed expansion method for industrial time sequence database
CN115587147A (en) Data processing method and system
CN112434010A (en) Interaction method for master station database of electricity consumption information acquisition system
CN110569310A (en) Management method of relational big data in cloud computing environment
CN111104396A (en) Cross-database data migration method and data access method
Chen et al. Research and Application of Topology Analysis Method for Large-scale Distribution Grids
Ma et al. Evaluating distributed transactional database system
WO2024108639A1 (en) Data management method and apparatus based on multi-dimensional features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200609

RJ01 Rejection of invention patent application after publication