CN111258977A - Tax big data storage and analysis platform - Google Patents
Tax big data storage and analysis platform Download PDFInfo
- Publication number
- CN111258977A CN111258977A CN202010024869.5A CN202010024869A CN111258977A CN 111258977 A CN111258977 A CN 111258977A CN 202010024869 A CN202010024869 A CN 202010024869A CN 111258977 A CN111258977 A CN 111258977A
- Authority
- CN
- China
- Prior art keywords
- impala
- data storage
- cluster
- kudu
- big data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/256—Integrating or interfacing systems involving database management systems in federated or virtual databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a tax big data storage and analysis platform, and belongs to the technical field of tax data. The tax big data storage and analysis platform comprises an Ambari cluster, a Kudu cluster and an Impala cluster, wherein the Ambari provides a visual operation interface; the Kudu cluster comprises at least two management nodes and a plurality of data storage nodes, wherein the two management nodes are respectively used as a main node and a standby node; the Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and data storage nodes of the Impala Daemon and Kudu cluster are deployed on the same node. The tax big data storage and analysis platform can provide simple, rapid and accurate data processing and analysis capability for storage, analysis and calculation of tax big data, and has good popularization and application values.
Description
Technical Field
The invention relates to the technical field of tax data, and particularly provides a tax big data storage and analysis platform.
Background
After the tax agencies are merged, it is necessary to comprehensively aggregate the relevant tax-related data by means of big data, to exert the advanced technical capability of the big data, and to drive the integration and development of tax services.
The tax system accumulates various historical data for many years, but the traditional storage device can not maintain the storage requirement of tax big data development for a long time, and in addition, due to the restriction of technical architecture and server computing resources, when the data volume reaches TB level, the system performance suddenly drops, the response speed is slow, the reliability and the safety are poor, and the data processing and analysis efficiency is low.
The traditional solutions include distributed clusters constructed based on relational databases such as Oracle, and Hbase and Hive distributed clusters based on Hdfs as bottom storage. The distributed cluster constructed by using the traditional database has the problems of slow cross-region data reading, low analysis and calculation efficiency of large data volume and the like. Hbase is a columnar storage database, which is efficient in data storage and fast query, but has limited conditions for query using row keys and does not support SQL operations. Hive is used as a data warehouse, is mainly used for off-line data analysis, usually performs calculation analysis with long time consumption and large data volume, and is not suitable for real-time and rapid data reading and writing. In contrast, Kudu is used as the underlying data storage, and maintains a good data analysis function while supporting high-concurrency and low-delay queries, and this feature makes it possible to consider both OLTP and OLAP services. Impala is used as an interactive SQL query engine, the grammar of the interactive SQL query engine is highly compatible with Hive, standard ODBC and JDBC interfaces are provided, and Impala is superior to Hive in the aspect of execution efficiency.
Disclosure of Invention
The technical task of the invention is to provide a tax big data storage and analysis platform which can provide simple, rapid and accurate data processing and analysis capability for storage, analysis and calculation of tax big data.
In order to achieve the purpose, the invention provides the following technical scheme:
a tax big data storage and analysis platform comprises an Ambari, a Kudu cluster and an Impala cluster, wherein the Ambari provides a visual operation interface and directly associates the Kudu cluster and the Impala cluster; the Kudu cluster comprises at least two management nodes and a plurality of data storage nodes, wherein the two management nodes are respectively used as a main node and a standby node; the Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and data storage nodes of the Impala Daemon and Kudu cluster are deployed on the same node.
Impala is a novel query system mainly developed by Cloudera, provides SQL semantics and can query PB-level big data stored in HDFS and HBase of Hadoop.
Kudu is a storage engine that is sourced by Cloudera and can provide both low latency random reads and writes and efficient data analysis capabilities. The HDFS storage component is a new component fusing functions of the HDFS and the HBase and is provided with a new storage component between the HDFS and the HBase.
Impala Daemon is an Impala Daemon process, which is a core component of Impala, and the name of the process on each node is Impala.
The Impala statescore is responsible for collecting resource information of each Impala process distributed in the cluster, health condition of each node and synchronizing node information.
The Impala Catalog is an Impala directory service component and notifies metadata change generated by an Impala SQL statement to all DataNodes in the cluster. The process name corresponding to the service is catalogd, only one catalogd process is needed for one impala cluster, and when the metadata change is caused by SQL statements executed in the impala cluster, the catalogd service pushes the change to other impala process nodes.
Ambari provides a visual management platform with the functions of cluster installation, configuration, modification, expansion, monitoring and the like, can monitor the health state of a cluster and the use conditions of storage, internal memory, IO and the like in real time in the operation, maintenance and use processes, and provides stable and reliable guarantee for a big data platform. Before cluster deployment, the number of cluster nodes is planned according to the requirements of data storage and analysis. Ambari is deployed at a node and has direct access to the web management interface through the 8080 port.
The Kudu cluster comprises at least two management nodes, when the Kudu cluster is two management nodes, the two management nodes are respectively used as main and standby nodes, only one management node is used as a main node at the same time, and if the currently used management node is unavailable, the management node can be re-determined through election, so that the availability of the Kudu cluster is ensured. The data storage nodes can be continuously increased along with the requirement of data storage, and the expansibility of the Kudu cluster is guaranteed.
The data storage nodes of the Impala Daemon and the Kudu cluster are deployed on the same node, and when the Impala acquires data from the Kudu cluster, the Impala is directly acquired in a local table, so that the network, IO and other expenses of remotely acquiring the data are avoided, and the high efficiency of the cluster is ensured.
Preferably, all data of the Kudu cluster are stored in the data storage nodes, each table in the data storage nodes is provided with a corresponding table structure, a primary key and a partition, and the data are stored in order according to the primary key.
Preferably, the data in the data storage nodes are divided into segment tables, one segment table puts adjacent data together, and the data storage nodes complete the read-write operation of the segment tables. One fragment table has multiple copies placed on different servers, only one fragment table exists as a main table at the same time, each auxiliary table can provide reading operation, writing operation needs to be written in consistency, and reading and writing operation of the multiple fragment tables is completed by a data storage node.
Preferably, the management nodes store all metadata, and only one management node is a master node at a time.
Preferably, the Impala Daemon is a core process, receives a request of a client, generates a query plan, coordinates Impala Catalog and Impala statescore to execute query answering, and summarizes a query result and returns the query result to the client.
Preferably, the Impala statescore is responsible for cluster metadata notification and distribution, the metadata includes Impala Catalog and cluster membership, and the SQL query depends on the metadata.
Preferably, the Impala Daemon and the data storage node of the Kudu cluster are deployed in the same node depth for integration, and the Impala Daemon executes the SQL statement to complete the operations of creating, inserting, updating and deleting the Kudu tax database table.
Preferably, the Impala Catalog is responsible for creating and updating metadata information of the database and the table, and the Impala Catalog update is distributed to Impala Daemon by Impala statescore.
Compared with the prior art, the tax big data storage and analysis platform has the following outstanding beneficial effects: aiming at massive tax data, the tax big data storage and analysis platform is reasonable, feasible and efficient in data storage and analysis by utilizing Kudu + Impala; the Ambari can be used for easily realizing rapid deployment, expansion and operation and maintenance management of the cluster, and solving the storage problem of rapid increase of tax data; the data can be quickly and accurately deeply mined by utilizing the Impala, an auxiliary decision can be provided for data management, research and the like, and the method has good popularization and application values.
Drawings
FIG. 1 is a structural framework diagram of the tax big data storage and analysis platform according to the invention.
Detailed Description
The tax big data storage and analysis platform of the invention will be further described in detail with reference to the accompanying drawings and embodiments.
Examples
As shown in fig. 1, the tax big data storage and analysis platform of the present invention includes Ambari, Kudu cluster and Impala cluster.
Ambari provides a visual operation interface and directly associates the Kudu cluster and the Impala cluster. Ambari provides a visual management platform with the functions of cluster installation, configuration, modification, expansion, monitoring and the like, can monitor the health state of a cluster and the use conditions of storage, internal memory, IO and the like in real time in the operation, maintenance and use processes, and provides stable and reliable guarantee for a big data platform. Before cluster deployment, the number of cluster nodes is planned according to the requirements of data storage and analysis. Ambari is deployed at a node and has direct access to the web management interface through the 8080 port.
Kudu is a storage engine that is sourced by Cloudera and can provide both low latency random reads and writes and efficient data analysis capabilities. The HDFS storage component is a new component fusing functions of the HDFS and the HBase and is provided with a new storage component between the HDFS and the HBase.
When the Kudu cluster is deployed, the Kudu cluster comprises two management nodes, the two management nodes are respectively used as main and standby nodes, only one management node is used as a main node at the same time, and if the currently used management node is unavailable, the management node can be determined again through election, so that the availability of the Kudu cluster is ensured. The data storage nodes can be continuously increased along with the requirement of data storage, and the expansibility of the Kudu cluster is guaranteed. And in the Kudu cluster configuration stage, configuration optimization is carried out on the Kudu cluster according to the actual environment, and the method comprises the following measures: a block _ cache _ capacity _ mb parameter; controlling the maximum amount of memory allocated to the data storage node cache; the maximum memory amount memory _ limit _ hard _ bytes parameter which can be used by the data storage node is set, and the parameter has a large influence on the data write-in capability of the Kudu cluster, so the parameter is generally set to 80% of the memory of the machine.
All data of the Kudu cluster are stored in the data storage nodes, each table in the data storage nodes is provided with a corresponding table structure, a main key and a partition, and the data are stored in order according to the main keys. The data in the data storage nodes are divided into segment tables, one segment table puts adjacent data together, and the data storage nodes finish the read-write operation of the segment tables. One fragment table has multiple copies placed on different servers, only one fragment table exists as a main table at the same time, each auxiliary table can provide reading operation, writing operation needs to be written in consistency, and reading and writing operation of the multiple fragment tables is completed by a data storage node. The management nodes store all metadata, and only one management node is a master node at a time.
The Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and the data storage nodes of the Impala Daemon and the Kudu cluster are deployed on the same node. Impala Daemon is an Impala Daemon process, which is a core component of Impala, and the name of the process on each node is Impala. The Impala statescore is responsible for collecting resource information of each Impala process distributed in the cluster, health condition of each node and synchronizing node information. The Impala Catalog is an Impala directory service component and notifies metadata change generated by an Impala SQL statement to all DataNodes in the cluster. The process name corresponding to the service is catalogd, only one catalogd process is needed for one impala cluster, and when the metadata change is caused by SQL statements executed in the impala cluster, the catalogd service pushes the change to other impala process nodes.
The data storage nodes of the Impala Daemon and the Kudu cluster are deployed on the same node, and when the Impala acquires data from the Kudu cluster, the Impala is directly acquired in a local table, so that the network, IO and other expenses of remotely acquiring the data are avoided, and the high efficiency of the cluster is ensured. The method comprises the steps of receiving a client request, generating a query plan, coordinating Impala Catalog and Impala statescore to execute query answering, summarizing query results and returning the query results to a client.
Impala statescore is responsible for cluster metadata notification and distribution, metadata including Impala Catalog and cluster membership, SQL queries rely on metadata. The Impala Daemon and the data storage nodes of the Kudu cluster are deployed on the same node depth for integration, and SQL statements are executed through the Impala Daemon to complete the operations of creating, inserting, updating and deleting the Kudu tax database table. The Impala Catalog is responsible for creating and updating metadata information of the database and the table, and the Impala Catalog updating is distributed to Impala Daemon by Impala Staestore.
The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.
Claims (8)
1. The utility model provides a tax big data storage and analysis platform which characterized in that: the method comprises the steps of Ambari, a Kudu cluster and an Impala cluster, wherein Ambari provides a visual operation interface and is directly associated with the Kudu cluster and the Impala cluster; the Kudu cluster comprises at least two management nodes and a plurality of data storage nodes, wherein the two management nodes are respectively used as a main node and a standby node; the Impala cluster comprises Impala Daemon, Impala Catalog and Impala Staestore, and data storage nodes of the Impala Daemon and Kudu cluster are deployed on the same node.
2. The tax big data storage and analysis platform according to claim 1, wherein: all data of the Kudu cluster are stored in the data storage nodes, each table in the data storage nodes is provided with a corresponding table structure, a main key and a partition, and the data are stored in order according to the main keys.
3. The tax big data storage and analysis platform according to claim 2, wherein: the data in the data storage nodes are divided into segment tables, one segment table puts adjacent data together, and the data storage nodes finish the read-write operation of the segment tables.
4. The tax big data storage and analysis platform according to claim 3, wherein: the management nodes store all metadata, and only one management node is a master node at a time.
5. The tax big data storage and analysis platform according to claim 4, wherein: the Impala daemon is a core process, receives a client request, generates a query plan, coordinates Impala Catalog and Impala statescore to execute query answering, summarizes query results and returns the query results to the client.
6. The tax big data storage and analysis platform according to claim 5, wherein: the Impala statescore is responsible for cluster metadata notification and distribution, the metadata comprises Impala Catalog and cluster membership, and the SQL query depends on the metadata.
7. The tax big data storage and analysis platform according to claim 6, wherein: the Impala Daemon and the data storage nodes of the Kudu cluster are deployed on the same node depth for integration, and the Impala Daemon executes SQL statements to complete the operations of creating, inserting, updating and deleting the Kudu tax database table.
8. The tax big data storage and analysis platform according to claim 7, wherein: the Impala Catalog is responsible for creating and updating metadata information of a database and a table, and the Impala Catalog updating is distributed to Impala Daemon through Impala Statescore.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010024869.5A CN111258977A (en) | 2020-01-10 | 2020-01-10 | Tax big data storage and analysis platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010024869.5A CN111258977A (en) | 2020-01-10 | 2020-01-10 | Tax big data storage and analysis platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111258977A true CN111258977A (en) | 2020-06-09 |
Family
ID=70946915
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010024869.5A Pending CN111258977A (en) | 2020-01-10 | 2020-01-10 | Tax big data storage and analysis platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111258977A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287017A (en) * | 2020-11-05 | 2021-01-29 | 浪潮云信息技术股份公司 | OpenSSH-based Impala cluster visual management method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109951463A (en) * | 2019-03-07 | 2019-06-28 | 成都古河云科技有限公司 | A kind of Internet of Things big data analysis method stored based on stream calculation and novel column |
CN110519100A (en) * | 2019-09-03 | 2019-11-29 | 浪潮云信息技术有限公司 | A kind of more cluster management methods, terminal and computer readable storage medium |
-
2020
- 2020-01-10 CN CN202010024869.5A patent/CN111258977A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109951463A (en) * | 2019-03-07 | 2019-06-28 | 成都古河云科技有限公司 | A kind of Internet of Things big data analysis method stored based on stream calculation and novel column |
CN110519100A (en) * | 2019-09-03 | 2019-11-29 | 浪潮云信息技术有限公司 | A kind of more cluster management methods, terminal and computer readable storage medium |
Non-Patent Citations (1)
Title |
---|
宁群仪等: "基于Kudu+Impala的交通大数据存储和分析平台", 《电脑编程技巧与维护》, no. 11, 18 November 2018 (2018-11-18), pages 91 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287017A (en) * | 2020-11-05 | 2021-01-29 | 浪潮云信息技术股份公司 | OpenSSH-based Impala cluster visual management method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11816126B2 (en) | Large scale unstructured database systems | |
CN107423422B (en) | Spatial data distributed storage and search method and system based on grid | |
US12050622B2 (en) | Replicating big data | |
CN106599043A (en) | Middleware used for multilevel database and multilevel database system | |
WO2019109854A1 (en) | Data processing method and device for distributed database, storage medium, and electronic device | |
CN110309233A (en) | Method, apparatus, server and the storage medium of data storage | |
US20230418811A1 (en) | Transaction processing method and apparatus, computing device, and storage medium | |
CN106649687B (en) | Big data online analysis processing method and device | |
CN105608126A (en) | Method and apparatus for establishing secondary indexes for massive databases | |
CN108228725B (en) | GIS application system based on distributed database | |
CN103365987A (en) | Clustered database system and data processing method based on shared-disk framework | |
CN115114294A (en) | Self-adaption method and device of database storage mode and computer equipment | |
CN105956041A (en) | Data model processing method based on Spring Data for MongoDB cluster | |
CN106780157B (en) | Ceph-based power grid multi-temporal model storage and management system and method | |
CN113961546B (en) | Real-time query library design method supporting online analysis and statistics | |
CN111258977A (en) | Tax big data storage and analysis platform | |
CN117111856A (en) | Data lake data processing method, device, system, equipment and medium | |
CN115934819A (en) | Universal distributed expansion method for industrial time sequence database | |
CN115587147A (en) | Data processing method and system | |
CN112434010A (en) | Interaction method for master station database of electricity consumption information acquisition system | |
CN110569310A (en) | Management method of relational big data in cloud computing environment | |
CN111104396A (en) | Cross-database data migration method and data access method | |
Chen et al. | Research and Application of Topology Analysis Method for Large-scale Distribution Grids | |
Ma et al. | Evaluating distributed transactional database system | |
WO2024108639A1 (en) | Data management method and apparatus based on multi-dimensional features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200609 |
|
RJ01 | Rejection of invention patent application after publication |