CN112800058A

CN112800058A - Method for realizing HBase secondary index

Info

Publication number: CN112800058A
Application number: CN202110107933.0A
Authority: CN
Inventors: 赵圣杰; 徐伟涛; 高传集; 胡清
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-05-14

Abstract

The invention particularly relates to a method for realizing HBase secondary index. The method for realizing the HBase secondary index integrates an elastic search engine, a NiFi data stream platform and an HBase distributed column storage database; the NiFi data flow platform is responsible for extracting source data and writing the source data into an Elasticissearch engine and an HBase distributed column storage database, the Elasticissearch engine is responsible for storing indexed data and rowkey main keys of the HBase distributed column storage database, and the HBase distributed column storage database is responsible for storing full detailed data; searching an Elasticissearch engine according to the query condition to obtain a rowkey main key of the HBase distributed column storage database, and querying detailed data stored in the HBase distributed column storage database by using the rowkey main key as the query condition, thereby providing an efficient query retrieval function for the HBase. The method for realizing the HBase secondary index can save the memory cache space and the disk storage space of a server, provides an efficient query and retrieval function for the HBase, and greatly improves the query and retrieval efficiency.

Description

Method for realizing HBase secondary index

Technical Field

The invention relates to the technical field of data retrieval, in particular to a method for realizing HBase secondary index.

Background

With the rapid development of computer technology and network technology, a large amount of data is stored in the HBase database. Only the Rowkey is used as a primary index in an HBase database, if data retrieval and query are to be performed on a non-primary key field of the HBase, full-table scanning is often performed through a MapReduce/Spark and other distributed computing frameworks, and both hardware resource consumption and time delay are high. HBase cannot satisfy the fast and complex query function of data. The advantages and disadvantages of HBase data storage are as follows:

the Apache HBase is a Hadoop database and is a distributed, extensible and big data storage database. The HBase distributed column storage database can host very large tables on commodity hardware clusters, with data reaching billions of rows and millions of columns. HBase is an open source, distributed, version-based, non-relational database developed based on the Bigtable model of Google. HBase provides Bigtable-like functionality over Hadoop and HDFS. HBase is an important member in an Apache Hadoop ecosystem and is mainly used for massive structured data storage. The main goal of HBase is to increase computing and storage capacity by increasing the number of inexpensive commercial servers, relying on a lateral expansion. HBase queries speed millisecond grade based on rowkey main key, but HBase is not suitable for complex logic query, and the complex query usually needs full table scanning and consumes resources greatly.

Based on the above situation, the invention provides a method for realizing HBase secondary index.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention provides a simple and efficient implementation method of HBase secondary index.

The invention is realized by the following technical scheme:

a method for realizing HBase secondary index is characterized in that: integrating an Elasticissearch search engine, a NiFi data stream platform and an HBase distributed column storage database;

the NiFi data flow platform is responsible for extracting source data and writing the source data into an Elasticissearch engine and an HBase distributed column storage database, the Elasticissearch engine is responsible for storing indexed data and rowkey main keys of the HBase distributed column storage database, and the HBase distributed column storage database is responsible for storing full detailed data;

searching an Elasticissearch engine according to the query condition to obtain a rowkey main key of the HBase distributed column storage database, and querying detailed data stored in the HBase distributed column storage database by using the rowkey main key as the query condition, thereby providing an efficient query retrieval function for the HBase.

The method comprises the following steps:

s1, configuring an address of a data source by a NiFi data flow platform, wherein the data source address is a url address of a relational database, a remote directory or a data flow tool;

s2, the field of the data source configured by the NiFi data flow platform corresponds to the field of the HBase distributed column storage database, and if the corresponding storage table in the HBase distributed column storage database does not exist, the NiFi data flow platform automatically creates the HBase table according to the configuration information;

s3, configuring a field needing to be written into an Elasticissearch engine by the NiFi data stream platform, and if the index of corresponding stored data in the Elasticissearch engine does not exist, automatically creating an index by the NiFi data stream platform according to configuration information;

s4, starting a NiFi data flow platform to extract source data and write the source data into an index of an Elasticissearch engine;

s5, starting a NiFi data leveling platform to extract source data and write the source data into a table of an HBase distributed column storage database;

s6, a user inputs detailed query sql, clicks a query button, an Elasticissearch search engine queries a corresponding index according to the sql input by the user, returns a main key of an HBase distributed column storage database stored in the index, queries a table corresponding to the HBase distributed column storage database according to the returned main key of the HBase distributed column storage database, returns detailed data, and displays the detailed data on a page;

s7, the user inputs the statistic analysis sql, clicks a query button, a statistic interface of the Elasticissearch search engine receives the statistic analysis sql, carries out statistic analysis and returns a statistic analysis result in a json form.

In step S1, the NiFi data stream platform extracts data from other data sources or reads file contents in the remote shared directory, and performs cleaning, conversion, and segmentation on the contents, and then writes the contents into other data storage components.

The data flow tool is a kafka flow processing platform.

In step S2, the NiFi data flow platform configures connection information of the HBase distributed column storage database, including a ticket of the HBase distributed column storage database, an IP address of the zookeeper, and a port number of the HBase distributed column storage database, so as to ensure that the NiFi data flow platform can communicate with the HBase distributed column storage database.

In step S3, the NiFi data stream platform configures connection information of the Elasticsearch engine, including a port number and an IP address of the Elasticsearch engine, to ensure that the NiFi data stream platform can communicate with the Elasticsearch engine.

The NiFi data stream platform also needs to configure the corresponding relation between the source data field name and the corresponding indexed field in the Elasticissearch search engine; and the NiFi data flow platform only needs to configure the query statistics field required by the service and the rowkey main key field written into the HBase distributed column storage database of the Elasticissearch engine; in step S4, the fields that are not used as query conditions and the fields that are not used for statistics are not written into the Elasticsearch engine, so that the memory cache space and the disk storage space of the server are saved, and the query and retrieval efficiency is improved.

In step S5, the data written into the HBase distributed column storage database by the NiFi data stream platform is full data, and includes all source data field information.

In step S7, the query condition in the statistical analysis sql input by the user at the interface must be a field stored in the Elasticsearch engine, and the statistical field must also be stored in the Elasticsearch engine.

In step S6, the query condition in the query detail sql input by the user on the interface must be a field stored in the Elasticsearch engine, and the presentation field may not be stored in the Elasticsearch engine, but must be a field stored in the HBase distributed column storage database.

The invention has the beneficial effects that: the method for realizing the HBase secondary index can save the memory cache space and the disk storage space of a server, provides an efficient query and retrieval function for the HBase, and greatly improves the query and retrieval efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of a data writing process according to the present invention.

FIG. 2 is a schematic diagram of a data retrieval process according to the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The NiFi data flow platform is an easy-to-use, powerful and reliable data processing and distributing system. Based on a Web graphical interface, programming based on a flow is completed through dragging, connecting and configuring, and functions such as data acquisition are realized. The method is suitable for visual creation and management of the processor directed graph. NiFi is asynchronous in nature, allows very high throughput and natural buffering as processing and traffic fluctuates, provides a highly concurrent model, and developers do not have to worry about the typical complexity of concurrency. Facilitates the development of cohesive and loosely coupled components that can then be reused in other environments and facilitates testable units. The resource-constrained connections make the critical functions of backpressure and pressure release very natural and intuitive. The points at which data enters and exits the system and how it flows through are easily understood and easily tracked.

The Elasticsearch search engine is a distributed, RESTful style search and data analysis engine. The elastic search provides functions of full-text retrieval, structured retrieval, data analysis and the like, and can process mass data in near real time. It can be extended to hundreds of servers, handling PB-level structured or unstructured data.

The method for realizing the HBase secondary index integrates an elastic search engine, a NiFi data stream platform and an HBase distributed column storage database;

The method comprises the following steps:

The data flow tool is a kafka flow processing platform.

The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A method for realizing HBase secondary index is characterized in that: integrating an Elasticissearch search engine, a NiFi data stream platform and an HBase distributed column storage database;

2. The method for implementing the HBase secondary index according to claim 1, comprising the following steps: :

3. The method for implementing the HBase secondary index according to claim 2, wherein: in the step S1, the NiFi data stream platform extracts data of other data sources or reads file contents in the remote shared directory, and cleans, converts and segments the contents, and then writes the contents into other data storage components;

the data flow tool is a kafka flow processing platform.

4. The method for implementing the HBase secondary index according to claim 2, wherein: in step S2, the NiFi data flow platform configures connection information of the HBase distributed column storage database, including a ticket of the HBase distributed column storage database, an IP address of the zookeeper, and a port number of the HBase distributed column storage database, so as to ensure that the NiFi data flow platform can communicate with the HBase distributed column storage database.

5. The method for implementing the HBase secondary index according to claim 2, wherein: in the step S3, the NiFi data stream platform configures connection information of the Elasticsearch engine, including a port number and an IP address of the Elasticsearch engine, to ensure that the NiFi data stream platform can communicate with the Elasticsearch engine;

6. The method for implementing the HBase secondary index according to claim 2, wherein: in step S5, the data written into the HBase distributed column storage database by the NiFi data stream platform is full data, and includes all source data field information.

7. The method for implementing the HBase secondary index according to claim 2, wherein: in step S7, the query condition in the statistical analysis sql input by the user at the interface must be a field stored in the Elasticsearch engine, and the statistical field must also be stored in the Elasticsearch engine.

8. The method for implementing the HBase secondary index according to claim 2, wherein: in step S6, the query condition in the query detail sql input by the user on the interface must be a field stored in the Elasticsearch engine, and the presentation field may not be stored in the Elasticsearch engine, but must be a field stored in the HBase distributed column storage database.