CN105608223B

CN105608223B - For the storage method and system of the Hbase database of kafka

Info

Publication number: CN105608223B
Application number: CN201610019054.1A
Authority: CN
Inventors: 曹宇; 余效伟; 肖赞; 李旭阳
Original assignee: BEIJING SINOIOV VEHICLE NETWORK TECHNOLOGY Co Ltd
Current assignee: BEIJING SINOIOV VEHICLE NETWORK TECHNOLOGY Co Ltd
Priority date: 2016-01-12
Filing date: 2016-01-12
Publication date: 2019-04-30
Anticipated expiration: 2036-01-12
Also published as: CN105608223A

Abstract

The present invention relates to a kind of storage methods of Hbase database for kafka, comprising: S1: collecting the data of all topic in kafka cluster, and saves in the queue；S2: the corresponding relationship of topic and separator in the data, filtering rule, putaway rule is configured in configuration file；S3: it according to the corresponding relationship, is filtered by Data Serialization, and to the data；S4: the configuration information in the configuration file, the connection of creation and Hbase database are read, and the data are built into put object；S5: the put object is put in storage the Hbase database.The present invention solves the problems, such as individually to be handled for difference topic needs in kafka, the method of all topic of General adaptive a kind of is provided, and construct a kind of high performance Hbase database storage method, greatly improve the efficiency of data write-in, output bandwidth is improved, the network and disk performance of machine are utilized to greatest extent, is designed by deque, the safety that ensure that data to greatest extent avoids the loss of data.

Description

For the storage method and system of the Hbase database of kafka

Technical field

The present invention relates to the technical field of computer digital animation, in particular to a kind of Hbase database for kafka Storage method and system.

Background technique

Kafka (Distributed Message Queue) is that a kind of distributed post of high-throughput subscribes to message system, it can locate Manage the everything data flow in the website of consumer's scale.This movement (web page browsing, the action of search and other users) It is a key factor of many social functions on modern network.These data are often as the requirement of handling capacity and lead to Processing log and log aggregation are crossed to solve.For dividing as the daily record data Hadoop (distributed system frame) with offline Analysis system, because being limited by requiring to handle in real time, carrying out processing using kafka is a feasible solution.kafka Purpose be by the loaded in parallel mechanism of Hadoop come on unified line and offline Message Processing, also for passing through cluster system To provide real-time consumption.

Hbase database is a PostgreSQL database distributed, towards column, can be using Hbase database technology Large-scale structure storage cluster is erected on cheap PC Server (server).Hbase database is Google The open source of Bigtable (distributed data-storage system) realizes that similar Google Bigtable utilizes GFS (expansible point Cloth file system) be used as its document storage system, Hbase database Hadoop HDFS (distributed file system) as Its document storage system；Google runs MapReduc (programming model) to handle the mass data in Bigtable, Hbase number The mass data in Hbase database is handled also with Hadoop MapReduce according to library；Google Bigtable is utilized Chubby is as cooperation with service, and Hbase data base manipulation Zookeeper (distributed application program coordination service) is as correspondence.

In data acquisition, data are very common ways by kafka transfer, and kafka designs many themes (topic), data are carried out for different topic and Hbase database is written, it usually needs there is different ways to come so that data Hbase database is written, because data are different in structure and content, topic is different, so that for different Topic write individual logical code and carries out data write-in, more time-consuming, inefficiency, and after being unfavorable for Phase maintenance, the excessive maintenance cost of logic are high.

Summary of the invention

The technical problem to be solved by the present invention is to how provide a kind of general to be applicable in different kafka topic's Hbase database storage method and system can be directed to different topic, with least work, most fast efficiency and minimum Maintenance cost come complete data write-in work.

For this purpose, the invention proposes a kind of storage method of Hbase database for Distributed Message Queue, packet Include following steps:

A kind of storage method of the Hbase database for kafka, which comprises the steps of:

S1: the data of all themes in kafka cluster are collected, and are saved in the queue；

S2: configuring theme in configuration file and the separator in the data, filtering rule, the corresponding of putaway rule are closed System；

S3: it according to the corresponding relationship, is filtered by the Data Serialization, and to the data；

S4: reading the configuration information in the configuration file, the connection of creation and Hbase database, and by the data It is built into put object；

S5: the put object is put in storage the Hbase database.

Wherein preferably, if individual themes are not suitable for general storage mode, this method further include: obtained by reflection The independent process class for corresponding theme is taken to handle the data of the corresponding theme.

Wherein preferably, the put object storage Hbase database includes: by calling the Hbase database storage The Hbase database is written in the data by interface.

Wherein preferably, the configuration information of the configuration file include data filtering rule, data loading rule, theme with The corresponding relationship and theme of independent process class and the corresponding relationship of Hbase database table.

Wherein preferably, the put object storage Hbase database includes: by calling entering for the Hbase database The data are completed data by the storage frame and Hbase database are written by the interface that library frame provides.

Wherein preferably, the process of the building put object includes: according to the rule of the configuration file, to the number According to being serialized again, object type needed for being built into the storage Hbase database.

Wherein preferably, described to include the following steps: put object storage Hbase database

S51: obtaining table name according to the queue, according to the table name batch operation sets of threads, table object group and and institute State operation sets of threads, the one-to-one buffer area of table object group；

S52: the operation sets of threads handles the data received, and the data received are written in corresponding buffering area；

S53: obtaining the data and corresponding table object group in the buffer area, in order distributes to the data pair The table object group answered carries out data write-in.

Wherein preferably, described write data into corresponding buffering area is written in the form of queue.

Wherein preferably, the table object group carries out data write-in using the form of parallel more queues.

On the other hand, the present invention also provides a kind of Input Systems of Hbase database for kafka, comprising:

Data acquisition unit for collecting the data of all themes in kafka cluster, and saves in the queue；

Configuration unit, for configuring the separator in theme and the data in the configuration file, filtering rule, entering The corresponding relationship of library rule；

Filter element is serialized, for being carried out by the Data Serialization, and to the data according to the corresponding relationship Filtering；

Object construction unit, for reading the configuration information in the configuration file, the company of creation and Hbase database It connects, and the data is built into put object；

Data loading unit, for the put object to be put in storage Hbase database.

Wherein preferably, if individual themes are not suitable for general storage mode, the system further include: independent process list Member, the separate processing units obtain the independent process class for corresponding theme by reflection come the data progress to the theme Processing.

By using the Hbase database provided by the present invention for Distributed Message Queue storage method and be System solves the problems, such as individually to be handled for difference topic needs in kafka, by general mode to kafka cluster Interior different types of data carry out consume and be uniformly stored into Hbase database, in addition, use the more Htable objects of multithreading with And producer consumer mode greatly improves the efficiency of data write-in, improves output bandwidth, machine is utilized to greatest extent The network and disk performance of device.It is designed by deque, ensure that the safety of data to greatest extent, avoid the loss of data.

Detailed description of the invention

The features and advantages of the present invention will be more clearly understood by referring to the accompanying drawings, and attached drawing is schematically without that should manage Solution is carries out any restrictions to the present invention, in the accompanying drawings:

Fig. 1 shows the present invention for the storage method flow diagram of the Hbase database of kafka；

Fig. 2 shows detailed storage flow diagram of the present invention for the Hbase database of kafka.

Specific embodiment

Below in conjunction with attached drawing, embodiments of the present invention is described in detail.

As shown in Figure 1, entering the present invention provides a kind of Hbase database for kafka (Distributed Message Queue) Library method, which comprises the steps of:

S1: the data of all topic (theme) in kafka cluster are collected, and are saved in the queue；

S2: the correspondence of topic and separator, filtering rule, putaway rule in the data are configured in configuration file Relationship；

S5: the put object is put in storage the Hbase database.

As shown in Fig. 2, specifically, S1: unified to collect all topic in kafka cluster by unified data receiver The data of (theme), configuration file are formulated topic list, are saved the data in queue.Uniform data collection terminal, using queue Form collect kafka cluster in all topic data, to guarantee data smoothly production and consumption, and in Hbase number In the case of library hypertonia or other, a part of data are cached.

S2: the correspondence of topic and separator, filtering rule, putaway rule in the data are configured in configuration file Relationship.

S3: and then data pass through serialization process, according to the corresponding relationship in step S2, the data that step S1 is collected into Sequence turns to LIST (array), reads the corresponding relationship of topic and relevance filtering rule in configuration file, to the data of collection into Row filtering.For example, Data Start has identifier, it is not stored in database, then to be filtered, every kind of topic is different, also wants It is filtered.

S4: the configuration information in configuration file, topic and table name, rowkey (line unit) rule, column family relevant configuration are read Filtered data are built into put object by information, the connection of creation and Hbase database.Wherein, fixed based on configuration file Adopted entire storage process can be configured by configuration file including data filtering rule, data loading rule, topic and individually Class corresponding relationship, topic and Hbase table corresponding relationship are handled, number of threads in the producer thread group of frame, Htable are put in storage Object number etc..Using filtering rule and putaway rule in customized syntactic definition configuration file, it ensure that kafka topic writes Enter the versatility of database.Configuration file example:

Data filtering: topicname:up, ' CAPACITY ', g

Topicname:3,r

Data loading: topicname:test, self, [1,2,3:family, column], [4,5,6:family, column]

Rule is interpreted: data filtering grammer topicname: Data Position, specific character string, (g filters r conduct to rule Rowkey, c are as column)

It is put in storage grammer: topicname: table name, rowkey rule (self is voluntarily produced, and is extracted in by data), data bit It sets and family, column corresponding relationship.

Wherein, the process for constructing put object is exactly the rule of file through the above configuration for the second time to data progress sequence Change, object type needed for data are built into storage Hbase database.

S5: after data are built into put object, put object batch is put in storage Hbase database.Specifically, using general Logic is written in data, and the Hbase is written in the data by calling the Hbase database high-performance to be put in storage framework interface Database.

Wherein, if individual topic are obtained by filtration in step s3 is not suitable for above-mentioned general storage mode or in step The individual topic collected in rapid S1 are not suitable for above-mentioned general storage mode, can also be by providing a factory class, by anti- It penetrates to obtain and data is handled for the independent process class for corresponding to topic.Independent process class allows for realizing program The top level interface of middle offer, this interface define the data processing methods that independent process class must include.

Wherein, described to include the following steps: put object storage Hbase database

S51: obtaining table name according to the data list of step S1, thread got from thread pool, is distributed according to the table name Operate sets of threads, Htable object group (table object group) and one-to-one with the operation sets of threads, Htable object group Buffer area；Htable object group is consumer thread's group, and consumer thread's group and operation sets of threads are one-to-one relationship, each Per thread has several Htable objects in consumer thread's group.Wherein, in high-performance storage frame, it is divided into multiple threads Group, per thread group and the table name to be written are one-to-one relationships, that is to say, that if there is 3 tables will be written in 5 topic, Just there are 3 sets of threads, includes several threads, a part of number of per thread processing in per thread group.

S52: the operation sets of threads handles the data received, and the data received are written in corresponding buffering area； Wherein, it is described write data into can be in corresponding buffering area be written in the form of queue.Number of queues and thread are Correspondingly.Ensure that Hbase database pressure crosses the data safety for capitalizing not enter or in the case of other using queue Property, it is second queue in the present invention.The present invention uses dual cohort design, when process starts and process terminates There is the presence of queue, to guarantee that the present invention caches data in data input and data output, to prevent exception The loss of data of situation, has ensured data safety to greatest extent,

S53: obtaining the data and corresponding Htable object group in the buffer area, specifically, handling line from storage Cheng Chizhong obtains thread, and Htable object is obtained from Htable object pool, is in order distributed to the data corresponding Htable object group carries out data write-in, by corresponding table name and sets of threads, object group, buffer area, ensure that data No write de-lay greatly improves output bandwidth, utilizes server network and hard disk I/O bandwidth to greatest extent, improves write-in speed Degree.Wherein, the Htable object group, which carries out data write-in, can use the form of parallel more queues.It is used in final output Parallel more queues, match multithreading and more Htable modes, ensure that parallel maximization output data, improving performance.

On the other hand, using above-mentioned storage method, the present invention also provides a kind of entering for the Hbase database of kafka Library system, comprising:

Data loading unit, for the put object to be put in storage Hbase database.

Wherein, if individual themes are not suitable for general storage mode, the system further include: separate processing units, institute It states separate processing units and the independent process class for corresponding theme is obtained to handle the data of the theme by reflection.

Although the embodiments of the invention are described in conjunction with the attached drawings, but those skilled in the art can not depart from this hair Various modifications and variations are made in the case where bright spirit and scope, such modifications and variations are each fallen within by appended claims Within limited range.

Claims

1. a kind of storage method of the Hbase database for kafka, which comprises the steps of:

S2: the corresponding relationship of theme and separator in the data, filtering rule, putaway rule is configured in configuration file；

S4: the configuration information in the configuration file, the connection of creation and Hbase database are read, and the data are constructed At put object；

S5: obtaining table name according to the queue, according to the table name batch operation sets of threads, table object group and and the operation Sets of threads, the one-to-one buffer area of table object group；The operation sets of threads handles the data received, and the number that will be received According in write-in corresponding buffering area；The data and corresponding table object group in the buffer area are obtained, in order by the data It distributes to corresponding table object group and carries out data write-in, so that the put object is put in storage the Hbase database.

2. the storage method of the Hbase database according to claim 1 for kafka, which is characterized in that if individual Theme is not suitable for general storage mode, then this method further include: obtains the independent process class for corresponding theme by reflection To handle the data of the corresponding theme.

3. the storage method of the Hbase database according to claim 2 for kafka, which is characterized in that the configuration The configuration information of file includes corresponding relationship and the master of regular data filtering, data loading rule, theme and independent process class The corresponding relationship of topic and Hbase database table.

4. the storage method of the Hbase database according to claim 1 or 2 for kafka, which is characterized in that described Put object storage Hbase database includes: will be described by the interface for calling the storage frame of the Hbase database to provide Data complete data by the storage frame and Hbase database are written.

5. the storage method of the Hbase database according to claim 1 or 2 for kafka, which is characterized in that described The process for constructing put object includes: to serialize, be built into again to the data according to the rule of the configuration file Object type needed for being put in storage the Hbase database.

6. the storage method of the Hbase database according to claim 1 for kafka, which is characterized in that write data Enter in corresponding buffering area is written in the form of queue.

7. the storage method of the Hbase database according to claim 1 for kafka, which is characterized in that the table pair As group progress data write-in is using the form of parallel more queues.

8. a kind of Input System of the Hbase database for kafka characterized by comprising

Configuration unit, for configuring separator in theme and the data, filtering rule, putaway rule in configuration file Corresponding relationship；

Filter element is serialized, for being carried out by the Data Serialization, and to the data according to the corresponding relationship Filter；

Object construction unit, for reading the configuration information in the configuration file, the connection of creation and Hbase database, and The data are built into put object；

Data loading unit, for obtaining table name according to the queue, according to the table name batch operation sets of threads, table object group And with the operation sets of threads, the one-to-one buffer area of table object group；The operation sets of threads handles the data received, And the data received are written in corresponding buffering area；The data and corresponding table object group in the buffer area are obtained, are pressed The data are distributed to corresponding table object group and carry out data write-in by sequence, so that the put object is put in storage Hbase data Library.

9. the Input System of the Hbase database according to claim 8 for kafka, which is characterized in that if individual Theme is not suitable for general storage mode, then the system further include: separate processing units, the separate processing units pass through reflection The independent process class for corresponding theme is obtained to handle the data of the theme.