CN105956106B

CN105956106B - method and system for accessing big data based on memory database and Hbase

Info

Publication number: CN105956106B
Application number: CN201610289753.8A
Authority: CN
Inventors: 李晓静
Original assignee: Beijing Si Tech Information Technology Co Ltd
Current assignee: Beijing Si Tech Information Technology Co Ltd
Priority date: 2016-05-04
Filing date: 2016-05-04
Publication date: 2019-12-13
Anticipated expiration: 2036-05-04
Also published as: CN105956106A

Abstract

the invention discloses a method and a system for accessing big data based on a memory database and Hbase, wherein the method comprises the following steps: s1, reading a plurality of source files to be processed, and respectively performing task processing on each source file, wherein one task processing comprises a plurality of processing steps; s2, after each processing step is carried out on each source file, storing the file processing state of each source file in a memory database in a first preset mode; and S3, storing the file data of each processing step of each source file in Hbase in a second preset mode. According to the invention, the intermediate data files in the task processing process are stored in the Hbase, and the processing state of each file after each processing step is stored in the memory database, so that the advantages of the Hbase that the data can be amplified and the memory database has high access speed are utilized, and the data can be rapidly accessed.

Description

Method and system for accessing big data based on memory database and Hbase

Technical Field

The invention relates to the technical field of data access, in particular to a method and a system for accessing big data based on a memory database and Hbase.

Background

The distributed memory database is a memory database, all data are stored in a memory, and the advantage of super-high speed of memory access can be exerted. Data reliability is guaranteed through a full data file (checkpoint) and a redo log. And sql flexible data access is supported. Meanwhile, the distributed memory database is distributed and is deployed on a plurality of nodes of a network, and a uniform access interface is provided for the outside.

Hbase is a NOSQL database; data in Hbase can be conveniently retrieved according to rowkey or the range of rowkey, but the requirement of flexible query without using rowkey as a key word cannot be met.

disclosure of Invention

the invention aims to provide a method and a system for accessing big data based on a memory database and Hbase, which can improve the data access speed.

The technical scheme for solving the technical problems is as follows:

In one aspect, the invention provides a method for accessing big data based on a memory database and Hbase, comprising the following steps:

S1, reading a plurality of source files to be processed, and respectively performing task processing on each source file, wherein one task processing comprises a plurality of processing steps;

S2, after each processing step is carried out on each source file, storing the file processing state of each source file in a memory database in a first preset mode;

And S3, storing the file data of each processing step of each source file in Hbase in a second preset mode.

in another aspect, the present invention provides a system for accessing big data based on a memory database and Hbase, comprising:

the file reading module is used for reading a plurality of source files to be processed;

The task processing module is used for performing task processing on each source file, wherein one task processing comprises a plurality of processing steps;

The first storage module is used for storing the file processing state of each source file in the memory database in a first preset mode after each processing step is carried out on each source file;

And the second storage module is used for storing the file data of each source file after each processing step in the Hbase in a second preset mode.

according to the method and the system for accessing the big data based on the memory database and the Hbase, the intermediate data file in the task processing process is stored in the Hbase, the file processing state of each file after each processing step is stored in the memory database, and the advantages of the Hbase that the data can be stored and amplified and the memory database has high access speed are utilized, so that the data can be accessed quickly.

Drawings

fig. 1 is a flowchart of a method for accessing big data based on a memory database and Hbase according to embodiment 1 of the present invention;

FIG. 2 is a flowchart of example 2;

Fig. 3 is a schematic diagram of a system for accessing big data based on a memory database and Hbase according to embodiment 3 of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Example 1, a method for accessing big data based on in-memory database and Hbase. The method provided by the present embodiment is described below with reference to fig. 1.

Referring to fig. 1, the method provided in this embodiment includes: s1, reading a plurality of source files to be processed, and respectively performing task processing on each source file, wherein one task processing comprises a plurality of processing steps;

In this embodiment, a process sequence number is configured for each task process, and the step S2 specifically includes:

after each processing step is performed on each source file, the file processing state of each source file and the processing time of the source file are stored in the memory database in a data table form by taking the process sequence number and the file identifier as indexes.

The step S3 specifically includes:

designing a reasonable Rowkey for the file after each processing step of each source file, and storing the file data in Hbase in a data table form by taking the Rowkey as an index, wherein the Rowkey is obtained by adding a file identifier with a process sequence number.

in this embodiment, when the intermediate file after each processing step is stored in the Hbase and the file processing state is stored in the memory database, a corresponding data table is established according to the data amount in each source file at a first predetermined time interval and stored in the corresponding memory database or the Hbase, and the corresponding data table is periodically cleaned at a second predetermined time interval.

example 2

To further the understanding of the method for accessing big data based on the in-memory database and Hbase provided by the present invention, a specific example is described below.

referring to fig. 2, first, a program based on Hbase storage (hereinafter referred to as a work order program for descriptive convenience) is developed to record and save files after each processing step in the task processing and breakpoint file records. For example, take preprocessing and re-emphasis processing of a source file as an example: after each file is preprocessed, the preprocessed file needs to be written into Hbase through a worksheet program, and the processing state of the file is recorded in a memory database (for example, preprocessing is completed and deduplication processing is not started); reading the preprocessed file from the Hbase by the worksheet program, putting the preprocessed file into a duplication elimination processing inlet, and changing the processing state of the file in the memory database (preprocessing is processed, duplication elimination processing is started); after the re-trimming process is finished, the worksheet program puts the file into Hbase from the re-trimming process outlet, and changes the processing state of the file in the memory database (the pre-processing is finished and the re-trimming process is finished).

After each processing step is performed on each processed file, the file processing state of each source file and the processing time of the source file are stored in the memory database in a data table form by taking the process sequence number and the file identifier as indexes.

wherein, the table name of the memory database is a file state table (FileStatusTable), an index is built according to the ProcID of the process sequence number, and the table structure of the memory database is as follows: the process sequence number ProcID, the file name FileName, the file processing state Status and the file processing time deal _ time.

The values of the file processing state Status represent the following meanings:

program number 01 has placed this file at the exit;

1 the work order writer has written this file into hbase table;

The 2-work-order reading program has put this file in the entry of program No. 2;

program number 32 has processed this file;

1, indicating that a file name given in a memory base table read when the Hbase program is written in the work order does not exist on a file system;

2, the file name given in the memory library table read when the Hbase program is read by the work order is not present in the Hbase.

The processes of the same group of processing tasks are the same sequence number (for example, the process sequence numbers corresponding to the program No. 1, the work order program, and the program No. 2 which are used for processing the same task are all 001), and the process sequence number is a basis for task allocation.

The table name of the memory database is a file data table FileData01, and the index is built by using the process sequence number + the file name as Rowkey, wherein the reason for adding the process sequence number in Rowkey is to put the data which needs to be processed by the same process together, so that the read-write operation is performed on the same Region, and the performance is better.

in addition, when data is stored in an in-memory database or Hbase, a data table can be established every day or every month according to the data amount, and the expired data table is cleaned up periodically. For example, according to the daily table, the file names are: FileData _20160310, FileData _20160311, …, FileData _ 20160318.

When the abnormal downtime or disk failure occurs and the disk is restarted, the last processing state of each file is found from the memory database, the steps for processing the abnormal downtime or disk failure are processed according to the last processing state of the file, namely the memory database can record the breakpoint file, and after the restart, only the steps after the breakpoint are needed to be processed, and the restart is not needed to be started from the beginning.

Example 3, a system for accessing big data based on memory data and Hbase. The system provided by the present embodiment is described below with reference to fig. 3.

referring to fig. 3, the system provided in this embodiment includes a file reading module 31, a configuration module 32, a task processing module 33, a table creating module 34, a first storage module 35, a second storage module 36, and a table cleaning module 37.

Specifically, the file reading module 31 is configured to read a plurality of source files to be processed.

A task processing module 33, configured to perform task processing on each source file, where a task processing includes multiple processing steps.

the first storage module 35 is configured to store the file processing state of each source file in the memory database in a first preset manner after each processing step is performed on each source file.

And a second storage module 36, configured to store the file data after each processing step of each source file in the Hbase in a second preset manner.

The system provided in this embodiment further includes a configuration module 32, configured to configure a process sequence number for each task process; the first storage module 35 is specifically configured to: after each processing step is performed on each source file, the file processing state of each source file and the processing time of the source file are stored in the memory database in a data table form by taking the process sequence number and the file identifier as indexes.

the second storage module 36 is specifically configured to: designing a reasonable Rowkey for the file after each processing step of each source file, and storing the file data in Hbase in a data table form by taking the Rowkey as an index, wherein the Rowkey is obtained by adding a file identifier with a process sequence number.

The system provided by this embodiment further includes a table establishing module 34 and a table cleaning module 37, where the table establishing module 34 is configured to establish a corresponding data table according to the data amount in each source file at a first predetermined time interval, and store the data table in a corresponding memory database or Hbase; and a table cleaning module 37, configured to clean the corresponding data table periodically according to a second predetermined time interval.

In the description herein, references to the description of the terms "embodiment one," "example," "specific example," or "some examples," etc., mean that a particular method, apparatus, or feature described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, methods, apparatuses, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. a method for accessing big data based on a memory database and Hbase is characterized by comprising the following steps:

The step S2 specifically includes:

After each processing step is carried out on each source file, the file processing state of each source file and the processing time of the source file are stored in a memory database in a data table mode by taking the process sequence number and the file identification as indexes;

s3, storing the file data of each processed source file in Hbase in a second preset mode;

the step S3 specifically includes:

2. the method according to claim 1, wherein a corresponding data table is created according to the data amount in each source file and the first predetermined time interval, and stored in the corresponding memory database or Hbase.

3. the method for accessing big data based on an in-memory database and Hbase of claim 2, wherein the corresponding data table is periodically cleaned up according to a second predetermined time interval.

4. The method according to any one of claims 1 to 3, wherein when an abnormal downtime or disk failure restart occurs, the file processing status of each source file is read from the memory database, and the processing steps after the abnormal downtime or disk failure are performed on the file data according to the file processing status.

5. A system for accessing big data based on an in-memory database and Hbase, comprising:

the first storage module is used for storing the file processing state of each source file in the memory database in a first preset mode after each processing step is carried out on each source file; the method specifically comprises the following steps: after each processing step is carried out on each source file, the file processing state of each source file and the processing time of the source file are stored in a memory database in a data table mode by taking the process sequence number and the file identification as indexes;

The second storage module is used for storing the file data of each source file after each processing step in the Hbase in a second preset mode; the method specifically comprises the following steps: designing a reasonable Rowkey for the file after each processing step of each source file, and storing the file data in Hbase in a data table form by taking the Rowkey as an index, wherein the Rowkey is obtained by adding a file identifier with a process sequence number.

6. the in-memory database and Hbase based access big data system of claim 5, further comprising:

The table establishing module is used for establishing a corresponding data table according to the data amount in each source file and a first preset time interval, and storing the data table in a corresponding memory database or Hbase;

And the table cleaning module is used for cleaning the corresponding data table periodically according to a second preset time interval.