CN105956106B - method and system for accessing big data based on memory database and Hbase - Google Patents

method and system for accessing big data based on memory database and Hbase Download PDF

Info

Publication number
CN105956106B
CN105956106B CN201610289753.8A CN201610289753A CN105956106B CN 105956106 B CN105956106 B CN 105956106B CN 201610289753 A CN201610289753 A CN 201610289753A CN 105956106 B CN105956106 B CN 105956106B
Authority
CN
China
Prior art keywords
file
processing
hbase
memory database
source file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610289753.8A
Other languages
Chinese (zh)
Other versions
CN105956106A (en
Inventor
李晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Si Tech Information Technology Co Ltd
Original Assignee
Beijing Si Tech Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Si Tech Information Technology Co Ltd filed Critical Beijing Si Tech Information Technology Co Ltd
Priority to CN201610289753.8A priority Critical patent/CN105956106B/en
Publication of CN105956106A publication Critical patent/CN105956106A/en
Application granted granted Critical
Publication of CN105956106B publication Critical patent/CN105956106B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

the invention discloses a method and a system for accessing big data based on a memory database and Hbase, wherein the method comprises the following steps: s1, reading a plurality of source files to be processed, and respectively performing task processing on each source file, wherein one task processing comprises a plurality of processing steps; s2, after each processing step is carried out on each source file, storing the file processing state of each source file in a memory database in a first preset mode; and S3, storing the file data of each processing step of each source file in Hbase in a second preset mode. According to the invention, the intermediate data files in the task processing process are stored in the Hbase, and the processing state of each file after each processing step is stored in the memory database, so that the advantages of the Hbase that the data can be amplified and the memory database has high access speed are utilized, and the data can be rapidly accessed.

Description

Method and system for accessing big data based on memory database and Hbase
Technical Field
The invention relates to the technical field of data access, in particular to a method and a system for accessing big data based on a memory database and Hbase.
Background
The distributed memory database is a memory database, all data are stored in a memory, and the advantage of super-high speed of memory access can be exerted. Data reliability is guaranteed through a full data file (checkpoint) and a redo log. And sql flexible data access is supported. Meanwhile, the distributed memory database is distributed and is deployed on a plurality of nodes of a network, and a uniform access interface is provided for the outside.
Hbase is a NOSQL database; data in Hbase can be conveniently retrieved according to rowkey or the range of rowkey, but the requirement of flexible query without using rowkey as a key word cannot be met.
disclosure of Invention
the invention aims to provide a method and a system for accessing big data based on a memory database and Hbase, which can improve the data access speed.
The technical scheme for solving the technical problems is as follows:
In one aspect, the invention provides a method for accessing big data based on a memory database and Hbase, comprising the following steps:
S1, reading a plurality of source files to be processed, and respectively performing task processing on each source file, wherein one task processing comprises a plurality of processing steps;
S2, after each processing step is carried out on each source file, storing the file processing state of each source file in a memory database in a first preset mode;
And S3, storing the file data of each processing step of each source file in Hbase in a second preset mode.
in another aspect, the present invention provides a system for accessing big data based on a memory database and Hbase, comprising:
the file reading module is used for reading a plurality of source files to be processed;
The task processing module is used for performing task processing on each source file, wherein one task processing comprises a plurality of processing steps;
The first storage module is used for storing the file processing state of each source file in the memory database in a first preset mode after each processing step is carried out on each source file;
And the second storage module is used for storing the file data of each source file after each processing step in the Hbase in a second preset mode.
according to the method and the system for accessing the big data based on the memory database and the Hbase, the intermediate data file in the task processing process is stored in the Hbase, the file processing state of each file after each processing step is stored in the memory database, and the advantages of the Hbase that the data can be stored and amplified and the memory database has high access speed are utilized, so that the data can be accessed quickly.
Drawings
fig. 1 is a flowchart of a method for accessing big data based on a memory database and Hbase according to embodiment 1 of the present invention;
FIG. 2 is a flowchart of example 2;
Fig. 3 is a schematic diagram of a system for accessing big data based on a memory database and Hbase according to embodiment 3 of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Example 1, a method for accessing big data based on in-memory database and Hbase. The method provided by the present embodiment is described below with reference to fig. 1.
Referring to fig. 1, the method provided in this embodiment includes: s1, reading a plurality of source files to be processed, and respectively performing task processing on each source file, wherein one task processing comprises a plurality of processing steps;
S2, after each processing step is carried out on each source file, storing the file processing state of each source file in a memory database in a first preset mode;
And S3, storing the file data of each processing step of each source file in Hbase in a second preset mode.
In this embodiment, a process sequence number is configured for each task process, and the step S2 specifically includes:
after each processing step is performed on each source file, the file processing state of each source file and the processing time of the source file are stored in the memory database in a data table form by taking the process sequence number and the file identifier as indexes.
The step S3 specifically includes:
designing a reasonable Rowkey for the file after each processing step of each source file, and storing the file data in Hbase in a data table form by taking the Rowkey as an index, wherein the Rowkey is obtained by adding a file identifier with a process sequence number.
in this embodiment, when the intermediate file after each processing step is stored in the Hbase and the file processing state is stored in the memory database, a corresponding data table is established according to the data amount in each source file at a first predetermined time interval and stored in the corresponding memory database or the Hbase, and the corresponding data table is periodically cleaned at a second predetermined time interval.
example 2
To further the understanding of the method for accessing big data based on the in-memory database and Hbase provided by the present invention, a specific example is described below.
referring to fig. 2, first, a program based on Hbase storage (hereinafter referred to as a work order program for descriptive convenience) is developed to record and save files after each processing step in the task processing and breakpoint file records. For example, take preprocessing and re-emphasis processing of a source file as an example: after each file is preprocessed, the preprocessed file needs to be written into Hbase through a worksheet program, and the processing state of the file is recorded in a memory database (for example, preprocessing is completed and deduplication processing is not started); reading the preprocessed file from the Hbase by the worksheet program, putting the preprocessed file into a duplication elimination processing inlet, and changing the processing state of the file in the memory database (preprocessing is processed, duplication elimination processing is started); after the re-trimming process is finished, the worksheet program puts the file into Hbase from the re-trimming process outlet, and changes the processing state of the file in the memory database (the pre-processing is finished and the re-trimming process is finished).
After each processing step is performed on each processed file, the file processing state of each source file and the processing time of the source file are stored in the memory database in a data table form by taking the process sequence number and the file identifier as indexes.
wherein, the table name of the memory database is a file state table (FileStatusTable), an index is built according to the ProcID of the process sequence number, and the table structure of the memory database is as follows: the process sequence number ProcID, the file name FileName, the file processing state Status and the file processing time deal _ time.
The values of the file processing state Status represent the following meanings:
program number 01 has placed this file at the exit;
1 the work order writer has written this file into hbase table;
The 2-work-order reading program has put this file in the entry of program No. 2;
program number 32 has processed this file;
1, indicating that a file name given in a memory base table read when the Hbase program is written in the work order does not exist on a file system;
2, the file name given in the memory library table read when the Hbase program is read by the work order is not present in the Hbase.
The processes of the same group of processing tasks are the same sequence number (for example, the process sequence numbers corresponding to the program No. 1, the work order program, and the program No. 2 which are used for processing the same task are all 001), and the process sequence number is a basis for task allocation.
The table name of the memory database is a file data table FileData01, and the index is built by using the process sequence number + the file name as Rowkey, wherein the reason for adding the process sequence number in Rowkey is to put the data which needs to be processed by the same process together, so that the read-write operation is performed on the same Region, and the performance is better.
in addition, when data is stored in an in-memory database or Hbase, a data table can be established every day or every month according to the data amount, and the expired data table is cleaned up periodically. For example, according to the daily table, the file names are: FileData _20160310, FileData _20160311, …, FileData _ 20160318.
When the abnormal downtime or disk failure occurs and the disk is restarted, the last processing state of each file is found from the memory database, the steps for processing the abnormal downtime or disk failure are processed according to the last processing state of the file, namely the memory database can record the breakpoint file, and after the restart, only the steps after the breakpoint are needed to be processed, and the restart is not needed to be started from the beginning.
Example 3, a system for accessing big data based on memory data and Hbase. The system provided by the present embodiment is described below with reference to fig. 3.
referring to fig. 3, the system provided in this embodiment includes a file reading module 31, a configuration module 32, a task processing module 33, a table creating module 34, a first storage module 35, a second storage module 36, and a table cleaning module 37.
Specifically, the file reading module 31 is configured to read a plurality of source files to be processed.
A task processing module 33, configured to perform task processing on each source file, where a task processing includes multiple processing steps.
the first storage module 35 is configured to store the file processing state of each source file in the memory database in a first preset manner after each processing step is performed on each source file.
And a second storage module 36, configured to store the file data after each processing step of each source file in the Hbase in a second preset manner.
The system provided in this embodiment further includes a configuration module 32, configured to configure a process sequence number for each task process; the first storage module 35 is specifically configured to: after each processing step is performed on each source file, the file processing state of each source file and the processing time of the source file are stored in the memory database in a data table form by taking the process sequence number and the file identifier as indexes.
the second storage module 36 is specifically configured to: designing a reasonable Rowkey for the file after each processing step of each source file, and storing the file data in Hbase in a data table form by taking the Rowkey as an index, wherein the Rowkey is obtained by adding a file identifier with a process sequence number.
The system provided by this embodiment further includes a table establishing module 34 and a table cleaning module 37, where the table establishing module 34 is configured to establish a corresponding data table according to the data amount in each source file at a first predetermined time interval, and store the data table in a corresponding memory database or Hbase; and a table cleaning module 37, configured to clean the corresponding data table periodically according to a second predetermined time interval.
according to the method and the system for accessing the big data based on the memory database and the Hbase, the intermediate data file in the task processing process is stored in the Hbase, the file processing state of each file after each processing step is stored in the memory database, and the advantages of the Hbase that the data can be stored and amplified and the memory database has high access speed are utilized, so that the data can be accessed quickly.
In the description herein, references to the description of the terms "embodiment one," "example," "specific example," or "some examples," etc., mean that a particular method, apparatus, or feature described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, methods, apparatuses, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. a method for accessing big data based on a memory database and Hbase is characterized by comprising the following steps:
s1, reading a plurality of source files to be processed, and respectively performing task processing on each source file, wherein one task processing comprises a plurality of processing steps;
s2, after each processing step is carried out on each source file, storing the file processing state of each source file in a memory database in a first preset mode;
The step S2 specifically includes:
After each processing step is carried out on each source file, the file processing state of each source file and the processing time of the source file are stored in a memory database in a data table mode by taking the process sequence number and the file identification as indexes;
s3, storing the file data of each processed source file in Hbase in a second preset mode;
the step S3 specifically includes:
Designing a reasonable Rowkey for the file after each processing step of each source file, and storing the file data in Hbase in a data table form by taking the Rowkey as an index, wherein the Rowkey is obtained by adding a file identifier with a process sequence number.
2. the method according to claim 1, wherein a corresponding data table is created according to the data amount in each source file and the first predetermined time interval, and stored in the corresponding memory database or Hbase.
3. the method for accessing big data based on an in-memory database and Hbase of claim 2, wherein the corresponding data table is periodically cleaned up according to a second predetermined time interval.
4. The method according to any one of claims 1 to 3, wherein when an abnormal downtime or disk failure restart occurs, the file processing status of each source file is read from the memory database, and the processing steps after the abnormal downtime or disk failure are performed on the file data according to the file processing status.
5. A system for accessing big data based on an in-memory database and Hbase, comprising:
the file reading module is used for reading a plurality of source files to be processed;
The task processing module is used for performing task processing on each source file, wherein one task processing comprises a plurality of processing steps;
the first storage module is used for storing the file processing state of each source file in the memory database in a first preset mode after each processing step is carried out on each source file; the method specifically comprises the following steps: after each processing step is carried out on each source file, the file processing state of each source file and the processing time of the source file are stored in a memory database in a data table mode by taking the process sequence number and the file identification as indexes;
The second storage module is used for storing the file data of each source file after each processing step in the Hbase in a second preset mode; the method specifically comprises the following steps: designing a reasonable Rowkey for the file after each processing step of each source file, and storing the file data in Hbase in a data table form by taking the Rowkey as an index, wherein the Rowkey is obtained by adding a file identifier with a process sequence number.
6. the in-memory database and Hbase based access big data system of claim 5, further comprising:
The table establishing module is used for establishing a corresponding data table according to the data amount in each source file and a first preset time interval, and storing the data table in a corresponding memory database or Hbase;
And the table cleaning module is used for cleaning the corresponding data table periodically according to a second preset time interval.
CN201610289753.8A 2016-05-04 2016-05-04 method and system for accessing big data based on memory database and Hbase Active CN105956106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610289753.8A CN105956106B (en) 2016-05-04 2016-05-04 method and system for accessing big data based on memory database and Hbase

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610289753.8A CN105956106B (en) 2016-05-04 2016-05-04 method and system for accessing big data based on memory database and Hbase

Publications (2)

Publication Number Publication Date
CN105956106A CN105956106A (en) 2016-09-21
CN105956106B true CN105956106B (en) 2019-12-13

Family

ID=56913637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610289753.8A Active CN105956106B (en) 2016-05-04 2016-05-04 method and system for accessing big data based on memory database and Hbase

Country Status (1)

Country Link
CN (1) CN105956106B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7426516B1 (en) * 2003-11-24 2008-09-16 Novell, Inc. Mechanism for supporting indexed tagged content in a general purpose data store
CN103246700A (en) * 2013-04-01 2013-08-14 厦门市美亚柏科信息股份有限公司 Mass small file low latency storage method based on HBase
CN103955538A (en) * 2014-05-19 2014-07-30 携程计算机技术(上海)有限公司 HBase data persistence and query methods and HBase system
CN104391903A (en) * 2014-11-14 2015-03-04 广州科腾信息技术有限公司 Distributed storage and parallel calculation-based power grid data quality detection method
CN105138592A (en) * 2015-07-31 2015-12-09 武汉虹信技术服务有限责任公司 Distributed framework-based log data storing and retrieving method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7426516B1 (en) * 2003-11-24 2008-09-16 Novell, Inc. Mechanism for supporting indexed tagged content in a general purpose data store
CN103246700A (en) * 2013-04-01 2013-08-14 厦门市美亚柏科信息股份有限公司 Mass small file low latency storage method based on HBase
CN103955538A (en) * 2014-05-19 2014-07-30 携程计算机技术(上海)有限公司 HBase data persistence and query methods and HBase system
CN104391903A (en) * 2014-11-14 2015-03-04 广州科腾信息技术有限公司 Distributed storage and parallel calculation-based power grid data quality detection method
CN105138592A (en) * 2015-07-31 2015-12-09 武汉虹信技术服务有限责任公司 Distributed framework-based log data storing and retrieving method

Also Published As

Publication number Publication date
CN105956106A (en) 2016-09-21

Similar Documents

Publication Publication Date Title
US9183268B2 (en) Partition level backup and restore of a massively parallel processing database
CN109164980B (en) Aggregation optimization processing method for time sequence data
EP2474919B1 (en) System and method for data replication between heterogeneous databases
US9256665B2 (en) Creation of inverted index system, and data processing method and apparatus
CN102541757B (en) Write cache method, cache synchronization method and device
CN109213756A (en) Data storage, search method, device, server and storage medium
WO2012083754A1 (en) Method and device for processing dirty data
CN102141963A (en) Method and equipment for analyzing data
CN111078657A (en) Service log query method, system, medium and equipment of distributed system
CN106570163A (en) Unreliable environment-oriented audit log read-write managing method and system
CN106815353A (en) A kind of method and apparatus of data query
CN105630934A (en) Data statistic method and system
WO2023277819A3 (en) Data processing method, system, device, computer program product, and storage function
CN111078719A (en) Data recovery method and device, storage medium and processor
CN118069712A (en) Data life cycle management method and device, electronic equipment and storage medium
CN114297196A (en) Metadata storage method and device, electronic equipment and storage medium
CN114297204A (en) Data storage and retrieval method and device for heterogeneous data source
CN103092955B (en) Checkpointed method, Apparatus and system
CN111159117B (en) Low-overhead file operation log acquisition method
CN105956106B (en) method and system for accessing big data based on memory database and Hbase
CN108334565A (en) A kind of data mixing storage organization, data store query method, terminal and medium
CN104331460A (en) Hbase-based data read-write operation method and system
CN110399396B (en) Efficient data processing
CN113064943A (en) Data acquisition method and device, electronic equipment and storage medium
CN115858471A (en) Service data change recording method, device, computer equipment and medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant