CN115544149A

CN115544149A - Small file storage method and system based on HBase multi-terminal fusion

Info

Publication number: CN115544149A
Application number: CN202211286417.XA
Authority: CN
Inventors: 佘平; 罗琳; 李静茹; 徐鑫朋; 袁铭
Original assignee: CETC 32 Research Institute
Current assignee: CETC 32 Research Institute
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2022-12-30

Abstract

The invention provides a small file storage method and a system based on HBase multi-terminal fusion, wherein a table is established in an HBase starting database, file name related information is used as a row key, and the content, the type, the size and the creation time of a small file are stored by using an independent column cluster, and the file operation process comprises the following steps: step 1: inputting a file path and a file name; step 2: forming HBase basic data entry metadata according to the file name; and 3, step 3: and judging the file processing operation type, calling the client to connect with the HBase database, and performing small file query, deletion, downloading and addition operations according to the metadata. The invention provides a C + +/C #/Java data input interface mode, realizes uniform access of multi-source data, provides a plurality of HBase thrift services by combining the characteristics of large quantity of small files, and realizes efficient storage of a large quantity of small files in a flexible load balancing mode.

Description

Small file storage method and system based on HBase multi-terminal fusion

Technical Field

The invention relates to the technical field of data storage and processing, in particular to a small file storage method and system based on HBase multi-terminal fusion.

Background

In the field of mass data storage, data storage is generally performed by a distributed file system, and the distributed file system has a data redundancy mechanism and supports the lateral expansion of a storage system. The distributed file system is generally composed of a plurality of data nodes, metadata service provides file data attribute information, and file access needs to access metadata information of a file first and then actual data information of the file. Meanwhile, the data information takes data blocks as basic storage units, and the size of the data blocks is generally larger than that of a single file system, for example, in a distributed file system HDFS, the size of one data block is 128M.

Patent document CN114595255A (application number: CN 202210238856.7) discloses multi-source heterogeneous data fusion storage, and relates to the technical field of data storage. The multi-source heterogeneous data fusion storage comprises a HaiNaTable database management system and a storage hard disk, wherein the HaiNaTable database management system has the functions of starting fusion, newly adding, modifying, searching a main key and finishing fusion; the HaiNaTable database management system stores the data files by taking Tdb as data and TIndex as index, and stores the data files into the storage hard disk in real time, and the index files stored by the index are files in which the characteristic information of the data files is stored into Int128 through character strings generated by Md 5.

However, in a distributed file system, if there are a large number of small files, the overall performance of data access is not high because access to a large number of small files would incur metadata and data block overhead. Based on various input modes of C + +/C #/Java, file metadata information is stored in a distributed column database system HBase in Rowkey mode, and small file data is stored in a single-column data unit of the HBase, so that the small file data can be rapidly and reliably stored and accessed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a small file storage method and system based on HBase multi-terminal fusion.

According to the small file storage method based on HBase multi-terminal fusion provided by the invention, a table is established in an HBase starting database, file name related information is used as a row key, and the content, type, size and creation time of a small file are stored by using an independent column cluster, wherein the file operation process comprises the following steps:

step 1: inputting a file path and a file name;

step 2: forming HBase basic data entry metadata according to the file name;

and step 3: and judging the file processing operation type, calling the client to connect with the HBase database, and performing small file query, deletion, downloading and addition operations according to the metadata.

Preferably, the small file query process is as follows: inputting a file name, calling an HBase third C + +/C #/Java interface for query, judging whether the file exists, if so, constructing a packaged small file object and outputting the small file object, and if not, outputting null.

Preferably, the small file deleting process is as follows: inputting a file name, and calling an HBase gradient C + +/C #/Java interface to delete;

the small file adding process comprises the following steps: inputting a small file object, reading the file content, and calling HBase gradient C + +/C #/Java interface addition.

Preferably, the small file downloading process is as follows: inputting a file name and a download address, calling an HBase third C + +/C #/Java query interface, judging whether null is returned, if yes, directly ending the flow, and if not, reading a file content field of the small file to the specified file.

Preferably, when small files are stored, the reverse timestamp, the file path and the file name information are spliced into a row key in the HBase table, and the file size, the file time, the file type and the file content are stored in a column cluster in the HBase table.

According to the small file storage system based on HBase multi-end fusion provided by the invention, a table is established in an HBase starting database, file name related information is used as a row key, and a single column cluster is used for storing the content, type, size and creation time of a small file, wherein the file operation process comprises the following modules:

a module M1: inputting a file path and a file name;

a module M2: forming HBase basic data entry metadata according to the file name;

a module M3: and judging the file processing operation type, calling the client to connect the HBase database, and performing small file query, deletion, downloading and addition operations according to the metadata.

Compared with the prior art, the invention has the following beneficial effects:

1) The invention can effectively improve the storage performance of the small files by utilizing the columnar data storage characteristic;

2) The invention provides a C + +/C #/Java data input interface mode, which realizes uniform access of multi-source data;

3) The invention supports file operation, and realizes the functions of small file creation, deletion, reading, writing and the like based on an HBase database interface;

4) The invention provides a plurality of HBase threads services by combining the characteristics of large quantity of small files, and realizes the high-efficiency storage of mass small files in a flexible load balancing mode.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a diagram of a system configuration;

FIG. 2 is a schematic diagram of a multi-terminal fusion;

FIG. 3 is a flow chart of C + + small file query implementation;

FIG. 4 is a flowchart of a C + + small file deletion implementation;

FIG. 5 is a flowchart of a C + + small file download implementation;

FIG. 6 is a flow chart of a C + + small file addition implementation;

FIG. 7 is a flow chart of C # doclet query implementation;

FIG. 8 is a C # doclet deletion implementation flow diagram;

FIG. 9 is a flow chart of a C # small file download implementation;

FIG. 10 is a flow chart of a C # doclet addition implementation;

FIG. 11 is a flow diagram of a JAVA doclet query implementation;

FIG. 12 is a flow diagram of a JAVA doclet deletion implementation;

FIG. 13 is a flow chart of a JAVA doclet download implementation;

FIG. 14 is a flow chart of a JAVA doclet addition implementation;

FIG. 15 is a small file metadata implementation diagram;

FIG. 16 is a flow chart of client side doclet data storage.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the concept of the invention. All falling within the scope of the present invention.

Example 1:

the invention provides a small file storage method based on HBase multi-terminal fusion, which is characterized in that a table is established in an HBase starting database, file name related information is used as a row key, and the content, type, size and creation time of a small file are stored by using an independent column cluster, wherein the file operation process comprises the following steps: step 1: inputting a file path and a file name; step 2: forming HBase basic data entry metadata according to the file name; and step 3: and judging the file processing operation type, calling the client to connect with the HBase database, and performing small file query, deletion, downloading and addition operations according to the metadata.

The small file query process comprises the following steps: inputting a file name, calling an HBase third C + +/C #/Java interface for query, judging whether the file exists, if so, constructing a packaged small file object and outputting the small file object, and if not, outputting null.

The small file deleting process comprises the following steps: inputting the file name, and calling HBase gradient C + +/C #/Java interface to delete.

The small file downloading process comprises the following steps: inputting a file name and a download address, calling an HBase third C + +/C #/Java query interface, judging whether null is returned, if yes, directly ending the flow, and if not, reading a file content field of the small file to the specified file.

When small files are stored, the reverse time stamps, the file paths and the file name information are spliced into row keys in the HBase table, and the file size, the file time, the file type and the file content are stored in a column cluster in the HBase table.

Example 2:

the invention also provides a small file storage system based on HBase multi-terminal fusion, which can be realized by executing the flow steps of the small file storage method based on HBase multi-terminal fusion, namely, the small file storage method based on HBase multi-terminal fusion can be understood as the preferred implementation mode of the small file storage system based on HBase multi-terminal fusion by a person skilled in the art. The system provides multi-language end data function access on one hand, and has HBase high-performance access capability on the other hand, and can realize rapid unified storage of multi-source small file data. Meanwhile, the system is built on a distributed file system and has high data reliability and dynamic capacity expansion capacity of the data system. The system realizes the uploading and downloading interfaces of the small files, realizes the storage of the metadata and the data of the small files based on Rowkey through the uploading interface, stores the actual data on the HBase system through the data of the file interface, and supports data redundancy. And the query of the small file data based on the metadata information and the data content downloading are realized through a downloading interface. The system composition refers to fig. 1.

The specific method for realizing the HBase based on the HBase multi-terminal fusion small file storage system is to establish a large table, use file name related information as a row key, use a single column cluster to store attribute information such as content, type, size, creation time and the like of a small file, and the general flow of file operation is as follows:

1) Inputting a file path and a file name;

2) Forming HBase basic data entry metadata according to the file name information;

3) If the operation is uploading operation, acquiring file content;

4) And calling the client to connect with the HBase database, and performing uploading, querying, downloading and other operations according to the metadata.

Aiming at the read-write process in the file system, the HBase multi-terminal fusion-based small file storage system provides multiple programming languages and functionally equivalent interface realization for the client, and supports multi-terminal writing and reading of small files. Multi-terminal fusion reference is made to fig. 2.

The system specifically supports the following data inputs:

1. c + + small file end interface

The C + + small file interface realizes the operations of querying, deleting, downloading and adding small files.

And (3) small file query: receiving an input file name, calling an HBase third C + + interface for query, and returning a small file object if the file exists, specifically referring to FIG. 3.

Deleting the small file: receiving the input file name, and calling the HBase gradient C + + interface to delete, which refers to FIG. 4 specifically.

Downloading the small file: receiving the input file name and download path, calling the C + + small file interface for query, and downloading if the file exists, specifically referring to fig. 5.

Adding small files: receiving the constructed small file object information, and calling the HBase third C + + interface for addition, which specifically refers to FIG. 6.

2. C # small file end interface

The C # small file interface realizes the operations of inquiring, deleting, downloading and adding the small files.

And (3) small file query: receiving the input file name, calling HBase third C # interface for query, and returning a small file object if the file exists, specifically referring to FIG. 7.

Deleting the small file: and receiving the input file name, and calling the HBase triple C # interface for deletion, which refers to FIG. 8 specifically.

Downloading the small file: receiving the input file name and download path, calling the C # small file interface for inquiry, and downloading if the file exists, specifically referring to fig. 9.

Adding small files: receiving the constructed small file object information, and calling the HBase third interface for addition, specifically referring to fig. 10.

3. Java small file end interface

The Java small file interface realizes the operations of inquiring, deleting, downloading and adding the small files.

And (3) small file query: receiving the input file name, calling the HBase third Java interface for query, and returning a small file object if the file exists, which specifically refers to FIG. 11.

Deleting the small file: and receiving the input file name, and calling the HBase triple Java interface to delete the file name, which is specifically referred to in FIG. 12.

Downloading the small file: receiving the input file name and download path, calling the Java doclet interface for inquiry, and downloading if the file exists, specifically referring to fig. 13.

Adding small files: receiving the constructed small file object information, and calling the HBase third JAVA interface for addition, specifically referring to fig. 14.

4. Small file metadata storage

When the small file is stored, the reverse timestamp, the file path, and the file name information are spliced into a row key (Rowkey) in the HBase table, and file metadata such as the file size, the file time, the file type, and the like and the file content are stored together in a column cluster in the HBase table, which is specifically referred to fig. 15.

5. Client small file data storage process

And storing the file content in a column cluster determined by the small file Rowkey in the HBase table by the small file data storage. Considering the scenario that the number of small files is large, when the client stores data, a plurality of HBase thrift services can be provided, and the client can select one of the HBase thrift services for storing data through a load balancing configuration strategy. The client data storage flow design implementation refers specifically to fig. 16.

It is known to those skilled in the art that, in addition to implementing the system, apparatus and its various modules provided by the present invention in pure computer readable program code, the system, apparatus and its various modules provided by the present invention can be implemented in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like by completely programming the method steps. Therefore, the system, the apparatus, and the modules thereof provided by the present invention may be considered as a hardware component, and the modules included in the system, the apparatus, and the modules for implementing various programs may also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A small file storage method based on HBase multi-terminal fusion is characterized in that a table is established in an HBase starting database, file name related information is used as a row key, content, type, size and creation time of a small file are stored by using a single column cluster, and a file operation process comprises the following steps:

step 1: inputting a file path and a file name;

and 2, step: forming HBase basic data entry metadata according to the file name;

2. The HBase multi-terminal fusion-based small file storage method according to claim 1, wherein the small file query process is as follows: inputting a file name, calling an HBase third C + +/C #/Java interface for query, judging whether the file exists, if so, constructing a packaged small file object and outputting the small file object, and if not, outputting null.

3. The HBase multi-terminal fusion-based small file storage method according to claim 1, wherein the small file deletion process is as follows: inputting a file name, and calling an HBase gradient C + +/C #/Java interface to delete;

4. The method for storing the small files based on the HBase multi-terminal fusion according to claim 1, wherein the small file downloading process is as follows: inputting a file name and a download address, calling an HBase third C + +/C #/Java query interface, judging whether null is returned, if yes, directly ending the flow, and if not, reading a file content field of the small file to the specified file.

5. The method for storing the small files based on the HBase multi-terminal fusion as claimed in claim 1, wherein when the small files are stored, the reverse timestamp, the file path and the file name information are spliced into a row key in the HBase table, and the file size, the file time, the file type and the file content are stored in a column cluster in the HBase table.

6. A small file storage system based on HBase multi-end fusion is characterized in that a table is established in an HBase starting database, file name related information is used as a row key, a single column cluster is used for storing the content, type, size and creation time of a small file, and a file operation flow comprises the following modules:

a module M1: inputting a file path and a file name;

7. The HBase multi-terminal fusion-based small file storage system according to claim 6, wherein the small file query process is as follows: inputting a file name, calling an HBase third C + +/C #/Java interface for query, judging whether the file exists, if so, constructing a packaged small file object and outputting the small file object, and if not, outputting null.

8. The HBase multi-terminal fusion-based small file storage system according to claim 6, wherein the small file deletion process is as follows: inputting a file name, and calling an HBase gradient C + +/C #/Java interface to delete;

the small file adding process comprises the following steps: inputting a small file object, reading the file content, and calling HBase third C + +/C #/Java interface addition.

9. The HBase multi-terminal fusion-based small file storage system according to claim 6, wherein the small file downloading process is as follows: inputting a file name and a download address, calling an HBase third C + +/C #/Java query interface, judging whether null is returned, if yes, directly ending the flow, and if not, reading a file content field of the small file to the specified file.

10. The HBase multi-terminal fusion-based small file storage system according to claim 6, wherein when small files are stored, the reverse timestamp, the file path and the file name information are spliced into a row key in the HBase table, and the file size, the file time, the file type and the file content are stored in a column cluster in the HBase table.