CN113204520B

CN113204520B - Remote sensing data rapid concurrent read-write method based on distributed file system

Info

Publication number: CN113204520B
Application number: CN202110469599.3A
Authority: CN
Inventors: 段延松; 张祖勋; 陶鹏杰; 柯涛; 张永军
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2023-04-07
Anticipated expiration: 2041-04-28
Also published as: CN113204520A

Abstract

The invention provides a remote sensing data rapid concurrent read-write method based on a distributed file system, wherein the bottom physical structure inherits the characteristics of an HDFS file system, the method comprises the steps of installing a Hadoop system on each data server in a computer group, establishing the HDFS file system, and then dividing a part of space on each data server to be used as a physical storage space of the own file system; performing primary packaging on an HDFS service processing layer, taking over access of an operating system to a file system, and when the operating system only requires to read a file and file data already exists, directly referencing an HDFS file system interface, and finishing reading of the file data by the HDFS; when the operating system requires that the access to the file comprises file writing operation, the file operation is completely taken over, the data reading and writing are realized by the own file system, and the data are synchronized into the HDFS after the data reading and writing are finished; the self-contained file system only reads and writes one server. The invention can realize the rapid concurrent reading and writing of mass remote sensing data.

Description

Remote sensing data rapid concurrent read-write method based on distributed file system

Technical Field

The invention relates to the fields of computer application technology and remote sensing big data processing, in particular to a concurrent read-write technology of big data files, and particularly relates to a rapid concurrent read-write technology for realizing remote sensing data based on a distributed file system.

Background

With the rapid development of remote sensing technology, remote sensing satellites transmitted by various countries are more and more, satellite data received every day reaches a PB level, and the processing of massive remote sensing data puts higher requirements on storage technology and processing speed. On the other hand, with the rapid development of computer, especially internet technology, data storage technology has also gained a qualitative leap, especially with the recent appearance of cloud technology, and data storage technology has been improved to an unprecedented level. The cloud technology is usually based on a distributed file system, so that high reliability is guaranteed, and capacity and cost performance are improved. Among the Distributed File systems, HDFS (Hadoop Distributed File System) Distributed File systems that provide open source code are particularly popular. HDFS is one of the cores of Hadoop projects, and is the basis for distributed data storage. The HDFS is designed based on the requirement of accessing the ultra-Large file in a streaming Data mode, can run on a general server with high cost performance, has the characteristics of high fault tolerance, high reliability, high expandability, high availability, high throughput rate and the like, and brings great convenience for processing an ultra-Large Data Set (Large Data Set).

However, HDFS also has the following 3 disadvantages (HDFS is not applicable in these cases): 1. low latency data access, such as read and write data on the order of milliseconds, cannot be achieved. HDFS is only suitable for high throughput scenarios, i.e. writing a large amount of data at a time, but does not support fast reading back of data at once. 2. Large amounts of small file storage are not supported. Storing a large number of small files will occupy a large amount of memories of the index service (NameNode) to store the index information of the data block, however, the memory of the index service of the HDFS is limited, and cannot realize mass expansion, and in addition, a large number of indexes will cause the seek time to exceed the reading time, thereby greatly reducing the access efficiency. 3. Files cannot be written concurrently or modified randomly. The files of the HDFS have to be exclusively accessed, multiple threads are not allowed to write simultaneously, an additional (appendix) mode is only supported, and random modification is not supported. However, in the remote sensing big data processing, data rewriting and updating are a processing mode which must be supported, the remote sensing image is usually very big, and the remote sensing image cannot be processed again all at a time, but only the position needing to be modified is modified, and the characteristic is determined by an algorithm of professional processing of the remote sensing image.

Therefore, the HDFS including the above 3 disadvantages cannot be applied to remote sensing big data processing, and therefore the invention provides a method for improving a distributed file system to realize rapid concurrent reading and writing of remote sensing big data.

Disclosure of Invention

In order to solve the 3 defects of the HDFS, the invention provides a method for improving a distributed file system by modifying a native HDFS distributed file system.

In order to achieve the purpose, the technical scheme of the invention is as follows:

the invention provides a remote sensing data rapid concurrent read-write method based on a distributed file system, wherein the bottom physical structure inherits the characteristics of an HDFS file system, the method comprises the steps of installing a Hadoop system on each data server in a computer group, establishing the HDFS file system, and then dividing a part of space on each data server to be used as a physical storage space of the own file system; performing primary packaging on an HDFS service processing layer, taking over access of an operating system to a file system, and when the operating system only requires to read a file and file data already exists, directly referencing an HDFS file system interface, and finishing reading of the file data by the HDFS;

when the operating system requires that the access to the file comprises a file writing operation, the file writing operation is completely taken over, the data reading and writing are realized by the own file system, and after the data reading and writing are finished, a file writing interface of the HDFS file system is quoted to synchronize the data into the HDFS; the self-contained file system only reads and writes one server.

Moreover, a plurality of universal data servers are deployed in the computer group to realize data storage and scientific calculation; after hardware installation is completed, installing Hadoop systems on all data servers, establishing an HDFS file system, and completing construction of an HDFS native storage cluster; meanwhile, each data server is divided into a part of space as the physical storage space of the own file system.

Moreover, the own file system reads and writes data in a RAID mode.

And moreover, one index server is deployed to realize the management of the whole distributed file system.

Moreover, when the self-owned file system is scheduled, the data server with the best read-write performance is always selected to perform data read-write service each time, and the standard implementation of the consideration with the best performance is as follows,

providing an indicator for each data server to report the data read-write condition in real time, wherein the value calculation formula of the indicator is as follows,

I＝(I _max –(D/T)/(1-(C-S)))×(1-W)

in the formula, wherein I isDevice value, I _max The maximum data throughput provided for the hard disk group of the server is D, the latest data read-write quantity, T, the time for reading and writing the D data, C, the current time, S, the starting statistical time and W, wherein the D is the latest data read-write quantity, the T is the time for reading and writing the D data, the S is the starting statistical time, and the W is the CPU utilization rate of the storage server.

Moreover, after the index server selects the data server with the best performance, subsequent file reading and writing are realized by the corresponding data server, and any data cannot be circulated by the index server in the data reading and writing process, so that the bottleneck of data reading and writing is ensured not to be formed.

Moreover, the process of providing read-write data by the data server includes two cases,

firstly, brand new data is written, only simple file reading and writing needs to be provided by the self-owned file system, after the file reading and writing are finished, the self-owned file system refers to the file writing function of the HDFS system, the file is synchronized into the HDFS file system, and a file reading service is provided for the outside;

the second one is file rewriting, which is used to modify the existing file, firstly, immediately applying for the storage space with the same size as the file to be rewritten in the own file system, and partitioning the file according to the corresponding partitioning consistency of the HDFS file system, and marking the file as 0; then, marking the file in the HDFS as invalid, reading a data block of the HDFS, updating the data and writing the data into a self-owned file system; and finally, synchronizing the self file into the HDFS file system.

Moreover, in the case of a large number of small files, the file name in each folder is managed using the HBase database, and the file contents are managed using a large file including a fixed record size.

The invention is based on an open source HDFS system, and aims at 3 defects of the HDFS system, a set of channel read-write technology (English Gate IO) is designed, a distributed file system is used for providing services to the outside, and the defect of concurrent and rapid read-write of files is overcome through the self-owned file system. Meanwhile, in order to realize load balance of concurrent reading and writing, a scheduling strategy of concurrent reading and writing is designed in an own file system, and the file system is ensured to have better reading and writing performance. Finally, in order to support the mass of small files, the invention fully utilizes the superior management capability of the HBase database to manage the file name in each folder by the HBase database, and the file content is managed by adopting a large file with a fixed record size, thereby finally realizing the rapid concurrent reading and writing of mass data.

The scheme of the invention is simple and convenient to implement, has strong practicability, solves the problems of low practicability and inconvenient practical application of the related technology, can improve the user experience, and has important market value.

Drawings

FIG. 1 is a block diagram of a logical structure according to an embodiment of the present invention;

FIG. 2 is a process flow diagram of an embodiment of the invention;

FIG. 3 is a diagram illustrating a physical storage structure of each server according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an entire distributed file system according to an embodiment of the present invention;

fig. 5 is a schematic view of a read-write scheduling process of an own file system according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is specifically described below with reference to the accompanying drawings and examples.

The invention provides a method for improving a distributed file system to realize rapid concurrent reading and writing of remote sensing data, which is named as a channel reading and writing technology (English Gate IO) and can realize rapid concurrent reading and writing of mass data. In order to solve the 3 defects of the HDFS, the embodiment of the present invention modifies the native HDFS distributed file system:

firstly, the characteristics of high fault tolerance, high reliability, high expandability, high availability, high throughput and the like of the HDFS file system are inherited on a bottom physical structure, and the high efficiency and stability of the distributed file system are ensured.

Then, the first-level packaging is carried out on the HDFS service processing layer, and the access of the operating system to the file system is taken over. If the operating system only requires to read the file and the file data already exists, the HDFS file system interface is directly referred to, and the HDFS finishes reading the file data. If the operating system requires the access to the file to include the file writing operation (including creating and rewriting), the file operation is completely taken over, the built file system realizes the data reading and writing, and after the data reading and writing are completed, the writing file interface of the HDFS file system is quoted to synchronize the data into the HDFS.

The logical structure composition and the processing flow of the embodiment of the invention are shown in fig. 1 and fig. 2. The embodiment of the invention provides a file read-write service provided by the Gate IO to the outside, wherein the native HDFS file system realizes distributed data organization, and the file write is provided by the file system. The process can be designed as follows:

judging whether the file needs to be written or not according to the file reading and writing request;

if so, reading and writing data by the own file system, then finishing reading the existing data or writing the data, and synchronizing the data with the HDFS when finishing reading the existing data or writing the data;

and if not, the native HDFS reads and writes data.

In order to realize the rapid concurrent reading and writing of data, the invention needs to establish a computer group, namely a plurality of universal data servers are deployed to realize data storage and scientific calculation. The number of hard disks installed in each data server is more than 4, and 8 blocks are recommended. And after finishing hardware installation, installing Hadoop systems on all the data servers, establishing an HDFS file system, and finishing the establishment of an HDFS native storage cluster. After that, a part of the space is divided on each data server as the physical storage space of the own file system. In order to ensure the read-write performance of the self-owned file system, the physical space of the self-owned file system must make full use of the characteristics of a disk array (RAID) formed by a plurality of hard Disks, that is, the self-owned file system must read and write data in a RAID mode. The RAID attribute is set according to the number of hard disks of each data server, and a RAID5 mode, that is, a mode in which the number of redundancies is 1, is recommended. The storage division diagram of each data server is shown in fig. 3, wherein each data server includes a part of the physical storage space used by the native HDFS file system as the physical storage space of its own file system.

After the storage space organization of a single data server is completed, the present invention is similar to the HDFS, and an index server (i.e., nameNode server) needs to be deployed to manage the entire distributed file system, and the composition of the entire distributed file system is shown in fig. 4. In specific implementation, a standby index server NameNode can be set. Unlike native HDFS, the index server deployed by the present invention provides two types of services,

one is a read file service, which directly refers to the read file of the native HDFS and is not described in detail here.

The second is a service for writing files (including creating files and rewriting files), which is performed by the own file system.

The file system is different from the HDFS, files are distributed and stored in each data server according to a predefined rule, and only one data server is read and written. In order to support the concurrent reading and writing of multiple computers and multiple files, a self-owned file system needs to provide a set of scheduling algorithm to complete the load balancing of the concurrent reading and writing data of the multiple computers. The general idea of the own file system scheduling algorithm is as follows: and selecting the server with the best read-write performance for data read-write service each time. The best consideration standard of performance provides an indicator for each storage server, reports the data reading and writing condition of the indicator in real time, the maximum value of the indicator is 100, if no data reading and writing application exists recently and a spare magnetic disk physical space exists, the indicator is the maximum 100, no space is directly provided for 0, if data reading and writing exist, the value of the indicator is calculated according to the data flow of a period of time, and the statistical algorithm is recommended as follows:

I＝(I _max –(D/T)/(1-(C-S)))×(1-W)

wherein I is an indicator value, I _max The maximum data throughput provided by the hard disk group of the data server, D is the latest data read-write quantity, T is the time for reading and writing the data of D, C is the current time, S is the starting statistical time, and W is the CPU utilization rate of the storage server.

In the process of selecting the read-write data server, the situation that the data server applying for reading and writing provides data storage service needs to be considered. In this case, if the local indicator value is not 0, the local storage service is preferentially selected.

After the index server selects a specific data storage server, subsequent file reading and writing are realized by the data server, and any data cannot be circulated by the index server in the data reading and writing process, so that the aim of ensuring that the bottleneck of data reading and writing is not formed is fulfilled.

The specific data storage server provides specific read-write data process including two cases,

one of the methods is to write new data, which is simple, and the file system of the system only needs to provide simple file reading and writing. After the file reading and writing are finished, the file writing function of the HDFS system is required to be quoted by the own file system, the file is synchronized into the HDFS system, and a file reading service is provided for the outside.

And secondly, rewriting the file, namely modifying the existing file. The processing of this situation is relatively complex, the processing flow is shown in fig. 5, and the main working steps are as follows:

(1) Firstly, immediately applying for a storage space with the same size as a file to be rewritten in a self-owned file system, and partitioning the space, wherein the size of the partition is consistent with that of the partition of the file to be rewritten in an HDFS (Hadoop distributed File System);

(2) All blocks are marked as 0, which represents that the data block does not exist;

(3) And marking the modified file as invalid in the HDFS file to ensure that the HDFS does not accept the access to the file any more, and if file access applications exist, directly handing over the file system.

(4) Calculating a modified data block according to the rewritten file address, reading the data into a memory by referring to a file reading interface of the HDFS, updating a modified part, writing the data into a block corresponding to a self-owned file system, and simultaneously modifying the identifier to be 1;

(5) After all the modifications are finished, similar to the new files, the file writing function of the HDFS system is introduced, and the files are synchronized into the HDFS file system.

(6) And deleting the original files marked as invalid in the HDFS file, and releasing the storage space.

In addition, in order to solve the problem of a large number of small files, the embodiment of the invention adopts an HBase database and a small file special space provided by a Hadoop system to solve the problem. Aiming at the problem that a native HDFS system does not support the management of massive file names, an HBase database is introduced to manage the file names. HBase provides an index and optimization method based on payment strings, theoretically supports infinite data entry management, and tests prove that HBase has better time than 2 seconds for searching character strings of billions of records. With the support of such powerful databases, massive file name retrieval has not been a problem. For the storage of the data content of the small files, the invention carries out processing by establishing a large file mode, in particular to store a plurality of small files into one large file. When the system is installed, a user is allowed to set a capacity limit (a recommended default value is 64 KB) of the small files to the system, and all files with the file size smaller than 64KB are considered to be small files in the running process of the system. In the file system provided by the invention, all files are created only by one channel, namely, the files are created in the file system and then synchronized into the HDFS. When small files are encountered in the synchronization process, the system does not carry out synchronization, but uniformly writes the contents of the small files into one large file. It is particularly reminded that the large file storing the small files stores data in a fixed record size manner, wherein the size of each record is the small file size threshold (e.g. 64K), and the actual data smaller than the threshold also occupies the same size space. When small data is stored in a large file, the corresponding data entry of the file name database is required to record the actual storage large file name and the file offset of the data, so that the actual content of the file can be conveniently found when the file is read later. The organization mode can effectively solve the writing and reading of massive small files, but is not favorable for deleting the small files, so that the invention has to add extra work to solve the problem, but the comprehensive effect is still superior to the prior art. Generally, if there is a small file deletion request, the index server will directly operate on the HBase database to remove the file name record, but not delete the record, but move the record to another table called a deleted file. When the file list is deleted to a certain extent (for example, 10 ten thousand records), the small file content arrangement work is started. The small file arrangement does not need to be redesigned and developed, the essence of the operation is to rewrite the file, and directly execute the file rewriting function described above on the large file storing the content of the small file.

In summary, the present invention is based on the open-source HDFS system, and a set of channel read-write technologies is designed for overcoming 3 disadvantages of the HDFS system, so that the fast concurrent read-write of mass data can be finally realized.

It will be apparent to those skilled in the art that the steps of the present invention described above may be implemented by a general purpose computer, by designing, developing or directly arranging executable program code to exist on a single computer or to be distributed over a network of multiple computing servers, and in some cases may differ from the order and execution of the steps listed herein or be separately fabricated into single or multiple independent modules, and thus the present invention is not limited to any particular combination of hardware and software.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A remote sensing data rapid concurrent read-write method based on a distributed file system is characterized in that: the method comprises the steps that the characteristics of an HDFS file system are inherited on a bottom physical structure, a Hadoop system is installed on each data server in a computer cluster, the HDFS file system is established, and then a part of space is divided on each data server to serve as a physical storage space of the own file system; performing primary packaging on an HDFS service processing layer, taking over access of an operating system to a file system, and when the operating system only requires to read a file and file data already exists, directly referencing an HDFS file system interface, and finishing reading of the file data by the HDFS;

when the operating system requires that the access to the file comprises a file writing operation, the file writing operation is completely taken over, the data reading and writing are realized by the own file system, and after the data reading and writing are finished, a file writing interface of the HDFS file system is quoted to synchronize the data into the HDFS; the self-owned file system only reads and writes one server; the data reading and writing are realized by the self-owned file system in a mode of reading and writing data in an RAID mode, a storage space with the same size as a file to be rewritten is applied in the self-owned file system, and the data is partitioned according to the corresponding partition consistency of the HDFS file system and is marked as 0; then, marking the file in the HDFS as invalid, reading a data block of the HDFS, updating the data and writing the data into a self-owned file system; and finally, synchronizing the self file into the HDFS file system.

2. The remote sensing data rapid concurrent reading and writing method based on the distributed file system according to claim 1, characterized in that: a plurality of universal data servers are deployed in the computer group to realize data storage and scientific calculation; after hardware installation is completed, installing Hadoop systems on all data servers, establishing an HDFS file system, and completing construction of an HDFS native storage cluster; meanwhile, each data server is divided into a part of space as the physical storage space of the own file system.

3. The remote sensing data rapid concurrent read-write method based on the distributed file system according to claim 2, characterized in that: and deploying an index server to realize the management of the whole distributed file system.

4. The remote sensing data rapid concurrent reading and writing method based on the distributed file system according to claim 3, characterized in that: the data server with the best read-write performance is always selected to perform data read-write service each time when the self-owned file system is scheduled, the best consideration standard implementation mode is as follows,

I＝(I _max –(D/T)/(1-(C-S)))×(1-W)

wherein I is an indicator value, I _max The maximum data throughput provided for the hard disk group of the server, D is the latest data read-write quantity, T is the time for reading and writing the data of D, C is the current time, S is the starting statistical time, and W is the CPU utilization rate of the storage server.

5. The remote sensing data rapid concurrent reading and writing method based on the distributed file system according to claim 4, characterized in that: after the index server selects the data server with the best performance, subsequent file reading and writing are realized by the corresponding data server, and any data cannot be circulated by the index server in the data reading and writing process, so that the bottleneck of data reading and writing is ensured not to be formed.

6. The method for rapidly and concurrently reading and writing the remote sensing data based on the distributed file system according to claim 1, 2, 3, 4 or 5, wherein: the process of providing read and write data by the data server includes two cases,

one is writing brand new data, the self-contained file system only needs to provide simple file reading and writing, after the file reading and writing are finished, the self-contained file system refers to the file writing function of the HDFS system, synchronizes the file into the HDFS file system, and provides file reading service for the outside;

secondly, rewriting the file, which is used for modifying the existing file, firstly immediately applying for a storage space with the same size as the file to be rewritten in an own file system, and partitioning the file according to the corresponding partitioning consistency of the HDFS file system, wherein the storage space is marked as 0; then, marking the file in the HDFS as invalid, reading a data block of the HDFS, updating the data and writing the data into a self-owned file system; and finally, synchronizing the self file into the HDFS file system.

7. The method for rapidly and concurrently reading and writing the remote sensing data based on the distributed file system according to claim 1, 2, 3, 4 or 5, wherein: for the case of a large number of small files, the file name in each folder is managed by using the HBase database, and the file content is managed by using a large file containing a fixed record size.