CN113806298A - Distributed file system storage optimization method, computer equipment and storage medium - Google Patents

Distributed file system storage optimization method, computer equipment and storage medium Download PDF

Info

Publication number
CN113806298A
CN113806298A CN202110906828.3A CN202110906828A CN113806298A CN 113806298 A CN113806298 A CN 113806298A CN 202110906828 A CN202110906828 A CN 202110906828A CN 113806298 A CN113806298 A CN 113806298A
Authority
CN
China
Prior art keywords
data
slice
storage
hash
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110906828.3A
Other languages
Chinese (zh)
Inventor
阎继宁
王力哲
王志鹏
刘洪�
宋维静
邓泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN202110906828.3A priority Critical patent/CN113806298A/en
Publication of CN113806298A publication Critical patent/CN113806298A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1727Details of free space management performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Abstract

The invention provides a distributed file system storage optimization method, computer equipment and a storage medium, wherein the method comprises the following steps: performing hash calculation on the identifier of each data node of the distributed file system cluster to obtain a node hash value corresponding to each data node, and mapping the hash value onto a hash ring to represent the position of each data node on the hash ring; obtaining remote sensing data, carrying out cube segmentation on the remote sensing data to obtain a data slice, and carrying out at least one preset treatment on the data slice to obtain a storage slice; acquiring feature data of the storage slice, and performing hash calculation on the feature data to obtain a slice hash value, wherein the feature data comprises a time feature and a space feature; and storing the storage slice on the data node through a preset rule according to the slice hash value, so that a storage strategy of time association and space association of the remote sensing data can be realized, and the retrieval efficiency is improved.

Description

Distributed file system storage optimization method, computer equipment and storage medium
Technical Field
The invention relates to the field of distributed file storage, in particular to a distributed file system storage optimization method, computer equipment and a storage medium.
Background
With the development of computer technology, the volume of data is larger and larger, and a traditional network storage system adopts a centralized storage server to store all data, and the storage server becomes a bottleneck of system performance and gradually cannot meet the requirement of large-scale storage application.
For storage and management of large-volume data, such as remote sensing data, a distributed file system and a database are mostly adopted. The distributed storage system adopts an expandable system structure, and a plurality of storage servers are used for sharing storage load, so that the reliability and the availability of the system are improved. The HDFS system is a typical representative of a distributed file system, but in practical application, the distributed file system stores data to different data nodes in a scattered manner, and when performing range access on the data, the data needs to be acquired from a plurality of data nodes, which is inefficient and not beneficial to efficient data access and processing.
Disclosure of Invention
The problem addressed by the present invention is how to increase the speed of accessing and processing remotely sensed data in a distributed file system.
In order to solve the above problems, the present invention provides a distributed file system storage optimization method, which comprises the following steps:
performing hash calculation on the identification of each data node of the distributed file system cluster to obtain a node hash value corresponding to each data node, and mapping the node hash value onto a hash ring to represent the position of each data node on the hash ring; obtaining remote sensing data, carrying out cube segmentation on the remote sensing data to obtain a data slice, and carrying out at least one preset treatment on the data slice to obtain a storage slice; acquiring feature data of the storage slice, and performing hash calculation on the feature data to obtain a slice hash value, wherein the feature data comprises a time feature and a space feature; storing the storage slice on the data node according to the slice hash value by a preset rule, wherein the preset rule comprises: comparing the slice hash value with the node hash value, and storing the storage slice corresponding to the slice hash value on a calibration data node, wherein the calibration data node is the data node which is larger than the slice hash value and corresponds to the node hash value closest to the slice hash value.
Compared with the prior art, the method has the advantages that the unique node hash value is obtained by performing hash calculation on the representation of each data node of the distributed file system cluster, the data nodes are placed into the hash ring according to the size of the hash value, the data nodes can be randomly and uniformly arranged, the load of each data node is balanced, the storage efficiency is increased, and the maintenance cost of the data nodes is reduced; the data slices are subjected to at least one preset processing, the storage slices are subjected to Hash calculation to obtain the Hash values of the storage slices, the data are stored according to the Hash values, the data with similar time or space are stored in the same node or similar nodes based on the time relevance and the space relevance of the data, and the retrieval efficiency and the retrieval accuracy can be improved.
Optionally, the performing cube segmentation on the remote sensing data to obtain a data slice includes: obtaining time characteristics, space characteristics and spectral dimension characteristics of the remote sensing data; and taking the time characteristic, the space characteristic and the spectrum dimensional characteristic as dimensions of a cube, and slicing the remote sensing data.
Therefore, the remote sensing data is subjected to cube segmentation, data slices related to three dimensions (space, time and spectrum) of the cube can be obtained, and the remote sensing data subjected to cube segmentation is guaranteed to have time relevance and space relevance before segmentation.
Optionally, performing at least one preset process on the data slice, and obtaining a storage slice includes:
performing second preset processing on the data slice to obtain a second name of the data slice, wherein the second preset processing comprises: and truncating the time characteristic of the data slice to a preset precision, and adding the space characteristic to the rear of the truncated time characteristic.
Therefore, through the second preset processing, the name with the time characteristic and the space characteristic can be obtained, and the search based on the keywords of the time and the space is ensured.
Optionally, before the second preset processing is performed on the data slice, the method further includes the following steps:
and performing first preset processing on the data slice to obtain a first name of the data slice, wherein the first preset processing comprises naming the data slice with respect to the time feature, the spatial projection type and the spatial coding.
Therefore, the naming with the time characteristics, the space projection type and the space codes can be obtained, and the retrieval can be carried out according to the time, the projection type and the space codes.
Optionally, the obtaining feature data of the storage slice, and performing hash calculation on the feature data to obtain a slice hash value, where the feature data includes temporal features and spatial features includes:
obtaining the temporal features and the spatial features of the stored slices; and respectively carrying out hash calculation on the time characteristic and the space characteristic to respectively obtain a time hash and a space hash, wherein the slice hash value comprises at least one of the time hash and the space hash.
Therefore, the time hash and the space hash of the storage slice are calculated, and the storage slice can be correspondingly stored on the hash ring according to the time hash and the space hash.
Optionally, the storing the storage slice on the data node according to the slice hash value by a preset rule includes:
copying the storage slice to obtain two slice copies, and recording the two slice copies as a first copy and a second copy; storing the memory slice on a data node that issued a write request; storing the first copy through the preset rule based on the temporal hash; and storing the second copy through the preset rule based on the spatial hash.
Therefore, the data to be stored is divided into three parts through the two slice copies and the storage slices, and the three parts are respectively stored on the data nodes by different strategies, so that the fault tolerance of the distributed file system can be increased, and the data loss caused by the downtime of the data nodes can be prevented.
Optionally, the storing the storage slice on the data node according to the slice hash value by a preset rule further includes: and if the data node cannot be stored, randomly searching another data node on the same rack as the data node for storage.
Therefore, when the data nodes cannot be stored, the storage slices can be stored on the data nodes of other same racks, and data storage is guaranteed under the condition that retrieval and reading efficiency is not reduced.
Optionally, the storing the storage slice on the data node according to the slice hash value by a preset rule includes:
the block storage space of each distributed file system is made to store only one of the storage slices.
Therefore, each block only stores one storage slice, so that the disk space of the data node is not wasted, the addressing efficiency during retrieval is improved, and the retrieval efficiency is increased.
In another aspect, the present invention further provides a computer storage medium, where the computer storage medium stores a computer program, and the computer program is read by a processor and executed to implement the distributed file system storage optimization method as described above.
Compared with the prior art, the computer storage medium has the same advantages as the distributed file system based storage optimization method, and is not described herein again.
The invention also provides a computer device, which includes a computer readable storage medium storing a computer program and a processor, wherein the computer program is read by the processor and when the computer program runs on the processor, the method for optimizing the storage of the distributed file system is implemented.
Compared with the prior art and the distributed file system storage optimization method, the computer device has the same advantages, and the description is omitted here.
Drawings
FIG. 1 is a flowchart illustrating a distributed file system storage optimization method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a distributed file system storage optimization method according to an embodiment of the present invention
FIG. 3 is a diagram illustrating a step S400 of a distributed file system storage optimization method according to an embodiment of the present invention after being refined;
FIG. 4 is a diagram illustrating the effect of the distributed file system storage optimization method according to the embodiment of the present invention;
fig. 5 is a diagram illustrating another effect of the distributed file system storage optimization method according to the embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Distributed file system, means that the physical storage resources managed by the file system are not necessarily directly connected to the local node, but are connected to the node through a computer network: or a complete file system formed by combining several partitions or tags that do not use logical disks.
With the exponential increase of the data volume to be stored, the method of expanding the storage capacity by simply increasing the number of hard disks is not satisfactory in the aspects of capacity size, capacity growth speed, data stability, data security and the like.
However, the existing distributed file system has the problem of low data access efficiency when storing remote sensing image data. Due to the continuous development of satellite sensors, the spatial resolution of the collected image data is continuously improved, so that the size of single image data is more than several hundred million, the data volume can also be continuously increased along with the passage of time, the continuously increased data brings huge pressure to the storage and management of the data, the existing distributed file system is not well combined with remote sensing data, and the defect of low efficiency is caused in the use scene of range access.
The invention provides a storage optimization method of a distributed file system, and with reference to fig. 1, the method comprises the following steps:
step S100, performing hash calculation on the identifier of each data node of the distributed file system cluster to obtain a node hash value corresponding to each data node, and mapping the node hash value to a hash ring to represent the position of each data node on the hash ring.
Optionally, in an embodiment, the Distributed File system is hdfs (hadoop Distributed File system).
Optionally, the identifier of each data node is a combination of a host name IP and a port of the data node, so that it is ensured that the identifier of each data node is unique after being calculated as a hash value.
In an embodiment, the identifier of the data node is a hash value calculated after serial combination of a host name IP and a port character string, for example, the host name IP of a data node is 192.168.2.2, the port is 8080, in this embodiment, a katama algorithm is used to calculate the hash value, that is, the hash value of 192.168.2.2:8080 is calculated, and finally, the hash value is 3540412423, the data node is mapped to a position corresponding to the hash ring, and by using this method, all data nodes are mapped to the hash ring.
The distributed file system is designed to be suitable for running on general hardware, is suitable for being deployed on a cheaper machine and is suitable for application on a large-scale data set. The distributed file system adopts a master-slave structure model, and a distributed file system cluster is composed of a NameNode and a plurality of DataNodes, wherein the NameNode is used as a master server to construct a naming space and manage metadata of files; the plurality of DataNodes serve as data nodes for storing data.
In an embodiment, hash calculation is performed on an identifier of each data node of a distributed file system cluster, wherein the identifier of each data node is unique, so that hash values after hash calculation are not repeated, and each data node is mapped on different positions of a hash ring according to the size of the hash values.
In another embodiment, the hash ring is a value space of 0-232A ring of-1, joining the hash rings end to end, in a clockwise direction, would represent 0-232The points of-1 form a closed ring, and after calculating the hash value of each DataNode data node (the value of the hash value is greater than or equal to 0 and less than or equal to 2)32-1), mapping the hash value onto the hash ring, and by the mapping of the hash value, the location of each data node can be specified, when data nodes are added or subtracted, only the adjacent data node data nodes are affected, and other data node data nodes are not affected.
Step S200, obtaining remote sensing data, conducting cube segmentation on the remote sensing data to obtain data slices, conducting at least one preset treatment on the data slices, and obtaining storage slices.
The data cube is a description mode of a multidimensional data model, data are spatially arranged on the basis of three dimensions, data can be retrieved from the three dimensions, and data interpretation is more convenient. In one embodiment, the three dimensions of the data cube are the temporal, spatial, and spectral bands, respectively.
The storage slice is a storage unit for storing the remote sensing data in the distributed file system, the storage slice is stored in the blocks in the distributed file system, and when the size of the storage slice is smaller than the size of the storage space of the blocks, the storage space can still be occupied according to the size of one block.
The method comprises the following steps of cube segmentation, namely, physically segmenting the remote sensing data by adopting a data cube, segmenting the remote sensing data on the basis of time, space and spectral wave bands, and segmenting the remote sensing data into a plurality of data slices, wherein the data slices are smaller than or equal to the size of one block and are as close to the storage size of the block as possible, so that the data with time relevance and space relevance cannot be dispersed in different blocks, the redundancy of the blocks cannot be caused, and the storage space of a magnetic disk is wasted.
In one embodiment, 500GB MODIS remote sensing image data is physically sliced, the pixel size of each slice is 250 × 250, the size of each block is kept within 128MB, and the file format is NetCDF format. Because the single remote sensing image is often large in size, under a common condition, the single remote sensing image needs to be segmented when being stored by using the HDFS, and segmented blocks are stored on each data node in a distributed manner by adopting a rack sensing strategy. The method does not consider the time series characteristic or the space region characteristic among remote sensing data. Therefore, the data cube is considered to be used for segmenting the data, and each slice is guaranteed to occupy one block of storage space.
In one embodiment, one block is 128MB in size, but one storage slice is 120MB in size, and the storage slice occupies 128MB of space after being stored in the block, and the block can not store other storage slices smaller than 8MB, so as to ensure the retrieval efficiency and the retrieval accuracy.
When the blocks are too small, the NameNode in the distributed file system uses too many memories to store metadata, and the problems of the NameNode can be caused by too many blocks under the condition that the physical memory of the NameNode is limited; when the block is too large, the time for transmitting data from the disk is longer than the addressing time, so that the speed of the program is reduced when the current data is processed; during data read-write calculation, network transmission is needed, if the block is too large, the network transmission speed time is prolonged, and the cost of pulling blocks of other nodes or failing to retry is very high during task execution; the NameNode supervision easily judges that a data node is dead, so that a cluster frequently generates or removes copies and occupies hardware resources.
Optionally, the size of the block is set based on a preset formula, the preset formula comprising disk transfer rate average addressing time 0.01.
For example, in a system having an average address time of 10ms and a disk transfer rate of 200MB/s, the size calculated using a preset formula is 200MB, and the size of each block is set to 256 MB.
Optionally, step S200 includes: obtaining time characteristics, space characteristics and spectral dimension characteristics of the remote sensing data; and taking the time characteristic, the space characteristic and the spectrum dimensional characteristic as dimensions of a cube, and slicing the remote sensing data.
In one embodiment, the remote sensing data includes a spectral image including three-dimensional features including temporal, spatial and spectral features, for example, a multispectral image including image data of a detection region including a spatial geographic feature of the detection region, a multispectral image also including detection time, i.e., temporal features, and three bands having spectral features in the multispectral image. Therefore, a plurality of time periods, a plurality of image sub-blocks and three wave bands can be divided, and the multispectral image is physically divided to obtain data slices. Because the characteristics of time, space, spectrum and the like of the remote sensing data have relevance, the data slice obtained after cubic segmentation still has data relevance.
Optionally, step S200 includes: and performing first preset processing on the data slice to obtain a first name of the data slice, wherein the first preset processing comprises naming the data slice with respect to the time feature, the spatial projection type and the spatial coding.
In an embodiment, the first preset process includes naming the data slice, and the naming rule includes satellite sensor, spatial resolution, grid number, time, and GeoSOT spatial coding.
For example, remote sensing image data of a MODIS product, 2/25/2000, is subjected to cube segmentation to form four data slices, wherein the first name of one data slice is MOD13Q1_250_ sinussoid _5_1_ 200002252359500000. nc. Data representing observations at 59 minutes, 59 seconds, and 500 milliseconds at 23 o' clock, 59 f, 25 f, 2000, with a spatial resolution of 250 meters, a spatial projection type of sinussoid, horizontal and vertical encodings of 5 and 1, respectively, and a file type of MOD13Q1, are shown. Through the rule, the data slices are named, so that the data slices can be prevented from being cut when stored in the HDFS, and the relative integrity and relevance of the data and the image are ensured, so that the time and space relevance of the data is ensured.
Optionally, step S200 includes: and performing second preset processing on the data slice to obtain a second name of the data slice, wherein the second preset processing comprises truncating the time characteristic of the data slice to a preset precision, adding the space characteristic to the rear of the truncated time characteristic, and obtaining the second name.
In an embodiment, the second preset process is performed on the data slice, and the second preset process includes truncating the time characteristic to a preset precision, such as the data slice with the file name MOD13Q1_250_ sinusoidal _5_1_ 2000022523595959500000. nc, and truncating 20000225235959500000 representing time to 20000225, in this embodiment, truncating the precision of the time characteristic to date. After the temporal feature is truncated, a spatial feature is added behind the temporal feature, for example, the spatial feature with the number G00013 is placed behind 20000225, and the second name after the spatial feature is added is MOD13Q1_250_ nusoidal _5_1_20000225_ G00013.
In the second preset processing step, the spatial characteristics are that two-dimensional vertical and horizontal grid codes are converted into one-dimensional codes, so that the retrieval efficiency can be increased.
In another embodiment, the step of performing the first preset processing on the data slice to obtain the first name is skipped, the step of directly performing the second preset processing on the data slice to obtain the second name is directly performed, and then the data slice is stored on the corresponding data node according to the preset rule.
Step S300, acquiring feature data of the storage slice, and performing hash calculation on the feature data to obtain a slice hash value, wherein the feature data comprises a time feature and a space feature.
The characteristic data which can represent the storage content of the storage slice is subjected to Hash calculation and is used as a basis to determine the storage position of the storage slice, in one embodiment, the characteristic data comprises a characteristic which can uniquely map remote sensing data contained in the corresponding storage slice, when data retrieval is carried out through a Hash value, retrieval can be carried out from a retrieval range which is as small as possible, retrieval time is reduced, and retrieval accuracy is increased, on the other hand, the characteristic which is uniquely mapped with the stored data is subjected to Hash calculation, and then the Hash value is used as the basis of the storage position, retrieval of related characteristics can be achieved during retrieval, for example, data with similar characteristics can be retrieved more easily, addressing time is reduced, and retrieval efficiency is increased.
The characteristic data comprises time characteristics and space characteristics, time and space attributes have certain relevance in the remote sensing data, so the time characteristics and the space characteristics of each storage slice are extracted, the time characteristics and the space characteristics are subjected to Hash calculation to obtain corresponding Hash values, and the storage positions of the storage slices have certain relevance due to the fact that the time characteristics and the space characteristics of the adjacent storage slices have certain relevance.
Optionally, step S300 includes the steps of:
the temporal features and the spatial features of the stored slices are obtained.
And respectively carrying out hash calculation on the time characteristic and the space characteristic to respectively obtain a time hash and a space hash, wherein the slice hash value comprises at least one of the time hash and the space hash.
Specifically, the observation time is obtained from the second name of the storage slice, and if the observation time is 1 month, 15 days, 15 minutes and 15 seconds in 2020, hash calculation is performed on the time 2020010115 to obtain a hash value.
And acquiring space characteristics from the second name of the storage slice, wherein the space characteristics comprise the area number, and performing hash calculation on the area number to acquire a hash value.
As shown in fig. 2, in step S400, the storage slice is stored on the data node according to the slice hash value by a preset rule, where the preset rule includes: comparing the slice hash value with the node hash value, and storing the storage slice corresponding to the slice hash value on a calibration data node, wherein the calibration data node is the data node which is larger than the slice hash value and corresponds to the node hash value closest to the slice hash value.
And storing the storage slices on the data nodes according to the slice hash values, wherein the data nodes are already placed at the positions corresponding to the hash rings in the step S100, and the positions are unique, the hash value of each storage slice is obtained at this time, the hash value of the storage slice is compared with the node hash value, and the storage slice is stored on the data node closest to the hash value of the storage slice in the clockwise direction.
As shown in fig. 2, in one embodiment, the nodes A, B, C, D, E, F, G are different data nodes, each node determines its own location on the hash ring by computing a node hash value, and the node identifier used to compute the node hash value is a combination of hostname IP and port according to the rule, and since the node identifier is unique in the present system, the node hash value computed by each node is also unique. In this embodiment, all storage slices are respectively subjected to hash calculation to obtain hash values, and then the storage slices are stored on the hash ring based on the hash values. For example, a storage slice with hash values between data node a and data node B is stored on data node B. And if the hash value of a certain storage slice is greater than the node hash value of the data node G, storing the storage slice on the data node A.
Alternatively, as shown in fig. 3, the step S400 includes the steps of:
step S401, copy the storage slice to obtain two slice copies, which are recorded as a first copy and a second copy.
Specifically, the storage slice is backed up for multiple times, and two slice copies are obtained and stored on different data nodes respectively. The data can be prevented from being lost after a certain data node is down, the bandwidth of a plurality of racks can be fully utilized during data reading, when the data per se has errors, the data can be obtained through the backup copy, and under certain conditions, the storage slice data can be restored through the backup copy.
Step S402, storing the storage slice on the data node sending the write request.
In particular, a distributed file system is usually operated on a cluster deployed in a plurality of racks, and communication delay between different hosts of the same rack is better than that between different racks, so that a storage slice is placed in a node storage written into the storage slice.
In one embodiment, when the node requesting the write is a data node, the copy is placed on the current node, and the remaining two copies are stored on other data nodes, not allowing data of the same block to be placed on the same data node.
Step S403, storing the first copy through the preset rule based on the temporal hash.
Specifically, in an embodiment, a time hash value of the storage slice is obtained, and based on the time hash value, the first copy is placed on a data node whose node hash value is greater than the time hash value and whose distance is the closest, where an absolute value of a difference between the hash values is represented as a distance between the node hash value and the time hash value.
Optionally, when the data node is the same as the data node requesting writing, the first copy is randomly placed on another data node on the same rack as the data node requesting writing.
Optionally, when the data node and the data node requesting writing are the same node, the first copy is placed in other random data nodes.
Step S404, storing the second copy through the preset rule based on the spatial hash.
Specifically, in an embodiment, a spatial hash value of the storage slice is obtained, and based on the spatial hash value, a second copy is placed on a data node whose node hash value is greater than the spatial hash value and whose distance is the closest, where an absolute value of a difference between the hash values is represented as a distance between the node hash value and the spatial hash value.
Optionally, when the data node is the same as the data node requesting writing, the first copy is randomly placed on another data node on the same rack as the data node requesting writing.
Optionally, when the data node and the data node requesting writing are the same node, the first copy is placed in other random data nodes.
Optionally, the stored data, the first copy, and the second copy are placed on different data nodes, and duplicate placement is not allowed.
Optionally, if the data node cannot be stored, another data node on the same rack as the data node is randomly searched for storage.
In an embodiment, as shown in fig. 2, a certain storage slice should be stored in the data node B according to a preset rule, but the storage space of the data node B is full and a new storage slice cannot be stored, and the storage slice is placed in the data node D on the rack unified with the data node B, so as to ensure the storage and retrieval efficiency.
Optionally, the storing the storage slice on the data node according to the slice hash value by a preset rule includes: the block storage space of each distributed file system is made to store only one of the storage slices.
Preferably, each block is 128MB in size.
As shown in fig. 4 and fig. 5, in an embodiment, 1G, 5G, 10G, 25G, and 50G of remote sensing data are selected for experimental comparison, and the association rate of files can be improved by using the distributed file system storage optimization method provided by the present invention, so as to improve the access speed and the access efficiency of related data; in the aspect of file reading, the method provided by the invention can be used for reading data quickly, the access efficiency is improved, and the improvement trend is more and more obvious along with the increase of the number.
A computer storage medium according to another embodiment of the present invention stores a computer program, which when read and executed by a processor, implements the distributed file system storage optimization method as described above.
A computer device according to another embodiment of the present invention includes a computer-readable storage medium storing a computer program and a processor, where the computer program is read by the processor and executed to implement the distributed file system storage optimization method as described above.
Although the present disclosure has been described above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present disclosure, and these changes and modifications are intended to be within the scope of the present disclosure.

Claims (10)

1. A distributed file system storage optimization method is characterized by comprising the following steps:
performing hash calculation on the identification of each data node of the distributed file system cluster to obtain a node hash value corresponding to each data node, and mapping the node hash value onto a hash ring to represent the position of each data node on the hash ring;
obtaining remote sensing data, carrying out cube segmentation on the remote sensing data to obtain a data slice, and carrying out at least one preset treatment on the data slice to obtain a storage slice;
acquiring feature data of the storage slice, and performing hash calculation on the feature data to obtain a slice hash value, wherein the feature data comprises a time feature and a space feature;
storing the storage slice on the data node according to the slice hash value by a preset rule, wherein the preset rule comprises: comparing the slice hash value with the node hash value, and storing the storage slice corresponding to the slice hash value on a calibration data node, wherein the calibration data node is the data node which is larger than the slice hash value and corresponds to the node hash value closest to the slice hash value.
2. The distributed file system storage optimization method of claim 1, wherein the cube slicing the remote sensing data to obtain data slices comprises:
obtaining time characteristics, space characteristics and spectral dimension characteristics of the remote sensing data;
and taking the time characteristic, the space characteristic and the spectrum dimensional characteristic as dimensions of a cube, and slicing the remote sensing data.
3. The distributed file system storage optimization method according to claim 2, wherein the performing at least one preset process on the data slice to obtain a storage slice includes:
performing second preset processing on the data slice to obtain a second name of the data slice, wherein the second preset processing comprises: and truncating the time characteristic of the data slice to a preset precision, and adding the space characteristic to the rear of the truncated time characteristic.
4. The distributed file system storage optimization method according to claim 3, further comprising, before the second preset processing is performed on the data slice:
performing first preset processing on the data slice to obtain a first name of the data slice, wherein the first preset processing comprises: the data slices are named with respect to the temporal features, spatial projection types, and spatial encoding.
5. The distributed file system storage optimization method according to any one of claims 1 to 4, wherein the obtaining the feature data of the storage slice, and performing hash calculation on the feature data to obtain a slice hash value includes:
obtaining the temporal features and the spatial features of the stored slices;
and respectively carrying out hash calculation on the time characteristic and the space characteristic to respectively obtain a time hash and a space hash, wherein the slice hash value comprises at least one of the time hash and the space hash.
6. The distributed file system storage optimization method according to claim 5, wherein the storing the storage slice on the data node according to the slice hash value by a preset rule comprises:
copying the storage slice to obtain two slice copies, and recording the two slice copies as a first copy and a second copy;
storing the memory slice on a data node that issued a write request;
storing the first copy through the preset rule based on the temporal hash;
and storing the second copy through the preset rule based on the spatial hash.
7. The distributed file system storage optimization method according to claim 6, wherein the storing the storage slice on the data node according to the slice hash value by a preset rule further comprises:
and if the data node cannot be stored, randomly searching another data node on the same rack as the data node for storage.
8. The distributed file system storage optimization method according to claim 5, wherein the storing the storage slice on the data node according to the slice hash value by a preset rule comprises:
the block storage space of each distributed file system is made to store only one of the storage slices.
9. A computer storage medium, characterized in that the computer readable storage medium stores a computer program which, when read and executed by a processor, implements the distributed file system storage optimization method according to any one of claims 1 to 8.
10. A computer device comprising a computer readable storage medium storing a computer program and a processor, the computer program being read and executed by the processor to implement the distributed file system storage optimization method of any one of claims 1 to 8.
CN202110906828.3A 2021-08-09 2021-08-09 Distributed file system storage optimization method, computer equipment and storage medium Pending CN113806298A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110906828.3A CN113806298A (en) 2021-08-09 2021-08-09 Distributed file system storage optimization method, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110906828.3A CN113806298A (en) 2021-08-09 2021-08-09 Distributed file system storage optimization method, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113806298A true CN113806298A (en) 2021-12-17

Family

ID=78942858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110906828.3A Pending CN113806298A (en) 2021-08-09 2021-08-09 Distributed file system storage optimization method, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113806298A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878046A (en) * 2023-01-09 2023-03-31 苏州浪潮智能科技有限公司 Data processing method, system, device, storage medium and electronic equipment
CN117271534A (en) * 2023-11-23 2023-12-22 长春汽车工业高等专科学校 Spectrum detection method and system for automobile parts

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102843403A (en) * 2011-06-23 2012-12-26 盛大计算机(上海)有限公司 File processing method based on distributed file system, system, and client
CN109635068A (en) * 2018-12-20 2019-04-16 中国地质大学(武汉) Mass remote sensing data high-efficiency tissue and method for quickly retrieving under cloud computing environment
CN112230861A (en) * 2020-10-26 2021-01-15 金钱猫科技股份有限公司 Data storage method and terminal based on consistent hash algorithm
CN113094527A (en) * 2021-04-25 2021-07-09 华中师范大学 Generalized cloud control space-time spectrum big database and remote sensing image processing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102843403A (en) * 2011-06-23 2012-12-26 盛大计算机(上海)有限公司 File processing method based on distributed file system, system, and client
CN109635068A (en) * 2018-12-20 2019-04-16 中国地质大学(武汉) Mass remote sensing data high-efficiency tissue and method for quickly retrieving under cloud computing environment
CN112230861A (en) * 2020-10-26 2021-01-15 金钱猫科技股份有限公司 Data storage method and terminal based on consistent hash algorithm
CN113094527A (en) * 2021-04-25 2021-07-09 华中师范大学 Generalized cloud control space-time spectrum big database and remote sensing image processing method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878046A (en) * 2023-01-09 2023-03-31 苏州浪潮智能科技有限公司 Data processing method, system, device, storage medium and electronic equipment
CN115878046B (en) * 2023-01-09 2023-05-12 苏州浪潮智能科技有限公司 Data processing method, system, device, storage medium and electronic equipment
CN117271534A (en) * 2023-11-23 2023-12-22 长春汽车工业高等专科学校 Spectrum detection method and system for automobile parts
CN117271534B (en) * 2023-11-23 2024-03-05 长春汽车工业高等专科学校 Spectrum detection method and system for automobile parts

Similar Documents

Publication Publication Date Title
CN110442444B (en) Massive remote sensing image-oriented parallel data access method and system
CN104820714B (en) Magnanimity tile small documents memory management method based on hadoop
US7761407B1 (en) Use of primary and secondary indexes to facilitate aggregation of records of an OLAP data cube
CN106201771B (en) Data-storage system and data read-write method
US9547706B2 (en) Using colocation hints to facilitate accessing a distributed data storage system
CN107562757B (en) Query and access method, device and system based on distributed file system
CN108701076A (en) Distributed data collection stores and retrieval
US10061834B1 (en) Incremental out-of-place updates for datasets in data stores
US11429630B2 (en) Tiered storage for data processing
CN111125392B (en) Remote sensing image storage and query method based on matrix object storage mechanism
CN113806298A (en) Distributed file system storage optimization method, computer equipment and storage medium
CN105160039A (en) Query method based on big data
CN103812939A (en) Big data storage system
CN110888837B (en) Object storage small file merging method and device
Malensek et al. Fast, ad hoc query evaluations over multidimensional geospatial datasets
KR20090063733A (en) Method recovering data server at the applying multiple reproduce dispersion file system and metadata storage and save method thereof
CN111708894B (en) Knowledge graph creation method
Mitra et al. Stash: Fast hierarchical aggregation queries for effective visual spatiotemporal explorations
US11797506B2 (en) Database management systems for managing data with data confidence
US11449256B2 (en) Method for accelerating image storing and retrieving differential latency storage devices based on access rates
US10095738B1 (en) Dynamic assignment of logical partitions according to query predicate evaluations
CN106547484A (en) It is a kind of that internal storage data reliability method and system realized based on RAID5
CN107133334B (en) Data synchronization method based on high-bandwidth storage system
Zhong et al. A novel method to manage very large raster data on distributed key-value storage system
US10157216B2 (en) Data management system and data management method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination