CN114896203A - File processing method and device based on distributed file system - Google Patents

File processing method and device based on distributed file system Download PDF

Info

Publication number
CN114896203A
CN114896203A CN202210316455.9A CN202210316455A CN114896203A CN 114896203 A CN114896203 A CN 114896203A CN 202210316455 A CN202210316455 A CN 202210316455A CN 114896203 A CN114896203 A CN 114896203A
Authority
CN
China
Prior art keywords
file
metadata node
metadata
target
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210316455.9A
Other languages
Chinese (zh)
Inventor
张垚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210316455.9A priority Critical patent/CN114896203A/en
Publication of CN114896203A publication Critical patent/CN114896203A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of computers, in particular to a file processing method and device based on a distributed file system. The method comprises the following steps: defining a first metadata node and a second metadata node; in response to receiving a processing request for a target file, acquiring the file size of the target file and comparing the file size with a first preset value; in response to the file size exceeding the first preset value, allocating the processing request to the first metadata node for processing; and responding to the file size not exceeding the first preset value, and allocating the processing request to the second metadata node for processing. The method of the invention realizes that different metadata nodes are adopted to process the large file and the small file respectively, reduces the load pressure of a single metadata node, enhances the management capability of the distributed file system to the small file, obviously improves the processing efficiency of the data in the distributed file system and enriches the data processing mode of the distributed file system.

Description

File processing method and device based on distributed file system
Technical Field
The invention relates to the field of computers, in particular to a file processing method based on a distributed file system.
Background
A distributed File system (HDFS for short) is a basis of data storage management in distributed computing, is developed based on the requirements of stream data mode access and processing of oversized files, and can be operated on a low-cost commercial server. The method has the characteristics of high fault tolerance, high reliability, high expandability, high availability, high throughput rate and the like, provides fault-resistant storage for mass Data, and brings great convenience for application processing of a Large Data Set (Large Data Set). The HDFS has two roles of a metadata node Namenode and a data node dataode. The Namenode is responsible for storing the structure of the file system and the corresponding relation between each file and a dataode data block, and the dataode is responsible for storing data.
At present, the processing of a traditional distributed file system does not distinguish file sizes, a Namenode in the distributed system is the core of a file storage system in the whole distributed system, and all operations of the distributed file storage system need to be participated by the Namenode. Since the Namenode is responsible for maintaining the meta information and directory information of all files in the whole distributed file system, the HDFS has the following problems in storing a large number of small files: first, the read-only property of HDFS is not conducive to small file modification operations. Many modification records are generated when small files are modified for many times, a large amount of file fragments are generated, the access of the small files is seriously influenced, and the storage space is wasted. Second, when the number of small files stored in a single block (the minimum access space of the file system) of the HDFS is large, it is slow to retrieve the storage location of a specific file. Thirdly, the management of small files inevitably faces a large number of files, and the naneonde occupies a large amount of memory when storing and processing a large amount of file metadata, so that the efficiency is low, and the load of the naneonde is too large.
Disclosure of Invention
In view of the above, it is necessary to provide a file processing method and apparatus based on a distributed file system to solve the above technical problems.
According to a first aspect of the present invention, there is provided a file processing method based on a distributed file system, the method comprising:
defining a first metadata node and a second metadata node;
in response to receiving a processing request for a target file, acquiring the file size of the target file and comparing the file size with a first preset value;
in response to the file size exceeding the first preset value, allocating the processing request to the first metadata node for processing;
and responding to the file size not exceeding the first preset value, and allocating the processing request to the second metadata node for processing.
In some embodiments, the method further comprises:
in response to the second metadata node receiving the processing request, wherein the processing request is a write-in request, storing the file content and the file path of the target file into a cache of the second metadata node, and marking the file content and the file path as new data;
and writing the file name of the target file into the file index of the second metadata node.
In some embodiments, the method further comprises:
counting the access frequency of all files in the cache of the second metadata node;
determining inactive files based on the counted access frequency, and counting a total file size of all inactive files;
in response to that the total file size is equal to a second preset value, storing metadata of all inactive files into a metadata area of the second metadata node, and writing all inactive files and a file index corresponding to each inactive file into a disk;
and deleting the file content and the file index corresponding to each inactive file written into the disk from the cache of the second metadata node and the file index of the second metadata node respectively.
In some embodiments, the method further comprises:
in response to the second metadata node receiving the processing request and the processing request being a read request, judging whether a file name of the target file exists in a file index of the second metadata node;
in response to the file index of the second metadata node having the file name of the target file, reading the file content of the target file from the cache of the second metadata node based on the reading request;
in response to that the file index of the second metadata node does not have the file name of the target file, judging whether the metadata area of the second metadata node has the file name of the target file or not;
and in response to the file name of the target file existing in the metadata area of the second metadata node, storing the file content of the target file in the disk into a cache of the second metadata node, and adding cold data to read a cache mark.
In some embodiments, the method further comprises:
in response to the second metadata node receiving the processing request and the processing request being a modification request, judging whether a file name of the target file exists in a file index of the second metadata node;
in response to the file index of the second metadata node having the file name of the target file, modifying the corresponding text content in the cache based on the modification request;
in response to that the file index of the second metadata node does not have the file name of the target file, judging whether the metadata area of the second metadata node has the file name of the target file or not;
and in response to the file name of the target file existing in the metadata area of the second metadata node, storing the file content of the target file in the disk into a cache of the second metadata node, and adding cold data read cache marks and modification marks.
In some embodiments, the method further comprises:
in response to the second metadata node receiving the processing request and the processing request being a deletion request, judging whether a file name of the target file exists in a file index of the second metadata node;
in response to the file index of the second metadata node having the file name of the target file, deleting the corresponding text content in the cache of the second metadata node and the file name in the file index of the second metadata node based on the deletion request, and adding a deletion marker of the target file in the cache of the second metadata node;
in response to that the file index of the second metadata node does not have the file name of the target file, judging whether the metadata area of the second metadata node has the file name of the target file or not;
and in response to the file name of the target file existing in the metadata area of the second metadata node, adding a deletion mark of the target file in the cache of the second metadata node.
In some embodiments, the method further comprises:
regularly reading a deletion mark and a modification mark from the cache of the second metadata node;
in response to reading the deletion mark, deleting the file content corresponding to the deletion mark in the disk based on the read deletion mark, and clearing the corresponding deletion mark in the cache of the second metadata node;
and in response to reading the modification mark, covering the corresponding file content in the disk by using the corresponding file content in the cache of the second metadata node by using the modification mark.
In some embodiments, the file size of the target file is the file size after the processing request is executed, and the method further includes:
responding to the fact that the file size exceeds the first preset value, and judging whether the file index of the second metadata node has the file name of the target file or not;
in response to the file index of the second metadata node having the file name of the target file, storing the file content of the target file in the cache of the second metadata node into the cache of the first metadata node;
and adding the name of the target file into the file index of the first metadata node, and deleting the file content corresponding to the cache of the second metadata node and the file name corresponding to the file index of the second metadata node.
In some embodiments, the method further comprises:
in response to that the file size does not exceed the first preset value, judging whether a file name of the target file exists in a file index of the first metadata node;
and in response to the file index of the first metadata node having the file name of the target file, storing the file content of the target file in the cache of the first metadata node into the cache of the second metadata node, adding the name of the target file to the file index of the first metadata node, and deleting the file content corresponding to the cache of the first metadata node and the file name corresponding to the file index of the second metadata node.
In some embodiments, the method further comprises:
defining a plurality of third metadata nodes;
selecting one of the plurality of third data nodes as a target third data node;
synchronizing the full amount of cache data, the file indexes and the metadata areas of the second metadata node to the target third metadata node at regular time;
and regularly synchronizing the total cache data, the file indexes and the metadata areas of the target third element data nodes to the rest third element data nodes except the target third element data nodes.
According to a second aspect of the present invention, there is also provided a file processing apparatus based on a distributed file system, the apparatus comprising:
the definition module is used for defining a first metadata node and a second metadata node;
the comparison module is used for responding to the received processing request of the target file, acquiring the file size of the target file and comparing the file size with a first preset value;
the first allocation module is used for responding to the situation that the file size exceeds the first preset value, and allocating the processing request to the first metadata node for processing;
and the second distribution module is used for distributing the processing request to the second metadata node for processing in response to the file size not exceeding the first preset value.
According to the file processing method and device based on the distributed file system, the first metadata node data and the second metadata node data are predefined, then when a processing request for a target file is received, the processing request with the file size exceeding a first preset value is distributed to the first metadata node for processing, and the processing request with the file size smaller than or equal to the first preset value is distributed to the second metadata node for processing, so that the large file and the small file are respectively processed by adopting different metadata nodes, the load pressure of a single metadata node is reduced, the management capacity of the distributed file system for the small file is enhanced, the processing efficiency of data in the distributed file system is remarkably improved, and the data processing mode of the distributed file system is enriched.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a flowchart illustrating a file processing method 100 based on a distributed file system according to an embodiment of the present invention;
FIG. 2 is a block diagram illustrating an architecture of a distributed file system using two types of metadata nodes according to another embodiment of the present invention;
FIG. 3 is a functional diagram of a second metadata node according to another embodiment of the present invention;
fig. 4 is a schematic structural diagram of a file processing apparatus 200 based on a distributed file system according to another embodiment of the present invention;
fig. 5 is an internal structural view of a computer device in another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In an embodiment, referring to fig. 1, the present invention provides a file processing method 100 based on a distributed file system, where the method includes:
step 101, defining a first metadata node and a second metadata node;
in an embodiment, the first metadata node is used for processing a large file and may be represented as a Namenode, and the second metadata node is used for processing a small file and may be represented as a small _ file _ Namenode.
Step 102, in response to receiving a processing request for a target file, acquiring the file size of the target file and comparing the file size with a first preset value;
in this embodiment, the file size of the target file may be a size before operation or a size after operation, and the first preset value may be defined by a user according to a size of a file commonly used in actual service processing, that is, both the large file and the small file are relative to the first preset value.
Step 103, responding to the file size exceeding the first preset value, allocating the processing request to the first metadata node for processing;
and step 104, responding to the file size not exceeding the first preset value, distributing the processing request to the second metadata node for processing.
According to the file processing method based on the distributed file system, the first metadata node data and the second metadata node data are predefined, then when a processing request for a target file is received, the processing request with the file size exceeding a first preset value is distributed to the first metadata node for processing, and the processing request with the file size smaller than or equal to the first preset value is distributed to the second metadata node for processing, so that the large file and the small file are respectively processed by adopting different metadata nodes, the load pressure of a single metadata node is reduced, the management capacity of the distributed file system for the small file is enhanced, the processing efficiency of data in the distributed file system is remarkably improved, and the data processing mode of the distributed file system is enriched.
In some embodiments, the method further comprises:
in response to the second metadata node receiving the processing request, wherein the processing request is a write-in request, storing the file content and the file path of the target file into a cache of the second metadata node, and marking the file content and the file path as new data;
and writing the file name of the target file into the file index of the second metadata node.
In this embodiment, a small file different from a conventional HDFS is directly stored in an LRU (Least Recently Used) cache of a second metadata node in a key value form, where key is a path of the small file, value is a content of the small file, marked as new data, and the small file index is written after the file is written in the LRU cache.
In some embodiments, the method further comprises:
counting the access frequency of all files in the cache of the second metadata node;
determining inactive files based on the counted access frequency, and counting a total file size of all inactive files;
in response to that the total file size is equal to a second preset value, storing metadata of all inactive files into a metadata area of the second metadata node, and writing all inactive files and a file index corresponding to each inactive file into a disk;
and deleting the file content and the file index corresponding to each inactive file written into the disk from the cache of the second metadata node and the file index of the second metadata node respectively.
In this embodiment, since the data in the cache is stored in the LRU structure, the capacity of the LRU is limited, and it stores small files with active access, and as the small files are written and the time goes by, the data with low access frequency is located at the rear of the LRU, when the number of data written in the LRU cache exceeds a certain percentage of the total cache capacity (the threshold is configurable), a destage operation is triggered to destage some data in time to prevent data loss, thereby improving the flexibility of the distributed file system, improving the utilization rate of the second-element data node cache, and improving the processing speed of the small file data.
In some embodiments, the method further comprises:
in response to the second metadata node receiving the processing request and the processing request being a read request, judging whether a file name of the target file exists in a file index of the second metadata node;
in response to the file index of the second metadata node having the file name of the target file, reading the file content of the target file from the cache of the second metadata node based on the reading request;
in response to that the file index of the second metadata node does not have the file name of the target file, judging whether the metadata area of the second metadata node has the file name of the target file or not;
and in response to the file name of the target file existing in the metadata area of the second metadata node, storing the file content of the target file in the disk into a cache of the second metadata node, and adding cold data to read a cache mark.
In some embodiments, the method further comprises:
in response to the second metadata node receiving the processing request and the processing request being a modification request, judging whether a file name of the target file exists in a file index of the second metadata node;
in response to the file index of the second metadata node having the file name of the target file, modifying the corresponding text content in the cache based on the modification request;
in response to that the file index of the second metadata node does not have the file name of the target file, judging whether the metadata area of the second metadata node has the file name of the target file or not;
and in response to the file name of the target file existing in the metadata area of the second metadata node, storing the file content of the target file in the disk into a cache of the second metadata node, and adding cold data read cache marks and modification marks.
In some embodiments, the method further comprises:
in response to the second metadata node receiving the processing request and the processing request being a deletion request, judging whether a file name of the target file exists in a file index of the second metadata node;
in response to the file index of the second metadata node having the file name of the target file, deleting the corresponding text content in the cache of the second metadata node and the file name in the file index of the second metadata node based on the deletion request, and adding a deletion marker of the target file in the cache of the second metadata node;
in response to that the file index of the second metadata node does not have the file name of the target file, judging whether the metadata area of the second metadata node has the file name of the target file or not;
and in response to the file name of the target file existing in the metadata area of the second metadata node, adding a deletion mark of the target file in the cache of the second metadata node.
In some embodiments, the method further comprises:
regularly reading a deletion mark and a modification mark from the cache of the second metadata node;
in response to reading the deletion mark, deleting the file content corresponding to the deletion mark in the disk based on the read deletion mark, and clearing the corresponding deletion mark in the cache of the second metadata node;
and in response to reading the modification mark, covering the corresponding file content in the disk by using the corresponding file content in the cache of the second metadata node by using the modification mark.
In the embodiment, in response to the situation that a small file is modified or deleted in the distributed file system, by modifying and deleting data in the cache, then uniformly updating the data in the disk into modified data or deleting corresponding data, and by using the characteristic that the priority of the cache is higher than that of the HDFS disk, the deletion and modification operations are ensured to be effective immediately, then the data drop is uniformly completed within a reasonable time range, and unnecessary drop operations are reduced while the data security is ensured.
In some embodiments, the file size of the target file is the file size after the processing request is executed, and the method further includes:
responding to the fact that the file size exceeds the first preset value, and judging whether the file name of the target file exists in the file index of the second metadata node or not;
in response to the file index of the second metadata node having the file name of the target file, storing the file content of the target file in the cache of the second metadata node into the cache of the first metadata node;
and adding the name of the target file into the file index of the first metadata node, and deleting the file content corresponding to the cache of the second metadata node and the file name corresponding to the file index of the second metadata node.
In the embodiment, the situation that the size of the operated file is changed is judged in advance, the situation that the large file is upgraded by the small file is found in time, the file is stored in a correct place in advance, management errors caused by the fact that the small file is changed into the large file after being processed are prevented, and the reliability and the stability of the distributed file system are improved.
In some embodiments, the method further comprises:
in response to that the file size does not exceed the first preset value, judging whether a file name of the target file exists in a file index of the first metadata node;
and in response to the file index of the first metadata node having the file name of the target file, storing the file content of the target file in the cache of the first metadata node into the cache of the second metadata node, adding the name of the target file to the file index of the first metadata node, and deleting the file content corresponding to the cache of the first metadata node and the file name corresponding to the file index of the second metadata node.
In the embodiment, the situation that the size of the operated file is changed is judged in advance, the small file degraded by the large file is found in time, the file is stored in a correct place in advance, management errors caused by the fact that the large file is changed into the small file after being processed are prevented, and the reliability and the stability of the distributed file system are improved.
In some embodiments, the method further comprises:
defining a plurality of third metadata nodes;
selecting one of the plurality of third data nodes as a target third data node;
synchronizing the full amount of cache data, the file indexes and the metadata areas of the second metadata node to the target third metadata node at regular time;
and regularly synchronizing the total cache data, the file indexes and the metadata areas of the target third element data nodes to the rest third element data nodes except the target third element data nodes.
In this embodiment, a plurality of third metadata nodes capable of synchronizing the second metadata nodes in time are generally defined, so that the problem of single-point failure data loss is solved, and meanwhile, only one third metadata node synchronizes data from the second metadata node to the second level, so that the access pressure of the second metadata nodes is reduced, and the stability and reliability of the distributed file storage system are remarkably improved.
In some implementations, in order to facilitate understanding of the technical solution of the present invention, the method is applied to the distributed file system in fig. 2, in this embodiment, the first metadata node is denoted as a Namenode, the second metadata node is denoted as a small _ file _ name, file system structure metadata and the like stored in the Namenode need to occupy a large amount of memory, a small file cache in the memory also needs to occupy a large amount of memory, which may cause an excessive pressure on the Namenode. In addition, the small _ file _ name is also responsible for writing the small files which are accessed infrequently into the block and storing the block into the disk, and combining the modifications of the small files to the corresponding disk block to form a new block. And recording a routing table in the Namenode, and directly routing the operation related to the small file-Namenode. Similarly, an access request of the small _ file _ name to a large file which is not managed by the small _ file _ name is routed to the Namenode.
The function of the small _ file _ name will be described in detail below with reference to fig. 3: (1) small file indexing: the index of the small file in the memory cache is saved and also in the memory. The index is an index of a file name and is used for quickly judging whether the file exists in the memory cache or not. (2) Metadata region of small file: like the metadata of the files in the namenode, the metadata of the small file that is not in the memory cache but is stored in the block is saved (the small file is not in the memory cache but stored on the disk in the block means the small file is cold data). (3) LRU caching: namely, the memory cache uses an LRU elimination mechanism to accelerate the access speed of the active file. And the inactive small files are regularly merged into blocks to be written into a disk, so that the utilization rate of a memory is reduced, and the resource consumption and the response speed of the system are balanced.
The following will describe in detail an implementation flow of the file processing method based on the distributed file system according to this embodiment with specific processing requests and request allocation:
method for realizing processing request distribution by adopting routing function
Both the Namenode and the small _ file _ Namenode have a routing function. The HDFS client does not need to know the specific role of the node and is therefore not aware from a usage perspective. Judging the size of the file when the Namenode receives the request, comparing the size with a threshold value, if the file is a large file, processing the file by the Namenode, and having the same logic with the original HDFS nanmenode and not repeated. If the file is a small file, the file is handed to small _ file _ name. If the file searching operation is carried out, the small file index and the small file metadata are sequentially inquired by the small file _ name node, and if the small file index and the small file metadata are not found, the file does not exist. Similarly, small _ file _ name also has a routing function, routing to a name operation if the requested file is not in the small file index and the small file metadata maintained by itself.
Second, small file write request handling
Unlike the traditional HDFS, the small file is directly stored into the LRU cache of the small _ file _ name in a key value form. The key is the path of the small file, the value is the content of the small file, and the new data is marked. After the file is written into the LRU cache, the small file index is written into the LRU cache until the writing logic is finished. The data in the memory is stored in an LRU structure. The capacity of the LRU is limited, which keeps small files with active access. With the writing of small files and the passage of time, data with a low access frequency is located behind the LRU. When the LRU cache write data number occupies more than a certain percentage of the total cache capacity (the threshold is configurable), a write block operation is triggered. The write block operation reads data from the LRU cache having a block size that is an integer multiple of the access frequency. And then storing the small file indexes in the memory and the position offset of the corresponding files stored in the block file. And finally, storing the index and the content of the small file into a block, and removing the part of data from the memory cache and the index of the small file.
Three, small file reading processing
And inquiring the index of the small file, judging whether the file is in the memory cache, and if so, directly reading the file from the memory. If the deleted mark of the file is read in the memory cache, the returned file does not exist. And if the file is not indexed in the small file, continuously inquiring the metadata of the small file. If the small file is found, the small file is stored in the block write disk. The block corresponding to the block is read, the block file written in the previous step contains index information, so that the block file can be quickly positioned to the position of the small file, then the file is written back to the memory cache, and a cold data reading cache mark is marked (the data with the mark is directly moved out of the cache if the data is not active in the LRU cache afterwards, and the disk record corresponding to the file does not need to be processed.
Four, small file update processing
And querying the index of the small file, and if the data is in the memory cache, directly modifying the data. If the data is not in the index, then searching the metadata of the small file, if the data is found, writing the content of the file into a cache, marking the file modification, and temporarily not changing the block where the file is located. Since the priority of the memory cache is higher than that of the HDFS when the small file is read, the file modification can be immediately effective. The record of the updated file in the HDFS disk can be really updated in the step of arranging the small file data blocks.
Five, small file deletion processing
Whether the file is in the small file index is retrieved. If the file is in the index, the file is deleted from the memory cache, and the file is also deleted from the index. And if the small file index is not found, continuously searching the small file metadata. If the deleted mark of the file is found, the deleted mark of the file is written into the memory cache, the block where the file is located is not modified temporarily, and the records are cleared in the subsequent small file data block sorting operation step.
Sorting operation of six, small files
After the small file is updated and deleted for a period of time, the data cached in the memory in the small file-node will not be consistent with the number in the small file block. The system therefore requires periodic grooming operations. In the specific implementation process, the finishing operation can refer to the following modes: and reading data marked as modification or deletion in the LRU cache periodically, finding the block where the data is located, updating the block, and then rewriting the block to cover the original block. And after the Block is successfully rewritten, if the file is a modified file, deleting the modified mark in the memory cache, and if the file is a deleted file, deleting the record in the memory cache.
Seventhly, the small _ file _ name adopts a master-slave switching strategy
The small _ file _ name uses a memory cache. To ensure that abnormally down data is not lost. A plurality of small _ file _ names can be enabled in the cluster, and a main node, namely a leader, is selected from the small _ file _ names and is responsible for all small file operations. The other small _ file _ name node is a slave node, namely a follower, and is only responsible for small file reading operation. In addition, the leader is responsible for pushing the modification of the memory cache to one of the followers, the follower starts to make the same modification on the memory data maintained by the leader after receiving the pushing modification, and if a plurality of followers exist, the rest followers pull the data from the followers in sequence. The purpose of using the method of not pulling data from the leader simultaneously is to reduce the access pressure of the leader. Until all the follower synchronizations are modified. If the leader crashes, one of the folders will be reselected as the leader to undertake the function of the original leader. If a new follower joins the cluster, the full amount of memory cache, small file index and small file metadata are synchronized immediately from any existing follower.
Eight, small _ file _ name handles promotion and demotion of large and small files
If the size of a small file exceeds the threshold of the small file after being modified, the small file is called upgrading. Correspondingly, if a large file is modified, the size is smaller than the threshold of the small file, which is called degradation. When a small file is upgraded, the small _ file _ name can forward a request for writing the file to the name, the traditional HDFS file writing logic is used, and after the name is successfully written, the file is marked to be deleted in the cache. The actual deletion operation is performed in the sorting operation. For the case of a large file destage, the namenode first forwards the request to the small _ file _ namenode, which treats it as a new file write operation. And after the writing is successful, the small _ file _ namenode informs the namenode, and then the namenode performs a file deleting operation.
The file processing method based on the distributed file system at least has the following beneficial technical effects: (1) the management capability of the HDFS on the small files is enhanced. (2) The read-write delay of the small files is reduced. (3) The influence of small file management and frequent operation of small files on cluster performance and stability is reduced. (4) The Small _ file _ name has a backup mechanism, and the problem of single-point failure data loss is solved. (5) The method can automatically sense and distinguish the large files and the small files, and different storage strategies are adopted respectively.
In another embodiment, referring to fig. 4, the present invention further provides a file processing apparatus 200 based on a distributed file system, the apparatus including:
a defining module 201, configured to define a first metadata node and a second metadata node;
a comparing module 202, configured to, in response to receiving a processing request for a target file, obtain a file size of the target file and compare the file size with a first preset value;
a first allocating module 203, configured to allocate the processing request to the first metadata node for processing in response to the file size exceeding the first preset value;
a second allocating module 204, configured to allocate the processing request to the second metadata node for processing in response to that the file size does not exceed the first preset value.
According to the file processing device based on the distributed file system, the first metadata node data and the second metadata node data are predefined, then when a processing request for a target file is received, the processing request with the file size exceeding a first preset value is distributed to the first metadata node for processing, and the processing request with the file size smaller than or equal to the first preset value is distributed to the second metadata node for processing, so that the large file and the small file are respectively processed by adopting different metadata nodes, the load pressure of a single metadata node is reduced, the management capacity of the distributed file system for the small file is enhanced, the processing efficiency of data in the distributed file system is remarkably improved, and the data processing mode of the distributed file system is enriched.
It should be noted that, for specific limitations of the file processing apparatus based on the distributed file system, reference may be made to the above limitations of the file processing method based on the distributed file system, and details are not repeated here. The respective modules in the file processing apparatus based on the distributed file system may be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
According to another aspect of the present invention, a computer device is provided, the computer device may be a server, and the internal structure thereof is shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements the distributed file system based file processing method described above, in particular the method comprising the steps of:
defining a first metadata node and a second metadata node;
in response to receiving a processing request for a target file, acquiring the file size of the target file and comparing the file size with a first preset value;
in response to the file size exceeding the first preset value, allocating the processing request to the first metadata node for processing;
and responding to the file size not exceeding the first preset value, and allocating the processing request to the second metadata node for processing.
According to yet another aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the distributed file system-based file processing method described above, and in particular, comprises performing the steps of:
defining a first metadata node and a second metadata node;
in response to receiving a processing request for a target file, acquiring the file size of the target file and comparing the file size with a first preset value;
in response to the file size exceeding the first preset value, allocating the processing request to the first metadata node for processing;
and responding to the file size not exceeding the first preset value, and allocating the processing request to the second metadata node for processing.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A file processing method based on a distributed file system is characterized by comprising the following steps:
defining a first metadata node and a second metadata node;
in response to receiving a processing request for a target file, acquiring the file size of the target file and comparing the file size with a first preset value;
in response to the file size exceeding the first preset value, allocating the processing request to the first metadata node for processing;
and responding to the file size not exceeding the first preset value, and allocating the processing request to the second metadata node for processing.
2. The distributed file system based file processing method of claim 1, wherein the method further comprises:
in response to the second metadata node receiving the processing request, wherein the processing request is a write-in request, storing the file content and the file path of the target file into a cache of the second metadata node, and marking the file content and the file path as new data;
and writing the file name of the target file into the file index of the second metadata node.
3. The distributed file system based file processing method of claim 2, further comprising:
counting the access frequency of all files in the cache of the second metadata node;
determining inactive files based on the counted access frequency, and counting a total file size of all inactive files;
in response to that the total file size is equal to a second preset value, storing metadata of all inactive files into a metadata area of the second metadata node, and writing all inactive files and a file index corresponding to each inactive file into a disk;
and deleting the file content and the file index corresponding to each inactive file written into the disk from the cache of the second metadata node and the file index of the second metadata node respectively.
4. The distributed file system based file processing method of claim 3, further comprising:
in response to the second metadata node receiving the processing request and the processing request being a read request, judging whether a file name of the target file exists in a file index of the second metadata node;
in response to the file index of the second metadata node having the file name of the target file, reading the file content of the target file from the cache of the second metadata node based on the reading request;
in response to that the file index of the second metadata node does not have the file name of the target file, judging whether the metadata area of the second metadata node has the file name of the target file or not;
and in response to the file name of the target file existing in the metadata area of the second metadata node, storing the file content of the target file in the disk into a cache of the second metadata node, and adding cold data to read a cache mark.
5. The distributed file system-based file processing method of claim 4, wherein the method further comprises:
in response to the second metadata node receiving the processing request and the processing request being a modification request, judging whether a file name of the target file exists in a file index of the second metadata node;
in response to the file index of the second metadata node having the file name of the target file, modifying the corresponding text content in the cache based on the modification request;
in response to that the file index of the second metadata node does not have the file name of the target file, judging whether the metadata area of the second metadata node has the file name of the target file or not;
and in response to the file name of the target file existing in the metadata area of the second metadata node, storing the file content of the target file in the disk into a cache of the second metadata node, and adding cold data read cache marks and modification marks.
6. The distributed file system based file processing method of claim 5, wherein the method further comprises:
in response to the second metadata node receiving the processing request and the processing request being a deletion request, judging whether a file name of the target file exists in a file index of the second metadata node;
in response to the file index of the second metadata node having the file name of the target file, deleting the corresponding text content in the cache of the second metadata node and the file name in the file index of the second metadata node based on the deletion request, and adding a deletion marker of the target file in the cache of the second metadata node;
in response to that the file index of the second metadata node does not have the file name of the target file, judging whether the metadata area of the second metadata node has the file name of the target file or not;
and in response to the file name of the target file existing in the metadata area of the second metadata node, adding a deletion mark of the target file in the cache of the second metadata node.
7. The distributed file system based file processing method of claim 6, wherein the method further comprises:
regularly reading a deletion mark and a modification mark from the cache of the second metadata node;
in response to reading the deletion mark, deleting the file content corresponding to the deletion mark in the disk based on the read deletion mark, and clearing the corresponding deletion mark in the cache of the second metadata node;
and in response to reading the modification mark, covering the corresponding file content in the disk by using the corresponding file content in the cache of the second metadata node by using the modification mark.
8. The file processing method based on the distributed file system according to any one of claims 1 to 7, wherein the file size of the target file is a file size after the processing request is executed, and the method further comprises:
responding to the fact that the file size exceeds the first preset value, and judging whether the file name of the target file exists in the file index of the second metadata node or not; in response to the file index of the second metadata node having the file name of the target file, storing the file content of the target file in the cache of the second metadata node into the cache of the first metadata node; adding the name of the target file into the file index of the first metadata node, and deleting the corresponding file content in the cache of the second metadata node and the corresponding file name in the file index of the second metadata node; or
In response to that the file size does not exceed the first preset value, judging whether a file name of the target file exists in a file index of the first metadata node; and in response to the file index of the first metadata node having the file name of the target file, storing the file content of the target file in the cache of the first metadata node into the cache of the second metadata node, adding the name of the target file to the file index of the first metadata node, and deleting the file content corresponding to the cache of the first metadata node and the file name corresponding to the file index of the second metadata node.
9. The distributed file system based file processing method of any of claims 1 to 7, wherein the method further comprises:
defining a plurality of third metadata nodes;
selecting one of the plurality of third data nodes as a target third data node;
synchronizing the full amount of cache data, the file indexes and the metadata areas of the second metadata node to the target third metadata node at regular time;
and regularly synchronizing the total cache data, the file indexes and the metadata areas of the target third element data node to the rest third element data nodes except the target third element data node.
10. A file processing apparatus based on a distributed file system, the apparatus comprising:
the definition module is used for defining a first metadata node and a second metadata node;
the comparison module is used for responding to the received processing request of the target file, acquiring the file size of the target file and comparing the file size with a first preset value;
the first allocation module is used for responding to the situation that the file size exceeds the first preset value, and allocating the processing request to the first metadata node for processing;
and the second distribution module is used for distributing the processing request to the second metadata node for processing in response to the file size not exceeding the first preset value.
CN202210316455.9A 2022-03-29 2022-03-29 File processing method and device based on distributed file system Withdrawn CN114896203A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210316455.9A CN114896203A (en) 2022-03-29 2022-03-29 File processing method and device based on distributed file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210316455.9A CN114896203A (en) 2022-03-29 2022-03-29 File processing method and device based on distributed file system

Publications (1)

Publication Number Publication Date
CN114896203A true CN114896203A (en) 2022-08-12

Family

ID=82715045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210316455.9A Withdrawn CN114896203A (en) 2022-03-29 2022-03-29 File processing method and device based on distributed file system

Country Status (1)

Country Link
CN (1) CN114896203A (en)

Similar Documents

Publication Publication Date Title
US11474972B2 (en) Metadata query method and apparatus
US20170024315A1 (en) Efficient garbage collection for a log-structured data store
CN104899156A (en) Large-scale social network service-oriented graph data storage and query method
CN113377868B (en) Offline storage system based on distributed KV database
CN110555001B (en) Data processing method, device, terminal and medium
KR20190019805A (en) Method and device for storing data object, and computer readable storage medium having a computer program using the same
CN113515487B (en) Directory query method, computing device and distributed file system
CN109063192B (en) Working method of high-performance mass file storage system
US9307024B2 (en) Efficient storage of small random changes to data on disk
US20130290636A1 (en) Managing memory
CN114253908A (en) Data management method and device of key value storage system
CN111198845A (en) Data migration method, readable storage medium and computing device
CN107181773B (en) Data storage and data management method and device of distributed storage system
CN110955488A (en) Virtualization method and system for persistent memory
CN111796767A (en) Distributed file system and data management method
CN113867627A (en) Method and system for optimizing performance of storage system
CN113377292A (en) Single machine storage engine
CN112148736A (en) Method, device and storage medium for caching data
KR100907477B1 (en) Apparatus and method for managing index of data stored in flash memory
CN112334891A (en) Centralized storage for search servers
CN113204520B (en) Remote sensing data rapid concurrent read-write method based on distributed file system
CN114610680A (en) Method, device and equipment for managing metadata of distributed file system and storage medium
US20180011897A1 (en) Data processing method having structure of cache index specified to transaction in mobile environment dbms
CN116894041B (en) Data storage method, device, computer equipment and medium
CN116594562A (en) Data processing method and device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220812

WW01 Invention patent application withdrawn after publication